Internet Measurements

by Jensen J.

Why is it important to measure the network?

Distributed multi-domain network
- → Information only partially available
Moving target
- Requirements change
- Growth, usage, structure changes
Highly interactive system
Heterogeneity in all directions •
The total is more than the sum of its pieces •
Built, driven, and used by humans
- → Errors, misconfigurations, flaws, failures, misuse,…

=> Active network measurements are an important research area to understand the Internet and interactions between all its components.

Why is network measurement important for network providers?

Manage traffic
- Model reality
- Predict future
- Plan network
- Avoid bottlenecks in advance
Reduce cost
Accounting

Why is network measurement important for service providers?

Get information about clients
Adjust service to demands
Reduce load on servers
Accounting

Why is network measurement important for clients?

Get the best possible service
Do I get what I paid for?

Why is network measurement important for security?

Detect malicious traffic
Detect malicious hosts
Detect malicious networks

Why is network measurement important for researchers?

Understand the Internet better
Could our new routing algorithm handle all this real-
world traffic?
...

What are some problems with measuring the internet?

Creates additional traffic
Creates load on routers and hosts
Might uncover personal information
Might be intrusive

What are some considerations in temrs of problems in internet measurements?

Scan with a moderate rate
Distribute the load as good as possible
Do not publish data without anonymization or limited access
Inform about the scanning behavior and react to complaints

What tools are widely used for internet measurements?

nmap
zmap

What is nmap? What modes of operation does it have?

network mapping
modes
- host discovery
- service discovery
- os detection
- execution of custom scripts

What scanning techniques does nmap have?

TCP raw socket scans
- SYN -> find open ports
- NULL/FIN/XMAX
  - have different flags set
  - NULL: no bit
  - FIN: only fin flag
  - XMAS: fin push urg flags set
- ACK -> determine filtered/unfiltered ports in a firewall
- window: same as ack, lsits responses with window > 0 in RST as open
- maimon: send fin + ack; acording to RFC, all hosts should respond with RST no matter if port open or closed…
TCP conect scans
ICMP ping scans
UDP payload scnas

How can one perform internet-wide scanning with nmap? What is the performance?

stateful scanning
- -> nmap keeps state for every packet in transit
- catch timeouts and send retry packets…

Performance:

full casn from one system takes 10 days (4k IP addr / sec)
25 Amazon EC2 instances -> 25 hours (1.6k IP addr / sec)
Typically 1 packet sent an d1 packet received per IP addr

What is zmap?

adaption of nmap to do interent-wide scans
able to scan entire interent in 45 minutes
can satureate (initially 1 GBit/s) 10 GBit/s links with scanning activity…

How does ZMAP do internet-wide scanning?

use TCP syn or UDP payload scan to find open ports
-> input randomization to distribute scan and not scan individzal network at once…
can use multiple workers on different machines
- -> still scan each IP only once…

What is the basic scanning approach of zmap?

stateless (compared to stateful in nmap… (considering whole interent…))
- no state for sent packets kept
- timout detecion not possible
- -> Identify responbses belonging to scan wiht:
  - IP ID = 54321
  - generate validation based on packet input (e.g. dest IP) using AES
  - store validation in packet which will be sent (e.g. in sequence number)
  - validate validation (e.g. sequence number - 1 ) un received packet

How can zmap identify responses?

encode some infroamtion in the request (e.g. IP id)
-> that should be echoed by the response…

further validation in other fields
- encode used IP in TCP sequence number (-> i.e. sequence number - 1 should match IP address…)

How does ZMAP separate send and receive threats?

using RAW sockets
-> directly use sockets to send and receive packets (-> bypass kernel TCP stack…)
-> no locking needed
=> send TCP syn via RAW socket
receive TCP syn ack via RAW socket
use TCP stack (kernel) to sent TCP reset

What modules does ZMAP have?

separate probe and output modules
Probe module:
- implement scanning technique
- e.g. TCP SYN, SYN-ACK, UDP payload
output module
- implement processing and output of received responses
- e.g. IP address only, CSV, database….

What additional tools does ZMAP offer?

Zgrab
- stateful application-layer scanner
- e.g. for HTPS, SSH, BACNET
ZDNS
- utility for fast DNS lookups
ZCrypto
- TLS and X.509 library
- certificate parsing and TLS handshake transcriptor

Does ZMAP support IPv6?

originally not -> only Ipv4
=> ZMAPv6…
-> extends ZMAP with IPv6 capabilities

What are hitlists?

List of (scan) targets that are most likely responsive
list has feasible size…

With what does ZMAPv6 extend ZMAP?

adaption of scanning core to send and receive IPv6 packets
Port probe modules for IPv6 scanning
- ICMPv6
- TCP over IPv6
- UDP over IPv6

To what are hitlists responsive to?

to at least one protocol
-> ICMP, HTTP,…
different between addresses (not all address responsive to same proto…)
changes over time…

What is considered in feasability of hitlists?

scan duration
bandwidth limitations

What are toplists?

type of hitlist
-> list of domains ranked by their popularity
=> ranked list of domains
=> popularity calculated by different measures
=> Normally one million entries

What are the most popular toplist types?

Alexa top list
majestic million
umbrella

What is the alexa toplist?

Provided by Amazon
Based on HTTP requests
- Collected with a browser toolbar
- Depends on volunteers to install the toolbar
- Captures statistics about visited web pages
→ Strong focus towards web pages

What is the majestic milion toplist?

Independent organization
Based on link metrics (similar to page rank…)
- Combination of outgoing and incoming links (hyperlink…)
- Collected by a web crawler
- Data updated several times a day
→ Focus towards web pages

What is the umbrella toplist?

Provided by Cisco
Based on DNS requests to the Umbrella global Network (formerly OpenDNS)
Algorithm based on unique client IPs visiting a domain
Calculates Internet popularity independent of the port

→ No focus towards web traffic

What to consider using toplists?

Treat Top Lists carefully:

Frequent changes over time
Weekend effect
- Different user behavior changes lists on the weekend
- Focus towards entertainment and streaming on the weekend
Clustering Effect
- Large clusters with same rank
- Ordered alphabetically
Size is not always 1 million

Is there a way to rank prefixes instead of domains?

use prefix top lists…

What is zipfs law?

• Internet traffic is assumed to follow Zipfs law [7]
- A few sites consist of millions of pages, but millions of sites only contain a handful of pages.
- Millions of users flock to a few select sites, giving little attention to millions of others.
- k = rank of object
- s = slope of distribution
- s is set to 1 based on related work [8]

wk = (1/k^s) / SUM(n=1 to N) (1/n^s)

How to construct prefix top lists using zipfs law?

Aggregate top lists over a week
Collect A and AAAA records for domain based top lists
Assign Zipf weight of domain to IP addresses
Aggregate on prefixes and ASes
Useful for:
- Prefix prioritization
- Security impact assessment
prefixtoplists.net.in.tum.de

What is the difference between IPv4 and IPv6 lists?

-> IPv4 lists => full scan of the whole IPv4 address space

IPv6:

vast address space -> full scan not possible…
=> would take longer than the universe exists…
-> there is multitude of possible IPv6 hitlist sources and a lack of understanding of the sources

What is a solution to the problem of scanning IPv6?

=> Solution:

different approaches to create hitlists might suit different use cases
-> valuate biases of hitlists and aliased prefixes
combine hitlists to a taoliored iPv6 hitlist…

What are possible sources for the creation of an IPv6 hitlist?

List of addresses
List of domains
- ranked and unranked
active scans
machine learning

What are possible list of addresses for IPv6 hitlists?

raw packet traces
- -> extract IPv6 addresses from live traffic
flow data (netFlow, IPFIX)
- export flow data from routers and collect at measurement point
- extract IPv6 addresses from flow data
Traceroutes
- Often used for the analysis of network paths and structure
- Reveals addresses of hops on the path
- e.g. with Scamper

What are possible list of domains for IPv6 hitlists?

A list of existing domains can be resolved into used addresses.

Unranked lists
Extracted from other datasets
Side products of other scans

→ Targets highly depend on the source

What are possible sources for unranked IPv6 list of domains?

DNS zone files
- Content of complete top-level domain name zone
- .com, .net, .org, . . . are available via contract with Verisign or paid services (e.g. premiumdrops.com)
- New gTLDs are available via ICANN’s Centralized Zone Data Service (CZDS)
Certificate Transparency (CT)
- Extract domains from Common Name, Subject Alternative Name entries of logged certificates
Rapid7 IPv4 rDNS
- Complete reverse DNS resolution of IPv4 addresses
- Published weekly on scans.io
Rapid7 DNS ANY
- Use domains gathered from other scans for DNS ANY scans
- Published weekly on scans.io
CAIDA IPv6 router DNS names
- rDNS resolution of IPv6 addresses obtained from traceroute measurements on the Ark measurement infrastructure
- Request access on caida.org

What is IPv6 rDNS walking?

example of active scan resulting in hitlist (IPv6)

How does rDNS walk work?

start at root ipv6.arpa (reverse DNS lookup for IPv6)
-> query first nibble value => e.g. 0…
in case NXDOMAIN is returned, prune whole subtree…
else, descend into subtree and query first value of next nibble…
descend until full address is reached…

How long does a full IPv6 scan using rDNS walk take? How much overhead?

query rate: 200 queries per name server
-> scan duration 7 to 10 daysw
large query overhead
- all 16 permutations of each nibble are queried
- -> majority replies are NXDOMAIN

What are the current results of rDNS walk=

1.2 mio /64 prefixes
9 mio addresses
addresses cover > 5k AS
most popular SDes:
- yandex
- KPN
- yahoo

What is a useful side-result of rDNS waling?

one can see distribution of nibble values…
=> first nibble always 2
=> and patterns like ff:fe exist… (SLAAC)

How can one use Machine learning to create IPv6 toplists?

use existing schemes in existing datasets to learn about used IPv6 addresses
=> rely on responsive addresses as seed list

What patterns exist in IPv6 addresses that ML approaches can make use of?

MAC based IIDs -> ff:fe
servers with fixed schema
=> use them to learn new addresses

How can ML make use of existing pattern (features…)?

Entropy / IP

calculate entropy of adderss
transform to bayesian network model
- ealk model to generate addresses
- e.g. distribution of used 1s and 0s
6GEN
- cluster addresses
=> basically good approach to extend hitlists with comparable responsiveness

What is the target bias and how can it be used to create ML based iPv6 hitlitst?

evaluate the IID (interface ID) portion of IPv6 addresses to determine device type
=> traceroute contains routers
Router IP addresses are assigned mostly manually
=> most commonly only one bit of IID set to 1 -> e.g. ::1 for default gateway…

OR:

IXP (internet exchange point) sources contains many client devices
=> clients make extensive use of IPv6 privacy extensions
=> central limit theorem applies -> sum of single-bit distribution approximates normal distribution (set bits are normal distributed among 64 bit…)

=> such considerations and biases apply to differetn types of devices and AS

What can pose a problem in IPv6 hitlists?

Aliases..
=> different IPv6 address for same host…
=> i.e. aliased prefixes -> whole prefix bound to same host….
=> resulting in some hosts being over represetned due to this aliased prefixes….

How can one detect aliased prefixes? What is it required for?

-multi-level pseudo random probing
-> choose one bit in host part and change it (e.g. assign all 16 values from 0 to f)
-> compare things such as initial TTL, TCP options, timestamps,…. => fingerprinting…

=> crucial to reduce bias in IPv6 dataset…

How should one filter hitlists before testing for responsiveness of addresses?

Multiple steps necessary
-

Why use randomization in zmap?

randomize the scanned IP address
-> to basicaslly distribute the scanning over time and not sequentially scanning the IP address range…

Join Course

Preview

Author

Jensen J.

Information

Last changed
2 years ago

Report course