Why is it important to measure the network?
Distributed multi-domain network
→ Information only partially available
Moving target
Requirements change
Growth, usage, structure changes
Highly interactive system
Heterogeneity in all directions •
The total is more than the sum of its pieces •
Built, driven, and used by humans
→ Errors, misconfigurations, flaws, failures, misuse,…
=> Active network measurements are an important research area to understand the Internet and interactions between all its components.
Why is network measurement important for network providers?
Manage traffic
Model reality
Predict future
Plan network
Avoid bottlenecks in advance
Reduce cost
Accounting
Why is network measurement important for service providers?
Get information about clients
Adjust service to demands
Reduce load on servers
Why is network measurement important for clients?
Get the best possible service
Do I get what I paid for?
Why is network measurement important for security?
Detect malicious traffic
Detect malicious hosts
Detect malicious networks
Why is network measurement important for researchers?
Understand the Internet better
Could our new routing algorithm handle all this real-
world traffic?
...
What are some problems with measuring the internet?
Creates additional traffic
Creates load on routers and hosts
Might uncover personal information
Might be intrusive
What are some considerations in temrs of problems in internet measurements?
Scan with a moderate rate
Distribute the load as good as possible
Do not publish data without anonymization or limited access
Inform about the scanning behavior and react to complaints
What tools are widely used for internet measurements?
nmap
zmap
What is nmap? What modes of operation does it have?
network mapping
modes
host discovery
service discovery
os detection
execution of custom scripts
What scanning techniques does nmap have?
TCP raw socket scans
SYN -> find open ports
NULL/FIN/XMAX
have different flags set
NULL: no bit
FIN: only fin flag
XMAS: fin push urg flags set
ACK -> determine filtered/unfiltered ports in a firewall
window: same as ack, lsits responses with window > 0 in RST as open
maimon: send fin + ack; acording to RFC, all hosts should respond with RST no matter if port open or closed…
TCP conect scans
ICMP ping scans
UDP payload scnas
How can one perform internet-wide scanning with nmap? What is the performance?
stateful scanning
-> nmap keeps state for every packet in transit
catch timeouts and send retry packets…
Performance:
full casn from one system takes 10 days (4k IP addr / sec)
25 Amazon EC2 instances -> 25 hours (1.6k IP addr / sec)
Typically 1 packet sent an d1 packet received per IP addr
What is zmap?
adaption of nmap to do interent-wide scans
able to scan entire interent in 45 minutes
can satureate (initially 1 GBit/s) 10 GBit/s links with scanning activity…
How does ZMAP do internet-wide scanning?
use TCP syn or UDP payload scan to find open ports
-> input randomization to distribute scan and not scan individzal network at once…
can use multiple workers on different machines
-> still scan each IP only once…
What is the basic scanning approach of zmap?
stateless (compared to stateful in nmap… (considering whole interent…))
no state for sent packets kept
timout detecion not possible
-> Identify responbses belonging to scan wiht:
IP ID = 54321
generate validation based on packet input (e.g. dest IP) using AES
store validation in packet which will be sent (e.g. in sequence number)
validate validation (e.g. sequence number - 1 ) un received packet
How can zmap identify responses?
encode some infroamtion in the request (e.g. IP id)
-> that should be echoed by the response…
further validation in other fields
encode used IP in TCP sequence number (-> i.e. sequence number - 1 should match IP address…)
How does ZMAP separate send and receive threats?
using RAW sockets
-> directly use sockets to send and receive packets (-> bypass kernel TCP stack…)
-> no locking needed
=> send TCP syn via RAW socket
receive TCP syn ack via RAW socket
use TCP stack (kernel) to sent TCP reset
What modules does ZMAP have?
separate probe and output modules
Probe module:
implement scanning technique
e.g. TCP SYN, SYN-ACK, UDP payload
output module
implement processing and output of received responses
e.g. IP address only, CSV, database….
What additional tools does ZMAP offer?
Zgrab
stateful application-layer scanner
e.g. for HTPS, SSH, BACNET
ZDNS
utility for fast DNS lookups
ZCrypto
TLS and X.509 library
certificate parsing and TLS handshake transcriptor
Does ZMAP support IPv6?
originally not -> only Ipv4
=> ZMAPv6…
-> extends ZMAP with IPv6 capabilities
What are hitlists?
List of (scan) targets that are most likely responsive
list has feasible size…
With what does ZMAPv6 extend ZMAP?
adaption of scanning core to send and receive IPv6 packets
Port probe modules for IPv6 scanning
ICMPv6
TCP over IPv6
UDP over IPv6
To what are hitlists responsive to?
to at least one protocol
-> ICMP, HTTP,…
different between addresses (not all address responsive to same proto…)
changes over time…
What is considered in feasability of hitlists?
scan duration
bandwidth limitations
What are toplists?
type of hitlist
-> list of domains ranked by their popularity
=> ranked list of domains
=> popularity calculated by different measures
=> Normally one million entries
What are the most popular toplist types?
Alexa top list
majestic million
umbrella
What is the alexa toplist?
Provided by Amazon
Based on HTTP requests
Collected with a browser toolbar
Depends on volunteers to install the toolbar
Captures statistics about visited web pages
→ Strong focus towards web pages
What is the majestic milion toplist?
Independent organization
Based on link metrics (similar to page rank…)
Combination of outgoing and incoming links (hyperlink…)
Collected by a web crawler
Data updated several times a day
→ Focus towards web pages
What is the umbrella toplist?
Provided by Cisco
Based on DNS requests to the Umbrella global Network (formerly OpenDNS)
Algorithm based on unique client IPs visiting a domain
Calculates Internet popularity independent of the port
→ No focus towards web traffic
What to consider using toplists?
Treat Top Lists carefully:
Frequent changes over time
Weekend effect
Different user behavior changes lists on the weekend
Focus towards entertainment and streaming on the weekend
Clustering Effect
Large clusters with same rank
Ordered alphabetically
Size is not always 1 million
Is there a way to rank prefixes instead of domains?
use prefix top lists…
What is zipfs law?
• Internet traffic is assumed to follow Zipfs law [7]
A few sites consist of millions of pages, but millions of sites only contain a handful of pages.
Millions of users flock to a few select sites, giving little attention to millions of others.
k = rank of object
s = slope of distribution
s is set to 1 based on related work [8]
wk = (1/k^s) / SUM(n=1 to N) (1/n^s)
How to construct prefix top lists using zipfs law?
Aggregate top lists over a week
Collect A and AAAA records for domain based top lists
Assign Zipf weight of domain to IP addresses
Aggregate on prefixes and ASes
Useful for:
Prefix prioritization
Security impact assessment
prefixtoplists.net.in.tum.de
What is the difference between IPv4 and IPv6 lists?
-> IPv4 lists => full scan of the whole IPv4 address space
IPv6:
vast address space -> full scan not possible…
=> would take longer than the universe exists…
-> there is multitude of possible IPv6 hitlist sources and a lack of understanding of the sources
What is a solution to the problem of scanning IPv6?
=> Solution:
different approaches to create hitlists might suit different use cases
-> valuate biases of hitlists and aliased prefixes
combine hitlists to a taoliored iPv6 hitlist…
What are possible sources for the creation of an IPv6 hitlist?
List of addresses
List of domains
ranked and unranked
active scans
machine learning
What are possible list of addresses for IPv6 hitlists?
raw packet traces
-> extract IPv6 addresses from live traffic
flow data (netFlow, IPFIX)
export flow data from routers and collect at measurement point
extract IPv6 addresses from flow data
Traceroutes
Often used for the analysis of network paths and structure
Reveals addresses of hops on the path
e.g. with Scamper
What are possible list of domains for IPv6 hitlists?
A list of existing domains can be resolved into used addresses.
Unranked lists
Extracted from other datasets
Side products of other scans
→ Targets highly depend on the source
What are possible sources for unranked IPv6 list of domains?
DNS zone files
Content of complete top-level domain name zone
.com, .net, .org, . . . are available via contract with Verisign or paid services (e.g. premiumdrops.com)
New gTLDs are available via ICANN’s Centralized Zone Data Service (CZDS)
Certificate Transparency (CT)
Extract domains from Common Name, Subject Alternative Name entries of logged certificates
Rapid7 IPv4 rDNS
Complete reverse DNS resolution of IPv4 addresses
Published weekly on scans.io
Rapid7 DNS ANY
Use domains gathered from other scans for DNS ANY scans
CAIDA IPv6 router DNS names
rDNS resolution of IPv6 addresses obtained from traceroute measurements on the Ark measurement infrastructure
Request access on caida.org
What is IPv6 rDNS walking?
example of active scan resulting in hitlist (IPv6)
How does rDNS walk work?
start at root ipv6.arpa (reverse DNS lookup for IPv6)
-> query first nibble value => e.g. 0…
in case NXDOMAIN is returned, prune whole subtree…
else, descend into subtree and query first value of next nibble…
descend until full address is reached…
How long does a full IPv6 scan using rDNS walk take? How much overhead?
query rate: 200 queries per name server
-> scan duration 7 to 10 daysw
large query overhead
all 16 permutations of each nibble are queried
-> majority replies are NXDOMAIN
What are the current results of rDNS walk=
1.2 mio /64 prefixes
9 mio addresses
addresses cover > 5k AS
most popular SDes:
yandex
KPN
yahoo
What is a useful side-result of rDNS waling?
one can see distribution of nibble values…
=> first nibble always 2
=> and patterns like ff:fe exist… (SLAAC)
How can one use Machine learning to create IPv6 toplists?
use existing schemes in existing datasets to learn about used IPv6 addresses
=> rely on responsive addresses as seed list
What patterns exist in IPv6 addresses that ML approaches can make use of?
MAC based IIDs -> ff:fe
servers with fixed schema
=> use them to learn new addresses
How can ML make use of existing pattern (features…)?
Entropy / IP
calculate entropy of adderss
transform to bayesian network model
- ealk model to generate addresses
e.g. distribution of used 1s and 0s
6GEN
cluster addresses
=> basically good approach to extend hitlists with comparable responsiveness
What is the target bias and how can it be used to create ML based iPv6 hitlitst?
evaluate the IID (interface ID) portion of IPv6 addresses to determine device type
=> traceroute contains routers
Router IP addresses are assigned mostly manually
=> most commonly only one bit of IID set to 1 -> e.g. ::1 for default gateway…
OR:
IXP (internet exchange point) sources contains many client devices
=> clients make extensive use of IPv6 privacy extensions
=> central limit theorem applies -> sum of single-bit distribution approximates normal distribution (set bits are normal distributed among 64 bit…)
=> such considerations and biases apply to differetn types of devices and AS
What can pose a problem in IPv6 hitlists?
Aliases..
=> different IPv6 address for same host…
=> i.e. aliased prefixes -> whole prefix bound to same host….
=> resulting in some hosts being over represetned due to this aliased prefixes….
How can one detect aliased prefixes? What is it required for?
-multi-level pseudo random probing
-> choose one bit in host part and change it (e.g. assign all 16 values from 0 to f)
-> compare things such as initial TTL, TCP options, timestamps,…. => fingerprinting…
=> crucial to reduce bias in IPv6 dataset…
How should one filter hitlists before testing for responsiveness of addresses?
Multiple steps necessary
-
Why use randomization in zmap?
randomize the scanned IP address
-> to basicaslly distribute the scanning over time and not sequentially scanning the IP address range…
Last changed2 years ago