>
2/4/08
OSDI '04
1
CoDNS:
Improving DNS Performance and Reliability
via Cooperative Lookups
KyoungSoo Park,
Vivek Pai, Larry Peterson,
Zhe Wang
Princeton University
2/4/08
OSDI '04
2
Domain Name
System(DNS)
Human-friendly names 00/font> IP
addresses
Operational for over 20 years
Essential part of the Web
Two components
Server-side: name owners
Client-side: contacting name
owners
2/4/08
OSDI '04
3
Two Kinds
of DNS Problems
Server-side problems [Danzig92],
[Jung01]
Nameserver bugs
Misconfigurations
Hardening/replacing server
infrastructure
Client-side
problems
Between local nameservers
(LDNS) and clients
Larger memories = higher
LDNS hit rate
LDNS cache hit rate : 80
~ 90%
Result: LDNS problems magnified
2/4/08
OSDI '04
4
Contributions
Measure LDNS problems, causes
Client-side DNS helper, CoDNS
Communicates with other
CoDNS peers
Incrementally deployable
Works with all DNS lookups
(CDN, etc)
Benefits
Latency reduction: 27-82%
Availability: generally adds
extra 0000/font>
2/4/08
OSDI '04
5
Local DNS
Lookup Problems
Local DNS lookup failures
5+ seconds delay for cached
records
Frequent & widely-distributed*
Unpredictable service
Directly affects user-perceived
latency
Random delays in web access
Kills HTTP proxies, web services,
and busy mail servers
2/4/08
OSDI '04
6
Demonstrating
Local Problems
Local name lookup every 6
seconds
00yy.domain00on xxx.domain
at 200 sites
00lanetlab-2.cs.princeton.edu00
for
planetlab-1.cs.princeton.edu
Lookup should be handled
locally
LDNS is site-shared, NOT
PlanetLab00
Failure criteria
5+ seconds of latency
zero answer
Rolling average of the past
100 queries
2/4/08
OSDI '04
7
Expected
DNS Behavior
University of Utah
Rice University
2/4/08
OSDI '04
8
DNS Failure
on Various Nodes
Cornell
Texas A&M
University of Oregon
2/4/08
OSDI '04
9
Possible
Causes
Packet loss
LDNS overloading
Cron jobs
Maintenance problems
2/4/08
OSDI '04
10
Packet Loss
UDP inherently unreliable
Single loss triggers query
retransmission
Less than ~0.1% in LAN environment
Heavily dependent on local
traffic
Losses last for ~ 1 min
Cable modem/DSL may be
worse
Our sites have ~4 LAN hops,
Cable ~8
2/4/08
OSDI '04
11
Nameserver
Overloading
University of Michigan
University of Torino, Italy
Technical University Berlin,
Germany
8 am
6 pm
8 am
6 pm
2/4/08
OSDI '04
12
Nameserver
Overloading
Many responses for 1 sec
~ 5 sec
No timeout but simply late
Pr (Overloading | DNS Failure)
= 90% for some nodes
Bursts cause socket buffer
overflow
Experiment in the paper
2/4/08
OSDI '04
13
Cron jobs/heavy
processes
University of Tennessee 1
University of Tennessee 2
Moscow State University
Not a client problem!
2/4/08
OSDI '04
14
Why Do We
See This?
Large memory 00/font>
large cache
Large cache 00/font> high hit rate
High hit rate 00/font>
CPU load drops
Low CPU load 00/font>
add more services
More services 00/font>
memory pressure
Memory pressure 00/font>
failures, delays
2/4/08
OSDI '04
15
Maintenance
Problems
/etc/resolv.conf
Configured to dead nameservers
Blocking services
Outside the firewall
Complete outage
Berkeley Millennium nodes,
3/17/2004
Blackout / natural disaster
Duke hit by hurricane Isabel,
Fall/2003
2/4/08
OSDI '04
16
Wide Area Network(WAN)
Solution:CoDNS
CoDNS
My LAN
LAN
LDNS
CoDNS
My Machine
LDNS
remote answer
Client
Programs
remote query
2/4/08
OSDI '04
17
CoDNS :
Cooperative DNS
Cooperative name lookup scheme
If local server OK, use local
server
When failing, ask peers to
do lookup
Insurance model
Share risk, share benefits
Aggregate name lookup service
Aggregate cache effect
Incrementally deployable,
no server change
2/4/08
OSDI '04
18
Design Issues
Proximity / liveness
Select nearby peers
Monitors nameserver00 health
as well
Request locality
Pick same peer for same names
Highest Random Weight (HRW)
Remote request timeout
Dynamically adjusted to local
server00 health
Exponentially backed off
for each remote query
2/4/08
OSDI '04
19
How many
peers needed?
One extra
peer halves
avg response
time!
Average Response Time
2/4/08
OSDI '04
20
Effect of
Timeout
Average Number of Lookups
200ms -
slope changes
500ms -
virtually flat
2/4/08
OSDI '04
21
Deployment
Status
CoDNS deployed on all PlanetLab
nodes
Running 24/7 since August
2003
CoDeeN uses CoDNS as primary
DNS
After CoDeeN00 own DNS
cache
Remote query configuration
One extra peer, 200ms starting
timeout
On total LDNS failure, send
immediately
Monitor 10 nodes as neighbors
2/4/08
OSDI '04
22
Evaluation
Live traffic for one week
for CoDeeN (20k - 30k)
2/4/08
OSDI '04
23
Finer-grained
View
Live traffic for one day
Effectively flattens the
spikes
Cache miss + WAN problem
LDNS
CoDNS
2/4/08
OSDI '04
24
Availability
Adds one 0000 from 99%
to 99.9%
9%
90%
99%
99.9%
99.99%
0
10
20
30
40
50
60
70
80
90
Nodes Sorted By LDNS Availability
Availability(%)
CoDNS
LDNS
2/4/08
OSDI '04
25
What About
CDNs?
CDN uses DNS to pick 00est00replica
CoDNS used only when LDNS failing
Pro: faster lookup time
Con: maybe worse/farther
replica
In reality, peer00 answer
is better 30% of the time
2/4/08
OSDI '04
26
CDN Pro/Con
Measurements
2/4/08
OSDI '04
27
Overhead
Heartbeat packet: 1/sec,
Memory: 600KB
Remote queries: median 25%
more lookups
2/4/08
OSDI '04
28
CoDNS Alternatives
In the paper:
Private Nameservers
Secondary Nameservers
TCP Queries
2/4/08
OSDI '04
29
Conclusion
Local failures relatively
frequent
Failure time dominates latency
CoDNS provides low-cost 00nsurance00
service
Masks local failures
Reduces avg response time
27-82%
Improves availability by
additional 0000/font>
Incrementally deployable,
no server change
2/4/08
OSDI '04
30
More Information
CoDNS homepage:
http://codeen.cs.princeton.edu/codns/
Email:
princeton_codeen@slices.planet-lab.org
2/4/08
OSDI '04
31
TCP Queries
DNS support TCP
Failure rate is better
Not used exept for AFXR or
when answer is big
Simple TCP
2 packets vs. 9 packets (3+2+4
=9)
Persistent TCP
ACK overhead
Resource waste for Idle connections
Vulnerable to overloading/server
down
2/4/08
OSDI '04
32
S-TCP,P-TCP,UDP,
CoDNS
Replay test(10792 names)
on 107 nodes
CoDNS First
2/4/08
OSDI '04
33
CoDNS vs.
Persistent TCP
Average Response Time (ms)
2/4/08
OSDI '04
34
Lookup Distribution
Live traffic on a node for
one week (20333 queries)
2043135 ms / 5809265 ms =
35.1%
100 ms vs. 286 ms per query
Great improvement on W-CDF
5.5% 00/font> 0.06%
76% 00/font> 17.8%
2/4/08
OSDI '04
35
Analysis
on Wins
80% at first query, 95%
at second query
Percentage