2/4/08
OSDI '04
1
CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups
KyoungSoo Park, Vivek Pai, Larry Peterson,
Zhe Wang
Princeton University
2/4/08
OSDI '04
2
Domain Name System(DNS)
Human-friendly names 00/font> IP addresses Operational for over 20 years Essential part of the Web Two components Server-side: name owners Client-side: contacting name owners2/4/08
OSDI '04
3
Two Kinds of DNS Problems
Server-side problems [Danzig92], [Jung01] Nameserver bugs Misconfigurations Hardening/replacing server infrastructure Client-side problems Between local nameservers (LDNS) and clients Larger memories = higher LDNS hit rate LDNS cache hit rate : 80 ~ 90% Result: LDNS problems magnified2/4/08
OSDI '04
4
Contributions
Measure LDNS problems, causes Client-side DNS helper, CoDNS Communicates with other CoDNS peers Incrementally deployable Works with all DNS lookups (CDN, etc) Benefits Latency reduction: 27-82% Availability: generally adds extra 0000/font>2/4/08
OSDI '04
5
Local DNS Lookup Problems
Local DNS lookup failures 5+ seconds delay for cached records Frequent & widely-distributed* Unpredictable service Directly affects user-perceived latency Random delays in web access Kills HTTP proxies, web services, and busy mail servers2/4/08
OSDI '04
6
Demonstrating Local Problems
Local name lookup every 6 seconds 00yy.domain00on xxx.domain at 200 sites 00lanetlab-2.cs.princeton.edu00 for planetlab-1.cs.princeton.edu Lookup should be handled locally LDNS is site-shared, NOT PlanetLab00 Failure criteria 5+ seconds of latency zero answer Rolling average of the past 100 queries2/4/08
OSDI '04
7
Expected DNS Behavior
University of Utah Rice University2/4/08
OSDI '04
8
DNS Failure on Various Nodes
Cornell Texas A&M University of Oregon2/4/08
OSDI '04
9
Possible Causes
Packet loss LDNS overloading Cron jobs Maintenance problems2/4/08
OSDI '04
10
Packet Loss
UDP inherently unreliable Single loss triggers query retransmission Less than ~0.1% in LAN environment Heavily dependent on local traffic Losses last for ~ 1 min Cable modem/DSL may be worse Our sites have ~4 LAN hops, Cable ~82/4/08
OSDI '04
11
Nameserver Overloading
University of Michigan University of Torino, Italy Technical University Berlin, Germany8 am
6 pm
8 am
6 pm
2/4/08
OSDI '04
12
Nameserver Overloading
Many responses for 1 sec ~ 5 sec No timeout but simply late Pr (Overloading | DNS Failure) = 90% for some nodes Bursts cause socket buffer overflow Experiment in the paper2/4/08
OSDI '04
13
Cron jobs/heavy processes
University of Tennessee 1 University of Tennessee 2 Moscow State UniversityNot a client problem!
2/4/08
OSDI '04
14
Why Do We See This?
Large memory 00/font> large cache Large cache 00/font> high hit rate High hit rate 00/font> CPU load drops Low CPU load 00/font> add more services More services 00/font> memory pressure Memory pressure 00/font> failures, delays2/4/08
OSDI '04
15
Maintenance Problems
/etc/resolv.conf Configured to dead nameservers Blocking services Outside the firewall Complete outage Berkeley Millennium nodes, 3/17/2004 Blackout / natural disaster Duke hit by hurricane Isabel, Fall/20032/4/08
OSDI '04
16
Wide Area Network(WAN)
Solution:CoDNS
CoDNS
My LAN
LAN
LDNS
CoDNS
My Machine
LDNS
remote answer
Client
Programs
remote query
2/4/08
OSDI '04
17
CoDNS : Cooperative DNS
Cooperative name lookup scheme If local server OK, use local server When failing, ask peers to do lookup Insurance model Share risk, share benefits Aggregate name lookup service Aggregate cache effect Incrementally deployable, no server change2/4/08
OSDI '04
18
Design Issues
Proximity / liveness Select nearby peers Monitors nameserver00 health as well Request locality Pick same peer for same names Highest Random Weight (HRW) Remote request timeout Dynamically adjusted to local server00 health Exponentially backed off for each remote query2/4/08
OSDI '04
19
How many peers needed?
One extra peer halves
avg response time!
Average Response Time
2/4/08
OSDI '04
20
Effect of Timeout
Average Number of Lookups
200ms - slope changes
500ms - virtually flat
2/4/08
OSDI '04
21
Deployment Status
CoDNS deployed on all PlanetLab nodes Running 24/7 since August 2003 CoDeeN uses CoDNS as primary DNS After CoDeeN00 own DNS cache Remote query configuration One extra peer, 200ms starting timeout On total LDNS failure, send immediately Monitor 10 nodes as neighbors2/4/08
OSDI '04
22
Evaluation
Live traffic for one week for CoDeeN (20k - 30k)2/4/08
OSDI '04
23
Finer-grained View
Live traffic for one day Effectively flattens the spikesCache miss + WAN problem
LDNS
CoDNS
2/4/08
OSDI '04
24
Availability
Adds one 0000 from 99% to 99.9%9%
90%
99%
99.9%
99.99%
0
10
20
30
40
50
60
70
80
90
Nodes Sorted By LDNS Availability
Availability(%)
CoDNS
LDNS
2/4/08
OSDI '04
25
What About CDNs?
CDN uses DNS to pick 00est00replica
CoDNS used only when LDNS failing
Pro: faster lookup time Con: maybe worse/farther replica In reality, peer00 answer is better 30% of the time2/4/08
OSDI '04
26
CDN Pro/Con Measurements
2/4/08
OSDI '04
27
Overhead
Heartbeat packet: 1/sec, Memory: 600KB Remote queries: median 25% more lookups2/4/08
OSDI '04
28
CoDNS Alternatives
In the paper:
Private Nameservers Secondary Nameservers TCP Queries2/4/08
OSDI '04
29
Conclusion
Local failures relatively frequent Failure time dominates latency CoDNS provides low-cost 00nsurance00 service Masks local failures Reduces avg response time 27-82% Improves availability by additional 0000/font> Incrementally deployable, no server change2/4/08
OSDI '04
30
More Information
CoDNS homepage:
http://codeen.cs.princeton.edu/codns/
Email:
princeton_codeen@slices.planet-lab.org
2/4/08
OSDI '04
31
TCP Queries
DNS support TCP Failure rate is better Not used exept for AFXR or when answer is big Simple TCP 2 packets vs. 9 packets (3+2+4 =9) Persistent TCP ACK overhead Resource waste for Idle connections Vulnerable to overloading/server down2/4/08
OSDI '04
32
S-TCP,P-TCP,UDP, CoDNS
Replay test(10792 names) on 107 nodes CoDNS First2/4/08
OSDI '04
33
CoDNS vs. Persistent TCP
Average Response Time (ms)
2/4/08
OSDI '04
34
Lookup Distribution
Live traffic on a node for one week (20333 queries) 2043135 ms / 5809265 ms = 35.1% 100 ms vs. 286 ms per query Great improvement on W-CDF5.5% 00/font> 0.06%
76% 00/font> 17.8%
2/4/08
OSDI '04
35
Analysis on Wins
80% at first query, 95% at second query
Percentage
download CoDNS: Improving DNS Performance and Reliability via Cooperative ...
