Improvement to DNS resolving in Ubuntu

Registered by Stéphane Graber on 2011-10-20

There are two big topics to discuss in this session:
 - Fixing the way we handle /etc/resolv.conf to be consistent across the whole distro
 - Integrate a DNS resolver daemon in the default Ubuntu desktop installation

== The notes below are about including a local resolver in the default Ubuntu Desktop installation ==
The idea is to have a local resolver like dnsmasq or unbound running on all Ubuntu desktop installations and controlled by Network Manager.

All machines would then use "nameserver 127.0.0.1" in their /etc/resolv.conf
search path/domain may be retained in resolv.conf or moved into dnsmasq/unbound's configuration too.

The main benefit of doing this is to increase overall responsiveness of the desktop by having a local DNS cache and allow for better dispatching of DNS queries.

For example in the following scenario:
 - Wired connection with DNS 1.1.1.1 and domain blah.com
 - Wireless connection with DNS 2001::2 and domain example.com
 - VPN connection with DNS 2.2.2.2 and domain ubuntu.com

With current Network Manager, all DNS queries would be going to 2.2.2.2, making everything slow if the latency on that link is high.

Also, any request to blah.com or example.com will be sent to 2.2.2.2 which doesn't necessarily know these domains (if they are internal domains for example).

The resolver will fix all that by sending DNS queries to the right server depending on the domain, will cache the results and will properly handle timeouts and detection of broken server, thereby reducing delays and improving reliability of DNS on Ubuntu.

Blueprint information

Status:
Started
Approver:
Steve Langasek
Priority:
Medium
Drafter:
Stéphane Graber
Direction:
Approved
Assignee:
Stéphane Graber
Definition:
Approved
Series goal:
Accepted for precise
Implementation:
Good progress
Milestone target:
None
Started by
Colin Watson on 2012-02-06

Whiteboard

== Notes from the session ==

Having a local resolver:
 - Support for VPNs, sending queries for the annonunced subnets and domains to the remote DNS and send the rest to the local DNS
 - Local resolver to "fix" the libc behaviour of waiting 20s before falling back to the next entry
 - Workaround broken IPv6 support on home routers
 - Support DNSSEC
/etc/resolv.conf problems:
 - Script directly modifying resolv.conf without coordination (dhclient-script for example)
Problems:
 - About 5 packages that write to /etc/resolv.conf without coordination
   - resolvconf
   - NetworkManager
   - dhclient
   - installer
   - bind9, dnsmasq and other things that talk to resolvconf sometimes muck with resolv.conf themselves
 - virt-manager uses dnsmasq to bind name resolution of VMs on host
 - dnsmasq has memory issues
Need to do same thing on desktop and server, or don't fix for LTS
Can resolvconf start early enough to be usable?
 - NM can talk to resolvconf
 - dnsmasq installed by default for connection sharing
Problems with resolvconf:
 - In universe
 - Not upstartified, starts too late
 - Uses inappropriate directories (should switch to /run)
 - Doesn't solve crappy-IPv6-router-issue
Desktop side:
 - NM supports using dnsmasq for DNS by default
 - Ubuntu not using it currently
Lots of software starts dnsmasq on their own. If system runs one as well...
libvirt:
 - Each vibr# has a dnsmasq with static IP entries for running VMs
 - Want to add each dnsmasq to list of resolvers
 - Probably needs resolvconf integration to do this
 - But libc only allows 3 resolvers; falls down with lots of bridges
What does dhclient do?
 - Should just use resolvconf
 - Only supporting resolvconf; promote to required

mathieu-tl, 2011-11-25:
- linking bug 366967; apparently Sander van Griekan has already done some of the work in upstartifying resolvconf and is communicating with the Debian maintainer on other fixes. A preliminary patch may be already available to be reviewed and/or sponsored.

stgraber 2011-12-09:
 - dnsmasq supports IPv6 properly both as a client and as a server and so should work fine in both single and dual-stack ipv4/ipv6 setups. I have been running Network Manager with dnsmasq since UDS on all my machines on a dual-stack network where around 30% of the traffic is IPv6, haven't seen any issue.
 - dnsmasq doesn't do DNSSEC validation itself, this is unfortunate but is "by design" as dnsmasq is essentially a DNS proxy and so doesn't have code for full recursion (from the root servers down) or code for record validation. We'd need to use bind or unbound to have that additional functionality.
 - dnsmasq ressource usage and stability are very good. After more than a month, the typical memory usage is after over a month => VIRT: 13M, RES: 348K, SHR: 252K

mdeslaur 2011-12-13:
- The selected local resolver must maintain a separate cache per user, to prevent privacy issues, and to prevent local users from spying on source ports and trivially performing a birthday attack in order to poison the cache for other users on a multiuser system.

stgraber 2011-12-13:
 - After discussion with mdeslaur, we'll still turn on dnsmasq by default in Network Manager but document how to turn it off (comment the dns= line in /etc/NetworkManager/NetworkManager.conf). Should there be major concern before/after release, a potential workaround is to pass --cache-size=0 to dnsmasq, thereby completely turning off the cache, fixing the two issues Marc mentioned at the disadvantage of loosing the caching ability of dnsmasq.

jdstrand 2011-12-13:
I want to make the security team's stance more clear after discussing this with mdeslaur. While having a local DNS cache is highly desirable and something we should work toward in Ubuntu, I am worried about the implementation, especially wrt to the LTS. This seems like something we should be working on in LTS+1. The problems with the current implementation are many:
 - DNS cache poisoning is made easier on multiuser systems
 - any user on the system will be able to enumerate the domains other users have accessed
 - diagnostic tools will all show that the DNS resolver is 127.0.0.1 which will invariably cause problems with support desk calls from users when they call their ISP because something isn't working right
 - users are not able to flush their individual caches (indeed, a reboot is needed unless they HUP dnsmasq somehow (which requires root)).
 - If we add to that the behavior requested in bug #903854, then we are also radically changing the behavior of Ubuntu as a DNS client.
This is a lot of change and does not in my mind make sense as a default for Ubuntu, especially as an LTS. We could simply say "Ubuntu isn't really a multiuser system", but I think that goes against our Ubuntu/Linux tradition, university-style deployments, and our current focus to better support corporate environments. I also wonder if we need TB approval for this change of focus.
If Ubuntu proceeds with this implementation, we will have to provide documentation (release notes, security team FAQ) on how to run securely in a multiuser environment. However, just because we say the new behavior is intentional, it doesn't invalidate when someone publishes an exploit and we are compelled to use '--cache-size=0' in a security update (at which point we have to ask, why did we do this in the first place?).
While I don't know how feasible either of these options are, it seems the proper long term solution is spending our time on either fixing the (aiui many) problems in nscd (since it has per-user caches and glibc integration), or update dnsmasq such that it has glibc integration and per user caches. Perhaps there is other software that could fill this niche as well. This work can start now as a non-default option for 12.04, but then in 12.10 possibly be rolled out as default. Bottom line is: this seems risky for an LTS, per user caches would solve most of the problems, and I think it is still worthwhile to remove the resolv.conf/127.0.0.1 change so diagnostic tools work as intended.

stgraber 2011-12-13:
Just responding to the various points above:
 - DNS cache poisoning => I'm fine turning the cache off, I consider the cache a nice to have and not the main reason for having the local resolver in the first place. If we think it's too risky, then let's turn the cache off.
 - any user on the system can enumerate domains => how exactly? I understand you can check the TTL to know if someone was accessed by another user, I'm not sure to see how someone can enumerate the whole list though as the cache is in RAM. Even if possible, I'm fine turning the cache off.
 - diagnostic tool: On desktop systems (and they are the only ones affected by this change), the reliable source of information should be Network Manager, the "Connection information" screen of network manager show the actual DNS servers.
 - users are not able to flush their individual caches: Anyone who can touch the network connection, either through network manager or physically will trigger a restart of dnsmasq, flushing the cache in the process. Once again, doesn't apply if we turn the cache off.
 - I think the change in behaviour proposed in bug #903854 will greatly improve the reliability of DNS resolution on our desktop systems at the cost of a slightly higher DNS traffic to the upstream DNS servers. I think it's worth it as having most of your apps hang on DNS resolution is rather annoying and unfortunately happens pretty often.
 - As we chose for the scope of this change to be limited to desktop systems running Network Manager only, I don't believe it's something that needs TB approval, however I'm fine discussing it there if you feel it's necessary (but will let someone lead the discussion).

 - To clarifiy, my main interest in having dnsmasq by default on desktop systems is to support split DNS (for people using VPNs), better fallback when one of the DNS servers is dead, support for > 3 DNS servers with a mix of IPv4 and IPv6. I certainly would love to have caching too but if that's too much trouble, I'm fine going without a cache for 12.04 and discussing per-user caching for 12.10.

== Actions ==

Work items for precise-alpha-1:
[stgraber] Test dnsmasq robustness, viability, and memory consumption tendencies: DONE
[stgraber] Investigate state of IPv6 and dnssec in dnsmasq: DONE

Work items for precise-alpha-2:
[stgraber] Submit patch to network-manager changing the default dnsmasq flags (drop --strict-order, add --cache-size=0) (attached to bug 903854): DONE
[stgraber] Correct resolvconfs design oversights: DONE
[stgraber] Make dhclient use resolvconf if it's present (already done through the resolvconf hook being sourced by dhclient-script and re-defining make_resolv_conf): DONE
[stgraber] Get resolvconf promoted to main and turned on by default (added to ubuntu-minimal): DONE

Work items:
[mathieu-tl] Enable dnsmasq in NetworkManager if tests are conclusive: DONE
[vorlon] Upstartify resolvconf: DONE
[mdeslaur] Determine if virt-manager/libvirt can integrate with resolvconf to get the default dnsmasq setup in the resolver: POSTPONED

(?)

Work Items