Add Mesos-DNS to the Ubuntu Mesos driver

Registered by Bertrand NOEL

Mesos-DNS is a service for Mesos to have DNS service discovery. It allows tasks launched on a Mesos cluster to be requested using DNS (by default only from inside the cluster).
For example a service launched by Marathon named with id "nginx" could be accessed at the following name: nginx.marathon.mesos

This blueprint proposes to bring Mesos-DNS to the existing Ubuntu Mesos template.

Notes about Mesos-DNS:
- Mesos-DNS can be installed anywhere on the Mesos cluster: master or node.
- There is no package yet. Just a binary to download.
- Mesos-DNS is included in DCOS
- It can be launched manually, or deployed by Marathon.
- When installing Mesos-DNS, one has to change the resolv.conf file of all nodes, to point to the IP of the machine where Mesos-DNS is deployed.

Mesos-DNS binary could be put in the image, when building it.
The last point, about resolv.conf, might be problematic for a cluster deployed by Magnum, because of its dynamicity; one node running Mesos-DNS could disappear, and a change in all nodes would be needed. This could be solved using the SoftwareGroup agent on nodes (instead of configuring nodes by cloud-init, run an agent that would poll Heat for changes, like it's already done for Mesos masters), but it makes for big changes in the Mesos Magnum template. Another way to do it could be to run the Mesos-DNS service in all nodes, and set 127.0.0.1 in resolv.conf. Tried it and it works.
By default the DNS entry is only available from inside the cluster. But it can be configured to have machine from outside to reach these Mesos-DNS entries.
Mesos-DNS would answer DNS queries from inside the cluster, and if anything matches it would fallback to the DNS defined in its configuration file.

Proposal:
- Change the image creation to install Mesos-DNS binary, and create a Mesos-DNS upstart file
- Add a label to the template to enable it. By default scripts would not be executed and the service would not run.
- Add cloud-init script for nodes to create the config file, and to add 127.0.0.1 as the first line of the resolv.conf file, and start Mesos-DNS service.
- Update documentation to tell that it is available, and with a simple usecase to test it

Link:
Site of the project:
http://mesosphere.github.io/mesos-dns/
Basic tutorial:
https://mesosphere.github.io/mesos-dns/docs/tutorial.html
What are the DNS entry created:
https://dcos.io/docs/1.9/usage/service-discovery/mesos-dns/service-naming/
Parameters of the config file:
https://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html

Blueprint information

Status:
Not started
Approver:
Adrian Otto
Priority:
Undefined
Drafter:
Bertrand NOEL
Direction:
Needs approval
Assignee:
Bertrand NOEL
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

Let's tighten up the blueprint description to zero in on the one-service-per-node deployment as the desired configuration, not just an option. Our team has already discussed, and approved that deployment pattern for foundation level services. If improperly implemented, this service would become a single point of failure for clusters. I suggest that the resolv.conf have a secondary address (perhaps the address of one of the masters) in addition to the 127.0.0.1 address to allowed for continued service even if the local instance of the service fails, or is temporarily stopped for something like a rolling upgrade.

I checked and verified that the license is Apache 2.0, so we don't have a problem including it. However, I'm reluctant to approve a deployment process that includes downloading an unsigned binary off the internet, and just blindly running it. I'm not comfortable with the window of security vulnerability this opens. It would leave us no way to audit that binary to verify that it was in fact derived by source code that we can inspect. This is one key benefit that upstream Linux packaging offers. My concern can be addressed by one of the following:

1) Arrange for this to be packaged by Debian or Ubuntu.
2) Propose a workaround by which the checksum of the source code tarball, and the binary match known values that we verify in advance so we have confidence that the code has not been tampered with at some point in the delivery pipeline.

With a suitable adjustment to the proposal, I would be happy to mark this with a directional approval, and plan it for the ocata release cycle. --adrian_otto

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.