The current recommended DNS architecture solution within a cluster includes NodeLocal

Thanks for the info <a class="user-mention notranslate" data-hovercard-type="user" dat

Correct. But in general they limit the scope of what they are querying for to th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Using coredns daemonset instead of nodelocal dns about dns HOT 20 OPEN

dudicoco commented on August 15, 2024 3

Using coredns daemonset instead of nodelocal dns

from dns.

Comments (20)

johnbelamaric commented on August 15, 2024 1

There are a number of reasons.

When running as cluster DNS, CoreDNS is configured with the Kubernetes plugin. This puts a watch on all EndpointSlices and Services (and other things, depending on your config). This means a persistent connection to the API server for each instance of CoreDNS, and the API server sending watch events down that channel for any changes to those resources. For clusters with thousands of nodes, that would put a substantial burden on the API server.

NodeLocalDNS, on the other hand, is only a cache and a stub resolver. It does not put a watch on the API server. This makes it much less of a burden on the API server, and also makes it a much smaller process since it does not need to use memory to hold those API resources.

NodeLocalDNS also solves a second problem. Early versions of Kubernetes would sometimes have failures due to the conntrack table filling up. This was found to be because UDP entries need to age out of the conntrack table, so a burst of DNS traffic could fill that table up (I seem to recall some kernel bugs may have also been involved, but this is several years ago). NodeLocalDNS turns off connection tracking for UDP traffic to the node local DNS IP address, and it upgrades requests made to cluster DNS from UDP to TCP. TCP is not subject to this issue since entries can be removed when the connection is closed.

Finally, even if we did use a DaemonSet, it wouldn't work the way you would hope. There is no guarantee that requests from a client would talk to the local CoreDNS instance. In fact, at the time NodeLocalDNS was created, it would be rare, because the local node would have no higher weight in the kube-proxy based load balancing. So if you had 1000 instances of CoreDNS, only 1/1000 would go to your local CoreDNS instance. I am not sure if that has changed, there has been some work on more topology-aware services. But I am not sure how far it has progressed - you would have to check with SIG Network.

from dns.

dudicoco commented on August 15, 2024

Thanks for the info @johnbelamaric.

Regarding the API server connections, I have addressed that in coredns/helm#86 (comment) - other daemonsets also perform API calls - kube-proxy, CNI plugins, log collectors etc.
Regarding the conntrack issue - can't we turn off connection traffic for a coredns daemonset?
Regarding directing requests from the client to the local coredns instance - this is now possible with internal traffic policy, but in any case this problem would be present with nodelocaldns as well, which could negate its benefits.
One possible issue that could occur when using coredns as a daemonset with internal traffic policy is that until the coredns pod is ready no DNS requests could be made by other pods on that node.

from dns.

johnbelamaric commented on August 15, 2024

Correct. But in general they limit the scope of what they are querying for to the things local to that node. For example, kubelet, kube-proxy, etc. do not monitor all pods and endpoints across the cluster, but instead only those assigned to their node.
Possibly. It's not controlled in that way, it's done through iptables rules IIRC. So you would need to do some magic but it's theoretically possible.
No, Node Local DNS changes the way pods running on the node do their DNS so that it goes to the local cache. It is not subject to this issue (it is not accessed via kube-proxy based load balancing rules).

from dns.

dudicoco commented on August 15, 2024

What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.
I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.
It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:

- name: HOST_IP
  valueFrom:
    fieldRef:
      apiVersion: v1
      fieldPath: status.hostIP

from dns.

dpasiukevich commented on August 15, 2024

What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.

I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.

It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
- name: HOST_IP
  valueFrom:
    fieldRef:
      apiVersion: v1
      fieldPath: status.hostIP

As @johnbelamaric mentioned, it would be the linear dependency. In the CoreDNS daemonset each CoreDNS instance would initialize watchers for EndpointSlices, Services and ConfigMaps.
Overall effect would be defined: how frequently these objects change in your cluster, multiplied by N (num of nodes).
Nodelocaldns uses TCP to talk to clusterDNS pods thus it's not affected by the conntrack issue that much vs UDP.

Also keep in mind that with CoreDNS daemonset there will be no guarantee that client pod would talk to local CoreDNS pod on the same node.
/etc/resolv.conf points to the kube-dns Service so the traffic would go to the any pod in the cluster.
Plus, as the default DNS protocol is UDP, as client will communicate with any CoreDNS pod in the cluster, the conntrack exhaustion issue will reappear in such setup.

Whereas with nodelocaldns (with the iptables rules) the client is guranteed to talk to the local NLD pod on the same node.

This should work, but I personally see this as a non-elegant solution as you'd have to define and keep this override for all pods in your cluster.
Plus there may be unexplored consistency problems that HOST_IP points to the right IP all the time (e.g. some redeploys and status changes may cause brief unexpected outages).

from dns.

dudicoco commented on August 15, 2024

@dpasiukevich

It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?
Why can't the same solution be applied to coredns? We could have an ip tables rule to direct DNS traffic to the local coredns pod (this also negates the need for a downward API based solution).

from dns.

dpasiukevich commented on August 15, 2024

That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.
It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

from dns.

dudicoco commented on August 15, 2024

That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.

It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

Why would it require more DIY? Couldn't it be implemented into coredns directly?

Another idea: have a nodelocaldns container and a coredns sidecar container in the same pod and direct traffic from nodelocaldns to coredns via localhost, this would simplify the architecture while preserving the benefits of nodelocaldns without requiring new features or code changes.
A possible issue would be if the nodelocaldns container starts before the coredns container, in that case the DNS resolution would fail. I assume this can be solved by having nodelocal wait for coredns to be available.

from dns.

dudicoco commented on August 15, 2024

@johnbelamaric @dpasiukevich any updates?

from dns.

chrisohaver commented on August 15, 2024

It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?

Yes. It doesn't scale.

from dns.

johnbelamaric commented on August 15, 2024

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

from dns.

dudicoco commented on August 15, 2024

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.
This is what zelando are doing but with dnsmasq instead of nodelocaldns, according to them it performs better.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.
We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

from dns.

chrisohaver commented on August 15, 2024

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

from dns.

dudicoco commented on August 15, 2024

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

I don't think it is more complex than nodelocaldns daemonset + coredns deployment + dns autoscaler, however using just nodelocaldns with the kubernetes plugin would be preferable, i'm not sure how it would deal with non-cached responses in that case though?

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

What is considered a large cluster? There is no info on number of services/endpoints within https://kubernetes.io/docs/setup/best-practices/cluster-large/.

We are running ~500 services and ~500 endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

Thanks for the clarification.

from dns.

chrisohaver commented on August 15, 2024

What is considered a large cluster? ... We are running ~500 services and ~500 endpoints.

Per the link 150000 Pods per cluster. Each pod can have multiple services and endpoints.

from dns.

vaskozl commented on August 15, 2024

I expect endpoint churn (per unit time) to be a more useful number than absolute number of endpoints.

There's nothing stoping one from using a DaemonSet with maxSurge=1 and maxUnavailable=0 together with internalTrafficPolicy: Local with the vanilla coredns image.

The suggested way to autoscale coredns is proportional to cluster size, exactly the same as scaling with a daemonset, except with a configurable coresPerReplica rather than coresPerReplica being equal to the number of cores per machine.

The suggested config in the doc is "coresPerReplica":256,"nodesPerReplica":16 which is also "linear" any which way you look at it, the advantage is that you can chose K and run a fraction of coreDNS pods when you have small nodes. At worst the DaemonSet method results in 16x the load on the API.

As such I see no good argument to not run CoreDNS as a Daemonset.

On the contrary I can think of quite a few advantages to the Daemonset approach:

no iptables/ipvs kube-proxy or other CNI caveats due to the NOTRACK rules
simple scaling on vanilla clusters with no cluster-proportional-autoscaler deployed
quicker DNS resolution without needing second node-local DNS config with hardcoded service IPs
easier to debug and reason about when encountering issues

from dns.

Using coredns daemonset instead of nodelocal dns about dns HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent