Coder Social home page Coder Social logo

Comments (14)

omoerbeek avatar omoerbeek commented on June 2, 2024

Took a quick look, I'm not understanding completely yet what is going on. But here is what I found out so far:

For google.com this code in syncres.cc:updateCacheFromRecords() is hit:

    if (i->first.place == DNSResourceRecord::ANSWER && ednsmask) {
      d_wasVariable = true;
    }

which causes the code that adds the subnet info in pdns_recursor.cc:startDoResolve() to not do that.

if (g_useIncomingECS && dc->d_ecsFound && !sr.wasVariable() && !variableAnswer) {

here sr.wasVariable() is true.

I'll continue analyzing this later.

from pdns.

omoerbeek avatar omoerbeek commented on June 2, 2024

In the mean time a question popped up: what's your use-case of having the ECS value sent to the client?

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

Actually the client is dnsdist in this case :) The idea is to have several clusters of dnsdists + pdns-recursors distributed around the globe, when each cluster serves clients from several countries. Of course it's always possible to make it more decentralized, but we had an expectation that dnsdist (with 'useClientSubnet' and 'parseECS') + pdns-recursor will allow us to

  1. Get more optimal resolving results
  2. Keep centralization to some degree
  3. Not kill caching at the same time

Client subnets (I mean real clients here, not dnsdists) are aggregated quite well in our case

from pdns.

omoerbeek avatar omoerbeek commented on June 2, 2024

From the recursor's point of view: ECS kills the packet cache (answers that ECS dependent are not inserted into the packet cache, only answers that are valid for any client are). The record cache does store ECS info and uses it to retrieve the right ECS enabled answer. But wrt performance, the packet cache is much better than the record cache.

Zero scoped ECSs are passed to the client, as they are valid for any client as well. This was done with dnsdist in mind as it has special case handling for zero scopes, see https://dnsdist.org/advanced/passing-source-address.html?highlight=scope#influence-on-caching
But this is the exception. Originally we never intended for non zero scoped ECS info to be passed to the client.

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

But doesn't it violate RFC7871 (chapter 7.2.2) ?

[7.2.2] Intermediate Nameserver

   When an Intermediate Nameserver uses ECS, whether it passes an ECS
   option in its own response to its client is predicated on whether the
   client originally included the option.  Because a client that did not
   use an ECS option might not be able to understand it, the server MUST
   NOT provide one in its response.  If the client query did include the
   option, the server MUST include one in its response, especially as it
   could be talking to a Forwarding Resolver, which would need the
   information for its own caching.

About caching - I was pretty sure that options like ecs-ipv4-cache-bits/ecs-ipv6-cache-bits controls caching of answers in the packet cache. Will take a note for the future that I was wrong.

from pdns.

omoerbeek avatar omoerbeek commented on June 2, 2024

I created a PR to improve the docs a bit. As for RFC7871, I'd like to have have @rgacogne 's opinion.

from pdns.

rgacogne avatar rgacogne commented on June 2, 2024

Technically this is correct, the recursor should include the ECS scope whenever it uses a record that has a non-zero scope. In some cases the lack of a scope could be a problem, in particular when recursor are chained. It isn't a problem for DNSdist because DNSdist has a packet-cache only, not a record cache, so it will only serve a given answer to clients including the exact same ECS (or lack of) source. There is some code to handle 0 scope special case as "can be served to everyone" but this is limited to this case and has no vocation to be extended.
If we want to include the proper ECS scope in the recursor it will require a bit of work: we need to keep track of the narrower scope we have used when processing a query, in addition of just keeping of track of whether the query was variable or not.

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

Thanks for comments! My current understanding is that at present moment dnsdist has no problems with caching of ECS-specific answers (with obvious implications for cache hit rate, but zero-scope feature should be able to improve the overall situation). If it's true - it should be natural for recursor to be able to provide correct ECS scopes for dnsdist (this functionality can be controlled by configuration and not enabled by default). Otherwise it looks slightly strange if you need to use some 3rd party recursor to fully utilize dnsdist potential.

from pdns.

rgacogne avatar rgacogne commented on June 2, 2024

I think there is some misunderstanding here. From DNSdist's point of view, there are only two cases:

  • zero-scope processing is enabled, the incoming query does not have any ECS information DNSdist adds ECS before forwarding the query to its backend and the reply from the backend contains a scope value of zero: DNSdist knows it can use this response for all incoming queries, and caches it in a special way
  • for all other queries, DNSdist hashes the incoming query (after adding ECS if configured to do so) and cache the incoming response based on the hash of the query, so it can only be used for queries whose hash matches the one from the initial query, which means exact same ECS source value.

The recursor already covers both cases, so DNSdist's potential is fully usable.

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

Ok, maybe I really misunderstood something. The root cause of my investigation was an experiment with

  1. dnsdist configured to add ECS before forwarding queries to backend (useClientSubnet=true for all backend servers) and zero-scope enabled (parseECS=true for packet cache)
  2. recursor configured to trust ECS from dnsdist and use this information (full recursor config provided above)

What I observed in this configuration was a really miserable cache hit rate on dnsdist and huge number of lookup collisions:

> getPool(''):getCache():printStats()
Entries: 44553/250000
Hits: 1938325
Misses: 51838720
Deferred inserts: 2832
Deferred lookups: 6662
Lookup Collisions: 1758469
Insert Collisions: 2393
TTL Too Shorts: 0
Cleanup Count: 5602

The same configuration without parseECS=trueprovides better cache hit rate and no significant amount of lookup collisions:

> getPool(""):getCache():printStats()
Entries: 29297/250000
Hits: 11986
Misses: 171982
Deferred inserts: 5
Deferred lookups: 23
Lookup Collisions: 1
Insert Collisions: 1
TTL Too Shorts: 0
Cleanup Count: 16

That's why my idea was that lack of non-zero scope ECS information from recursor can be a source of the problems with caching at dnsdist side. But taking into account your explanation it really looks like that the problem is somewhere else. Which is still looking strange for me, because in my experiment I used something like 15-20 clients subnets with /24 netmasks, and expected much better cache hit rates.

I'm not sure that it's right to continue discussion about nuances of dnsdist caching here (I don't want to transform issue into a support request :) but at the same time I'll be grateful for any hints about what I can potentially miss in my setup.

from pdns.

rgacogne avatar rgacogne commented on June 2, 2024

That's quite unexpected. Deferred inserts and Deferred lookups values being higher suggests lock contention between threads, which should not be impacted by parseECS. Collisions could in theory be impacted but it would indicate a weir behaviour of the hash algorithm we are using, or a bug somewhere. If you can share an easy way for us to reproduce this behaviour, we'll look into it.

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

I've tried to create an easy way to reproduce the problem and got even more confusing results. Looks like that lookup collisions are somehow connected with DoH in dnsdist (in the very beginning I didn't think it's important, but DoH is an actual part of our configuration so I decided to make some tests as well). So, the easy way to see lookup collisions is the following:

  1. Initial dnsdist configuration - DoH endpoint, backend servers with useClientSubnet=true, packet cache with parseECS=true. Now let's do some test queries. I'll make simultaneous queries for 3 domain names from 3 different /24 client subnets. One of this domain names is ECS-dependent (google.com), the rest two domain names are any two ECS zero-scope domains. dnsdist restarted for each scenario so we have empty packet cache in the very beginning of each test.

  2. Resolving via UDP, parseECS=true:

> getPool(''):getCache():printStats()
Entries: 5/250000
Hits: 5685
Misses: 19
Deferred inserts: 0
Deferred lookups: 0
Lookup Collisions: 0
Insert Collisions: 0
TTL Too Shorts: 0
Cleanup Count: 13

Expected results, 3 entries in the cache for google.com, 2 entries for zero-scope domain names

  1. Resolving via UDP, parseECS=false:
> getPool(''):getCache():printStats()
Entries: 9/250000
Hits: 49
Misses: 9
Deferred inserts: 0
Deferred lookups: 0
Lookup Collisions: 0
Insert Collisions: 0
TTL Too Shorts: 0
Cleanup Count: 0

Expected result, 3 entries in the cache for each domain name

  1. Resolving via DoH, parseECS=false:
> getPool(''):getCache():printStats()
Entries: 18/250000
Hits: 139
Misses: 18
Deferred inserts: 0
Deferred lookups: 0
Lookup Collisions: 0
Insert Collisions: 0
TTL Too Shorts: 0
Cleanup Count: 0

Hmm, 18 entries in the cache. Unexpected, at least for me

  1. Resolving via DoH, parseECS=true:
> getPool(''):getCache():printStats()
Entries: 18/250000
Hits: 458
Misses: 18
Deferred inserts: 0
Deferred lookups: 0
Lookup Collisions: 914
Insert Collisions: 0
TTL Too Shorts: 0
Cleanup Count: 2

Still 18 entries in the cache, but lookup collisions are growing quickly.

So, looks like caching are working just fine in all scenarios (almost all queries are cache hits), but when DoH is used we have increased cache size and tons of lookup collisions.

I must admit that the whole discussion made an unexpected turn from the original bug report :)

from pdns.

rgacogne avatar rgacogne commented on June 2, 2024

Would you mind sharing the full dnsdist configurations used during your tests and more details on the queries you are sending, including the tool(s) used? I'm confused by the fact that the first run suggests 5685+19 queries while the second one does 49+9 queries, etc?

from pdns.

mizzy241 avatar mizzy241 commented on June 2, 2024

Yes, sure. Full dnsdist config:

-- setKey("An actual key is here in real config file, hope it's not relevant")
controlSocket("127.0.0.1")

addLocal('0.0.0.0', {reusePort=true})

-- setACL({An actual ACL is here in real config file, hope it's not relevant})

certPath = "/opt/dnsdist/certs/fullchain.pem"
keyPath = "/opt/dnsdist/certs/privkey.pem"

addDOHLocal("0.0.0.0:443", certPath, keyPath, { "/dns-query" }, { reusePort=true })

newServer({address='10.10.0.2', useClientSubnet=true})

setServerPolicy(whashed)

pc = newPacketCache(250000, {maxTTL=86400, minTTL=0, temporaryFailureTTL=60, staleTTL=60, dontAge=false, parseECS=true })
getPool(''):setCache(pc)

10.10.10.2 is a pdns-recursor with config from the very beginning of this thread.

Testing script is actually very dumb (make curl use our dnsdist as DoH resolver and try to connect to several sites, ECS-dependent and non ECS-dependent):

#!/usr/bin/env bash

DNSDIST_IP="<dnsdist_ip_address>"

while :
do
	echo "Press [CTRL+C] to stop..."

        curl --doh-url "https://${DNSDIST_IP}/dns-query" --doh-insecure --insecure https://google.com
        curl --doh-url "https://${DNSDIST_IP}/dns-query" --doh-insecure --insecure http://obsilf.kiev.ua
        curl --doh-url "https://${DNSDIST_IP}/dns-query" --doh-insecure --insecure https://froster.org

	sleep 1
done

This script was executed on machines from 3 different /24 subnets and it was enough to observe results described above. Different numbers of hits and misses in different tests depends on how quick getPool(''):getCache():printStats() was executed :)

from pdns.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.