Coder Social home page Coder Social logo

Comments (14)

StarlightIbuki avatar StarlightIbuki commented on June 6, 2024

@chobits Could you take a look?

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

after downgrade the Kong to 3.4.2, it is very rare but it did happen still.

from kong.

chobits avatar chobits commented on June 6, 2024

It seems that kong attempted many queries ofdomain:type in the query sequence but could not get avaiable records, see the Tried ... attempts log. See a similar troubleshooting in this #12890 (reply in thread), which contains detailed explanation of Kong's Tried ... log.

If kong reportes this error sporadically, it means your local dns accidentally replied NXDOMAIN for all the queries domain:type.

from kong.

chobits avatar chobits commented on June 6, 2024

after downgrade the Kong to 3.4.2, it is very rare but it did happen still.

Yea, you can increase dns_stale_ttl with a larger value or set the option dns_no_sync=off to mitigate this problem, but you need to check your local DNS server, it once did fail to reply with available records for the query.

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better.
So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS.
The issue is critical because one error mean one failure of user request even in rare frequency.

from kong.

chobits avatar chobits commented on June 6, 2024

3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better. So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS. The issue is critical because one error mean one failure of user request even in rare frequency.

If you could easily reproduce this problem, it's not hard to debug. And you need to follow the queried chain provided by the error log to check if you could get the DNS result from your local DNS. We could tell you how to debug, while we could not debug for you if you cannot provide a reproduce step for us.

"(short)service-name:(na) - cache-hit/stale",
 "service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dereferencing SRV",
"(short)6233306365613731.service-name.default.svc.cluster.local:(na) - cache-hit/stale",
"6233306365613731.service-name.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns client error: 101 empty record received",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"

For these chain, kong tried all the domain:type , but failed, so I think you could also checked this at that time mannually, like using a dns client $ dig @<local_dns_server_ip> 6233306365613731.service-name.default.svc.cluster.local.ec2.internal CNAME for ,"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"

DNS protocol type number:

33 - SRV
5 - CNAME
1 - A

from kong.

chobits avatar chobits commented on June 6, 2024

And for this 6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected

You could provide the output of $ dig <dns ip> 6233306365613731.service-name.default.svc.cluster.local SRV, which could tell us whyrecursion detected error was reported by dns client. This info provided by kong dns client means that there some recursion loop in SRV result.

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

Thank you for helping. we tested the dig on our cluster.
Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but nslookup 6233306365613731.xxx.xx always failed.
I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.

kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV

Got Answer:

; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION:
cluster.local.		30	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30

;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:17:36 UTC 2024
;; MSG SIZE  rcvd: 207`
and sometime it is not complaining the warning
`; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;100.64.0.10.			IN	A

;; AUTHORITY SECTION:
.			20	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400

;; Query time: 1 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE  rcvd: 115

;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION:
cluster.local.		60	IN	SOA	ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60

;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE  rcvd: 196

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

I tested the domain:
kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local
intermmitently ** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN

kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local
this domain always resolved successfully.

Do you think it could be the cause? how can we fix that?

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

I did not know what is happening on the coreDNS, the logs show as follow:

[INFO] 100.97.128.6:52380 - 33150 "SRV IN service-name.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000204145s
[INFO] 100.97.128.6:52766 - 10810 "SRV IN service-name.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000145161s
[INFO] 100.97.128.6:37468 - 45120 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000114833s
[INFO] 100.97.128.6:43556 - 11260 "SRV IN service-name. udp 42 false 512" NXDOMAIN qr,aa,rd,ra 117 0.000069646s
[INFO] 100.113.0.2:36230 - 54693 "SRV IN service-name.default.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000164341s
[INFO] 100.110.0.4:37313 - 58428 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000196046s

from kong.

chobits avatar chobits commented on June 6, 2024

Thank you for helping. we tested the dig on our cluster. Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but nslookup 6233306365613731.xxx.xx always failed. I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.

feel that it's related to k8s/dns configuration, but it's beyond my understanding 😢

From kong's output, it seems service-name.default.svc.cluster.local.svc.cluster.local: SRV returns SRV records pointing to 6233306365613731.service-name.default.svc.cluster.local, then kong tries to derefence and resolve 6233306365613731.service-name.default.svc.cluster.local:A, but gets NXDOMAIN. So you can check kong's attempts list of every domain and type, select one of them you want to contain IP addresses and configure your local DNS server to return IP address for that domain and type(usually A type). Then kong DNS client could return IP address to the upper caller.

kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV

Got Answer: `; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION: cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30

;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:17:36 UTC 2024 ;; MSG SIZE rcvd: 207and sometime it is not complaining the warning; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;100.64.0.10. IN A

;; AUTHORITY SECTION: . 20 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400

;; Query time: 1 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 115

;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV

;; AUTHORITY SECTION: cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60

;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 196 `

from kong.

chobits avatar chobits commented on June 6, 2024

I tested the domain: kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local intermmitently ** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN

kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local this domain always resolved successfully.

If you are sure that you could use A type for service-name.default.svc.cluster.local, you can remove SRV option from the dns_order=... option in kong.conf, which is LAST,SRV,A,CNAME by defaut.

Do you think it could be the cause? how can we fix that?

from kong.

jyc5120 avatar jyc5120 commented on June 6, 2024

Thank you again!
I tested it removing SRV dns_order=LAST,A,CNAME and the errors haven't appeared any more until now.
I thought Kong would try all of 4 DNS types then complain errors if they all failed. now it looks ending up trying SRV records only?

from kong.

chobits avatar chobits commented on June 6, 2024

If you remove SRV from dns_order, kong will not try SRV.

Kong tries to query all the domain:type combinations for the queried domain until it get an available result, like IP address or SRV target. If it gets IP address during the phase, it will directly return it. If it gets SRV target, it will re-query the domain pointed by SRV target.

The query sequence of these domain:type combinations is generated by domain/ search option from resolv.conf and the dns_order option from kong.conf. For example, you can check this case to see how kong dns client generate the query sequence: https://github.com/Kong/kong/blob/master/spec/01-unit/21-dns-client/02-client_spec.lua#L190

from kong.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.