Comments (14)
@chobits Could you take a look?
from kong.
after downgrade the Kong to 3.4.2, it is very rare but it did happen still.
from kong.
It seems that kong attempted many queries ofdomain:type
in the query sequence but could not get avaiable records, see the Tried ...
attempts log. See a similar troubleshooting in this #12890 (reply in thread), which contains detailed explanation of Kong's Tried ...
log.
If kong reportes this error sporadically, it means your local dns accidentally replied NXDOMAIN for all the queries domain:type
.
from kong.
after downgrade the Kong to 3.4.2, it is very rare but it did happen still.
Yea, you can increase dns_stale_ttl
with a larger value or set the option dns_no_sync=off
to mitigate this problem, but you need to check your local DNS server, it once did fail to reply with available records for the query.
from kong.
3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better.
So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS.
The issue is critical because one error mean one failure of user request even in rare frequency.
from kong.
3.6.1 is pretty often and we increased dns_stale_ttl but no mitigate that. and downgrade 3.4.2 is much better. So I still suspect there are something in the KONG. we are using Kong inside of our K8S cluster. the DNS server is coreDNS. i did not find any error by debugging Kube-DNS. The issue is critical because one error mean one failure of user request even in rare frequency.
If you could easily reproduce this problem, it's not hard to debug. And you need to follow the queried chain provided by the error log to check if you could get the DNS result from your local DNS. We could tell you how to debug, while we could not debug for you if you cannot provide a reproduce step for us.
"(short)service-name:(na) - cache-hit/stale",
"service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dereferencing SRV",
"(short)6233306365613731.service-name.default.svc.cluster.local:(na) - cache-hit/stale",
"6233306365613731.service-name.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:1 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:33 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns client error: 101 empty record received",
"6233306365613731.service-name.default.svc.cluster.local.default.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.cluster.local:5 - cache-hit/stale/scheduled/dns server error: 3 name error",
"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"
For these chain, kong tried all the domain:type
, but failed, so I think you could also checked this at that time mannually, like using a dns client $ dig @<local_dns_server_ip> 6233306365613731.service-name.default.svc.cluster.local.ec2.internal CNAME
for ,"6233306365613731.service-name.default.svc.cluster.local.ec2.internal:5 - cache-hit/stale/scheduled/dns server error: 3 name error"
DNS protocol type number:
33 - SRV
5 - CNAME
1 - A
from kong.
And for this 6233306365613731.service-name.default.svc.cluster.local:33 - cache-hit/stale/scheduled/recursion detected
You could provide the output of $ dig <dns ip> 6233306365613731.service-name.default.svc.cluster.local SRV
, which could tell us whyrecursion detected
error was reported by dns client. This info provided by kong dns client means that there some recursion loop in SRV result.
from kong.
Thank you for helping. we tested the dig on our cluster.
Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but nslookup 6233306365613731.xxx.xx
always failed.
I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.
kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
Got Answer:
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV
;; AUTHORITY SECTION:
cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30
;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:17:36 UTC 2024
;; MSG SIZE rcvd: 207`
and sometime it is not complaining the warning
`; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;100.64.0.10. IN A
;; AUTHORITY SECTION:
. 20 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400
;; Query time: 1 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE rcvd: 115
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV
;; AUTHORITY SECTION:
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60
;; Query time: 2 msec
;; SERVER: 100.64.0.10#53(100.64.0.10)
;; WHEN: Mon May 13 16:28:26 UTC 2024
;; MSG SIZE rcvd: 196
from kong.
I tested the domain:
kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local
intermmitently ** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN
kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local
this domain always resolved successfully.
Do you think it could be the cause? how can we fix that?
from kong.
I did not know what is happening on the coreDNS, the logs show as follow:
[INFO] 100.97.128.6:52380 - 33150 "SRV IN service-name.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000204145s
[INFO] 100.97.128.6:52766 - 10810 "SRV IN service-name.svc.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000145161s
[INFO] 100.97.128.6:37468 - 45120 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000114833s
[INFO] 100.97.128.6:43556 - 11260 "SRV IN service-name. udp 42 false 512" NXDOMAIN qr,aa,rd,ra 117 0.000069646s
[INFO] 100.113.0.2:36230 - 54693 "SRV IN service-name.default.svc.cluster.local. udp 68 false 512" NOERROR qr,aa,rd 254 0.000164341s
[INFO] 100.110.0.4:37313 - 58428 "SRV IN service-name.cluster.local. udp 56 false 512" NXDOMAIN qr,aa,rd 149 0.000196046s
from kong.
Thank you for helping. we tested the dig on our cluster. Yes, we did find there is one line warning: ;; WARNING: recursion requested but not available occasionally. but
nslookup 6233306365613731.xxx.xx
always failed. I don't know why we have the 6233306365613731.xxx.xx domain. and the service-name could always be resolved by DNS.
feel that it's related to k8s/dns configuration, but it's beyond my understanding 😢
From kong's output, it seems service-name.default.svc.cluster.local.svc.cluster.local: SRV
returns SRV records pointing to 6233306365613731.service-name.default.svc.cluster.local
, then kong tries to derefence and resolve 6233306365613731.service-name.default.svc.cluster.local:A
, but gets NXDOMAIN. So you can check kong's attempts list of every domain and type, select one of them you want to contain IP addresses and configure your local DNS server to return IP address for that domain and type(usually A
type). Then kong DNS client could return IP address to the upper caller.
kubectl exec -i -t dnsutils -- dig 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV
Got Answer: `; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14191 ;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV
;; AUTHORITY SECTION: cluster.local. 30 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715361132 7200 1800 86400 30
;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:17:36 UTC 2024 ;; MSG SIZE rcvd: 207
and sometime it is not complaining the warning
; <<>> DiG 9.9.5-9+deb8u19-Debian <<>> 100.64.0.10 6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local SRV ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54383 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;100.64.0.10. IN A
;; AUTHORITY SECTION: . 20 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2024051300 1800 900 604800 86400
;; Query time: 1 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 115
;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 63690 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION: ;6233306365613731.service-name.default.svc.cluster.local.svc.cluster.local. IN SRV
;; AUTHORITY SECTION: cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster.cluster.local. 1715616000 28800 7200 604800 60
;; Query time: 2 msec ;; SERVER: 100.64.0.10#53(100.64.0.10) ;; WHEN: Mon May 13 16:28:26 UTC 2024 ;; MSG SIZE rcvd: 196 `
from kong.
I tested the domain:
kubectl exec -i -t dnsutils -- nslookup 6233306365613731.service-name.default.svc.cluster.local
intermmitently** server can't find 6233306365613731.service-name.default.svc.cluster.local: NXDOMAIN
kubectl exec -i -t dnsutils -- nslookup service-name.default.svc.cluster.local
this domain always resolved successfully.
If you are sure that you could use A type for service-name.default.svc.cluster.local
, you can remove SRV
option from the dns_order=...
option in kong.conf, which is LAST,SRV,A,CNAME
by defaut.
Do you think it could be the cause? how can we fix that?
from kong.
Thank you again!
I tested it removing SRV dns_order=LAST,A,CNAME
and the errors haven't appeared any more until now.
I thought Kong would try all of 4 DNS types then complain errors if they all failed. now it looks ending up trying SRV records only?
from kong.
If you remove SRV from dns_order, kong will not try SRV.
Kong tries to query all the domain:type
combinations for the queried domain until it get an available result, like IP address or SRV target. If it gets IP address during the phase, it will directly return it. If it gets SRV target, it will re-query the domain pointed by SRV target.
The query sequence of these domain:type
combinations is generated by domain/ search
option from resolv.conf
and the dns_order
option from kong.conf
. For example, you can check this case to see how kong dns client generate the query sequence: https://github.com/Kong/kong/blob/master/spec/01-unit/21-dns-client/02-client_spec.lua#L190
from kong.
Related Issues (20)
- Admin API address in "New Connnection" form only support IP, not DNS. HOT 3
- Upsert target is not an upsert HOT 4
- failed to set X-Kong-Upstream-Status header while sending to client HOT 3
- [PostgreSQL error] failed to retrieve PostgreSQL server_version_num: connection refused HOT 2
- Optional capture groups are broken with the request-transformer plugin and traditional_compatible router HOT 4
- Error in logs: failed to run timer HOT 3
- Database migration failed while using helm chart HOT 2
- JWT Plugin bypasses validation process occasionally on frequent requests HOT 4
- TLS SNI Route not work HOT 7
- Kong info, notice, inspect logs are all getting logged as error in GCP(google cloud platform) HOT 2
- Cannot use kong.db
- http-log plugin: Host header not including port HOT 3
- Every time request localhost:8001/metrics, kong-cp-kong-pod will prompt a license-related error HOT 2
- Dataplane not getting information from the ControlPlane in Hybrid mode HOT 3
- go plugin error, worker-events: event callback failed; source=plugin_server, event=reset_instance HOT 1
- Error: attempt to index local 'ssl' (a nil value) for Kong >= 3.6.0 when KONG_PG_SSL=on HOT 6
- Handling two JWT signing keys for same iss
- Documentation Improvement - Clarify Plugin Compatibility with Kong Versions](https://github.com/Kong/kong/issues/9999
- Error "error loading plugin schemas: on plugin <my-plugin>: <my-plugin> plugin is enabled but not installed" with Python plugin for kong HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kong.