Coder Social home page Coder Social logo

Comments (18)

henrythethird avatar henrythethird commented on July 20, 2024 1

Because of a hunch I replaced the http dependency of our plugin with resty.http. So far we haven't had any more issues. Seems like for some reason the lua-http dependency was hogging the DNS-Socket. Not sure why that was only happening with newer versions of Kong, probably some kind of race condition. Anyways, cheers for the support

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Could it be related to this issue (currently open): #12290
It's not the same error in the logs, but a similar response and related to DNS as well

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

We found the change that causes worsens our issue: #11871

We updated our kong.conf with dns_no_sync = on (the name confused me a bit, first I turned it off, thinking it would turn off dns_sync, then I noticed the no in there)

This resolves our problem, but the underlying issue is still there if someone cares to investigate
The problem persists, but happens quite rarely.

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Possibly related to #3223?

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Any input on this?

from kong.

chronolaw avatar chronolaw commented on July 20, 2024

@chobits , could you give us some comments about this?

from kong.

chobits avatar chobits commented on July 20, 2024

will look into it later

from kong.

chobits avatar chobits commented on July 20, 2024

socket busy

This might be sock:setpeername() is trying to resolve the name, during this process, if someone tries to execute sock:send(), then it will report socket busy error

We updated our kong.conf with dns_no_sync = on, The problem persists, but happens quite rarely.

So it is as expected when you enable sync, which means a large number network queries for the same domain at the same time in one worker have been merged into one coroutine. It is like fewer DNS queries trigger fewer DNS errors.

If you could provide debug log in the context of error message, it would be beneficial to debug for us.

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Hi @chobits, thanks for chipping in

Here are the logs. I included some startup logs as well so you can see the configuration. It happened right off the bat as well
Edit: I tried with Kong 3.6.1 (the latest available version on Dockerhub)
kong-logs.csv

The problem is quite severe for us (at least with the DNS sync enabled) and happens even if there are very few request being made. The example above are a few system tests that run through. I guess that there are about 50 calls within 2 minutes. I don't imagine it is a heavy load problem, rather a configuration problem on our end or something else

Some more information that could possibly be relevant:

  • We have a custom auth plugin installed with a few lua deps
  • We use the official ubuntu image and update the dependencies in our custom dockerfile (because of our vulnerability management policies)
  • We are inside a Google Kubernetes Cluster and behind a Load Balancer that provides the SSL-Support
  • We deploy via a custom helm chart, since we just need a minimal setup

from kong.

chobits avatar chobits commented on July 20, 2024

Sry for late reply, I have reviewed your kong-logs.csv, but I dont find some useful logs to help us to investigate. Because most useful debug logs have been commentted in DNS client library.

And I guess there might be some incorrect DNS resoluton operate executed by sock:connect() in lua-nginx-module, I still need to spend some time on that and then tell you how to debug it.

BTW, Is xxx hostname in your logs a hidden domain for the real domain name?

from kong.

chobits avatar chobits commented on July 20, 2024

hi I have thought of a method to mitigate your issue, you could configure dns_stale_ttl from 4s to a longer value. Currently in kong latest version, we have set it to 3600s, you could also set it to this value, which gives your dns records a longer lifespan.

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Thanks for the reply.

Yes xxx is a real kubernetes domain name(s) (internal DNS)

It seems like the staleTtl is already at 3600 as far as I can tell. Please correct me if I'm wrong here. Here a snippet from the logs that describes the DNS settings more verbosely (staleTtl being the second from the bottom):

"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:647: init(): [dns-client] emptyTtl = 30 s","2024-03-07T08:24:08.956815518Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:645: init(): [dns-client] badTtl = 1 s","2024-03-07T08:24:08.956632467Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:633: init(): [dns-client] search = king.svc.cluster.local, svc.cluster.local, cluster.local, google.internal","2024-03-07T08:24:08.956478982Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:631: init(): [dns-client] ndots = 5","2024-03-07T08:24:08.956330928Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:627: init(): [dns-client] timeout = 2000 ms","2024-03-07T08:24:08.956134323Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:618: init(): [dns-client] no_random = true","2024-03-07T08:24:08.955963978Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:611: init(): [dns-client] attempts = 5","2024-03-07T08:24:08.955823817Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:606: init(): [dns-client] nameserver 10.0.0.10","2024-03-07T08:24:08.955685968Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:565: init(): [dns-client] validTtl = nil","2024-03-07T08:24:08.955670959Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: localhost = [::1]","2024-03-07T08:24:08.955662768Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:541: init(): [dns-client] adding A-record from 'hosts' file: localhost = 127.0.0.1","2024-03-07T08:24:08.955627713Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allrouters = [fe00::2]","2024-03-07T08:24:08.954856727Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localnet = [fe00::0]","2024-03-07T08:24:08.954672873Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-allnodes = [fe00::1]","2024-03-07T08:24:08.954364051Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-loopback = [::1]","2024-03-07T08:24:08.954047780Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-localhost = [::1]","2024-03-07T08:24:08.953833227Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:541: init(): [dns-client] adding A-record from 'hosts' file: xxx = xxx","2024-03-07T08:24:08.953549257Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:556: init(): [dns-client] adding AAAA-record from 'hosts' file: ip6-mcastprefix = [fe00::0]","2024-03-07T08:24:08.953099393Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:501: init(): [dns-client] query order = LAST, SRV, A, CNAME","2024-03-07T08:24:08.952530616Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:482: init(): [dns-client] noSynchronisation = false","2024-03-07T08:24:08.950206705Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:478: init(): [dns-client] staleTtl = 3600","2024-03-07T08:24:08.950063140Z"
"2024/03/07 08:24:08 [debug] 1325#0: *1294 [lua] client.lua:473: init(): [dns-client] (re)configuring dns client","2024-03-07T08:24:08.949858585Z"

We don't override any DNS related options in the config file. Here our current kong.conf (injected into the kong:3.4.2 image):

plugins = auth-internal
declarative_config = /etc/kong/kong.yml
nginx_http_lua_shared_dict=auth_internal 10m
# Change for kong>=3.5 cf https://github.com/Kong/kong/issues/12568
# dns_no_sync = on

Thanks @chobits for your support 🙏

from kong.

chobits avatar chobits commented on July 20, 2024

hi @henrythethird

Could you try this docker image: kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu which is built from #12305 based 3.6.x.x. It has passed all our test cases.

In this PR, we refactor our DNS client, if you could try this PR's docker image, it could help us to verify whether this issue still be there in new DNS client. If both versions of kong have this problem, I guess DNS client socket might be messed up by some other subsystem in kong.

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Thanks for the update @chobits. We get the following error with kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu:

2024/03/26 08:27:37 [error] 1331#0: *90612 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/resty/dns_client/init.lua:669: attempt to call method 'lower' (a nil value)
stack traceback:
coroutine 0:
	/usr/local/share/lua/5.1/kong/resty/dns_client/init.lua: in function 'execute_original_func'
	/usr/local/share/lua/5.1/kong/dynamic_hook/init.lua:144: in function 'raw_func'
	/usr/local/share/lua/5.1/kong/tracing/instrumentation.lua:288: in function 'toip'
	/usr/local/share/lua/5.1/kong/runloop/balancer/init.lua:364: in function 'execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:832: in function 'balancer_execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:1352: in function 'after'
	/usr/local/share/lua/5.1/kong/init.lua:1186: in function 'access'
	access_by_lua(nginx-kong.conf:112):2: in main chunk, client: xxx, server: kong, request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"

Additionally in the beginning we get the following (still the same)

[error] 1330#0: *90586 [lua] init.lua:371: execute(): DNS resolution failed: DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000. Tried: [["cluster.address.local:SRV","DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000"]], client: xxx, server: kong, request: "GET /path/to/resource", host: "xxx", request_id: "0b7dc268d13c6648f089681aa55df436"

from kong.

chobits avatar chobits commented on July 20, 2024

Thanks for the update @chobits. We get the following error with kong/kong:3a034e9417743fcf946deb6f98d2993049b1e589-ubuntu:

2024/03/26 08:27:37 [error] 1331#0: *90612 lua entry thread aborted: runtime error: /usr/local/share/lua/5.1/kong/resty/dns_client/init.lua:669: attempt to call method 'lower' (a nil value)
stack traceback:
coroutine 0:
	/usr/local/share/lua/5.1/kong/resty/dns_client/init.lua: in function 'execute_original_func'
	/usr/local/share/lua/5.1/kong/dynamic_hook/init.lua:144: in function 'raw_func'
	/usr/local/share/lua/5.1/kong/tracing/instrumentation.lua:288: in function 'toip'
	/usr/local/share/lua/5.1/kong/runloop/balancer/init.lua:364: in function 'execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:832: in function 'balancer_execute'
	/usr/local/share/lua/5.1/kong/runloop/handler.lua:1352: in function 'after'
	/usr/local/share/lua/5.1/kong/init.lua:1186: in function 'access'
	access_by_lua(nginx-kong.conf:112):2: in main chunk, client: xxx, server: kong, request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"

Did it constantly occur if you request this request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"?

Did your plugin call this kind of methods: kong.service.set_target() or kong.service.set_upstream().

From kong's logic, it seems that the balancer_data.host should be a string, so if we call balancer_data.host:lower(), it will success while not reporting a runtime error.

Additionally in the beginning we get the following (still the same)

[error] 1330#0: *90586 [lua] init.lua:371: execute(): DNS resolution failed: DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000. Tried: [["cluster.address.local:SRV","DNS server error: failed to send request to UDP server 10.0.0.10:53: socket busy, Query Time: 1711441647.111 0.000"]], client: xxx, server: kong, request: "GET /path/to/resource", host: "xxx", request_id: "0b7dc268d13c6648f089681aa55df436"

So it might be the dns socket messed up, but currently your image has no nginx debug log enabled like --with-debug, so if you enable debug level, we could not get debug log from nginx's logic/lua-nginx-module's logic, only get the debug log from kong's lua land. Let me think of how to trace it.

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

Did it constantly occur if you request this request: "POST /path/to/resource HTTP/1.1", host: "xxx", request_id: "2a1bbe203f4c2a309845b0777f2b7862"?

No only sporadically. It happens even less with the dns_no_sync turned on

Did your plugin call this kind of methods: kong.service.set_target() or kong.service.set_upstream().

No we don't call any of these methods. We have a static yaml that sets those parameters

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

I just tried again with the latest 3.6.1-Dockerfile and now it seems to work 😕. Strange, but I'll take it. Thanks for your support @chobits. Must be something with the dependencies I'd imagine

from kong.

henrythethird avatar henrythethird commented on July 20, 2024

False alarm. Still happens, just less often

from kong.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.