I got these Ratelimiter errors in the logs after upgrading to latest master. Edit:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Ratelimiter state machine errors about nostrum HOT 10 CLOSED

tspenov commented on September 27, 2024

Ratelimiter state machine errors

from nostrum.

Comments (10)

jchristgit commented on September 27, 2024 1

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

from nostrum.

jchristgit commented on September 27, 2024 1

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

from nostrum.

jchristgit commented on September 27, 2024 1

Sorry but I have no environment where I can reproduce this, at all.

Since you mentioned that the only difference between stage and prod is the AWS NAT Gateway I cannot help but assume that the AWS NAT Gateway is at fault, and nostrum works properly here. Perhaps the NAT Gateway is messing around with HTTP/2 connections or it is doing some other strange form of introspection on the request. In nostrum the bot gateway request does not do anything out of the ordinary in terms of HTTP requests.

from nostrum.

tspenov commented on September 27, 2024

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

Thanks for the quick reply! I'll test it out and let you know

from nostrum.

tspenov commented on September 27, 2024

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

@jchristgit I tested and in logs I see now this error:

Date: 2023-06-29T14:03:02.570
Level: error

State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating

Last event = {
  info, {
    gun_down, <0.5653.0>, http2, closed, [#Ref<0.2297931708.4051173377.228586>]
  }
}

Server state = {
  connected, {
    conn => <0.5653.0>,
    inflight => #{},
    outstanding => {
      <<"/users/@me">> => {
        initial, {
          [{
            #{body => <<>>,
              headers => [{<<content-type>>, <<application/json>>}],
              method => get,
              params => [],
              route => <<"/users/@me">>
            }, 
            {<0.6601.0>, #Ref<0.2297931708.4051173377.228585>}
          }], []
        }
      }
    },
    remaining_in_window => 48,
    running => #{}
  }
}

Reason for termination = error: {badkey, #Ref<0.2297931708.4051173377.228586>}

Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']

Callback mode = state_functions

Stacktrace = [
  {erlang, map_get, [#Ref<0.2297931708.4051173377.228586>, #{}], [{error_info, #{module => erl_erts_errors}}]},
  {'Elixir.Nostrum.Api.Ratelimiter', '-connected/3-fun-1-', 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 881}]},
  {'Elixir.Enum', '-map/2-lists^map/1-0-', 2, [{file, "lib/enum.ex"}, {line, 1658}]},
  {'Elixir.Nostrum.Api.Ratelimiter', connected, 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 878}]},
  {gen_statem, loop_state_callback, 11, [{file, "gen_statem.erl"}, {line, 1426}]},
  {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 240}]}
]

Time-outs: {1, [{{timeout, reset_bot_calls_window}, expired}]}

from nostrum.

tspenov commented on September 27, 2024

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Thank you! I'll test it out tomorrow and let you know.

The elixir app is deployed to AWS, so there shouldn't be any problem with the network.
Not sure why the WS to discord gets closed so often.
Another thing the app is experiencing very often is {:stream_error, :closed} from gun in which case I just retry the command/interaction.

from nostrum.

tspenov commented on September 27, 2024

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Sometimes after Nostrum.Api.get_current_user() is executed I see in logs: "Request to \"/users/@me\" was closed abnormally, requeueing" but the call doesn't return anything, neither raises an exception. It seems like it is blocking indefinitely.

To work around this I have wrapped it in a Task like this:

  defp warm_up(msg_or_interaction, retries \\ 3) do
    t1 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "WARM UP STARTING...")

    task = Task.async(fn -> Nostrum.Api.get_current_user() end)

    case Task.yield(task, 5000) || Task.shutdown(task) do
      nil ->
        log(msg_or_interaction, "WARM UP ERROR: Timeout reached.", type: :error)

        if retries > 0 do
          log(msg_or_interaction, "WARM UP TIMEOUT: Retrying...", type: :error)

          warm_up(msg_or_interaction, retries - 1)
        else
          log(msg_or_interaction, "WARM UP TIMEOUT: No more retries.", type: :error)
        end

      {:ok, result} ->
        # handle the result
        handle_result(result, msg_or_interaction, retries)
    end

    t2 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "Time spent warming up #{t2 - t1}ms.")
  end

Here are some log excerpts:

{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... msg.content=<@1039814526708764742> hi metadata=%
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:32.178"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:32.178"}
{"level":"warning","message":"Request to \"/users/@me\" was closed abnormally, requeueing","timestamp":"2023-06-30T15:16:32.183"}
{"level":"info","message":"[id=1124357812546457661] WARM UP ERROR: Timeout reached. msg.content=<@1039814526708764742> {"level":"info","message":"[id=1124357812546457661] WARM UP TIMEOUT: Retrying...\"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... \"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:37.202"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:37.202"}
{"level":"info","message":"[id=1124357812546457661] WARM UP SUCCESS ….
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 190ms.
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 5191ms …

from nostrum.

jchristgit commented on September 27, 2024

I cannot reproduce this here, not with spamming the endpoint nor with running it one by one like you do. Do you have a proxy sitting in front of nostrum? Some outgoing firewall?

from nostrum.

tspenov commented on September 27, 2024

I cannot reproduce this here, not with spamming the endpoint nor with running it one by one like you do. Do you have a proxy sitting in front of nostrum? Some outgoing firewall?

@jchristgit The elixir app is deployed in kubernetes cluster in AWS.
The problem is only on the production cluster. Locally and on stage cluster it is fine.
On production I can reproduce every time if the WS connection has been idle for 5-10 mins (no other API call to discord has been made).
The only difference between stage and prod clusters is that on prod cluster there is AWS NAT Gateway for outgoing requests. The default timeout for tcp connections for the AWS NAT Gateway is 350 seconds ~ 6 mins
But I wonder why this happens since there are HEARTBEAT_ACK in the logs pretty much every 40 seconds.
They should be refreshing this connection so this timeout from AWS should not happen.
Do you have any insight on this ?

from nostrum.

tspenov commented on September 27, 2024

@jchristgit My current workaround is having a job that makes an API call every couple of minutes so it makes sure that the connection doesn't get idle.

Thank you for quick responses and the fixes, really appreciated! 💙

I am closing the issue.

from nostrum.

Ratelimiter state machine errors about nostrum HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent