Coder Social home page Coder Social logo

TestFleet* is failing about cloud-on-k8s HOT 13 OPEN

barkbay avatar barkbay commented on June 11, 2024
TestFleet* is failing

from cloud-on-k8s.

Comments (13)

pebrc avatar pebrc commented on June 11, 2024 1

@cmacknz could you help us assess what the problem might be? It is an older version of the stack, 8.1.3, that we are testing here that might have known issues. If that were the case we would be also glad for some guidance for which versions to mute our tests.

from cloud-on-k8s.

cmacknz avatar cmacknz commented on June 11, 2024

From the logs posted here, this is what I see:

  1. Elastic Agent can't communicate with Fleet Server, which is the cause of the connection refused errors.
  2. Fleet Server seems to be gracefully shutting down. The Error: fleet-server failed: context canceled is not a real error, it is us failing to filter this out because the internal contexts are expected to be cancelled during shutdown.

I suspect that Fleet Server restarting is going to be the root cause here but I can't tell why it is shutting down in the logs.

Fleet Server doesn't actually depend on Kibana at all, it depends on Elasticsearch and information between Fleet Server and Kibana is exchanged via the .fleet indices.

Was Elasticsearch working as expected during this test? Are there any more Fleet Server logs besides those posted here?

from cloud-on-k8s.

barkbay avatar barkbay commented on June 11, 2024

Was Elasticsearch working as expected during this test?

Cluster was green with the following indices:

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   .apm-agent-configuration            Phgbu1G_QuKRypuhfrlu5g   1   1          0            0       450b           225b false
green  open   .apm-custom-link                    IdCZhr3qRdOVN8NhDucCYg   1   1          0            0       450b           225b false
green  open   .ds-ilm-history-5-2024.04.29-000001 mSSaUyQMQcmlaa8G_PU8dA   1   1          6            0     37.2kb         18.6kb false
green  open   .fleet-agents-7                     0Qw2CLFgS6Ols6CBQ2hXRQ   1   1          3            1      429kb        208.3kb false
green  open   .fleet-enrollment-api-keys-7        kCxvPPR6T7y9co0Tv6NNUg   1   1          2            0     13.2kb          6.6kb false
green  open   .geoip_databases                    _uqV82zNRZCZbpkQFPu8sw   1   1         33            0     62.3mb         31.1mb false
green  open   .kibana-event-log-8.1.3-000001      SCl-CCDPSxCTKKFvNyhxVQ   1   1          2            0     24.6kb         12.3kb false
green  open   .kibana_8.1.3_001                   mbYdeM1-RKCDNPXsN_iybw   1   1       1739          998     19.1mb          9.5mb false
green  open   .kibana_security_session_1          Kl6U1-TzRHqeEPcdw7nInA   1   1          0            0       450b           225b false
green  open   .kibana_task_manager_8.1.3_001      7HnEovZaRt-MI7Ds-qcsFg   1   1         20          984    430.1kb        180.8kb false
green  open   .security-7                         bkRgaX0eSN2LjRcumxkiHA   1   1        110            0    721.1kb        364.4kb false
green  open   .tasks                              njUPJ0uPSOy1htV2deDGzg   1   1          2            0     22.3kb         14.2kb false

I can't find any error in the logs.

Are there any more Fleet Server logs besides those posted here?

Sorry for the screenshot, I have some issues when I try to get an export:

image

Please note that Kibana has used my TZ (GMT+2) in the screenshot above

from cloud-on-k8s.

cmacknz avatar cmacknz commented on June 11, 2024

@michel-laterman does anything jump out in these 8.1.3 agent logs?

from cloud-on-k8s.

michel-laterman avatar michel-laterman commented on June 11, 2024

The "Fleet Server - Waiting in policy..." message indicates that the policy self-monitor is awaiting for the policy to be detected, but .fleet-policies isn't a part of the indices listed in #7760 (comment)

from cloud-on-k8s.

cmacknz avatar cmacknz commented on June 11, 2024

.fleet-policies is written by the Fleet plugin in Kibana correct?

So the problem may be Kibana after all.

from cloud-on-k8s.

michel-laterman avatar michel-laterman commented on June 11, 2024

yes, i think it's written by kibana. but i'm not sure why there's no error logs from fleet-server to indicate the reason for the shutdown

from cloud-on-k8s.

barkbay avatar barkbay commented on June 11, 2024

I'll try to reproduce it in my dev env, with log level increased.

from cloud-on-k8s.

barkbay avatar barkbay commented on June 11, 2024

FWIW the last successful run was on Saturday 2024-04-27T23:06:08.133397756Z , first failure is reported on Monday 2024-04-28. There is no commit in ECK during that time frame 😕

from cloud-on-k8s.

barkbay avatar barkbay commented on June 11, 2024

I'll try to reproduce it in my dev env, with log level increased.

image

😞

from cloud-on-k8s.

barkbay avatar barkbay commented on June 11, 2024

I reproduced the problem in our CI, attempted to run the Agent diagnostic tool using https://github.com/elastic/eck-diagnostics/, but for some reasons there is no Agent diagnostic data in the archive 😞

I also attempted to enable debug logs without success, it does not seem to give us more logs, I'm a bit out of ideas...

from cloud-on-k8s.

kvalliyurnatt avatar kvalliyurnatt commented on June 11, 2024

We seem to hit a similar failure on 8.15.0-SNAPSHOT as well
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/538#018f5503-4478-47bc-a9f5-ad5f1178dbf3

from cloud-on-k8s.

kvalliyurnatt avatar kvalliyurnatt commented on June 11, 2024

I think I was able to repro this on my dev cluster, and I saw no container kills or pod restarts but on the node where the agent pods were running, I see the agent process being continously OOM killed

[Wed May  8 22:10:22 2024] agentbeat invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=-997
[Wed May  8 22:10:22 2024] CPU: 7 PID: 64018 Comm: agentbeat Not tainted 6.1.75+ #1
[Wed May  8 22:10:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
[Wed May  8 22:10:22 2024] Call Trace:
[Wed May  8 22:10:22 2024]  <TASK>
[Wed May  8 22:10:22 2024]  dump_stack_lvl+0x4a/0x70
[Wed May  8 22:10:22 2024]  dump_header+0x52/0x250
[Wed May  8 22:10:22 2024]  oom_kill_process+0x10a/0x220
[Wed May  8 22:10:22 2024]  out_of_memory+0x3dc/0x5c0
[Wed May  8 22:10:22 2024]  ? __wake_up_locked_key+0x52/0x80
[Wed May  8 22:10:22 2024]  try_charge_memcg+0x827/0xa90
[Wed May  8 22:10:22 2024]  charge_memcg+0x3f/0x1f0
[Wed May  8 22:10:22 2024]  __mem_cgroup_charge+0x2b/0x80
[Wed May  8 22:10:22 2024]  wp_page_copy+0x3a8/0xa40
[Wed May  8 22:10:22 2024]  handle_mm_fault+0x84a/0x16b0
[Wed May  8 22:10:22 2024]  do_user_addr_fault+0x271/0x4d0
[Wed May  8 22:10:22 2024]  exc_page_fault+0x78/0xf0
[Wed May  8 22:10:22 2024]  asm_exc_page_fault+0x22/0x30
[Wed May  8 22:10:22 2024] RIP: 0033:0x55c6379eea4e
[Wed May  8 22:10:22 2024] Code: 24 38 48 8b 7c 24 78 48 8d 05 ae 23 7b 09 e8 c9 22 90 ff 48 89 44 24 58 84 00 48 8b 94 24 98 00 00 00 48 8b b4 24 90 00 00 00 <48> 89 30 83 3d f8 6e 79 0d 00 74 02 eb 04 eb 14 66 90 e8 5b 69 96
[Wed May  8 22:10:22 2024] RSP: 002b:000000c0015bc330 EFLAGS: 00010246
[Wed May  8 22:10:22 2024] RAX: 000000c00363d048 RBX: 0000000000000088 RCX: 000000c00362dec0
[Wed May  8 22:10:22 2024] RDX: 000055c644855cc8 RSI: 000055c6402ca1a0 RDI: 0000000000000000
[Wed May  8 22:10:22 2024] RBP: 000000c0015bc430 R08: 0000000000000110 R09: ffffffffffffffff
[Wed May  8 22:10:22 2024] R10: 0000000000000000 R11: 000000000001b1e7 R12: ffffffffffffffff
[Wed May  8 22:10:22 2024] R13: 000055c6408e9de0 R14: 000000c000f024e0 R15: 0000000000000000
[Wed May  8 22:10:22 2024]  </TASK>
[Wed May  8 22:10:22 2024] memory: usage 358400kB, limit 358400kB, failcnt 52936
[Wed May  8 22:10:22 2024] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[Wed May  8 22:10:22 2024] kmem: usage 7228kB, limit 9007199254740988kB, failcnt 0
[Wed May  8 22:10:22 2024] Memory cgroup stats for /kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c:
[Wed May  8 22:10:22 2024] anon 359591936
                           file 4096
                           kernel 7401472
                           kernel_stack 1343488
                           pagetables 4644864
                           sec_pagetables 0
                           percpu 35920
                           sock 0
                           vmalloc 65536
                           shmem 0
                           file_mapped 0
                           file_dirty 0
                           file_writeback 0
                           swapcached 0
                           anon_thp 0
                           file_thp 0
                           shmem_thp 0
                           inactive_anon 359559168
                           active_anon 32768
                           inactive_file 0
                           active_file 4096
                           unevictable 0
                           slab_reclaimable 223696
                           slab_unreclaimable 828072
                           slab 1051768
                           workingset_refault_anon 0
                           workingset_refault_file 32589
                           workingset_activate_anon 0
                           workingset_activate_file 258
                           workingset_restore_anon 0
                           workingset_restore_file 3
                           workingset_nodereclaim 0
                           pgscan 123548
                           pgsteal 34679
                           pgscan_kswapd 0
                           pgscan_direct 123548
                           pgsteal_kswapd 0
                           pgsteal_direct 34679
                           pgfault 5820413
                           pgmajfault 0
                           pgrefill 6278
                           pgactivate 6538
                           pgdeactivate 6278
                           pglazyfree 0
                           pglazyfreed 0
                           thp_fault_alloc 0
                           thp_collapse_alloc 0
[Wed May  8 22:10:22 2024] Tasks state (memory values in pages):
[Wed May  8 22:10:22 2024] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed May  8 22:10:22 2024] [  48825]     0 48825      257        1    28672        0          -998 pause
[Wed May  8 22:10:22 2024] [  48859]     0 48859      625      147    49152        0          -997 tini
[Wed May  8 22:10:22 2024] [  48871]     0 48871   640830    18324   450560        0          -997 elastic-agent
[Wed May  8 22:10:22 2024] [  49408]     0 49408   602607    44255   847872        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  60311]     0 60311   584174    44946   847872        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  63295]     0 63295   584238    46602   839680        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  63384]     0 63384   584254    47449   843776        0          -997 agentbeat
[Wed May  8 22:10:22 2024] [  64018]     0 64018   602687    47556   856064        0          -997 agentbeat
[Wed May  8 22:10:22 2024] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,mems_allowed=0,oom_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c,task_memcg=/kubepods/pod6b3e8711-ce89-4a87-b342-fc12cd732b9c/803821597f2bfbedf470b149da5dcac07dda15484666b779993163c3edeafcc7,task=agentbeat,pid=64018,uid=0
[Wed May  8 22:10:22 2024] Memory cgroup out of memory: Killed process 64018 (agentbeat) total-vm:2410748kB, anon-rss:69512kB, file-rss:120712kB, shmem-rss:0kB, UID:0 pgtables:836kB oom_score_adj:-997

from cloud-on-k8s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.