Coder Social home page Coder Social logo

Comments (11)

cf-gitbot avatar cf-gitbot commented on September 13, 2024

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160065986

The labels on this github issue will be updated when the story is started.

from garden-runc-release.

emalm avatar emalm commented on September 13, 2024

Hey, @achawki, could you try running sudo ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep "D " on that Diego cell VM and reporting the output here? That particular ps invocation won't read the process cmdline or environment from memory and so shouldn't hang like ps aux has. Also, if you can inspect the memory cgroups for that container, could you report with the values in its memory.limit_in_bytes, memory.memsw.usage_in_bytes, and memory.usage_in_bytes cgroup files? This issue sounds like one we on the Diego and Garden teams have seen before infrequently but never been able to reproduce satisfactorily or to get enough insight into the kernel to understand.

Thanks,
Eric, CF Diego PM

from garden-runc-release.

achawki avatar achawki commented on September 13, 2024

@emalm over the night (CET, 7h ago) it was again possible to execute ps -efT on all cells (we have a monitoring in place). It seems that the app was stopped or re-pushed, the PID on the cell is also gone. The cpu load on the cell is also back to normal. So unfortunately I am not able to execute the command (the output is empty now) and check the cgroups.

from garden-runc-release.

Callisto13 avatar Callisto13 commented on September 13, 2024

@BooleanCat do you agree that we can close this and ask @achawki to re-open if it occurs again?

The fact that it recovered by itself means it was not a case of permanent D-State which @emalm was angling for (there is no recovery from that).

from garden-runc-release.

achawki avatar achawki commented on September 13, 2024

@emalm @Callisto13 issue happened again. I am not able to re-open the issue.

sudo ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep "D "
    PID     TID    PPID USER        COMMAND         S WCHAN
   2142    2142    2138 root        monit           D call_rwsem_down_read_failed
  58662   58662       1 2040        ps              D call_rwsem_down_read_failed
  66368   66368       1 2088        ps              D call_rwsem_down_read_failed
 146197  146197       1 2041        ps              D call_rwsem_down_read_failed
 154075  154075       1 2089        ps              D call_rwsem_down_read_failed
 236707  236707       1 2042        ps              D call_rwsem_down_read_failed
 242384  242384       1 2090        ps              D call_rwsem_down_read_failed
 329547  329547       1 2091        ps              D call_rwsem_down_read_failed
 333824  333824       1 2043        ps              D call_rwsem_down_read_failed
 417809  417809       1 2092        ps              D call_rwsem_down_read_failed
 425653  425653       1 2044        ps              D call_rwsem_down_read_failed
 440205  440296  440175 cvcap       MemHistoryThrea D call_rwsem_down_write_failed
 440205  446859  440175 cvcap       Thread-4        D call_rwsem_down_read_failed
 513969  513969       1 2093        ps              D call_rwsem_down_read_failed
 523314  523314       1 2045        ps              D call_rwsem_down_read_failed
 570247  570247       1 2002        ps              D call_rwsem_down_read_failed
 606393  606393       1 2094        ps              D call_rwsem_down_read_failed
 622799  622799       1 2046        ps              D call_rwsem_down_read_failed
 674396  674396       1 2003        ps              D call_rwsem_down_read_failed
 697575  697575       1 2095        ps              D call_rwsem_down_read_failed
 733100  733100       1 2047        ps              D call_rwsem_down_read_failed
 772002  772002       1 2004        ps              D call_rwsem_down_read_failed
 789378  789378       1 2096        ps              D call_rwsem_down_read_failed
 836739  836739       1 2048        ps              D call_rwsem_down_read_failed
 875252  875252       1 2005        ps              D call_rwsem_down_read_failed
 885677  885677       1 2097        ps              D call_rwsem_down_read_failed
 933411  933411       1 2049        ps              D call_rwsem_down_read_failed
 978127  978127       1 2006        ps              D call_rwsem_down_read_failed
 982162  982162       1 2098        ps              D call_rwsem_down_read_failed
1031153 1031153       1 2050        ps              D call_rwsem_down_read_failed
1076275 1076275       1 2099        ps              D call_rwsem_down_read_failed
1079333 1079333       1 2007        ps              D call_rwsem_down_read_failed
1127120 1127120       1 2054        ps              D call_rwsem_down_read_failed
1172680 1172680       1 2100        ps              D call_rwsem_down_read_failed
1177911 1177911       1 2008        ps              D call_rwsem_down_read_failed
1222366 1222366       1 2055        ps              D call_rwsem_down_read_failed
1269097 1269097       1 2101        ps              D call_rwsem_down_read_failed
1278764 1278764       1 2009        ps              D call_rwsem_down_read_failed
1318064 1318064       1 2056        ps              D call_rwsem_down_read_failed
1365794 1365794       1 2102        ps              D call_rwsem_down_read_failed
1379354 1379354       1 2010        ps              D call_rwsem_down_read_failed
1413216 1413216       1 2057        ps              D call_rwsem_down_read_failed
1462503 1462503       1 2103        ps              D call_rwsem_down_read_failed
1477259 1477259       1 2011        ps              D call_rwsem_down_read_failed
1508848 1508848       1 2058        ps              D call_rwsem_down_read_failed
1559530 1559530       1 2104        ps              D call_rwsem_down_read_failed
1575203 1575203       1 2012        ps              D call_rwsem_down_read_failed
1602112 1602112       1 2059        ps              D call_rwsem_down_read_failed
1653254 1653254       1 2105        ps              D call_rwsem_down_read_failed
1676993 1676993       1 2013        ps              D call_rwsem_down_read_failed
1694430 1694430       1 2060        ps              D call_rwsem_down_read_failed
1749184 1749184       1 2106        ps              D call_rwsem_down_read_failed
1776951 1776951       1 2014        ps              D call_rwsem_down_read_failed
1789108 1789108       1 2061        ps              D call_rwsem_down_read_failed
1843712 1843712       1 2107        ps              D call_rwsem_down_read_failed
1888355 1888355       1 2015        ps              D call_rwsem_down_read_failed
1889728 1889728       1 2062        ps              D call_rwsem_down_read_failed
1939576 1939576       1 2108        ps              D call_rwsem_down_read_failed
1986841 1986841       1 2063        ps              D call_rwsem_down_read_failed
1989740 1989740       1 2016        ps              D call_rwsem_down_read_failed
2044444 2044444       1 2109        ps              D call_rwsem_down_read_failed
2081197 2081197       1 2064        ps              D call_rwsem_down_read_failed
2097182 2097182       1 2017        ps              D call_rwsem_down_read_failed
2131670 2131670       1 2110        ps              D call_rwsem_down_read_failed
2176861 2176861       1 2065        ps              D call_rwsem_down_read_failed
2201714 2201714       1 2018        ps              D call_rwsem_down_read_failed
2231518 2231518       1 2111        ps              D call_rwsem_down_read_failed
2275386 2275386       1 2066        ps              D call_rwsem_down_read_failed
2305218 2305218       1 2019        ps              D call_rwsem_down_read_failed
2323888 2323888       1 2112        ps              D call_rwsem_down_read_failed
2371857 2371857       1 2067        ps              D call_rwsem_down_read_failed
2401132 2401132       1 2020        ps              D call_rwsem_down_read_failed
2413761 2413761       1 2113        ps              D call_rwsem_down_read_failed
2466801 2466801       1 2068        ps              D call_rwsem_down_read_failed
2496788 2496788       1 2021        ps              D call_rwsem_down_read_failed
2504788 2504788       1 2114        ps              D call_rwsem_down_read_failed
2561984 2561984       1 2069        ps              D call_rwsem_down_read_failed
2591867 2591867       1 2022        ps              D call_rwsem_down_read_failed
2595906 2595906 2595905 2115        ps              D call_rwsem_down_read_failed
2600206 2600206 2599585 root        ps              D call_rwsem_down_read_failed
2658511 2658511       1 2070        ps              D call_rwsem_down_read_failed
2711599 2711599       1 2023        ps              D call_rwsem_down_read_failed
2753021 2753021       1 2071        ps              D call_rwsem_down_read_failed
2803180 2803180       1 2024        ps              D call_rwsem_down_read_failed
2848935 2848935       1 2072        ps              D call_rwsem_down_read_failed
2890274 2890274       1 2025        ps              D call_rwsem_down_read_failed
2939240 2939240       1 2073        ps              D call_rwsem_down_read_failed
2985853 2985853       1 2026        ps              D call_rwsem_down_read_failed
3028048 3028048       1 2074        ps              D call_rwsem_down_read_failed
3085875 3085875       1 2027        ps              D call_rwsem_down_read_failed
3113356 3113356       1 2075        ps              D call_rwsem_down_read_failed
3176546 3176546       1 2028        ps              D call_rwsem_down_read_failed
3201512 3201512       1 2076        ps              D call_rwsem_down_read_failed
3268319 3268319       1 2029        ps              D call_rwsem_down_read_failed
3289076 3289076       1 2077        ps              D call_rwsem_down_read_failed
3357731 3357731       1 2030        ps              D call_rwsem_down_read_failed
3378215 3378215       1 2078        ps              D call_rwsem_down_read_failed
3447175 3447175       1 2031        ps              D call_rwsem_down_read_failed
3465316 3465316       1 2079        ps              D call_rwsem_down_read_failed
3533724 3533724       1 2032        ps              D call_rwsem_down_read_failed
3553034 3553034       1 2080        ps              D call_rwsem_down_read_failed
3621918 3621918       1 2033        ps              D call_rwsem_down_read_failed
3647014 3647014       1 2081        ps              D call_rwsem_down_read_failed
3708964 3708964       1 2034        ps              D call_rwsem_down_read_failed
3734654 3734654       1 2082        ps              D call_rwsem_down_read_failed
3796680 3796680       1 2035        ps              D call_rwsem_down_read_failed
3822280 3822280       1 2083        ps              D call_rwsem_down_read_failed
3883675 3883675       1 2036        ps              D call_rwsem_down_read_failed
3910044 3910044       1 2084        ps              D call_rwsem_down_read_failed
3978464 3978464       1 2037        ps              D call_rwsem_down_read_failed
3997721 3997721       1 2085        ps              D call_rwsem_down_read_failed
4077681 4077681       1 2038        ps              D call_rwsem_down_read_failed
4085444 4085444       1 2086        ps              D call_rwsem_down_read_failed
4164867 4164867       1 2039        ps              D call_rwsem_down_read_failed
4173116 4173116       1 2087        ps              D call_rwsem_down_read_failed

The corresponding PID is 440205

ps -o cgroup 440205
CGROUP
12:pids:/garden/589444c6-c397-41b3-7bea-2ad3,11:hugetlb:/garden/589444c6-c397-41b3-7bea-2ad3,10:net_prio:/garden/589444c6-c397-41b3-7bea-2ad3,9:perf_event:/garden/589444c6-c397-41b3-7bea-2ad3,8:net_cls:/garden/589444c6-c397-41b3-7bea-2ad3,7:freezer:/garden/589444c6-c397-41b3-7bea-2ad3,6:devices:/garden/589444c6-c397-41b3-7bea-2ad3,5:memory:/garden/589444c6-c397-41b3-7bea-2ad3,4:blkio:/garden/589444c6-c397-41b3-7bea-2ad3,3:cpuacct:/garden/589444c6-c397-41b3-7bea-2ad3,2:cpu:/garden/589444c6-c397-41b3-7bea-2ad3,1:cpuset:/garden/589444c6-c397-41b3-7bea-2ad3
memory.limit_in_bytes: 209715200
memory.memsw.usage_in_bytes: 209727488
memory.usage_in_bytes: 206069760

from garden-runc-release.

Callisto13 avatar Callisto13 commented on September 13, 2024

Hey @achawki!

The team will not be back online until Monday 9:00GMT.

In the meantime, could you give us the Garden Runc release, Stemcell and Kernel versions for this deployment (if they are different from the last time) and also please run the following script as root from inside the VM and attach the resulting tar to this issue: curl bit.ly/garden-ordnance-survey -sSfL | bash
The script may also go into D, in which case could you remove the command(s) which hang (script is here) and try it again?

We have encountered various problems which result in processes stuck in permanent uninterruptible state and have documented them here: https://docs.google.com/document/d/1Ph7j__TJco1ZO592re3fJyCCP-kgyfd2YRxLRD_IzFU/edit?usp=sharing
Could you see if any symptoms match in case it is not the previously seen problem here?

If possible, it would be great if you could keep the VM around until Monday 9:00GMT, but if you can't please tell us any other details which may help us reproduce.

Thanks!

from garden-runc-release.

cf-gitbot avatar cf-gitbot commented on September 13, 2024

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/162516147

The labels on this github issue will be updated when the story is started.

from garden-runc-release.

achawki avatar achawki commented on September 13, 2024

Hi @Callisto13,

thanks for the quick reply.
Unfortunately we had to immediately recreate the vm.

IaaS: AWS
garden-runc-release: 1.16.4
cf-deployment: 4.5
Stemcell: 3586.54
Kernel: Linux d8b9a837-c69d-4444-b5a3-b71e53a15f09 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I am not able to access the document.

details which may help us reproduce.

The landscape is very large and I have no insights about the applications running on the corresponding cells.

from garden-runc-release.

Callisto13 avatar Callisto13 commented on September 13, 2024

@achawki apologies, this link should work https://docs.google.com/document/d/1Ph7j__TJco1ZO592re3fJyCCP-kgyfd2YRxLRD_IzFU/edit?usp=sharing

from garden-runc-release.

achawki avatar achawki commented on September 13, 2024

@Callisto13 thanks

from garden-runc-release.

julz avatar julz commented on September 13, 2024

Closing due to inactivity since we don't think we can do much on this without more information, but @achawki please feel free to create a new bug (or comment on this one) if you are able to reproduce again and supply the information requested above

from garden-runc-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.