Comments (11)
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/160065986
The labels on this github issue will be updated when the story is started.
from garden-runc-release.
Hey, @achawki, could you try running sudo ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep "D "
on that Diego cell VM and reporting the output here? That particular ps
invocation won't read the process cmdline or environment from memory and so shouldn't hang like ps aux
has. Also, if you can inspect the memory cgroups for that container, could you report with the values in its memory.limit_in_bytes
, memory.memsw.usage_in_bytes
, and memory.usage_in_bytes
cgroup files? This issue sounds like one we on the Diego and Garden teams have seen before infrequently but never been able to reproduce satisfactorily or to get enough insight into the kernel to understand.
Thanks,
Eric, CF Diego PM
from garden-runc-release.
@emalm over the night (CET, 7h ago) it was again possible to execute ps -efT
on all cells (we have a monitoring in place). It seems that the app was stopped or re-pushed, the PID on the cell is also gone. The cpu load on the cell is also back to normal. So unfortunately I am not able to execute the command (the output is empty now) and check the cgroups
.
from garden-runc-release.
@BooleanCat do you agree that we can close this and ask @achawki to re-open if it occurs again?
The fact that it recovered by itself means it was not a case of permanent D-State which @emalm was angling for (there is no recovery from that).
from garden-runc-release.
@emalm @Callisto13 issue happened again. I am not able to re-open the issue.
sudo ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep "D "
PID TID PPID USER COMMAND S WCHAN
2142 2142 2138 root monit D call_rwsem_down_read_failed
58662 58662 1 2040 ps D call_rwsem_down_read_failed
66368 66368 1 2088 ps D call_rwsem_down_read_failed
146197 146197 1 2041 ps D call_rwsem_down_read_failed
154075 154075 1 2089 ps D call_rwsem_down_read_failed
236707 236707 1 2042 ps D call_rwsem_down_read_failed
242384 242384 1 2090 ps D call_rwsem_down_read_failed
329547 329547 1 2091 ps D call_rwsem_down_read_failed
333824 333824 1 2043 ps D call_rwsem_down_read_failed
417809 417809 1 2092 ps D call_rwsem_down_read_failed
425653 425653 1 2044 ps D call_rwsem_down_read_failed
440205 440296 440175 cvcap MemHistoryThrea D call_rwsem_down_write_failed
440205 446859 440175 cvcap Thread-4 D call_rwsem_down_read_failed
513969 513969 1 2093 ps D call_rwsem_down_read_failed
523314 523314 1 2045 ps D call_rwsem_down_read_failed
570247 570247 1 2002 ps D call_rwsem_down_read_failed
606393 606393 1 2094 ps D call_rwsem_down_read_failed
622799 622799 1 2046 ps D call_rwsem_down_read_failed
674396 674396 1 2003 ps D call_rwsem_down_read_failed
697575 697575 1 2095 ps D call_rwsem_down_read_failed
733100 733100 1 2047 ps D call_rwsem_down_read_failed
772002 772002 1 2004 ps D call_rwsem_down_read_failed
789378 789378 1 2096 ps D call_rwsem_down_read_failed
836739 836739 1 2048 ps D call_rwsem_down_read_failed
875252 875252 1 2005 ps D call_rwsem_down_read_failed
885677 885677 1 2097 ps D call_rwsem_down_read_failed
933411 933411 1 2049 ps D call_rwsem_down_read_failed
978127 978127 1 2006 ps D call_rwsem_down_read_failed
982162 982162 1 2098 ps D call_rwsem_down_read_failed
1031153 1031153 1 2050 ps D call_rwsem_down_read_failed
1076275 1076275 1 2099 ps D call_rwsem_down_read_failed
1079333 1079333 1 2007 ps D call_rwsem_down_read_failed
1127120 1127120 1 2054 ps D call_rwsem_down_read_failed
1172680 1172680 1 2100 ps D call_rwsem_down_read_failed
1177911 1177911 1 2008 ps D call_rwsem_down_read_failed
1222366 1222366 1 2055 ps D call_rwsem_down_read_failed
1269097 1269097 1 2101 ps D call_rwsem_down_read_failed
1278764 1278764 1 2009 ps D call_rwsem_down_read_failed
1318064 1318064 1 2056 ps D call_rwsem_down_read_failed
1365794 1365794 1 2102 ps D call_rwsem_down_read_failed
1379354 1379354 1 2010 ps D call_rwsem_down_read_failed
1413216 1413216 1 2057 ps D call_rwsem_down_read_failed
1462503 1462503 1 2103 ps D call_rwsem_down_read_failed
1477259 1477259 1 2011 ps D call_rwsem_down_read_failed
1508848 1508848 1 2058 ps D call_rwsem_down_read_failed
1559530 1559530 1 2104 ps D call_rwsem_down_read_failed
1575203 1575203 1 2012 ps D call_rwsem_down_read_failed
1602112 1602112 1 2059 ps D call_rwsem_down_read_failed
1653254 1653254 1 2105 ps D call_rwsem_down_read_failed
1676993 1676993 1 2013 ps D call_rwsem_down_read_failed
1694430 1694430 1 2060 ps D call_rwsem_down_read_failed
1749184 1749184 1 2106 ps D call_rwsem_down_read_failed
1776951 1776951 1 2014 ps D call_rwsem_down_read_failed
1789108 1789108 1 2061 ps D call_rwsem_down_read_failed
1843712 1843712 1 2107 ps D call_rwsem_down_read_failed
1888355 1888355 1 2015 ps D call_rwsem_down_read_failed
1889728 1889728 1 2062 ps D call_rwsem_down_read_failed
1939576 1939576 1 2108 ps D call_rwsem_down_read_failed
1986841 1986841 1 2063 ps D call_rwsem_down_read_failed
1989740 1989740 1 2016 ps D call_rwsem_down_read_failed
2044444 2044444 1 2109 ps D call_rwsem_down_read_failed
2081197 2081197 1 2064 ps D call_rwsem_down_read_failed
2097182 2097182 1 2017 ps D call_rwsem_down_read_failed
2131670 2131670 1 2110 ps D call_rwsem_down_read_failed
2176861 2176861 1 2065 ps D call_rwsem_down_read_failed
2201714 2201714 1 2018 ps D call_rwsem_down_read_failed
2231518 2231518 1 2111 ps D call_rwsem_down_read_failed
2275386 2275386 1 2066 ps D call_rwsem_down_read_failed
2305218 2305218 1 2019 ps D call_rwsem_down_read_failed
2323888 2323888 1 2112 ps D call_rwsem_down_read_failed
2371857 2371857 1 2067 ps D call_rwsem_down_read_failed
2401132 2401132 1 2020 ps D call_rwsem_down_read_failed
2413761 2413761 1 2113 ps D call_rwsem_down_read_failed
2466801 2466801 1 2068 ps D call_rwsem_down_read_failed
2496788 2496788 1 2021 ps D call_rwsem_down_read_failed
2504788 2504788 1 2114 ps D call_rwsem_down_read_failed
2561984 2561984 1 2069 ps D call_rwsem_down_read_failed
2591867 2591867 1 2022 ps D call_rwsem_down_read_failed
2595906 2595906 2595905 2115 ps D call_rwsem_down_read_failed
2600206 2600206 2599585 root ps D call_rwsem_down_read_failed
2658511 2658511 1 2070 ps D call_rwsem_down_read_failed
2711599 2711599 1 2023 ps D call_rwsem_down_read_failed
2753021 2753021 1 2071 ps D call_rwsem_down_read_failed
2803180 2803180 1 2024 ps D call_rwsem_down_read_failed
2848935 2848935 1 2072 ps D call_rwsem_down_read_failed
2890274 2890274 1 2025 ps D call_rwsem_down_read_failed
2939240 2939240 1 2073 ps D call_rwsem_down_read_failed
2985853 2985853 1 2026 ps D call_rwsem_down_read_failed
3028048 3028048 1 2074 ps D call_rwsem_down_read_failed
3085875 3085875 1 2027 ps D call_rwsem_down_read_failed
3113356 3113356 1 2075 ps D call_rwsem_down_read_failed
3176546 3176546 1 2028 ps D call_rwsem_down_read_failed
3201512 3201512 1 2076 ps D call_rwsem_down_read_failed
3268319 3268319 1 2029 ps D call_rwsem_down_read_failed
3289076 3289076 1 2077 ps D call_rwsem_down_read_failed
3357731 3357731 1 2030 ps D call_rwsem_down_read_failed
3378215 3378215 1 2078 ps D call_rwsem_down_read_failed
3447175 3447175 1 2031 ps D call_rwsem_down_read_failed
3465316 3465316 1 2079 ps D call_rwsem_down_read_failed
3533724 3533724 1 2032 ps D call_rwsem_down_read_failed
3553034 3553034 1 2080 ps D call_rwsem_down_read_failed
3621918 3621918 1 2033 ps D call_rwsem_down_read_failed
3647014 3647014 1 2081 ps D call_rwsem_down_read_failed
3708964 3708964 1 2034 ps D call_rwsem_down_read_failed
3734654 3734654 1 2082 ps D call_rwsem_down_read_failed
3796680 3796680 1 2035 ps D call_rwsem_down_read_failed
3822280 3822280 1 2083 ps D call_rwsem_down_read_failed
3883675 3883675 1 2036 ps D call_rwsem_down_read_failed
3910044 3910044 1 2084 ps D call_rwsem_down_read_failed
3978464 3978464 1 2037 ps D call_rwsem_down_read_failed
3997721 3997721 1 2085 ps D call_rwsem_down_read_failed
4077681 4077681 1 2038 ps D call_rwsem_down_read_failed
4085444 4085444 1 2086 ps D call_rwsem_down_read_failed
4164867 4164867 1 2039 ps D call_rwsem_down_read_failed
4173116 4173116 1 2087 ps D call_rwsem_down_read_failed
The corresponding PID is 440205
ps -o cgroup 440205
CGROUP
12:pids:/garden/589444c6-c397-41b3-7bea-2ad3,11:hugetlb:/garden/589444c6-c397-41b3-7bea-2ad3,10:net_prio:/garden/589444c6-c397-41b3-7bea-2ad3,9:perf_event:/garden/589444c6-c397-41b3-7bea-2ad3,8:net_cls:/garden/589444c6-c397-41b3-7bea-2ad3,7:freezer:/garden/589444c6-c397-41b3-7bea-2ad3,6:devices:/garden/589444c6-c397-41b3-7bea-2ad3,5:memory:/garden/589444c6-c397-41b3-7bea-2ad3,4:blkio:/garden/589444c6-c397-41b3-7bea-2ad3,3:cpuacct:/garden/589444c6-c397-41b3-7bea-2ad3,2:cpu:/garden/589444c6-c397-41b3-7bea-2ad3,1:cpuset:/garden/589444c6-c397-41b3-7bea-2ad3
memory.limit_in_bytes: 209715200
memory.memsw.usage_in_bytes: 209727488
memory.usage_in_bytes: 206069760
from garden-runc-release.
Hey @achawki!
The team will not be back online until Monday 9:00GMT.
In the meantime, could you give us the Garden Runc release, Stemcell and Kernel versions for this deployment (if they are different from the last time) and also please run the following script as root from inside the VM and attach the resulting tar to this issue: curl bit.ly/garden-ordnance-survey -sSfL | bash
The script may also go into D, in which case could you remove the command(s) which hang (script is here) and try it again?
We have encountered various problems which result in processes stuck in permanent uninterruptible state and have documented them here: https://docs.google.com/document/d/1Ph7j__TJco1ZO592re3fJyCCP-kgyfd2YRxLRD_IzFU/edit?usp=sharing
Could you see if any symptoms match in case it is not the previously seen problem here?
If possible, it would be great if you could keep the VM around until Monday 9:00GMT, but if you can't please tell us any other details which may help us reproduce.
Thanks!
from garden-runc-release.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/162516147
The labels on this github issue will be updated when the story is started.
from garden-runc-release.
Hi @Callisto13,
thanks for the quick reply.
Unfortunately we had to immediately recreate the vm.
IaaS: AWS
garden-runc-release: 1.16.4
cf-deployment: 4.5
Stemcell: 3586.54
Kernel: Linux d8b9a837-c69d-4444-b5a3-b71e53a15f09 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I am not able to access the document.
details which may help us reproduce.
The landscape is very large and I have no insights about the applications running on the corresponding cells.
from garden-runc-release.
@achawki apologies, this link should work https://docs.google.com/document/d/1Ph7j__TJco1ZO592re3fJyCCP-kgyfd2YRxLRD_IzFU/edit?usp=sharing
from garden-runc-release.
@Callisto13 thanks
from garden-runc-release.
Closing due to inactivity since we don't think we can do much on this without more information, but @achawki please feel free to create a new bug (or comment on this one) if you are able to reproduce again and supply the information requested above
from garden-runc-release.
Related Issues (20)
- Use containerd-style stdin closer instead of exponential backoff stdin close HOT 2
- Get the protobuf duplicate fix registration warning/panic fixed in log-cache-release HOT 3
- GrootFS additional metrics HOT 7
- Support exporting garden-runc-release on windows HOT 4
- Uninitialized constant when rendering job template HOT 15
- Add support in CFAR for per-docker-app seccomp profiles HOT 6
- Upgrade busybox to 1.34.1 HOT 1
- Question: now the app container started by garden, the PID 1 process is app process? HOT 3
- gdn binary is gone in 1.20.9 release assets HOT 3
- Gdn failed to run on ubuntu bionic HOT 33
- Release gdn binary for ARM HOT 2
- release 1.22.9 doesn't include gdn binary HOT 2
- containerd and runc are included in two places that cause versions falling out of sync HOT 1
- Pinned dependecies should have a reason or unpinned
- Test issue. Please Ignore.
- Change default for garden spec to be containerd mode
- Stop Testing for and remove rootless mode
- Stop testing for containerd-for-processes in CI HOT 6
- Missing gdn binaries in release assets for 1.46 HOT 2
- Noisy neighbours causing CPU entitlement usage of consistent load apps to increase
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from garden-runc-release.