Coder Social home page Coder Social logo

vmcompute issue with latest patch - encountered an error during hcs::System::Start: context deadline exceeded related to (Splunk Universal Forwarder) about windows-containers HOT 14 CLOSED

brwilkinson avatar brwilkinson commented on June 3, 2024 1
vmcompute issue with latest patch - encountered an error during hcs::System::Start: context deadline exceeded related to (Splunk Universal Forwarder)

from windows-containers.

Comments (14)

Howard-Haiyang-Hao avatar Howard-Haiyang-Hao commented on June 3, 2024 1

@brwilkinson

I am trying to repro the issue:

PS C:\Users\Administrator> docker run jenkins/agent:jdk17-windowsservercore-ltsc2022 powershell.exe -c echo hello
hello
PS C:\Users\Administrator> get-hotfix

Source Description HotFixID InstalledBy InstalledOn


WIN-35ILNJ... Update KB5022507 5/5/2023 12:00:00 AM
WIN-35ILNJ... Security Update KB5026370 5/5/2023 12:00:00 AM
WIN-35ILNJ... Update KB5026493 5/5/2023 12:00:00 AM

Can you call get-hotfix to see the packages that installed on your system?

Thanks,
Howard.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024 1

Today I installed the base image that had been tested as working. A fresh image, not the "corporate" image, via a standard deployment.

Then installed each monthly patch, and tested each iteration of the updates.

SW_DVD9_Win_Server_STD_CORE_2022_2108.20_64Bit_English_DC_STD_MLF_X23-42932

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1668_none_a001fbb8ddff3a5d\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1726_none_9ff35834de0abdb0\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1787_none_9ff94328de056f5b\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1850_none_9fec11d6de0f8be0\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1906_none_9fdbb7c8de1cc2e4\vmcompute.exe

C:\Windows\WinSxS\amd64_hyperv-compute-host-service_31bf3856ad364e35_10.0.20348.1970_none_9fe3427ede15da7f\vmcompute.exe

I was not able to reproduce the error.

I am continuing to investigate.

I am leaning towards some "scanning" and "security" software as a possible cause, however will update once I have completed more testing.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

Possibly related:

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

when the server gets deployed (without updates) it works and has the following.

get-hotfix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
server001  Update           KB5028956     NT AUTHORITY\SYSTEM  9/22/2023 12:00:00 AM
server001  Security Update  KB5012170     NT AUTHORITY\SYSTEM  9/21/2023 12:00:00 AM
server001  Security Update  KB5029250     NT AUTHORITY\SYSTEM  9/22/2023 12:00:00 AM
server001  Update           KB5029395     NT AUTHORITY\SYSTEM  9/21/2023 12:00:00 AM

#-----------------------------------------
Package Identity : Package_for_DotNetRollup31bf3856ad364e35amd64~~10.0.4614.6
State : Installed
Release Type : Update
Install Time : 4/5/2023 12:21 AM

Package Identity : Package_for_RollupFix31bf3856ad364e35amd64~~20348.1668.1.8
State : Installed
Release Type : Security Update
Install Time : 4/5/2023 12:33 AM 

Package Identity : Package_for_ServicingStack_166331bf3856ad364e35amd64~~20348.1663.1.1
State : Installed
Release Type : Security Update
Install Time : 4/5/2023 12:20 AM

#-----------------------------------------
Which is from a server 2022 base image: Win_Server_2022_2108.20
#-----------------------------------------

Then as part of the deployment process when we install the latest patch it breaks.

Get-HotFix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
server002  Update           KB5029928     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5012170     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5030216     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM
server002  Security Update  KB5025314                          4/5/2023 12:00:00 AM
server002  Update           KB5030369     NT AUTHORITY\SYSTEM  9/28/2023 12:00:00 AM

encountered an error during hcs::System::Start:context deadline exceeded.

I have tested on Server Core and with GUI and same result.

Based on the other thread it likely wasn't this latest roll up or patch that had the breaking change... however I haven't been able to test on different patch versions. Also when I uninstalled this update the server was still broken, I noticed vmcompute.exe was an even earlier version in this case (after uninstall).

I am using the exact same process to deploy all of the servers so they are identical other than the working servers did not have the updates applied at build time.

Then I noticed the vmcomute.exe did have some changes between the versions.

we should test with this since it's a smaller image and straight from mcr... not that it makes that much difference.

docker run mcr.microsoft.com/windows/nanoserver:ltsc2022 ping localhost -r 5

also docker pull works fine, it's just run that times out.

This is the patch to install top test a repro...
Security patch
September 12, 2023—KB5030216 (OS Build 20348.1970) - Microsoft Support
https://support.microsoft.com/en-us/topic/september-12-2023-kb5030216-os-build-20348-1970-34d4aff3-fd05-4270-b288-4ab6379c7f81

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

@Howard-Haiyang-Hao also thank you for the follow up.

If there are some lower level tests or logs that we can look into, please let us know.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

Adding below log from the Application logs:

6:25:02PM - Warning
Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [traceID=bb85376784de4797faa7730c7d12ce40 spanID=323e301be76b14aa timeout=4m0s]

6:25:02PM - Error
failed to start container [error=container 6ac12218a7dfcf3e6507274fc678b46fd8f30f2224dcbc0bb27c05c58b543641 encountered an error during hcs::System::Start: context deadline

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

Hi @Howard-Haiyang-Hao were you able to get the update installed? Any luck with the repro?

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

Adding some further information.

  1. Expanded testing from VMware host machines to Hyper-V host machines. No change in experience.

docker run --isolation=process mcr.microsoft.com/windows/nanoserver:ltsc2022 ping localhost -r 5

ltsc2022: Pulling from windows/nanoserver
452f3ad1d32c: Pulling fs layer
452f3ad1d32c: Verifying Checksum
452f3ad1d32c: Download complete
452f3ad1d32c: Pull complete
Digest: sha256:6562c9a2580260e4f2e3a081cc2cf1d960899cbce7c4568fb851e4848ca50e07
Status: Downloaded newer image for mcr.microsoft.com/windows/nanoserver:ltsc2022
docker: Error response from daemon: container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered
an error during hcs::System::Start: context deadline exceeded.

Event logs

warning

Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [spanID=be1a8d4cdaa4cc79 traceID=3c3d99e94347624bac4b3aa9b63f8d8b timeout=4m0s]

errors

failed to start container [module=libcontainerd namespace=moby container=62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 error=container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Start: context deadline exceeded]

failed to cleanup after a failed Start [module=libcontainerd namespace=moby container=62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 error=container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Terminate: context deadline exceeded]

Handler for POST /v1.43/containers/62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8/start returned error: container 62f69d70bdfc7a8bcb86e4d824221cb3ed0381e48fb2f6b2aebefab59a12a7b8 encountered an error during hcs::System::Start: context deadline exceeded

OSinfo

WindowsBuildLabEx                                       : 20348.1.amd64fre.fe_release.210507-1500
WindowsCurrentVersion                                   : 6.3
WindowsEditionId                                        : ServerStandard
WindowsInstallationType                                 : Server
WindowsInstallDateFromRegistry                          : 10/2/2023 3:38:09 PM
WindowsProductId                                        : 00454-10000-00001-AA777
WindowsProductName                                      : Windows Server 2022 Standard
OsName                                                  : Microsoft Windows Server 2022 Standard
OsType                                                  : WINNT
OsOperatingSystemSKU                                    : StandardServerEdition
OsVersion                                               : 10.0.20348


get-hotfix | ft -AutoSize

Source         Description     HotFixID  InstalledBy         InstalledOn
------         -----------     --------  -----------         -----------
BWILKINSON-VM1 Update          KB5029928 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5012170 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5030216 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM
BWILKINSON-VM1 Security Update KB5025314                     4/5/2023 12:00:00 AM
BWILKINSON-VM1 Update          KB5030369 NT AUTHORITY\SYSTEM 10/2/2023 12:00:00 AM

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

I am out of cycle for deploying more nodes right now, however I expect to continue doing so in the next few days.

Once I have more data, I will provide a further update.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

I will close this for now.

At this time, I don't have a root cause... however I will come back later and add notes in case it's useful for others.

I do have a way to make the vmcompute start working again... however I am just not definitive on exactly why or which single things breaks the service.

1 time restarting vmcompute (and then rebooting) brought the containers online
1 time uninstalling 2 enterprise monitoring apps brought the containers online
1 time uninstalling 1 enterprise monitoring app brought the containers online (also I have reproduced this several times).

I can only assume that the vmcompute is sensitive to something, however I am unsure what exactly OR why uninstalling the apps allowed the containers to start as normal. Also reinstalling the same combination of apps did not break the container services.

So I have a workaround, however not any sure solution on what exactly the issue is or was.

I will provide an update in the future.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

In case it's helpful for others...

In our environment I have been able to reproduce the issue many times.

  1. The container timeout error occurs.

    • Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time. [timeout=4m0s traceID=6b59dd25c46e317d63303f745be2ed33 spanID=540b21032cc68875]
    • failed to cleanup after a failed Start [namespace=moby container=1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 error=container 1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 encountered an error during hcs::System::Terminate: context deadline exceeded module=libcontainerd]
    • Handler for POST /v1.43/containers/1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324/start returned error: container 1b4b560791f9c2fcf36aeafb6251c6bc883b4b1a788f9178fba4d87e5b186324 encountered an error during hcs::System::Start: context deadline exceeded
  2. Stop the "SplunkForwarder" Service.

  3. The container starts successfully.

  4. Start the "SplunkForwarder" the containers continue to work.

I have not seen any regression on this afterwards.

I have a Guess that it maybe related to the network setup for the "Microsoft-Windows-Hyper-V-VmSwitch" etc.

  • However this is only based on reviewing logs in the Application and System Events between stopping/starting the Splunk service as well as running docker.

If I have more time I will continue to capture these states as I deploy more new resources in the future.

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

@Howard-Haiyang-Hao @akarshm just a ping for this issue, which I closed, however is worth follow up in:

from windows-containers.

hopenbr avatar hopenbr commented on June 3, 2024

@brwilkinson ,
stopping the splunk forwarding svc, restarting docker service, and starting the splunk fwd service seems to have work for me. Thanks a million for that update.

Any further issues? Reoccurrences?

from windows-containers.

brwilkinson avatar brwilkinson commented on June 3, 2024

@hopenbr

This is being tracked over on

from windows-containers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.