Coder Social home page Coder Social logo

Comments (7)

chaochenq avatar chaochenq commented on May 20, 2024

This is a good point. Thanks for diving deep into this!

Probably 12 is not even safe enough, because we want the babysit to think that $(get_agent_pid) is not empty, that requires a wait for a full stop + start, in worst case Shutdown can take 11 seconds, but the start could also take sometime, what do you think?

from amazon-kinesis-agent.

chris-gilmore avatar chris-gilmore commented on May 20, 2024

After thinking more about this, I now realize that sleeping for any amount of time will not solve the underlying issue -- it just merely pushes ahead the time in which a race condition may still exist. For example, imagine if my restart were to occur 12 seconds after the babysit started, then we're back to the same situation.
I think I need to noodle on this some more ...
What do you think about maybe having the babysit make a status call to the agent's SysVinit script, with the do_status function wrapped around the same MUTEXFILE as the do_start() and do_stop() functions?
And then the babysit script can look more like the awslogs-nanny script (reproduced below):

#!/bin/sh
# Version: 1.1.2-rpm

# This script will restart the awslogs service if it has died

service awslogs status >/dev/null
status=$?

if [ "$status" -eq "1" -o "$status" -eq "2" ]; then
    service awslogs restart
fi

from amazon-kinesis-agent.

chris-gilmore avatar chris-gilmore commented on May 20, 2024

After some more testing, I believe even the real underlying issue is mutex-resistant, since the daemonizing of start-aws-kinesis-agent is continued into the background outside of the mutexed region.
If another process then enters the do_stop() while the initialization of start-aws-kinesis-agent is still in progress, then a potential core dump is possible. See for example this real world error from /tmp/aws-kinesis-agent.*.initlog

Terminated (core dumped)
awk: (FILENAME=- FNR=3) warning: error writing standard output (Broken pipe)

However, if we can wrap the do_status call with the same MUTEXFILE, then we can at least prevent the babysit from falsely assuming the agent has unexpectedly died, and thus preventing a double restart of the agent.

from amazon-kinesis-agent.

chaochenq avatar chaochenq commented on May 20, 2024

Thanks @chris-gilmore for looking deep into this, I appreciate it. I will take a look at this next week :)

from amazon-kinesis-agent.

chaochenq avatar chaochenq commented on May 20, 2024

Actually I don't remember any specific reason why I put the conditions separately, how about we simply check the PID file and process at the same time? That would reduce the chance for a race condition too

if [ -f $PIDFILE ] && [ -z $(get_agent_pid) ]; then
start_agent
fi

exit 0

What do you think?

from amazon-kinesis-agent.

chris-gilmore avatar chris-gilmore commented on May 20, 2024

I tested your proposal and confirmed that it does not help alleviate the problem.

The issue is with running the babysit check while another process is in the middle of running the do_stop() from the sysvinit script. There is a small but perceptible amount of time within the do_stop() in which the agent_pid has been killed while the PIDFILE has not yet been deleted. Therefore, we wish to prevent the babysit check from running simultaneously as the do_stop().

from amazon-kinesis-agent.

chaochenq avatar chaochenq commented on May 20, 2024

Makes sense. I'll test it out and merge the PR. Thanks!

from amazon-kinesis-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.