Coder Social home page Coder Social logo

Comments (13)

tsackton avatar tsackton commented on July 29, 2024

I am also seeing the ST_KV_DATABASE_EXCEPTION error, running with singularity 2.6.1 and latest version of progressiveCactus on the same test data.

Specifically, I see a lot of network errors, e.g.,

(py27) [tsackton@bioinf02 cactus]$ grep "Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: " *.log
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:W/u/jobMAMhgY    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/W/jobcYd5p3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/W/jobcYd5p3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/W/jobcYd5p3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/W/jobcYd5p3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/W/jobcYd5p3    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:0/K/jobT_Nt8l    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:Y/4/jobT1BXlw    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:8/b/jobjS_1G2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.184 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:R/Y/jobziuk_f    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:J/O/jobZkDBNH    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:M/U/jobPEGnpz    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.130.249 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:T/k/jobH39D4e    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.128.189 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error
logfile_slurm_3.log:WARNING:toil.leader:X/x/jobuUaQQ2    Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.31.133.81 with error: network error

The jobs that fail with this error are not always SavePrimaryDB, and some jobs succeed, so it seems to have to do with whether the host that the ktserver is running on is reachable from the specific compute node a later job lands on.

Is there any way to control the node that the ktserver lands on, e.g. force it to run on the high memory machine the job is launched from? In previous versions of progressiveCactus I've managed this by tinkering with the bigBatchSystem options (to force the ktserver job to use the bigBatchSystem), but it is not clear if this is still an option with the new Toil version.

from cactus.

joelarmstrong avatar joelarmstrong commented on July 29, 2024

Hmm, unfortunately toil (the descendant of the old jobTree framework we used to use) doesn't have the bigBatchSystem hack that jobTree did. Sadly there isn't really a way that I know of to control which host the ktserver process gets launched on.

It should be somewhat possible to put the same hack into toil. There is a "local" batch system function that basically reintroduced a very similar thing to handle small CWL housekeeping jobs. Currently it's hardcoded to only run those jobs, but it could be made customizable without too much hassle.

Re: the ports possibly being off, that's interesting... maybe Singularity is doing some sort of NAT? As written, the DB code expects the address/port combination that it binds to to be reachable from all workers.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

I'm not sure about the ports (thus removed that comment)... digging through so many scattered logs and temp directories is confusing. I'll put more info if/when I can confirm anything.

On my system at least, there should be no issues connecting from one node to another, so I'm a bit confused about the errors. Any ideas as to how we can possibly get some more detail on this problem? Or perhaps a some easier way to test/debug the ktserver stuff? I'm thinking something like running a simple ktserver instance and issuing some commands to see what does/doesn't work.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

OK, a bit more testing has revealed two separate issues. I am now simple trying to run the example workflow on our interactive node and it gets "stuck" on the SavePrimaryDB step every time. No errors that I can find, but things just stop. See attached logfile: cactus_localrun.log.gz

The other issue occurs when I run using SLURM. That's were I run into ST_KV_DATABASE_EXCEPTION, even when trying to connect on localhost. Nodes on my system should be able to communicate with one another, so I'm not sure what the problem actually is.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

I seem to be able to get things to complete (with various workarounds for Singularity 3) now. However, whenever I set --logLevel DEBUG or --logDebug I run into this problem. This leads me to believe that it is not an actual network error, but instead some invalid argument formatting that is causing a non-zero exit code from a process, which is incorrectly interpreted as a network error.

@tsackton Could you try running without turning on --logLevel DEBUG and see if you can get things to complete? Thanks.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

Oddly enough, I seem to be able to rerun the pipeline with debug on, I still got errors, but this time it completed successfully (eventually). This appears to be something that happens occasionally. Is there some non-deterministic part of this workflow? Also, any thoughts as to why I see this only with DEBUG on? Is that because the error is silenced otherwise?

from cactus.

lparsons avatar lparsons commented on July 29, 2024

Continued testing shows that I get this error even when the server is on the same host:

Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 127.0.0.1 with error: network error

This occurs with debug off, at least on some runs and at least on some cluster systems. At this point, I am not able to replicate runs even on the same system when using the example data.

@joelarmstrong Is there some way to set a random seed to ensure any non-deterministic parts of the workflow run the same? The inability to get the same errors or results from one run to the next is disconcerting, and makes debugging quite difficult.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

On at leaast one of our clusters, the code used to determine a "public IP" doesn't work and ends up returning 127.0.0.1 which inevitably fails. I'm not sure of the appropriate fix at the moment, except that it seems it would be necessary to be able to choose a method that would work for the given grid environment as the current approach doesn't generalize well.

# Borrowed from toil code.
def getPublicIP():
"""Get the IP that this machine uses to contact the internet.
If behind a NAT, this will still be this computer's IP, and not the router's."""
try:
# Try to get the internet-facing IP by attempting a connection
# to a non-existent server and reading what IP was used.
with closing(socket.socket(socket.AF_INET, socket.SOCK_DGRAM)) as sock:
# 203.0.113.0/24 is reserved as TEST-NET-3 by RFC 5737, so
# there is guaranteed to be no one listening on the other
# end (and we won't accidentally DOS anyone).
sock.connect(('203.0.113.1', 1))
ip = sock.getsockname()[0]
return ip
except:
# Something went terribly wrong. Just give loopback rather
# than killing everything, because this is often called just
# to provide a default argument
return '127.0.0.1'

from cactus.

amizeranschi avatar amizeranschi commented on July 29, 2024

@lparsons and @tsackton, did you ever get this fixed somehow? I'm having a similar issue on a SGE cluster while running the evolverMammals example: #63.

Things run fine on a single node, but fail with Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 172.16.13.37 with error: network error when running distributed on an SGE queue. The cluster nodes should be able to communicate with each other, so I'm not sure what could cause the connection to fail.

I'm using a local Cactus install, as Docker and Singularity aren't supported by the cluster.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

I ended up with a couple of different solutions to this. The first was a simple firewall rule that was preventing communication (despite assurances communication was allowed ;-) ).

The second was resolved by the system admins who kinda changed the cluster configuration to allow the getPublicIP code to work. I'd prefer to get a more robust (or at least configurable) method for the getPublicIP, but I don't have any great options at the moment.

from cactus.

amizeranschi avatar amizeranschi commented on July 29, 2024

@lparsons Thanks a lot for the reply. Do you remember anything more specific about the firewall rule that was preventing communication? I've gotten in touch with the cluster admin and I suspect that it may be a similar issue, so any further clues you could provide would be much appreciated.

from cactus.

lparsons avatar lparsons commented on July 29, 2024

The issue on the cluster where the getPublicIP did work, but I still got the error (which seems more similar to your case) was a problem with iptables. The fix implemented was to turn off iptables completely on the nodes (which are only on the private cluster network).

from cactus.

lparsons avatar lparsons commented on July 29, 2024

I believe I have addressed all of the reasons that this error occurred for us. See #60 and #67 for additional info.

from cactus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.