This is a subtle problem that I have noticed on my MacBook Pro and when using Travis CI with osx
VM builds over the past few months. I have just now diagnosed the issue, but it would be good if others can reproduce this and discuss a workaround (if full macOS support is desired).
Bug report
Bug summary
The make()
command in tst/regression/scripts/utils/athena.py
, invoked by a regression test module's prepare()
step fails in the linker with
make: write error: stdout
being written in the middle of the stdout stream.
The challenge in tracking down this bug is that it manifests in any regression test only after an MPI-enabled regression test (which itself passes). Currently, the only regression test scripts that use MPI are:
mpi/mpi_linwave.py
grav/jeans_3d.py
Code for reproduction
cd tst/regression
python run_tests.py mpi/mpi_linwave hydro/sod_shock
Or, another example:
cd tst/regression
python run_tests.py grav/jeans_3d outputs/all_outputs
The error only occurs in the make
commands invoked in the separate test scripts following the MPI regression test. In other words, even though the mpi_linwave.py
first compiles an MPI binary and then a serial binary in the same module, there is no write error in that test.
Furthermore, the bug only occurs either when:
- Both tests are run in the same Python command, as above
- The multiple
python run_tests.py testname
commands are executed in the same process/script.
So, command line execution of
cd tst/regression
python run_tests.py grav/jeans_3d
python outputs/all_outputs
works fine, but running those commands in a Bash script will produce the error. Hence, in VM environments like Travis CI, which wraps the user's commands in a command called travis_run_script
, this issue may appear.
Actual outcome
...
g++ -O3 -o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/athena /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/main.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/globals.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/parameter_input.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/get_boundary_flag.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/reflect.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_mg.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/polarwedge.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_base.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_fc.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/bvals_grav.o /Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/flux_correction_cc.o /Users/kfelker/Desktop/athena-trunk-clean/tsmake: write error: stdout
Traceback (most recent call last):
File "run_tests.py", line 76, in main
module.prepare()
File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/tests/hydro/hydro_linwave.py", line 21, in prepare
athena.make()
File "/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/scripts/utils/athena.py", line 46, in make
.format(err.returncode,' '.join(err.cmd)))
AthenaError: Return code 1 from command 'make -j EXE_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/bin/ OBJ_DIR:=/Users/kfelker/Desktop/athena-trunk-clean/tst/regression/obj/'
---> Error in scripts/tests/hydro/hydro_linwave.py
Results:
mpi.mpi_linwave: passed
hydro.sod_shock: failed -- unexpected failure in prepare() stage
hydro.hydro_linwave: failed -- unexpected failure in prepare() stage
Summary: 1 out of 3 tests passed
I have also observed:
IOError: [Errno 35] Resource temporarily unavailable
instead of make: write error
when using alternative subprocess
commands; see below.
Version Information
My current MacBook Pro environment is:
- Operating System: macOS Sierra 10.12.6
- Python Version: Python 2.7.13, installed by Homebrew
- C++ compiler version: macOS system
clang
Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
- MPI version: MPICH 3.2.0, installed by Homebrew.
mpicxx -show
:
clang++ -Wl,-flat_namespace -Wl,-commons,use_dylibs -I/usr/local/Cellar/mpich/3.2_3/include -L/usr/local/Cellar/mpich/3.2_3/lib -lmpicxx -lmpi -lpmpi
corresponding to one of these Homebrew-built binaries (bottles):
...
commit d414b50c6744a47b1cbfa72f716bf8b39720684d
Author: BrewTestBot <[email protected]>
Date: Mon Sep 18 18:37:15 2017 +0000
mpich: update 3.2_3 bottle.
commit 54c585fbd1d02b4271a5e154e1cda4458944cfb0
Author: BrewTestBot <[email protected]>
Date: Thu May 4 02:36:36 2017 +0000
mpich: update 3.2_3 bottle.
...
It the process of debugging, I have tried many different versions of compilers, Python environments, build options, and macOS versions. It also occurs with:
gcc
7.1 or 4.9 installed by Homebrew
- System-managed or user Homebrew-managed Python 2.7 installations. Also tried:
virtualenv
for Python.
- Starting Python in unbuffered binary stdout and stderr mode, via
python -u
- OpenMPI or MPICH installed by Homebrew or source, compiled with
gcc
or clang
- GNU Make versions 4.2.1, 3.8.1, either macOS system or Homebrew managed
- Serial or parallel Make
- Tried
--output-sync
option for GNU Make version > 4.0 to ensure that parallel make output is buffered and well-ordered.
- Various releases of macOS 10.10, 10.11, 10.12
Explanation
I have encountered related bug reports in the Travis CI, GNU Make, and macOS communities, but never found a complete explanation until recently. The 12/1/17 reply on travis-ci/travis-ci#4704 explains that this bug is caused by the EAGAIN signal, "try again/ data not ready" from nonblocking socket.
To query if O_NONBLOCK
is set, you can use:
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'
Embedding this command in the regression test driver returns 0 (O_NONBLOCK
disabled) after every test until an MPI-enabled test executes, then it returns a nonzero number. So, the open question is: why do the MPI-enabled regression test scripts turn on nonblocking IO, and why does it occur only after the overall completion of the test?
Possible fixes
- Disable nonblocking IO after all regression tests, or after an MPI-enabled regression test. I am currently placing the following command in my driver script:
python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.fcntl(sys.stdout, fcntl.F_SETFL, flags&~os.O_NONBLOCK);'
- Redirect the
athena.py
output of make()
command to a file by replacing the subprocess.check_call()
commands with subprocess.Popen()
and pipes. Or, suppress the output by replacing with subprocess.check_output()
(without ever communicating the stream to stdout).
- Figure out how to prevent the MPI-enabled regression test from switching to nonblocking IO in the first place.