Coder Social home page Coder Social logo

libhio's Introduction

# -*- Mode: sh; sh-basic-offset:2 ; indent-tabs-mode:nil -*-
#
# Copyright (c) 2014-2017 Los Alamos National Security, LLC.  All rights
#                         reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#

HIO Readme
==========

Last updated 2017-01-12

libHIO is a flexible, high-performance parallel IO package developed at LANL.
libHIO supports IO to either a conventional PFS or to DataWarp with management
of DataWarp space and stage-in and stage-out from and to the PFS.

libHIO has been released as open source and is available at:

https://github.com/hpc/libhio

For more information on using libHIO, see the github package, in particular:

README
libhio_api.pdf
hio_example.c
README.datawarp

See file NEWS for a description of changes to hio.  Note that this README
refers to various LANL clusters that have been used for testing HIO. Using
HIO in other environments may require some adjustments.


Building
--------

HIO builds via a standard autoconf/automake build.  So, to build:

1) Untar
2) cd to root of tarball
3) module load needed compiler or MPI environment 
4) ./configure
5) make 

Additional generally useful make targets include clean and docs.  make docs
will build the HIO API document, but it requires doxygen and various latex
packages to run, so you may prefer to use the document distributed in file
design/libhio_api.pdf.

Our target build environments include gcc with OpenMPI on Mac OS for unit
test and gcc, Intel and Cray compilers on LANL Cray systems and TOSS clusters
with Cray MPI or OpenMPI.  

Included with HIO is a build script named hiobuild.  It will perform all of
the above steps in one invocation.  The HIO development team uses it to launch
builds on remote systems.  You may find it useful; a typical invocation might 
look like:

./hiobuild -c -s PrgEnv-intel,PrgEnv-gnu

hiobuild will also create a small script named hiobuild.modules.bash that
can be sourced to recreate the module environment used for build.


API Example
-----------

The HIO distribution contains a sample program, test/hio_example.c.  This
is built along with libHIO. The script hio_example.sh will run the sample
program.


Simple DataWarp Test Job
------------------------

The HIO source contains a script test/dw_simple_sub.sh that will submit a
simple, small scale test job on a system with Moab/DataWarp integration.  See
the comments in the file for instructions and a more detailed description.


Testing
-------

HIO's tests are in the test subdirectory.  There is a simple API test named
test01 which can also serve as a coding example.  Additionally, other tests
are named run02, run03, etc.  Theses tests are able to run in a variety of
environments:

1) On Mac OS for unit testing
2) On a non-DataWarp cluster in interactive or batch mode
3) On one of the Trinity systems with DataWarp in interactive or batch mode

run02 and run03 are N-N and N-1 tests (respectively). Options help can be
displayed by invoking with a -h option.  These tests use a common script 
named run_setup to process options and establish the testing environment.
They invoke hio using a program named xexec which is driven by command strings
contained in the runxx test scripts.

A typical usage to submit a test DataWarp batch job on the small LANL test system
named buffy might look like:

cd <tarball>/test
./run02 -s m -r 32 -n 2 -b 

Options used:
  -s m    ---> Size medium (200 MB per rank)
  -r 32   ---> Use 32 ranks
  -n 2    ---> Use 2 nodes
  -b      ---> Submit a batch job  

The runxx tests will use the hiobuild.modules.bash files saved by hiobuild
(if available) to reestablish the same module environment used at build
time.

A multi-job submission script to facilitate running a large number of tests
with one command  is available.  A typical usage for a fairly thorough test
on a large system like Trinity might look like:

run_combo -t ./run02 ./run03 ./run12 -s x y z -n 32 64 128 256 512 1024 -p 32 -b

This will submit 54 jobs (3 x 3 x 6) with all combinations of the specified
tests and parameters.  The job scripts and output will be in the test/run
subdirectory.


Step by step procedure for building and running HIO tests on LANL system Trinite:
---------------------------------------------------------------------------------

This procedure is accurate as of 2017-01-12 with HIO.1.3.0.6.

1) Get the distribution tarball libhio-1.3.0.6.tar.gz from github at
   https://github.com/hpc/libhio/releases

2) Untar

3) cd <dir>/libhio-1.3       ( <dir> is where you untarred HIO )

4) ./hiobuild -cf -s PrgEnv-intel,PrgEnv-gnu

   At the end of the build you will see:

    tt-fey1 ====[HIOBUILD_RESULT_START]===()===========================================
    tt-fey1 hiobuild : Checking /users/cornell/tmp/libhio-1.3/hiobuild.out for build problems
    24:configure: WARNING: using cross tools not prefixed with host triplet
    259:Warning:
    tt-fey1 hiobuild : Checking for build target files
    tt-fey1 hiobuild : Build errors found, see above.
    tt-fey1 ====[HIOBUILD_RESULT_END]===()=============================================

   Ideally, the two warning messages would not be present, but at the moment, they can be ignored.

5) cd test

6) ./run_combo -t ./run02 ./run03 ./run12 ./run20 -s s m -n 1 2 4 -p 32 -b

   This will create 24 job scripts in the libhio-1.3/test/run directory and submit the jobs.
   Msub messages are in the corresponding .jobid files in the same directory. Job output is
   directed to corresponding .out files.  The number and mix of jobs is controlled by the
   parameters. Issue run_combo -h for more information.

7) After the jobs complete, issue the following:

   grep -c "RESULT: SUCCESS" run/*.out

   If all jobs ran OK, grep should show 24 files with a count of 1.  Like this:

   cornell@tr-login1:~/pgm/hio/tr-gnu/libhio-1.3/test> grep -c "RESULT: SUCCESS" run/*.out
   run/job.20170108.080917.out:1
   run/job.20170108.080927.out:1
   run/job.20170108.080936.out:1
   run/job.20170108.081422.out:1
     . . . .
   run/job.20170108.082133.out:1
   run/job.20170108.082141.out:1

   Investigate any missing job output or counts of 0.

8) Alternatively, cd to the libhio-1.3/test directory, run the script

   ./check_test

   This will show how many jobs are queued and currently running and how many 
   output files are incomplete or have failures.

9) Resources for better understanding and/or modifying these procedures:

   libhio-1.3/README
   libhio-1.3/README.datawarp
   libhio-1.3/hiobuild -h
   libhio-1.3/test/run_combo -h
   libhio-1.3/test/run_setup -h
   libhio-1.3/test/run02, run03, run12, run20
   libhio-1.3/test/xexec -h
   libhio-1.3/design/libhio_api.pdf
   libhio-1.3/test/hio_example.c

10) Additional test commands, check the results the same way as above.

   Very simple small single job Moab/DataWarp test:

     ./run02 -s t -n 1 -r 1 -b

   Alternate multi job test suitable for a large system like Trinity:

     ./run_combo -t ./run02 ./run03 ./run12 ./run20 -s l x -n 1024 512 256 128 64 -p 32 -b

   Additional many job submission contention test

     ./run90 -p 5 -s t -n 1 -b

     This test submits two jobs that each submit two additional jobs.  Job
     submission continues until the -p parameter is exhausted.  So, the
     total number of jobs is given by (p^2) - 2.  Be cautious about increasing
     the -p parameter.  Since this is only a job submission test, the normal
     scan for RESULT: SUCCESS is not applicable.  Simply wait for the queue to
     empty and look for the expected number of .sh and .out files in the run
     directory.  If there are any .sh files without corresponding .out files,
     look for errors via checkjob -v on the job IDs in the .jobid file.

  DataWarp stage-out can impose a significant load on the scratch file system. 
  To inhibit stage-out (which will reduce test coverage) set the environment:

    export HIO_datawarp_stage_mode=disable.


--- End of README ---

libhio's People

Contributors

bws avatar cornellwright avatar gshipman avatar hjelmn avatar hppritcha avatar hugegreenbug avatar m3morin avatar plamborn avatar zwparchman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libhio's Issues

Create new libhio release

Can someone cut a new libhio release?
#51 fixes an issue for builds on Power 9 with XL and we'd like to get this pulled into our production builds.

test

To see if Howard gets an email

libhio-1.4.1.2 fails with openmpi-3.1.0

Making all in xexec
make[2]: Entering directory '/tmp/junghans/spack-stage/spack-stage-ik_yg78y/libhio-1.4.1.2/test/xexec'
  CC       xexec_x-xexec.o
  CC       xexec_x-xexec_base.o
  CC       xexec_x-xexec_fio.o
  CC       xexec_x-xexec_hio.o
  CC       xexec_x-xexec_mpi.o
  CC       xexec_x-cw_misc.o
  CCLD     xexec.x
xexec_x-xexec_hio.o: In function `hi_run':
/tmp/junghans/spack-stage/spack-stage-ik_yg78y/libhio-1.4.1.2/test/xexec/xexec_hio.c:242: undefined reference to `hio_init_mpi'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_aggregate_statistics'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_list_resize'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_list_alloc'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_header_cleanup'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_list_release'
../../src/.libs/libhio.so: undefined reference to `hioi_dataset_list_get'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:393: xexec.x] Error 1
make[2]: Leaving directory '/tmp/junghans/spack-stage/spack-stage-ik_yg78y/libhio-1.4.1.2/test/xexec'
make[1]: *** [Makefile:533: all-recursive] Error 1
make[1]: Leaving directory '/tmp/junghans/spack-stage/spack-stage-ik_yg78y/libhio-1.4.1.2/test'
make: *** [Makefile:504: all-recursive] Error 1

Full spack-build.txt

trace file analyzer

Per discussions with @hjelmn, it would be useful to have a HIO trace file analyzer to enable reproduction of an applications usage of HIO with a simple application. This would facilitate
debugging and hopefully performance tuning within libHIO.

Optimization strategy for DataWarp

Hi, I have a question about the allocation strategy of DataWarp. The default optimization strategy of DataWarp is "bandwidth" which will assign as many servers as possible (as determined by the capacity request, pool granularity and available space) to maximize bandwidth according to the DataWarp document. I wonder whether the DataWarp "bandwidth" strategy will consider other factors ( such as the workload status of each BB server ) to prevent situations like allocating new jobs on the BB servers which are already very busy, or it just only consider the factors of user requested capacity/available space ( like the document said ) to allocate BB servers for new jobs on a round-robin basis?

Thank you!

v1.4.1.1 tarball is broken

wget https://github.com/hpc/libhio/releases/download/v1.4.1.1/libhio-1.4.1.1.tar.bz2
tar -xf libhio-1.4.1.1.tar.bz2
cd libhio-1.4.1.1
./configure
...
config.status: error: cannot find input file: `hdf5-hio/doc/Makefile.in'

My guess this was broken in 6aec1b9.

libhio with internal json-c build fails on power9 nodes

The json tarball included in libhio is too old to recognize the system type - ppc64le, so the build fails when it tries to build the json lib.

The json tarball needs to be updated to one of the 0.13.1 releases. These releases work on darwin power9 nodes, for example.

The tarball needs to be patched to support the doc gen removal and function renaming. See the json-c.patch file. Note the current patch fails to patch cleanly on either json-c master or the 0.13.1 tags. This will have to be manually redone.

HDF5 HIO plugin doesn't work with newer versions of HDF5

Looks like the HDF5 HIO plugin is suffering from bit rot. If one tries to build against anything newer than the 1.8.x releases, it fails to compile cleanly and fails to link the test:

H5FDhio.c:238:22: warning: implicit declaration of function 'H5P_get_driver'; did you mean 'H5P_set_driver'? [-Wimplicit-function-declaration]
     if(H5FD_HIO_g != H5P_get_driver(plist))
                      ^~~~~~~~~~~~~~
                      H5P_set_driver
H5FDhio.c:240:41: warning: implicit declaration of function 'H5P_get_driver_info'; did you mean 'H5Pget_driver_info'? [-Wimplicit-function-declaration]
     if(NULL == (fa = (H5FD_hio_fapl_t *)H5P_get_driver_info(plist)))
                                         ^~~~~~~~~~~~~~~~~~~
                                         H5Pget_driver_info
H5FDhio.c:240:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     if(NULL == (fa = (H5FD_hio_fapl_t *)H5P_get_driver_info(plist)))
                      ^
H5FDhio.c: In function 'H5FD_hio_open':
H5FDhio.c:575:26: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
         if(NULL == (fa = (const H5FD_hio_fapl_t *)H5P_get_driver_info(plist)))
                          ^
  CCLD     libh5fdhio.la

Looks like configury and code enhancements are needed to allow building against the 1.8.x code base and newer HDF5 releases.

Libhio file/node on Lustre hangs

Hi folks,

I'm trying to use libhio on a lustre file system and it is hanging here (see stack trace below).
Interestingly it does not hang when using file/node on GPFS.

0x00002aaab5a9e9d3 in __pwrite64_nocancel () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00002aaab5a9e9d3 in __pwrite64_nocancel () from /lib64/libpthread.so.0
#1  0x00002aaab1dd3ebd in hioi_file_write (file=0xa, ptr=0x7fffffe1c4a0, count=1921280) at hio_internal.c:622
#2  0x00002aaab1ddae2c in builtin_posix_module_element_io_internal (posix_module=0xa, element=0x7fffffe1c4a0, 
    offset=1921280, iovec=0x2aaab5a9e9d3 <__pwrite64_nocancel+10>, count=-20878584, reading=128)
    at builtin-posix_component.c:2123
#3  0x00002aaab1dda7d6 in builtin_posix_module_process_reqs (dataset=0xa, reqs=0x7fffffe1c4a0, req_count=1921280)
    at builtin-posix_component.c:2233
#4  0x00002aaab1de3312 in hio_element_write_strided_nb (element=0xa, request=0x7fffffe1c4a0, offset=1921280, 
    reserved0=46912680618451, ptr=0x3a668ffec16b08, count=46912496271488, size=1921280, stride=0)
    at api/element_write.c:151
#5  0x00002aaab1de31cc in hio_element_write_strided (element=0xa, offset=140737486374048, reserved0=1921280, 
    ptr=0x2aaab5a9e9d3 <__pwrite64_nocancel+10>, count=16438317289663240, size=46912496271488, stride=69)
    at api/element_write.c:99
#6  0x00002aaab1de3191 in hio_element_write (element=0xa, offset=140737486374048, reserved0=1921280, 
    ptr=0x2aaab5a9e9d3 <__pwrite64_nocancel+10>, count=16438317289663240, size=46912496271488)
    at api/element_write.c:85
#7  0x000000000196e912 in hioc_writeat2 (unit=10, serial=-1981280, data=0x1d5100, 
    offset0=0x2aaab5a9e9d3 <__pwrite64_nocancel+10>, buf_bytes=1921280) at hio.c:752
#8  0x000000000196b875 in hio_module::my_hio_file_write2 (
    unit=<error reading variable: Cannot access memory at address 0xa>, serial=.FALSE., 
    numdata=<error reading variable: Cannot access memory at address 0x1d5100>, pos0=3779645987002334536, 
    data_c1=<error reading variable: Cannot access memory at address 0x3a668ffec16b48>, 
    data_i32=<error reading variable: Cannot access memory at address 0x2aaaaaad00c0>, 
    data_i64=<error reading variable: Location address is not set.>, 
    data_r32=<error reading variable: Location address is not set.>, 
    data_r64=<error reading variable: value requires 1921280 bytes, which is more than max-value-size>)
    at module_hio.f90:255

Array bounds overrun

I recently built libhio with Clang and it pointed out an array bounds overrun in hio_dataset.c:

int hioi_dataset_aggregate_statistics (hio_dataset_t dataset) {
  hio_context_t context = hioi_object_context (&dataset->ds_object);
  uint64_t tmp[7];

  /* collect statistics now so they can be included in the manifest */
  tmp[0] = dataset->ds_stat.s_bread;
  tmp[1] = dataset->ds_stat.s_bwritten;
  tmp[2] = dataset->ds_stat.s_rtime;
  tmp[3] = dataset->ds_stat.s_wtime;
  tmp[4] = atomic_load(&dataset->ds_stat.s_rcount);
  tmp[5] = atomic_load(&dataset->ds_stat.s_wcount);
  tmp[6] = dataset->ds_stat.s_ctime;
  tmp[7] = dataset->ds_stat.s_ftime;
  tmp[8] = atomic_load(&dataset->ds_stat.s_fcount);

Thought you might like to know.

dw_simple_sub.sh possibly broken

this script does not appear to currently function -- possibly caused by the fact assumes moab (not current slurm) as scheduler which impacts multiple parts of script

hio_example.sh needs a fix

line 24 of hio_example.sh should be changed from "aprun" to "srun" (or somehow a conditional introduced allowing for different launch commands) in order for this script to function with slurm

how to do a release

add a blurb to the readme about how to generate release tarball (use a Mac OS-X with latex installed and probably running at least High Sierra).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.