russhyde / data_buddy Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 124 KB

Hooks for setting up data-analysis projects

R 2.32% Shell 37.19% Python 58.94% TeX 1.55%

data_buddy's Introduction

data_buddy

This repository contains some scripts I use when running a data-analysis project.

My data analysis projects contain

details of a project-specific conda environment that should be created and activated before running anything
some non-conda packages and scripts that I use in multiple projects. These are either
- copied in from local directories / files (in which case the current project should keep them under version control); or
- (preferably) cloned from a github or bitbucket repository. In the latter case, an explicit package version is included by specifying a git-commit SHA and branch (in which case the current project does not keep the included package under version control).
packages & scripts that are developed specifically for the current project
a Snakefile for controlling the running of the project scripts
links to data
subjobs (which are nested copies of the project structure, but which are version-controlled and environment-defined within the main project)

Since data_buddy will progressively change, it should be copied into any new project (for the moment at least).

All config files for use in data_buddy should be stored in ./.sidekick/setup

To run ./sidekick setup your environment should contain:

sh
pyyaml
# and for R-based projects
r-base
r-desc
r-devtools

data_buddy's People

Contributors

Stargazers

Watchers

data_buddy's Issues

'setup' should fail if a cloned packages deps arent available

When building / installing a cloned package, ensure that it's dependencies are present in the environment and fail if they are not. The pipeline currently just prints the names of missing dependencies:

...
Building package: ./lib/cloned_packages/miiq
✔  checking for file ‘/home/ah327h/jobs_llr/drug_markers/lib/cloned_packages/miiq/DESCRIPTION’ ...
─  preparing ‘miiq’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘miiq_0.1.0.tar.gz’
   
[1] "./lib/built_packages/miiq_0.1.0.tar.gz"
*** Installing into /home/ah327h/tools/miniconda3/envs/drug_markers/lib/R/library ***
ERROR: dependencies ‘Biobase’, ‘limma’, ‘preprocessCore’, ‘AnnotationDbi’, ‘ArrayExpress’, ‘GEOquery’, ‘RCurl’, ‘XML’ are not available for package ‘miiq’
* removing ‘/home/ah327h/tools/miniconda3/envs/drug_markers/lib/R/library/miiq’
Warning message:
In install.packages(pkg, repos = NULL, type = "source") :
  installation of package ‘./lib/built_packages/miiq_0.1.0.tar.gz’ had non-zero exit status
...

Allow OSTYPE to include MacOS

Setup scripts don't work on MacOS.
See here for a way to check whether the User is running on MacOS

update README.md to use `./sidekick` not `./scripts/setup.sh`

Add further default code to start of `notebook.Rmd`

Add sketches and gallery as empty lists at start of notebook.Rmd

These are for storing results within the notebook
Were previously called

pith (sketches) (store for immediate presentation) and
synopsis (gallery) (store for presentation in executive summary section)

Reinstall `buddy` if it has been updated since installation

only export R package name if R package is required

Snakefile, .gitignore, TODO.md, include_into_rpackage.txt within the template

Then remove these files from .setup_config/touch_these_files.txt

`validate_dir_existence` should autoexpand `~` as home dir

Calling ./sidekick setup with the existing directory ~/snap stated in ./.sidekick/setup/check_these_dirs.yaml threw an error

./sidekick setup

JOB: /home/ah327h/temp/my_new_project
./scripts/setup.sh: Running the work-package setup-script.
./scripts/setup.sh: 'buddy' has already been installed
Traceback (most recent call last):
  File "./bin/buddy/buddy/validate_dir_existence.py", line 46, in <module>
    run_workflow(ARGS.required_dirs_yaml[0])
  File "./bin/buddy/buddy/validate_dir_existence.py", line 28, in run_workflow
    errno.ENOENT, os.strerror(errno.ENOENT), current_dir
FileNotFoundError: [Errno 2] No such file or directory: '~/snap'
Traceback (most recent call last):
  File "./sidekick", line 117, in <module>
    main()
  File "./sidekick", line 111, in main
    args.func(args)
  File "./sidekick", line 29, in setup
    subprocess.run(["./scripts/setup.sh"], check=True)
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['./scripts/setup.sh']' returned non-zero exit status 1.

.gitignore should contain...

Cloned data:

**/lib/cloned_packages # note that lib/cloned_packages only ignores cloned code in the top project

Raw and computed data:

.RData

Temp files:

*.swp

Local configs:

.Rproj.user

Log files:

.Rhistory
.ipynb_checkpoints/*
*.Rcheck

Computed files:

md5sum Validator should check if input_file exists

If a validation-tests.yaml file mentions an input-file that doesn't exist, the md5sum-based validation pipeline fails with the following error:

(nil_cpi_rita_rnaseq) ah327h@15:43:12:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation_tests.yaml 

Traceback (most recent call last):
  File "bin/buddy/buddy/validate_file_contents.py", line 30, in <module>
    run_workflow(ARGS.validate_yaml[0])                                                                                           
  File "bin/buddy/buddy/validate_file_contents.py", line 13, in run_workflow
    report = workflow.format_failure_report()                                                                                     
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 54, in format_failure_report
    failures = self.get_failing_validators()                                                                                      
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 42, in get_failing_validators
    return {k: v for k, v in self.validators.items() if not v.is_valid()}                                                         
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 42, in <dictcomp>
    return {k: v for k, v in self.validators.items() if not v.is_valid()}                                                         
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_classes.py", line 14, in is_valid
    return get_md5sum(self.input_file) == self.expected_md5sum                                                                    
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_classes.py", line 26, in get_md5sum
    md5 = str(sh.md5sum(filepath)).strip().split()[0]                                                                             
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 1427, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)                                                                  
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 774, in __init__
    self.wait()                                                                                                                   
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 792, in wait
    self.handle_command_exit_code(exit_code)                                                                                      
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 815, in handle_command_exit_code
    raise exc                                                                                                                     
sh.ErrorReturnCode_1:

  RAN: /usr/bin/md5sum results/notebook_pdf_output/nested_lrt/cpi_in_rita.nested_contrast_lrt_top_tags.tsv                        
   
  STDOUT:                                                                                                                         
            
   
  STDERR:                                                                                                                         
/usr/bin/md5sum: results/notebook_pdf_output/nested_lrt/cpi_in_rita.nested_contrast_lrt_top_tags.tsv: No such file or directory

rename since there is a project_buddy and a data_buddy already in existence

`buddy` should be a git subtree of `data_buddy`

To allow independent development of buddy the python package: split it out

Use `utils::package.skeleton` instead of `usethis::create_package` or (now defunct) `devtools::create`

For example, devtools::create has been moved to usethis::create as of devtools-v2.1.0; and devtools::create is used in setup.DESCRIPTION.R

Similarly,

usethis::use_package
options(usethis.full_name = ...)
usethis::use_testthat

Don't reinstall `buddy` package if it is present

Currently, setup.sh installs buddy package using pip. But it does this every time that ./setup.sh is called.

Would prefer if

buddy was installed at the first run of ./scripts.setup.sh
and then installed again only if the version installed in the current environment predates the source code in ./scripts/helpers_for_setup/buddy

For the latter, see #20

given R is required and R is not in current env: setup should fail informatively

When R is not in the current env, and the project has "IS_R_REQUIRED=1", setup.sh fails silently.
An error code is raised, but no informative message is written to stderr.

Suggest splitting check_env.sh into:

check the name of the current env matches the expected (can do this in setup.sh)
check the content of the current env is valid (do this in python)

`sidekick` should fail when `setup` fails

Error codes produced when running setup.sh are not caught / dealt with by sidekick at present

call `./sidekick validate my.yaml` not `./sidekick validate --yaml my.yaml`

yaml file is obligatory. It shouldn't need an argument flag

script (?python) to copy specific github repos to a project

use yaml.safe_load() not yaml.load()

See here for why yaml.load() is considered unsafe: https://www.kevinlondon.com/2015/08/15/dangerous-python-functions-pt2.html

integration tests for git clone / checkout

Require integration tests to check that a repo can be cloned and that a specific commit can be checked out.

The unit tests only check that the correct git commands are called.

Suggest:

setup-method makes a bare git repo in a random directory, and commits twice to the repo (first add file1, then add file2)
test1: clone the new repo to a new position within the temporary directory; assert that file1 and file2 are present in the copied location
test2: obtain the commit hashes for the two commits; checkout the first commit; ensure that file1 is present and file2 is absent
test3: attempt to checkout a random commit hash "abcdef1" in the new repo, assert that an exception is raised.

allow user specified build/installation order for R packages

R packages may depend on each other for installation.

Typically, copied packages may depend on cloned packages and project-specific package may depend on either copied or cloned packages.

User should be able to configure which gets built / installed in which order.

should fail if a package can't be built / installed

within setup_libs.sh for project tki_abl_rnaseq: installation of reeq failed due to some missing packages (eg, bioconductor-edger) in the environment.

R threw a warning that it couldn't install the package

This didn't stop ./sidekick setup from runnning.

If a package cannot be installed, it should throw an error and initiate pipe-fail in the pipeline.

md5sum compare a comment-stripped file

eg, featureCount results files produced by two different versions of featureCounts may be identical modulo the header:

These two files have identical body, but differing header (205615, lane1, dtg_rnaseq project) when using subread 1.6.2 versus 1.5.0-p3

# Program:featureCounts v1.5.0-p3; Command:"featureCounts" "-p" "-s0" "-T2" "-t" "exon" "-g" "gene_id" "-a" "temp.gtf" "-o" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.fcount" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.bam"

# Program:featureCounts v1.6.2; Command:"featureCounts" "-p" "-s0" "-T2" "-t" "exon" "-g" "gene_id" "-a" "temp.gtf" "-o" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.fcount" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.bam"

move `.setup_config` into `.sidekick/config`

Not all files used by .sidekick relate to project 'setup', eg, validation rules. Therefore suggest renaming the folder.

use `./buddy setup|run|validate <... args ...>` or `./sidekick setup ...` to call the python code

At present, we run the code using ./scripts/setup.sh - this mainly uses the shell scripts, and calls any components of buddy that are required.

For the validate-file-contents.py script it doesn't make sense for this to be called during setup (since it will typically be used to validate the contents of results files after a run). For consistency we could add a ./scripts/validate.sh runner script.

Alternatively, we could add a higher level program that decides which script to run. If this is in a project-root, this could be called as

./buddy setup     # run all project (and subproject)-setup code
./buddy run        # run the project (eg, call snakemake)
./buddy validate # check that the project results match expectations (eg, md5sums)

or replace buddy with whatever this tool gets renamed to, eg, sidekick

Use relative path to subjobs in `.sidekick/setup/subjob_names.txt`

Rather than <subjob_name> use ./subjobs/<subjob_name> when specifying subjob order in the config. This would allow a data_subjob to add itself to the subjob_names of it's parent at initialisation (see russHyde/data_subjob#3).

initial conda-env requirements for `data_buddy`

Need a requirements.txt / environment.yml for creating a minimal conda env that will allow the sidekick setup program, setup-scripts and Snakefiles

set -euo pipefail in job_specific_vars.sh

links to links

If I specify a link in "./sidekick/setup/make_these_links.txt" as follows:

../some_path    ./local_link_name

and ../some_path is itself a softlink, I get some weird results.

The target of the link created by ./sidekick setup is expanded to its full filepath rather than just using a path relative to the position of the linkname.

I'm worried that if the resulting links get pushed into version control they won't be portable.

My example at present is when setting up a link to ../int_data/some_dir from ./data/int/; but such links are not included in version control (since ./data is not included in version control).

validation test shouldn't print anything if there are no failures (currently prints empty string)

suggest changing validate_file_contents::run_workflow to only print report if it is a non-empty string.

top-level `sidekick` should be a link to the `bin` version

The data_buddy template has a soft-link from ./sidekick to ./bin/sidekick.py
But, when making a new project this soft-link is disregarded and an actual copy of ./bin/sidekick.py is put into ./sidekick

In a correctly made project, ./sidekick should be a link to ./bin/sidekick

"subjob_names" should be called "subjob_order"

Also, subjobs should be sidekicked by default; use alphanumeric ordering if none is specified

add `./sidekick run <...args...>` for running a project

allow subjobs with no sidekick/buddy present

Don't run any setup steps on subjobs that don't have a sidekick in place.

split python project `buddy` away from cookie-cutter project

eg, put buddy into it's own repository and use git submodules

R package name should (by default) be '.'-contracted project name

eg,
If project name is "some_Drug_and-technique"
then package_name should be "some.drug.and.technique"

consistency checks for results files

User can specify a range of checks to be performed on named results or data files

Purpose:

Boss asks for the top X results in experiment Y
- If I send these results on, and subsequently refactor the result-generating code in my analysis scripts (or add independent functionality), I want to know that restructuring the code does not alter the results obtained (at least for those X results) since experiments based on those results may be performed.
- Also, if I replace the result-generating code (say, replacing edgeR with limma), I'd like to know if some aspect of those results is stable (eg, the top-10 genes may be setwise equal, but have different statistics)

Checks should be specified in yaml

test1:
    input_file: some_file
    expected_md5sum: xyz345.....

test2:
    input_file: some_other_file
    test_script: some_script.py
    test_args: "--head=10 --column=2 --sort"
    expected_file: "my_expected_results.[tsv|txt]" ## contains the second column from the first 10 lines

User shouldn't be prompted for an R package name if they aren't going to make an R package

remove `set -u` in check_env.sh

When running check_env.sh, I've written tests to check whether CONDA_PREFIX or CONDA_DEFAULT_ENV are defined and to die with an informative message if they are not.

If the user has not activated a conda environment, these variables will not be set.

With set -u in place, undefined vars CONDA_PREFIX cause check_env.sh to die quickly without providing an informative message to the user.

Therefore, suggest removing set -u flag from the header of check_env.sh

fix `validate` in sidekick:

touch abc.yaml

./sidekick validate --yaml abc.yaml

Traceback (most recent call last):
  File "./sidekick", line 98, in <module>
    main()
  File "./sidekick", line 92, in main
    args.func(args)
  File "./sidekick", line 43, in validate
    subprocess.call(["python", validation_script, args.yaml])
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 267, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: expected str, bytes or os.PathLike object, not list

prompt for env-name, is-r-required, is-r-package-required, is-jupyter-r-kernel-required

When using cookiecutter to build a new project, the user should be prompted for all project-wide variables that are currently stored in ./.setup_config/job_specific_vars.sh or that are required in other template files.

That is:

unbound PKGNAME when not making R package

For example:

JOB: /home/ah327h/jobs_repos/genomic_refs
./scripts/setup.sh: Running the work-package setup-script.
./scripts/setup.sh: 'buddy' has already been installed
./scripts/helpers_for_setup/setup_libs.sh: line 119: PKGNAME: unbound variable
Traceback (most recent call last):
  File "./sidekick", line 117, in <module>
    main()
  File "./sidekick", line 111, in main
    args.func(args)
  File "./sidekick", line 29, in setup
    subprocess.run(["./scripts/setup.sh"], check=True)
  File "/home/ah327h/tools/miniconda3/envs/genomic_refs/lib/python3.5/subprocess.py", line 708, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['./scripts/setup.sh']' returned non-zero exit status 1

validity check the `project_name`

project_name variable should be checked to see that it can be used as a directory name, both locally and at github / bitbucket, Simplest check: no-whitespace, yes-alphanumeric, yes-underscore/dash, no-dots

use cookiecutter logic to specify optional files

eg, if is_r_required is false, there's no need for doc/notebook.Rmd etc.

This might be possible using logic in the filenames, or in the .setup_config/* file contents.

See cookiecutter/cookiecutter#723

Put "[FAILURE]\t" prefix into `validate` report

Make it more obvious when a validation test has failed

Current validation-test output looks like:

ah327h@13:01:02:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation.yaml
test_name:test1 test_type:md5sum    input_file:results/.../<filename>

Would prefer

ah327h@13:01:02:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation.yaml
[FAILURE]    test_name:test1 test_type:md5sum    input_file:results/.../<filename>

Ensure different versions of the same built packages aren't installed

Eg, if lib/built_packages contains multiple copies of the same package (if pkg_0.1.tar.gz and pkg_0.2.tar.gz then only install pkg_0.2.tar.gz)

Recursive `.sidekick validate`?

Should be able to run sidekick validate in main project folder, and it call .sidekick validate on all subjobs within <main>/subjobs/

better description of how to setup the project-environment

Should state that to make the exact linux environment use

conda create --name <project> --file envs/requirements.txt

whereas to make an approximate environment (eg, on OSx, or when the exact builds are no longer available to allow using envs/requirements.txt)

conda create --name <env> --file envs/environment.yml

make buddy's default position ./bin/buddy

The buddy python project currently sits in ./scripts/helpers_for_setup/buddy
But, buddy now has scripts that aren't tied to 'setup' steps - eg, results-file validation steps.
So it doesn't make sense for it to be put into helpers_for_'setup'

But also, in a given project, the buddy source code should not be modified by the user. So it might make more sense for this package to reside in ./bin/ and be considered "execute-only".

setup_libs bugs

lib/Makefile should check that lib/built_packages dir exists

setup_libs.sh should check for existence of ./scripts/helpers_for_setup/package_builder.R