Coder Social home page Coder Social logo

data_buddy's Introduction

data_buddy

This repository contains some scripts I use when running a data-analysis project.

My data analysis projects contain

  • details of a project-specific conda environment that should be created and activated before running anything

  • some non-conda packages and scripts that I use in multiple projects. These are either

    • copied in from local directories / files (in which case the current project should keep them under version control); or

    • (preferably) cloned from a github or bitbucket repository. In the latter case, an explicit package version is included by specifying a git-commit SHA and branch (in which case the current project does not keep the included package under version control).

  • packages & scripts that are developed specifically for the current project

  • a Snakefile for controlling the running of the project scripts

  • links to data

  • subjobs (which are nested copies of the project structure, but which are version-controlled and environment-defined within the main project)


Since data_buddy will progressively change, it should be copied into any new project (for the moment at least).

All config files for use in data_buddy should be stored in ./.sidekick/setup


To run ./sidekick setup your environment should contain:

sh
pyyaml
# and for R-based projects
r-base
r-desc
r-devtools

data_buddy's People

Contributors

russhyde avatar

Stargazers

 avatar

Watchers

 avatar  avatar

data_buddy's Issues

'setup' should fail if a cloned packages deps arent available

When building / installing a cloned package, ensure that it's dependencies are present in the environment and fail if they are not. The pipeline currently just prints the names of missing dependencies:

...
Building package: ./lib/cloned_packages/miiq
✔  checking for file ‘/home/ah327h/jobs_llr/drug_markers/lib/cloned_packages/miiq/DESCRIPTION’ ...
─  preparing ‘miiq’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘miiq_0.1.0.tar.gz’
   
[1] "./lib/built_packages/miiq_0.1.0.tar.gz"
*** Installing into /home/ah327h/tools/miniconda3/envs/drug_markers/lib/R/library ***
ERROR: dependencies ‘Biobase’, ‘limma’, ‘preprocessCore’, ‘AnnotationDbi’, ‘ArrayExpress’, ‘GEOquery’, ‘RCurl’, ‘XML’ are not available for package ‘miiq’
* removing ‘/home/ah327h/tools/miniconda3/envs/drug_markers/lib/R/library/miiq’
Warning message:
In install.packages(pkg, repos = NULL, type = "source") :
  installation of package ‘./lib/built_packages/miiq_0.1.0.tar.gz’ had non-zero exit status
...

Add further default code to start of `notebook.Rmd`

Add sketches and gallery as empty lists at start of notebook.Rmd

These are for storing results within the notebook
Were previously called

  • pith (sketches) (store for immediate presentation) and

  • synopsis (gallery) (store for presentation in executive summary section)

`validate_dir_existence` should autoexpand `~` as home dir

Calling ./sidekick setup with the existing directory ~/snap stated in ./.sidekick/setup/check_these_dirs.yaml threw an error

./sidekick setup

JOB: /home/ah327h/temp/my_new_project
./scripts/setup.sh: Running the work-package setup-script.
./scripts/setup.sh: 'buddy' has already been installed
Traceback (most recent call last):
  File "./bin/buddy/buddy/validate_dir_existence.py", line 46, in <module>
    run_workflow(ARGS.required_dirs_yaml[0])
  File "./bin/buddy/buddy/validate_dir_existence.py", line 28, in run_workflow
    errno.ENOENT, os.strerror(errno.ENOENT), current_dir
FileNotFoundError: [Errno 2] No such file or directory: '~/snap'
Traceback (most recent call last):
  File "./sidekick", line 117, in <module>
    main()
  File "./sidekick", line 111, in main
    args.func(args)
  File "./sidekick", line 29, in setup
    subprocess.run(["./scripts/setup.sh"], check=True)
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['./scripts/setup.sh']' returned non-zero exit status 1.

.gitignore should contain...

Cloned data:

  • **/lib/cloned_packages # note that lib/cloned_packages only ignores cloned code in the top project

Raw and computed data:

  • .RData

Temp files:

  • *.swp

Local configs:

  • .Rproj.user

Log files:

  • .Rhistory
  • .ipynb_checkpoints/*
  • *.Rcheck

Computed files:

  • results
  • *.pdf
  • *.png
  • *.html
  • doc/figure
  • jars/* *.jar
  • man
  • *.knit.md # intermediates during
  • *.utf8.md # .. rmarkdown compilation

md5sum Validator should check if input_file exists

If a validation-tests.yaml file mentions an input-file that doesn't exist, the md5sum-based validation pipeline fails with the following error:

(nil_cpi_rita_rnaseq) ah327h@15:43:12:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation_tests.yaml 

Traceback (most recent call last):
  File "bin/buddy/buddy/validate_file_contents.py", line 30, in <module>
    run_workflow(ARGS.validate_yaml[0])                                                                                           
  File "bin/buddy/buddy/validate_file_contents.py", line 13, in run_workflow
    report = workflow.format_failure_report()                                                                                     
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 54, in format_failure_report
    failures = self.get_failing_validators()                                                                                      
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 42, in get_failing_validators
    return {k: v for k, v in self.validators.items() if not v.is_valid()}                                                         
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_workflow.py", line 42, in <dictcomp>
    return {k: v for k, v in self.validators.items() if not v.is_valid()}                                                         
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_classes.py", line 14, in is_valid
    return get_md5sum(self.input_file) == self.expected_md5sum                                                                    
  File "/home/ah327h/jobs_llr/nil_cpi_rita_rnaseq/bin/buddy/buddy/validation_classes.py", line 26, in get_md5sum
    md5 = str(sh.md5sum(filepath)).strip().split()[0]                                                                             
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 1427, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)                                                                  
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 774, in __init__
    self.wait()                                                                                                                   
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 792, in wait
    self.handle_command_exit_code(exit_code)                                                                                      
  File "/home/ah327h/tools/miniconda3/envs/nil_cpi_rita_rnaseq/lib/python3.6/site-packages/sh.py", line 815, in handle_command_exit_code
    raise exc                                                                                                                     
sh.ErrorReturnCode_1:

  RAN: /usr/bin/md5sum results/notebook_pdf_output/nested_lrt/cpi_in_rita.nested_contrast_lrt_top_tags.tsv                        
   
  STDOUT:                                                                                                                         
            
   
  STDERR:                                                                                                                         
/usr/bin/md5sum: results/notebook_pdf_output/nested_lrt/cpi_in_rita.nested_contrast_lrt_top_tags.tsv: No such file or directory

Don't reinstall `buddy` package if it is present

Currently, setup.sh installs buddy package using pip. But it does this every time that ./setup.sh is called.

Would prefer if

  • buddy was installed at the first run of ./scripts.setup.sh

  • and then installed again only if the version installed in the current environment predates the source code in ./scripts/helpers_for_setup/buddy

For the latter, see #20

given R is required and R is not in current env: setup should fail informatively

When R is not in the current env, and the project has "IS_R_REQUIRED=1", setup.sh fails silently.
An error code is raised, but no informative message is written to stderr.

Suggest splitting check_env.sh into:

  • check the name of the current env matches the expected (can do this in setup.sh)
  • check the content of the current env is valid (do this in python)

integration tests for git clone / checkout

Require integration tests to check that a repo can be cloned and that a specific commit can be checked out.

The unit tests only check that the correct git commands are called.

Suggest:

  • setup-method makes a bare git repo in a random directory, and commits twice to the repo (first add file1, then add file2)

  • test1: clone the new repo to a new position within the temporary directory; assert that file1 and file2 are present in the copied location

  • test2: obtain the commit hashes for the two commits; checkout the first commit; ensure that file1 is present and file2 is absent

  • test3: attempt to checkout a random commit hash "abcdef1" in the new repo, assert that an exception is raised.

allow user specified build/installation order for R packages

R packages may depend on each other for installation.

Typically, copied packages may depend on cloned packages and project-specific package may depend on either copied or cloned packages.

User should be able to configure which gets built / installed in which order.

should fail if a package can't be built / installed

within setup_libs.sh for project tki_abl_rnaseq: installation of reeq failed due to some missing packages (eg, bioconductor-edger) in the environment.

R threw a warning that it couldn't install the package

This didn't stop ./sidekick setup from runnning.

If a package cannot be installed, it should throw an error and initiate pipe-fail in the pipeline.

md5sum compare a comment-stripped file

eg, featureCount results files produced by two different versions of featureCounts may be identical modulo the header:

These two files have identical body, but differing header (205615, lane1, dtg_rnaseq project) when using subread 1.6.2 versus 1.5.0-p3

# Program:featureCounts v1.5.0-p3; Command:"featureCounts" "-p" "-s0" "-T2" "-t" "exon" "-g" "gene_id" "-a" "temp.gtf" "-o" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.fcount" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.bam"
# Program:featureCounts v1.6.2; Command:"featureCounts" "-p" "-s0" "-T2" "-t" "exon" "-g" "gene_id" "-a" "temp.gtf" "-o" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.fcount" "data/job/align/ID205615/205615_S9_L001.hisat2.pe.bam"

use `./buddy setup|run|validate <... args ...>` or `./sidekick setup ...` to call the python code

At present, we run the code using ./scripts/setup.sh - this mainly uses the shell scripts, and calls any components of buddy that are required.

For the validate-file-contents.py script it doesn't make sense for this to be called during setup (since it will typically be used to validate the contents of results files after a run). For consistency we could add a ./scripts/validate.sh runner script.

Alternatively, we could add a higher level program that decides which script to run. If this is in a project-root, this could be called as

./buddy setup     # run all project (and subproject)-setup code
./buddy run        # run the project (eg, call snakemake)
./buddy validate # check that the project results match expectations (eg, md5sums)

or replace buddy with whatever this tool gets renamed to, eg, sidekick

links to links

If I specify a link in "./sidekick/setup/make_these_links.txt" as follows:

../some_path    ./local_link_name

and ../some_path is itself a softlink, I get some weird results.

The target of the link created by ./sidekick setup is expanded to its full filepath rather than just using a path relative to the position of the linkname.

I'm worried that if the resulting links get pushed into version control they won't be portable.

My example at present is when setting up a link to ../int_data/some_dir from ./data/int/; but such links are not included in version control (since ./data is not included in version control).

top-level `sidekick` should be a link to the `bin` version

The data_buddy template has a soft-link from ./sidekick to ./bin/sidekick.py
But, when making a new project this soft-link is disregarded and an actual copy of ./bin/sidekick.py is put into ./sidekick

In a correctly made project, ./sidekick should be a link to ./bin/sidekick

consistency checks for results files

User can specify a range of checks to be performed on named results or data files

Purpose:

  • Boss asks for the top X results in experiment Y
    • If I send these results on, and subsequently refactor the result-generating code in my analysis scripts (or add independent functionality), I want to know that restructuring the code does not alter the results obtained (at least for those X results) since experiments based on those results may be performed.
    • Also, if I replace the result-generating code (say, replacing edgeR with limma), I'd like to know if some aspect of those results is stable (eg, the top-10 genes may be setwise equal, but have different statistics)

Checks should be specified in yaml

test1:
    input_file: some_file
    expected_md5sum: xyz345.....

test2:
    input_file: some_other_file
    test_script: some_script.py
    test_args: "--head=10 --column=2 --sort"
    expected_file: "my_expected_results.[tsv|txt]" ## contains the second column from the first 10 lines

remove `set -u` in check_env.sh

When running check_env.sh, I've written tests to check whether CONDA_PREFIX or CONDA_DEFAULT_ENV are defined and to die with an informative message if they are not.

If the user has not activated a conda environment, these variables will not be set.

With set -u in place, undefined vars CONDA_PREFIX cause check_env.sh to die quickly without providing an informative message to the user.

Therefore, suggest removing set -u flag from the header of check_env.sh

fix `validate` in sidekick:

touch abc.yaml

./sidekick validate --yaml abc.yaml

Traceback (most recent call last):
  File "./sidekick", line 98, in <module>
    main()
  File "./sidekick", line 92, in main
    args.func(args)
  File "./sidekick", line 43, in validate
    subprocess.call(["python", validation_script, args.yaml])
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 267, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/ah327h/tools/miniconda3/envs/temp/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: expected str, bytes or os.PathLike object, not list

unbound PKGNAME when not making R package

For example:

JOB: /home/ah327h/jobs_repos/genomic_refs
./scripts/setup.sh: Running the work-package setup-script.
./scripts/setup.sh: 'buddy' has already been installed
./scripts/helpers_for_setup/setup_libs.sh: line 119: PKGNAME: unbound variable
Traceback (most recent call last):
  File "./sidekick", line 117, in <module>
    main()
  File "./sidekick", line 111, in main
    args.func(args)
  File "./sidekick", line 29, in setup
    subprocess.run(["./scripts/setup.sh"], check=True)
  File "/home/ah327h/tools/miniconda3/envs/genomic_refs/lib/python3.5/subprocess.py", line 708, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['./scripts/setup.sh']' returned non-zero exit status 1

validity check the `project_name`

project_name variable should be checked to see that it can be used as a directory name, both locally and at github / bitbucket, Simplest check: no-whitespace, yes-alphanumeric, yes-underscore/dash, no-dots

Put "[FAILURE]\t" prefix into `validate` report

Make it more obvious when a validation test has failed

Current validation-test output looks like:

ah327h@13:01:02:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation.yaml
test_name:test1 test_type:md5sum    input_file:results/.../<filename>

Would prefer

ah327h@13:01:02:~/jobs_llr/nil_cpi_rita_rnaseq$ ./sidekick validate --yaml results_validation.yaml
[FAILURE]    test_name:test1 test_type:md5sum    input_file:results/.../<filename>

Recursive `.sidekick validate`?

Should be able to run sidekick validate in main project folder, and it call .sidekick validate on all subjobs within <main>/subjobs/

better description of how to setup the project-environment

Should state that to make the exact linux environment use

conda create --name <project> --file envs/requirements.txt

whereas to make an approximate environment (eg, on OSx, or when the exact builds are no longer available to allow using envs/requirements.txt)

conda create --name <env> --file envs/environment.yml

make buddy's default position ./bin/buddy

The buddy python project currently sits in ./scripts/helpers_for_setup/buddy
But, buddy now has scripts that aren't tied to 'setup' steps - eg, results-file validation steps.
So it doesn't make sense for it to be put into helpers_for_'setup'

But also, in a given project, the buddy source code should not be modified by the user. So it might make more sense for this package to reside in ./bin/ and be considered "execute-only".

setup_libs bugs

lib/Makefile should check that lib/built_packages dir exists

setup_libs.sh should check for existence of ./scripts/helpers_for_setup/package_builder.R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.