The epilogos from meuleman

No reason to call this "bootstrapping" -- rather just call it sampling

epilogos/src/computeEpilogosPairwiseVisual.py

Lines 223 to 240 in d3786b2

    
           def fitOnBootstrap(distanceArrNull, samplingSize): 
        
               if len(distanceArrNull) <= samplingSize: 
        
                   bootstrapData = distanceArrNull 
        
               else: 
        
                   np.random.seed() # On linux, multiprocessing inherits the master seed and doesn't generate fully random numbers 
        
                   bootstrapData = pd.Series(np.random.choice(distanceArrNull, size=samplingSize, replace=False)) 
        
               # ignore warnings 
        
               with warnings.catch_warnings(): 
        
                   warnings.simplefilter("ignore") 
        
                   # Fit the data 
        
                   params = st.gennorm.fit(bootstrapData) 
        
                   # Calculate SSE and MLE 
        
                   mle = st.gennorm.nnlf(params, pd.Series(distanceArrNull)) 
        
               return params, mle

Several lines mention this, let's replace with sampling/sample to avoid confusion

Consider wrapping `subprocess` calls in `try/except` blocks

epilogos/src/computeEpilogosSlurm.py

Line 453 in 80bf1f5

subprocess.run("scancel {}".format(allJobIDs), shell=True)

Using a try/except block where you use subprocess calls can help with catching local issues with running command-line tools outside of Python:

try:
    subprocess.run("scancel {}".format(allJobIDs), shell=True)
except subprocess.CalledProcessError as e:
    # handle error...

Certain error codes might have to do with the binary not being found, or in the case of scancel, one or more job IDs passed to it might not found associated with any jobs internally, etc. Better handling of errors can help end users catch odd problems, or help with debugging.

Consider using `click` library to handle command-line options

epilogos/src/columnSplitterAdsera.py

Lines 108 to 127 in 80bf1f5

    
           # Helper to convert taken in string into a boolean  
        
           def strToBool(string): 
        
               if string in ["true", "t", "T", "y", "Y", "True", "yes", "Yes"]: 
        
                   return True 
        
               elif string in ["false", "f", "F", "n", "N", "False", "no", "No"]: 
        
                   return False 
        
               else: 
        
                   print("ERROR: STRTOBOOL FAILED") 
        
                   return "ERROR: STRTOBOOL FAILED" 
        
           if __name__ == "__main__": 
        
               if len(sys.argv) <= 2: 
        
                   main(sys.argv[1], column="") 
        
               elif len(sys.argv) <= 5: 
        
                   main(sys.argv[1], column=str(sys.argv[-1])) 
        
               elif len(sys.argv) <= 6: 
        
                   main(sys.argv[1], randomSampling=strToBool(sys.argv[2]), equalSize=int(sys.argv[3]), column=sys.argv[4], type1=sys.argv[5]) 
        
               else: 
        
                   main(sys.argv[1], randomSampling=strToBool(sys.argv[2]), equalSize=int(sys.argv[3]), column=sys.argv[4], type1=sys.argv[5], type2=sys.argv[6])

Using click or similar libraries could help simplify handling options (and boolean-flag options).

Consider using type hints in function definitions to indicate types of variables and returned data

epilogos/src/computeEpilogosExpectedMaster.py

Lines 87 to 113 in 80bf1f5

    
           def s1Calc(file1Path, file2Path, rowsToCalc, numStates, verbose): 
        
               """ 
        
               Function responsible for expected frequency calculation over a set of rows for a saliency metric of 1 
        
               Input: 
        
               file1Path -- The path of the only (single epilogos) or first (paired epilogos) file to read states from 
        
               file2Path -- The path of the second file to read states from (paired epilogos) 
        
               rowsToCalc -- The rows to count expected frequencies from the files 
        
               numStates -- The number of states in the state model 
        
               verbose -- Boolean which if True, causes much more detailed prints 
        
               Output: 
        
               A numpy array containing the counts of each state within the specified rows of the file 
        
               """ 
        
               dataArr = readStates(file1Path=file1Path, file2Path=file2Path, rowsToCalc=rowsToCalc, verbose=verbose) 
        
               expFreqArr = np.zeros(numStates, dtype=np.int32) 
        
               if verbose and rowsToCalc[0] == 0: print("Calculating expected frequencies...", flush=True); tExp = time() 
        
               # Simply count all states across out our subset of data 
        
               uniqueStates, stateCounts = np.unique(dataArr, return_counts=True) 
        
               for i, state in enumerate(uniqueStates): 
        
                   expFreqArr[state] += stateCounts[i] 
        
               if verbose and rowsToCalc[0] == 0: print("    Time:", time() - tExp, flush=True) 
        
               return expFreqArr

Type hints can be passed into documentation and linter scripts to automate writing docs and checking the types of data being passed around, including Python primitives (str, int, etc.) and numpy data:

https://www.python.org/dev/peps/pep-0484/

https://numpy.org/devdocs/reference/typing.html

This can also be useful to read the function at a glance, to see what goes in and out, and help with debugging.

computeEpilogos_singleChromosomeSingleProcessor.sh needs to send fewer arguments to computeEpilogosPart2_perChrom.cpp

The arguments that computeEpilogos_singleChromosomeSingleProcessor.sh sends to computeEpilogosPart2_perChrom.cpp need to be the same as the ones that computeEpilogos.sh sends to computeEpilogosPart2_perChrom.cpp.

Comments for /epilogos/src/computeEpilogosGreatestHits.py

I think it would be useful to have more comments on what the output for the greatest hits is for people who are less familiar with it.

Negative values interpretation

Hello,

I wounder if you can elaborate on the interpretation of the negative values in epilogos?
Another question is: does the high value at a particular poistion means that the state at this particular position is the same across all the used tracks? .i.e not so much variability across the different tracks?

Thank you!
Abdul

Consider having one entry point to running epilogos in various forms

This may be looked at in the course of preparing epilogos for distribution via PyPI, but without reading the README and just looking at the scripts and src directories, I wasn't quite sure what the starting point or entry point was to running things.

A lot of people don't read documentation and just jump in. If I didn't read the README, I'd have assumed src/epilogos.py might be the starting point.

Perhaps simplifying things to a dedicated, single "Pythonic" entry point might be helpful. This could call either the Slurm and single-node compute mode of the epilogos kit, instead of pointing users to one or another script.

https://amir.rachum.com/blog/2017/07/28/python-entry-points/

https://chriswarrick.com/blog/2014/09/15/python-apps-the-right-way-entry_points-and-scripts/

Uncertain what the use of `sleep` here is for

epilogos/src/computeEpilogosSlurm.py

Lines 463 to 466 in 80bf1f5

    
           if saliency == 1: 
        
               sleep(2) 
        
           else: 
        
               sleep(10)

Is there a reason you use sleep here, and different time values? (Is it to delay submission of separate jobs to the cluster?)

only get the statePairs.txt, no starch or score.txt.gz file

Hi,

After successfully run epilogos on the test data, I tried computeEpilogos_minimal.sh on my own data using the S3 mode, but got only the statePairs.txt for each chromosome, no observations.starch, scores.txt.gz or exemplarRegions.txt file. Also no error messages. I checked my input data format and it seems to be exactly the seem with the test data in the epilogo package. So I am now really puzzled..Could you please point out what could be wrong? Thanks.

Yan

qcat file specification

Hi,

Could you please give some specification of the output?

Thanks,
Tommy

Consider using Python `None` instead of string to represent null value

epilogos/src/computeEpilogosSlurm.py

Line 30 in 80bf1f5

    
           @click.option("-b", "--background-directory", "expFreqDir", type=str, default=["null"], multiple=True, help="Path to where the background frequency array is read from (-m s) or written to (-m bg, -m both) [default: output-directory]")

Instead of using a default string called null, I would suggest using the Python None keyword to define a null value.

Later on, comparisons can be made more "Pythonic", e.g.:

epilogos/src/computeEpilogosSlurm.py

Lines 73 to 74 in 80bf1f5

    
           if expFreqDir == "null": 
        
               print("Background Directory =", outputDirPath)

can become:

if not expFreqDir: print("Background Directory = {}".format(outputDirPath))

epilogos/src/computeEpilogosSlurm.py

Lines 83 to 84 in 80bf1f5

    
           if expFreqDir == "null": 
        
               expFreqDir = outputDirectory

can become a simpler ternary, perhaps:

expFreqDir = outputDirectory if not expFreqDir else None

computeEpilogos_minimal.sh * buffer overflow detected * error

Hello again, I'm running into another issue that is unfortunately limiting my ability to use epilogos outputs for downstream analysis.

This is 1 example command from our pipeline (I'm using a singularity container, so "singularity exec episead.sif" just runs the rest of the command, but inside the container):

singularity exec episead.sif computeEpilogos_minimal.sh wt/04epilogos_input_by_chr/chr5.txt 13 1 $output_dir "1-9" > wt/05epi_out/chr5/scores.txt.gz

This is effectively:

computeEpilogos_minimal.sh wt/04epilogos_input_by_chr/chr5.txt 13 1 $output_dir "1-9" > wt/05epi_out/chr5/scores.txt.gz

Here is my error message:

*** buffer overflow detected ***: /usr/local/bin/computeEpilogosPart2_perChrom terminated
/usr/local/bin/computeEpilogos_minimal.sh: line 213: 66552 Aborted (core dumped) $EXE1 $infile1 $metric $numStates $outfileP $outfileQ $outfileNsites $groupAspec $groupBspec $outfileRand $outfileQB > ${outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr

Or:

/usr/local/bin/computeEpilogos_minimal.sh: line 301: 451205 Aborted (core dumped) $EXE2 $infile $metric $totalNumSites $infileQ $outfileObserved $outfileScores $chr $infileQB > ${outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr

Expected behavior:
Computing epilogos on a single chromosome bed file

Actual behavior:

Epilogos experiences errors inconsistently (approximately 50% of the time)
And if you run it repeatedly until it works, output files are incomplete (missing between 0 and 99.8% of the expected lines).

All missing lines are from the end of the files, and across 26 chromosome files, the number of missing lines is consistent for each file from run-to-run (in other words, when it runs, it appears to run normally on the first part of the file then stop without error).
The error can occur in either the $EXE1 or $EXE2 command and appears to relate to calling the /bin scripts (part1_perChrom and part2_perChrom specifically)

How I'm using epilogos:
I am running computeEpilogos_minimal.sh on a per-chromosome basis on an LSF cluster as part of a pipeline. I'm running it like that because we use a pipeline based on an older version of epilogos where it was easier to run per-chromosome, and computeEpilogos_minimal.sh appears to be written to fulfill the same function. The error occurs within the pipeline and when run manually.

Environment:

I am using the current version of epilogos, installed using git clone + make, with the 3 /bin scripts added to my path.
I am running epilogos inside a singularity container environment that was built from scratch to run epilogos, and I can include the commands I used to install dependencies below if you would like.
I have requested up to 64gb memory to run a single chromosome with no change in behavior

The test-run recommended in the ReadMe works without error.

Here are lines from a file that's breaking after 4580 lines (the original file is tab-separated):
line number 400:
chr5 400000 401000 5 6 2 5
line number 500:
chr5 500000 501000 11 7 10 10
line number 600:
chr5 600000 601000 12 11 12 12

line number 17600:
chr5 17600000 17601000 8 8 8 8
line number 17700:
chr5 17700000 17701000 8 8 8 8
line number 17800:
chr5 17800000 17801000 4 8 2 4

I don't see any differences in the lines so I don't think the error is because of our input files.

Here are the lines that would be breaking it:
line number 4578:
chr5 4578000 4579000 4 8 8 3
line number 4579:
chr5 4579000 4580000 13 8 8 3
line number 4580:
chr5 4580000 4581000 13 13 13 9
line number 4581:
chr5 4581000 4582000 13 13 13 9
line number 4582:
chr5 4582000 4583000 13 13 4 3

Here are the final lines of the apparently truncated epilogos output file:
chr5 4575000 4576000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4576000 4577000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4577000 4578000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4578000 4579000 0 0 0.813 0.1082 0 0 0 0.1067 0 0 0 0 0
chr5 4579000 4580000 0 0 0.813 0 0 0 0 0.1067 0 0 0 0 0.8291

Any thoughts or feedback would be greatly appreciated. Thank you.

Question about groupSpec argument

Hi, I'm trying to use epilogos and am confused about the groupSpec argument. With how I understand the documentation and comments in the epilogos code, it appears that groupSpec is supposed to state which columns should be included in analysis, and that the first sample (from column 4) should be "1" in this argument because it is the first sample. If this is wrong, please let me know.

Based on this understanding, if I have 9 samples, I would expect "1-9" to work for the groupSpec argument, and it does, but I also tried some other inputs to make sure I understood, such as "2-10", which also works and has slightly different outputs to the scores.txt.gz file. Following that trend, "3-11" raises an error:

/usr/local/bin/computeEpilogos_minimal.sh: line 213: 390467 Segmentation fault (core dumped) $EXE1 $infile1 $metric $numStates $outfileP $outfileQ $outfileNsites $groupAspec $groupBspec $outfileRand $outfileQB > ${outdir}/${jobName}.stdout 2> ${outdir}/${jobName}.stderr

But "4-12" works and so does "4-13".

This is from my input file:
chr3 0 1000 8 8 8 8 8 8 8 8 8
chr3 1000 2000 8 8 8 8 8 8 8 8 8
chr3 2000 3000 8 8 8 8 8 8 8 8 8
chr3 3000 4000 8 8 8 8 8 8 8 8 8
chr3 4000 5000 8 8 8 8 8 8 8 8 8

This is the command I run (for "4-12"):

computeEpilogos_minimal.sh 04epilogos_input_by_chr/chr3.txt 13 1 05epilogos_output_by_chr/chr3 "4-12" > 05epilogos_output_by_chr/chr3/scores.txt.gz

Any input on this topic would be appreciated.

Consider using touch approach to create empty files

Opening and immediately closing a file handle looks a bit odd. Is this perhaps being used to create an empty file?

epilogos/src/computeEpilogosSlurm.py

Lines 365 to 373 in 80bf1f5

    
           try: 
        
               jout = open(jobOutPath, 'x') 
        
               jout.close() 
        
               jerr = open(jobErrPath, 'x') 
        
               jerr.close() 
        
           except FileExistsError as err: 
        
               # This error should never occur because we are deleting the files first 
        
               print(err) 
        
               return

It might be harmless and work fine, if so, but for readability, there are other ways to touch a file that might be a bit clearer, e.g.:

from pathlib import Path

Path('path/to/file').touch()

why need to process by chromosome

Hi,

I want to know the reason why one needs to process by chromosome using computeEpilogos_singleChromosomeSingleProcessor.sh . Is it because the memory usage is too big if I process all chromosomes at once?

Thanks,
Tommy

Consider using `requirements.txt` at the root level to specify dependencies

Having a requirements.txt file at the root level of the project makes installing dependencies as simple as: pip install -r requirements.txt.

You can edit this file whenever new dependencies are needed, without having to keep track of the list of packages in the project README.

Another advantage is that you can specify specific or minimum versions of numpy, etc. in this text file. This helps with making a reproducible setup on other systems, in case you need to debug things. Newer versions of numpy etc. can change function parameters and so on, which could make your code dependent on specific versions.

To create this file with a current environment where you know things are working, you can run pip freeze > requirements.txt in the root directory. The file will be a list of all the dependencies and versions (I think).

This could also be generally useful later on for packaging.

test run failed

Hi,

I was trying the examples.

computeEpilogos_singleChromosomeSingleProcessor.sh data/chr1_127epigenomes_15observedStates.txt.gz 0 15 KL "33-34,37-45,47-48,61"

ls KL
chr1_Pnumerators.txt  p1a_chr1.stderr  p1a_chr1.stdout  p2_chr1.stderr  p2_chr1.stdout  Q.txt

cat p2_chr1.stderr 
Usage type #1:  /scratch/genomic_med/apps/epilogos/bin/computeEpilogosPart2_perChrom infile KLtype NsitesGenomewide infileQ outfileObs outfileQcat chr [infileQ2]
where
* infile holds the tab-delimited state or state pair IDs observed in the epigenomes or pairs of epigenomes, one line per genomic segment
* KLtype is either 0 (for KL), 1 (for KL*), or 2 (for KL**)
  KL compares states, KL* compares tallies of state pairs, and KL** compares state pairs of individual epigenome pairs
* NsitesGenomewide is the total number of sites observed genome-wide
* infileQ contains the Q, Q*, or Q** tally matrix (also see below)
* outfileObs will receive genomic coordinates (regions on chromosome "chr" of width regionWidth, starting at firstBegPos),
  the state (or state pair) making the largest contribution to the metric,
  the magnitude of that contribution, and the total value of the metric.
  If two groups are specified (see below), it will also include a column containing +/-1,
  specifying whether the first group (+1) or the second (-1) contributes more to the overall metric.
* outfileQcat will be in "qcat" format, uncompressed
* Optional additional argument infileQ2 can be used to specify Q, Q*, or Q** for a 2nd group of epigenomes,
  in which case the metric quantifies the difference (distance) between them.

Usage type #2:  /scratch/genomic_med/apps/epilogos/bin/computeEpilogosPart2_perChrom infile KLtype NsitesGenomewide infileQ1 infileQ2 outfileNulls
where
* infile contains random permutations of states (or state pairs) observed in the initial input data
* outfileNulls will receive the total difference metric for each line of permuted states (or state pairs)
* the remaining arguments are the same as described above
This second "usage type" is used to generate a null distribution, for estimating significance
of the metric values calculated via "usage type 1."

other files are empty except (60M in size) chr1_Pnumerators.txt. any ideas what's wrong?
BTW, do you have an estimate on time and CPU usage for the examples?

Thanks!
Tommy

For bash scripts, consider using `set -u` with other options

epilogos/scripts/computeEpilogos.sh

Line 3 in 80bf1f5

# set -e -o pipefail

With production bash scripts shared with others, I suggest using set -u with -e -o pipefail.

This will cause the script to exit with a non-zero error, if a variable is left unset or is otherwise undefined.

This is useful when using positional variables like ${1}, ${2}, etc. which happens later on in this script. If a variable is left out by accident, the script will stop early.

I also saw this here:

epilogos/scripts/computeEpilogos_minimal.sh

Line 3 in 80bf1f5

# set -e -o pipefail

It may be useful to add set -ueo pipeline in here, as well:

https://github.com/meuleman/epilogos/blob/80bf1f5b0846b495d40eda0cd0315c8c0d371b2c/scripts/preprocess_data_ChromHMM.sh

Minimal non-Slurm paired test failed

When running the minimal, non-Slurm paired test, I get the following numpy-related error:

TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'

Details:

(epilogos_052421_v3) areynolds@Earl-Grey epilogos % tmp_dir=$(mktemp -d -t epilogos-paired-XXXXXXXXXX)
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % epilogos -m paired -l -a data/pyData/male/ -b data/pyData/female/ -n data/state_metadata/human/Boix_et_al_833_sample/hg19/18/metadata.tsv -o ${tmp_dir}


                  d8b 888                                     
                  Y8P 888                                     
                      888                                     
 .d88b.  88888b.  888 888  .d88b.   .d88b.   .d88b.  .d8888b  
d8P  Y8b 888 "88b 888 888 d88""88b d88P"88b d88""88b 88K      
88888888 888  888 888 888 888  888 888  888 888  888 "Y8888b. 
Y8b.     888 d88P 888 888 Y88..88P Y88b 888 Y88..88P      X88 
 "Y8888  88888P"  888 888  "Y88P"   "Y88888  "Y88P"   88888P' 
         888                            888                   
         888                       Y8b d88P                   
         888                        "Y88P"                    


Input Directory 1 = /Users/areynolds/Developer/Github/epilogos/data/pyData/male
Input Directory 2 = /Users/areynolds/Developer/Github/epilogos/data/pyData/female
State Model = 18
Saliency level = 1
Output Directory = /var/folders/h5/sggy_r7s355gnwssgl9mw44c0000gq/T/epilogos-paired-XXXXXXXXXX.v0iSPsWB
Number of Cores = All available
Quiescent State = 18

STEP 1: Per data file background frequency calculation
    epilogos_matrix_chr1		[Done]

STEP 2: Background frequency combination
    [Done]

STEP 3: Score calculation
    epilogos_matrix_chr1	.........	[Done]

STEP 4: Generating p-values and figures
    Fitting distances		[Done]
    Reading in files		[Done]
    Calculating p-vals		[Done]
    Writing metrics		[Done]
    Benjamini-Hochberg procedure		[Done]
    Greatest hits txt	Traceback (most recent call last):
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/bin/epilogos", line 33, in <module>
    sys.exit(load_entry_point('epilogos', 'console_scripts', 'epilogos')())
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/areynolds/Developer/Github/epilogos/epilogos/__main__.py", line 64, in main
    runEpilogos(mode, commandLineBool, inputDirectory, inputDirectory1, inputDirectory2, outputDirectory, stateInfo, saliency,
  File "/Users/areynolds/Developer/Github/epilogos/epilogos/epilogos.py", line 214, in main
    pairwiseVisual.main(inputDirPath.name, inputDirPath2.name, stateInfo, outputDirPath, fileTag, numProcesses,
  File "/Users/areynolds/Developer/Github/epilogos/epilogos/pairwiseVisual.py", line 106, in main
    createTopScoresTxt(roiPath, locationArr, chrDict, distanceArrReal, maxDiffArr, stateNameList, pvals, False, mhPvals)
  File "/Users/areynolds/Developer/Github/epilogos/epilogos/pairwiseVisual.py", line 548, in createTopScoresTxt
    locations = pd.DataFrame(np.concatenate((locationArr[indices], distanceArr[indices].reshape(len(indices), 1),
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4379, in replace
    return super().replace(
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6500, in replace
    return self.replace(
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4379, in replace
    return super().replace(
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6518, in replace
    return self._replace_columnwise(mapping, inplace, regex)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/frame.py", line 4415, in _replace_columnwise
    newobj = ser.replace(target, value, regex=regex)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/series.py", line 4563, in replace
    return super().replace(
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/generic.py", line 6543, in replace
    new_data = self._mgr.replace_list(
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 642, in replace_list
    masks = [comp(s, mask, regex) for s in src_list]
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 642, in <listcomp>
    masks = [comp(s, mask, regex) for s in src_list]
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 636, in comp
    return _compare_or_regex_search(values, s, regex, mask)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1992, in _compare_or_regex_search
    _check_comparison_types(False, a, b)
  File "/Users/areynolds/miniconda3/envs/epilogos_052421_v3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1971, in _check_comparison_types
    raise TypeError(
TypeError: Cannot compare types 'ndarray(dtype=int32)' and 'str'

My run environment:

(epilogos_052421_v3) areynolds@Earl-Grey epilogos % python --version
Python 3.9.4
(epilogos_052421_v3) areynolds@Earl-Grey epilogos % conda list                                                                               
# packages in environment at /Users/areynolds/miniconda3/envs/epilogos_052421_v3:
#
# Name                    Version                   Build  Channel
ca-certificates           2020.12.5            h033912b_0    conda-forge
certifi                   2020.12.5        py39h6e9494a_1    conda-forge
click                     7.1.2                    pypi_0    pypi
cycler                    0.10.0                   pypi_0    pypi
cython                    0.29.23                  pypi_0    pypi
kiwisolver                1.3.1                    pypi_0    pypi
libcxx                    11.1.0               habf9029_0    conda-forge
libffi                    3.3                  h046ec9c_2    conda-forge
matplotlib                3.3.2                    pypi_0    pypi
natsort                   7.1.1                    pypi_0    pypi
ncls                      0.0.57                   pypi_0    pypi
ncurses                   6.2                  h2e338ed_4    conda-forge
numpy                     1.19.2                   pypi_0    pypi
openssl                   1.1.1k               h0d85af4_0    conda-forge
pandas                    1.1.3                    pypi_0    pypi
patsy                     0.5.1                    pypi_0    pypi
pillow                    8.2.0                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7                    pypi_0    pypi
pyranges                  0.0.97                   pypi_0    pypi
pyrle                     0.0.32                   pypi_0    pypi
python                    3.9.4           h9133fd0_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.9                      1_cp39    conda-forge
pytz                      2021.1                   pypi_0    pypi
readline                  8.1                  h05e3726_0    conda-forge
scipy                     1.5.2                    pypi_0    pypi
setuptools                49.6.0           py39h6e9494a_3    conda-forge
six                       1.16.0                   pypi_0    pypi
sorted-nearest            0.0.33                   pypi_0    pypi
sqlite                    3.35.5               h44b9ce1_0    conda-forge
statsmodels               0.12.0                   pypi_0    pypi
tabulate                  0.8.9                    pypi_0    pypi
tk                        8.6.10               h0419947_1    conda-forge
tzdata                    2021a                he74cb21_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                haf1e3a3_1    conda-forge
zlib                      1.2.11            h7795811_1010    conda-forge

General bash suggestions

The following script may not be obsolete:

epilogos/scripts/preprocess_data_ChromHMM.sh

Lines 1 to 33 in 80bf1f5

    
           #!/bin/bash 
        
           ### The ${GENOME}.genome file contains chromosome sizes, and can be obtained in several ways, including: 
        
           ### wget -q -O - ftp://hgdownload.cse.ucsc.edu/goldenPath/${GENOME}/database/chromInfo.txt.gz | gunzip - | cut -f 1-2 
        
           GENOME=$1 
        
           ### the METADATA file is expected to contain a header line 
        
           METADATA=$2 
        
           ### Directory where all STATEBYLINE ChromHMM output is stored, one file per biosample-chromosome combination. 
        
           DIR="/home/meuleman/work/projects/imputation//WM20191003_curated_data_mixed/public_ChromHMM_released/observed_aux_18_hg19/CALLS/STATEBYLINE/" 
        
           for CHR in $(cut -f1 ${GENOME}.genome); do 
        
             echo -n "Processing ${CHR}: " 
        
             CMD="paste";  
        
             NUMFILES=`ls -1 ${DIR}/*${CHR}_*.txt* 2>/dev/null | wc -l` 
        
             if [ ${NUMFILES} -gt 0 ]; 
        
             then 
        
               echo "${NUMFILES} files" 
        
             for BIOSAMPLE in $(cut -f1 ${METADATA} | tail -n +2); do 
        
               NAM=`stat -t -c %n ${DIR}/*${BIOSAMPLE}*${CHR}_*.txt*` 
        
               CMD="$CMD <(zcat -f -- ${NAM})"; 
        
             done 
        
             eval $CMD | awk 'BEGIN{OFS="\t";chr=""} 
        
               {if (NR==1) { chr=$2 } else { if (NR>2) print chr, (NR-3)*200, (NR-2)*200, $0}}' \ 
        
               > matrix_${CHR}.txt 
        
             else 
        
               echo "no files found" 
        
             fi 
        
           done

Some general suggestions that I have are:

Use set -ueo pipefail below the shebang line to ensure variables are set and to handle error states more closely. You could even add an -x to log each command. This can be useful for pipelines, log statements, or debugging a problematic script.
Try to avoid using uppercase variable names in bash scripts. Some variables in bash are reserved POSIX names and your intended values will get clobbered, if you accidently use these variable names. It can lead to non-obvious and frustrating errors.
Use curly braces around positional variables like $1 to ${1}, $2 to ${2} etc. Not a big deal, here, but a good habit to get into as this allows you to specify more than nine input parameters ($10 will not work, but ${10} will, for example).

Consider using `sys.stderr.write` or logging library in place of `print`

epilogos/src/columnSplitterAdsera.py

Lines 65 to 67 in 80bf1f5

    
           print("{}\t1,2,3,{}".format(type1, str(shuffledList[:size]).strip('[]').replace(" ", ""))) 
        
           print() 
        
           print("All Others\t1,2,3,{}".format(str(shuffledList[size:2*size]).strip('[]').replace(" ", "")))

The print function hides that it prints by default to standard error, and it doesn't work the same between Python 2 and 3. Using sys.stderr.write is explicit and portable, and isn't much more typing.

I would suggest replacing print() with a sys.stderr.write('\n') or similar to add newline characters. Newlines could also be added to previous print calls.

If these print commands are used for logging, the logging functionality might be useful to look into (https://docs.python.org/3/library/logging.html) for more explicit control of where messages go.

Consider exposing Slurm-specific parameters to `click` options

epilogos/src/computeEpilogosSlurm.py

Lines 378 to 383 in 80bf1f5

    
           if saliency == 1: 
        
               slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand) 
        
           elif saliency == 2: 
        
               slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand) 
        
           elif saliency == 3: 
        
               slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand)

Instead of hardcoding Slurm parameters like partition, memory, tasks-per-CPU etc., consider exposing them via click options. These parameters could be set to Altius-specific defaults, to make the kit easier to run locally.

	def fitOnBootstrap(distanceArrNull, samplingSize):
	if len(distanceArrNull) <= samplingSize:
	bootstrapData = distanceArrNull
	else:
	np.random.seed() # On linux, multiprocessing inherits the master seed and doesn't generate fully random numbers
	bootstrapData = pd.Series(np.random.choice(distanceArrNull, size=samplingSize, replace=False))

	# ignore warnings
	with warnings.catch_warnings():
	warnings.simplefilter("ignore")

	# Fit the data
	params = st.gennorm.fit(bootstrapData)

	# Calculate SSE and MLE
	mle = st.gennorm.nnlf(params, pd.Series(distanceArrNull))

	return params, mle

	# Helper to convert taken in string into a boolean
	def strToBool(string):
	if string in ["true", "t", "T", "y", "Y", "True", "yes", "Yes"]:
	return True
	elif string in ["false", "f", "F", "n", "N", "False", "no", "No"]:
	return False
	else:
	print("ERROR: STRTOBOOL FAILED")
	return "ERROR: STRTOBOOL FAILED"


	if __name__ == "__main__":
	if len(sys.argv) <= 2:
	main(sys.argv[1], column="")
	elif len(sys.argv) <= 5:
	main(sys.argv[1], column=str(sys.argv[-1]))
	elif len(sys.argv) <= 6:
	main(sys.argv[1], randomSampling=strToBool(sys.argv[2]), equalSize=int(sys.argv[3]), column=sys.argv[4], type1=sys.argv[5])
	else:
	main(sys.argv[1], randomSampling=strToBool(sys.argv[2]), equalSize=int(sys.argv[3]), column=sys.argv[4], type1=sys.argv[5], type2=sys.argv[6])

	def s1Calc(file1Path, file2Path, rowsToCalc, numStates, verbose):
	"""
	Function responsible for expected frequency calculation over a set of rows for a saliency metric of 1

	Input:
	file1Path -- The path of the only (single epilogos) or first (paired epilogos) file to read states from
	file2Path -- The path of the second file to read states from (paired epilogos)
	rowsToCalc -- The rows to count expected frequencies from the files
	numStates -- The number of states in the state model
	verbose -- Boolean which if True, causes much more detailed prints

	Output:
	A numpy array containing the counts of each state within the specified rows of the file
	"""
	dataArr = readStates(file1Path=file1Path, file2Path=file2Path, rowsToCalc=rowsToCalc, verbose=verbose)

	expFreqArr = np.zeros(numStates, dtype=np.int32)

	if verbose and rowsToCalc[0] == 0: print("Calculating expected frequencies...", flush=True); tExp = time()

	# Simply count all states across out our subset of data
	uniqueStates, stateCounts = np.unique(dataArr, return_counts=True)
	for i, state in enumerate(uniqueStates):
	expFreqArr[state] += stateCounts[i]
	if verbose and rowsToCalc[0] == 0: print(" Time:", time() - tExp, flush=True)

	return expFreqArr

	if expFreqDir == "null":
	print("Background Directory =", outputDirPath)

	try:
	jout = open(jobOutPath, 'x')
	jout.close()
	jerr = open(jobErrPath, 'x')
	jerr.close()
	except FileExistsError as err:
	# This error should never occur because we are deleting the files first
	print(err)
	return

	#!/bin/bash

	### The ${GENOME}.genome file contains chromosome sizes, and can be obtained in several ways, including:
	### wget -q -O - ftp://hgdownload.cse.ucsc.edu/goldenPath/${GENOME}/database/chromInfo.txt.gz \| gunzip - \| cut -f 1-2
	GENOME=$1

	### the METADATA file is expected to contain a header line
	METADATA=$2

	### Directory where all STATEBYLINE ChromHMM output is stored, one file per biosample-chromosome combination.
	DIR="/home/meuleman/work/projects/imputation//WM20191003_curated_data_mixed/public_ChromHMM_released/observed_aux_18_hg19/CALLS/STATEBYLINE/"

	for CHR in $(cut -f1 ${GENOME}.genome); do
	echo -n "Processing ${CHR}: "
	CMD="paste";
	NUMFILES=`ls -1 ${DIR}/${CHR}_.txt* 2>/dev/null \| wc -l`
	if [ ${NUMFILES} -gt 0 ];
	then
	echo "${NUMFILES} files"
	for BIOSAMPLE in $(cut -f1 ${METADATA} \| tail -n +2); do
	NAM=`stat -t -c %n ${DIR}/${BIOSAMPLE}${CHR}_.txt`
	CMD="$CMD <(zcat -f -- ${NAM})";
	done
	eval $CMD \| awk 'BEGIN{OFS="\t";chr=""}
	{if (NR==1) { chr=$2 } else { if (NR>2) print chr, (NR-3)200, (NR-2)200, $0}}' \
	> matrix_${CHR}.txt
	else
	echo "no files found"
	fi
	done

	print("{}\t1,2,3,{}".format(type1, str(shuffledList[:size]).strip('[]').replace(" ", "")))
	print()
	print("All Others\t1,2,3,{}".format(str(shuffledList[size:2*size]).strip('[]').replace(" ", "")))

	if saliency == 1:
	slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand)
	elif saliency == 2:
	slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand)
	elif saliency == 3:
	slurmCommand = "sbatch --dependency=afterok:{} --job-name={}.job --output={} --partition=queue1 --error={} --ntasks=1 --mem-per-cpu=8000 --wrap='{}'".format(scoreJobIDStr, jobName, jobOutPath, jobErrPath, pythonCommand)

meuleman / epilogos Goto Github PK

epilogos's People

Contributors

Stargazers

Watchers

Forkers

epilogos's Issues

Recommend Projects

Recommend Topics

Recommend Org