Coder Social home page Coder Social logo

bmds-lab / crackling Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 7.0 289 KB

CRISPR, faster, better – The Crackling method for whole-genome target detection

License: BSD 3-Clause "New" or "Revised" License

Python 11.93% C++ 86.05% Makefile 0.07% C 1.95%
bioinformatics crispr crispr-analysis crispr-design efficacy issl specificity

crackling's People

Contributors

ceschmitz avatar jakeb1996 avatar systemsresearch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crackling's Issues

The RNAfold output is at risk of being overwritten if multiple instances of Crackling were launched from the same directory

The RNAfold -o flag will write to a file named RNAfold_output.fold if no value for -o is provided.

In the code, below, no filename is provided. If multiple instances of Crackling have been launched from the same directory and some just happen to be at the folding step then RNAfold may overwrite its own output. I ran into this issue when running many instances of Crackling on a HPC.

To fix this issue, an output file should be specified so that the default is not used.

runner('{} --noPS -j{} -i {} -o'.format(
configMngr['rnafold']['binary'],
configMngr['rnafold']['threads'],
configMngr['rnafold']['input']
),
shell=True,
check=True
)
os.replace('RNAfold_output.fold' ,configMngr['rnafold']['output'])
printer('\t\tStarting to process the RNAfold results.')
RNAstructures = {}
with open(configMngr['rnafold']['output'], 'r') as fRnaOutput:

I used this code as my temporary fix:

runner('{} --noPS -j{} -i {} -o {}'.format(
        configMngr['rnafold']['binary'],
        configMngr['rnafold']['threads'],
        configMngr['rnafold']['input'],  
        configMngr['rnafold']['output']
    ),
    shell=True,
    check=True
)

#os.replace('RNAfold_output.fold' ,configMngr['rnafold']['output'])

printer('\t\tStarting to process the RNAfold results.')

RNAstructures = {}
with open(configMngr['rnafold']['output'], 'r') as fRnaOutput:

The datetime directives used to format elapsed time are incorrect

When the elapsed time for a batch and the total elapsed time is reported, the expression %d %H:%M:%S is used. Even when the pipeline takes less than one day to run, the elapsed time will be reported with %d being one (i.e., taking at least 24 hours), as %d reports day of month (there is no 0'th day of a month). See here.

The code used to report elapsed time needs to be changed so that days/hours/minutes/seconds are reported accurately.

time.strftime('%d %H:%M:%S', time.gmtime((time.time() - startTime))),

time.strftime('%d %H:%M:%S', time.gmtime((time.time() - batchStartTime))),

Extracting off-targets is at risk of crashing if only one sequence is processed

The extract off-target sites utility will crash if there is only one FASTA sequence provided. This leads to only one intermediate file existing; these intermediate files are sorted and merged. Importantly, the sort is successful but the merge is not.

I do not believe there will be an issue when the input(s) are either: (1) multiple FASTA files or (2) a single multi-FASTA file.

The crash is caused by a variable being referenced before assignment:
Line 191 exists outside of the while loop, and therefore, mergedFile may never be declared.

shutil.move(mergedFile.name, fpOutput)

installing issue

Hi Crackling developers,
Thanks for making crackling available.
I ran into issue while trying to install it.
The error messages are listed below:

g++ -O3 -std=c++11 -fopenmp -mpopcnt -Isrc/ISSL/include -o bin/isslScoreOfftargets src/ISSL/isslScoreOfftargets.cpp
src/ISSL/isslScoreOfftargets.cpp: In function ‘int main(int, char**)’:
src/ISSL/isslScoreOfftargets.cpp:193:14: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
  193 |         fread(&mask, sizeof(uint64_t), 1, fp);
      |         ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ISSL/isslScoreOfftargets.cpp:194:14: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
  194 |         fread(&score, sizeof(double), 1, fp);
      |         ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/bin/ld: cannot open output file bin/isslScoreOfftargets: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [Makefile:13: isslScoreOfftargets] Error 1

Can you please help me figure it out?

Thanks a lot in advance.

Best,
Huanle

Does not compile

Dear authors,

g++ -o search_ots_score search_ots_score.cpp -O3 -std=c++11 -fopenmp -mpopcnt

fails with

   34 | #include <phmap.h>
      |          ^~~~~~~~~
compilation terminated.

despite phmap.h being present.
What am I doing wrong?

Formatting of the stdout text

Some of the stdout log is inconsistently formatted, or is incorrect. See examples below.

There are several cases where large numbers are displayed with , separating 1000's. There are some cases where no ,'s are used.

>>> 2022-02-17 16:14:29:442676: Done.

>>> 2022-02-17 16:14:29:442723: 2500000 guides evaluated.

>>> 2022-02-17 16:14:29:442841: This batch ran in 01 00:40:21 (dd hh:mm:ss) or 2421.1502072811127 seconds

>>> 2022-02-17 16:14:29:442913: Processing batch file 2 of 7

>>> 2022-02-17 16:14:35:888107:         Loaded 2,500,000 guides

>>> 2022-02-17 16:14:35:888266: CHOPCHOP - remove those without G in position 20.

>>> 2022-02-17 16:15:13:670180:         773,162 of 1,043,323 failed here.

...

In the run that produced this output, the batch size was set to 2.5m guides, yet, below, on the second line, it says the page will contain 5m guides.
The page size would not be larger than the batch size.
I believe pageSize = min(batchSize, maxPageSize, actualPageSize) would be more accurate.
Also, notice the lack of commas again.


>>> 2022-02-17 16:17:10:869049: mm10db - check secondary structure.

>>> 2022-02-17 16:17:49:964502:         Processing page 1 (5,000,000 per page).

>>> 2022-02-17 16:17:49:964671:                 Constructing the RNAfold input file.

>>> 2022-02-17 16:17:50:481058:                 678,423 guides in this page.
...

>>> 2022-02-17 16:24:15:592360: Calculating mm10db final result.

>>> 2022-02-17 16:24:18:294110:         426284 accepted.

>>> 2022-02-17 16:24:18:294204:         2073716 failed.

[Windows] Extract off-targets: opened temporary file cannot be moved

When running the extract off-targets script on Windows, an OS error may be thrown when trying to move the temporary file.

The temporary file is never closed and therefore, Windows does not want to move it.

The error occurs here:

shutil.move(mergedFile.name, fpOutput)

According to the Python docs, here, this behavior is expected:

tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, delete=True, *, errors=None)

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows). If delete is true (the default), the file is deleted as soon as it is closed. The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.

One way to fix this is by closing the temporary file inside the while loop:

while len(sortedFiles) > 1:
# A file to write the merged sequences to
mergedFile = tempfile.NamedTemporaryFile(delete = False)
# Select the files to merge
while True:
try:
sortedFilesPointers = [open(file, 'r') for file in sortedFiles[:maxNumOpenFiles]]
break
except OSError as e:
if e.errno == 24:
printer(f'Attempted to open too many files at once (OSError errno 24)')
maxNumOpenFiles = max(1, int(maxNumOpenFiles / 2))
printer(f'Reducing the number of files that can be opened by half to {maxNumOpenFiles}')
continue
raise e
printer(f'Merging {len(sortedFilesPointers):,}')
# Merge and write
with open(mergedFile.name, 'w') as f:
f.writelines(heapq.merge(*sortedFilesPointers))
# Close all of the open files
for file in sortedFilesPointers:
file.close()
# prepare for the next set to be merged
sortedFiles = sortedFiles[maxNumOpenFiles:] + [mergedFile.name]
shutil.move(mergedFile.name, fpOutput)

mergedFile.close()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.