The crackling from bmds-lab

The RNAfold output is at risk of being overwritten if multiple instances of Crackling were launched from the same directory

The RNAfold -o flag will write to a file named RNAfold_output.fold if no value for -o is provided.

In the code, below, no filename is provided. If multiple instances of Crackling have been launched from the same directory and some just happen to be at the folding step then RNAfold may overwrite its own output. I ran into this issue when running many instances of Crackling on a HPC.

To fix this issue, an output file should be specified so that the default is not used.

Crackling/src/crackling/Crackling.py

Lines 426 to 440 in b00de36

    
           runner('{} --noPS -j{} -i {} -o'.format( 
        
                   configMngr['rnafold']['binary'], 
        
                   configMngr['rnafold']['threads'], 
        
                   configMngr['rnafold']['input'] 
        
               ), 
        
               shell=True, 
        
               check=True 
        
           ) 
        
           os.replace('RNAfold_output.fold' ,configMngr['rnafold']['output']) 
        
           printer('\t\tStarting to process the RNAfold results.') 
        
           RNAstructures = {} 
        
           with open(configMngr['rnafold']['output'], 'r') as fRnaOutput:

I used this code as my temporary fix:

runner('{} --noPS -j{} -i {} -o {}'.format(
        configMngr['rnafold']['binary'],
        configMngr['rnafold']['threads'],
        configMngr['rnafold']['input'],  
        configMngr['rnafold']['output']
    ),
    shell=True,
    check=True
)

#os.replace('RNAfold_output.fold' ,configMngr['rnafold']['output'])

printer('\t\tStarting to process the RNAfold results.')

RNAstructures = {}
with open(configMngr['rnafold']['output'], 'r') as fRnaOutput:

The datetime directives used to format elapsed time are incorrect

When the elapsed time for a batch and the total elapsed time is reported, the expression %d %H:%M:%S is used. Even when the pipeline takes less than one day to run, the elapsed time will be reported with %d being one (i.e., taking at least 24 hours), as %d reports day of month (there is no 0'th day of a month). See here.

The code used to report elapsed time needs to be changed so that days/hours/minutes/seconds are reported accurately.

Crackling/src/crackling/Crackling.py

Line 886 in b00de36

time.strftime('%d %H:%M:%S', time.gmtime((time.time() - startTime))),

Crackling/src/crackling/Crackling.py

Line 879 in b00de36

time.strftime('%d %H:%M:%S', time.gmtime((time.time() - batchStartTime))),

sgRNAScorer 2.0 model training script does not work from the command line

The function main() in this file is intended to be called from the command line but it is not called.

When I run python3.9 trainModel.py then main() should run, but it does not.

Crackling/src/crackling/utils/trainModel.py

Line 119 in b00de36

def main():

Adding the following code to the bottom of the script is a potential solution

if __name__ == '__main__':
    main()

Extracting off-targets is at risk of crashing if only one sequence is processed

The extract off-target sites utility will crash if there is only one FASTA sequence provided. This leads to only one intermediate file existing; these intermediate files are sorted and merged. Importantly, the sort is successful but the merge is not.

I do not believe there will be an issue when the input(s) are either: (1) multiple FASTA files or (2) a single multi-FASTA file.

The crash is caused by a variable being referenced before assignment:
Line 191 exists outside of the while loop, and therefore, mergedFile may never be declared.

Crackling/src/crackling/utils/extractOfftargets.py

Line 191 in bb50af9

shutil.move(mergedFile.name, fpOutput)

installing issue

Hi Crackling developers,
Thanks for making crackling available.
I ran into issue while trying to install it.
The error messages are listed below:

g++ -O3 -std=c++11 -fopenmp -mpopcnt -Isrc/ISSL/include -o bin/isslScoreOfftargets src/ISSL/isslScoreOfftargets.cpp
src/ISSL/isslScoreOfftargets.cpp: In function ‘int main(int, char**)’:
src/ISSL/isslScoreOfftargets.cpp:193:14: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
  193 |         fread(&mask, sizeof(uint64_t), 1, fp);
      |         ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/ISSL/isslScoreOfftargets.cpp:194:14: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
  194 |         fread(&score, sizeof(double), 1, fp);
      |         ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/bin/ld: cannot open output file bin/isslScoreOfftargets: No such file or directory
collect2: error: ld returned 1 exit status
make: *** [Makefile:13: isslScoreOfftargets] Error 1

Can you please help me figure it out?

Thanks a lot in advance.

Best,
Huanle

Does not compile

Dear authors,

g++ -o search_ots_score search_ots_score.cpp -O3 -std=c++11 -fopenmp -mpopcnt

fails with

   34 | #include <phmap.h>
      |          ^~~~~~~~~
compilation terminated.

despite phmap.h being present.
What am I doing wrong?

Formatting of the stdout text

Some of the stdout log is inconsistently formatted, or is incorrect. See examples below.

There are several cases where large numbers are displayed with , separating 1000's. There are some cases where no ,'s are used.

>>> 2022-02-17 16:14:29:442676: Done.

>>> 2022-02-17 16:14:29:442723: 2500000 guides evaluated.

>>> 2022-02-17 16:14:29:442841: This batch ran in 01 00:40:21 (dd hh:mm:ss) or 2421.1502072811127 seconds

>>> 2022-02-17 16:14:29:442913: Processing batch file 2 of 7

>>> 2022-02-17 16:14:35:888107:         Loaded 2,500,000 guides

>>> 2022-02-17 16:14:35:888266: CHOPCHOP - remove those without G in position 20.

>>> 2022-02-17 16:15:13:670180:         773,162 of 1,043,323 failed here.

...

In the run that produced this output, the batch size was set to 2.5m guides, yet, below, on the second line, it says the page will contain 5m guides.
The page size would not be larger than the batch size.
I believe pageSize = min(batchSize, maxPageSize, actualPageSize) would be more accurate.
Also, notice the lack of commas again.


>>> 2022-02-17 16:17:10:869049: mm10db - check secondary structure.

>>> 2022-02-17 16:17:49:964502:         Processing page 1 (5,000,000 per page).

>>> 2022-02-17 16:17:49:964671:                 Constructing the RNAfold input file.

>>> 2022-02-17 16:17:50:481058:                 678,423 guides in this page.
...

>>> 2022-02-17 16:24:15:592360: Calculating mm10db final result.

>>> 2022-02-17 16:24:18:294110:         426284 accepted.

>>> 2022-02-17 16:24:18:294204:         2073716 failed.

[Windows] Extract off-targets: opened temporary file cannot be moved

When running the extract off-targets script on Windows, an OS error may be thrown when trying to move the temporary file.

The temporary file is never closed and therefore, Windows does not want to move it.

The error occurs here:

Crackling/src/crackling/utils/extractOfftargets.py

Line 191 in bb50af9

shutil.move(mergedFile.name, fpOutput)

According to the Python docs, here, this behavior is expected:

tempfile.NamedTemporaryFile(mode='w+b', buffering=- 1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, delete=True, *, errors=None)

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows). If delete is true (the default), the file is deleted as soon as it is closed. The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.

One way to fix this is by closing the temporary file inside the while loop:

Crackling/src/crackling/utils/extractOfftargets.py

Lines 161 to 191 in bb50af9

    
           while len(sortedFiles) > 1: 
        
               # A file to write the merged sequences to 
        
               mergedFile = tempfile.NamedTemporaryFile(delete = False) 
        
               # Select the files to merge 
        
               while True: 
        
                   try: 
        
                       sortedFilesPointers = [open(file, 'r') for file in sortedFiles[:maxNumOpenFiles]] 
        
                       break 
        
                   except OSError as e: 
        
                       if e.errno == 24: 
        
                           printer(f'Attempted to open too many files at once (OSError errno 24)') 
        
                           maxNumOpenFiles = max(1, int(maxNumOpenFiles / 2)) 
        
                           printer(f'Reducing the number of files that can be opened by half to {maxNumOpenFiles}') 
        
                           continue 
        
                       raise e 
        
               printer(f'Merging {len(sortedFilesPointers):,}') 
        
               # Merge and write 
        
               with open(mergedFile.name, 'w') as f: 
        
                   f.writelines(heapq.merge(*sortedFilesPointers)) 
        
               # Close all of the open files 
        
               for file in sortedFilesPointers: 
        
                   file.close() 
        
               # prepare for the next set to be merged 
        
               sortedFiles = sortedFiles[maxNumOpenFiles:] + [mergedFile.name] 
        
           shutil.move(mergedFile.name, fpOutput)

mergedFile.close()

bmds-lab / crackling Goto Github PK

crackling's People

Contributors

Stargazers

Watchers

Forkers

crackling's Issues

The RNAfold output is at risk of being overwritten if multiple instances of Crackling were launched from the same directory

The datetime directives used to format elapsed time are incorrect

sgRNAScorer 2.0 model training script does not work from the command line

Extracting off-targets is at risk of crashing if only one sequence is processed

installing issue

Does not compile

Formatting of the stdout text

[Windows] Extract off-targets: opened temporary file cannot be moved

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	runner('{} --noPS -j{} -i {} -o'.format(
	configMngr['rnafold']['binary'],
	configMngr['rnafold']['threads'],
	configMngr['rnafold']['input']
	),
	shell=True,
	check=True
	)

	os.replace('RNAfold_output.fold' ,configMngr['rnafold']['output'])

	printer('\t\tStarting to process the RNAfold results.')

	RNAstructures = {}
	with open(configMngr['rnafold']['output'], 'r') as fRnaOutput:

	while len(sortedFiles) > 1:
	# A file to write the merged sequences to
	mergedFile = tempfile.NamedTemporaryFile(delete = False)

	# Select the files to merge
	while True:
	try:
	sortedFilesPointers = [open(file, 'r') for file in sortedFiles[:maxNumOpenFiles]]
	break
	except OSError as e:
	if e.errno == 24:
	printer(f'Attempted to open too many files at once (OSError errno 24)')
	maxNumOpenFiles = max(1, int(maxNumOpenFiles / 2))
	printer(f'Reducing the number of files that can be opened by half to {maxNumOpenFiles}')
	continue
	raise e

	printer(f'Merging {len(sortedFilesPointers):,}')

	# Merge and write
	with open(mergedFile.name, 'w') as f:
	f.writelines(heapq.merge(*sortedFilesPointers))

	# Close all of the open files
	for file in sortedFilesPointers:
	file.close()

	# prepare for the next set to be merged
	sortedFiles = sortedFiles[maxNumOpenFiles:] + [mergedFile.name]

	shutil.move(mergedFile.name, fpOutput)