Coder Social home page Coder Social logo

Too many files open error about peaks2utr HOT 14 CLOSED

abs-yy avatar abs-yy commented on June 15, 2024
Too many files open error

from peaks2utr.

Comments (14)

lldelisle avatar lldelisle commented on June 15, 2024

Hi,
I got the same error.

INFO     Iterating over peaks to annotate 3' UTRs.:   0%|▏                                    | 589/118256 [00:03<08:41, 225.69it/s]
Process Process-3:                                                                                                                  
Traceback (most recent call last):                                                                                                  
  File "/ssoft/spack/syrah/v1/opt/spack/linux-rhel8-icelake/gcc-11.3.0/python-3.10.4-mcloxmxhlovfwwkxzyafm35bq545ntqx/lib/python3.10
/multiprocessing/process.py", line 315, in _bootstrap                                                                               
    self.run()                                                                                                                      
  File "/ssoft/spack/syrah/v1/opt/spack/linux-rhel8-icelake/gcc-11.3.0/python-3.10.4-mcloxmxhlovfwwkxzyafm35bq545ntqx/lib/python3.10
/multiprocessing/process.py", line 108, in run                                                                                      
    self._target(*self._args, **self._kwargs)                                                                                       
  File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/peaks2utr/annotations.py", line 52, in _iter_pe
aks                                                                                                                                 
    annotate_utr_for_peak(db, queue, peak, args.max_distance, args.override_utr, args.extend_utr, args.five_prime_ext)              
  File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/peaks2utr/annotations.py", line 82, in annotate
_utr_for_peak                                                                                                                       
    bt = pybedtools.BedTool(cached(k + "_coverage_gaps.bed")).filter(lambda x: x.chrom == peak.chr)                                 
  File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/bedtool.py", line 529, in __init__  
    self._isbam = isBAM(fn)                                                                                                         
  File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/helpers.py", line 214, in isBAM     
    if isBGZIP(fn) and (in_.read(4).decode() == "BAM\x01"):                                                                         
  File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/helpers.py", line 182, in isBGZIP   
    with open(fn, "rb") as fh:                                                                                                      
OSError: [Errno 24] Too many open files: '/scratch/ldelisle/new_gtf_mouse/mm10/.cache/forward_coverage_gaps.bed'                    
INFO     Iterating over peaks to annotate 3' UTRs.:   1%|▎                                   | 1008/118256 [00:03<07:01, 278.01it/s]

from peaks2utr.

lldelisle avatar lldelisle commented on June 15, 2024

I found the cause:
https://github.com/daler/pybedtools/blob/be6e8f91a367337775a5b8d2327128d456b54ba1/docs/source/FAQs.rst#too-many-files-open-error
The trick is to increase the number of process with the -p option so you have less than ulimit -n files opened. For example, in my case, I had 118256 peaks (you can get it from your progress bar) then with -p 200 I manage not to hit the 1024 files opened.
Of course, this is not ideal because then you have 200 processes while you don't have 200 cpu but at least you do not hit the limit.

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

I'm just testing a simple fix for this. There was no reason for every parallel process to be opening its own BedTools object when there are only two BedTools objects necessary to read from in total (one for forward and reverse strands). Therefore I'm opening them upstream and passing them down to the annotate_utr_for_peak function (hopefully will have a minor improvement to runtimes too)

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

@lldelisle @abs-yy Please could you try again with the latest release v0.5.1

from peaks2utr.

abs-yy avatar abs-yy commented on June 15, 2024

Hi, i installed peaks2utr v0.5.1 from scratch and tried out one of my bam files, but resulted with the same Too many open files. There is also a missing index error as well, but would this be relevant?

% peaks2utr --version
2023-04-19 09:19:08,577 - INFO - Make .cache directory
peaks2utr 0.5.1


% peaks2utr -p 12 genes.gff3  mapped.bam 
2023-04-19 09:13:23,050 - INFO - Make .cache directory
2023-04-19 09:13:26,167 - INFO - Splitting forward strand from mapped.bam.
2023-04-19 09:13:38,862 - INFO - Finished splitting forward strand.
2023-04-19 09:13:38,863 - INFO - Splitting reverse strand from mapped.bam.
2023-04-19 09:13:53,116 - INFO - Finished splitting reverse strand.
2023-04-19 09:13:53,864 - INFO - Merging SPAT outputs.
2023-04-19 09:13:53,938 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] [E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/peaks2utr/.cache/mapped.forward.bam'
Could not retrieve index file for '/prm/scratch/home/abs/peaks2utr/.cache/mapped.reverse.bam'
2023-04-19 09:14:07,293 - INFO - Creating gff db.
2023-04-19 09:14:07,294 - INFO - Calling peaks for forward strand with MACS3.
2023-04-19 09:14:07,521 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-19 09:14:09,533 - INFO - Populating features
2023-04-19 09:14:09,533 - INFO - Populating features
2023-04-19 09:14:33,507 - INFO - Finished calling forward strand peaks.
2023-04-19 09:14:34,175 - INFO - Finished calling reverse strand peaks.
2023-04-19 09:14:40,597 - INFO - Populating features table and first-order relations: 311769 features
2023-04-19 09:14:40,597 - INFO - Populating features table and first-order relations: 311769 features
2023-04-19 09:14:40,597 - INFO - Updating relations
2023-04-19 09:14:40,597 - INFO - Updating relations
2023-04-19 09:14:46,731 - INFO - Creating relations(parent) index
2023-04-19 09:14:46,731 - INFO - Creating relations(parent) index
2023-04-19 09:14:47,008 - INFO - Creating relations(child) index
2023-04-19 09:14:47,008 - INFO - Creating relations(child) index
2023-04-19 09:14:47,326 - INFO - Creating features(featuretype) index
2023-04-19 09:14:47,326 - INFO - Creating features(featuretype) index
2023-04-19 09:14:47,526 - INFO - Creating features (seqid, start, end) index
2023-04-19 09:14:47,526 - INFO - Creating features (seqid, start, end) index
2023-04-19 09:14:47,738 - INFO - Creating features (seqid, start, end, strand) index
2023-04-19 09:14:47,738 - INFO - Creating features (seqid, start, end, strand) index
2023-04-19 09:14:48,009 - INFO - Running ANALYZE features
2023-04-19 09:14:48,009 - INFO - Running ANALYZE features
2023-04-19 09:14:49,172 - INFO - Finished creating gff db.
INFO     Iterating over peaks to annotate 3' UTRs.:  58%|██████████████████████████████████████                            | 10877/18845 [03:20<02:36, 50.88it/s]Process Process-3:
Traceback (most recent call last):
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 52, in _iter_peaks
    annotate_utr_for_peak(db, queue, peak, spat_pileups.get(peak.strand), bedtools.get(peak.strand), args.max_distance, args.override_utr, args.extend_utr, args.five_prime_ext)
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 84, in annotate_utr_for_peak
    bt = bedtool.filter(lambda x: x.chrom == peak.chr)
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 968, in filter
    return BedTool((f for f in self if func(f, *args, **kwargs)))
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 1197, in __iter__
    if isGZIP(self.fn):
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/helpers.py", line 171, in isGZIP
    with open(fn, "rb") as f:
OSError: [Errno 24] Too many open files: '/prm/scratch/home/abs/peaks2utr/.cache/forward_coverage_gaps.bed'

from peaks2utr.

haessar avatar haessar commented on June 15, 2024
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 84, in annotate_utr_for_peak
    bt = bedtool.filter(lambda x: x.chrom == peak.chr)
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 968, in filter
    return BedTool((f for f in self if func(f, *args, **kwargs)))

Looks like internally pybedtools is returning a new BedTool on each filter query, hence why this problem persists. I'll attempt to implement @lldelisle's suggestion.

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

@abs-yy Could you attach your .log/peaks2utr_debug.log file here so I can take a look?

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

I've completely removed the dependence on pybedtools within the annotate_utr_for_peak function and built a bespoke in-memory class to handle filtering. A knock-on effect is that the iterating over peaks is now substantially faster

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

@abs-yy can you try again with v0.5.2 ?

from peaks2utr.

abs-yy avatar abs-yy commented on June 15, 2024

Hi @haessar, here's a test with v0.5.2.

% peaks2utr --version
2023-04-20 13:33:37,143 - INFO - Make .cache directory
peaks2utr 0.5.2

% peaks2utr -p 4 ./genes.gff3  ./reads.bam 
2023-04-20 13:31:58,637 - INFO - Make .cache directory
2023-04-20 13:31:58,896 - INFO - Splitting forward strand from ./reads.bam.
2023-04-20 13:32:00,938 - INFO - Finished splitting forward strand.
2023-04-20 13:32:00,939 - INFO - Splitting reverse strand from ./reads.bam.
2023-04-20 13:32:03,030 - INFO - Finished splitting reverse strand.
2023-04-20 13:32:03,040 - INFO - Merging SPAT outputs.
2023-04-20 13:32:03,042 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.forward.bam'
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.reverse.bam'
2023-04-20 13:32:10,550 - INFO - Creating gff db.
2023-04-20 13:32:10,552 - INFO - Calling peaks for forward strand with MACS3.
2023-04-20 13:32:10,559 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-20 13:32:10,574 - INFO - Populating features
2023-04-20 13:32:10,574 - INFO - Populating features
2023-04-20 13:32:14,653 - INFO - Finished calling reverse strand peaks.
2023-04-20 13:32:14,843 - INFO - Finished calling forward strand peaks.
2023-04-20 13:32:35,795 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:32:35,795 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:32:35,796 - INFO - Updating relations
2023-04-20 13:32:35,796 - INFO - Updating relations
2023-04-20 13:32:40,527 - INFO - Creating relations(parent) index
2023-04-20 13:32:40,527 - INFO - Creating relations(parent) index
2023-04-20 13:32:40,768 - INFO - Creating relations(child) index
2023-04-20 13:32:40,768 - INFO - Creating relations(child) index
2023-04-20 13:32:41,059 - INFO - Creating features(featuretype) index
2023-04-20 13:32:41,059 - INFO - Creating features(featuretype) index
2023-04-20 13:32:41,279 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:32:41,279 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:32:41,507 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:32:41,507 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:32:41,823 - INFO - Running ANALYZE features
2023-04-20 13:32:41,823 - INFO - Running ANALYZE features
2023-04-20 13:32:42,198 - INFO - Finished creating gff db.
2023-04-20 13:32:42,494 - INFO - Clearing cache.
Traceback (most recent call last):
  File "/lustre/home/abs/anaconda3/envs/bio/bin/peaks2utr", line 8, in <module>
    sys.exit(main())
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 51, in main
    asyncio.run(_main())
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 122, in _main
    processes = [batch_annotate_strand(db, batch, queue, args) for batch in iter_batches(peaks, math.ceil(total_peaks/args.processors))]
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 122, in <listcomp>
    processes = [batch_annotate_strand(db, batch, queue, args) for batch in iter_batches(peaks, math.ceil(total_peaks/args.processors))]
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 61, in batch_annotate_strand
    truncation_points[symbol] = models.SPATTruncationPoints(cached(strand + "_unmapped.json"))
  File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/models.py", line 168, in __init__
    self.data.update(json.load(f))
TypeError: 'NoneType' object is not iterable

The log file doesnt say anything much

cat .log/peaks2utr_debug.log 
2023-04-20 13:34:32,197 - INFO - Splitting forward strand from ./reads.bam.
2023-04-20 13:34:34,166 - INFO - Finished splitting forward strand.
2023-04-20 13:34:34,167 - INFO - Splitting reverse strand from ./reads.bam.
2023-04-20 13:34:36,059 - INFO - Finished splitting reverse strand.
2023-04-20 13:34:36,067 - INFO - Merging SPAT outputs.
2023-04-20 13:34:36,069 - INFO - Filtering intervals with zero coverage.
2023-04-20 13:34:43,305 - INFO - Creating gff db.
2023-04-20 13:34:43,308 - INFO - Calling peaks for forward strand with MACS3.
2023-04-20 13:34:43,315 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-20 13:34:43,335 - INFO - Populating features
2023-04-20 13:34:47,637 - INFO - Finished calling reverse strand peaks.
2023-04-20 13:34:47,847 - INFO - Finished calling forward strand peaks.
2023-04-20 13:35:09,938 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:35:09,939 - INFO - Updating relations
2023-04-20 13:35:14,670 - INFO - Creating relations(parent) index
2023-04-20 13:35:14,917 - INFO - Creating relations(child) index
2023-04-20 13:35:15,189 - INFO - Creating features(featuretype) index
2023-04-20 13:35:15,387 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:35:15,606 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:35:15,911 - INFO - Running ANALYZE features
2023-04-20 13:35:16,249 - INFO - Finished creating gff db.
2023-04-20 13:35:16,403 - INFO - Clearing cache.

from peaks2utr.

haessar avatar haessar commented on June 15, 2024

Hmm, surprised that happened. I've released v0.5.3 which among other things has a patch fix to prevent an error being thrown at self.data.update(json.load(f)).

from peaks2utr.

lldelisle avatar lldelisle commented on June 15, 2024

I got the same error message when installing the master branch and testing on the demo gff and bam.

from peaks2utr.

abs-yy avatar abs-yy commented on June 15, 2024

That did the trick! and the 0.5.2 update really made iterations go blazing fast.
I also tried it for a more large bam file and the whole run went smoothly.

Thanks for the help!

% peaks2utr --version
2023-04-21 10:29:41,053 - INFO - Make .cache directory
peaks2utr 0.5.3

% peaks2utr -p 16 genes.gff3  reads.bam
2023-04-21 10:26:40,648 - INFO - Make .cache directory
2023-04-21 10:26:40,810 - INFO - Splitting forward strand from reads.bam.
2023-04-21 10:26:45,925 - INFO - Finished splitting forward strand.
2023-04-21 10:26:45,926 - INFO - Splitting reverse strand from reads.bam.
2023-04-21 10:26:50,665 - INFO - Finished splitting reverse strand.
2023-04-21 10:26:50,668 - INFO - Merging SPAT outputs.
2023-04-21 10:26:50,669 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.reverse.bam'
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.forward.bam'
2023-04-21 10:27:03,584 - INFO - Creating gff db.
2023-04-21 10:27:03,585 - INFO - Calling peaks for forward strand with MACS3.
2023-04-21 10:27:03,599 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-21 10:27:03,601 - INFO - Populating features
2023-04-21 10:27:03,601 - INFO - Populating features
2023-04-21 10:27:14,174 - INFO - Finished calling reverse strand peaks.
2023-04-21 10:27:14,498 - INFO - Finished calling forward strand peaks.
2023-04-21 10:27:33,698 - INFO - Populating features table and first-order relations: 311769 features
2023-04-21 10:27:33,698 - INFO - Populating features table and first-order relations: 311769 features
2023-04-21 10:27:33,699 - INFO - Updating relations
2023-04-21 10:27:33,699 - INFO - Updating relations
2023-04-21 10:27:37,819 - INFO - Creating relations(parent) index
2023-04-21 10:27:37,819 - INFO - Creating relations(parent) index
2023-04-21 10:27:38,032 - INFO - Creating relations(child) index
2023-04-21 10:27:38,032 - INFO - Creating relations(child) index
2023-04-21 10:27:38,291 - INFO - Creating features(featuretype) index
2023-04-21 10:27:38,291 - INFO - Creating features(featuretype) index
2023-04-21 10:27:38,480 - INFO - Creating features (seqid, start, end) index
2023-04-21 10:27:38,480 - INFO - Creating features (seqid, start, end) index
2023-04-21 10:27:38,688 - INFO - Creating features (seqid, start, end, strand) index
2023-04-21 10:27:38,688 - INFO - Creating features (seqid, start, end, strand) index
2023-04-21 10:27:38,934 - INFO - Running ANALYZE features
2023-04-21 10:27:38,934 - INFO - Running ANALYZE features
2023-04-21 10:27:39,190 - INFO - Finished creating gff db.
INFO     Iterating over peaks to annotate 3' UTRs.: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18845/18845 [00:51<00:00, 369.43it/s]
2023-04-21 10:28:34,539 - INFO - Merging annotations with canonical gff file.
2023-04-21 10:28:52,723 - INFO - Successfully merged three_prime_UTRs into canonical gff: genes.new.gff.
2023-04-21 10:28:52,731 - INFO - Writing summary statistics file.
2023-04-21 10:28:56,313 - INFO - peaks2utr finished successfully.
2023-04-21 10:28:57,314 - INFO - Clearing cache.

% cat summary_stats.txt 
Total peaks: 18845
Total 3' UTRs annotated: 1887
Peaks with no nearby features: 14997 (79%)
Peaks corresponding to an already annotated 3' UTR: 0 (0%)
Peaks contained within a feature: 1552 (8%)
Peaks corresponding to 5'-end of a feature: 365 (1%)


from peaks2utr.

lldelisle avatar lldelisle commented on June 15, 2024

Sorry, the TypeError: 'NoneType' object is not iterable has been fixed for me too (I forgot to pull the last commits...).

from peaks2utr.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.