Comments (14)
Hi,
I got the same error.
INFO Iterating over peaks to annotate 3' UTRs.: 0%|▏ | 589/118256 [00:03<08:41, 225.69it/s]
Process Process-3:
Traceback (most recent call last):
File "/ssoft/spack/syrah/v1/opt/spack/linux-rhel8-icelake/gcc-11.3.0/python-3.10.4-mcloxmxhlovfwwkxzyafm35bq545ntqx/lib/python3.10
/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/ssoft/spack/syrah/v1/opt/spack/linux-rhel8-icelake/gcc-11.3.0/python-3.10.4-mcloxmxhlovfwwkxzyafm35bq545ntqx/lib/python3.10
/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/peaks2utr/annotations.py", line 52, in _iter_pe
aks
annotate_utr_for_peak(db, queue, peak, args.max_distance, args.override_utr, args.extend_utr, args.five_prime_ext)
File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/peaks2utr/annotations.py", line 82, in annotate
_utr_for_peak
bt = pybedtools.BedTool(cached(k + "_coverage_gaps.bed")).filter(lambda x: x.chrom == peak.chr)
File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/bedtool.py", line 529, in __init__
self._isbam = isBAM(fn)
File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/helpers.py", line 214, in isBAM
if isBGZIP(fn) and (in_.read(4).decode() == "BAM\x01"):
File "/scratch/ldelisle/new_gtf_mouse/.venv_peaks2utr/lib/python3.10/site-packages/pybedtools/helpers.py", line 182, in isBGZIP
with open(fn, "rb") as fh:
OSError: [Errno 24] Too many open files: '/scratch/ldelisle/new_gtf_mouse/mm10/.cache/forward_coverage_gaps.bed'
INFO Iterating over peaks to annotate 3' UTRs.: 1%|▎ | 1008/118256 [00:03<07:01, 278.01it/s]
from peaks2utr.
I found the cause:
https://github.com/daler/pybedtools/blob/be6e8f91a367337775a5b8d2327128d456b54ba1/docs/source/FAQs.rst#too-many-files-open-error
The trick is to increase the number of process with the -p
option so you have less than ulimit -n
files opened. For example, in my case, I had 118256 peaks (you can get it from your progress bar) then with -p 200
I manage not to hit the 1024 files opened.
Of course, this is not ideal because then you have 200 processes while you don't have 200 cpu but at least you do not hit the limit.
from peaks2utr.
I'm just testing a simple fix for this. There was no reason for every parallel process to be opening its own BedTools object when there are only two BedTools objects necessary to read from in total (one for forward and reverse strands). Therefore I'm opening them upstream and passing them down to the annotate_utr_for_peak function (hopefully will have a minor improvement to runtimes too)
from peaks2utr.
@lldelisle @abs-yy Please could you try again with the latest release v0.5.1
from peaks2utr.
Hi, i installed peaks2utr v0.5.1 from scratch and tried out one of my bam files, but resulted with the same Too many open files. There is also a missing index error as well, but would this be relevant?
% peaks2utr --version
2023-04-19 09:19:08,577 - INFO - Make .cache directory
peaks2utr 0.5.1
% peaks2utr -p 12 genes.gff3 mapped.bam
2023-04-19 09:13:23,050 - INFO - Make .cache directory
2023-04-19 09:13:26,167 - INFO - Splitting forward strand from mapped.bam.
2023-04-19 09:13:38,862 - INFO - Finished splitting forward strand.
2023-04-19 09:13:38,863 - INFO - Splitting reverse strand from mapped.bam.
2023-04-19 09:13:53,116 - INFO - Finished splitting reverse strand.
2023-04-19 09:13:53,864 - INFO - Merging SPAT outputs.
2023-04-19 09:13:53,938 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] [E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/peaks2utr/.cache/mapped.forward.bam'
Could not retrieve index file for '/prm/scratch/home/abs/peaks2utr/.cache/mapped.reverse.bam'
2023-04-19 09:14:07,293 - INFO - Creating gff db.
2023-04-19 09:14:07,294 - INFO - Calling peaks for forward strand with MACS3.
2023-04-19 09:14:07,521 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-19 09:14:09,533 - INFO - Populating features
2023-04-19 09:14:09,533 - INFO - Populating features
2023-04-19 09:14:33,507 - INFO - Finished calling forward strand peaks.
2023-04-19 09:14:34,175 - INFO - Finished calling reverse strand peaks.
2023-04-19 09:14:40,597 - INFO - Populating features table and first-order relations: 311769 features
2023-04-19 09:14:40,597 - INFO - Populating features table and first-order relations: 311769 features
2023-04-19 09:14:40,597 - INFO - Updating relations
2023-04-19 09:14:40,597 - INFO - Updating relations
2023-04-19 09:14:46,731 - INFO - Creating relations(parent) index
2023-04-19 09:14:46,731 - INFO - Creating relations(parent) index
2023-04-19 09:14:47,008 - INFO - Creating relations(child) index
2023-04-19 09:14:47,008 - INFO - Creating relations(child) index
2023-04-19 09:14:47,326 - INFO - Creating features(featuretype) index
2023-04-19 09:14:47,326 - INFO - Creating features(featuretype) index
2023-04-19 09:14:47,526 - INFO - Creating features (seqid, start, end) index
2023-04-19 09:14:47,526 - INFO - Creating features (seqid, start, end) index
2023-04-19 09:14:47,738 - INFO - Creating features (seqid, start, end, strand) index
2023-04-19 09:14:47,738 - INFO - Creating features (seqid, start, end, strand) index
2023-04-19 09:14:48,009 - INFO - Running ANALYZE features
2023-04-19 09:14:48,009 - INFO - Running ANALYZE features
2023-04-19 09:14:49,172 - INFO - Finished creating gff db.
INFO Iterating over peaks to annotate 3' UTRs.: 58%|██████████████████████████████████████ | 10877/18845 [03:20<02:36, 50.88it/s]Process Process-3:
Traceback (most recent call last):
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 52, in _iter_peaks
annotate_utr_for_peak(db, queue, peak, spat_pileups.get(peak.strand), bedtools.get(peak.strand), args.max_distance, args.override_utr, args.extend_utr, args.five_prime_ext)
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 84, in annotate_utr_for_peak
bt = bedtool.filter(lambda x: x.chrom == peak.chr)
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 968, in filter
return BedTool((f for f in self if func(f, *args, **kwargs)))
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 1197, in __iter__
if isGZIP(self.fn):
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/helpers.py", line 171, in isGZIP
with open(fn, "rb") as f:
OSError: [Errno 24] Too many open files: '/prm/scratch/home/abs/peaks2utr/.cache/forward_coverage_gaps.bed'
from peaks2utr.
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 84, in annotate_utr_for_peak
bt = bedtool.filter(lambda x: x.chrom == peak.chr)
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/pybedtools/bedtool.py", line 968, in filter
return BedTool((f for f in self if func(f, *args, **kwargs)))
Looks like internally pybedtools is returning a new BedTool on each filter query, hence why this problem persists. I'll attempt to implement @lldelisle's suggestion.
from peaks2utr.
@abs-yy Could you attach your .log/peaks2utr_debug.log file here so I can take a look?
from peaks2utr.
I've completely removed the dependence on pybedtools within the annotate_utr_for_peak function and built a bespoke in-memory class to handle filtering. A knock-on effect is that the iterating over peaks is now substantially faster
from peaks2utr.
@abs-yy can you try again with v0.5.2 ?
from peaks2utr.
Hi @haessar, here's a test with v0.5.2.
% peaks2utr --version
2023-04-20 13:33:37,143 - INFO - Make .cache directory
peaks2utr 0.5.2
% peaks2utr -p 4 ./genes.gff3 ./reads.bam
2023-04-20 13:31:58,637 - INFO - Make .cache directory
2023-04-20 13:31:58,896 - INFO - Splitting forward strand from ./reads.bam.
2023-04-20 13:32:00,938 - INFO - Finished splitting forward strand.
2023-04-20 13:32:00,939 - INFO - Splitting reverse strand from ./reads.bam.
2023-04-20 13:32:03,030 - INFO - Finished splitting reverse strand.
2023-04-20 13:32:03,040 - INFO - Merging SPAT outputs.
2023-04-20 13:32:03,042 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.forward.bam'
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.reverse.bam'
2023-04-20 13:32:10,550 - INFO - Creating gff db.
2023-04-20 13:32:10,552 - INFO - Calling peaks for forward strand with MACS3.
2023-04-20 13:32:10,559 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-20 13:32:10,574 - INFO - Populating features
2023-04-20 13:32:10,574 - INFO - Populating features
2023-04-20 13:32:14,653 - INFO - Finished calling reverse strand peaks.
2023-04-20 13:32:14,843 - INFO - Finished calling forward strand peaks.
2023-04-20 13:32:35,795 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:32:35,795 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:32:35,796 - INFO - Updating relations
2023-04-20 13:32:35,796 - INFO - Updating relations
2023-04-20 13:32:40,527 - INFO - Creating relations(parent) index
2023-04-20 13:32:40,527 - INFO - Creating relations(parent) index
2023-04-20 13:32:40,768 - INFO - Creating relations(child) index
2023-04-20 13:32:40,768 - INFO - Creating relations(child) index
2023-04-20 13:32:41,059 - INFO - Creating features(featuretype) index
2023-04-20 13:32:41,059 - INFO - Creating features(featuretype) index
2023-04-20 13:32:41,279 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:32:41,279 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:32:41,507 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:32:41,507 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:32:41,823 - INFO - Running ANALYZE features
2023-04-20 13:32:41,823 - INFO - Running ANALYZE features
2023-04-20 13:32:42,198 - INFO - Finished creating gff db.
2023-04-20 13:32:42,494 - INFO - Clearing cache.
Traceback (most recent call last):
File "/lustre/home/abs/anaconda3/envs/bio/bin/peaks2utr", line 8, in <module>
sys.exit(main())
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 51, in main
asyncio.run(_main())
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 122, in _main
processes = [batch_annotate_strand(db, batch, queue, args) for batch in iter_batches(peaks, math.ceil(total_peaks/args.processors))]
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/__init__.py", line 122, in <listcomp>
processes = [batch_annotate_strand(db, batch, queue, args) for batch in iter_batches(peaks, math.ceil(total_peaks/args.processors))]
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/annotations.py", line 61, in batch_annotate_strand
truncation_points[symbol] = models.SPATTruncationPoints(cached(strand + "_unmapped.json"))
File "/lustre/home/abs/anaconda3/envs/bio/lib/python3.9/site-packages/peaks2utr/models.py", line 168, in __init__
self.data.update(json.load(f))
TypeError: 'NoneType' object is not iterable
The log file doesnt say anything much
cat .log/peaks2utr_debug.log
2023-04-20 13:34:32,197 - INFO - Splitting forward strand from ./reads.bam.
2023-04-20 13:34:34,166 - INFO - Finished splitting forward strand.
2023-04-20 13:34:34,167 - INFO - Splitting reverse strand from ./reads.bam.
2023-04-20 13:34:36,059 - INFO - Finished splitting reverse strand.
2023-04-20 13:34:36,067 - INFO - Merging SPAT outputs.
2023-04-20 13:34:36,069 - INFO - Filtering intervals with zero coverage.
2023-04-20 13:34:43,305 - INFO - Creating gff db.
2023-04-20 13:34:43,308 - INFO - Calling peaks for forward strand with MACS3.
2023-04-20 13:34:43,315 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-20 13:34:43,335 - INFO - Populating features
2023-04-20 13:34:47,637 - INFO - Finished calling reverse strand peaks.
2023-04-20 13:34:47,847 - INFO - Finished calling forward strand peaks.
2023-04-20 13:35:09,938 - INFO - Populating features table and first-order relations: 311769 features
2023-04-20 13:35:09,939 - INFO - Updating relations
2023-04-20 13:35:14,670 - INFO - Creating relations(parent) index
2023-04-20 13:35:14,917 - INFO - Creating relations(child) index
2023-04-20 13:35:15,189 - INFO - Creating features(featuretype) index
2023-04-20 13:35:15,387 - INFO - Creating features (seqid, start, end) index
2023-04-20 13:35:15,606 - INFO - Creating features (seqid, start, end, strand) index
2023-04-20 13:35:15,911 - INFO - Running ANALYZE features
2023-04-20 13:35:16,249 - INFO - Finished creating gff db.
2023-04-20 13:35:16,403 - INFO - Clearing cache.
from peaks2utr.
Hmm, surprised that happened. I've released v0.5.3 which among other things has a patch fix to prevent an error being thrown at self.data.update(json.load(f))
.
from peaks2utr.
I got the same error message when installing the master branch and testing on the demo gff and bam.
from peaks2utr.
That did the trick! and the 0.5.2 update really made iterations go blazing fast.
I also tried it for a more large bam file and the whole run went smoothly.
Thanks for the help!
% peaks2utr --version
2023-04-21 10:29:41,053 - INFO - Make .cache directory
peaks2utr 0.5.3
% peaks2utr -p 16 genes.gff3 reads.bam
2023-04-21 10:26:40,648 - INFO - Make .cache directory
2023-04-21 10:26:40,810 - INFO - Splitting forward strand from reads.bam.
2023-04-21 10:26:45,925 - INFO - Finished splitting forward strand.
2023-04-21 10:26:45,926 - INFO - Splitting reverse strand from reads.bam.
2023-04-21 10:26:50,665 - INFO - Finished splitting reverse strand.
2023-04-21 10:26:50,668 - INFO - Merging SPAT outputs.
2023-04-21 10:26:50,669 - INFO - Filtering intervals with zero coverage.
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.reverse.bam'
[E::idx_find_and_load] Could not retrieve index file for '/prm/scratch/home/abs/tmp/.cache/reads.forward.bam'
2023-04-21 10:27:03,584 - INFO - Creating gff db.
2023-04-21 10:27:03,585 - INFO - Calling peaks for forward strand with MACS3.
2023-04-21 10:27:03,599 - INFO - Calling peaks for reverse strand with MACS3.
2023-04-21 10:27:03,601 - INFO - Populating features
2023-04-21 10:27:03,601 - INFO - Populating features
2023-04-21 10:27:14,174 - INFO - Finished calling reverse strand peaks.
2023-04-21 10:27:14,498 - INFO - Finished calling forward strand peaks.
2023-04-21 10:27:33,698 - INFO - Populating features table and first-order relations: 311769 features
2023-04-21 10:27:33,698 - INFO - Populating features table and first-order relations: 311769 features
2023-04-21 10:27:33,699 - INFO - Updating relations
2023-04-21 10:27:33,699 - INFO - Updating relations
2023-04-21 10:27:37,819 - INFO - Creating relations(parent) index
2023-04-21 10:27:37,819 - INFO - Creating relations(parent) index
2023-04-21 10:27:38,032 - INFO - Creating relations(child) index
2023-04-21 10:27:38,032 - INFO - Creating relations(child) index
2023-04-21 10:27:38,291 - INFO - Creating features(featuretype) index
2023-04-21 10:27:38,291 - INFO - Creating features(featuretype) index
2023-04-21 10:27:38,480 - INFO - Creating features (seqid, start, end) index
2023-04-21 10:27:38,480 - INFO - Creating features (seqid, start, end) index
2023-04-21 10:27:38,688 - INFO - Creating features (seqid, start, end, strand) index
2023-04-21 10:27:38,688 - INFO - Creating features (seqid, start, end, strand) index
2023-04-21 10:27:38,934 - INFO - Running ANALYZE features
2023-04-21 10:27:38,934 - INFO - Running ANALYZE features
2023-04-21 10:27:39,190 - INFO - Finished creating gff db.
INFO Iterating over peaks to annotate 3' UTRs.: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18845/18845 [00:51<00:00, 369.43it/s]
2023-04-21 10:28:34,539 - INFO - Merging annotations with canonical gff file.
2023-04-21 10:28:52,723 - INFO - Successfully merged three_prime_UTRs into canonical gff: genes.new.gff.
2023-04-21 10:28:52,731 - INFO - Writing summary statistics file.
2023-04-21 10:28:56,313 - INFO - peaks2utr finished successfully.
2023-04-21 10:28:57,314 - INFO - Clearing cache.
% cat summary_stats.txt
Total peaks: 18845
Total 3' UTRs annotated: 1887
Peaks with no nearby features: 14997 (79%)
Peaks corresponding to an already annotated 3' UTR: 0 (0%)
Peaks contained within a feature: 1552 (8%)
Peaks corresponding to 5'-end of a feature: 365 (1%)
from peaks2utr.
Sorry, the TypeError: 'NoneType' object is not iterable
has been fixed for me too (I forgot to pull the last commits...).
from peaks2utr.
Related Issues (18)
- Installation failure with Python 3.10 HOT 4
- Example run failure HOT 5
- Warning message: Genometools binary can't be called HOT 2
- Update genome annotation file with multiple samples? HOT 1
- Recommended way to convert output GFF file back to GTF? HOT 2
- IndexError: list index out of range HOT 4
- Error "[E::idx_find_and_load] Could not retrieve index file HOT 2
- Questions about three_prime_utr annotation behavior HOT 1
- ERROR - abnormal 3' UTR HOT 2
- ValueError: Duplicate ID HOT 6
- Error in trying to run Demo HOT 4
- Use of bam file generated from Bulk RNA-seq alignment HOT 1
- sqlite3.IntegrityError HOT 2
- What is the default for the option '--max-distance'? HOT 3
- OSError: [Errno 39] Directory not empty: '/path/to/.cache' HOT 8
- Validation of matching seqids
- 'colour' variable added to GTF files HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from peaks2utr.