Comments (7)
That seems reasonable. Under the hood I have to iterate over the vector anyway, so testing for and skipping NaNs shouldn't be a big deal (just need to look up the C API for numpy and find how to do that).
from pybigwig.
They're written, exactly for the reason you suspected.
from pybigwig.
Following up from @rcasero's observation, I'm providing a reproducible example showing that even when the indices of the intervals with non-NaN values are passed to bw.addEntries
they are filled up by NaN
values.
- Save array by creating gaps for
NaN
values
import pyBigWig
import numpy as np
# Initialise array as fake genomic signal for one chromosome
data = np.ones(1000)
# Replace half of the array with NaNs
data[500:1000] = np.nan
# Create dictionary from array
data_dict = dict()
data_dict['chr1'] = data
def to_bigwig_create_gaps(x, filename):
"""Save dictionary to bigWig by creating gaps
"""
with pyBigWig.open(filename, 'wb') as bw:
# add header (list of (chromosome, length) tuples)
bw.addHeader([(k, len(v)) for k, v in x.items()])
for chrom in x.keys():
valid_idx = ~np.isnan(x[chrom])
for chrom in x.keys():
if np.all(valid_idx):
# no NaN values, so it can be written in one go
bw.addEntries(chrom, 0, values=x[chrom], span=1, step=1)
else:
# we skip writing NaN values by giving the location of every non-nan value, and considering it a
# segment of span=1
bw.addEntries(chrom, np.where(valid_idx)[0], values=x[chrom][valid_idx], span=1)
to_bigwig_create_gaps(data_dict, "by_gaps.bigwig")
As pointed out by @rcasero above, the array can also be saved segment-wise resulting in files without NaN
values but on cost of speed.
- Saving array segment-wise
import itertools
def to_bigwig_segment_wise(x, filename):
"""Save dictionary to bigWig segment-wise
"""
with pyBigWig.open(filename, 'wb') as bw:
# add header (list of (chromosome, length) tuples)
bw.addHeader([(k, len(v)) for k, v in x.items()])
for chrom in x.keys():
# first segment is at the beginning of the chromosome
pos = 0
# group values into segments using as criterion whether they are NaN or not
for k, g in itertools.groupby(x[chrom], lambda a: ~np.isnan(a)):
# values in current segment
g = np.array(list(g))
# if this is a segment of valid numbers, write it to file
if k:
bw.addEntries(chrom, pos, values=g, span=1, step=1)
# move pointer to the beginning of next segment
pos += len(g)
to_bigwig_segment_wise(data_dict, "segment_wise.bigwig")
The resulting file size of by_gaps.bigwig
is 1,528 bytes while segment_wise.bigwig
is 757 bytes in size.
from pybigwig.
Another colleagued figured out what was going on with the different file sizes. Basically, the different versions of bw.addEntries()
write different types of wiggle blocks even if they are in binary form, which consist of a certain header and then the values. Depending on the type of data you have (same value repeats in segments vs. lots of different values), you want to choose the bw.addEntries()
that would be more efficient for that data. This, in practice, means writing a function that transverses the data, finds segments of repeat values and segments of changing values, and then chooses bw.addEntries()
accordingly.
I think the package would benefit from that being documented, specifying what kind of wiggle block each bw.addEntries()
syntax writes.
from pybigwig.
On a related note, if zeros are provided in addEntries()
, are they written or also left as gaps? I imagine they have to be written because otherwise they'd be read as nan
s.
from pybigwig.
Some more details. So, PyBigWig can save the whole vector of a chromosome,
bw.addEntries(chrom, 0, values=x[chrom], span=1, step=1)
but then it saves the NaNs as np.nan
in the file, which then can break applications like peak callers.
Alternatively, it can save only the non-NaN values with their locations (indices),
valid_idx = ~np.isnan(x[chrom])
bw.addEntries(chrom, np.where(valid_idx)[0], values=x[chrom][valid_idx], span=1)
which is relatively fast, but this produces a larger bigwig file.
It is also possible to look for segments g
of consecutive values that are not NaNs in the chromosome, and write those with
bw.addEntries(chrom, pos, values=g, span=1, step=1)
This is much slower, but the resulting file is smaller too. An example of times/file size:
Method | Filesize | Processing time |
---|---|---|
Using indices of non-NaN values | 9.6G | 14 min 30 s (+flush time) |
Iterating segments of non-NaN values | 5.3G | 1 h 8 min (+flush time) |
from pybigwig.
I'm surprised there's a difference in size and that the performance difference is so large. The actual data should be the same in either case as are the zoom levels created when the file is closed. I don't fully understand what's going on there. Do you have any data handy that I could use for testing? I have some higher priority things on my plate at the moment but this is vexing me.
from pybigwig.
Related Issues (20)
- RuntimeError: Invalid interval bounds
- Document performance considerations? HOT 4
- Cannot add entries of value type int, but only float HOT 2
- support for osx-arm64 HOT 2
- numpy support broken in 0.3.18? HOT 1
- Create a BedGraph file using addEntries() throws segmentation fault HOT 2
- library import error HOT 4
- pyBigWig fails to find numpy installation when installing from PyPI HOT 5
- Can't enforce numpy features when pyBigWig is used as a dependency in downstream package HOT 5
- pip installation broken HOT 4
- Installing through pip not working HOT 9
- addHeader does not support multiple calls HOT 1
- Support for python >=3.11 HOT 1
- Issue Downloading pyBigWig HOT 1
- Simple patch to resolve conflict with roundup() macro
- Stats Sum Not Working as Expected
- 'zsh: segmentation fault ' HOT 1
- Out of memory listing entries on one human chromosome on a machine with 300 GB ram and 165 GB BigBed file HOT 1
- pyBigWig.entries() should return empty array, not None when no entries are found
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pybigwig.