allenneuraldynamics / aind-data-transfer Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Describe the bug
The Ephys job doesn't appear to work on windows
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The job should work on Windows
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
Describe the bug
Due to an indendation error in the NP-opto correction, the settings files in an OpenEphys folder are retreived 384 times..This doesn't affect the result.
To Reproduce
See correct_np_opto_electrode_locations
function: https://github.com/AllenNeuralDynamics/nd-data-transfer/blob/main/src/transfer/util/npopto_correction.py#L21-L28
Expected behavior
Settings files are retieved only once
The spike sorting capsule does not work out of the box on compressed data.
wavpack-numcodecs
is installed'inter_sample_shift' is not a property!
As a developer, I want to test the writing methods easily, so I can catch bugs with writes before commits.
Add any helpful notes here.
As a user, I want to be able to run a transcode job easily using a configuration file
Add any helpful notes here.
As a developer, I want a clean code base, so I can maintain it easier.
Add any helpful notes here.
Once data has landed in a cloud bucket, we need to tell CodeOcean about it.
<modality>_<subject-id>_<acq_date>_<acq_time>
)ecephys
)data_description.json
Example curl request that does the right thing:
curl --location --request POST 'https://codeocean.allenneuraldynamics.org/api/v1/data_assets' \
--header 'Content-Type: application/json' \
-u \'{API_TOKEN}:\' \
--data '{
"name": "ecephys_625463_2022-09-28_16-34-22",
"description": "",
"mount": "ecephys_625463_2022-09-28_16-34-22",
"tags": [ "ecephys" ],
"source": {
"aws": {
"bucket": "{BUCKET_NAME}",
"prefix": "ecephys_625463_2022-09-28_16-34-22",
"keep_on_external_storage": true,
"index_data": true,
"access_key_id": "'"{ACCESS_KEY_ID}"'",
"secret_access_key": "'"{ACCESS_KEY_ID}"'"
}
}
}'
As a developer, I want to efficient and robust unit tests, so I can have a healthy and maintainable code base.
Add any helpful notes here.
As a developer, I want to re-use code from aind-data-schema, so I can maintain the code base easier.
We might need to wait until the aind-data-schema project is published to pypi
As a user, I want to see documentation about the input parameters of a method, so I can understand how to use it easier.
As a developer, I want clean, well-documented code, so I can maintain it easier.
Originally this issue was about automating file reorganization. I've updated it to be about simply making sure that the ephys team can run this job themselves.
Acceptance Criteria:
old version below:
Right now the ephys team is manually reorganizing files from different hard drives.
Describe the bug
Currently, the NI-DAQ filter is in the read stream, and so those recording aren't being compressed.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It will make things easier if the NI-DAQ recording blocks are also compressed
Additional context
That data doesn't need to be scaled though, so we can move the filter into the scale_read_blocks
method
Describe the bug
Currently, the configs are parsing json, but some of the parameters need to be mapped to enum.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The config parser should parse configs correctly
Additional context
We'll want to modify the ephys job config parser to handle this:
The enum values are stored here: https://numcodecs.readthedocs.io/en/stable/blosc.html
for nd-data-transfer
, expose an option to upload data in a BIDS-like directory structure.
https://bids-specification.readthedocs.io/en/stable/01-introduction.html
https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/10-microscopy.html
this may help: https://github.com/bids-standard/pybids
Proposed structure:
<modality>_<subject-id>_<acquisition-date>
(root directory, optional suffix for _<acquisition-time>
)
LICENSE
(CC-BY-4.0)dataset_description.json
, minimally:
<modality>
chunk-<N>_stain-<label>.<ext>
This does not need to validate for now.
As a user, I want to see which version of the code I'm working on, so I can update or retrace my steps if I want to.
Add any helpful notes here.
Describe the bug
Use experiment notation insted of block index notation.
To Reproduce
Run the compressor and inspect the zarr file names.
Expected behavior
The compressed files have experiment names.
Additional context
Unify naming between compressed data and CodeOcean capsules.
Acceptance Criteria:
aind-data-schema
conventions (see: RawDataAsset
and DerivedDataAsset
).As a user, I want to control a few directory options, so I can manage data better.
aws sync
and gsutil rsync
commands are used to upload folders to cloud storage. We may want to look into whether we want to overwrite the existing cloud folder, update the data if a cloud folder already exists, or send back a warning if the cloud folder already exists.Add any helpful notes here.
Describe the bug
It takes over 4 minutes to install the dependencies. We should explore ways to streamline this process.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
We can save time by streamlining this operation if we can.
Additional context
Add any other context about the problem here.
As a user, I want to install nd-data-transfer via pip, so I can easily run the code on local machines.
Add any helpful notes here.
Describe the bug
Windows has a built-in maximum length of file names of 256 (see e.g. here). When Zarr attempts to create files, this can throw a cryptic FileNotFoundError
.
To Reproduce
This simple code shows the faulty behavior:
import zarr
# simple function to create a dataset in a group
def create_zarr_group_dataset(root_path):
if Path(root_path).is_dir():
shutil.rmtree(root_path)
zarr_root = zarr.open(root_path, mode="w", storage_options=None)
zarr_group = zarr_root.create_group("a_group")
g_dset = zarr_group.create_dataset(name="group_data", data=[str(i) for i in range(100)],
compressor=None)
# here we extend the file name by appending \\new_folder n_iter times
for n_iter in range(20):
zarr_path = base_folder
for i in range(n_iter):
zarr_path += "\\new_folder"
zarr_path += zarr_name
print(f"N iter: {n_iter} - len file path {len(zarr_path)}")
try:
create_zarr_group_dataset(zarr_path)
except Exception as e:
print(f"Failed for iter {n_iter}")
Which produces:
N iter: 0 - len file path 57
N iter: 1 - len file path 68
N iter: 2 - len file path 79
N iter: 3 - len file path 90
N iter: 4 - len file path 101
N iter: 5 - len file path 112
N iter: 6 - len file path 123
N iter: 7 - len file path 134
N iter: 8 - len file path 145
N iter: 9 - len file path 156
N iter: 10 - len file path 167
N iter: 11 - len file path 178
N iter: 12 - len file path 189
N iter: 13 - len file path 200
Failed for iter 13
N iter: 14 - len file path 211
Failed for iter 14
N iter: 15 - len file path 222
Failed for iter 15
N iter: 16 - len file path 233
Failed for iter 16
N iter: 17 - len file path 244
Failed for iter 17
N iter: 18 - len file path 255
Failed for iter 18
N iter: 19 - len file path 266
Failed for iter 19
Expected behavior
We should raise an error with an informative message to reduce the depth of the destination folder.
Also opened an issue on the zarr project: zarr-developers/zarr-python#1235
Desktop (please complete the following information):
This task is create an easily installable script that rig operators can use to reliably upload ephys data to cloud storage.
This script contains a prototypical workflow for doing a related upload + transcode job for raw ephys data:
nd-data-transfer is being developed to take raw imaging data, compress and convert as OME-Zarr, and upload it to AWS or GCP object storage.
Non-requirements:
nd-data-transfer
, but this would be preferable.nd-data-transfer
for imaging data).Acceptance Criteria:
pip
-installableIn the future we will migrate to a Wavpak compressor.
Is your feature request related to a problem? Please describe.
The OpenEphys GUI does not handle correctly the geometry of NP-opto probes when saving the settings.xml, and saves the NP-opto as having the NP1.0 configuration.
Since we automatically read the probe information from the settings file, this can lead to errors for downstream analysis.
Describe the solution you'd like
Before compressing and clipping, we should check for opto probes (easy from the settings file) and correct the electrode locations, so that the correct probe configuration is loaded in SpikeInterface.
Describe alternatives you've considered
An alternative could be to correct the probe geometry a posteriori, but since we plan to trigger the computational pipeline as soon as a new data asset is created, this is not possible.
As a user, I want to compress ephys data, so I can upload a smaller data set to the cloud.
Add any helpful notes here.
As a user, I want to encrypt the video data folder, so I can manage who can access it.
We can try using pyminizip, but it might require extra installation in a windows machine.
As a user and developer, I want to see useful log messages, so I can more easily track the progress or debug the processing pipeline.
Add any helpful notes here.
Update 7/7
Update 7/11
Update 7/19
Update 7/22
Acceptance Criteria:
pip install aind-data-transfer[ephys]
workspip install aind-data-transfer[imaging]
workingspip install aind-data-transfer
pip install aind-data-transfer[full]
does it all.Describe the bug
When decompressing the last chunk with wavpack-numcodecs
, you get a wrong chunk size error.
To Reproduce
https://github.com/AllenNeuralDynamics/ephys-spikesort-kilosort25-full/issues/2
Expected behavior
Wavpack correctly decompresses the last chunk
Additional context
Fixed with AllenNeuralDynamics/wavpack-numcodecs#6 and https://pypi.org/project/wavpack-numcodecs/
Describe the bug
Currently, calling write_ome_zarr.py
with the --resume
flag will perform the following check to see if a tile
has already been written
aind-data-transfer/src/aind_data_transfer/transcode/ome_zarr.py
Lines 260 to 267 in 720a57c
This was meant as a placeholder, and is incorrect for a couple reasons. 1) since it only takes into account the shape of the array, not stored data. 2) it only checks the lowest resolution level, since at the time they were written in order from high -> low. All levels are written simultaneously now.
To actually resume in case of failure, what I've been doing is checking the logs to see the tile that was currently being written, then delete that (partially written) tile from the output location, and finally restart the job using the --resume
flag, which skips over all the existing arrays (whether or not they were fully written) in the output store.
This issue zarr-developers/zarr-python#587 describes a few different ways to detect missing chunks in a Zarr array. The one that seemed most promising to me was taking the ratio of nchunks_initialized and nchunks , which should be 1
if all chunks were written (assuming there are no empty "zero" chunks). The methods which scan the entire array and/or compute checksums feel less appealing to me since it might end up being faster to just re-write the tile, but could be worth investigating depending on how thorough we want to be.
As a user, I want to a clipped dat file in addition to a compressed dat file, so I can still use the spike interface api.
Add any helpful notes here.
As a developer, I want a github action to manage the version number, so I don't have to worry about the updates manually.
Add any helpful notes here.
Videos
subdirectory to the same level as ecphys_clipped
and ephys_compressed
Videos
to videos
to be consistentecephys_<subject_id>_<acq_date>_<acq_time>
As a user, I want to be able to pip install from pypi, so I can run the jobs without cloning the repo.
Add any helpful notes here.
As a user, I want to upload datasets to the cloud, so I can run my analysis in the cloud.
Do any file re-organization locally.
As a user, I want to keep my data set without any modifications from the ephys pipeline, so I can keep the data sets as raw as possible.
Add any helpful notes here.
As a user, I want to retain tile position metadata from the Imaris files when converting to OME-Zarr, so that I can
stitch the dataset.
According to OME-NGFF spec https://ngff.openmicroscopy.org/latest/ , the translation
field specifies the offset from the origin, in physical coordinates, and must go in the .zattrs
under coordinateTransformations
, like so
{
"multiscales": [
{
"datasets": [
{
"coordinateTransformations": [
{
"scale": [
1.0,
0.75,
0.75
],
"type": "scale"
},
{
"translation": [
-12000.0,
0.0,
0.0
],
"type": "translation"
}
],
"path": "0"
}
]
}
]
}
translation
must be specified after scale
translation
field shows up in the .zattrs
fileAdd any helpful notes here.
Describe the bug
A lot of warnings show up when running the ephys upload job.
warnings.warn(
/miniconda3/envs/nd-data-transfer/lib/python3.8/site-packages/packaging/version.py:111: DeprecationWarning: Creating a LegacyVersion has been deprecated and will be removed in the next major release
To Reproduce
Steps to reproduce the behavior:
Expected behavior
There shouldn't be any warnings if we can avoid it
Additional context
Add any other context about the problem here.
As a developer, I want to re-use code, so I can maintain things better.
We might need to wait until aind-codeocean-api is published to pypi
For example:
Describe the bug
Github actions are failing.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The checks during github actions should be passing.
Screenshots
Check the stack trace here:
https://github.com/AllenNeuralDynamics/nd-data-transfer/runs/8218772895?check_suite_focus=true
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context
One of the unit tests are implicitly creating a client to Google Cloud Storage. As a quick-fix, the unit test can be suppressed. The long-term fix is to separate out the client from the class it's being instantiated, and mock it in the unit test. The file that's probably causing the issue is:
https://github.com/AllenNeuralDynamics/nd-data-transfer/blob/main/tests/test_gcs_uploader.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.