Coder Social home page Coder Social logo

datalad-ukbiobank's Introduction

DataLad extension for working with the UKbiobank

GitHub release PyPI version fury.io Build status codecov.io Documentation Status DOI

This software is a DataLad extension that equips DataLad with a set of commands to obtain (and monitor) imaging data releases of the UKbiobank (see documentation for more information).

UKbiobank is a national and international health resource with unparalleled research opportunities, open to all bona fide health researchers. UK Biobank aims to improve the prevention, diagnosis and treatment of a wide range of serious and life-threatening illnesses – including cancer, heart diseases, stroke, diabetes, arthritis, osteoporosis, eye disorders, depression and forms of dementia. It is following the health and well-being of 500,000 volunteer participants and provides health information, which does not identify them, to approved researchers in the UK and overseas, from academia and industry.

Command(s) provided by this extension

  • ukb-init -- Initialize an existing dataset to track a UKBiobank participant
  • ukb-update -- Update an existing dataset of a UKbiobank participant

Installation

Before you install this package, please make sure that you install a recent version of git-annex. Afterwards, install the latest version of datalad-ukbiobank from PyPi. It is recommended to use a dedicated virtualenv:

# create and enter a new virtual environment (optional)
virtualenv --system-site-packages --python=python3 ~/env/datalad
. ~/env/datalad/bin/activate

# install from PyPi
pip install datalad_ukbiobank

You will also need to download the ukbfetch utility provided by the UK Biobank. See the ukbfetch documentation for specifics.

Use

To track UKB data for a single participant (example ID: 1234), start by creating and initializing a new dataset:

% datalad create 1234
% cd 1234
% datalad ukb-init --bids 1234 20227_2_0 20227_3_0 25755_2_0 25755_3_0

In this example only two data records with two instances each are selected. However, any other selection is supported too. The --bids flag enables an additional dataset layout with a BIDS-like structure.

After initialization, run ukb-update at any time to (re-)download data from UKB, and update the dataset in order to track changes longitudinally.

datalad -c datalad.ukbiobank.keyfile=<pathtoaccesstoken> ukb-update

This will maintain two or three branches:

  • incoming: tracking the pristine UKB downloads
  • incoming-native: a "native" representation of the extracted downloads for single file access using UKB naming conventions
  • incoming-bids: an alternative dataset layout using BIDS conventions (if enabled with ukb-init --bids)

Changes can then be merged manually into the main branch. Alternatively, ukb-update --merge merges incoming-native (or incoming-bids if enabled) automatically.

Use with pre-downloaded data

Re-download can be avoided (while maintaining all other functionality), if the ukbfetch utility is replaced by a shim that obtains the relevant files from where they have been downloaded to. An example script is provided at tools/ukbfetch_surrogate.sh.

One simple way to use this script is to add a symlink at ~/env/datalad/bin/ for example:

ln -s tools/ukbfetch_surrogate.sh ~/env/datalad/bin/ukbfetch`

Use on non-UNIX-like operating systems

This code relies on a number of POSIX filesystem features that may make it somewhat hard to get working on Windows. Contributions to port this extension to non-POSIX platforms are welcome, but presently this is not supported.

Support

For general information on how to use or contribute to DataLad (and this extension), please see the DataLad website or the main GitHub project page.

All bugs, concerns and enhancement requests for this software can be submitted here: https://github.com/datalad/ukbiobank/issues

If you have a problem or would like to ask a question about how to use DataLad, please submit a question to NeuroStars.org with a datalad tag. NeuroStars.org is a platform similar to StackOverflow but dedicated to neuroinformatics.

All previous DataLad questions are available here: http://neurostars.org/tags/datalad/

Acknowledgements

This development was supported by European Union’s Horizon 2020 research and innovation programme under grant agreement VirtualBrainCloud (H2020-EU.3.1.5.3, grant no. 826421).

datalad-ukbiobank's People

Contributors

adswa avatar bpoldrack avatar jbpoline avatar loj avatar ltetrel avatar mih avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

datalad-ukbiobank's Issues

Test failure `test_drop`

===================================
__________________________________ test_drop ___________________________________

dspath = '/tmp/datalad_temp_test_drop3madglh0'
records = '/tmp/datalad_temp_test_drop8s8mzg2j'

    @skip_if_on_windows  # see gh-61
    @with_tempfile
    @with_tempfile(mkdir=True)
    def test_drop(dspath=None, records=None):
        make_datarecord_zips('12345', records)
        ds = create(dspath, **ckwa)
        ds.ukb_init(
            '12345',
            ['20227_2_0', '25747_2_0', '25748_2_0', '25748_3_0'], **ckwa)
        ds.config.add('datalad.ukbiobank.keyfile', 'dummy', where='local')
        bin_dir = make_ukbfetch(ds, records)
    
        # baseline
        with patch.dict('os.environ', {'PATH': '{}:{}'.format(
                str(bin_dir),
                os.environ['PATH'])}):
            ds.ukb_update(merge=True, force=True, **ckwa)
        zips_in_ds = list(ds.pathobj.glob('**/*.zip'))
        neq_(zips_in_ds, [])
    
        # drop archives
        with patch.dict('os.environ', {'PATH': '{}:{}'.format(
                str(bin_dir),
                os.environ['PATH'])}):
            ds.ukb_update(merge=True, force=True, drop='archives', **ckwa)
        # no ZIPs can be found, also not in the annex
        eq_(list(ds.pathobj.glob('**/*.zip')), [])
        # we can get all we want (or rather still have it)
>       assert_status('notneeded', ds.get('.', **ckwa))

/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/datalad_ukbiobank/tests/test_update.py:212: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

label = ['notneeded'], results = []

    def assert_status(label, results):
        """Verify that each status dict in the results has a given status label
    
        `label` can be a sequence, in which case status must be one of the items
        in this sequence.
        """
        label = ensure_list(label)
        results = ensure_result_list(results)
        if len(results) == 0:
            # If there are no results, an assertion about all results must fail.
>           raise AssertionError("No results retrieved")
E           AssertionError: No results retrieved


datalad-extensions FAILs in test_base

e.g. https://github.com/datalad/datalad-extensions/runs/7694639181?check_suite_focus=true#step:10:190

FAIL: datalad_ukbiobank.tests.test_init.test_base
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/datalad/tests/utils_pytest.py", line 954, in _wrap_with_tempfile
    return t(*(arg + (filename,)), **kw)
  File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/datalad_ukbiobank/tests/test_init.py", line 26, in test_base
    for b in ['git-annex', 'incoming', 'incoming-native', DEFAULT_BRANCH])
  File "/opt/hostedtoolcache/Python/3.7.13/x64/lib/python3.7/site-packages/datalad/tests/utils_pytest.py", line [191](https://github.com/datalad/datalad-extensions/runs/7694639181?check_suite_focus=true#step:10:192), in assert_true
    assert expr
AssertionError:

Using with pre-downloaded data

Hi and thanks for your work,
This is really helpfull to have a dataset per participant, and this helps a lot on data management.

I was wondering how to use your tool if data was pre-downloaded.
We basically have a list of zip files (bulk files I suppose?):

2005646_20227_2_0.zip
3013463_20227_2_0.zip
4020412_20227_2_0.zip
5029399_20227_2_0.zip

Specifically how to replace the "ukbfetch utility" as described in https://github.com/datalad/datalad-ukbiobank#use-with-pre-downloaded-data ? Do we need to go and modify source into the install folder where datalad-ukbiobank was installed ?

Document how pre-downloaded content can be used as a permanent data source

tools/ukbfetch_surrogate.sh currently uses rsync to obtain the data for a participant dataset. However, it may be that a site has the UKB download in a pretty permanent and secure place already, and it would make sense to reference it properly, rather than just making a copy. Here is a sketch of an alternative approach:

#!/bin/bash

set -u -e

cmd="git annex addurl --raw --pathdepth=-1"
baseurl="https://some.host/ukb-downloads"

for line in $(cat .ukbbatch |  sed 's/ /,/g'); do
    sub_id=${line%,*}
    modality=${line#*,}

    $cmd $baseurl/${sub_id}/${sub_id}_${modality}.zip \
       || $cmd $baseurl/${sub_id}/${sub_id}_${modality}.txt \
       || $cmd $baseurl/${sub_id}/${sub_id}_${modality}.adv \
       || $cmd $baseurl/${sub_id}/${sub_id}_${modality}.ed2
done

The outcome would be that all pristine downloads are also referenced by a URL. This makes it possible to git annex drop -A --force and keep no duplicate downloads around. Any datalad get would resolve to the respective UKB data record that contains a particular file, download the record, extract it, and make that content available locally.

Zenodo release

This needs to become citable eventually - registering it as a TODO.

AnnexBatchCommandError: 'addurl' [Error, annex reported failure for addurl

Hi,

when I try to download and bidsify a subsample of UKB subjects on the head node of our HPC (has internet connection) with datalad ukb an error occurs. Interestingly, executing the same set of commands locally works flawlessly. I also had a similar error when trying to establish datalad-hirni (psychoinformatics-de/datalad-hirni#201). Maybe something is wrong with my environment

The error:

[INFO   ] Initiating special remote datalad-archives 
AnnexBatchCommandError: 'addurl' [Error, annex reported failure for addurl (url='dl+archive:MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip#path=fMRI/unusable/rfMRI_SBREF.nii.gz&size=801230'): {'command': 'addurl', 'note': 'from datalad-archives\nto 20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz', 'success': False, 'input': ['dl+archive:MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip#path=fMRI/unusable/rfMRI_SBREF.nii.gz&size=801230 20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz'], 'error-messages': ["  Failed to fetch any archive containing URL-s801230--dl,43archive:MD5E-s348562673--4-d47d48693f84afad33301a3ae2467f14. Tried: ['MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip', 'MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip', 'MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip'] [archives.py:_transfer:407]"], 'file': '20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz'}]

The whole output:

+ '[' -d 'sub-5088058/ses*' ']'
+ datalad create sub-5088058
[INFO   ] Creating a new annex repo at /work/fatx405/projects/BIDS_UKB/sub-5088058 
[INFO   ] scanning for unlocked files (this may take some time) 
create(ok): /work/fatx405/projects/BIDS_UKB/sub-5088058 (dataset)
+ pushd sub-5088058
/work/fatx405/projects/BIDS_UKB/sub-5088058 /work/fatx405/projects/BIDS_UKB /work/fatx405/projects/BIDS_UKB
+ datalad ukb-init --bids 5088058 20227_2_0 20227_3_0 20250_2_0 20250_3_0 20252_2_0 20252_3_0 20253_2_0 20253_3_0
ukb_init(ok): . (dataset)                          
+ datalad ukb-update --keyfile /work/fatx405/projects/BIDS_UKB/k71359r46151.key --merge --drop extracted
[INFO   ] == Command start (output follows) ===== 

ukbfetch on unx - ver Jan 30 2019 15:39:51 - using Glibc2.17(stable)
Run start : 2021-08-11T20:43:52 
Verbose mode activated
Registering repository "biota.ndph.ox.ac.uk"
Registering repository "chest.ndph.ox.ac.uk"
UsrNm: fatx405
AppID: 71359
Loaded 8 lines from ".ukbbatch"
Request(1) for EncID:5088058, Field:20227, Instance:2, Array:0
Contacting "chest.ndph.ox.ac.uk"
348672958 bytes fetched
Download has been logged against IP address 134.100.32.114
Unpacking 348672346 -> 348562673 ... done 348562673 bytes
Opening output file "ukb1_1628707432_1227.tmp_bulk"...
348562673 bytes written
Renaming tmp file "ukb1_1628707432_1227.tmp_bulk" to output file "5088058_20227_2_0.zip"...
Opening output listfile ".git/tmp/ukb.lis"
Created 5088058_20227_2_0.zip
Request(2) for EncID:5088058, Field:20227, Instance:3, Array:0
Contacting "chest.ndph.ox.ac.uk"
323 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20227/Instance=3/Array=0

Contacting "biota.ndph.ox.ac.uk"
343 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20227/Instance=3/Array=0

Download failure
Request(3) for EncID:5088058, Field:20250, Instance:2, Array:0
Contacting "chest.ndph.ox.ac.uk"
323 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20250/Instance=2/Array=0

Contacting "biota.ndph.ox.ac.uk"
343 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20250/Instance=2/Array=0

Download failure
Request(4) for EncID:5088058, Field:20250, Instance:3, Array:0
Contacting "biota.ndph.ox.ac.uk"
343 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20250/Instance=3/Array=0

Contacting "chest.ndph.ox.ac.uk"
323 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20250/Instance=3/Array=0

Download failure
Request(5) for EncID:5088058, Field:20252, Instance:2, Array:0
Contacting "biota.ndph.ox.ac.uk"
50668109 bytes fetched
Download has been logged against IP address 134.100.32.114
Unpacking 50667481 -> 50659551 ... done 50659551 bytes
Opening output file "ukb5_1628707485_1227.tmp_bulk"...
50659551 bytes written
Renaming tmp file "ukb5_1628707485_1227.tmp_bulk" to output file "5088058_20252_2_0.zip"...
Created 5088058_20252_2_0.zip
Request(6) for EncID:5088058, Field:20252, Instance:3, Array:0
Contacting "biota.ndph.ox.ac.uk"
343 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20252/Instance=3/Array=0

Contacting "chest.ndph.ox.ac.uk"
323 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20252/Instance=3/Array=0

Download failure
Request(7) for EncID:5088058, Field:20253, Instance:2, Array:0
Contacting "biota.ndph.ox.ac.uk"
34576101 bytes fetched
Download has been logged against IP address 134.100.32.114
Unpacking 34575473 -> 34564840 ... done 34564840 bytes
Opening output file "ukb7_1628707503_1227.tmp_bulk"...
34564840 bytes written
Renaming tmp file "ukb7_1628707503_1227.tmp_bulk" to output file "5088058_20253_2_0.zip"...
Created 5088058_20253_2_0.zip
Request(8) for EncID:5088058, Field:20253, Instance:3, Array:0
Contacting "biota.ndph.ox.ac.uk"
343 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20253/Instance=3/Array=0

Contacting "chest.ndph.ox.ac.uk"
323 bytes fetched
Download has been logged against IP address 134.100.32.114
Error: Bulk data not present for  Encoded_id=5088058 Field=20253/Instance=3/Array=0

Download failure
Fetched 3/8 datafiles
Run end : 2021-08-11T20:45:26
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] Adding content of the archive MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip into annex AnnexRepo(/work/fatx405/projects/BIDS_UKB/sub-5088058) 
[INFO   ] Initiating special remote datalad-archives 
AnnexBatchCommandError: 'addurl' [Error, annex reported failure for addurl (url='dl+archive:MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip#path=fMRI/unusable/rfMRI_SBREF.nii.gz&size=801230'): {'command': 'addurl', 'note': 'from datalad-archives\nto 20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz', 'success': False, 'input': ['dl+archive:MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip#path=fMRI/unusable/rfMRI_SBREF.nii.gz&size=801230 20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz'], 'error-messages': ["  Failed to fetch any archive containing URL-s801230--dl,43archive:MD5E-s348562673--4-d47d48693f84afad33301a3ae2467f14. Tried: ['MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip', 'MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip', 'MD5E-s348562673--4e8652e17e5570f4dc4da0722e0bd53e.zip'] [archives.py:_transfer:407]"], 'file': '20227_2_0/fMRI/unusable/rfMRI_SBREF.nii.gz'}]
+ popd

The script I am executing

#!/bin/bash

#source activate datalad
ROOT_DIR=$(realpath .)
KEY=$ROOT_DIR/key
export PATH=$ROOT_DIR/:$PATH

pushd $ROOT_DIR
for sub in $(cat ukb_matched_subjects.txt);do
    [ -d sub-${sub}/ses* ] && continue
    datalad create sub-${sub}; pushd sub-${sub}
    datalad ukb-init --bids $sub 20227_2_0 20227_3_0 20250_2_0 20250_3_0 20252_2_0 20252_3_0 20253_2_0 20253_3_0
    datalad ukb-update --keyfile $KEY --merge --drop extracted
    popd

done
popd

Datalad wtf output

datalad wtf
# WTF
## configuration <SENSITIVE, report disabled by configuration>
## credentials 
  - keyring: 
    - active_backends: 
      - PlaintextKeyring with no encyption v.1.0 at /home/fatx405/.local/share/python_keyring/keyring_pass.cfg
    - config_file: /home/fatx405/.config/python_keyring/keyringrc.cfg
    - data_root: /home/fatx405/.local/share/python_keyring
## datalad 
  - full_version: 0.14.6
  - version: 0.14.6
## dependencies 
  - annexremote: 1.5.0
  - appdirs: 1.4.4
  - boto: 2.49.0
  - cmd:7z: 16.02
  - cmd:annex: 8.20201104-g13bab4f2c
  - cmd:bundled-git: 2.29.2
  - cmd:git: 2.29.2
  - cmd:system-git: 2.29.2
  - cmd:system-ssh: 7.4p1
  - humanize: 3.10.0
  - iso8601: 0.1.14
  - keyring: 23.0.1
  - keyrings.alt: 4.0.2
  - msgpack: 1.0.2
  - requests: 2.25.1
  - wrapt: 1.12.1
## environment 
  - LANG: en_US.UTF-8
  - PATH: /work/fatx405/miniconda3/envs/datalad/bin:/work/fatx405/miniconda3/condabin:/work/fatx405/miniconda3/bin:/sw/link/git/2.32.0/bin:/sw/env/system-gcc/singularity/3.5.2-overlayfix/bin:/sw/batch/slurm/19.05.6/bin:/sw/rrz/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin
## extensions 
  - container: 
    - description: Containerized environments
    - entrypoints: 
      - datalad_container.containers_add.ContainersAdd: 
        - class: ContainersAdd
        - load_error: None
        - module: datalad_container.containers_add
        - names: 
          - containers-add
          - containers_add
      - datalad_container.containers_list.ContainersList: 
        - class: ContainersList
        - load_error: None
        - module: datalad_container.containers_list
        - names: 
          - containers-list
          - containers_list
      - datalad_container.containers_remove.ContainersRemove: 
        - class: ContainersRemove
        - load_error: None
        - module: datalad_container.containers_remove
        - names: 
          - containers-remove
          - containers_remove
      - datalad_container.containers_run.ContainersRun: 
        - class: ContainersRun
        - load_error: None
        - module: datalad_container.containers_run
        - names: 
          - containers-run
          - containers_run
    - load_error: None
    - module: datalad_container
    - version: 1.1.5
  - metalad: 
    - description: DataLad semantic metadata command suite
    - entrypoints: 
      - datalad_metalad.aggregate.Aggregate: 
        - class: Aggregate
        - load_error: None
        - module: datalad_metalad.aggregate
        - names: 
          - meta-aggregate
          - meta_aggregate
      - datalad_metalad.dump.Dump: 
        - class: Dump
        - load_error: None
        - module: datalad_metalad.dump
        - names: 
          - meta-dump
          - meta_dump
      - datalad_metalad.extract.Extract: 
        - class: Extract
        - load_error: None
        - module: datalad_metalad.extract
        - names: 
          - meta-extract
          - meta_extract
    - load_error: None
    - module: datalad_metalad
    - version: 0.2.1
  - neuroimaging: 
    - description: Neuroimaging tools
    - entrypoints: 
      - datalad_neuroimaging.bids2scidata.BIDS2Scidata: 
        - class: BIDS2Scidata
        - load_error: None
        - module: datalad_neuroimaging.bids2scidata
        - names: 
          - bids2scidata
    - load_error: None
    - module: datalad_neuroimaging
    - version: 0.3.1
  - ukbiobank: 
    - description: UKBiobank dataset support
    - entrypoints: 
      - datalad_ukbiobank.init.Init: 
        - class: Init
        - load_error: None
        - module: datalad_ukbiobank.init
        - names: 
          - ukb-init
          - ukb_init
      - datalad_ukbiobank.update.Update: 
        - class: Update
        - load_error: None
        - module: datalad_ukbiobank.update
        - names: 
          - ukb-update
          - ukb_update
    - load_error: None
    - module: datalad_ukbiobank
    - version: 0.3.3
## git-annex 
  - build flags: 
    - Assistant
    - Webapp
    - Pairing
    - Inotify
    - DBus
    - DesktopNotify
    - TorrentParser
    - MagicMime
    - Feeds
    - Testsuite
    - S3
    - WebDAV
  - dependency versions: 
    - aws-0.22
    - bloomfilter-2.0.1.0
    - cryptonite-0.26
    - DAV-1.3.4
    - feed-1.3.0.1
    - ghc-8.8.4
    - http-client-0.6.4.1
    - persistent-sqlite-2.10.6.2
    - torrent-10000.1.1
    - uuid-1.3.13
    - yesod-1.6.1.0
  - key/value backends: 
    - SHA256E
    - SHA256
    - SHA512E
    - SHA512
    - SHA224E
    - SHA224
    - SHA384E
    - SHA384
    - SHA3_256E
    - SHA3_256
    - SHA3_512E
    - SHA3_512
    - SHA3_224E
    - SHA3_224
    - SHA3_384E
    - SHA3_384
    - SKEIN256E
    - SKEIN256
    - SKEIN512E
    - SKEIN512
    - BLAKE2B256E
    - BLAKE2B256
    - BLAKE2B512E
    - BLAKE2B512
    - BLAKE2B160E
    - BLAKE2B160
    - BLAKE2B224E
    - BLAKE2B224
    - BLAKE2B384E
    - BLAKE2B384
    - BLAKE2BP512E
    - BLAKE2BP512
    - BLAKE2S256E
    - BLAKE2S256
    - BLAKE2S160E
    - BLAKE2S160
    - BLAKE2S224E
    - BLAKE2S224
    - BLAKE2SP256E
    - BLAKE2SP256
    - BLAKE2SP224E
    - BLAKE2SP224
    - SHA1E
    - SHA1
    - MD5E
    - MD5
    - WORM
    - URL
    - X*
  - operating system: linux x86_64
  - remote types: 
    - git
    - gcrypt
    - p2p
    - S3
    - bup
    - directory
    - rsync
    - web
    - bittorrent
    - webdav
    - adb
    - tahoe
    - glacier
    - ddar
    - git-lfs
    - httpalso
    - hook
    - external
  - supported repository versions: 
    - 8
  - upgrade supported from repository versions: 
    - 0
    - 1
    - 2
    - 3
    - 4
    - 5
    - 6
    - 7
  - version: 8.20201104-g13bab4f2c
## location 
  - path: /work/fatx405/projects/BIDS_UKB
  - type: directory
## metadata_extractors 
  - annex (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: None
    - module: datalad.metadata.extractors.annex
    - version: None
  - audio (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: No module named 'mutagen' [audio.py:<module>:17]
    - module: datalad.metadata.extractors.audio
  - bids (datalad-neuroimaging 0.3.1): 
    - distribution: datalad-neuroimaging 0.3.1
    - load_error: None
    - module: datalad_neuroimaging.extractors.bids
    - version: None
  - datacite (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: None
    - module: datalad.metadata.extractors.datacite
    - version: None
  - datalad_core (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: None
    - module: datalad.metadata.extractors.datalad_core
    - version: None
  - datalad_rfc822 (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: None
    - module: datalad.metadata.extractors.datalad_rfc822
    - version: None
  - dicom (datalad-neuroimaging 0.3.1): 
    - distribution: datalad-neuroimaging 0.3.1
    - load_error: None
    - module: datalad_neuroimaging.extractors.dicom
    - version: None
  - exif (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: No module named 'exifread' [exif.py:<module>:16]
    - module: datalad.metadata.extractors.exif
  - frictionless_datapackage (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: None
    - module: datalad.metadata.extractors.frictionless_datapackage
    - version: None
  - image (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: No module named 'PIL' [image.py:<module>:16]
    - module: datalad.metadata.extractors.image
  - metalad_annex (datalad-metalad 0.2.1): 
    - distribution: datalad-metalad 0.2.1
    - load_error: None
    - module: datalad_metalad.extractors.annex
    - version: None
  - metalad_core (datalad-metalad 0.2.1): 
    - distribution: datalad-metalad 0.2.1
    - load_error: None
    - module: datalad_metalad.extractors.core
    - version: None
  - metalad_custom (datalad-metalad 0.2.1): 
    - distribution: datalad-metalad 0.2.1
    - load_error: None
    - module: datalad_metalad.extractors.custom
    - version: None
  - metalad_runprov (datalad-metalad 0.2.1): 
    - distribution: datalad-metalad 0.2.1
    - load_error: None
    - module: datalad_metalad.extractors.runprov
    - version: None
  - nidm (datalad-neuroimaging 0.3.1): 
    - distribution: datalad-neuroimaging 0.3.1
    - load_error: None
    - module: datalad_neuroimaging.extractors.nidm
    - version: None
  - nifti1 (datalad-neuroimaging 0.3.1): 
    - distribution: datalad-neuroimaging 0.3.1
    - load_error: None
    - module: datalad_neuroimaging.extractors.nifti1
    - version: None
  - xmp (datalad 0.14.6): 
    - distribution: datalad 0.14.6
    - load_error: No module named 'libxmp' [xmp.py:<module>:20]
    - module: datalad.metadata.extractors.xmp
## metadata_indexers 
## python 
  - implementation: CPython
  - version: 3.8.1
## system 
  - distribution: centos/7/Core
  - encoding: 
    - default: utf-8
    - filesystem: utf-8
    - locale.prefered: UTF-8
  - max_path_length: 287
  - name: Linux
  - release: 4.14.240-1.0.33.el7.rrz.x86_64
  - type: posix
  - version: #1 SMP Thu Jul 22 18:29:43 CEST 2021

As always grateful for any input and happy to provide further details.

Cheers,
Marvin

(Super)dataset entry points

We will likely need to have multiple superdatasets. One for all raw datasets (#3), and one for BIDS-normalized datasets. In addition, it will be helpful to be able to programmatically generate additional superdatasets that only contain a subset of participants, e.g. based on a prior query.

git-annex 8.20210223+git33-gee4fd38ec breaks things

======================================================================
ERROR: datalad_ukbiobank.tests.test_update.test_drop
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad_ukbiobank/tests/test_update.py", line 195, in test_drop
    ds.ukb_update(merge=True, force=True, drop='archives')
Obscure filename: str=b' |;&%b5{}\'"\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 .datc ' repr=' |;&%b5{}\'"ΔЙקم๗あ .datc '
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/distribution/dataset.py", line 503, in apply_func
Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8'
    return f(**kwargs)
Environment: LANG='C.UTF-8' PATH='/opt/hostedtoolcache/Python/3.7.10/x64/bin:/opt/hostedtoolcache/Python/3.7.10/x64:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/opt/pipx_bin:/usr/share/rust/.cargo/bin:/home/runner/.config/composer/vendor/bin:/usr/local/.ghcup/bin:/home/runner/.dotnet/tools:/snap/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin' GIT_CONFIG_PARAMETERS="'init.defaultBranch=master'"
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/interface/utils.py", line 482, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/interface/utils.py", line 470, in return_func
    results = list(results)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/interface/utils.py", line 401, in generator_func
    allkwargs):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/interface/utils.py", line 557, in _process_results
    for res in results:
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad_ukbiobank/update.py", line 289, in __call__
    for rec in repo.call_annex_records(['drop'] + drop_opts):
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/support/annexrepo.py", line 1206, in call_annex_records
    return self._call_annex_records(args, files=files)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/support/annexrepo.py", line 1127, in _call_annex_records
    raise e
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/support/annexrepo.py", line 1087, in _call_annex_records
    **kwargs,
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/support/annexrepo.py", line 987, in _call_annex
    **kwargs)
  File "/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/datalad/cmd.py", line 413, in run
    **results,
datalad.support.exceptions.CommandError: CommandError: 'git annex drop --force --branch incoming -I '*.zip' --json --json-error-messages -c annex.dotfiles=true -c annex.retry=3' failed with exitcode 1 under /tmp/datalad_temp_test_drop6c80q7z0 [err: 'git-annex: Cannot use --all or --unused or --key or --incomplete with options that match on file names.'] [info keys: stdout_json]

Performance anecdotes

We generated datasets for all 42716 with NIfTI data we have access to. Each dataset is about 1k files and tracks 4GB. All datasets are set up in a way that they track their content by tracking only the pristine downloads that are kept on a local mirror webserver. Because there is nothing in the annex each datasets is small.

When placed in a RIA store, the entire store with all 42k datasets together is about 20.5GB and needs 1.5M inodes.

We also built a single giantic BIDS-like superdataset that tracks the master branches of all 42k participant datasets as subdatasets. The entire repo is just 5.5MB.

While this is exploring the edges of Git's capabilities, it is still functional. Even if a git status takes 16min, a datalad subdatasets completes its report in 12s.

`ukb-update` broken with `git-annex >= 10.20220526`

Tested to be broken with 10.20220526-gc6b112108 (linux) and 10.20220624 (brew) -- last known-to-work version is 10.20220504.

Revealed by #87

The path to the situation that shows the issue is a bit complicated, and I failed to find a minimal reproducer. However, within the failing test, this is the situation:

% git annex add -- ses-2/func/sub-12345_ses-2_task-hariri_sbref.json                                                                   
add ses-2/func/sub-12345_ses-2_task-hariri_sbref.json 
git-annex: ses-2/func/sub-12345_ses-2_task-hariri_sbref.json: rename: does not exist (No such file or directory)
failed
add: 1 failed
% git status --untracked=all 
On branch incoming-bids
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        ses-2/func/sub-12345_ses-2_task-hariri_sbref.json

nothing added to commit but untracked files present (use "git add" to track)

% ls -l ses-2/func/sub-12345_ses-2_task-hariri_sbref.json
lrwxrwxrwx 1 mih mih 126 Jul 22 15:01 ses-2/func/sub-12345_ses-2_task-hariri_sbref.json -> ../../.git/annex/objects/Gq/5F/MD5E-s16--7717b383a501e2b750b7147032ec5bc9.json/MD5E-s16--7717b383a501e2b750b7147032ec5bc9.json

Essentially, the untracked file is a moved/renamed annex-symlink. It was "imported" from another branch via git read-tree -u --reset <otherbranch>, and unstaged via git reset HEAD ., then renamed. The following annex add call shown above fails.

The full debug log, does not reveal anything interesting in addition:

% git annex --debug add -- ses-2/func/sub-12345_ses-2_task-hariri_sbref.json\ 

[2022-07-22 16:08:08.700985299] (Utility.Process) process [447621] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","git-annex"]
[2022-07-22 16:08:08.703691231] (Utility.Process) process [447621] done ExitSuccess
[2022-07-22 16:08:08.703925333] (Utility.Process) process [447622] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","--hash","refs/heads/git-annex"]
[2022-07-22 16:08:08.706469813] (Utility.Process) process [447622] done ExitSuccess
[2022-07-22 16:08:08.706802798] (Utility.Process) process [447623] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","log","refs/heads/git-annex..e0e504a6b64f4377e530ed9cf997339f05ee5a4b","--pretty=%H","-n1"]
[2022-07-22 16:08:08.70958952] (Utility.Process) process [447623] done ExitSuccess
[2022-07-22 16:08:08.710889571] (Utility.Process) process [447624] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"]
[2022-07-22 16:08:08.713634081] (Utility.Process) process [447625] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","symbolic-ref","-q","HEAD"]
[2022-07-22 16:08:08.716216812] (Utility.Process) process [447625] done ExitSuccess
[2022-07-22 16:08:08.716413387] (Utility.Process) process [447626] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","show-ref","refs/heads/incoming-bids"]
[2022-07-22 16:08:08.718956163] (Utility.Process) process [447626] done ExitSuccess
[2022-07-22 16:08:08.71915702] (Utility.Process) process [447627] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","-z","--others","--exclude-standard","--","ses-2/func/sub-12345_ses-2_task-hariri_sbref.json"]
[2022-07-22 16:08:08.721819403] (Utility.Process) process [447628] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","check-attr","-z","--stdin","annex.backend","annex.largefiles","annex.numcopies","annex.mincopies","--"]
add ses-2/func/sub-12345_ses-2_task-hariri_sbref.json 
git-annex: ses-2/func/sub-12345_ses-2_task-hariri_sbref.json: rename: does not exist (No such file or directory)
failed
[2022-07-22 16:08:08.725364409] (Utility.Process) process [447627] done ExitSuccess
[2022-07-22 16:08:08.725505447] (Utility.Process) process [447629] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","-z","--modified","--","ses-2/func/sub-12345_ses-2_task-hariri_sbref.json"]
[2022-07-22 16:08:08.728228073] (Utility.Process) process [447629] done ExitSuccess
[2022-07-22 16:08:08.728438983] (Utility.Process) process [447630] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","diff","--name-only","--diff-filter=T","-z","--cached","--","ses-2/func/sub-12345_ses-2_task-hariri_sbref.json"]
[2022-07-22 16:08:08.731222872] (Utility.Process) process [447630] done ExitSuccess
[2022-07-22 16:08:08.731581192] (Utility.Process) process [447624] done ExitSuccess
[2022-07-22 16:08:08.731805783] (Utility.Process) process [447628] done ExitSuccess
add: 1 failed

Add data record subdirectory for any ZIP file content in non-BIDS mode

Some ZIP files for data records contain obfuscated names that make at hard to reconnect to a specific data record (ZIP that contain only numbered PNG, in no subdirectory). Given the complexity and variabilty it makes sense to extract ZIP file content into a directory named after the data record ID.

The change is related but not identical to #28

It will also required to change the BIDS mapping to include the new data record directory layer.

Adjust for deprecated module import

DeprecationWarning: AddArchiveContent has been moved to datalad.local.add_archive_content. This module was deprecated in 0.16.0, and will be removed in a future release. Please adjust the import.

BIDS validation issues: Incorrect slice timing values?

I am exploring whether the resulting datasets are BIDS-compliant enough to run fMRIprep on them. I will report on all problems I encounter as issues.

Invalid slice timing

Some files seem to have invalid slice timing:

I have checked it in two subjects, and they have this issue for different modalities. Subject 10027** has it for resting state fMRI:

2: [ERR] "SliceTiming" value/s contains invalid value as it is greater than RepetitionTime.  SliceTiming values should be in seconds not milliseconds (common mistake). (code: 66 - SLICETIMING_VALUES_GREATOR_THAN_REPETITION_TIME)
./sub-10027**/ses-3/func/sub-10027**_ses-3_task-rest_bold.nii.gz
Evidence: 0.805,0.895,0.9825,1.0725,1.1625,1.25,1.34,1.43,1.52,1.6075,1.6975,1.7875,1.8775,1.965,2.055,2.145,2.235,2.3225,2.4125,2.5025,2.5925,2.68,2.77,2.86,2.95,3.0375,3.1275,3.2175,3.3075,3.395,3.485,3.575,3.665,3.7525,3.8425,3.9325,4.0225,4.11,4.2,4.29,4.38,4.4675,4.5575,4.6475,4.7375,4.825,4.915,5.005,5.095,5.1825,5.2725,5.3625,5.4525,5.54,5.63

subject 10025** has is for task-hariri:

1: [ERR] "SliceTiming" value/s contains invalid value as it is greater than RepetitionTime.  SliceTiming values should be in seconds not milliseconds (common mistake). (code: 66 - SLICETIMING_VALUES_GREATOR_THAN_REPETITION_TIME)
./sub-10025**/ses-2/func/sub-10025**_ses-2_task-hariri_bold.nii.gz
Evidence: 0.805,0.8925,0.9825,1.0725,1.1625,1.25,1.34,1.43,1.52,1.6075,1.6975,1.7875,1.8775,1.965,2.055,2.145,2.235,2.3225,2.4125,2.5025,2.59,2.68,2.77,2.86,2.9475,3.0375,3.1275,3.2175,3.305,3.395,3.485,3.575,3.6625,3.7525,3.8425,3.9325,4.02,4.11,4.2,4.29,4.3775,4.4675,4.5575,4.6475,4.735,4.825,4.915,5.005,5.0925,5.1825,5.2725,5.3625,5.45,5.54,5.63
Here is the corresponding faulty (?) json sidecar for subject 10025**'s hariri task, and how it looks in subject 10027**:
cat fmriprep/sub-10025**/ses-2/func/sub-10025**_ses-2_task-hariri_bold.json | jq                                                                                       1 !
{
  "Manufacturer": "Siemens",
  "ManufacturersModelName": "Skyra",
  "ImageType": [
    "ORIGINAL",
    "PRIMARY",
    "M",
    "MB",
    "ND",
    "MOSAI"
  ],
  "AcquisitionTime": 90735.7425,
  "AcquisitionDate": 20170512,
  "MagneticFieldStrength": 3,
  "FlipAngle": 51,
  "EchoTime": 0.0424,
  "RepetitionTime": 0.735,
  "EffectiveEchoSpacing": 0.000639989,
  "SliceTiming": [
    0,
    0.09,
    0.1775,
    0.2675,
    0.3575,
    0.4475,
    0.535,
    0.625,
    0.715,
    0.805,
    0.8925,
    0.9825,
    1.0725,
    1.1625,
    1.25,
    1.34,
    1.43,
    1.52,
    1.6075,
    1.6975,
    1.7875,
    1.8775,
    1.965,
    2.055,
    2.145,
    2.235,
    2.3225,
    2.4125,
    2.5025,
    2.59,
    2.68,
    2.77,
    2.86,
    2.9475,
    3.0375,
    3.1275,
    3.2175,
    3.305,
    3.395,
    3.485,
    3.575,
    3.6625,
    3.7525,
    3.8425,
    3.9325,
    4.02,
    4.11,
    4.2,
    4.29,
    4.3775,
    4.4675,
    4.5575,
    4.6475,
    4.735,
    4.825,
    4.915,
    5.005,
    5.0925,
    5.1825,
    5.2725,
    5.3625,
    5.45,
    5.54,
    5.63
  ],
  "PhaseEncodingDirection": "j-"
}

Here is how is looks like for subject 10027** (where this task does not through an error):

cat fmriprep/sub-10027**/ses-2/func/sub-10027**_ses-2_task-hariri_bold.json | jq      
{
  "Manufacturer": "Siemens",
  "ManufacturersModelName": "Skyra",
  "ImageType": [
    "ORIGINAL",
    "PRIMARY",
    "M",
    "MB",
    "ND",
    "MOSAI"
  ],
  "AcquisitionTime": 82349.7575,
  "AcquisitionDate": 20170829,
  "MagneticFieldStrength": 3,
  "FlipAngle": 51,
  "EchoTime": 0.0424,
  "RepetitionTime": 0.735,
  "EffectiveEchoSpacing": 0.000639989,
  "SliceTiming": [
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445,
    0,
    0.2675,
    0.535,
    0.0875,
    0.3575,
    0.625,
    0.1775,
    0.445
  ],
  "PhaseEncodingDirection": "j-"
}
Here is the corresponding faulty (?) json sidecar for subject 10027**'s hariri task, and how it looks in subject 10025**:

faulty:

 cat fmriprep/sub-10027**/ses-3/func/sub-10027**_ses-3_task-rest_bold.json | jq
{
  "Manufacturer": "Siemens",
  "ManufacturersModelName": "Skyra",
  "ImageType": [
    "ORIGINAL",
    "PRIMARY",
    "M",
    "MB",
    "ND",
    "MOSAI"
  ],
  "AcquisitionTime": 83200.67,
  "AcquisitionDate": 20190907,
  "MagneticFieldStrength": 3,
  "FlipAngle": 51,
  "EchoTime": 0.039,
  "RepetitionTime": 0.735,
  "EffectiveEchoSpacing": 0.000639989,
  "SliceTiming": [
    0,
    0.09,
    0.18,
    0.2675,
    0.3575,
    0.4475,
    0.5375,
    0.625,
    0.715,
    0.805,
    0.895,
    0.9825,
    1.0725,
    1.1625,
    1.25,
    1.34,
    1.43,
    1.52,
    1.6075,
    1.6975,
    1.7875,
    1.8775,
    1.965,
    2.055,
    2.145,
    2.235,
    2.3225,
    2.4125,
    2.5025,
    2.5925,
    2.68,
    2.77,
    2.86,
    2.95,
    3.0375,
    3.1275,
    3.2175,
    3.3075,
    3.395,
    3.485,
    3.575,
    3.665,
    3.7525,
    3.8425,
    3.9325,
    4.0225,
    4.11,
    4.2,
    4.29,
    4.38,
    4.4675,
    4.5575,
    4.6475,
    4.7375,
    4.825,
    4.915,
    5.005,
    5.095,
    5.1825,
    5.2725,
    5.3625,
    5.4525,
    5.54,
    5.63
  ],
  "PhaseEncodingDirection": "j-"
}

Here is how it looks for the other subject (where the task does not throw an error)

cat fmriprep/sub-10025**/ses-3/func/sub-10025**_ses-3_task-rest_bold.json | jq 
{
  "Manufacturer": "Siemens",
  "ManufacturersModelName": "Skyra",
  "ImageType": [
    "ORIGINAL",
    "PRIMARY",
    "M",
    "MB",
    "ND",
    "MOSAI"
  ],
  "AcquisitionTime": 93213.705,
  "AcquisitionDate": 20190806,
  "MagneticFieldStrength": 3,
  "FlipAngle": 51,
  "EchoTime": 0.0424,
  "RepetitionTime": 0.735,
  "EffectiveEchoSpacing": 0.000639989,
  "SliceTiming": [
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475,
    0,
    0.2675,
    0.535,
    0.09,
    0.3575,
    0.625,
    0.1775,
    0.4475
  ],
  "PhaseEncodingDirection": "j-"
}

support for data types not downloaded as zip files

Some data types (ex: rfmri correlation matrix; field: 25751) are not downloaded as zip files and are rather downloaded as txt files. As a result, these files get left behind in the incoming branch.

Ability to drop files from annex

Aiming for something like this:

  --drop {extracted|archives}
                        Drop file content to avoid storage duplication.
                        'extracted': drop all content of files extracted from
                        downloaded archives to yield the most compact storage
                        at the cost of partial re-extraction when accessing
                        archive content; 'archives': keep extracted content,
                        but drop archives instead. By default no content is
                        dropped, duplicating archive content in extracted
                        form. Constraints: value must be one of
                        ('extracted', 'archives')

Juelich / Montreal UKB project todo list

@jb : Get the mapping

  • Send email to Alan Simon (cc Laura)

fmriprep reproducibility

- @jb : talk to Basile for LTS : is that fixing FS / ANTS etc reproducibility?
- timeline 

Mtl : Distrubting computing

Overwrite error when downloading two data instances of one participant with ukb-update

I tried ukb-update to download the bulk data for a participant who has two instances of the same field-id. Data were downloaded successfully but in the end, I received this error:
[ERROR] File path_to/FreeSurfer/stats/lh.aparc.stats already exists, but new (?) file FreeSurfer/stats/lh.aparc.stats was instructed to be placed there while overwrite=False [add_archive_content.py:call:404] (RuntimeError)

As I understand, the two instances are downloaded as zip files with different file names but since they were automatically unzipped the content has the same file/folder name and this caused the problem! On the other hand, I want to store both data differently so how should I solve this issue?

adding new DATARECORD-IDs to the dataset

Right now when I want to initialize the dataset I have to provide PARTICPANT-ID and values for DATARECORD-ID and there is no option of adding more DATARECORD-IDs after initialization. Is that right?

If that is true, what would you recommend if I realize that I need more IDs in the future?

I tried to use --force, but not sure if this is the way to do it (and I had error due to the merge conflicts anyway).

cc:@Hoda1394

Implement name prefix support for managed branches?

Two sites (on one application) might have different subsets of data (storage limits, interest differences), but still might want to collaborate on the same participant datasets. ATM the implementation only supports full updates, i.e. only what is available at the time of an update will make it into the managed branches.

Supporting proper incremental updates is not trivial, as it would involve determining which files came from with download and selectively maintain those that existed before.

A cheaper approach might be to add support for a branch-name prefix. Each site would have their own incoming-*, and the union of the site contributions could be achieved by merging both incoming-* branch from each site into a mainline branch, whenever an update was made.

Ping @Hoda1394

hardlink issue with datalad, datalad-ukbiobank (likely a git-annex issue)

this may be more directly a git-annex issue but i will post it here first:

for our ukbiobank processing i've written a wrapper that either hardlinks to a local file in some other directory or fetches it remotely. however this fails in some local filemode modification step because it's a hardlink.

the hardlink would allow us to not duplicate 100+TB and allow us to remove things after everything is taken care of.

$ datalad ukb-update -k keyfile
[INFO   ] == Command start (output follows) ===== 
Found: linking to local file
... repeat for all files
[INFO   ] == Command exit (modification check follows) =====                                                                                        
[ERROR  ]   2196131_21011_0_0.fda: setFileMode: permission denied (Operation not permitted) [add(/om4/project/biobank/testukb/2196131_21011_0_0.fda)] 
... repeats for all files

Missing json files for BIDS layout

Hi,

I am using datalad-ukbiobank with pre-downloaded data, to convert the downloaded archive into a BIDS compliant dataset.
However, I don't see any json files associated with the niftii images, is it expected ?

I run the following:

datalad create ${output_dir}/sub-${participant_id}
cd ${output_dir}/sub-${participant_id}
datalad ukb-init --bids --force ${participant_id} 20227_2_0 20252_2_0
datalad ukb-update --merge --force --keyfile ${keyfile_path}

And here the content of the BIDS dataset:

.
└── sub-1159104
    └── ses-2
        ├── anat
        │   └── sub-1159104_ses-2_T1w.nii.gz -> ../../.git/annex/objects/ww/xf/MD5E-s13185752--68ed198d992da40e39d220e7358e97bc.nii.gz/MD5E-s13185752--68ed198d992da40e39d220e7358e97bc.nii.gz
        ├── func
        │   ├── sub-1159104_ses-2_task-hariri_sbref.nii.gz -> ../../.git/annex/objects/Q4/9M/MD5E-s850576--76d9f31599639ba9cc7c12c6d40b52b9.nii.gz/MD5E-s850576--76d9f31599639ba9cc7c12c6d40b52b9.nii.gz
        │   └── sub-1159104_ses-2_task-rest_bold.nii.gz -> ../../.git/annex/objects/2K/xM/MD5E-s384519774--4f8059fea8a822f4945c11ef3ec0b022.nii.gz/MD5E-s384519774--4f8059fea8a822f4945c11ef3ec0b022.nii.gz
        └── non-bids
            ├── fMRI
            │   ├── rfMRI_100.dr
            │   │   └── dr_stage1.txt -> ../../../../.git/annex/objects/jz/6M/MD5E-s658825--01c0101a0ab68f5185abd1430855a700.txt/MD5E-s658825--01c0101a0ab68f5185abd1430855a700.txt
            │   ├── rfMRI_25.dr
            │   │   └── dr_stage1.txt -> ../../../../.git/annex/objects/F6/8W/MD5E-s160316--18925e533b1dc16b423d2569dd2a82f4.txt/MD5E-s160316--18925e533b1dc16b423d2569dd2a82f4.txt
            │   └── rfMRI.ica
            │       ├── absbrainthresh.txt -> ../../../../.git/annex/objects/xx/KP/MD5E-s7--af9a779a7d48a1a5f1cddf0bde3b238b.txt/MD5E-s7--af9a779a7d48a1a5f1cddf0bde3b238b.txt
            │       ├── design.fsf -> ../../../../.git/annex/objects/10/Xv/MD5E-s7844--b5f65beee695ef9a19b61a1db4f2eb22.fsf/MD5E-s7844--b5f65beee695ef9a19b61a1db4f2eb22.fsf
            │       ├── example_func.nii.gz -> ../../../../.git/annex/objects/K5/21/MD5E-s1471475--a95c366c2d28d13f2db90e2549818fc5.nii.gz/MD5E-s1471475--a95c366c2d28d13f2db90e2549818fc5.nii.gz
            │       ├── filtered_func_data_clean.nii.gz -> ../../../../.git/annex/objects/21/gW/MD5E-s271074906--a87d26296349e9dd8519741128f647a5.nii.gz/MD5E-s271074906--a87d26296349e9dd8519741128f647a5.nii.gz
            │       ├── filtered_func_data.ica
            │       │   ├── log.txt -> ../../../../../.git/annex/objects/p9/vq/MD5E-s44357--e0dcd5a19bc83320527ba7fc092166cc.txt/MD5E-s44357--e0dcd5a19bc83320527ba7fc092166cc.txt
            │       │   ├── mask.nii.gz -> ../../../../../.git/annex/objects/2J/7q/MD5E-s14857--eae5ac170963e772c748159687c3e2ad.nii.gz/MD5E-s14857--eae5ac170963e772c748159687c3e2ad.nii.gz
            │       │   ├── mean.nii.gz -> ../../../../../.git/annex/objects/p9/Kq/MD5E-s553118--6bf2e937ff9fada684b7a049e7d6029a.nii.gz/MD5E-s553118--6bf2e937ff9fada684b7a049e7d6029a.nii.gz
            │       │   ├── melodic_FTmix -> ../../../../../.git/annex/objects/f6/00/MD5E-s597412--12bdf986ec9521c91de76679d3ba55b1/MD5E-s597412--12bdf986ec9521c91de76679d3ba55b1
            │       │   ├── melodic_IC.nii.gz -> ../../../../../.git/annex/objects/q4/V1/MD5E-s109324412--55c691985e03d1284cfa6c6e85433ddf.nii.gz/MD5E-s109324412--55c691985e03d1284cfa6c6e85433ddf.nii.gz
            │       │   ├── melodic_ICstats -> ../../../../../.git/annex/objects/9G/PP/MD5E-s8043--cd962f3faa199f8fc29674db33c0d02a/MD5E-s8043--cd962f3faa199f8fc29674db33c0d02a
            │       │   ├── melodic_mix -> ../../../../../.git/annex/objects/2f/MK/MD5E-s1313864--fbc89bb9c47b31d57136dbee33bb0d96/MD5E-s1313864--fbc89bb9c47b31d57136dbee33bb0d96
            │       │   ├── melodic_PPCA -> ../../../../../.git/annex/objects/PG/G4/MD5E-s44662--6b015d3d7b229fec1d748ceb7e9d4fff/MD5E-s44662--6b015d3d7b229fec1d748ceb7e9d4fff
            │       │   └── melodic_Tmodes -> ../../../../../.git/annex/objects/2f/MK/MD5E-s1313864--fbc89bb9c47b31d57136dbee33bb0d96/MD5E-s1313864--fbc89bb9c47b31d57136dbee33bb0d96
            │       ├── fix4melview_UKBiobank_thr20.txt -> ../../../../.git/annex/objects/0Q/Gq/MD5E-s6257--2ec348c000604ebd1f074394fec4f753.txt/MD5E-s6257--2ec348c000604ebd1f074394fec4f753.txt
            │       ├── mask.nii.gz -> ../../../../.git/annex/objects/P9/1x/MD5E-s14846--785df6b1b9d0a7217c034d3bf3aa5edd.nii.gz/MD5E-s14846--785df6b1b9d0a7217c034d3bf3aa5edd.nii.gz
            │       ├── mc
            │       │   ├── disp.png -> ../../../../../.git/annex/objects/Xf/QM/MD5E-s5148--edc8e9f7bf6b5dcc64d2460feb1bd2b0.png/MD5E-s5148--edc8e9f7bf6b5dcc64d2460feb1bd2b0.png
            │       │   ├── prefiltered_func_data_mcf_abs_mean.rms -> ../../../../../.git/annex/objects/V5/M8/MD5E-s8--e45161698eb23bea634aaf1d7e98e5fa.rms/MD5E-s8--e45161698eb23bea634aaf1d7e98e5fa.rms
            │       │   ├── prefiltered_func_data_mcf_abs.rms -> ../../../../../.git/annex/objects/2W/w1/MD5E-s3912--28d92c3404c03f46f65ff4ca7d5cc8f7.rms/MD5E-s3912--28d92c3404c03f46f65ff4ca7d5cc8f7.rms
            │       │   ├── prefiltered_func_data_mcf.cat -> ../../../../../.git/annex/objects/GM/K6/MD5E-s74489--dcd356fb01ba626dea5974fbbc062557.cat/MD5E-s74489--dcd356fb01ba626dea5974fbbc062557.cat
            │       │   ├── prefiltered_func_data_mcf_conf_hp.nii.gz -> ../../../../../.git/annex/objects/p9/f8/MD5E-s43798--1a3aab41bdef47c1a75779ef26fdf175.nii.gz/MD5E-s43798--1a3aab41bdef47c1a75779ef26fdf175.nii.gz
            │       │   ├── prefiltered_func_data_mcf_conf.nii.gz -> ../../../../../.git/annex/objects/Q1/fP/MD5E-s43644--2d5d0429b2ee0fdddcdced733277751a.nii.gz/MD5E-s43644--2d5d0429b2ee0fdddcdced733277751a.nii.gz
            │       │   ├── prefiltered_func_data_mcf.par -> ../../../../../.git/annex/objects/Pg/GZ/MD5E-s32566--d1652e3ddb38bb6bf611e6fa5c5723c4.par/MD5E-s32566--d1652e3ddb38bb6bf611e6fa5c5723c4.par
            │       │   ├── prefiltered_func_data_mcf_rel_mean.rms -> ../../../../../.git/annex/objects/3P/mk/MD5E-s9--0c62fa760e910cbd6a489eda776ba0a1.rms/MD5E-s9--0c62fa760e910cbd6a489eda776ba0a1.rms
            │       │   ├── prefiltered_func_data_mcf_rel.rms -> ../../../../../.git/annex/objects/mx/p9/MD5E-s4389--a541dfb887b9b8f2c5cc47bd6a33b764.rms/MD5E-s4389--a541dfb887b9b8f2c5cc47bd6a33b764.rms
            │       │   ├── rot.png -> ../../../../../.git/annex/objects/4Z/M1/MD5E-s4225--2a3b6ca8e4886fc56fad285db2643c3a.png/MD5E-s4225--2a3b6ca8e4886fc56fad285db2643c3a.png
            │       │   └── trans.png -> ../../../../../.git/annex/objects/3z/gm/MD5E-s5657--8e1959fa13d9f243969067fb5caec5de.png/MD5E-s5657--8e1959fa13d9f243969067fb5caec5de.png
            │       ├── mean_func.nii.gz -> ../../../../.git/annex/objects/7P/77/MD5E-s553272--ef64c68bc90b329983af142d4a087288.nii.gz/MD5E-s553272--ef64c68bc90b329983af142d4a087288.nii.gz
            │       ├── reg
            │       │   ├── example_func2highres.mat -> ../../../../../.git/annex/objects/zP/p9/MD5E-s185--f59c0ce4d7f7999a069c067a29840285.mat/MD5E-s185--f59c0ce4d7f7999a069c067a29840285.mat
            │       │   ├── example_func2highres.png -> ../../../../../.git/annex/objects/v7/17/MD5E-s2062548--0ebc194d182b713a6b8c40d75f36ab9f.png/MD5E-s2062548--0ebc194d182b713a6b8c40d75f36ab9f.png
            │       │   ├── example_func2standard1.png -> ../../../../../.git/annex/objects/Jp/0K/MD5E-s319695--17ca023f22fde3e49cdfcf10041ce0da.png/MD5E-s319695--17ca023f22fde3e49cdfcf10041ce0da.png
            │       │   ├── example_func2standard.mat -> ../../../../../.git/annex/objects/W2/XF/MD5E-s187--3e8c94f30a78f33a099d8df65115975d.mat/MD5E-s187--3e8c94f30a78f33a099d8df65115975d.mat
            │       │   ├── example_func2standard.nii.gz -> ../../../../../.git/annex/objects/fg/2v/MD5E-s2757548--29e7541037791885acba38dc936cffa1.nii.gz/MD5E-s2757548--29e7541037791885acba38dc936cffa1.nii.gz
            │       │   ├── example_func2standard.png -> ../../../../../.git/annex/objects/xx/kX/MD5E-s533384--b539d1cc99d230fc8f187d583f095e83.png/MD5E-s533384--b539d1cc99d230fc8f187d583f095e83.png
            │       │   ├── example_func2standard_warp.nii.gz -> ../../../../../.git/annex/objects/fJ/jP/MD5E-s8878337--b70086c3ab91c2f3e0f714d458ac636f.nii.gz/MD5E-s8878337--b70086c3ab91c2f3e0f714d458ac636f.nii.gz
            │       │   ├── highres2standard.png -> ../../../../../.git/annex/objects/zZ/6k/MD5E-s489218--8d9950545b094b962952bcbd163928d4.png/MD5E-s489218--8d9950545b094b962952bcbd163928d4.png
            │       │   └── unwarp
            │       │       ├── EF_D_edges.gif -> ../../../../../../.git/annex/objects/ZP/m2/MD5E-s3429185--52bae5f82866786ae2f73cc477fc3a5d.gif/MD5E-s3429185--52bae5f82866786ae2f73cc477fc3a5d.gif
            │       │       ├── EF_UD_movie.gif -> ../../../../../../.git/annex/objects/Ff/J8/MD5E-s6896516--0c39208c79668f353a177a5ebe34cafb.gif/MD5E-s6896516--0c39208c79668f353a177a5ebe34cafb.gif
            │       │       ├── EF_UD_shift+mag.png -> ../../../../../../.git/annex/objects/pk/94/MD5E-s279023--603ec6e0ea418e75d208d8fe29714343.png/MD5E-s279023--603ec6e0ea418e75d208d8fe29714343.png
            │       │       ├── EF_U_edges.gif -> ../../../../../../.git/annex/objects/g8/xP/MD5E-s3467264--ffb4d63617fa8b1cadd5df8e2e8c8e24.gif/MD5E-s3467264--ffb4d63617fa8b1cadd5df8e2e8c8e24.gif
            │       │       ├── example_func_distorted2highres.png -> ../../../../../../.git/annex/objects/2J/jm/MD5E-s1949162--41d84b768223492b4c0c906819c8cd9e.png/MD5E-s1949162--41d84b768223492b4c0c906819c8cd9e.png
            │       │       ├── fieldmap2edges.png -> ../../../../../../.git/annex/objects/0w/V6/MD5E-s3371871--bb6eca0d76538099ed3b50265893a4e0.png/MD5E-s3371871--bb6eca0d76538099ed3b50265893a4e0.png
            │       │       ├── fieldmap_fout_to_T1_brain_rad.nii.gz -> ../../../../../../.git/annex/objects/vv/8Q/MD5E-s6767006--a57cc73f20883f63cae05b16cdeb7b2d.nii.gz/MD5E-s6767006--a57cc73f20883f63cae05b16cdeb7b2d.nii.gz
            │       │       ├── fmap+mag.png -> ../../../../../../.git/annex/objects/M8/GF/MD5E-s2038397--e4701e74e60a07a117ff8c70105c54a7.png/MD5E-s2038397--e4701e74e60a07a117ff8c70105c54a7.png
            │       │       ├── FM_UD_fmap_mag_brain2str.png -> ../../../../../../.git/annex/objects/Vw/fW/MD5E-s2433108--53704ebc728eeea803d3b48d33c63647.png/MD5E-s2433108--53704ebc728eeea803d3b48d33c63647.png
            │       │       └── FM_UD_sigloss+mag.png -> ../../../../../../.git/annex/objects/j0/3X/MD5E-s546163--a91bbde1e450ef314dae44729f091c6a.png/MD5E-s546163--a91bbde1e450ef314dae44729f091c6a.png
            │       ├── report.html -> ../../../../.git/annex/objects/04/12/MD5E-s910--3e3b39b50d61177aca93102689e933f2.html/MD5E-s910--3e3b39b50d61177aca93102689e933f2.html
            │       ├── report_log.html -> ../../../../.git/annex/objects/fG/G0/MD5E-s72344--37b74d02cc07000d0c8dd0f3043f92d6.html/MD5E-s72344--37b74d02cc07000d0c8dd0f3043f92d6.html
            │       ├── report_prestats.html -> ../../../../.git/annex/objects/Zz/mx/MD5E-s1475--753b9246df4197deaf7a41d365876845.html/MD5E-s1475--753b9246df4197deaf7a41d365876845.html
            │       ├── report_reg.html -> ../../../../.git/annex/objects/3P/42/MD5E-s2256--9853a45ff585f8838d9794df06eab5b1.html/MD5E-s2256--9853a45ff585f8838d9794df06eab5b1.html
            │       └── report_unwarp.html -> ../../../../.git/annex/objects/KQ/gW/MD5E-s2415--c7bf406d7acea715551ae0e35ef59b6e.html/MD5E-s2415--c7bf406d7acea715551ae0e35ef59b6e.html
            └── T1
                ├── T1_brain_mask.nii.gz -> ../../../.git/annex/objects/Gx/P7/MD5E-s539998--bcff151200027ab12c65a07384697c80.nii.gz/MD5E-s539998--bcff151200027ab12c65a07384697c80.nii.gz
                ├── T1_brain.nii.gz -> ../../../.git/annex/objects/5Q/X5/MD5E-s3819921--f44ab545040d35be8a5604d491fd73c1.nii.gz/MD5E-s3819921--f44ab545040d35be8a5604d491fd73c1.nii.gz
                ├── T1_brain_to_MNI.nii.gz -> ../../../.git/annex/objects/G7/kP/MD5E-s3862977--7ad20e5ef12ee7c992b83340c86b7032.nii.gz/MD5E-s3862977--7ad20e5ef12ee7c992b83340c86b7032.nii.gz
                ├── T1_fast
                │   ├── T1_brain_bias.nii.gz -> ../../../../.git/annex/objects/Jx/Fq/MD5E-s5952225--8153c188faa2cb009c13829e0124cc84.nii.gz/MD5E-s5952225--8153c188faa2cb009c13829e0124cc84.nii.gz
                │   ├── T1_brain_pve_0.nii.gz -> ../../../../.git/annex/objects/X2/ZJ/MD5E-s833340--9d7193d9f224aada4b5da4fb1933380e.nii.gz/MD5E-s833340--9d7193d9f224aada4b5da4fb1933380e.nii.gz
                │   ├── T1_brain_pve_1.nii.gz -> ../../../../.git/annex/objects/Kw/84/MD5E-s1555592--c229c69751bb6e51f46ee201c943d512.nii.gz/MD5E-s1555592--c229c69751bb6e51f46ee201c943d512.nii.gz
                │   ├── T1_brain_pve_2.nii.gz -> ../../../../.git/annex/objects/jz/9Z/MD5E-s921396--cacb431b800f09cb1d8536b13013000f.nii.gz/MD5E-s921396--cacb431b800f09cb1d8536b13013000f.nii.gz
                │   └── T1_brain_seg.nii.gz -> ../../../../.git/annex/objects/FQ/6w/MD5E-s412320--5b33d9627ef62db33a9fec0c0f623f59.nii.gz/MD5E-s412320--5b33d9627ef62db33a9fec0c0f623f59.nii.gz
                ├── T1_first
                │   ├── T1_first_all_fast_firstseg.nii.gz -> ../../../../.git/annex/objects/Q5/1P/MD5E-s54672--42901bf6dfbc7e9d25a88e16591c9f73.nii.gz/MD5E-s54672--42901bf6dfbc7e9d25a88e16591c9f73.nii.gz
                │   ├── T1_first-BrStem_first.bvars -> ../../../../.git/annex/objects/jk/vQ/MD5E-s1555--c6a462407f403340adfa290952a73871/MD5E-s1555--c6a462407f403340adfa290952a73871
                │   ├── T1_first-BrStem_first.vtk -> ../../../../.git/annex/objects/xK/fW/MD5E-s32580--91ee8d365c67e30255605b08ba7f1ea3.vtk/MD5E-s32580--91ee8d365c67e30255605b08ba7f1ea3.vtk
                │   ├── T1_first-L_Accu_first.bvars -> ../../../../.git/annex/objects/JK/xj/MD5E-s1555--103cfa0f07c4c73b38989ffaed8dc647/MD5E-s1555--103cfa0f07c4c73b38989ffaed8dc647
                │   ├── T1_first-L_Accu_first.vtk -> ../../../../.git/annex/objects/20/qf/MD5E-s32614--4f7472bbb644438306679196a8253784.vtk/MD5E-s32614--4f7472bbb644438306679196a8253784.vtk
                │   ├── T1_first-L_Amyg_first.bvars -> ../../../../.git/annex/objects/QX/PX/MD5E-s1563--0d4eb8067bba528ebccb86749a178c3f/MD5E-s1563--0d4eb8067bba528ebccb86749a178c3f
                │   ├── T1_first-L_Amyg_first.vtk -> ../../../../.git/annex/objects/WX/94/MD5E-s32620--523fff20ac3c565553c30fc97af3d78f.vtk/MD5E-s32620--523fff20ac3c565553c30fc97af3d78f.vtk
                │   ├── T1_first-L_Caud_first.bvars -> ../../../../.git/annex/objects/MW/V2/MD5E-s1563--62bc6953894bfe640e6342097d7e0b24/MD5E-s1563--62bc6953894bfe640e6342097d7e0b24
                │   ├── T1_first-L_Caud_first.vtk -> ../../../../.git/annex/objects/4m/1j/MD5E-s49544--1e0a78cea67d0987c5e401a0a770a152.vtk/MD5E-s49544--1e0a78cea67d0987c5e401a0a770a152.vtk
                │   ├── T1_first-L_Hipp_first.bvars -> ../../../../.git/annex/objects/pq/0x/MD5E-s1563--82479b46cefacb21ac14b1aedc9b8ce4/MD5E-s1563--82479b46cefacb21ac14b1aedc9b8ce4
                │   ├── T1_first-L_Hipp_first.vtk -> ../../../../.git/annex/objects/jj/qX/MD5E-s37224--5eaad87fa8ec5918e06236b71d8b7078.vtk/MD5E-s37224--5eaad87fa8ec5918e06236b71d8b7078.vtk
                │   ├── T1_first-L_Pall_first.bvars -> ../../../../.git/annex/objects/j9/8f/MD5E-s1561--b4c3c5ac646008bf6f485093cffc1cbc/MD5E-s1561--b4c3c5ac646008bf6f485093cffc1cbc
                │   ├── T1_first-L_Pall_first.vtk -> ../../../../.git/annex/objects/8z/PJ/MD5E-s32582--d32de3b9b9bfa791d075b133d3ff87a2.vtk/MD5E-s32582--d32de3b9b9bfa791d075b133d3ff87a2.vtk
                │   ├── T1_first-L_Puta_first.bvars -> ../../../../.git/annex/objects/Z0/kf/MD5E-s1561--b6d28dd6090e271626453b9e173a93bd/MD5E-s1561--b6d28dd6090e271626453b9e173a93bd
                │   ├── T1_first-L_Puta_first.vtk -> ../../../../.git/annex/objects/XW/5F/MD5E-s32602--46d5c5dafd5ffd1086ef5cf0bce37950.vtk/MD5E-s32602--46d5c5dafd5ffd1086ef5cf0bce37950.vtk
                │   ├── T1_first-L_Thal_first.bvars -> ../../../../.git/annex/objects/jK/fP/MD5E-s1561--b1d9daa8ffdbdd734df2529c707d9169/MD5E-s1561--b1d9daa8ffdbdd734df2529c707d9169
                │   ├── T1_first-L_Thal_first.vtk -> ../../../../.git/annex/objects/jG/47/MD5E-s32593--a4f9c5f963122eda47220c1af667483a.vtk/MD5E-s32593--a4f9c5f963122eda47220c1af667483a.vtk
                │   ├── T1_first-R_Accu_first.bvars -> ../../../../.git/annex/objects/Xm/Xq/MD5E-s1555--d3a235caa580934f00e2d2dd2d733ca4/MD5E-s1555--d3a235caa580934f00e2d2dd2d733ca4
                │   ├── T1_first-R_Accu_first.vtk -> ../../../../.git/annex/objects/0J/k2/MD5E-s32604--3f4485c8b8c662ff995d67f4307aaa9c.vtk/MD5E-s32604--3f4485c8b8c662ff995d67f4307aaa9c.vtk
                │   ├── T1_first-R_Amyg_first.bvars -> ../../../../.git/annex/objects/Xv/Gg/MD5E-s1563--5f15072c8946d1e28c63f5436ae217ed/MD5E-s1563--5f15072c8946d1e28c63f5436ae217ed
                │   ├── T1_first-R_Amyg_first.vtk -> ../../../../.git/annex/objects/jQ/mm/MD5E-s32612--272407084716c891b1da3fbbf025f960.vtk/MD5E-s32612--272407084716c891b1da3fbbf025f960.vtk
                │   ├── T1_first-R_Caud_first.bvars -> ../../../../.git/annex/objects/4G/z2/MD5E-s1563--462573fd1aed8118a56746ea9761a7a4/MD5E-s1563--462573fd1aed8118a56746ea9761a7a4
                │   ├── T1_first-R_Caud_first.vtk -> ../../../../.git/annex/objects/xf/63/MD5E-s54989--7c24733d70a35436aa3be836beccbbb2.vtk/MD5E-s54989--7c24733d70a35436aa3be836beccbbb2.vtk
                │   ├── T1_first-R_Hipp_first.bvars -> ../../../../.git/annex/objects/02/MQ/MD5E-s1563--719ce159c5e2e25a187b4209142801e0/MD5E-s1563--719ce159c5e2e25a187b4209142801e0
                │   ├── T1_first-R_Hipp_first.vtk -> ../../../../.git/annex/objects/Gg/Xj/MD5E-s33706--c209a1f3ed1bb1cdcc13287b190288ba.vtk/MD5E-s33706--c209a1f3ed1bb1cdcc13287b190288ba.vtk
                │   ├── T1_first-R_Pall_first.bvars -> ../../../../.git/annex/objects/PV/MF/MD5E-s1561--4ff3ebb4caa8201285e1282fa2f7d07e/MD5E-s1561--4ff3ebb4caa8201285e1282fa2f7d07e
                │   ├── T1_first-R_Pall_first.vtk -> ../../../../.git/annex/objects/zw/5V/MD5E-s32597--ec7642786b38a7ecdc7d9380c7e99611.vtk/MD5E-s32597--ec7642786b38a7ecdc7d9380c7e99611.vtk
                │   ├── T1_first-R_Puta_first.bvars -> ../../../../.git/annex/objects/q0/0v/MD5E-s1561--1a9eb3a675d160f73eefd6e689ed9a8f/MD5E-s1561--1a9eb3a675d160f73eefd6e689ed9a8f
                │   ├── T1_first-R_Puta_first.vtk -> ../../../../.git/annex/objects/Mq/kX/MD5E-s32628--1d2f4344df2c6c75a6d5b2ff010e5cfe.vtk/MD5E-s32628--1d2f4344df2c6c75a6d5b2ff010e5cfe.vtk
                │   ├── T1_first-R_Thal_first.bvars -> ../../../../.git/annex/objects/2x/j2/MD5E-s1561--081cb66efb9573a491b74b4d848b68e6/MD5E-s1561--081cb66efb9573a491b74b4d848b68e6
                │   └── T1_first-R_Thal_first.vtk -> ../../../../.git/annex/objects/G7/5p/MD5E-s32608--3a981d0920d30d989911ab700fbe7be8.vtk/MD5E-s32608--3a981d0920d30d989911ab700fbe7be8.vtk
                ├── T1_orig_defaced.nii.gz -> ../../../.git/annex/objects/k5/Wz/MD5E-s20284968--5468c35d0dd550615764326d5ac10ab1.nii.gz/MD5E-s20284968--5468c35d0dd550615764326d5ac10ab1.nii.gz
                ├── T1_unbiased_brain.nii.gz -> ../../../.git/annex/objects/p9/33/MD5E-s6410788--5a9b65be9947cab036f35ad18e96eda8.nii.gz/MD5E-s6410788--5a9b65be9947cab036f35ad18e96eda8.nii.gz
                └── transforms
                    ├── T1_to_MNI_linear.mat -> ../../../../.git/annex/objects/4k/J6/MD5E-s189--c6abbec22f6c11eecb928e5544eedfc6.mat/MD5E-s189--c6abbec22f6c11eecb928e5544eedfc6.mat
                    └── T1_to_MNI_warp_coef.nii.gz -> ../../../../.git/annex/objects/px/x9/MD5E-s118638--591edf07be6feff791ca0c16861283c1.nii.gz/MD5E-s118638--591edf07be6feff791ca0c16861283c1.nii.gz

Thank you,

`ukb-update` fails with `--drop extracted` if only non-zip data types

Sometimes (rarely) participants have only non-zip data types available for download. If --drop extracted is used in this case, it will fail with

Remote 'datalad-archives' is not available. Command failed:                                              
RemoteNotAvailableError: 'annex drop --in datalad-archives --branch incoming-native --json --json-error-messages'

This is understandable given that datalad-archives was never enabled.

To reproduce:

❱ datalad create sub-0001234 && cd sub-0001234

❱ datalad ukb-init --bids 0001234 25747_2_0 25748_2_0 25749_2_0 

❱ datalad -c datalad.ukbiobank.keyfile=none ukb-update --merge --drop extracted 
[INFO   ] == Command start (output follows) ===== 
sending incremental file list
0001234_25747_2_0.adv
         86.06K 100%   50.83MB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 86.20K bytes  received 35 bytes  172.46K bytes/sec
total size is 86.06K  speedup is 1.00
sending incremental file list
0001234_25748_2_0.txt
        133.69K 100%   96.25MB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 133.84K bytes  received 35 bytes  267.75K bytes/sec
total size is 133.69K  speedup is 1.00
sending incremental file list
0001234_25749_2_0.ed2
         75.36K 100%   40.61MB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 75.49K bytes  received 35 bytes  151.04K bytes/sec
total size is 75.36K  speedup is 1.00
[INFO   ] == Command exit (modification check follows) ===== 
ukb_update(ok): . (dataset)                                                                              
ukb_bidsify(ok): 25747_2_0.adv (file)                                                                    
ukb_bidsify(ok): 25748_2_0.txt (file)
ukb_bidsify(ok): 25749_2_0.ed2 (file)
Remote 'datalad-archives' is not available. Command failed:                                              
RemoteNotAvailableError: 'annex drop --in datalad-archives --branch incoming-native --json --json-error-messages'

Proposal: per participant raw data tracking

Starting point for all processing should be a per-participant dataset that tracks all data as raw as possible, given what is provided. For fMRI that is DICOM, for other imaging modalities it is NIfTI.

Datasets should use the add-archive functionality of DataLad and track the downloadable ZIP files directly, but also register their extracted content. This makes it possible to use these datasets directly as input for further processing and normalization, but also use their original form (ZIP files) for preservation.

It seems that the number of files across all relevant data records is manageable enough to not require further subdatasets (at least for the NIfTI parts), but DICOM handling needs to be investigated. Possibly, we can use https://github.com/psychoinformatics-de/datalad-hirni to put DICOMs automatically in subdatasets.

BIDS validation issues: Non-compliant non-bids directory

I am exploring whether the resulting datasets are BIDS-compliant enough to run fMRIprep on them. I will report on all problems I encounter as issues.

non-bids directories

1. The non-bids directory is not-compliant:
	1: [ERR] Files with such naming scheme are not part of BIDS specification. This error is most commonly caused by typos in file names that make them not BIDS compatible. Please consult the specification and make sure your files are named correctly. If this is not a file naming issue (for example when including files not yet covered by the BIDS specification) you should include a ".bidsignore" file in your dataset (see https://github.com/bids-standard/bids-validator#bidsignore for details). Please note that derived (processed) data should be placed in /derivatives folder and source data (such as DICOMS or behavioural logs in proprietary formats) should be placed in the /sourcedata folder. (code: 1 - NOT_INCLUDED)
		./sub-100****/ses-2/non-bids/SWI/SOS_TE1.nii.gz
			Evidence: SOS_TE1.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SOS_TE2.nii.gz
			Evidence: SOS_TE2.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SWI.nii.gz
			Evidence: SWI.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SWI_TOTAL_MAG_TE2_orig.nii.gz
			Evidence: SWI_TOTAL_MAG_TE2_orig.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SWI_TOTAL_MAG_orig.nii.gz
			Evidence: SWI_TOTAL_MAG_orig.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SWI_TOTAL_MAG_to_T1.nii.gz
			Evidence: SWI_TOTAL_MAG_to_T1.nii.gz
		./sub-100****/ses-2/non-bids/SWI/SWI_to_T1.mat
			Evidence: SWI_to_T1.mat
		./sub-100****/ses-2/non-bids/SWI/T1_to_SWI.mat
			Evidence: T1_to_SWI.mat
		./sub-100****/ses-2/non-bids/SWI/T2star.nii.gz
			Evidence: T2star.nii.gz
		./sub-100****/ses-2/non-bids/SWI/T2star_to_T1.nii.gz
			Evidence: T2star_to_T1.nii.gz
		... and 866 more files having this issue (Use --verbose to see them all).

This can be fixed by placing a .bidsignore file containing **/non-bids/** into the directory that the subject directories lie in (i.e., not inside of sub-*/, but one level up). As it can't be inside of the subject directories, I don't think that this is something that can be accommodated during data download of individual subjects. I'm reporting this here anyway because it could get addressed by placing a .bidsignore file into a final superdataset.

Adjust for API deprecation

DeprecationWarning: datalad add_archive_content's annex parameter is deprecated and will be removed in a future release. Use the 'dataset' parameter instead.

Add unittests

Now that we have identified the ability to shim ukbfetch nothing prevents adding tests (no matter how simple).

IDs used by datalad-ukb

I have to datalize one dataset that contains data from the UK Biobank, and decided to try to use datalad-ukbiobank. I'm receiving various sets of errors, but I believe I should start from making sure that I understand IDs datalad expects.

When I'm checking my biobank project descriptions, I have something that is called Application ID and Bucket ID, but ukb-init expects PARTICPANT-ID and DATARECORD-ID. I've used Application ID for PARTICPANT-ID and Bucket ID for DATARECORD-ID, but not sure if this was right, the IDs that can be found in the help has a different format.

document removal of subjects

The UKB occasionally provides notifications of subjects who withdraw from the study. We should document how one can/should remove a specific subject's data; specifically the right way to drop annex keys to ensure compliance.

creating datalad-ukb dataset in a non empty directory

Could you give me some advice how to create datalad-ukb dataset from already downloaded data? I can create datalad dataset and add the files, I can also run ukb-init and ukb-update, but I believe ukb-update doesn't recognize existing files and just downloads everything again. Am I right?

download of 1000 subject subset with condor

I tested out the ukb_create_participant_ds and ukb_update_participant_ds scripts created by @mih using condor to download a 1000 subject subset.

To start, I created a csv file with a list of the subjects and modalities that I wanted.

0001234,20227_2_0,20249_2_0,20252_2_0
0001235,20227_2_0,20249_2_0,20252_2_0
0001236,20227_2_0,20249_2_0,20252_2_0
0001237,20227_2_0,20249_2_0,20252_2_0
0001238,20227_2_0,20249_2_0,20252_2_0
0001239,20227_2_0,20249_2_0,20252_2_0
...

Then, I used the following to call the scripts and submit jobs to condor:

To create the single-participant datasets:
./ukb_create_submit_gen.sh | condor_submit

ukb_create_submit_gen.sh

#!/bin/sh

logs_dir=~/logs/ukb/create
# create the logs dir if it doesn't exist
[ ! -d "$logs_dir" ] && mkdir -p "$logs_dir"

# print the .submit header
printf "# The environment
universe       = vanilla
getenv         = True
request_cpus   = 1
request_memory = 1G

# Execution
initial_dir    = /data/project/rehab_biobank/1000_subset/
executable     = /data/project/rehab_biobank/1000_subset/ukb_create_participant_ds
\n"

# create a job for each subject
for line in $(cat subset_rfrmi_tfrmi_t1.csv); do
    subject_id=${line%%,*} && line=${line#${subject_id},}
    modalities=$(echo ${line} | sed 's/,/ /g')
    printf "arguments = ${subject_id} ${subject_id} ${modalities}\n"
    printf "log       = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).log\n"
    printf "output    = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).out\n"
    printf "error     = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).err\n"
    printf "Queue\n\n"
done

To download the data:
./ukb_update_submit_gen.sh | condor_submit

ukb_update_submit_gen.sh

#!/bin/sh

logs_dir=~/logs/ukb/update
# create the logs dir if it doesn't exist
[ ! -d "$logs_dir" ] && mkdir -p "$logs_dir"

# print the .submit header
printf "# The environment
universe       = vanilla
getenv         = True
request_cpus   = 1
request_memory = 1G

# Execution
initial_dir    = /data/project/rehab_biobank/1000_subset/
executable     = /data/project/rehab_biobank/1000_subset/ukb_update_participant_ds
\n"

# create a job for each subject
for line in $(cat subset_rfrmi_tfrmi_t1.csv); do
    subject_id=${line%%,*} && line=${line#${subject_id},}
    printf "arguments = ${subject_id} ../.ukbkey\n"
    printf "log       = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).log\n"
    printf "output    = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).out\n"
    printf "error     = ${logs_dir}/sub-${subject_id}_\$(Cluster).\$(Process).err\n"
    printf "Queue\n\n"
done

UKBB possible steps for row data laid out by Michael

@mih : lmk if I missed anything

  1. Dowload -> UKBB zip
  2. DL save in branch incoming
  3. DL add-archive in branch incoming-processed
    Note: this is to extract the zip - see add-archive DL command
  4. git merge incoming-processed in master
  5. DL drop all extracted
  6. config RIA special remote pointing to actual storing
  7. DL publish onto storage
    Note: actual repo nees to be also pushed somewhere
  8. throw away local clones
  9. create BIDS - code lives on super DS

add all fields but fetch some

currently datalad ukb-init -d . <eid> field1 field2 allows adding all fields a project has access to to a participant. however, we may only want to download incrementally. it would be nice if ukb-update could process all by default but specified fields if asked to.

also an option to add fields after init would be nice. easy to do by going and changing .ukbbatch, but a helper may be useful.

Update Appveyor config to use new codecov uploader

On June 9, codecov released a new version of their coverage uploader program to replace the old bash uploader, which will stop working on February 1 and is currently experiencing scheduled brownouts; see this blog post for more information. This repository's Appveyor config installs the codecov uploader provided by Chocolatey, which has not yet updated to the newer version. If the Chocolatey package is not updated in time, the Appveyor config must be updated to install the new codecov uploader directly; see the linked blog post for instructions.

[ERROR] dataset containing given paths is not underneath the reference dataset

What is the problem?

I am not able to run ukb-update under a directory structure which contains a git repository. I get the following error:

[ERROR] dataset containing given paths is not underneath the reference dataset

However, if the dataset is not under a directory structure which does not contain a git repository, it works.

What steps will reproduce the problem?

// Create a test structure with a git initialized in the top
mkdir test && cd test && git init && mkdir data && cd data
// datalad commands
datalad create 1005393 && cd 1005393
datalad ukb-init -f --bids 1005393 20252_2_0
datalad -c datalad.ukbiobank.keyfile=keyfile ukb-update

What version of DataLad are you using (run datalad --version)? On what operating system (consider running datalad wtf)?

version: datalad 0.14
os: ubuntu 20.04

Is there anything else that would be useful to know in this context?

No

Have you had any success using DataLad before? (to assess your expertise/prior luck. We would welcome your testimonial additions to https://github.com/datalad/datalad/wiki/Testimonials as well)

Yes. ukb-update works if the directories above do not have a git repo.

no update tests on windows

While enabling windows tests in PR #59, tests for update where skipped. This is because they rely on a drop-in replacement for ukbfetch defined in test_update.py. The drop in currently is a shebang script that probably(?) doesn't work on windows. In addition the respective change to PATH hardcodes : as separator - I think this needs to become ; on windows.

BIDS validation issues: Unsupported BEP004?

I am exploring whether the resulting datasets are BIDS-compliant enough to run fMRIprep on them. I will report on all problems I encounter as issues.

swi modality seems unsupported?

2. The SWI files doesn't seem to comply to the specification
	1: [ERR] Files with such naming scheme are not part of BIDS specification. This error is most commonly caused by typos in file names that make them not BIDS compatible. Please consult the specification and make sure your files are named correctly. If this is not a file naming issue (for example when including files not yet covered by the BIDS specification) you should include a ".bidsignore" file in your dataset (see https://github.com/bids-standard/bids-validator#bidsignore for details). Please note that derived (processed) data should be placed in /derivatives folder and source data (such as DICOMS or behavioural logs in proprietary formats) should be placed in the /sourcedata folder. (code: 1 - NOT_INCLUDED)
		./sub-100****/ses-2/swi/sub-100****_ses-2_part-mag_echo-1_rec-norm_GRE.json
			Evidence: sub-1002708_ses-2_part-mag_echo-1_rec-norm_GRE.json
		./sub-100****/ses-2/swi/sub-100****_ses-2_part-mag_echo-1_rec-norm_GRE.nii.gz
			Evidence: sub-1002708_ses-2_part-mag_echo-1_rec-norm_GRE.nii.gz
		./sub-100****/ses-2/swi/sub-100****_ses-2_part-mag_echo-2_rec-norm_GRE.json
			Evidence: sub-1002708_ses-2_part-mag_echo-2_rec-norm_GRE.json
		./sub-100****/ses-2/swi/sub-100****_ses-2_part-phase_echo-1_GRE.json
			Evidence: sub-1002708_ses-2_part-phase_echo-1_GRE.json
		./sub-100****/ses-2/swi/sub-100****_ses-2_part-phase_echo-2_GRE.json
			Evidence: sub-1002708_ses-2_part-phase_echo-2_GRE.json
		./sub-100****/ses-3/swi/sub-100****_ses-3_part-mag_echo-1_rec-norm_GRE.json
			Evidence: sub-1002708_ses-3_part-mag_echo-1_rec-norm_GRE.json
		./sub-100****/ses-3/swi/sub-100****_ses-3_part-mag_echo-1_rec-norm_GRE.nii.gz
			Evidence: sub-1002708_ses-3_part-mag_echo-1_rec-norm_GRE.nii.gz
		./sub-100****/ses-3/swi/sub-100****_ses-3_part-mag_echo-2_rec-norm_GRE.json
			Evidence: sub-1002708_ses-3_part-mag_echo-2_rec-norm_GRE.json
		./sub-100****/ses-3/swi/sub-100****_ses-3_part-phase_echo-1_GRE.json
			Evidence: sub-1002708_ses-3_part-phase_echo-1_GRE.json
		./sub-100****/ses-3/swi/sub-100****_ses-3_part-phase_echo-2_GRE.json
			Evidence: sub-1002708_ses-3_part-phase_echo-2_GRE.json

It seems as if the naming is compliant to BEP004, but BEP004 hasn't made it into the BIDS specification yet. Adding **/swi/** to the .bidsignore file mentioned in #23 seems like a sensible approach.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.