razorx89 / roco-dataset Goto Github PK

View Code? Open in Web Editor NEW

168.0 5.0 18.0 13 MB

Radiology Objects in COntext (ROCO): A Multimodal Image Dataset

Python 100.00%

dataset radiology medical image-retrieval umls semantic-types cuis

roco-dataset's People

Contributors

Stargazers

Watchers

Forkers

saviola777 reemomer iamsuyogjadhav yurongchen1998 dlwbm123 wh-forker abbaddon1001 hsouporto zhaozh10 lorenzofamiglini aoibhinncrtai yiluzhou1 moradbeikie zhenxianglin axe1zheng micreed

roco-dataset's Issues

Windows compatibility issue

Multiprocessing on Windows uses a different process spawning mechanism and thus one cannot rely on visibility for global variables. Unix based processes always use forking as default process spawning.

Questions about the RoCo version released on Kaggle

After multiple failed download attempts, I discovered a Roco dataset on the Kaggle platform. Is there any difference between this dataset and the one I was trying to download, or is it just another version uploaded to Kaggle by someone else? I'm very confused...

zlib error

Hi,

I keep getting "zlib.error: Error -3 while decompressing data: invalid code lengths set" or other zlib errors during the download. Is there any fix on the script or this is a data problem?

Cheers

Windows Error when running fetch.py

Hello, I wanted to know if someone can help me with the following issue regarding running the script fetch.py on Windows 10 using python 3.8.11

I get the following output and error message:

Configuration:
Subdirectory: images
Extraction directory: C:\Users\franc\AppData\Local\Temp\roco-dataset
Keep archives: False
Delete contents of extraction directory: True
Number of processes: 4
Number of download retries: 10
Fetching ROCO dataset images...
multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last):
  File "C:\Users\franc\anaconda3\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\franc\roco-dataset\scripts\fetch.py", line 131, in process_group
    result = download_archive(extraction_dir_name, archive_url,
  File "C:\Users\franc\roco-dataset\scripts\fetch.py", line 209, in download_archive
    return subprocess.call(['wget', '-nc', '-nd', '-c', '-q', '-P',
  File "C:\Users\franc\anaconda3\lib\subprocess.py", line 340, in call
    with Popen(*popenargs, **kwargs) as p:
  File "C:\Users\franc\anaconda3\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\franc\anaconda3\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts\fetch.py", line 330, in <module>
    for i, pmc_id in enumerate(pool.imap_unordered(process_group,
  File "C:\Users\franc\anaconda3\lib\multiprocessing\pool.py", line 868, in next
    raise value
FileNotFoundError: [WinError 2] The system cannot find the file specified

Not sure what Im doing wrong. Any ideas on how to solve the issue?

Thanks in advance

Need ur help

Could you please lemme know how did you downloaded the data from PubMed. And why this url is not workinghttps://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=[email protected]&id=.
Thank you.

Image PMC4954863/JoU-2016-0019-g003.jpg not found in archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/2f/12/PMC4954863.tar.gz, skipping

Thanks very much for sharing this wonderful dataset!
When downloading the dataset using the file 'fetch.py', I got this problem:

Can you give me some suggestions about it?

Error: download failed, retrying

my environment has proxy
i can not download any image of dataset. but i can download other txt file of the dataset.

python scripts/fetch.py -n 1
Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 1
Number of download retries: 10
Fetching ROCO dataset images...
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC4608653
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying

"""
Traceback (most recent call last):
File "/home/work/lisa/miniconda3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 151, in process_group
result = download_archive(extraction_dir_name, archive_url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 212, in download_archive
raise Exception("Giving up download of archive {0} after {1} tries"
Exception: Giving up download of archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/8d/34/PMC4608653.tar.gz after 11 tries
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 338, in
pool = multiprocessing.Pool(processes=args.num_processes,
File "/home/work/lisa/miniconda3/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
Exception: Giving up download of archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/8d/34/PMC4608653.tar.gz after 11 tries

I don't know how to complete the download?

I run the script "python scripts/fetch.py".
But it downloaded as 9.895%.

So I immediately re-executed the script.

Should I take -d at the end of this instruction?

Question regarding ROCOv2

I had some questions regarding ROCOv2.

What is the difference between train_concepts_manual.csv and train_concept.csv
ROCO had keywords but ROCOv2 doesn't. What is the best way I can generate keywords for ROCOv2.

FileNotFoundError

hi there, I ran fetch.py under the Unbuntu On WSL, got some problems, here is the traceback:

Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 12
Number of download retries: 10
Fetching ROCO dataset images...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "fetch.py", line 199, in process_group
    shutil.copy(image_filename, target_filename)
  File "/usr/lib/python3.8/shutil.py", line 415, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
    with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'data/test/radiology/images/ROCO_00176.jpg'

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "fetch.py", line 340, in <module>
    for i, pmc_id in enumerate(pool.imap_unordered(process_group,
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'data/test/radiology/images/ROCO_00176.jpg'

I don't sure how's that happening? I also trying to run fetch.py in Windows, at beginning it works for a little while, some pics download in images indeed, but then thrown errors like:
zlib.error: Error -3 while decompressing data: invalid stored block lengths
module gzip has no attribute BadGzip etc.

could you give me some fixing ways, appreciated for it.

Download fails after 11 tries

Having the same issue as #11 even from ubuntu. Is the link active? I tried n 1 as suggested still not working.

download failed

Hi, I tried the download code, but it always shows: retrying even with setting -n 1

(base) [yupei@login roco-dataset-master]$ python scripts/fetch.py
Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 64
Number of download retries: 10
Fetching ROCO dataset images...
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC3395713
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC3130474

Could you tell me why?
THX

filename 'PMC4889020/nihms790182f4.jpg' not found"

some pictures cannot be downloaded.
Error: failed to extract image PMC4889020/nihms790182f4.jpg from archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/da/a9/PMC4889020.tar.gz: "filename 'PMC4889020/nihms790182f4.jpg' not found" Image PMC4889020/nihms790182f4.jpg not found in archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/da/a9/PMC4889020.tar.gz, skipping

slow speed

Hi thank you so much for your work!

May I ask is it normal that the download speed is very slow? It takes about 10 seconds to download one image. The command I used was: python fetch.py -n 1
Because if the number of processes is 10, then many error would occurred.

Thanks

Download fails

Is the script still working ? Or something has changed ?
I tried with 1 process also and it failed after 10 attempts.

dataset issues

Does the image name of the downloaded data set match the name of the image corresponding to the caption? How can I solve it, or can I directly provide a link to the data set?Thanks!