razorx89 / roco-dataset Goto Github PK
View Code? Open in Web Editor NEWRadiology Objects in COntext (ROCO): A Multimodal Image Dataset
Radiology Objects in COntext (ROCO): A Multimodal Image Dataset
Multiprocessing on Windows uses a different process spawning mechanism and thus one cannot rely on visibility for global variables. Unix based processes always use forking as default process spawning.
After multiple failed download attempts, I discovered a Roco dataset on the Kaggle platform. Is there any difference between this dataset and the one I was trying to download, or is it just another version uploaded to Kaggle by someone else? I'm very confused...
Hi,
I keep getting "zlib.error: Error -3 while decompressing data: invalid code lengths set" or other zlib errors during the download. Is there any fix on the script or this is a data problem?
Cheers
Hello, I wanted to know if someone can help me with the following issue regarding running the script fetch.py on Windows 10 using python 3.8.11
I get the following output and error message:
Configuration:
Subdirectory: images
Extraction directory: C:\Users\franc\AppData\Local\Temp\roco-dataset
Keep archives: False
Delete contents of extraction directory: True
Number of processes: 4
Number of download retries: 10
Fetching ROCO dataset images...
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "C:\Users\franc\anaconda3\lib\multiprocessing\pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "C:\Users\franc\roco-dataset\scripts\fetch.py", line 131, in process_group
result = download_archive(extraction_dir_name, archive_url,
File "C:\Users\franc\roco-dataset\scripts\fetch.py", line 209, in download_archive
return subprocess.call(['wget', '-nc', '-nd', '-c', '-q', '-P',
File "C:\Users\franc\anaconda3\lib\subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Users\franc\anaconda3\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\franc\anaconda3\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts\fetch.py", line 330, in <module>
for i, pmc_id in enumerate(pool.imap_unordered(process_group,
File "C:\Users\franc\anaconda3\lib\multiprocessing\pool.py", line 868, in next
raise value
FileNotFoundError: [WinError 2] The system cannot find the file specified
Not sure what Im doing wrong. Any ideas on how to solve the issue?
Thanks in advance
Could you please lemme know how did you downloaded the data from PubMed. And why this url is not workinghttps://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&email=[email protected]&id=.
Thank you.
my environment has proxy
i can not download any image of dataset. but i can download other txt file of the dataset.
python scripts/fetch.py -n 1
Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 1
Number of download retries: 10
Fetching ROCO dataset images...
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC4608653
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying
Error: download failed, retrying
"""
Traceback (most recent call last):
File "/home/work/lisa/miniconda3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 151, in process_group
result = download_archive(extraction_dir_name, archive_url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 212, in download_archive
raise Exception("Giving up download of archive {0} after {1} tries"
Exception: Giving up download of archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/8d/34/PMC4608653.tar.gz after 11 tries
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/work/lisa/distributed/visual-med-alpaca/roco-dataset/scripts/fetch.py", line 338, in
pool = multiprocessing.Pool(processes=args.num_processes,
File "/home/work/lisa/miniconda3/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
Exception: Giving up download of archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/8d/34/PMC4608653.tar.gz after 11 tries
I had some questions regarding ROCOv2.
hi there, I ran fetch.py
under the Unbuntu On WSL, got some problems, here is the traceback:
Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 12
Number of download retries: 10
Fetching ROCO dataset images...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "fetch.py", line 199, in process_group
shutil.copy(image_filename, target_filename)
File "/usr/lib/python3.8/shutil.py", line 415, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/usr/lib/python3.8/shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'data/test/radiology/images/ROCO_00176.jpg'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "fetch.py", line 340, in <module>
for i, pmc_id in enumerate(pool.imap_unordered(process_group,
File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next
raise value
FileNotFoundError: [Errno 2] No such file or directory: 'data/test/radiology/images/ROCO_00176.jpg'
I don't sure how's that happening? I also trying to run fetch.py
in Windows, at beginning it works for a little while, some pics download in images
indeed, but then thrown errors like:
zlib.error: Error -3 while decompressing data: invalid stored block lengths
module gzip has no attribute BadGzip
etc.
could you give me some fixing ways, appreciated for it.
Having the same issue as #11 even from ubuntu. Is the link active? I tried n 1 as suggested still not working.
Hi, I tried the download code, but it always shows: retrying even with setting -n 1
(base) [yupei@login roco-dataset-master]$ python scripts/fetch.py
Configuration:
Subdirectory: images
Extraction directory: /tmp/roco-dataset
Keep archives: False
Delete contents of extraction directory: False
Number of processes: 64
Number of download retries: 10
Fetching ROCO dataset images...
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC3395713
Error: download failed, retrying
Trying to get new archive URL: https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?tool=roco-fetch&[email protected]&id=PMC3130474
Could you tell me why?
THX
some pictures cannot be downloaded.
Error: failed to extract image PMC4889020/nihms790182f4.jpg from archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/da/a9/PMC4889020.tar.gz: "filename 'PMC4889020/nihms790182f4.jpg' not found" Image PMC4889020/nihms790182f4.jpg not found in archive ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/da/a9/PMC4889020.tar.gz, skipping
Hi thank you so much for your work!
May I ask is it normal that the download speed is very slow? It takes about 10 seconds to download one image. The command I used was: python fetch.py -n 1
Because if the number of processes is 10, then many error would occurred.
Thanks
Is the script still working ? Or something has changed ?
I tried with 1 process also and it failed after 10 attempts.
Does the image name of the downloaded data set match the name of the image corresponding to the caption? How can I solve it, or can I directly provide a link to the data set?Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.