gap-decoder / gapdecoder Goto Github PK
View Code? Open in Web Editor NEWThis project forked from emelyanenkok/gapdownloader
Google Arts And Culture Downloader. Python script to download high-resolution images from google arts & culture.
This project forked from emelyanenkok/gapdownloader
Google Arts And Culture Downloader. Python script to download high-resolution images from google arts & culture.
Hi @lovasoa,
Small update before the issue - I know you've been working on dezoomify-rs. I've made some progress on the two open issues on gap-decoder - I believe I can now download images with all metadata embed (viewable from exiftool or similar), and a new naming logic (parses metadata to find artist and date created, to name images as [author + date + image_name (parsing out and excluding author name) + info.image_id + '.jpg']. The new naming logic will make it easier to keep track of downloads by the same artist - a feature I am guessing some would be very interested in.
There is also logic for a batch download feature - not using xargs, as I've forgotten about that easy option - but still working nonetheless. The batch cache doubles as an archive file, so if the command breaks midway, it can pick up right where it left off and not redownload images that were already finished.
That said, I'm still testing these small updates as I'm sure I may have not covered all scenarios. Through testing, I noticed the following link maxed my RAM (32gb) very quickly on Zoom 8, and crashed the script:
https://artsandculture.google.com/asset/the-birth-of-venus-sandro-botticelli/MQEeq50LABEBVg
I also noticed that the currently maintained Dezoomify-rs did not crash on this link, and downloaded it fairly steadily.
Has there been any updates on the Dezoomify side to the actual tile downloading logic? It would be great to continue testing the new naming scheme/metadata and batch files, if some patch could be worked out here. Then we can all move over to Dezoomify-rs if you feel that is the better option.
Edit: This issue may possibly be linked to the max size of a jpeg file - it saved to a png using dezoomify-rs. If 65kx65 is the max a jpeg can take, I believe this on zoom 8 has 108k width.
I'm loving the tool! I thought it would be useful if a user could point gapdecoder to a text file with links, and have gapdecoder go through downloading each one. What do you think?
Also, I thought it might be useful for gapdecoder to output links already processed/downloaded to a json or text file, so users would be able to find which ones have been processed. In the above case, it would be useful if the command stops in middle and needs to be restarted.
Consistent with any url and any zoom level.
Hi @lovasoa! I've recently hit this error a few times. Any idea if we could put in a retry in the code?
https://artsandculture.google.com/asset/the-last-day-of-pompeii-karl-brullov/tAFrCGFUhXM8Jg
Defaulting to highest zoom level (8).
Zoom level 8 too high for JPEG output, using next zoom level 7 instead
Using zoom level 7.
Downloading tiles...
Traceback (most recent call last):
File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 8, in modified
return await f(*args, **kwargs)
File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 22, in fetch
async with session.get(url) as response:
File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\client.py", line 1012, in __aenter__
self._resp = await self._coro
File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\client.py", line 582, in _request
break
File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\helpers.py", line 596, in __exit__
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tile_fetch_3.py", line 353, in <module>
main()
File "tile_fetch_3.py", line 306, in main
loop.run_until_complete(coro)
File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
return future.result()
File "tile_fetch_3.py", line 162, in load_tiles
tiles = await async_tile_fetcher.gather_progress(awaitable_tiles)
File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 42, in gather_progress
total = await asyncio.gather(*[
File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 36, in print_percent
res = await awaitable
File "tile_fetch_3.py", line 118, in fetch_tile
encrypted_bytes = await async_tile_fetcher.fetch(session, image_url, file_path)
File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 13, in modified
raise err
Exception
Hi - when I try the tool recently (zoom 7), I get the following error:
Downloading image meta-information...
Traceback (most recent call last):
File "tile_fetch.py", line 178, in
main()
File "tile_fetch.py", line 159, in main
image_info = ImageInfo(url)
File "tile_fetch.py", line 61, in init
self.token = re.findall(token_regex, page_source)[0]
IndexError: list index out of range
Thanks!
Hello!! Thanks so much for creating this program. Running into a new error (this worked a few weeks ago). Looking to see if there is a new replacement for aiohttp.
PS C:\Users\Jozef Schutzman\Documents\gapdecoder-master\gapdecoder-master> python3 tile_fetch.py "https://artsandculture.google.com/asset/flag-of-east-timor-dom%C3%ADnio-p%C3%BAblico/WAHnaoBLfc9U6g"
Traceback (most recent call last):
File "tile_fetch.py", line 15, in
import aiohttp
ModuleNotFoundError: No module named 'aiohttp'
The decoder is now fast, efficient, and through xargs, iterative. Is there any way to embed the art description and metadata from the "Details" section along with the tiles?
For example, in the following picture: https://artsandculture.google.com/asset/thinking-of-history-at-my-space/4wHJZ6r2X7NOFQ
There's title, creator, date, and type, as well as the description right below the painting.
Suggested improvements:
Seeing as I'm not active on this project, perhaps it would be better to move it to an organisation repo or something?
For the following image, dezooming fails:
https://artsandculture.google.com/asset/wildflower-painting-of-red-grevillea/wwEzEHEBAqxv4w
(reported in #9 (comment))
After a quick inspection, it looks like it's because the token extraction heuristic yields the wrong token.
See #13
Hey! So I'm really enjoying the tool, even though I'm not familiar with the methods used to obtain the images.
One thing I noticed is that Zoom 7 (the highest I've tried so far) seems to hang at around ~40%, and at other times takes considerably longer than Zoom 6. Generally speaking the scans being pointed at had 4x the tiles in Zoom 7 vs 6. However, while Zoom 6 takes around ~3 minutes, Zoom 7 takes 30 min or so before hitting that arbitrary point where it stops.
If this is part and parcel of the way the script works, that's great - Zoom 7 isn't even remotely available in the other GA&C tools. I just thought I'd reach out and ask if it's something expected.
The scans in mention are part of the Gigapixel collection. I beilieve this was a problem file.
Edit: So I'm seeing most files fail on zoom 7, downloads a number of tiles and then starts. It is possible to pick up where it left off by re-running the command, but that only goes so far if it's a glitch that it gets stuck. Unfortunately there seems to be times where all restarting does is downloads a few more tiles.
@lovasoa I really appreciate your responsiveness to my previous suggestions. The new tool is a huge improvement :) Thank you. I humbly make just one more suggestion:
The issue is that zoom levels do not refer to consistent pixel dimensions. For some images, zoom 5 can mean 6000 x 4000 pixels, while for others it can mean 12,000 x 8000. This is annoying because the user has in mind the dimensions they want, but the zoom is only an imperfect indicator. So my suggestion is to also allow users specify the dimensions, and you can keep the zoom option too if you think it's useful. In terms of user input, I think it makes the most sense to specify megapixels in integers like 24 (i.e., 24 = 6000 x 4000), rather than the pixel length or width, since artwork can be long/narrow. The fallback option, if the user specifies a size that is NA, I hope, can be kept.
This size specification is implemented in Boquete's tool but I like gapdecoder better :)
https://github.com/Boquete/google-arts-crawler
On macOS Catalina:
git clone https://github.com/gap-decoder/gapdecoder.git
pip3 install -r requirements.txt --user
python3 tile_fetch.py --zoom 4 "https://artsandculture.google.com/asset/the-kiss-gustav-klimt/HQGxUutM_F6ZGg"
This is the traceback I get:
Downloading image meta-information...
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1016, in _send_output
self.send(msg)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 956, in send
self.connect()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1392, in connect
server_hostname=server_hostname)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 412, in wrap_socket
session=session
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 853, in _create
self.do_handshake()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 1117, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tile_fetch.py", line 181, in <module>
main()
File "tile_fetch.py", line 162, in main
image_info = ImageInfo(url)
File "tile_fetch.py", line 43, in __init__
page_source = urllib.request.urlopen(url).read()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open
'_open', req)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open
context=self._context, check_hostname=self._check_hostname)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>```
I've been getting this error pretty often. Is there a way to add error handling to loop back to the original command when this happens? Restarting the tile_fetch.py solves the issue temporarily - calling it again when an error occurs might help.
Edit: This still happens, but less frequently now for some reason. It's possible it's related to internet speed and breakages in uptime.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.