gap-decoder / gapdecoder Goto Github PK

View Code? Open in Web Editor NEW

This project forked from emelyanenkok/gapdownloader

80.0 80.0 13.0 70.82 MB

Google Arts And Culture Downloader. Python script to download high-resolution images from google arts & culture.

Python 100.00%

art culture google high-resolution image large-files

gapdecoder's People

Contributors

Stargazers

Watchers

Forkers

frerepoulet caoqihao lovasoa ohhdemgirls kolt54321 banyet1 cyclics sabjorn tomvo mtwebb huangronaldo6 ptrmdr

gapdecoder's Issues

RAM maxed on certain images, fails on higher zoom levels

Hi @lovasoa,

Small update before the issue - I know you've been working on dezoomify-rs. I've made some progress on the two open issues on gap-decoder - I believe I can now download images with all metadata embed (viewable from exiftool or similar), and a new naming logic (parses metadata to find artist and date created, to name images as [author + date + image_name (parsing out and excluding author name) + info.image_id + '.jpg']. The new naming logic will make it easier to keep track of downloads by the same artist - a feature I am guessing some would be very interested in.

There is also logic for a batch download feature - not using xargs, as I've forgotten about that easy option - but still working nonetheless. The batch cache doubles as an archive file, so if the command breaks midway, it can pick up right where it left off and not redownload images that were already finished.

That said, I'm still testing these small updates as I'm sure I may have not covered all scenarios. Through testing, I noticed the following link maxed my RAM (32gb) very quickly on Zoom 8, and crashed the script:

https://artsandculture.google.com/asset/the-birth-of-venus-sandro-botticelli/MQEeq50LABEBVg

I also noticed that the currently maintained Dezoomify-rs did not crash on this link, and downloaded it fairly steadily.

Has there been any updates on the Dezoomify side to the actual tile downloading logic? It would be great to continue testing the new naming scheme/metadata and batch files, if some patch could be worked out here. Then we can all move over to Dezoomify-rs if you feel that is the better option.

Edit: This issue may possibly be linked to the max size of a jpeg file - it saved to a png using dezoomify-rs. If 65kx65 is the max a jpeg can take, I believe this on zoom 8 has 108k width.

Suggestion to take a text file with URL's

I'm loving the tool! I thought it would be useful if a user could point gapdecoder to a text file with links, and have gapdecoder go through downloading each one. What do you think?

Also, I thought it might be useful for gapdecoder to output links already processed/downloaded to a json or text file, so users would be able to find which ones have been processed. In the above case, it would be useful if the command stops in middle and needs to be restarted.

Error: %b requires a bytes-like object, or an object that implements bytes, not 'NoneType'

Consistent with any url and any zoom level.

asyncio.exceptions.TimeoutError

Hi @lovasoa! I've recently hit this error a few times. Any idea if we could put in a retry in the code?

https://artsandculture.google.com/asset/the-last-day-of-pompeii-karl-brullov/tAFrCGFUhXM8Jg
Defaulting to highest zoom level (8).
Zoom level 8 too high for JPEG output, using next zoom level 7 instead
Using zoom level 7.
Downloading tiles...
Traceback (most recent call last):
  File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 8, in modified
    return await f(*args, **kwargs)
  File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 22, in fetch
    async with session.get(url) as response:
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\client.py", line 1012, in __aenter__
    self._resp = await self._coro
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\client.py", line 582, in _request
    break
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\site-packages\aiohttp\helpers.py", line 596, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tile_fetch_3.py", line 353, in <module>
    main()
  File "tile_fetch_3.py", line 306, in main
    loop.run_until_complete(coro)
  File "C:\Users\i\AppData\Local\Programs\Python\Python38\lib\asyncio\base_events.py", line 608, in run_until_complete
    return future.result()
  File "tile_fetch_3.py", line 162, in load_tiles
    tiles = await async_tile_fetcher.gather_progress(awaitable_tiles)
  File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 42, in gather_progress
    total = await asyncio.gather(*[
  File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 36, in print_percent
    res = await awaitable
  File "tile_fetch_3.py", line 118, in fetch_tile
    encrypted_bytes = await async_tile_fetcher.fetch(session, image_url, file_path)
  File "C:\Users\i\Downloads\gapdecoder\async_tile_fetcher.py", line 13, in modified
    raise err
Exception

"Index Out of Range" error on all links tried

Hi - when I try the tool recently (zoom 7), I get the following error:

Downloading image meta-information...

Traceback (most recent call last):
File "tile_fetch.py", line 178, in
main()
File "tile_fetch.py", line 159, in main
image_info = ImageInfo(url)
File "tile_fetch.py", line 61, in init
self.token = re.findall(token_regex, page_source)[0]
IndexError: list index out of range

Thanks!

No Module Named: 'aiohttp'

Hello!! Thanks so much for creating this program. Running into a new error (this worked a few weeks ago). Looking to see if there is a new replacement for aiohttp.

PS C:\Users\Jozef Schutzman\Documents\gapdecoder-master\gapdecoder-master> python3 tile_fetch.py "https://artsandculture.google.com/asset/flag-of-east-timor-dom%C3%ADnio-p%C3%BAblico/WAHnaoBLfc9U6g"
Traceback (most recent call last):
File "tile_fetch.py", line 15, in
import aiohttp
ModuleNotFoundError: No module named 'aiohttp'

ValueError: Unable to find google arts image token

I'm getting the above error on all images that I try (example here). @lovasoa Is there any chance you are able to fix the token extractor?

Suggestion: Scrape art metadata

The decoder is now fast, efficient, and through xargs, iterative. Is there any way to embed the art description and metadata from the "Details" section along with the tiles?

For example, in the following picture: https://artsandculture.google.com/asset/thinking-of-history-at-my-space/4wHJZ6r2X7NOFQ

There's title, creator, date, and type, as well as the description right below the painting.

Add max zoom option

Suggested improvements:

Add an argument to allow users to download the max zoom that doesn't exceed 65K (the jpg limit, I believe).
If the users specifies a zoom level that exceeds the max, then automatically download the max zoom. (This is similar in purpose to 1)
If zoom is not specified, once the available levels are shown, ask users to input the level they want.
If the URL is not provided, fetch it from the clipboard. For reference, see https://github.com/Boquete/google-arts-crawler/blob/master/crawler.py

Move to organisation repo?

Seeing as I'm not active on this project, perhaps it would be better to move it to an organisation repo or something?

Wrong token extraction

For the following image, dezooming fails:

https://artsandculture.google.com/asset/wildflower-painting-of-red-grevillea/wwEzEHEBAqxv4w

(reported in #9 (comment))

After a quick inspection, it looks like it's because the token extraction heuristic yields the wrong token.

The highest zoom levels are not accessible for some images

See #13

Zoom 7 hangs, or has greatly increased download time vs Zoom 6

Hey! So I'm really enjoying the tool, even though I'm not familiar with the methods used to obtain the images.

One thing I noticed is that Zoom 7 (the highest I've tried so far) seems to hang at around ~40%, and at other times takes considerably longer than Zoom 6. Generally speaking the scans being pointed at had 4x the tiles in Zoom 7 vs 6. However, while Zoom 6 takes around ~3 minutes, Zoom 7 takes 30 min or so before hitting that arbitrary point where it stops.

If this is part and parcel of the way the script works, that's great - Zoom 7 isn't even remotely available in the other GA&C tools. I just thought I'd reach out and ask if it's something expected.

The scans in mention are part of the Gigapixel collection. I beilieve this was a problem file.

Edit: So I'm seeing most files fail on zoom 7, downloads a number of tiles and then starts. It is possible to pick up where it left off by re-running the command, but that only goes so far if it's a glitch that it gets stuck. Unfortunately there seems to be times where all restarting does is downloads a few more tiles.

suggestion to specify megapixels instead of zoom

@lovasoa I really appreciate your responsiveness to my previous suggestions. The new tool is a huge improvement :) Thank you. I humbly make just one more suggestion:

The issue is that zoom levels do not refer to consistent pixel dimensions. For some images, zoom 5 can mean 6000 x 4000 pixels, while for others it can mean 12,000 x 8000. This is annoying because the user has in mind the dimensions they want, but the zoom is only an imperfect indicator. So my suggestion is to also allow users specify the dimensions, and you can keep the zoom option too if you think it's useful. In terms of user input, I think it makes the most sense to specify megapixels in integers like 24 (i.e., 24 = 6000 x 4000), rather than the pixel length or width, since artwork can be long/narrow. The fallback option, if the user specifies a size that is NA, I hope, can be kept.

This size specification is implemented in Boquete's tool but I like gapdecoder better :)
https://github.com/Boquete/google-arts-crawler

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)

On macOS Catalina:

git clone https://github.com/gap-decoder/gapdecoder.git
pip3 install -r requirements.txt --user
python3 tile_fetch.py --zoom 4 "https://artsandculture.google.com/asset/the-kiss-gustav-klimt/HQGxUutM_F6ZGg"

This is the traceback I get:

Downloading image meta-information...
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1317, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1016, in _send_output
    self.send(msg)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 956, in send
    self.connect()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/http/client.py", line 1392, in connect
    server_hostname=server_hostname)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 412, in wrap_socket
    session=session
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 853, in _create
    self.do_handshake()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/ssl.py", line 1117, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tile_fetch.py", line 181, in <module>
    main()
  File "tile_fetch.py", line 162, in main
    image_info = ImageInfo(url)
  File "tile_fetch.py", line 43, in __init__
    page_source = urllib.request.urlopen(url).read()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>```

HTTP Error 500: Internal Server Error

I've been getting this error pretty often. Is there a way to add error handling to loop back to the original command when this happens? Restarting the tile_fetch.py solves the issue temporarily - calling it again when an error occurs might help.

Edit: This still happens, but less frequently now for some reason. It's possible it's related to internet speed and breakages in uptime.