Coder Social home page Coder Social logo

artstation-scraper's Introduction

ArtStation Scraper

This is my personal project created to download images from ArtStation website. The program will download artworks from specified artists to specified download directory. In the download directory, the program will create and name subdirectories using the artist IDs, then save artworks to the corresponding subdirectories. For each artwork, the file modification time are set in order from newest to oldest so that you can sort files by modified date. Lastly, when running this program, it will check each artist directory to see if an update is needed such that only new uploads will be downloaded.

alt text

alt text

Instructions

  1. install Python 3.6+

  2. install requests library

    pip install --user requests
  3. edit config.json file in data folder manually or via command line interface

    • artists: the artist id shown in URL
    • save directory: the save directory path

Usage

display help message

$ python main.py -h

usage: main.py [-h] [-f FILE] [-l] [-s SAVE_DIR] [-a  [ID ...]]
               [-d all [ID ...]] [-c all [ID ...]] [-t THREADS] [-r]

optional arguments:
  -h, --help       show this help message and exit
  -f FILE          set config file
  -l               list current settings
  -s SAVE_DIR      set save directory path
  -a  [ID ...]     add artist ids
  -d all [ID ...]  delete artist ids and their directories
  -c all [ID ...]  clear artists directories
  -t THREADS       set the number of threads
  -r               run program

run the program with current configuration (i.e. update artists' artworks)

python main.py

add artist IDs then run the program

python main.py -a wlop trungbui42 -r

load temp.json file in data folder then add artist IDs. Note that temp.json is only used for this instance and is not a replacement for the default config.json file

python main.py -f data/temp.json -a wlop trungbui42

clear update information (i.e. re-download artworks), set threads to 24, then run the program

python main.py -c all -t 24 -r

Challenges

  1. get all artwork URLs of an artist from a specific URL. There are two ways to do this: through AJAX URL or through normal URL. The former is preferred as it returns a JSON object that is easy to work with. However, ArtStation has a security check that prevents direct access to the AJAX URL. Below are some of the methods I tried:

    • Attempt 1: for AJAX URL, the request works in browser but not in Python, so I change the request headers to match the one sent in browser to hopefully bypass the security check, which includes modifying user-agent, cookies, etc. Unfortunately, this does not work.

    • Solution 1: for AJAX URL, use Selenium with ChromeDriver to request the link. This only works if the driver is not in headless mode. I tried modifying the request headers, but to no avail. This is therefore not a good solution because: (1) the users are going to see the browser automation, which may not be desirable, and (2), the performance is not great due to the driver itself and the prevention of headless mode.

    • Solution 2: use normal URL and parse the plain HTML to get the information. The trick is to use the artist's website instead of the portfolio page, as the latter is generated dynamically from the AJAX request and thus contains no valuable content.

  2. invalid folder name. I originally planned to name subdirectories using the artist names, but there are two problems with this approach: (1) if the artist names contain special characters, the program may not be able to find the folder path (depending on the OS. For example, in Windows, the trailing . character in folder name will be removed automatically); hence terminated with errors. (2) if the artists change their names, the program will leave multiple directories pointing to the same artists.

    • Solution: use artist IDs as the folder names
  3. file duplicate issue. In ArtStation, artists can name their artworks with identical file names, which causes the program to overwrite downloaded files.

    • Solution: append artwork ID to each file name
  4. update mechanism

    • Attempt: download artworks from newest to oldest until an existing file is found on the disk. This does not work well with the multi-threading implementation, as it makes the program a lot more complicated in order to deal with thread stopping condition

    • Solution: record the last visited artwork information for each artist to check if update is needed. This does not work if the newest upload was deleted by the artist, as the stored information cannot be found in the retrieved HTML. One solution is to record a list of all downloaded artwork information for each artist, then compare it with the parsed data, but this wastes a lot of unnecessary space and memory

Todo

  • add more functionality (e.g. ranking)

artstation-scraper's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

artstation-scraper's Issues

UnicodeEncodeError: 'charmap' codec can't encode character '\u2726' in position 30: character maps to <undefined>

Doesn't seem to handle certain encoding characters?

OS: Windows 10
Python 3.8
Using the GitBash terminal

Running under normal conditions for the following users:

{
    "save_directory": "./art",
    "artists": [
        "kuvshinov_ilya",
        "guweiz",
        "nababa",
        "polkin",
        "rossdraws",
        "tsuaii",
        "viccolatte",
        "wlop"
    ]
}

Error code:

$ python main.py -r

there are 8 artists

download for artist Ilya Kuvshinov begins

artist Ilya Kuvshinov is up-to-date

download for artist Z.W. Gu begins

artist Z.W. Gu is up-to-date

download for artist arata yokoyama begins

artist arata yokoyama is up-to-date

download for artist Ilya Knyazev begins

artist Ilya Knyazev is up-to-date

Traceback (most recent call last):
  File "main.py", line 53, in <module>
    main()
  File "main.py", line 49, in main
    download_artists(api, config)
  File "main.py", line 9, in download_artists
    result = api.save_artists(config.artists, config.save_dir)
  File "H:\Dropbox\Projects\Python\scrapers\artstation-scraper\lib\artstation.py", line 106, in save_artists
    files = self.save_artist(id, dir_path)
  File "H:\Dropbox\Projects\Python\scrapers\artstation-scraper\lib\artstation.py", line 89, in save_artist
    print(f"download for artist {artist_name} begins\n")
  File "C:\Users\<username>\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2726' in position 30: character maps to <undefined>


python script opens then immediately closes

I dont think the instructions are clear, what do i need to download. where do i put it Im completely new to this stuff and its alot of stress to just collect images for reference to learn art.

System installation

Ok, there is cardinal rule that mean: any application must be installed by system package manager. So while I'm creating Gentoo ebuild (script for installing in system) I see that it's impossible to install system currently.
So I create this ticket.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.