Coder Social home page Coder Social logo

xa-scraper's Introduction

xA-Scraper

This is a automated tool for scraping content from a number of art sites:

  • DeviantArt
  • Patreon
  • FurAffinity
  • HentaFoundry
  • Pixiv
  • InkBunny
  • SoFurry
  • Weaslyl
  • Newgrounds art galleries

To Add:

Decrepit:

  • Tumblr art blogs

Checked so far:

  • hf, df, wy, ng, ib, fa

Todo:

  • da, pat, px

It also has grown a lot of other functions over time. It has a fairly complex, interactive web-interface for browsing the local gallery mirrors.

Dependencies:

  • Linux
  • Postgres >= 9.3 or Sqlite
  • CherryPy
  • Pyramid
  • Mako
  • BeautifulSoup 4
  • others
  • google-chrome (for da)

The backend can either use a local sqlite database (which has poor performance, particularly when cold, but is very easy to set up), or a full postgresql instance.

Configuration is done via a file named settings.py which must be placed in the repository root. settings.base.py is an example config to work from. In general, you will probably want to copy settings.base.py to settings.py, and then add your various usernames/password/database-config.

DB Backend is selected via the USE_POSTGRESQL parameter in settings.py.

If using postgre, DB setup is left to the user. xA-Scraper requires it's own database, and the ability to make IP-based connections to the hosting PG instance. The connection information, DB name, and client name must be set in settings.py.

When using sqlite, you just have to specify the path to where you want the sqlite db to be located (or you can use the default, which is ./sqlite_db.db).

settings.py is also where the login information for the various plugins goes.

Disabling of select plugins can be accomplished by commenting out the appropriate line in main.py. The JOBS list dictates the various scheduled scraper tasks that are placed into the scheduling system.

The preferred bootstrap method is to use run_scraper.sh from the repository root. It will ensure the required packages are available (build-essential, libxml2 libxslt1-dev python3-dev libz-dev), and then install all the required python modules in a local virtualenv. Additonally, it checks if the virtualenv is present, so once it's created, ./run_scraper.sh will just source the venv, and run the scraper witout any reinstallation.

To run the web UI (which handles adding names to scrape, viewing fetched files, etc...), run run_web.sh. The expected use is to have both run_scraper.sh and run_web.sh executed as daemons.

Currently, there are some aspects that need work. The artist selection system is currently a bit broken. Currently, there isn't a clean way to remove artists from the scrape list, though you can add or modify them.

Notes:

  • There have been reports that things are actively broken on non-linux platforms. Realistically, all development is done on a Ubuntu 18.04 LTS install, and running on anything else is at your own risk.

  • The Yiff-Party scraper requires significant external infrastructure, as it currently depends on threading it's fetch requests through the autotriever project. This depends on having both a publically available RabbitMQ instance, and an executing instance of the FetchAgent components of the ReadableWebProxy fetch-agent RPC service on your local LAN.

  • FurAffinity has a login captcha. This requires you either manually log the FA scraper in (via the "Manual FA Login" facility in the web-interface), or you can use a automated captcha service. Currently, the only solver service supported is the 2Captcha service.

  • This is my oldest "maintained" project, and the codebase is commensuarately horrible. Portions of it were designed and written while I was still learning python, so there are a bunch of really terrible design decisons baked into the class structure, and much of the code just does stupid things.


Anyways, Pictures!

These are a few DeviantArt Artists culled from the Reddit ImaginaryLandscapes subreddit.

The web-interface has a lot of fancy mouseover preview stuff. Since this is primarily intended to run off a local network, bandwidth concerns are not too relevant, and I went a bit nuts with jQuery.

Basic Popups

There is also a somewhat experimental "gallery slice" viewing system, where horizontal mouse movement seeks through a spaced sub-set of each artist's images. The artist is determined by the row, and each horizontal 10 pixels is a different image.

Fancy Popups

Lastly, there is also a basic, chronological view of each artist's work, though it does support infinite-scrolling for their entire gallery. The scraper also preserves the description that preserves each item, and it is presented with the corresponding image.

xa-scraper's People

Contributors

fake-name avatar helios-vmg avatar herp-a-derp avatar importtaste avatar pyup-bot avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.