Coder Social home page Coder Social logo

pixivmediascraper's People

Contributors

sliceofcake avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

wittybaji

pixivmediascraper's Issues

Even faster would be nice

My artist list is currently at 55 artists, and it takes 8 minutes, just about on the nose, for the JavaScript scraper to run. Although this process saves enormous amounts of time, it would be nice if the JavaScript scraper ran even faster.

Maybe we can do bolder guessing on filenames and page-counts. Maybe there's a better way to throttle the iframes batches.

Supported platforms?

Which Apple OSs have Python already loaded. Do they all have these libraries preloaded as well, like my OS does?
Does the script work for Windows?

Breakage - Sign In Failure [29 Jun 2019]

Script fails when attempting sign in. Manually, the error message is the completely irrelevant one saying that too many bad password attempts were made and a temporary lock is therefore in place. You can still manually get in reliably by just answering the grid captcha. Their system is a mess though - it seems to randomly ask you for a captcha and randomly show or not show that message. I tried visiting the page to copy-paste the text here, but I can't get the error message again - it just manually signs in without issue now.

I see a bunch of criteo.com stuff on page load triggers, so wondering if they recently partnered with some third-party group and maybe it affected their authentication.

Or it could just by my account and/or bad luck. Will wait this one out for a few days because I'm currently at a loss...

ADD: Was working fine on [22 Jun 2019].

Fight with python to spawn >2046 threads

The stop-go gates every 256 threads works, but significantly slows down the script's execution time. There needs to be a way to start all the threads at the beginning, but just not ~literally~ because python will complain.

Artist changed their name, script redownloaded everything from them

Because existence checks are done by the name of folders, and the folder name includes the username, an artist just changed their name and the scraper, as its duty, didn't recognize that we already have a folder for that artist. Instead, it treated it as a new artist, so now I have two more-or-less equal folders.

Stuck on 1. Slow.

I tried a second artist and it got stuck on 1. Also, it runs really slowly.
[self-reported Issue is sparse on details. working on a solution]

threads cap out for some reason [2046 max per run]

Downloading 2681 images...
△△△△△△△△Traceback (most recent call last):
File "pixivRoot.py", line 390, in
t.start()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯◯◯◯◯◯◯


Two special things:
• a lot of files were to be downloaded
• my RAM was pretty much pegged

pixiv changed their page format

WARNING : could not find username on page - falling back to solely userID [developer's fault - pixiv changed their artist page format]

Putting this here. If I can find time, I'll try to fix it.

ampersand showing as HTML

Because we load the page raw, ampersand obviously stays as such when it turns into an artist name string. I suppose we need to convert single-character HTML things before turning them into a name. To be on the safe side, maybe just handle ampersand for now?

[21 Oct 2018] Batch

• It's annoying to see the waves of white text, since that text doesn't matter.
• Occasionally there are thread failures by "returnFalseOnFailure" not defined line 94.

Sometimes Images Get An Incorrect 404

There was an image that got a 404 for me at first, but when I went through the images on my computer I noticed that bad file, deleted it, and when I ran the DL script again, it downloaded just fine [although I did load that image manually in my browser before redoing the update process, and it initially had a broken image link symbol, before I refreshed].

I'm guessing that issues can arise with the downloads. Maybe there could be a way to identify that a jpg or png isn't correct [it'll be a text file with a 404 or something, or blank maybe, but it'll be missing the magic identifier at the start of the file https://en.wikipedia.org/wiki/List_of_file_signatures]. From that point, it could either retry the download, or maybe just remove the incorrect file and wait for the user to re-run the script another day, when/if success chances will be higher at that later time.

Suggestion : stop script earlier

The --disable-page option should be more aggressive - once it recognizes a few galleries [alternative, a single subgallery] that it already has downloaded, stop scanning the artist.

Not viable with 46 artists

• I ran the JavaScript script on a list of 46 artists and it didn't complete. I have here 2 hours and 45 minutes, after which I stopped the timer because my browser stopped making progress [see the note about RAM usage].
• Firefox's pegged my RAM over 10.5GB+ before it ran out of free RAM and seemingly stopped functioning.
• It seemed to frequently have many more than 4 outstanding pages, think like 17 in some bursts.

I'd like the JavaScript script to run much faster. Think 10x faster. Maybe look into sending cookies along with cURL requests and do this scanning process on the command-line, if it's possible to transfer the scanning logic over.

--or, in the meantime, maybe there's an issue with iframe not properly being released? Unless the text file is somehow becoming erroneously enormous, Firefox should not be demanding more the 10.5GB+ of RAM that it was.

Not downloading everything

Sometimes there will be <10 missing downloads from <1000 total images. It'd be nice if this could somehow be detected and dealt with. There may be cURL errors or 404 returned that can be checked. Or maybe there are files other than .jpg and .png.

Recommended way to stop script, along with all threads

When I use Control+Z to quit the script in the middle of it working, I sometimes [haven't tested it enough, a little scared to], there will be leftover python jobs. When I go to close the Terminal tab, it will tell me that. What's the recommended way to kill all python threads?

Performance Improvement Suggestion

Before we check the extension, we can just use the ID to check if ANY matching file with that ID exists locally. if 1000.jpg exists, we don't need to verify that it's actually a .jpg, we can just trust that it is.

This would need to be a flag because if people mess with their files, this could generate incorrect results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.