sliceofcake / pixivmediascraper Goto Github PK
View Code? Open in Web Editor NEWKeeps a local folder tree of your favorite pixiv artists' works, quickly gets newly posted works. [Python2] [Mac OS X]
Keeps a local folder tree of your favorite pixiv artists' works, quickly gets newly posted works. [Python2] [Mac OS X]
Since 19 Dec 2016, the python script barely gets through the initial page scans before creeping along extremely slowly and capping out at 1-core CPU 100%.
Maybe it's a probably with the account I'm using.
The --disable-page option should be more aggressive - once it recognizes a few galleries [alternative, a single subgallery] that it already has downloaded, stop scanning the artist.
^
Sometimes there will be <10 missing downloads from <1000 total images. It'd be nice if this could somehow be detected and dealt with. There may be cURL errors or 404 returned that can be checked. Or maybe there are files other than .jpg and .png.
Because we load the page raw, ampersand obviously stays as such when it turns into an artist name string. I suppose we need to convert single-character HTML things before turning them into a name. To be on the safe side, maybe just handle ampersand for now?
Script fails when attempting sign in. Manually, the error message is the completely irrelevant one saying that too many bad password attempts were made and a temporary lock is therefore in place. You can still manually get in reliably by just answering the grid captcha. Their system is a mess though - it seems to randomly ask you for a captcha and randomly show or not show that message. I tried visiting the page to copy-paste the text here, but I can't get the error message again - it just manually signs in without issue now.
I see a bunch of criteo.com stuff on page load triggers, so wondering if they recently partnered with some third-party group and maybe it affected their authentication.
Or it could just by my account and/or bad luck. Will wait this one out for a few days because I'm currently at a loss...
ADD: Was working fine on [22 Jun 2019].
When I use Control+Z to quit the script in the middle of it working, I sometimes [haven't tested it enough, a little scared to], there will be leftover python jobs. When I go to close the Terminal tab, it will tell me that. What's the recommended way to kill all python threads?
Before we check the extension, we can just use the ID to check if ANY matching file with that ID exists locally. if 1000.jpg exists, we don't need to verify that it's actually a .jpg, we can just trust that it is.
This would need to be a flag because if people mess with their files, this could generate incorrect results.
My artist list is currently at 55 artists, and it takes 8 minutes, just about on the nose, for the JavaScript scraper to run. Although this process saves enormous amounts of time, it would be nice if the JavaScript scraper ran even faster.
Maybe we can do bolder guessing on filenames and page-counts. Maybe there's a better way to throttle the iframes batches.
I tried a second artist and it got stuck on 1. Also, it runs really slowly.
[self-reported Issue is sparse on details. working on a solution]
Downloading 2681 images...
△△△△△△△△Traceback (most recent call last):
File "pixivRoot.py", line 390, in
t.start()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯△△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯△◯◯◯◯◯◯◯◯
Two special things:
• a lot of files were to be downloaded
• my RAM was pretty much pegged
The stop-go gates every 256 threads works, but significantly slows down the script's execution time. There needs to be a way to start all the threads at the beginning, but just not ~literally~ because python will complain.
The <title> tag contents has a different format now.
The API seems to have changed.
• I ran the JavaScript script on a list of 46 artists and it didn't complete. I have here 2 hours and 45 minutes, after which I stopped the timer because my browser stopped making progress [see the note about RAM usage].
• Firefox's pegged my RAM over 10.5GB+ before it ran out of free RAM and seemingly stopped functioning.
• It seemed to frequently have many more than 4 outstanding pages, think like 17 in some bursts.
I'd like the JavaScript script to run much faster. Think 10x faster. Maybe look into sending cookies along with cURL requests and do this scanning process on the command-line, if it's possible to transfer the scanning logic over.
--or, in the meantime, maybe there's an issue with iframe not properly being released? Unless the text file is somehow becoming erroneously enormous, Firefox should not be demanding more the 10.5GB+ of RAM that it was.
• It's annoying to see the waves of white text, since that text doesn't matter.
• Occasionally there are thread failures by "returnFalseOnFailure" not defined line 94.
WARNING : could not find username on page - falling back to solely userID [developer's fault - pixiv changed their artist page format]
Putting this here. If I can find time, I'll try to fix it.
Because existence checks are done by the name of folders, and the folder name includes the username, an artist just changed their name and the scraper, as its duty, didn't recognize that we already have a folder for that artist. Instead, it treated it as a new artist, so now I have two more-or-less equal folders.
There was an image that got a 404 for me at first, but when I went through the images on my computer I noticed that bad file, deleted it, and when I ran the DL script again, it downloaded just fine [although I did load that image manually in my browser before redoing the update process, and it initially had a broken image link symbol, before I refreshed].
I'm guessing that issues can arise with the downloads. Maybe there could be a way to identify that a jpg or png isn't correct [it'll be a text file with a 404 or something, or blank maybe, but it'll be missing the magic identifier at the start of the file https://en.wikipedia.org/wiki/List_of_file_signatures]. From that point, it could either retry the download, or maybe just remove the incorrect file and wait for the user to re-run the script another day, when/if success chances will be higher at that later time.
The script runs slowly and returns 0 changes. Pixiv just changed their HTML recently, so I guess they did it again...
Already fixed. Filing this report for record-keeping.
Which Apple OSs have Python already loaded. Do they all have these libraries preloaded as well, like my OS does?
Does the script work for Windows?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.