radiolarian / ao3scraper Goto Github PK
View Code? Open in Web Editor NEWA Python scraper for getting fan fiction content and metadata from Archive of Our Own.
A Python scraper for getting fan fiction content and metadata from Archive of Our Own.
Thanks so much for this scraper! It's awesome. I'm coming across this issue when running ao3_get_fanfics.py where it appears that if the particular work doesn't have kudos, the following error comes up referencing the get_kudos function and ends the program:
This does not appear to be an issue with works that have no bookmarks or no comments.
Hi there, this has been brought up before but unfortunately I can't seem to figure out why it's happening (it's not just Excel for me), but the output of the ao3_get_fanfics csv looks like this:
Not sure what's causing the text of the body to enter into the other rows, any insights on how to fix this?
EDIT: the breaks seem to happen with paragraph breaks in the actual text of the fic, but also in the middle of sentences with no delimiter other than a space?
Hello,
when I tried ao3_work_ids.py
on the example to README, it returned no results (after removing the --lang from the other issue).
Note that I used python3, since I was unable to install dependencies for python2.
I tracked the problem to function get_id
, to the line works = soup.find_all(class_="work blurb group")
. Apparently, when given multiple classes, find_all returns only the results where the class is exactly the argument. But the works in the list contain also other classes like work-some_number
.
Based on this SO question:
https://stackoverflow.com/questions/40305678/beautifulsoup-multiple-class-selector
using works = soup.select("li.work.blurb.group")
seems to work.
Hi, the link to the Hastac 2017 slides appears to be broken.
Sorry for the vague header.
Issue 1: I tried running this: python ao3_work_ids.py "https://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=revised_at&work_search%5Bother_tag_names%5D=&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=18&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=&commit=Sort+and+Filter&tag_id=IT+%28Movies+-+Muschietti%29" --out_csv itworks
I think it worked the first time, after I installed lxml. It created a spreadsheet, but honestly it looked like the spreadsheet was the above URL over and over for 1000+ lines (each with a line in between). I tried to run the second command code:
python ao3_get_fanfics.py itworks.csv
And received:
C:\Users\[REDACTED]\ao3_get_fanfics.py:269: SyntaxWarning: "is" with a literal. Did you mean "=="?
if restart is '':
Writing a header row for the csv.
Traceback (most recent call last):
File "C:\Users\[REDACTED]\ao3_get_fanfics.py", line 292, in <module>
main()
File "C:\Users\[REDACTED]\ao3_get_fanfics.py", line 267, in main
with open(csv_fname, 'r+') as f_in:
PermissionError: [Errno 13] Permission denied: 'itworks.csv'
Issue 2: I noticed one of those spam fics with a ton of tags so I refreshed and it had been reported, so I thought to run the first command again just to make sure it wasn't included in my scrape. But this somehow resulted in a spreadsheet where there was one line, all in foreign characters and OpenOffice telling me that my sheet was broken because of too many characters. This broke my original spreadsheet somehow?
ETA: Adding a folder where I have the spreadsheet working again, but the links on it lead to nowhere because of coding tagged on at the end.
itworks links unusable.zip
Hi there, I seem to be having a problem where the scraper only takes the first 50 or so kudos and bookmarks for the fics, even for those with well over a hundred. Was just wondering if this was a bug or if there was a workaround for this?
Thanks!
Hey there,
at least on windows using the ao3_work_ids.py script will result in a file with every other line beeing blank. Adding newline=""
as argument to the open()
call in line 193 fixes it for me. I will test whether this is relevant for linux aswell.
Edit: This seems to be windiws specific, but the fix I mentioned seems to not affect linux behaviour at all so I guess it could be worth implementing
Edit2: ao3_get_fanfics.py is also hit (lines: 269, 271, 280)
Hey there! I just wanted to let you know that I had to install additional dependencies that were not included in your README file for the ao3_work_ids.py
file to run correctly :)
They included:
pip install datetime
pip install argparse
pip install lxml
The last one I had to search for online because I was receiving the bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
error in my Git Bash terminal.
No biggie, but thought I'd let you know if you would like to update the README!
Mads
Hello,
ao3_works_ids.py
seems to be missing --lang
option support. When I try the README example, I get:
> python3 ao3_work_ids.py "http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29" --num_to_retrieve 100 --out_csv sherlock -lang English
> usage: ao3_work_ids.py [-h] [--out_csv OUT_CSV] [--header HEADER] [--num_to_retrieve NUM_TO_RETRIEVE] [--multichapter_only MULTICHAPTER_ONLY] [--tag_csv TAG_CSV] URL
> ao3_work_ids.py: error: unrecognized arguments: -lang English
Hi! First of all, let me say that this scraper was super useful for a project I'm doing on fandom statistics and text. So thank you!
Second of all, I'd like to do a PR that scrapes user-related features such as
I've tried to do a PR but it seems I need some sort of special access to create a new branch. Any advice?
Hi!
I'm interested in setting up a pip-installable version of the repository so I can (hopefully) use the code without having to be in the same directory as the scripts. (This is a step toward having a pip installable script; with a setup.py you can either pip install from the downloaded folder or just directly from the github repo.)
I've got a standard template and I'm happy to make a PR, but it's likely stuff you as the primary developer should fill in, like your names, email addresses, and how you want to license the code.
Thanks!
Hi! I've tried a few times to run python ao3_get_fanfics.py
, and it's successfully scraping around half of the stories but the rest are coming back "Access Denied." I tried adding this http header flag but it didn’t seem to help: --header 'Chrome/88.0.4324.146 (Macintosh; Intel Mac OS X 10.15.7); Theo Evans/University of Chicago/[email protected]'
Any ideas of what might be going wrong?
Thank you!
When I try to run AO3_work_ids.py, I get an error: inconsistent use of tabs and spaces in indentation, on line 112. I'm far from an expert on Python, but I'm definitely running it in 2.7. Hm.
I'm not a CS major by training so I'm not sure what data structure might suit storing these best or if it might get corrupted by the text being stored in a CSV. Any ideas?
Hi there! First, I wanna say thank you so much for building this scraper! I'm extremely new to data scraping (I'm a physicist in training) so this is very helpful in building toy data for a project of mine.
I need individual txt files instead of a single CSV, so I'm trying to use csv_to_txts. However, I run into the following error:
Changing "rb" to "r" in line 29 (apologies if I'm not supposed to do this!) results in this:
Thanks for your help!
Hi, I am running the Sherlock example but got a problem. Here is my code:
python ao3_work_ids.py "http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29" --num_to_retrieve 10 --out_csv sherlock
But then I got:
line 60
<title>AO3Scraper/ao3_work_ids.py at master · radiolarian/AO3Scraper · GitHub</title>
^
SyntaxError: invalid character '·' (U+00B7)
Could anyone help me fix this issue? Thank you so much in advance!
I know ao3_get_fanfics.py has a restart command line argument, but ao3_work_ids.py doesn't have the same functionality. When downloading a lot of fanfiction, it can be quite frustrating to have to start all over again just because I lost internet connection. Please add a restart command line argument to ao3_work_ids.py or explain how I can resume scraping without having to start from the beginning. Thank you.
Is there a straightforward way to capture author data as well - namely, the name of the author?
When I tried to scrape larger amounts for a NLP Projekt I stumbled over the problem that only a fraction of the ids got scraped. The same has alreay been observed by @celegant in #20 (comment)_
The reason is mainly that there is no catch handling a 429 status code which seems to be returned by ao3 if their security system blocks your ip for around 3 minutes - which happens to me around every 70-100 requets no matter the delay.
What happens right now is that the scraper will just skip to the next id and so on for the time being blocked by ao3 - which resulted for me in only about 3.5k ids out of 44k requested ids really beeing scraped.
To fix this I implemented status code checks which either write the result of the request to the errorlog (in case of 404 and so on) or in case of 429 wait a certain period of time and then restart scraping at the current id - which is easy to do since the restart functionality is already implemented :)
...
In advance, I'm really sorry if this is a dumb question, I only learned Python about two weeks ago.
I'm having this issue when i run extract_metadata on the output of ao3_get_fanfics:
extract_metadata.py:26: DeprecationWarning: 'U' mode is deprecated with open(csv_name, 'rU') as csvfile: Traceback (most recent call last): File "extract_metadata.py", line 36, in <module> main() File "extract_metadata.py", line 31, in main work_id = row[0] IndexError: list index out of range
none of the solutions i've tried thus far have worked, do you have any insight?
Hi, apologies if this is a simple question, as I have not used python much outside of my data science classes, so that may be the root of the problem. When I run the example code (the Sherlock one) I get a syntax error pointing to the URL and saying "invalid syntax".
Additionally, I tried to run one of the example lines for ao3_get_fanfics.py (python ao3_get_fanfics.py 5937274
) and got "invalid syntax" again, this time pointing at the beginning of "ao3_get_fanfics.py". How do I resolve this issue?
When I run the command in your README (i.e. getting work IDs for 100 Sherlock fics), I get this error:
Traceback (most recent call last):
File "ao3_work_ids.py", line 261, in <module>
main()
File "ao3_work_ids.py", line 257, in main
process_for_ids(header_info)
File "ao3_work_ids.py", line 240, in process_for_ids
ids = get_ids(header_info)
File "ao3_work_ids.py", line 108, in get_ids
soup = BeautifulSoup(req.text, "lxml")
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bs4/__init__.py", line 165, in __init__
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Happens with both Python 2 and 3.
Hello,
when using --restart
in ao3_get_fanfics.py
the script crashes.
The reason is missing lang
argument in the respective call of write_fic_to_csv
on line 295.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.