radiolarian / ao3scraper Goto Github PK

View Code? Open in Web Editor NEW

169.0 169.0 55.0 120 KB

A Python scraper for getting fan fiction content and metadata from Archive of Our Own.

Python 100.00%

ao3scraper's People

Contributors

Stargazers

Watchers

Forkers

kolvia willhopkins ricketybridge michaelmilleryoder caramessina hyuwujin expwnent sallyexactly creichts radioactivepigeons zettstai rouselaurenc ziwnchen rmalusa72 taoleeyu flawrence sarken seowxft firesuiry bianchi-dy remixherstory dashstander cbogart katthomp ajhoeh johnnyma314 pharsaliam theheroichydrangea katamarija madqueeen haj10 ariouseok 97emilylc vicky-vicky27 mrolfe katfang zsk6 irrationalpie7 giovanadanieles puzzle727 taniaxmehta jhcandiceyao gitpea croushore rarerea pimirk semirag aryannaik123 elevenychen marugannwg lemontii heartofgalaxies

ao3scraper's Issues

Issues with no kudos

Thanks so much for this scraper! It's awesome. I'm coming across this issue when running ao3_get_fanfics.py where it appears that if the particular work doesn't have kudos, the following error comes up referencing the get_kudos function and ends the program:

This does not appear to be an issue with works that have no bookmarks or no comments.

Error in writing body of text to csv

Hi there, this has been brought up before but unfortunately I can't seem to figure out why it's happening (it's not just Excel for me), but the output of the ao3_get_fanfics csv looks like this:

Not sure what's causing the text of the body to enter into the other rows, any insights on how to fix this?

EDIT: the breaks seem to happen with paragraph breaks in the actual text of the fic, but also in the middle of sentences with no delimiter other than a space?

Broken parsing works from request

Hello,

when I tried ao3_work_ids.py on the example to README, it returned no results (after removing the --lang from the other issue).
Note that I used python3, since I was unable to install dependencies for python2.

I tracked the problem to function get_id, to the line works = soup.find_all(class_="work blurb group"). Apparently, when given multiple classes, find_all returns only the results where the class is exactly the argument. But the works in the list contain also other classes like work-some_number.

Based on this SO question:
https://stackoverflow.com/questions/40305678/beautifulsoup-multiple-class-selector
using works = soup.select("li.work.blurb.group") seems to work.

Broken Slides link in Readme

Hi, the link to the Hastac 2017 slides appears to be broken.

Basic running errors/questions

Sorry for the vague header.

Issue 1: I tried running this: python ao3_work_ids.py "https://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=revised_at&work_search%5Bother_tag_names%5D=&exclude_work_search%5Barchive_warning_ids%5D%5B%5D=18&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=&work_search%5Bdate_to%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=&commit=Sort+and+Filter&tag_id=IT+%28Movies+-+Muschietti%29" --out_csv itworks

I think it worked the first time, after I installed lxml. It created a spreadsheet, but honestly it looked like the spreadsheet was the above URL over and over for 1000+ lines (each with a line in between). I tried to run the second command code:
python ao3_get_fanfics.py itworks.csv
And received:

C:\Users\[REDACTED]\ao3_get_fanfics.py:269: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if restart is '':
Writing a header row for the csv.
Traceback (most recent call last):
  File "C:\Users\[REDACTED]\ao3_get_fanfics.py", line 292, in <module>
    main()
  File "C:\Users\[REDACTED]\ao3_get_fanfics.py", line 267, in main
    with open(csv_fname, 'r+') as f_in:
PermissionError: [Errno 13] Permission denied: 'itworks.csv'

Issue 2: I noticed one of those spam fics with a ton of tags so I refreshed and it had been reported, so I thought to run the first command again just to make sure it wasn't included in my scrape. But this somehow resulted in a spreadsheet where there was one line, all in foreign characters and OpenOffice telling me that my sheet was broken because of too many characters. This broke my original spreadsheet somehow?

ao3scrapeerror.zip

ETA: Adding a folder where I have the spreadsheet working again, but the links on it lead to nowhere because of coding tagged on at the end.
itworks links unusable.zip

all_kudos & all_bookmarks problem

Hi there, I seem to be having a problem where the scraper only takes the first 50 or so kudos and bookmarks for the fics, even for those with well over a hundred. Was just wondering if this was a bug or if there was a workaround for this?

Thanks!

blank lines between entries on windows

Hey there,

at least on windows using the ao3_work_ids.py script will result in a file with every other line beeing blank. Adding newline="" as argument to the open() call in line 193 fixes it for me. I will test whether this is relevant for linux aswell.

Edit: This seems to be windiws specific, but the fix I mentioned seems to not affect linux behaviour at all so I guess it could be worth implementing

Edit2: ao3_get_fanfics.py is also hit (lines: 269, 271, 280)

I had to install additional dependencies not listed in your README file

Hey there! I just wanted to let you know that I had to install additional dependencies that were not included in your README file for the ao3_work_ids.py file to run correctly :)

They included:

pip install datetime
pip install argparse
pip install lxml

The last one I had to search for online because I was receiving the bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? error in my Git Bash terminal.

No biggie, but thought I'd let you know if you would like to update the README!
Mads

Commas in body causing trouble with CSV

Hi,

I was able to run ao3_work_ids without problems, but when I use that to scrape the fic bodies, I run into issues where every time a comma appears in the text, the csv reads that as a delimiter. What can I do to fix this?

Attached is a screenshot:

Thanks!

Kiran

Missing --lang option

Hello,

ao3_works_ids.py seems to be missing --lang option support. When I try the README example, I get:

> python3 ao3_work_ids.py "http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29" --num_to_retrieve 100 --out_csv sherlock -lang English

> usage: ao3_work_ids.py [-h] [--out_csv OUT_CSV] [--header HEADER] [--num_to_retrieve NUM_TO_RETRIEVE] [--multichapter_only MULTICHAPTER_ONLY] [--tag_csv TAG_CSV] URL

> ao3_work_ids.py: error: unrecognized arguments: -lang English

New features for scraping user responses

Hi! First of all, let me say that this scraper was super useful for a project I'm doing on fandom statistics and text. So thank you!

Second of all, I'd like to do a PR that scrapes user-related features such as

Author name
Users who have kudoed
Users who have bookmarked

I've tried to do a PR but it seems I need some sort of special access to create a new branch. Any advice?

License and Setup

Hi!

I'm interested in setting up a pip-installable version of the repository so I can (hopefully) use the code without having to be in the same directory as the scripts. (This is a step toward having a pip installable script; with a setup.py you can either pip install from the downloaded folder or just directly from the github repo.)

I've got a standard template and I'm happy to make a PR, but it's likely stuff you as the primary developer should fill in, like your names, email addresses, and how you want to license the code.

Thanks!

"Access Denied"

"Access Denied" while scraping

Hi! I've tried a few times to run python ao3_get_fanfics.py, and it's successfully scraping around half of the stories but the rest are coming back "Access Denied." I tried adding this http header flag but it didn’t seem to help: --header 'Chrome/88.0.4324.146 (Macintosh; Intel Mac OS X 10.15.7); Theo Evans/University of Chicago/[email protected]'

Any ideas of what might be going wrong?

Thank you!

AO3_work_ids.py error?

When I try to run AO3_work_ids.py, I get an error: inconsistent use of tabs and spaces in indentation, on line 112. I'm far from an expert on Python, but I'm definitely running it in 2.7. Hm.

Scraping comments (and threads)

I'm not a CS major by training so I'm not sure what data structure might suit storing these best or if it might get corrupted by the text being stored in a CSV. Any ideas?

Issues when using csv_to_txts

Hi there! First, I wanna say thank you so much for building this scraper! I'm extremely new to data scraping (I'm a physicist in training) so this is very helpful in building toy data for a project of mine.

I need individual txt files instead of a single CSV, so I'm trying to use csv_to_txts. However, I run into the following error:

Changing "rb" to "r" in line 29 (apologies if I'm not supposed to do this!) results in this:

Thanks for your help!

invalid character '·' (U+00B7) in ao3_work_ids.py

Hi, I am running the Sherlock example but got a problem. Here is my code:

python ao3_work_ids.py "http://archiveofourown.org/works?utf8=%E2%9C%93&work_search%5Bsort_column%5D=kudos_count&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=1&work_search%5Bcomplete%5D=0&work_search%5Bcomplete%5D=1&commit=Sort+and+Filter&tag_id=Sherlock+%28TV%29" --num_to_retrieve 10 --out_csv sherlock

But then I got:

line 60
<title>AO3Scraper/ao3_work_ids.py at master · radiolarian/AO3Scraper · GitHub</title>
^
SyntaxError: invalid character '·' (U+00B7)

Could anyone help me fix this issue? Thank you so much in advance!

No way to restart ao3_work_ids.py without starting over

I know ao3_get_fanfics.py has a restart command line argument, but ao3_work_ids.py doesn't have the same functionality. When downloading a lot of fanfiction, it can be quite frustrating to have to start all over again just because I lost internet connection. Please add a restart command line argument to ao3_work_ids.py or explain how I can resume scraping without having to start from the beginning. Thank you.

Authors?

Is there a straightforward way to capture author data as well - namely, the name of the author?

429 (and other) status codes not handled

When I tried to scrape larger amounts for a NLP Projekt I stumbled over the problem that only a fraction of the ids got scraped. The same has alreay been observed by @celegant in #20 (comment)_

The reason is mainly that there is no catch handling a 429 status code which seems to be returned by ao3 if their security system blocks your ip for around 3 minutes - which happens to me around every 70-100 requets no matter the delay.
What happens right now is that the scraper will just skip to the next id and so on for the time being blocked by ao3 - which resulted for me in only about 3.5k ids out of 44k requested ids really beeing scraped.

To fix this I implemented status code checks which either write the result of the request to the errorlog (in case of 404 and so on) or in case of 429 wait a certain period of time and then restart scraping at the current id - which is easy to do since the restart functionality is already implemented :)

...

Issue with extract_metadata

In advance, I'm really sorry if this is a dumb question, I only learned Python about two weeks ago.

I'm having this issue when i run extract_metadata on the output of ao3_get_fanfics:

extract_metadata.py:26: DeprecationWarning: 'U' mode is deprecated with open(csv_name, 'rU') as csvfile: Traceback (most recent call last): File "extract_metadata.py", line 36, in <module> main() File "extract_metadata.py", line 31, in main work_id = row[0] IndexError: list index out of range

none of the solutions i've tried thus far have worked, do you have any insight?

Cannot run ao3_work_ids.py: Invalid syntax?

Hi, apologies if this is a simple question, as I have not used python much outside of my data science classes, so that may be the root of the problem. When I run the example code (the Sherlock one) I get a syntax error pointing to the URL and saying "invalid syntax".
Additionally, I tried to run one of the example lines for ao3_get_fanfics.py (python ao3_get_fanfics.py 5937274) and got "invalid syntax" again, this time pointing at the beginning of "ao3_get_fanfics.py". How do I resolve this issue?

BS4 error

When I run the command in your README (i.e. getting work IDs for 100 Sherlock fics), I get this error:

Traceback (most recent call last):
  File "ao3_work_ids.py", line 261, in <module>
    main()
  File "ao3_work_ids.py", line 257, in main
    process_for_ids(header_info)
  File "ao3_work_ids.py", line 240, in process_for_ids
    ids = get_ids(header_info)
  File "ao3_work_ids.py", line 108, in get_ids
    soup = BeautifulSoup(req.text, "lxml")
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bs4/__init__.py", line 165, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Happens with both Python 2 and 3.

Broken restart feature

Hello,

when using --restart in ao3_get_fanfics.py the script crashes.
The reason is missing lang argument in the respective call of write_fic_to_csv on line 295.

radiolarian / ao3scraper Goto Github PK

ao3scraper's People

Contributors

Stargazers

Watchers

Forkers

ao3scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org