Coder Social home page Coder Social logo

fandom-search's Introduction

The Archive of Our Own script ao3.py can be used to scrape and analyze fanworks and prepare the results for visualization in JavaScript. A markup version of the script of the orginal work is required for searching for n-gram matches in the fanworks.

The basic workflow is below. This assumes you have a scripts folder, a fanworks folder, and a results folder, with a particular structure that can be inferred from the example commands below. (Sorry, very busy!) Take sw-all to be a stand-in for a folder of fan works, sw-new-hope.txt to be a stand-in for a correctly formatted script, and sw-new-hope (without the .txt) to be a stand-in for the results folder for the given movie.

A todo for this repo is to create options for where to save error and log files, and search results.

Another todo for this repo is to create more thorough documentation, especially of the script format, which is idiosyncratic but effective.

  • Scrape AO3 (Ooops! Currently broken!)

    python ao3.py scrape \
        -t "Star Wars - All Media Types" \
        -o fanworks/sw-all/html
    

The scrape command will save log and error files; check to see that the scrape went OK, and then move the (generically named) error file to fanworks/sw-all/sw-all-errors.txt.

  • Clean the HTML

    python ao3.py clean \
        fanworks/sw-all/html/ \
        -o fanworks/sw-all/plaintext/
    

The clean command will save an error file; check to see that the cleaning process went OK, and then move the error file (this time in the root dir) from clean-html-errors.txt to sw-all-clean-errors.txt

  • Perform the reuse search

    python ao3.py search \
        fanworks/sw-all/ \
        scripts/sw-new-hope.txt
    

The search command will create sevaral (and in some case, many, even hundreds) of separate CSV files. Each one contains the results for 500 fan works. They will automatically be aggregated by the script at the end of the process, but they are also saved here to ensure that if the search is interrupted, the results are still usable.

If the search completes without any errors, the final aggregated data will be in a file with a date timestamp in YYYYMMDD format. It will be something like match-6gram-20190604. Create a new folder results/sw-all/20190604/, and move all the CSV files into that folder.

  • Aggregate the results over the script (i.e. "format" the results)

    python ao3.py format \
        results/sw-new-hope/20190604/match-6gram-20190604.csv \
        scripts/sw-new-hope.txt \
        -o results/sw-new-hope/fandom-data-new-hope.csv
    
  • Create a Bokeh visualization of the aggregated results

    python ao3.py vis \
        results/sw-new-hope/fandom-data-new-hope.csv \
        -o results/sw-new-hope/new_hope_reuse.html
    

This is not a perfect workflow and needs to be tidied up in several ways. I will get around to that someday.

usage: ao3.py [-h] {scrape,clean,getmeta,search,matrix,format} ...

process fanworks scraped from Archive of Our Own.

positional arguments:
  {scrape,clean,getmeta,search,matrix,format}
                        scrape, clean, getmeta, search, matrix, or format
    scrape              find and scrape fanfiction works from Archive of Our
                        Own
    clean               takes a directory of html files and yields a new
                        directory of text files
    getmeta             takes a directory of html files and yields a csv file
                        containing metadata
    search              compare fanworks with the original script
    matrix              deduplicates and builds matrix for best n-gram matches
    format              takes a script and outputs a csv with senitment
                        information for each word formatted for javascript
                        visualization

optional arguments:
  -h, --help            show this help message and exit

There are three scraping options for Archive of Our Own: (1) Use the '-s' option to provide a search term and see a list of possible tags. (2) Use the '-t' option to scrape fanworks from a tag. (3) Use the '-u' option to scrape fanworks from a URL. The URL should be to the /works page, e.g. https://archiveofourown.org/tags/Rogue%20One:%20A%20Star%20Wars%20Story%20(2016)/works

usage: ao3.py scrape [-h] [-s SEARCH | -t TAG | -u URL] [-o OUT]
                     [-p STARTPAGE]

optional arguments:
  -h, --help            show this help message and exit
  -s SEARCH, --search SEARCH
                        search term to search for a tag to scrape
  -t TAG, --tag TAG     the tag to be scraped
  -u URL, --url URL     the full URL of first page to be scraped
  -o OUT, --out OUT     target directory for scraped html files
  -p STARTPAGE, --startpage STARTPAGE
                        page on which to begin downloading (to resume a
                        previous job)

Clean and convert the scraped html files into plain text files.

usage: ao3.py clean [-h] [-o O] i

positional arguments:
  i           directory of input html files to clean

optional arguments:
  -h, --help  show this help message and exit
  -o O        target directory for output txt files

Extract Archive of Our Own metadata from the scraped html files.

usage: ao3.py getmeta [-h] [-o O] i

positional arguments:
  i           directory of input html files to process

optional arguments:
  -h, --help  show this help message and exit
  -o O        filename for metadata csv file

The search process compares fanworks with the original work script and is based on 6-gram matches.

usage: ao3.py search [-h] d s

positional arguments:
  d           directory of fanwork text files
  s           filename for markup version of script

optional arguments:
  -h, --help  show this help message and exits

The n-gram search results can be used to create a matrix.

usage: ao3.py matrix [-h] [-n N] i m

positional arguments:
  i           input csv file
  m           fandom/movie name for output file prefix

optional arguments:
  -h, --help  show this help message and exit
  -n N        n-gram size, default is 6-grams

The n-gram search results can be prepared for JavaScript visualization.

usage: ao3.py format [-h] [-o O] s

positional arguments:
  s           filename for markup version of script

optional arguments:
  -h, --help  show this help message and exit
  -o O        filename for csv output file of data formatted for visualization
s```

fandom-search's People

Contributors

annamarion avatar emontp avatar joelsjlee avatar senderle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fandom-search's Issues

Look into graph resizing

We tried to implement a way that the graph plot would resize it's width to the width of the RadioButtonGroup, but we weren't able to implement it correctly. We were having trouble getting the width of the RadioButtonGroup, as it's .width kept giving us None. This should be possible and would be a good alternative to fixing the sizing issue where the RadioButtonGroup is larger than the plot.

Also, based on how freeflowing the different RadioButtonGroups are going to be, (If it will be set in stone how many buttons there will be) we could just manually make the plot bigger, and say width=1000. This implementation has no problem and works perfectly--the problem was when we tried to make it dependent on the button group (i.e. plot.width = emotion_button_group.width) that kind of thing.

Add Character "Select" Dropdown Feature in Vis.py

Picking up from the new "Character_" columns from Joel's branch, work needs to be done in Vis.py to read in the CSV and do the following:

  • Find a way to add to the global _Fields variable/create a new variable list;

  • Run the data through same/similar functions to parse the data into similar chunks as existing elements in "_Fields";

  • Calculate the geometric mean of characters' speech participation in each chunk;

  • Create two button groups using the Bokeh "Select" Widget which will give the user the ability to select two characters at a time;

  • Ensure that these new features are incorporated into the ratio feature of the existing code;

  • Create a function that builds a stacked area plot to display up to two selected characters, based on the calculations for geometric mean.

Double Hover

Currently, the hovering on the graph will give us textboxes from both the reuse area, and also the line graph for the emotion. In some cases this creates a double pop up:
Screen Shot 2019-08-09 at 4 02 08 PM

We should fix this and have only the blue reuse area be hoverable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.