Coder Social home page Coder Social logo

major-scraper's Introduction

GraduateNU Major Scraper

This repo houses GraduateNU's major requirements scraper. It scrapes the Northeastern Academic Catalog.

Setup

Clone the repo and run:
pnpm install

Running

After install in dependencies you can run the scraper with:
pnpm scrape.

The scraper scrapes the current catalog by default, but you can specify one or more years for it to scrape as command line arguments. For example to scrape the catalog for 2021, 2022, and the current year, you'd write the following:
pnpm scrape 2021 2022 current

This will populate the results folder with parsed JSON files and the catalogCache folder with cached HTML.

major-scraper's People

Contributors

meebs1 avatar rael346 avatar alpacafur avatar cindy1u0 avatar dependabot[bot] avatar clue4 avatar

Stargazers

 avatar  avatar

Watchers

 avatar Arun Jeevanantham avatar Krish Sharma avatar Ryan Drew avatar

major-scraper's Issues

Extra Brackets Outputted in OR on Computer_Science_and_Behavioral_Neuroscience_BS

Summary

The following major is outputted into the incorrect format on line 347 (the courses property should not have another nested array in it):
degrees/Major/2023/computer-information-science/Computer_Science_and_Behavioral_Neuroscience_BS/parsed.json

You can see the nested array brackets highlighted below:

This is likely a parser postprocessor edge case issue (in src/parse/postprocess.ts).

{
  "type": "OR",
  "courses": [
-   [
      {
        "type": "AND",
        "courses": [
          {
            "subject": "PT",
            "classId": 5410,
            "description": "Functional Human Neuroanatomy",
            "type": "COURSE"
          },
          {
            "subject": "PT",
            "classId": 5411,
            "description": "Lab for PT 5410",
            "type": "COURSE"
          }
        ]
      }
-   ],
    {
      "type": "COURSE",
      "classId": 3200,
      "subject": "PSYC"
    }
  ]
}

Tasks

  • Discover and leave a comment in this issue explaining why it happened
  • Submit a PR fixing this post-processor issue.

Investigate shrinking prod Graduate RDS space allocation

Summary

Currently the prod graduate database has 100gb allocated to it, which is far overkill compared to usage (it appears to use <1gb at present). Shrinking a database is non-trivial but will have some cost-savings if we can pull it off. The ideal target is 20gb (the minimum size) for graduate for now (we can always increase storage size later, but can't shrink it).

Tasks

  • Determine how to shrink a database's size (start here: https://repost.aws/knowledge-center/rds-db-storage-size).
  • Create a new DB instance with the shrunk size.
  • Migrate prod to use the new instance.
  • Verify that the new instance contains the same data.
  • Shut down the old instance

Implement New Scraper Output Format

  • Blocks #5

Summary

  • Now that we landed on a good output format in #3, we should have the scraper output into that format.

Tasks

  • Based on the decisions in this notion doc, implement this output format for the scraper.
  • The scraper should output into the ../degrees/majors directory by default, and should exit with an error if such a directory doesn't exist. This is where the sandboxnu/degrees will be installed for scraper development, so we'd like to a folder called majors within that repo. (This will be made configurable in #5)

Tests for parsed majors in production

This ticket is mainly for backwards compatibility. Since we are iterating on a large amount of majors while still supporting the already hand validated majors in the GraduateNU's backend, we need to ensure:

  • Changes we make to fix the newer majors doesn't break the old ones
  • Have fix for hand validated errors that the scraper can handle

Current tests in scrapers are fairly rudimentary and doesn't address many edge cases that the major catalog is notorious for.

A few guidelines

  • Errors that are mainly parser responsibility, i.e. the tokens are correct but the parser parses it incorrectly should be tests in parse.test.ts
  • Errors that stems from tokens being wrong (not tokenizing a comment as a XOM, etc) should be in tokenize.test.ts
  • Snapshot tests should be reserved for majors that has a distinct quality (Business with their separated concentration pages, etc)

Reduce Superfluous Debug Logging from Graduate

Summary

Right now we a little over 3 gigabytes worth of logs in AWS CloudWatch from Graduate's prod and staging ECS containers. This is because we (a) log a lot of unnecessary info and (b) the logs never expire. In order to stay within the CloudWatch free tier, we should reduce this total.

Tasks

  • Reduce the log retention time to a reasonable level (1 month seems like a reasonable retention time, but depending on how much the second step reduces logs, we could bump it up higher).
  • Disable Graduate's debug and info logging in production (link to logging settings). You can use the NODE_ENV env variable to differentiate between production and development environments. This should reduce the size of /ecs/*-graduatenu-api logs pretty significantly.
  • Verify the log reduction locally by changing the env variable and upon deploy in CloudWatch. This will involve setting up the graduate repo locally, so feel free to reach out if you have any questions about that.

Address comments that are not tokenized as `XOM`

Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a COMMENT token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into the comments.json and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)

Determine a Good Scraper Data Format

Summary

Before we start implementing anything we want to make sure that the data format we settle on is good. That way, we can start implementing the Scraper output code and Tooling ingest code in parallel without blocking each other.

Ideally this structure will be both easy for us (as humans) to navigate through while also being easy for our Tooling (and Graduate's backend) to read through.

Tasks

  • Determine the output format for the flat files. This should include:
    • Directory structure (where are majors files located? within a year? within a folder for that specific major?)
    • Where individual steps fall under that? (HTML, tokens, parsed)
    • Where does metadata live in this structure? (separate file? per major? per college? etc)
    • How many copies of the same file do we keep? (just one? one "current" and one "new" if something has changed? more than two?)
    • What kind of metadata do we need to store? (last updated time, major review status, etc) (this can and likely will change over time)
  • Make sure the data format will be easily readable by Graduate's backend too! (ideally Graduate should be able to just pull our repo and reload the majors at runtime).

Implement Scraper Configuration

  • Depends on #4.

Summary

We'd like to lay down some infrastructure to allow us to configure the scraper in once place without modifying code. This will be less fragile and will prevent accidental commits when changing these properties.

Tasks

  • There will be two files in the repo root:
    • scraper.defaults.json
      • This file provides the default values for any config setting.
      • Each property in the file should have a ${PROPERTYNAME}.description property before it explaining it (what it does and some examples.) These will be ignored when parsing either config file.
    • scraper.config.json
      • This file provides the local-only overrides for the scraper config
      • When a property is not defined in this file, it should fallback to the defaults file.
      • This file should be ignored in .gitignore so it doesn't get commited by accident.
  • Initial config properties to support:
    • outputDirectory - where to output scraper files
    • useCache - whether to use cached HTML files when running the scraper
    • [optional] any others you feel are worth breaking out into this file.

Fix `parsing fail` for section without headers

Currently many of the majors are failing the parse stage because they have a section doesn't have a header token. This is because the major itself doesn't have a corresponding header token in the table.

For example, in Civil Engineer & Architectural studies, the Architecture Requirements doesn't have the bold header like Architecture Electives below, instead only having the h2 header.

Currently our tokenizer convert this h2 header into the description of a token. The goal for this ticket is to somehow have this header be converted into an actual HEADER token for the parser

Fix `unexpected end of tokens` error in parse stage

This ticket will be fixing the unexpected end of tokens error, which happens when the parser couldn't find a solution to a major. This error is currently being caused by mostly business concentrations (and a few more).

AWS Status Inventory

Summary

We want to get a better picture overall of what Graduate and Search's AWS stacks look like and how they fit together. We don't need to do a super deep dive on any component yet, but it'll be helpful to have a concrete high-level view.

Tasks

  • Review what our setup is at the moment in the AWS Console / Team Repos
    • Graduate’s Setup (frontend, backend, DB)
    • Search’s Setup (frontend, course-catalog-api, course-catalog-scraper, terraform, DB)
  • Investigate if any of the AWS tools we're using have any built in downtime/restart alerting or if we need to choose another option/roll our own using an API
    • Search currently has an external downtime bot in their #search-support channel, we could probably set something similar up for Graduate if that seems like a good option
  • Find out what error logging we have in place for Graduate's backend and Search's scraper/backend.
  • Take note of anything you come across that you're unsure about/seems like it should be different.
  • Write up these findings in a Notion Doc and informally present to interested people on the team.

Investigate Tooling Tech Stack Options

Summary

Before we start actually writing code (and while we're landing on a JSON storage spec), we should investigate what our options are for the tech stack.

Unless we come upon a pretty compelling reason, we're likely going to stick with a web-based frontend (whether in browser or a webview).

This ticket then focuses on evaluating our options for the backend, which will primarily be a layer between the flat JSON files in the filesystem and the web frontend.

The primary thing our backend needs to do is read/write JSON files from the filesystem and send them to the frontend so whichever tech we choose won't have to be too involved in what we have to write in it.

Tasks

  • Investigate tooling backend possibilities. Below are a few that I heard brought up by team members in 1-on-1s, but also feel free to look into another option if it seems promising.
    • TypeScript API-backed webapp (sticking to what we know best as a club)
    • Rust Tauri-backed desktop app (exploring a popular language in a small-scale environment)
    • TypeScript htmx-backed website (keeping the frontend simple by sending HTML forms)
    • Java API-backed webapp (trying a new-to-sandbox language as a backend option)
    • Anything else you feel like investigating/come across while looking into these.
  • Write up a Notion Doc detailing the pros/cons of our different options these can include but feel free to tweak these as you feel appropriate:
    • How interesting the language might be to team members/the club as a whole
    • Difficulty to learn
    • Project boilerplate/overhead
    • Long-term flexibility
    • Maintainability when knowledgeable members graduate/leave Sandbox.
  • Informally present your findings to the team and I'll set up a ranked-choice vote on the different technologies presented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.