Light

sandboxnu / major-scraper Goto Github PK

View Code? Open in Web Editor NEW

3.0 4.0 0.0 1.48 MB

Scraping Northeastern's Academic Catalog for use in GraduateNU.

License: GNU General Public License v3.0

TypeScript 4.33% Nearley 0.20% CSS 2.33% HTML 93.14%

major-scraper's Introduction

GraduateNU Major Scraper

This repo houses GraduateNU's major requirements scraper. It scrapes the Northeastern Academic Catalog.

Setup

Clone the repo and run:
pnpm install

Running

After install in dependencies you can run the scraper with:
pnpm scrape.

The scraper scrapes the current catalog by default, but you can specify one or more years for it to scrape as command line arguments. For example to scrape the catalog for 2021, 2022, and the current year, you'd write the following:
pnpm scrape 2021 2022 current

This will populate the results folder with parsed JSON files and the catalogCache folder with cached HTML.

major-scraper's People

Contributors

Stargazers

Watchers

major-scraper's Issues

Extra Brackets Outputted in OR on Computer_Science_and_Behavioral_Neuroscience_BS

Summary

The following major is outputted into the incorrect format on line 347 (the courses property should not have another nested array in it):
degrees/Major/2023/computer-information-science/Computer_Science_and_Behavioral_Neuroscience_BS/parsed.json

You can see the nested array brackets highlighted below:

This is likely a parser postprocessor edge case issue (in src/parse/postprocess.ts).

{
  "type": "OR",
  "courses": [
-   [
      {
        "type": "AND",
        "courses": [
          {
            "subject": "PT",
            "classId": 5410,
            "description": "Functional Human Neuroanatomy",
            "type": "COURSE"
          },
          {
            "subject": "PT",
            "classId": 5411,
            "description": "Lab for PT 5410",
            "type": "COURSE"
          }
        ]
      }
-   ],
    {
      "type": "COURSE",
      "classId": 3200,
      "subject": "PSYC"
    }
  ]
}

Tasks

Discover and leave a comment in this issue explaining why it happened
Submit a PR fixing this post-processor issue.

Investigate shrinking prod Graduate RDS space allocation

Summary

Currently the prod graduate database has 100gb allocated to it, which is far overkill compared to usage (it appears to use <1gb at present). Shrinking a database is non-trivial but will have some cost-savings if we can pull it off. The ideal target is 20gb (the minimum size) for graduate for now (we can always increase storage size later, but can't shrink it).

Tasks

Determine how to shrink a database's size (start here: https://repost.aws/knowledge-center/rds-db-storage-size).
Create a new DB instance with the shrunk size.
Migrate prod to use the new instance.
Verify that the new instance contains the same data.
Shut down the old instance

Implement New Scraper Output Format

Blocks #5

Summary

Now that we landed on a good output format in #3, we should have the scraper output into that format.

Tasks

Based on the decisions in this notion doc, implement this output format for the scraper.
The scraper should output into the ../degrees/majors directory by default, and should exit with an error if such a directory doesn't exist. This is where the sandboxnu/degrees will be installed for scraper development, so we'd like to a folder called majors within that repo. (This will be made configurable in #5)

Fix `socketerror: other side closed` error when fetching major's html

When fetching a lot of majors, about 1% of them just fail because the other side closed the connection on us. This ticket will be investigating why and find the way to mitigate this.

Remove Scraper Code from sandboxnu/graduatenu

Summary

Since we broke the scraper code out into this repo, it doesn't need to exist in Graduate's repo any more.

Tasks

Remove the scrapers-v2 package from the repo.
Make sure nothing breaks due to its absence
Update the README in graduate's repo to link to this repo.

Tests for parsed majors in production

This ticket is mainly for backwards compatibility. Since we are iterating on a large amount of majors while still supporting the already hand validated majors in the GraduateNU's backend, we need to ensure:

Changes we make to fix the newer majors doesn't break the old ones
Have fix for hand validated errors that the scraper can handle

Current tests in scrapers are fairly rudimentary and doesn't address many edge cases that the major catalog is notorious for.

A few guidelines

Errors that are mainly parser responsibility, i.e. the tokens are correct but the parser parses it incorrectly should be tests in parse.test.ts
Errors that stems from tokens being wrong (not tokenizing a comment as a XOM, etc) should be in tokenize.test.ts
Snapshot tests should be reserved for majors that has a distinct quality (Business with their separated concentration pages, etc)

Reduce Superfluous Debug Logging from Graduate

Summary

Right now we a little over 3 gigabytes worth of logs in AWS CloudWatch from Graduate's prod and staging ECS containers. This is because we (a) log a lot of unnecessary info and (b) the logs never expire. In order to stay within the CloudWatch free tier, we should reduce this total.

Tasks

Reduce the log retention time to a reasonable level (1 month seems like a reasonable retention time, but depending on how much the second step reduces logs, we could bump it up higher).
Disable Graduate's debug and info logging in production (link to logging settings). You can use the NODE_ENV env variable to differentiate between production and development environments. This should reduce the size of /ecs/*-graduatenu-api logs pretty significantly.
Verify the log reduction locally by changing the env variable and upon deploy in CloudWatch. This will involve setting up the graduate repo locally, so feel free to reach out if you have any questions about that.

Address comments that are not tokenized as `XOM`

Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a COMMENT token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into the comments.json and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)

Determine a Good Scraper Data Format

Summary

Before we start implementing anything we want to make sure that the data format we settle on is good. That way, we can start implementing the Scraper output code and Tooling ingest code in parallel without blocking each other.

Ideally this structure will be both easy for us (as humans) to navigate through while also being easy for our Tooling (and Graduate's backend) to read through.

Tasks

Determine the output format for the flat files. This should include:
- Directory structure (where are majors files located? within a year? within a folder for that specific major?)
- Where individual steps fall under that? (HTML, tokens, parsed)
- Where does metadata live in this structure? (separate file? per major? per college? etc)
- How many copies of the same file do we keep? (just one? one "current" and one "new" if something has changed? more than two?)
- What kind of metadata do we need to store? (last updated time, major review status, etc) (this can and likely will change over time)
Make sure the data format will be easily readable by Graduate's backend too! (ideally Graduate should be able to just pull our repo and reload the majors at runtime).

Implement Scraper Configuration

Depends on #4.

Summary

We'd like to lay down some infrastructure to allow us to configure the scraper in once place without modifying code. This will be less fragile and will prevent accidental commits when changing these properties.

Tasks

Fix `parsing fail` for section without headers

Currently many of the majors are failing the parse stage because they have a section doesn't have a header token. This is because the major itself doesn't have a corresponding header token in the table.

For example, in Civil Engineer & Architectural studies, the Architecture Requirements doesn't have the bold header like Architecture Electives below, instead only having the h2 header.

Currently our tokenizer convert this h2 header into the description of a token. The goal for this ticket is to somehow have this header be converted into an actual HEADER token for the parser

Fix `unexpected end of tokens` error in parse stage

This ticket will be fixing the unexpected end of tokens error, which happens when the parser couldn't find a solution to a major. This error is currently being caused by mostly business concentrations (and a few more).

Investigate Number Candidates for Descreasing ECS Resources

Summary

We want to take a look at the resource usage (visible on individual container pages ie. clusters>services>[backend/frontend service]). Then based on the maximum usage, decide on proposed new CPU/Memory numbers for the services to decrease costs.

AWS Status Inventory

Summary

We want to get a better picture overall of what Graduate and Search's AWS stacks look like and how they fit together. We don't need to do a super deep dive on any component yet, but it'll be helpful to have a concrete high-level view.

Tasks

Review what our setup is at the moment in the AWS Console / Team Repos
- Graduate’s Setup (frontend, backend, DB)
- Search’s Setup (frontend, course-catalog-api, course-catalog-scraper, terraform, DB)
Investigate if any of the AWS tools we're using have any built in downtime/restart alerting or if we need to choose another option/roll our own using an API
- Search currently has an external downtime bot in their #search-support channel, we could probably set something similar up for Graduate if that seems like a good option
Find out what error logging we have in place for Graduate's backend and Search's scraper/backend.
Take note of anything you come across that you're unsure about/seems like it should be different.
Write up these findings in a Notion Doc and informally present to interested people on the team.

Investigate Tooling Tech Stack Options

Summary

Before we start actually writing code (and while we're landing on a JSON storage spec), we should investigate what our options are for the tech stack.

Unless we come upon a pretty compelling reason, we're likely going to stick with a web-based frontend (whether in browser or a webview).

This ticket then focuses on evaluating our options for the backend, which will primarily be a layer between the flat JSON files in the filesystem and the web frontend.

The primary thing our backend needs to do is read/write JSON files from the filesystem and send them to the frontend so whichever tech we choose won't have to be too involved in what we have to write in it.

Tasks

Investigate tooling backend possibilities. Below are a few that I heard brought up by team members in 1-on-1s, but also feel free to look into another option if it seems promising.
- TypeScript API-backed webapp (sticking to what we know best as a club)
- Rust Tauri-backed desktop app (exploring a popular language in a small-scale environment)
- TypeScript htmx-backed website (keeping the frontend simple by sending HTML forms)
- Java API-backed webapp (trying a new-to-sandbox language as a backend option)
- Anything else you feel like investigating/come across while looking into these.
Write up a Notion Doc detailing the pros/cons of our different options these can include but feel free to tweak these as you feel appropriate:
- How interesting the language might be to team members/the club as a whole
- Difficulty to learn
- Project boilerplate/overhead
- Long-term flexibility
- Maintainability when knowledgeable members graduate/leave Sandbox.
Informally present your findings to the team and I'll set up a ranked-choice vote on the different technologies presented.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.