sandboxnu / major-scraper Goto Github PK
View Code? Open in Web Editor NEWScraping Northeastern's Academic Catalog for use in GraduateNU.
License: GNU General Public License v3.0
Scraping Northeastern's Academic Catalog for use in GraduateNU.
License: GNU General Public License v3.0
Right now we a little over 3 gigabytes worth of logs in AWS CloudWatch from Graduate's prod and staging ECS containers. This is because we (a) log a lot of unnecessary info and (b) the logs never expire. In order to stay within the CloudWatch free tier, we should reduce this total.
debug
and info
logging in production (link to logging settings). You can use the NODE_ENV
env variable to differentiate between production and development environments. This should reduce the size of /ecs/*-graduatenu-api
logs pretty significantly.../degrees/majors
directory by default, and should exit with an error if such a directory doesn't exist. This is where the sandboxnu/degrees will be installed for scraper development, so we'd like to a folder called majors
within that repo. (This will be made configurable in #5)Before we start implementing anything we want to make sure that the data format we settle on is good. That way, we can start implementing the Scraper output code and Tooling ingest code in parallel without blocking each other.
Ideally this structure will be both easy for us (as humans) to navigate through while also being easy for our Tooling (and Graduate's backend) to read through.
We want to take a look at the resource usage (visible on individual container pages ie. clusters>services>[backend/frontend service]). Then based on the maximum usage, decide on proposed new CPU/Memory numbers for the services to decrease costs.
When fetching a lot of majors, about 1% of them just fail because the other side closed the connection on us. This ticket will be investigating why and find the way to mitigate this.
Currently any element from the html that couldn't be recognized by the tokenizer will be tokenized into a COMMENT
token. Since the catalog can be inconsistent in its way of phrasing, our tokenizer can miss some of these cases. This ticket will mostly be going into the comments.json
and see which case we can address with the tokenizer, as well as writing some tests for them to ensure backwards compatibility with older majors (especially the XOM phrasing)
Currently the prod graduate database has 100gb allocated to it, which is far overkill compared to usage (it appears to use <1gb at present). Shrinking a database is non-trivial but will have some cost-savings if we can pull it off. The ideal target is 20gb (the minimum size) for graduate for now (we can always increase storage size later, but can't shrink it).
This ticket is mainly for backwards compatibility. Since we are iterating on a large amount of majors while still supporting the already hand validated majors in the GraduateNU's backend, we need to ensure:
Current tests in scrapers are fairly rudimentary and doesn't address many edge cases that the major catalog is notorious for.
A few guidelines
parse.test.ts
tokenize.test.ts
We want to get a better picture overall of what Graduate and Search's AWS stacks look like and how they fit together. We don't need to do a super deep dive on any component yet, but it'll be helpful to have a concrete high-level view.
#search-support
channel, we could probably set something similar up for Graduate if that seems like a good optionWe'd like to lay down some infrastructure to allow us to configure the scraper in once place without modifying code. This will be less fragile and will prevent accidental commits when changing these properties.
scraper.defaults.json
${PROPERTYNAME}.description
property before it explaining it (what it does and some examples.) These will be ignored when parsing either config file.scraper.config.json
.gitignore
so it doesn't get commited by accident.outputDirectory
- where to output scraper filesuseCache
- whether to use cached HTML files when running the scraperThe following major is outputted into the incorrect format on line 347 (the courses
property should not have another nested array in it):
degrees/Major/2023/computer-information-science/Computer_Science_and_Behavioral_Neuroscience_BS/parsed.json
You can see the nested array brackets highlighted below:
This is likely a parser postprocessor edge case issue (in src/parse/postprocess.ts
).
{
"type": "OR",
"courses": [
- [
{
"type": "AND",
"courses": [
{
"subject": "PT",
"classId": 5410,
"description": "Functional Human Neuroanatomy",
"type": "COURSE"
},
{
"subject": "PT",
"classId": 5411,
"description": "Lab for PT 5410",
"type": "COURSE"
}
]
}
- ],
{
"type": "COURSE",
"classId": 3200,
"subject": "PSYC"
}
]
}
Before we start actually writing code (and while we're landing on a JSON storage spec), we should investigate what our options are for the tech stack.
Unless we come upon a pretty compelling reason, we're likely going to stick with a web-based frontend (whether in browser or a webview).
This ticket then focuses on evaluating our options for the backend, which will primarily be a layer between the flat JSON files in the filesystem and the web frontend.
The primary thing our backend needs to do is read/write JSON files from the filesystem and send them to the frontend so whichever tech we choose won't have to be too involved in what we have to write in it.
Currently many of the majors are failing the parse stage because they have a section doesn't have a header token. This is because the major itself doesn't have a corresponding header token in the table.
For example, in Civil Engineer & Architectural studies, the Architecture Requirements doesn't have the bold header like Architecture Electives below, instead only having the h2
header.
Currently our tokenizer convert this h2
header into the description of a token. The goal for this ticket is to somehow have this header be converted into an actual HEADER
token for the parser
This ticket will be fixing the unexpected end of tokens
error, which happens when the parser couldn't find a solution to a major. This error is currently being caused by mostly business concentrations (and a few more).
Since we broke the scraper code out into this repo, it doesn't need to exist in Graduate's repo any more.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.