Coder Social home page Coder Social logo

getmanfred / dev-story-scraper Goto Github PK

View Code? Open in Web Editor NEW
17.0 13.0 1.0 6.03 MB

Scraper to download the profile information from a Stack Overflow Dev Story

License: Creative Commons Attribution Share Alike 4.0 International

Shell 0.01% JavaScript 0.03% TypeScript 3.68% HTML 96.25% Dockerfile 0.03%

dev-story-scraper's Introduction

StackOverflowgeddon Survival Kit
(Developer Stories Scraper)

With this code, you will survive "the StackOverflowgeddon", the discontinuation of the «Developer Story» feature on Stack Overflow, which means that more than 4 million people will lose the professional data forever. This script scraps and downloads your data contained in a Dev Story (using its public URL) as a JSON file, compliant with the open-source MAC (Manfred Awesomic CV) format.

General OverviewHow to runCodeDeploymentWhyWho we are
LicenseSpread the word !!! 🖖

StackOverflowgeddon landing

General Overview

The only dependency is with Google Maps API. If no key for Google Maps API is provided it just doesn't autocomplete the field whereILive. In that case the location information from the Dev Story is stored at aboutMe.profile.whereILive.notes.

General overview diagram

How to run

Plain Node.js

yarn install

yarn build

yarn start

curl http://localhost:3000\?username\=<Dev Story username>

Docker

docker build . -t username/dev-story-scraper

docker run -p3000:3000 -d username/dev-story-scraper

# If you have a Google Maps API
docker run -p3000:3000 -e SO_GOOGLE_MAPS_API_KEY=<key value> -d username/dev-story-scraper

Code

Code organization

The scraping process is designed to follow the MAC JSON schema structure independently from the position at the Stack Overflows HTML. So each "large" sub-document at the JSON Schema usually have its own parser class.

The DevStoryDownloader and Geocoder are created at the beginning so we can inject a mock for test purposes, avoiding overusing the Stack Overflow or Google's systems, this also prevents false red tests. We could use the dependency injection in better ways, but for a project that is going to be used for a few days and discarded it doesn't worth the price.

Deployment

Components architecture

We need to configure the GOOGLE_MAPS_API_KEY to use the geocoder.

Design decisions

To adapt the scraped data to the MAC JSON Schema we took some design decisions.

Name to name and surnames

Stack Overflow uses just a string with full name, to create name and surnames fields we decided to use the first word as name and the rest as surnames.

Example:

{
  name: 'Ryan Reynolds'
}

to

{
  name: 'Ryan',
  surnames: 'Reynolds'
}

Location completion

Location is a free field, so we are using Google Maps API to get more data. whereILive field is composed by country, region, and municipality but usually a Dev Story only has 2 of those fields.

Examples:

  • Tampa, Florida > Tampa, Florida, US
  • Madrid, Spain > Madrid, Community of Madrid, ES

Geocoder flow diagram

Job parsing

job parsing details diagram

Assessment parsing

assessment parsing details diagram

Stack Overflow top answers parsing

stack overflow top answers parsing diagram

Why

We were committed to building an open platform to manage careers during 2022, including online CVs. Then, we knew Stack Overflow had decided to sunset the Developer Stories (a kind of... online CVs), and that meant that more than 4 million developers would lose their professional data forever.

So, we thought "what if we hustle to get at least a subset of what we wanted to build in 6 months in just 2 to give those people a way to preserve their data?". The rest is history.

  1. First, we defined an open-source CV format: the MAC
  2. Second, we created a tool to recover and store your data from your Dev Story —as a MAC compliant JSON file— available through a landing page. The code running behind it is what you have in this repo.
  3. Third, we improved our online CV platform to import (if wanted) data from Dev Stories easily.

Who we are

The Kit was made with ❤️ and care by the Manfred team.

We are a bunch of developers trying to create a better approach to technical talent recruiting. This is our manifesto:

Manfred Manifesto

License

This code is free and open-source software licensed and distributed under the Creative Commons Attribution Share Alike 4.0 International (CC BY-SA 4.0 International).

🌟 Spread the word!

If you want to say thank you and/or support active development of the Kit:

  • Add a GitHub Star to the project!
  • Tweet about the project on your Twitter!

Thanks so much for your interest in growing the reach of this initiative!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.