Coder Social home page Coder Social logo

getmanfred / dev-story-scraper Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 1.0 6.03 MB

Scraper to download the profile information from a Stack Overflow Dev Story

License: Creative Commons Attribution Share Alike 4.0 International

Shell 0.01% JavaScript 0.03% TypeScript 3.68% HTML 96.25% Dockerfile 0.03%

dev-story-scraper's People

Contributors

dbonillaf avatar ydarias avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lpirir

dev-story-scraper's Issues

Competences at studies elements

At a timeline item related to studies, the competences are not being generated.

For example, at this profile we are not getting the tags related to the Surrey University studies.

imagen

Timeline items without a date

The Stack Overflow dev story supports defining a timeline item without a date. To sort the items it is quite possible that they are using the creation date, but we haven't that information in the HTML.

That information is mandatory at the MAC JSON schema, so we have 2 different options:

  1. We assign the current date to the start date when this value is undefined.
  2. We search the elements before and after the current one and select a date between them.

For example, this profile has elements without publishing date.

imagen

Better names separation

Right now we are separating the first element of a name as the first name and the rest of the string as surnames.

But there are some names like Robert C. Martin where we can do things a bit better dividing it like:

{
   name: 'Robert C.',
   surnames: 'Martin'
}

Actual organization link

At a job position, we extract the name and link of the company, but the company link at the dev story level is a link to Stack Overflow Company's page.

We can load that page and extract the actual link there.

imagen

Open source project could have the right type

The projects parse is using other as the default type. Because that comes from a DevStoryArtifact we can do a better effort:

  • Feature or Apps is other.
  • Open source is openSource.

Custom type dev story artifacts

During the scraping process, there are some items that are not present in the final JSON.

imagen

All the Dev Story items that are not scraped should be returned as highlights.

Create a dictionary to translate a Dev Story tag to a MAC mean

The Stack Overflow Dev Stories contains tags for each timeline's item. This allows the user to categorize each experience, publication, artifact item in her Dev Story.

imagen

At Manfred's JSON Schema that element is equivalent to

  • $defs.role.properties.means
  • or properties.experience.allOf[2].properties.publicArtifacts.items.properties.means.

A mean has 2 mandatory fields, name and type. We must decide if we assign a default type to each mean, or if we create a dictionary to assign the right type, at least for the most important ones.

Self-employment jobs

There are some profiles that have positions not assigned to a company, like this one

imagen

In this cases we will create an organization that returns a job position like:

{
    "organization": {
        "name": "Self Employed",
    },
    "type": "freelance"
    "roles": [
        {
            "name": "Computer Science Tutor",
            "startDate": "2009-05-01",
            "competences": [
                {
                    "name": "java",
                    "type": "technology"
                },
                {
                    "name": "python",
                    "type": "technology"
                },
                {
                    "name": "linux",
                    "type": "technology"
                }
            ]
        }
    ]
}

Use last day of the month for finish date fields

A Dev Story is filling just the year and month for dates. We add always 1st as the day, but for dates that are a finish date it is better to use the last day of the month, so there is no one-month gap between experiences.

*All the users*?

I noticed this on HN a little while ago, and immediately wondered how hard it would be to simply just grab everyone's developer story (it's always fun to be able to turn around and go "here y'go :D"). I can see a decent bit of work has gone into this, and it would be a small shame for only a few people to benefit from it.

Sadly given the timing it looks like it's a bit 11th-hour to try and do everything, but I wonder what's maybe still possible.

After doing some digging and initially getting discouraged by custom URLs, I realized that you can basically access every non-hidden developer story indexed by user ID, ie /story/12345678.

Furthermore, the StackExchange Network data dumps (https://archive.org/download/stackexchange) includes an active list of StackOverflow users! Yay!

So I went off, downloaded that, extracted the user ids out, and started working on a little parallel downloader to start vacuuming everything up.

After a few hitches this worked grea429 Too Many Requests

...noooooo :(

I find I'm consistently hitting 429 after only a minute or two making requests 400ms apart, which doesn't feel fast enough.

If someone has access to (or is motivated enough to go sign up for) one of those scraping/proxy services, ...well that's probably the only sure-fire, realistic solution that would nail this given that there are approximately 17,053,414 potential pages (a *large* majority will 404, but they still need to be requested, and get past the rate limiter) - and 9 days left to fetch everything in. If someone is crazy enough to think that's interesting that I might be able to follow along (and, possibly, even help download a few things), but I frankly don't have the disposable income to kick it off myself.

In any case, it's not much, but... here:

userids.txt.xz.zip (5MB)

Now you can skip wading through 5GB of XML, at least :v. ¯\_(ツ)_/¯

(GitHub insisted on ZIP, which is... not quite in the ballpark of compression of XZ)

Caveat emptor: I wouldn't be surprised if the people working on this project have already gone down this path and identified everything in this issue. Perhaps the individual-download approach is a compromise based on that realization.

Remove organization type

We have no good strategy to know which type of organization is the person working for. Instead of using other we just can leave that field empty, since other give not too much information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.