getmanfred / dev-story-scraper Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 1.0 6.03 MB

Scraper to download the profile information from a Stack Overflow Dev Story

License: Creative Commons Attribution Share Alike 4.0 International

Shell 0.01% JavaScript 0.03% TypeScript 3.68% HTML 96.25% Dockerfile 0.03%

dev-story-scraper's People

Contributors

Stargazers

Watchers

Forkers

lpirir

dev-story-scraper's Issues

Competences at studies elements

At a timeline item related to studies, the competences are not being generated.

For example, at this profile we are not getting the tags related to the Surrey University studies.

Timeline items without a date

The Stack Overflow dev story supports defining a timeline item without a date. To sort the items it is quite possible that they are using the creation date, but we haven't that information in the HTML.

That information is mandatory at the MAC JSON schema, so we have 2 different options:

We assign the current date to the start date when this value is undefined.
We search the elements before and after the current one and select a date between them.

For example, this profile has elements without publishing date.

Better names separation

Right now we are separating the first element of a name as the first name and the rest of the string as surnames.

But there are some names like Robert C. Martin where we can do things a bit better dividing it like:

{
   name: 'Robert C.',
   surnames: 'Martin'
}

Actual organization link

At a job position, we extract the name and link of the company, but the company link at the dev story level is a link to Stack Overflow Company's page.

We can load that page and extract the actual link there.

Open source project could have the right type

The projects parse is using other as the default type. Because that comes from a DevStoryArtifact we can do a better effort:

Feature or Apps is other.
Open source is openSource.

Recommendations default type is 'reading'

The recommendation element should be of type reading.

Custom type dev story artifacts

During the scraping process, there are some items that are not present in the final JSON.

All the Dev Story items that are not scraped should be returned as highlights.

Create a dictionary to translate a Dev Story tag to a MAC mean

The Stack Overflow Dev Stories contains tags for each timeline's item. This allows the user to categorize each experience, publication, artifact item in her Dev Story.

At Manfred's JSON Schema that element is equivalent to

$defs.role.properties.means
or properties.experience.allOf[2].properties.publicArtifacts.items.properties.means.

A mean has 2 mandatory fields, name and type. We must decide if we assign a default type to each mean, or if we create a dictionary to assign the right type, at least for the most important ones.

Self-employment jobs

There are some profiles that have positions not assigned to a company, like this one

In this cases we will create an organization that returns a job position like:

{
    "organization": {
        "name": "Self Employed",
    },
    "type": "freelance"
    "roles": [
        {
            "name": "Computer Science Tutor",
            "startDate": "2009-05-01",
            "competences": [
                {
                    "name": "java",
                    "type": "technology"
                },
                {
                    "name": "python",
                    "type": "technology"
                },
                {
                    "name": "linux",
                    "type": "technology"
                }
            ]
        }
    ]
}

Some times uses comma instead of 'at' to separate the organization

Some dev story users reference the organization by plain text, separating the item title using commas.

For example, at this profile there are some studies that have no organization references by a link, e.g. Darthmouth College.

The solution is to use , as a separator to split the title.

Use last day of the month for finish date fields

A Dev Story is filling just the year and month for dates. We add always 1st as the day, but for dates that are a finish date it is better to use the last day of the month, so there is no one-month gap between experiences.

All the users?

I noticed this on HN a little while ago, and immediately wondered how hard it would be to simply just grab everyone's developer story (it's always fun to be able to turn around and go "here y'go :D"). I can see a decent bit of work has gone into this, and it would be a small shame for only a few people to benefit from it.

Sadly given the timing it looks like it's a bit 11th-hour to try and do everything, but I wonder what's maybe still possible.

After doing some digging and initially getting discouraged by custom URLs, I realized that you can basically access every non-hidden developer story indexed by user ID, ie /story/12345678.

Furthermore, the StackExchange Network data dumps (https://archive.org/download/stackexchange) includes an active list of StackOverflow users! Yay!

So I went off, downloaded that, extracted the user ids out, and started working on a little parallel downloader to start vacuuming everything up.

After a few hitches this worked grea429 Too Many Requests

...noooooo :(

I find I'm consistently hitting 429 after only a minute or two making requests 400ms apart, which doesn't feel fast enough.

If someone has access to (or is motivated enough to go sign up for) one of those scraping/proxy services, ...well that's probably the only sure-fire, realistic solution that would nail this given that there are approximately 17,053,414 potential pages (a *large* majority will 404, but they still need to be requested, and get past the rate limiter) - and 9 days left to fetch everything in. If someone is crazy enough to think that's interesting that I might be able to follow along (and, possibly, even help download a few things), but I frankly don't have the disposable income to kick it off myself.

In any case, it's not much, but... here:

userids.txt.xz.zip (5MB)

Now you can skip wading through 5GB of XML, at least :v. ¯\_(ツ)_/¯

_{(GitHub insisted on ZIP, which is... not quite in the ballpark of compression of XZ)}

Caveat emptor: I wouldn't be surprised if the people working on this project have already gone down this path and identified everything in this issue. Perhaps the individual-download approach is a compromise based on that realization.

Remove organization type

We have no good strategy to know which type of organization is the person working for. Instead of using other we just can leave that field empty, since other give not too much information.