getmanfred / dev-story-scraper Goto Github PK
View Code? Open in Web Editor NEWScraper to download the profile information from a Stack Overflow Dev Story
License: Creative Commons Attribution Share Alike 4.0 International
Scraper to download the profile information from a Stack Overflow Dev Story
License: Creative Commons Attribution Share Alike 4.0 International
At a timeline item related to studies, the competences
are not being generated.
For example, at this profile we are not getting the tags related to the Surrey University studies.
The Stack Overflow dev story supports defining a timeline item without a date. To sort the items it is quite possible that they are using the creation date, but we haven't that information in the HTML.
That information is mandatory at the MAC JSON schema, so we have 2 different options:
undefined
.For example, this profile has elements without publishing date.
Right now we are separating the first element of a name as the first name and the rest of the string as surnames.
But there are some names like Robert C. Martin
where we can do things a bit better dividing it like:
{
name: 'Robert C.',
surnames: 'Martin'
}
The projects parse is using other
as the default type. Because that comes from a DevStoryArtifact
we can do a better effort:
other
.openSource
.The recommendation element should be of type reading
.
The Stack Overflow Dev Stories contains tags for each timeline's item. This allows the user to categorize each experience, publication, artifact item in her Dev Story.
At Manfred's JSON Schema that element is equivalent to
$defs.role.properties.means
properties.experience.allOf[2].properties.publicArtifacts.items.properties.means
.A mean
has 2 mandatory fields, name
and type
. We must decide if we assign a default type to each mean, or if we create a dictionary to assign the right type, at least for the most important ones.
There are some profiles that have positions not assigned to a company, like this one
In this cases we will create an organization that returns a job position like:
{
"organization": {
"name": "Self Employed",
},
"type": "freelance"
"roles": [
{
"name": "Computer Science Tutor",
"startDate": "2009-05-01",
"competences": [
{
"name": "java",
"type": "technology"
},
{
"name": "python",
"type": "technology"
},
{
"name": "linux",
"type": "technology"
}
]
}
]
}
Some dev story users reference the organization by plain text, separating the item title using commas.
For example, at this profile there are some studies that have no organization references by a link, e.g. Darthmouth College.
The solution is to use ,
as a separator to split the title.
A Dev Story is filling just the year and month for dates. We add always 1st as the day, but for dates that are a finish date it is better to use the last day of the month, so there is no one-month gap between experiences.
I noticed this on HN a little while ago, and immediately wondered how hard it would be to simply just grab everyone's developer story (it's always fun to be able to turn around and go "here y'go :D"). I can see a decent bit of work has gone into this, and it would be a small shame for only a few people to benefit from it.
Sadly given the timing it looks like it's a bit 11th-hour to try and do everything, but I wonder what's maybe still possible.
After doing some digging and initially getting discouraged by custom URLs, I realized that you can basically access every non-hidden developer story indexed by user ID, ie /story/12345678
.
Furthermore, the StackExchange Network data dumps (https://archive.org/download/stackexchange) includes an active list of StackOverflow users! Yay!
So I went off, downloaded that, extracted the user ids out, and started working on a little parallel downloader to start vacuuming everything up.
After a few hitches this worked grea429 Too Many Requests
...noooooo :(
I find I'm consistently hitting 429 after only a minute or two making requests 400ms apart, which doesn't feel fast enough.
If someone has access to (or is motivated enough to go sign up for) one of those scraping/proxy services, ...well that's probably the only sure-fire, realistic solution that would nail this given that there are approximately 17,053,414 potential pages (a *large* majority will 404, but they still need to be requested, and get past the rate limiter) - and 9 days left to fetch everything in. If someone is crazy enough to think that's interesting that I might be able to follow along (and, possibly, even help download a few things), but I frankly don't have the disposable income to kick it off myself.
In any case, it's not much, but... here:
userids.txt.xz.zip (5MB)
Now you can skip wading through 5GB of XML, at least :v. ¯\_(ツ)_/¯
(GitHub insisted on ZIP, which is... not quite in the ballpark of compression of XZ)
Caveat emptor: I wouldn't be surprised if the people working on this project have already gone down this path and identified everything in this issue. Perhaps the individual-download approach is a compromise based on that realization.
We have no good strategy to know which type of organization is the person working for. Instead of using other
we just can leave that field empty, since other
give not too much information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.