Coder Social home page Coder Social logo

innovationgraph's Introduction

GitHub Innovation Graph

This repo contains structured data files of public activity on GitHub, aggregated by economy on a quarterly basis from 2020 onward.

Through offerings such as the GitHub Innovation Graph, we hope to inform research and public policy that could benefit from data on software development activity globally. We welcome developers, data analysts, researchers, policymakers, and all other interested stakeholders to explore the data, discover insights, and create visualizations, among much more.

The GitHub Innovation Graph provides data on the following areas:

See the datasheet for more information.

Exploring Innovation Graph data

For an overview of the dataset, check out the charts and tables at the GitHub Innovation Graph website.

To dive deeper into the data and run your own analyses, feel free to fork this repo, explore the structured data files using the exploratory data analysis tool of your choice, and share your findings in our Discussions page.

Limitations

The GitHub Innovation Graph dataset contains data on (1) public activity (2) on GitHub (3) aggregated by economy (4) on a quarterly basis. As such, this dataset would not be useful for understanding:

  1. private activity;
  2. outside of GitHub;
  3. at a more granular geographic level than economy; or
  4. at a more granular temporal level than quarterly.

Additionally, economies that have fewer developers on GitHub (which generally correlates with the population of an economy) will have less data associated with them in this dataset.

See the datasheet for more information on limitations.

Representativeness of Innovation Graph data

How many economies are included?

We endeavor to publish as much data about public activity on GitHub as possible. However, the number of developers varies considerably by economy, and in some cases we decline to publish specific statistics for economies with fewer than 100 unique developers performing the relevant activity during the specified quarter out of an abundance of caution for developers’ privacy. You can find more information on our methodology in the datasheet.

Below a heatmap shows the count of economies reported for each data file by quarter:

Count of economies by data file by quarter

A heatmap of the count of economies for each GitHub Innovation Graph data file by quarter, which shows that the data for repositories and developers are fairly comprehensive, with over 215 distinct economies represented since Q1 2020. The other data files (with the exception of the topics data file) have fewer economies represented, ranging from about 110 - 180 economies. The topics data file shows distinct economy counts ranging from about 45 - 130 over time.

You can also find the CSV for this heatmap in the data/representativeness_data directory.

Which economies are included?

We aggregate GitHub activity for economies using a definition broader than recognized UN member states. For example, AQ reports activity from developers stationed on Antarctica. Below a heatmap reports the count of data files for each economy by quarter:

A heatmap of the count of GitHub Innovation Graph data files for each economy by quarter, which shows that the more populous economies are more likely to be represented in more data files.

You can also find the CSV for this heatmap in the data/representativeness_data directory.

License

This project is released under CC0-1.0.

Maintainers

See CODEOWNERS

Support

See SUPPORT

innovationgraph's People

Contributors

dependabot[bot] avatar khxu avatar mlinksva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

innovationgraph's Issues

ASP.NET is not a language

👋

While I am SUPER interested in tracking the popularity of specific framework use, i will say that ASP.NET seems out of place in this graph. For example, we don't list Rails, so I would expect us to also not list ASP.NET.

Thanks!

Consider separating "programming languages" into logical high level groups

Generally I think it's harder for me to wrap my head around the graph we currently have because it's mixing things that are used for different purposes.

For example, Makefile / Shell / Dockerfile / Batchfile / Powershell all seem to be a pretty different logical grouping than Java / C / C# / typescript / ruby etc. It's hard to really make sense of what seeing the plots side by side means . Have you considered showing labels or groups or something that would allow the viewer of the graph to separate on an axis that shows more interesting views? Like "Imperative programming languages"? "shell" languages? Build scripts?

Mostly I'm interested in being able to see things visualized in for categories that are "apples to apples".

Insights into top contributors' employment

Would be great if we could have data for the employers for git pushes.
Assigning the employee to an account may not be trivial, but through the email address or the orgs they belong, it might be feasible.

Would be great to see the top companies contributing to open-source, as well as the top universities (both globally and per region).

Thank you!

P.S.: I'm going to have 250+ students submitting patches to various github projects soon. I would be trilled to see the impact and benchmark it against other universities. They are required to use the university email address in the commits and github account.

How to cite?

This is an awesome new dataset, thanks for releasing it. As it's for research purposes I recommend you explicitly state how you want the dataset to be cited in the README (didn't find it but perhaps I missed it)

Can't export licenses to any compressed format

While trying to export to PNG, JPEG, PDF or SVG in licenses the website loads quite a while then redirects to the following url:
image

Can someone else reproduce this or is this only on my machine?

Use programming language colors specified in linguist for programming languages chart

Some are very recognizable either due to ubiquity such as Javascript or because the assigned color naturally goes with the language name, like Ruby or Rust https://github.com/github-linguist/linguist/blob/7ca3799b8b5f1acde1dd7a8dfb7ae849d3dfb4cd/lib/linguist/languages.yml#L6139C11-L6139C18

This would make the chart a little more instantly legible for heavy GitHub users (and perhaps users of other tools, I'm mildly curious now how widely those color associations have spread).

[Feature request] "Popular" projects?

This is a really cool project!

I was looking at the India figures and they are astonishing. 14M git pushes. 30M repositories. 12M developers. 450k orgs. In the FOSS communities that we're a part of here in India, a conversation that often comes up is the low number of widely used FOSS projects originating from here. But these numbers are hard to reconcile with that observation (anecdotal, of course) and perhaps indicate a different reality altogether.

It would be great if there was a way to get visibility into "popular" projects from geographies. Stars, forks etc. don't of course directly imply quality, usage, or even popularity, but may serve as an indicator. Perhaps there are other data points that serve as indicators.

Thank you

What is the working definition of repo in the csv file?

I am looking into the reason that the number of repo declined rapidly in China around Q1 2022. I wonder whether there is a working definition of the repo?

  • Would any active repo count as a repo?

  • If there is no action on the repo for a given time, would it still count?

  • If a repo is archived, would it still count?

Similarly, I wonder whether there is a working definition for organization in the dataset.

Monthly frequency

Thanks a lot for this great project! I was wondering whether there is the possibility to provide this dataset also at monthly frequency? I assume that the underlying data is available. Any feedback is greatly appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.