chaoss / augur Goto Github PK

Python library and web service for Open Source Software Health and Sustainability metrics & data collection. You can find our documentation and new contributor information easily here: https://oss-augur.readthedocs.io/en/main/ and learn more about Augur at our website https://augurlabs.io

Home Page: https://oss-augur.readthedocs.io/en/main/

License: MIT License

Python 56.51% HTML 0.13% JavaScript 0.02% CSS 0.33% Makefile 0.14% Shell 0.72% Dockerfile 0.06% PLpgSQL 39.23% Mako 0.01% Jinja 2.85%

chaoss linux linux-foundation open-source opensource github data-visualization facade git metrics

augur's Introduction

Augur NEW Release v0.63.3

Augur is primarily a data engineering tool that makes it possible for data scientists to gather open source software community data. Less data carpentry for everyone else! The primary way of looking at Augur data is through 8Knot ... A public instance of 8Knot is available at https://metrix.chaoss.io ... That is tied to a public instance of Augur at https://ai.chaoss.io

We follow the First Timers Only philosophy of tagging issues for first timers only, and walking one newcomer through the resolution process weekly. You can find these issues tagged with "first timers only" on our issues list..

NEW RELEASE ALERT!

If you want to jump right in, updated docker build/compose and bare metal installation instructions are available here

Augur is now releasing a dramatically improved new version to the main branch. It is also available here: https://github.com/chaoss/augur/releases/tag/v0.63.3

The main branch is a stable version of our new architecture, which features:
- Dramatic improvement in the speed of large scale data collection (100,000+ repos). All data is obtained for 100k+ repos within 2 weeks.
- A new job management architecture that uses Celery and Redis to manage queues, and enables users to run a Flower job monitoring dashboard
- Materialized views to increase the snappiness of API’s and Frontends on large scale data
- Changes to primary keys, which now employ a UUID strategy that ensures unique keys across all Augur instances
- Support for https://github.com/oss-aspen/8knot dashboards (view a sample here: https://eightknot.osci.io/). (beautification coming soon!)
- Data collection completeness assurance enabled by a structured, relational data set that is easily compared with platform API Endpoints
The next release of the new version will include a hosted version of Augur where anyone can create an account and add repos “they care about”. If the hosted instance already has a requested organization or repository it will be added to a user’s view. If its a new repository or organization, the user will be notified that collection will take (time required for the scale of repositories added).

What is Augur?

Augur is a software suite for collecting and measuring structured data about free and open-source software (FOSS) communities.

We gather trace data for a group of repositories, normalize it into our data model, and provide a variety of metrics about said data. The structure of our data model enables us to synthesize data across various platforms to provide meaningful context for meaningful questions about the way these communities evolve. Augur’s main focus is to measure the overall health and sustainability of open source projects, as these types of projects are system critical for nearly every software organization or company. We do this by gathering data about project repositories and normalizing that into our data model to provide useful metrics about your project’s health. For example, one of our metrics is Burstiness. Burstiness – how are short timeframes of intense activity, followed by a corresponding return to a typical pattern of activity, observed in a project?

This can paint a picture of a project’s focus and gain insight into the potential stability of a project and how its typical cycle of updates occurs.

We are a CHAOSS project, and many of our metrics are implementations of the metrics defined by our awesome community. You can find a full list of them here.

For more information on how to get involved on the CHAOSS website.

Collecting Data

Augur supports Python3.6 through Python3.9 on all platforms. Python3.10 and above do not yet work because of machine learning worker dependencies. On OSX, you can create a Python 3.9 environment this way: python3.9 -m venv path/to/venv.

Augur's main focus is to measure the overall health and sustainability of open source projects.

Augur collects more data about open source software projects than any other available software. Augur's main focus is to measure the overall health and sustainability of open source projects. One of Augur's core tenets is a desire to openly gather data that people can trust, and then provide useful and well-defined metrics that help give important context to the larger stories being told by that data. We do this in a variety of ways, one of which is doing all our own data collection in house. We currently collect data from a few main sources:

Raw Git commit logs (commits, contributors)
GitHub's API (issues, pull requests, contributors, releases, repository metadata)
The Linux Foundation's Core Infrastructure Initiative API (repository metadata)
Succinct Code Counter, a blazingly fast Sloc, Cloc, and Code tool that also performs COCOMO calculations

This data is collected by dedicated data collection workers controlled by Augur, each of which is responsible for querying some subset of these data sources. We are also hard at work building workers for new data sources. If you have an idea for a new one, please tell us - we'd love your input!

Getting Started

If you're interested in collecting data with our tool, the Augur team has worked hard to develop a detailed guide to get started with our project which can be found in our documentation.

If you're looking to contribute to Augur's code, you can find installation instructions, development guides, architecture references (coming soon), best practices and more in our developer documentation. Please know that while it's still rather sparse right now, but we are actively adding to it all the time. If you get stuck, please feel free to ask for help!

Contributing

To contribute to Augur, please follow the guidelines found in our CONTRIBUTING.md and our Code of Conduct. Augur is a welcoming community that is open to all, regardless if you're working on your 1000th contribution to open source or your 1st. We strongly believe that much of what makes open source so great is the incredible communities it brings together, so we invite you to join us!

License, Copyright, and Funding

Augur is free software: you can redistribute it and/or modify it under the terms of the MIT License as published by the Open Source Initiative. See the LICENSE file for more details.

This work has been funded through the Alfred P. Sloan Foundation, Mozilla, The Reynolds Journalism Institute, contributions from VMWare, Red Hat Software, Grace Hopper's Open Source Day, GitHub, Microsoft, Twitter, Adobe, the Gluster Project, Open Source Summit (NA/Europe), and the Linux Foundation Compliance Summit. Significant design contributors include Kate Stewart, Dawn Foster, Duane O'Brien, Remy Decausemaker, others omitted due to the memory limitations of project maintainers, and 15 Google Summer of Code Students.

Current maintainers

Derek Howard <https://github.com/howderek>_
Andrew Brain <https://github.com/ABrain7710>_
Isaac Milarsky <https://github.com/IsaacMilarky>_
John McGinnis <https://github.com/Ulincys>_
Sean P. Goggins <https://github.com/sgoggins>_

Former maintainers

Carter Landis <https://github.com/ccarterlandis>_
Gabe Heim <https://github.com/gabe-heim>_
Matt Snell <https://github.com/Nebrethar>_
Christian Cmehil-Warn <https://github.com/christiancme>_
Jonah Zukosky <https://github.com/jonahz5222>_
Carolyn Perniciaro <https://github.com/CMPerniciaro>_
Elita Nelson <https://github.com/ElitaNelson>_
Michael Woodruff <https://github.com/michaelwoodruffdev/>_
Max Balk <https://github.com/maxbalk/>_

Contributors

Dawn Foster <https://github.com/geekygirldawn/>_
Ivana Atanasova <https://github.com/ivanayov/>_
Georg J.P. Link <https://github.com/GeorgLink/>_
Gary P White <https://github.com/garypwhite/>_

GSoC 2022 participants

Kaxada <https://github.com/kaxada>_
Mabel F <https://github.com/mabelbot>_
Priya Srivastava <https://github.com/Priya730>_
Ramya Kappagantu <https://github.com/RamyaKappagantu>_
Yash Prakash <https://gist.github.com/yash-yp>_

GSoC 2021 participants

Dhruv Sachdev <https://github.com/Dhruv-Sachdev1313>_
Rashmi K A <https://github.com/Rashmi-K-A>_
Yash Prakash <https://github.com/yash2002109/>_
Anuj Lamoria <https://github.com/anujlamoria/>_
Yeming Gu <https://github.com/gymgym1212/>_
Ritik Malik <https://gist.github.com/ritik-malik>_

GSoC 2020 participants

Akshara P <https://github.com/aksh555/>_
Tianyi Zhou <https://github.com/tianyichow/>_
Pratik Mishra <https://github.com/pratikmishra356/>_
Sarit Adhikari <https://github.com/sarit-adh/>_
Saicharan Reddy <https://github.com/mrsaicharan1/>_
Abhinav Bajpai <https://github.com/abhinavbajpai2012/>_

GSoC 2019 participants

Bingwen Ma <https://github.com/bing0n3/>_
Parth Sharma <https://github.com/parthsharma2/>_

GSoC 2018 participants

Keanu Nichols <https://github.com/kmn5409/>_

augur's People

Stargazers

Watchers

Forkers

abuhman jakeharding kalepeterson ashkeelun bparish628 khtran1994 vrrobbie celalgorgun pombredanne mshoote srobins259 lch3m4 ntoenis jon9595 kfjustis ochirnyamb ryanlapeyre tljwvf cgj6hb henryhoekel mdeardorff joshmgreen kourosh-forti-hands moballin mureinik amunz chubbymaggie mit09m usrcoin-forks gordongli gouravsardana apoorvkhare07 umccki a-hodges kmn5409 accidentallyking orkohunter klumb tretrue parthsharma2 vicalexis mjc598 thomasjurczyk mas2g2 grantharrison skitzobrownie qisheng-tang segfaultcity ylrbc clarkwalcott meganwil westbrookj aeguthrie mk411 guowenbin90 pskggc mlevin23 whatever23333 randocartissian jrunner97 bradyburress dtsrkb bbland1999 samanthac13 jackykrabs brianhillis albertklorer mayacutkosky elbureeto deimious luke-fisher arianardo antoniomv3 xcd4p ericnmitchell austin-hales westonv wangweiweinimeng sophienedelco jcbasa sawqxd bwtthf bradywebb8 tertl700 aldavis612 bmsc7d oliviabishop cmperniciaro ineegomontoya szong06 christoph5782 jackwilke tnice austinibeh brendan-thompson dosterz97 sejalkhatri jk4g2 ajm-1 alexcourouble

augur's Issues

Adding License Identifiers

Hey all, I'd like to add SPDX license identifiers to all source files

SPDX-License-Identifier: MIT

Use of Django

Earlier, Derek and I discussed using Django vs Flask for our web portion, and we had initially decided on Flask. However, Matt and I discussed today that Django may allow for a known organization of what goes where in the code, which may help others to understand our project. This could lead to easier contributions from others if they also understand the Django framework.

I had installed Django for some of the initial work I did on learning to connect to the API, and I found it was quite easy to use. Here is a tutorial about making a Django app: https://docs.djangoproject.com/en/1.10/intro/tutorial01/. One thing I did not follow from the tutorial was setting it up to work with a database (I imported a separate driver), so if we were to follow the framework we'd probably need to do it that way instead. I am about to head to the lab meeting, but I have a views.py I will also post later to show an example of using Django (though I admit I didn't organize it correctly to be Django-like).

README

Another issue. We need to bump the README to stay inline with development. As an example:

https://github.com/DoSOCSv2/DoSOCSv2/blob/master/README.md

Do not expose individual users

Last week on the OSS Health Group call, we discussed exposing individual contributors through the metrics.
Some of our metrics currently return users' login names.
Would these metrics still be informative without exposing individual people?

Usage Issues

Once the install is done, how do I use the system?

Here is my opening screen:

I think that owner/repository is like: cakephp/cakephp. This isn't super clear from the UI.

When I do get the new repo, there are no indicators, just a green 'healthy' button. Am I suppose to be seeing more?

Proposed Removal of Wiki Pages

Hi all,

I'd like to propose that we move the Wiki pages to their own .md documents in the /docs folder. This was a recommendation from several folks here but primarily those who are actively using GH to build out communities. The suggestion is that Wiki pages are just not a very highly used or observed part of GH and that keeping things in files like this allows for better tracking through pull requests.

Tests

We need to support CI. We will probably use Travis.

Some challenges include:

Testing against a large database hosted elsewhere
Testing commands that will have different results since it works with realtime data
Testing the initialization of the database

Tests are not the highest priority right now, but should be supported before 1.0.0

Date Change Not Updating Comparison on Deployed Test

At ghdata test server the comparison projects are not having their date ranges updated when filtering by date. There also seems to be an issue with the 100% setting on compared projects in the latest deploy.

Add batch requests

Adding batch requests to our API will improve the efficiency of communication between our frontend and backend greatly.

This API should also support different resolutions like weekly, monthly, etc. to further reduce the work required on the frontend to render the data. This could potentially reduce code duplication between projects that use our API and definitely improve performance on slow connections.

ghdata.cfg options are not handled well

When a new parameter needs to be added to ghdata.cfg, the entire file has to be regenerated or the new options have to be appended by hand.

Ideally, all options should be optional and if they don't appear in ghdata.cfg, they should be added automatically with a sensible default.

I totally take responsibility for causing this bug, and I'll fix it myself eventually if no one else does but wanted to make a more accessible way to contribute available publicly in case anyone else wants to tackle it.

Installation Issues

I can never ever seem to get this installed on a fresh install. I always end up with 404 errors. The install seems fine but when I fire up the server, things just don't come up on the browser. This happens a lot to me.

Also, opendata.missouri.edu seems to be timing out. It is possible this is me as I'm at a community center doing this.

Documentation

Good practices in documentation will help our code to be maintainable over the long run. It also allows for better teamwork and easier contribution from outside users.

Comments. We should have comments explaining what each part of the code is doing and why. This helps both new contributors and ourselves to remember or understand the code. Comments also need to be kept up-to-date with the code, that way they match what is actually happening.
Clean code. While when initially solving a problem we may end up with code that is messy and hard to read, we should strive to clean up the code such that it is as self-explanatory as possible. This includes but is not limited to intuitive variable names, good formatting, and refactoring (with tests) to make code easier to follow when necessary.
Unit tests. While the main purpose of unit tests is to determine whether something is working and detect breaking changes, they can also be a form of documentation. If a unit test is complete, well-written, and up-to-date with the code it refers to, it shows the reader what are the expected results of running a particular piece of the code.

Google BigQuery with GitHub data

Today, Patrick from eBay pointed us to Google BigQuery because they have the GitHub Archive dataset and can run large queries in short amount of times.

One idea what could be done:

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it.

Another example combines datasets to show the effect of Hacker News on the stars of a project (how well is the PR working?)

I leave the exploration, whether this is useful to our able programmers :-)

GHTorrent difficult to synchronize

GHTorrent is an incredible resource, but it's difficult to deploy a version of it that is in sync with GitHub. We have a few options:

Write scripts to make it easier/containerize the deployment process
Lessen our dependency on GHTorrent and build directly on the GitHub API

What should be our path moving forward?

Watchers, stargazers, starring a repo

Do stargazers, watchers, and starring a repo mean the same thing?

This code and the comments use different terms for what the query is getting. If, for example, I was to link or copy/paste this code to the "watchers" metric on the wiki, would that be correct?

# Basic timeseries queries
def stargazers(self, repoid, start=None, end=None):
    """
    Timeseries of when people starred a repo
    :param repoid: The id of the project in the projects table. Use repoid() to get this.
    :return: DataFrame with stargazers/day
    """
    stargazersSQL = s.sql.text(self.__single_table_count_by_date('watchers', 'repo_id'))
    return pd.read_sql(stargazersSQL, self.db, params={"repoid": str(repoid)})

nginix deployment on ghdata public server

I would like to have the nginix web server in front of the app on our GHData Server. I may be able to do this myself, once the docs are updated @howderek & @ChristianCme :)

Exportable SVGs

Our visualizations should be exportable as SVGs

Refactor frontend

GHData's frontend currently has a steep learning curve. One would have to understand the whole thing before adding a new metric.

It should be more modular to make it easier for potential contributors to add metrics.

Goldmine of API requests

Hello all!

I just stumbled upon shields.io which is open source, and practically the entire project is contained in this large JS file

That file is full of API requests that we may want to use. These badges are often related to repo health, so it's worth checking out some of the services they use we haven't talked about:

Fix labels

The graphs need to be labeled like they were pre-Vue

Switch/Incorporate Go?

Python is fantastic for what we've been doing so far, which consists of making queries to our GHTorrent database and a Flask API to those queries.

Now that we are going to begin working on metrics that require analyzing repos, I'm not sure Python+GitPython will meet our needs well. GitPython has a beautiful API but it is probably too slow to make a usable web app that is able to deliver the more complicated metrics without some serious caching.

PyGit2 would probably be fast enough for our needs, but then the installation requirements for Windows users would be more involved. We could containerize it, but then we lose the benefits of having a Python library. Also it's GPL, and a modified GPL at that, so I'm not even sure we can use it.

One solution is to store every repo we analyze and keep them up to sync, analyzing them when they change and storing the results. Essentially we'd be keeping a database up to date with our metrics. That would allow for practically instant response when users look for a repo, but limit the scope of which repos they can look at, much like http://gittrends.io/

I think a better solution would be switching to Go.

I was looking for solutions when I stumbled upon go-git which is written in Go, and would let us download and manipulate git repos entirely in memory. It is created by a company that is doing some ML stuff on the code from every GitHub repo out there, so it was made to tackle a problem similar to ours. We would still cache, and we'd likely want to still download large/popular repos to avoid re-downloading them over and over, but for smaller, more obscure repos our Go package could download them and analyze them in seconds. The entire GitHub ecosystem would be open to our users, while still being practical for us and fast for them.

We would also get the other benefits of Go, including easy cross-platform binaries with no dependencies, and much better general performance vs Python. This would make it trivial for our users to host their own ghdata or use it behind firewalls.

The disadvantage is that go-git doesn't have close to the same breadth as GitPython, so we'd have to reimplement a lot of GoPython's functionality. One important example is it lacks git blame

What do you all think?

If Go is not an option, any thoughts on the best way to deal with the slowness of GitPython?

Debian Developer install failure

Running developer install on Debian (fedora 27):
looks like there is an assumption about where bash exists hard coded.

Successfully installed ghdata-0.4.1
"" pip3 install --upgrade .
/bin/bash: : command not found
make: *** [Makefile:43: install-dev] Error 127

Allow relative comparison dates

The user should be able to compare graphs on timeframes independent from one another

Uninstall

Using the current install command to update the version of ghdata conflicts with previously installed versions on an OS level and causes errors whenever you try to start ghdata afterwords. If you add an uninstall script or post the command for it in the README, it would make it a lot easier to stay current.

Thanks,
Spencer Robinson

Show error if searched repository is not found

Provide an error if the repository searched for is not found.
Make sure to retain original project data if the search that fails is on a comparison repository.

Fix Docker Deployment for New Front end release

@howderek : 👍

Windows Docker Install Error Message

When installing on Windows 10 with docker (as described in the README) I skipped step 2 (adding environment variables) before running docker-compose build. It gave me this error message:

Building ghdata
ERROR: Windows named pipe error: The system cannot find the file specified. (code: 2)

This may be due to not having the database credentials, in which case a more explicit error message would be desirable, but if it is not then it is likely an error in the build process itself.

Rewire GHData-CHAOSS Metric Description Links

The metrics committee changed the links on the wiki, but many of the links are dead.

As an example, the 'community activity' link on the current ghdata deployment: http://ghdata.sociallycompute.io/?repo=rails+rails

under the "community activity" graph at the top, links here: https://wiki.linuxfoundation.org/oss-health-metrics/metrics/community-activity

Which links to a dead github link, here: https://github.com/chaoss/metrics/blob/master/metrics/community-activity.md

I think we do need links for EACH metric pointing to specific MD files on the CHAOSS Repository.

@GeorgLink : I think this is yours, but let's start a discussion.

Broken link

The repository links to: https://osshealth.github.io/ghdata/

The first link "Python documentation" leads to a 404 error.

Looking to work on front end

Hey GHData community,

I would like to build some additional front-end capabilities for your repo. How would you recommend my partner and I begin? We are fairly new to open source, so any guidance would be greatly appreciated! A good start may just be a more detailed description of how to get the front end that you already have up and running.

Thanks,
srobins259 and VRRobbie

Using a sqlite database that can be created by for a single project

Sean (and others),

I think the biggest challenge of using ghdata is that it relies on GHtorrent.

What about having the ability to use a database created from a given repository that mimics the schema
of GHtorrent. The idea is that, if one is interested in running ghdata on a given project, this person can create the corresponding database (with a script in ghdata or maybe another tool) and then being able to run ghdata on such database.

--dmg

Development Workflow

Hi all,

I strongly encourage us to create a formalized workflow around the development of this system. This should include things like:

Contributor agreements
Defined processes for accepting pull requests
Defined processes for merging branches (we should have master and development)

matt

Source Code Structure

Could we come up with a better structure for the source code? Not a fan of dumping the .py source on the top level.

src
install
tests

Something simple like this.

Also, I'd highly, highly recommend consistent and detailed code commenting.

Improve Brunch custom server

Update the brunch custom server to serve API routes to the API.

That update will allow us to run the entire application on a single port, and improve consistency between the upcoming Docker image and local testing.

Exportable Data

The data used to render visualization should be downloadable as both CSV and JSON. One potential solution is making the API return CSVs directly which would preserve formatting whether GHData is used as a Python library or a web application

Some repos' pages don't load right because of special characters

Technical Progress General Inquiry

Hi everyone:

@bkeepers @howderek @wingr especially:

I am writing to give you an overview of the technical progress we are making, and to foreshadow future requests for accelerated or privileged access to the GitHub API that we may request. There are headings below so you can scan.

BACKGROUND (We All hopefully share this now for the most part):

Looking at changes over time in GitHub repositories will be essential to the aims of our project: understanding their health and sustainability. We hypothesize (and, based on preliminary work, we think with some likelihood that we are right) the following:

H1: There is a relationship between derivable indicators of repository activity on GitHub and the type of organization governing the project
H2: There is a relationship between derivable indicators of repository activity on GitHub and performance, as perceived from the perspective of various stakeholders.
H3: Different stakeholders (owners, contributors, users, regulators, etc.) will be influenced by different combinations of indicators.

I think these flow as lower level operating tests from our research questions:

How and to what extent are community health and sustainability indicators identifiable from GitHub open source community data?
What are dominant genres of community based on health and sustainability indicators, and how and to what extent are health and sustainability indicators different between these communities?
How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?
How and to what extent do heath and sustainability indicators change over time as communities evolve to include increased membership, new governance structures, and support from foundations?

TECHNICAL APPROACH:

Here, to some extent, we are looking to Brandon and Rowan to validate that we are not missing any key concepts or attributes of the available resources from GitHub. In particular, if there are limitations in the data archives and torrents we are referencing, those would be good to be aware of.

We are doing our indicator development against GHTorrent and the GitHub Archive.
1. Since data about deleted repositories and users may play a role in our research, it's necessary to use archives of GitHub data as opposed to the timestamp information included in GitHub API requests.
2. From our initial exploration, it appears there will be two projects that will meet our needs, GHTorrent and GitHub Archive
3. GHTorrent provides a SQL database of metadata created from the events stream, and GitHub Archive archives those events themselves.
4. There is a lot of overlap between the datasets, but both are needed. A fast interface to the data is needed, such as the SQL database that is populated by GHTorrent.
Once indicators are mature enough to evaluate (estimated 4-6 weeks), we will need more current information to validate with project stakeholders, who will likely have less recall of things going on a month or two ago than last week. We think less archival indicators are also going to be more compelling for GitHub users generally. To that end,
1. The data we use will need to become quite “up to date”. What is the best strategy?
  1. Daily dumps provided by the GitHub Archive to fill in the gaps between the SQL backups provided by GHTorrent and the realtime data provided by the GitHub API?
  2. Privileged API Access?
  3. Both?
  4. Other?
2. Ideally, we would like to demonstrate indicators and provide an indicator exploration site with the hope of prototyping a system that could be used to gain wider evaluation of the indicators (from GitHub’s ecosystem).

Perhaps this is too much for an email and a call is warranted? But I thought I would start here!

Thanks!

Security Metric

Update on Security Metric search:

I looked into cii-census project on Github. It calculates and assigns a point value for security (more information here: https://www.coreinfrastructure.org/programs/census-project). By reverse-engineering I found the data it gets does appear to be obtained manually (for example, through the command line) and from the Debian security tracker website, so I'm not sure we could follow the same procedures other than perhaps assigning some sort of point values.

They mention the following sources for their data:
"The data represented here is derived from: DSAs issued by the Security Team; issues tracked in the CVE database, issues tracked in the National Vulnerability Database (NVD), maintained by NIST; and security issues discovered in Debian packages as reported in the BTS."

As far as CVEs (Common Vulnerabilities and Exposures), what I have found when reading about them is that they appear to be voluntary to register. In other words, more CVEs for a project may indicate acceptance of CVEs as the standard, greater ability to detect vulnerabilities, and/or more reporting rather than less security.

I did find this site that gives a CVE count for various products https://www.cvedetails.com/product-list/firstchar-M/vendor_id-0/products.html . As of now I'm not sure of a way to use it other than manually, though.

The National Vulnerability Database also keeps track of vulnerabilities. I have not looked into it in depth.

In short, as of when I write this I have not found a clear way to evaluate security of a Github project or discover how many CVE's it has automatically.

I do know that there are automated tools that can attack a running web application and create a results report based on any vulnerabilities, but that would require us manually installing and running our own copy of a web project (so, not viable).

Links in README ends in a 404.

Theit can be synchronized with current data link in the README leads to a 404 page on Github.
The roadmap link also leads to a 404.

License Information Metric

Update on search for ways to get license information:

I looked into DoSOCSv2 for license information. It examines project files that are present locally. We would have to download a project's files from GitHub in order to run something similar. I think it might be technically possible to download the files automatically. We could use the API's contents to get the download URL https://developer.github.com/v3/repos/contents/#get-contents and then download using that URL. However, I'm not sure downloading and storing entire projects, even temporarily, would be viable in terms of speed and storage.

As far as I can tell, DoSOCSv2 uses https://github.com/ardumont/fossology-nomossa to examine files for license information. Fossology lists the following as things it uses. The top two would appear to scan files for license information (though I haven't looked in depth into either Nomos or Monk).

Nomos, a license scanner based in regular expressions.
Monk, a license scanner based on text comparison.
MonkBulk, an extension to Monk for user-based phase searching.
Copyright, an Agent searching for copyright, URL, e-mail and authorship statements.
ECC, export control and customs as an extension to Copyright.
Package Agent, an agent exporting metadata from installation packages.
Maintenance Agent, (new in 2.4.0)
Mimetype Agent, running over files trying to determine the mimetype.
Buckets, an agent to categorize files based un user-defineable definitions.

Best way to determine organization diversity for a project

I created a few queries earlier that counts the number of organizations or companies (they are separate concepts in the data) with pull requests on a project.

However, I have been looking more into some of the GHTorrent tables today and found that a user can be a member of multiple organizations. So if user Jane made a pull request, and Jane is a member of 4 different organizations, do we want to count 4 organizations towards diversity for that single pull request? Organizations have unique ids, so it is clear which users are members of which organizations (as compared to companies, described below). Organizations may not match one-to-one with real world companies. "Google" is a separate organization from "Google Drive", "Google Page Speed" and "Google Cloud Platform"

There is another field we can use to determine such diversity, the "company" field. Each user can have only one company. However, looking at the data in this field, I think users can type whatever they want into it. Thus, there are users whose company is "Google Inc.", others whose company is "Google, Inc.", and also those whose company is "Google".

What do others think about how to decide which user is a member of which organization or company?

Should we be using organizations or companies? I think organizations may be better due to unique ids. It is clear which users are members of which organizations, less so with companies. However, neither seems to match one-to-one with companies as we would normally think of them.
If a user is a member of multiple organizations, do we count them all when a user contributes to a project?
If a user is determined to not be a member of an organization or company (through whatever method we use) do we assume they are independent and count independent users as an organization/company?

Commands not working

I've tried to run some of the commands in cli.py either from the command line or by adding a call from the page code. It seems like they don't run and produce error messages.

I've tried:

Command line (with no changes to repo code):
python cli.py repo releases
python cli.py repo releases test_username
and other various combinations, but I get error messages, usually relating to having the wrong number of arguments (regardless of how many arguments I do or don't add.)

I've also tried just putting calls like
releases()
in the code of cli.py and seeing if it will run:
python cli.py
This also results in error messages.

Am I running these commands wrong? Do they work right now (I mean, do they run without error messages. I know they are skeleton commands)?

Comparisons should be updated with new dates

The date edit seems to only affect the base repository. I would expect it to change the rendering for comparisons as well.

Install Issues off New Document

@howderek working from:

https://github.com/OSSHealth/ghdata/blob/master/devloperstartup.md

I'm still here with nothing below growth indicators:

The trace is here:

Install Issue

Hey,

Doing an install. All things are seemingly fine but 404.

Ubuntu 16+

Unique Committers Broken on Deployed Server

Getting a NaN error.

Show indication of "waiting for data"

When the user hits search, provide a visual queue that the system is still pulling data and rendering visualizations.

Github Health Categories and Metrics Suggestions

Hello Everyone,

In working with @germonprez in his class at the University of Nebraska Omaha, I was asked to identify categories and some metrics associated with them. He suggested posting this here for discussion.

When presented with this problem I first took a step back and looked at how repositories flourish on Github. It is important to note that the code itself is not the only thing of concern. The goal here is to address and rank repositories on what the user is concerned about. From this I derive five key areas that I named:
• Community – Active contributors to the Repository and their growth and activity.
• Code – Quality and reliability of the source code itself.
• Assistance – Quality and helpfulness of issue resolution.
• Adaptability – The ability for the code to have a variety of uses.
• Licensing – Usability of the code.

It is important to segregate these concerns and have the ability to judge them separately as this should help diversify the systems adaptiveness to varying concerns. An entity that only plans on using source internally may not need to consider the license but is concerned about the quality of the code and community to support that code.

Community refers to the active contributors that support a repository. Looking at the activity of each contributor both on the repository and then on Github in general to determine how much time they commit to the repo compared to other projects. This should also look at the interaction, closeness and growth of the contributors over time. A few metrics that would apply to this are:
• number of contributors
• frequency of contributions
• activity level of contributors
• Truck Factor (“The number of developers it would need to lose to destroy its progress.” From the Influence Analysis of Github Repositories)
• Time to become contributor
• Distribution of work across community
• rate of acceptance of new contributions

Code is probably the easiest category and the hardest category to evaluate. Ideally, we want to know that it is routinely kept up to date, is clean and well documented, and will continue to stay that way for the foreseeable future. This is easier said than done, one thing that makes it easier to analyze is the fact that it has the most meta data to work with. A few metrics that would apply are:
• number of updates
• regularity of updates
• time since last update
• number of pull rejections
• number of CVEs and % of age open – https://www.coreinfrastructure.org/programs/census-project
• 3rd party Dependencies (if obtainable) – https://www.coreinfrastructure.org/programs/census-project
• stars
• overall size of the repository / commits

Assistance is exactly what it sounds like. As a user of the code how much assistance can you get in implementing it. Additionally, while this may not be directly relevant to some entities, it is indirectly relevant to everyone. Lack of support leads to lower adoption which also leads to a smaller set of stakeholders willing to keep it going. A few metrics that would apply are:
• number of open issues / label
• time to close
• communication level (response time and number)

Adaptability refers to the degree that the project could be easily adapted to your specific needs. While this is very useful it’s also extremely hard to determine from the metrics. However, I believe a couple could give small indirect indications of flexibility. The first is the number of forks in the repository followed by the number of downloads. A large number of forks with lower downloads tend indicate a useful code that can be expanded upon in many ways. Where a low number of forks but large number of downloads may indicate a project that is specific but widely useful. More research will need to be done to identify and refine these assumptions.

Licensing which is in reference to the usability of the code. More restrictive licenses may be a turn off or may just require more adaptability and community to be viable. A couple metrics for licenses would be:
• Is there a license
• Number of licenses
• Flexibility of licenses

Calculating the Truck-Factor of git Applications

The CII Census refers to the truck factor tool for git. - Maybe this can help us.

Organization History

organizationHistory folder is now added under dev branch. It contains the pythonBlameHistoryTree.py file, which gets the percentage of a repo written by different organizations at the time of each commit. If no organizations are output for a certain commit, or the percentage adds up to less than 100%, this is expected because not all users are a member of an organization.

However, currently the way users are matched to organizations, it is possible to have a user be a member of multiple organizations (leading to a potential sum of more than 100% of the repo, and/or other misleading statistics).

Data for the percentage of the file comes from Git. The Git Blame command allows the user to view which user most recently changed each line of code in a file. By looping through every commit and every file in the repo, we can get a percentage of the repo written by each user over the history of the repo.

The user's organization is obtained through querying the ghTorrent database using the user's email address (in my local case, MSR14).