Coder Social home page Coder Social logo

google / project-ocean Goto Github PK

View Code? Open in Web Editor NEW
48.0 13.0 19.0 319 KB

Project OCEAN is an open science collaboration focused on understanding the open source ecosystems creating datasets that enable research and forming a clear understanding of the state of open source communities.

Home Page: https://vermontcomplexsystems.org/partner/OCEAN/

License: Apache License 2.0

Python 61.50% Go 38.50%
angular golang go nodejs python opensource research ecosystems graphnetwork

project-ocean's Introduction

Project OCEAN 🦦

License made-with-Go made-with-python

"This is not an officially supported Google product"

This repository contains code and related content for the OCEAN open source project.

Overview / Goal

Project OCEAN is an open science collaboration focused on understanding the open source ecosystems & creating the datasets that enables research purposes and helps in forming a clear understanding of the state of open source communities. OCEAN’s goal is to understand the health of the open source communities.

Open Source Community Ecosystem Focus

We are focused on studying the following ecosystems :

  • Angular
  • Go
  • Node
  • Python

Project Datasets

We are collecting a list of datasets that would be useful for this project which are based on our ecosystem focus. The link below provides the latest list which is an ongoing work in progress.

OCEAN Open Source Ecosystem Data Map

If you know a dataset that should be on the list or have some updates to recommend, for now submit an Issue with as much of the following information that you have:

  • Dataset Name
  • Brief Description
  • EcoSystem
  • Data Category (governance model, source code, issue tracker, project docs, release infra, package repo, package manager, social board, community org)
  • Raw Data Location (where is it stored)
  • Size (GB/TB/?) - (if you record a different size, note it next to the number)
  • Accessible (How can we access it either API or scrape with permission or no access option or something else)
  • Start Date (When it started being collected - at a minimum what year)
  • End Date (When it stopped being collected or note today if its kept current - at a minimum what year)
  • Update Frequency (How often it is updated - daily, monthly, etc)
  • Reference Links (especially dataset schema and other info that is useful)
  • License Information (if there is any licensing or terms and conditions attached to the data source)
  • Other Info

OCEAN External Faculty Program

If you want to particpate in prioritizing the datasets we capture for research and collaborate with researchers on this effort there will be a process to apply to join the group. More details will be posted when we have them. For insights into the group checkout OCEAN Vermont Exernal Faculty Program.

Contributing

We welcome outside contributions to the project especially considering when we are studying open source communities. Junior and senior contributors are all welcome. We have a list of Issues that provide ideas on where to start. Feel free to send in PRs , if you have something to change or to add. Checkout the Contributing page for more information on how to participate.

Resources

Resources related to this project :

  • More to come

Source Code Headers

Every file containing source code must include copyright and license information. This includes any JS/CSS files that you might be serving out to the browsers. Please make sure to add the following to any files before you submit.

Apache header:

Copyright 2020 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

project-ocean's People

Contributors

amcasari avatar amygdala avatar dependabot[bot] avatar glasnt avatar nyghtowl avatar tpryan avatar tymsai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-ocean's Issues

Archiving mailing list data pipelines

OCEAN currently doesn't have any active users of this research dataset, but this code may be still useful, but is not going to be actively maintained.

Additionally given ongoing dependency updates (eg. #96) and issues with some data sources over time (#94), these pipelines will be moved to an archive folder.

Add python tests to CI/CD

Expected Behavior

Python tests run in GitHub actions

Actual Behavior

Tests currently not automated to run

Reload data into GCS and BQ

Clean up data load

  • Move all mailing list content into a specific GCS bucket
  • Fix golang files names to drop gg
  • Add groupname to filename for files in the mailing list folders
  • Load all mailing list content into BQ again because the structure has changed

List datasets in Repo

List datasets for the ecosystems that we have found and have interest in assessing.

Improve To parsing in the mailing list data that is loaded to BQ

Expected Behavior

Mailing list To field should be populated by the target person that the email is responding to.

Actual Behavior

In python_mailinglist table there are some messages where To is showing up in the body but not populating the To field.

Body: "B Zy < zy at gmail.com> wrote:

Hello
Help my code."

The To field is capturing the mailing list name instead.

Steps to Reproduce the Problem

  1. Review python mailing list examples
  2. Improve parsing in the extract_msgs script, probably a regex for the body

Automate Pipermail and Mailman data ingestion

Setup scheduler or something similar to run pipermail and mailman to pull mailing list data monthly at end of month

Date range should be 1st day to last day of the previous month and run (triggered) on 1st day of the new month.

This will then connect to the work on Issue #26 to move into BQ

Google Groups formatting changed, unit test issues

TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.

Discovered while trying to update dependencies.

Zero topics

Monthly pipeline processing was showing 0 topics returned:

2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.

Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).

E.g. https://groups.google.com/g/golang-checkins shows 1–30 of 81553 (specifically is \u2013 EN DASH). The regex in getTotalTopics specifies - (\u002D HYPHEN-MINUS).

So because the topic counts are 0, it's effecting loops later on (in my estimation)

Nest unit tests

Additionally, trying to run unit tests, it appears running just mailinglists/ doesn't run the nested mailing lists, so the unit tests for googlegroups weren't being run (and are currently breaking)

Failing topic unit tests

Now running the unit tests:

=== RUN   TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
    googlegroups_data_test.go:300: Result response does not match.
         got: map[2018-09.txt:[]]
        want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]

Infinite redirects

This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:

$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>

Summary

This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.

Fix Python BQ script date timezone hanlding

Expected Behavior

parse_datestring function will handle all timezones despite format and parse them correctly

Actual Behavior

parse_datestring only handling timezones if offset number provided. It is missing information when the timezone letters or words are provided like:

Wed, 25 Oct 2006 19:21:24 GMT

Test output should be 2006-10-25 21:21:24

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

  • 2-transform-data/cloud_func_bq_ingest/pycache/msgs_storage_bq.cpython-38.pyc

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.


Allstar has been installed on all Google managed GitHub orgs. Policies are gradually being rolled out and enforced by the GOSST and OSPO teams. Learn more at http://go/allstar

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Google Groups not loading all topics | Topic hidden because it was flagged

UPDATE: Found that the messages that were not loading were the ones reported for abuse and hidden. A topic id and message id are found for these messages but there wasn't a date to generate the filename. Create abuse.txt catchall filename and will add this to the table structure in BigQuery

Expected Behavior

All topics should be captured from Google Groups

Actual Behavior

It is coming up short by less than 100 for capture. Potentially an issue in the GoRoutine

Steps to Reproduce the Problem

  1. Run capture for angular or nodejs google groups and look at the total expected vs actual reported

This was an issue before and it had to do with the goroutine collapsing the content but its unclear where the miss is now.

Include Google Groups original message url

Expected Behavior

Pull out url where the data originated from for pipermail and mailman and put in BQ

Actual Behavior

Currently the urls used to get the data are not stored and this would be good for reference

Ideas for how to fix

Include the url in the filename (the syntax is an issue)
Open files and append url on the end

Corruption in Pipermail Files | Special Character Preservation

Expected Behavior

Preserve special characters.

Actual Behavior

Appears that the files didn't fully preserve special characters. Files may just be ascii, using literal '?'s for non-ascii chars. Need further investigation to determine how much this is an issue and how to resolve.

Example

One file I'm using as a check is pipermail-python-list-gzip/2011-February.txt.gz ...If you search for the name "Westley" in that file, you'll see lines like "Westley Mart?nez wrote:" , with (apparently) a literal "?". (the header lines in the file like: "From: anikom15 at gmail.com (Westley =?ISO-8859-1?Q?Mart=EDnez?=)" are treated differently -- those are decodable)

Investigation confirmed that these malformations are in the original zipped files that were on the site.

Steps to Reproduce the Problem

Specifications

  • Version:
  • Platform:

Handle corrupted content in mailman mailing list data

Expected Behavior

Load all mailman mailing list text

Actual Behavior

Errored out on post on March 9th 2002 in python-dev list because it had diamond question marks in the content

Steps to Reproduce the Problem

  1. Add code to catch unknown content and parse around it to pull the uncorrupted text

Add original content url to pipermail and mailman data

Expected Behavior

Pull out url where the data originated from for pipermail and mailman and put in BQ

Actual Behavior

Currently the urls used to get the data are not stored and this would be good for reference

Ideas for how to fix

  • Include the url in the filename (the syntax is an issue)
  • Open files and append url on the end

Data Analysis and Research

Expected Behavior

Data analysis using the existing datasets to show how the contributions to these ecosystems and their impact

Actual Behavior

Nothing built specifically at this time. Blank canvas that welcomes someone to contribute.

This is a very general issue and needs to be broken down into smaller chunks. Starting place is data scripts for analysis and reporting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.