Coder Social home page Coder Social logo

code-pile's People

Contributors

louiscastricato avatar ncoop57 avatar reshinthadithyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

code-pile's Issues

Catalog Licenses/Copyright for each data source

For every data source, we need to keep track of the license to ensure we are not violating it, especially around redistribution.

The main sources we need to catalog for the first thrust of code pile is the following sources:

  • Programming contests.
  • Stack Exchange.
  • GitHub Diffs.
  • GitHub Issues
  • Reddit
  • AI4Code
  • Discourse
  • Wikibooks

Gitter Discussions

Gitter Discussions

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Gitter is a chat and networking platform that helps to manage, grow and connect communities through messaging, content and discovery.

It has a rich set of discussions around specific topics such as Docker, webpack, etc.

Procedure

Programming Contest Sites

Title

Dataset URL - Collect Dataset from Programming Contest Sites
Does the dataset exist in a scraped format? Exists some resources available like CodeContest From DeepMind, APPS, LeetCode.

Description

Code Data from Competitive Programming Pages is a high-quality resource for code generation. Websites like Codeforces, AtCoder,... provided good resources about competitive programming problem and code.

Procedure

  • Collect CodeContest dataset available
  • Processing and formatting.
  • lm_dataformat Processing
  • Crawling TopCoder
  • Crawling LeetCode

Google AI4Code โ€“ Kaggle

Title

Google AI4Code โ€“ Understand Code in Python Notebooks

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here

Description

The dataset comprises about 160,000 Jupyter notebooks published by the Kaggle community. Jupyter notebooks are the tool of choice for many data scientists for their ability to tell a narrative with both code and natural language. These two types of discourse are contained within cells of the notebook, and we refer to these cells as either code cells or markdown cells (markdown being the text formatting language used by Jupyter).

Procedure

  • Download Kaggle Dataset
  • Process the dataset

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....

Discourse Forums

Discourse Forums

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Discourse is a self-hosting platform for communities to create discussions around a particular topic. They include threads of posts and an eco system to discuss a particular topic.

Procedure

  • Find a good way of indexing all current discourse websites
  • Apply filtering to only include active/popular discourse sites
  • Scrape posts that have at least one reply

Postprocessing and Formatting of Datasets

This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.

Initial draft of postprocessing:

  1. Exact duplication removal
  2. Near duplication removal
  3. Removal of specific html tags

Questions for formatting:

  1. How to format forums?
  2. How to format general website articles?
  3. How to format books?

LinusTechTip Programming Forum

Title

Dataset URL - LinusTechTip

Does the dataset exist in a scraped format? No

Description

This well-known programming forum, just scanned there have more than 10.000 topics from 2013

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....

StackExchange Sites

StackExchange

Dataset URL - The data dump can be used to obtain all the dumps here

Does the dataset exist in a scraped format?

Description

Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.

Potential sites

The Entire Stack Exchange Dump.

Procedure

  • Download all the dumps
  • Formulate appropriate filters if at all we want to.
  • Processing and formatting.
  • lm_dataformat Processing

GitHub Issues

GitHub Issues

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here
Only for HF datasets repository

Description

GitHub Issues are bug reports, feature requests, and discussions related to a repository. It contains text in a special GitHub markdown format and contains comments and reactions.

Procedure

We can use the procedure discuss in this blog post, which outlines how to do it for a specific repository. We just need to apply the exact same procedure, but for multiple repositories.

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

issue_post comments authors reactions
issue_text [comment_1, comment_2, ...] [issue_author, comment_1_author, comment_2_author, ...] [[reactions], [reactions], ...]

Google Code Archive

Title

Dataset URL - here

Does the dataset exist in a scraped format?
No

Description

The Google Code Archive contains the data found on the Google Code Project Hosting Service, which was turned down in early 2016.

This archive contains over 1.4 million projects, 1.5 million downloads, and 12.6 million issues. You can learn more about the data served from Google Cloud Storage here.

Google Code offered open-source project hosting on other domains besides just code.google.com, too.

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....

Reddit

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet
Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

  • Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
  • Store data dump on a GCP Bucket.
  • Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
  • Merge posts with reply chains and author metadata (specifically bio)
  • (Optionally) Filter for long dialogue chains following OPT
  • Process Reddit threads (posts and replies) into a conversational form using this script
  • Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
  • Process into output format {"text": string, "meta": obj}
  • Run dedup Min-Hash
  • Run lm_format script

Final Data Format inside text

[Context]:
	"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
	using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
	[context/2]:
		Could someone there post a summary of the insightful moments.
	[context/1]:
		Basically L2L is the new deep learning.
	[context/0]:
		What's "L2L" mean?

Other features:
	[context_author]:
		goodside
	[response_author]:
		NetOrBrain
	[subreddit]:
		MachineLearning
	[thread_id]:
		5h6yvl

Bitbucket Code

Title

Dataset URL - here

Does the dataset exist in a scraped format?
URL if Yes - here

Description

Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.

Procedure

  • Attempt to clone repo based on information parquest file above
  • Filtering by Licence following this list
MIT-0
MIT
MIT-feh
Apache-2.0
BSD-3-Clause
BSD-3-Clause-Clear
BSD-3-Clause-No-Nuclear-License-2014
BSD-2-Clause
CC0-1.0
EPL-1.0
MPL-2.0
Unlicense
ISC
Artistic-2.0
deprecated_LGPL-3.0+
deprecated_LGPL-2.1+
ECL-2.0
SHL-0.51
MPL-2.0-no-copyleft-exception
  • Procedure processes like Github CodeParrot
  • Convert to lm_dataformat

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....

Data Processing

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

  • Filtering of low quality documents
  • Filtering of documents with specific removal words
  • Filtering of exact duplicate content
  • Filtering of near duplicate content
  • Removal of PII

Google Code Jam Project - Archive

Google Code Jam Project - Archive

Dataset URL - here

Does the dataset exist in a scraped format?
Yes - here

Description

Google Code Jam is one of the most famous programming contests conducted with large scale participation

Procedure

  • Add Downloader
  • Decompress .bz
  • post-process

gitlab

Title

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Gitlab, like github, but not in bigquery

Procedure

  • Scrape the website
  • Formulate appropriate filters if at all we want to.
  • Processing and formatting.
  • lm_dataformat Processing

Texas Instruments Forum

Title

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Texas Instruments Forum. Likely we only care about the Microcontroller, Processor, and Tools subforums.

Procedure

  • Scrape the website
  • Formulate appropriate filters if at all we want to.
  • Processing and formatting.
  • lm_dataformat Processing

Programming and Computing Books

Open Access and Free Programming and Computing Books

Dataset URL - Computing Wikibooks. We can download the dump here and filter for computing wikibooks.
Free Computing Books -- not sure if the books on here are safe to use we need to check.

Does the dataset exist in a scraped format? Yes if HTML/website No if the book is in PDF.

Description

Books contain rich information and present cumulations of knowledge on specific topics. It could also be home to exercises and solutions. If a model is pretrained on it could perhaps enhance its chain of thought capabilities.

Procedure

  • Process WikiBooks dataset
  • Scrape PDF books into text
  • Scrape books in websites/webpages
  • Deduplicate and store in s3
  • Run lm_format.

Mailing Lists

Mailing Lists

Dataset URL -

Does the dataset exists in a scraped format ? No

Description

In general.
(Almost) every programmer uses a programming language, huge swathes of programming are organized around these languages
Most of these languages have some kind of package manager
This package manager usually has download statistics

Procedure

  • Determine the top 50-100 programming languages as shown by GitHub statistics or whatever
  • Ignore this list and immediately add Coq, Lean, Haskell, and OCaml as languages no matter what since you need them for proof solving
  • Then add the other 50 languages
  • Locate the mailing list(s) for that programming language, scrape its archives

Zulip Discussions

Zulip Discussions

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Zulip is a real-time chat application for self-hosting or cloud based discussions of various communities.

There are many CS and SE communities that use Zulip for discussions such as coq.

Procedure

  • See if communities are in Common Crawl
  • If so, see if they have an archive like coq
  • If so, easily scrape from their archive

Leetcode

Leetcode

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Leetcode contains many computer science programming questions with a rich community to share solutions and discuss the problem

Procedure

  • Scrape the website
  • Formulate appropriate filters if at all we want to.
  • Processing and formatting.
  • lm_dataformat Processing

Bitbucket diffs

Bitbucket has an API for public repos

Dataset URL - None

Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).

Description

Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.

Procedure

  1. Approximate the value of a bitbucket dataset by pulling metrics on open source. Using the Bitbucket API, pull the following information :
  • number of public repositories
  • distribution of watchers per repository
  • distribution of contributors per
  • number of commits per
  1. With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.

  2. Start pulling commit diffs from the highest priority repos. Docs

Arduino Forum

Arduino forums

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Forum for Arduino development.

Procedure

  • Scrape the forum
  • Formulate appropriate filters if at all we want to.
  • Processing and formatting.
  • lm_dataformat Processing

GitHub Diffs

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

  • Raw files before changes
  • Commit message
  • Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

  • Minimal working example
  • Decide on length threshold
  • parquet output
  • Inherit from dataset.py base classes
  • Parallel processing
  • Bitbucket modifications - see #5

Example

Give an example of the columns and data:

before_file commit_message diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] Change version [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.