The code-pile from carperai

Add datasheets for all datasources

Follow work in data documentation space such as https://arxiv.org/abs/1803.09010 and https://arxiv.org/abs/2201.07311

We will be basing our documentation off the template from huggingface: https://github.com/huggingface/datasets/blob/main/templates/README.md

Standardize Testing, Datatests, Unittests, Integration-Tests

Discuss, standardize and track how we want to test the submodules of the Code-Pile here.

Catalog Licenses/Copyright for each data source

For every data source, we need to keep track of the license to ensure we are not violating it, especially around redistribution.

The main sources we need to catalog for the first thrust of code pile is the following sources:

Gitter Discussions

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Gitter is a chat and networking platform that helps to manage, grow and connect communities through messaging, content and discovery.

It has a rich set of discussions around specific topics such as Docker, webpack, etc.

Procedure

Attempt to scrape index of chats from: https://gitter.im/explore/tags/?action=explore&source=homepage
For each topic, scrape the chat histories

Title

Dataset URL - Collect Dataset from Programming Contest Sites
Does the dataset exist in a scraped format? Exists some resources available like CodeContest From DeepMind, APPS, LeetCode.

Description

Code Data from Competitive Programming Pages is a high-quality resource for code generation. Websites like Codeforces, AtCoder,... provided good resources about competitive programming problem and code.

Procedure

Google AI4Code – Kaggle

Title

Google AI4Code – Understand Code in Python Notebooks

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here

Description

The dataset comprises about 160,000 Jupyter notebooks published by the Kaggle community. Jupyter notebooks are the tool of choice for many data scientists for their ability to tell a narrative with both code and natural language. These two types of discourse are contained within cells of the notebook, and we refer to these cells as either code cells or markdown cells (markdown being the text formatting language used by Jupyter).

Procedure

Download Kaggle Dataset
Process the dataset

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1	col2	....
row1	row1	....

Discourse Forums

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Discourse is a self-hosting platform for communities to create discussions around a particular topic. They include threads of posts and an eco system to discuss a particular topic.

Procedure

Find a good way of indexing all current discourse websites
Apply filtering to only include active/popular discourse sites
Scrape posts that have at least one reply

Postprocessing and Formatting of Datasets

This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.

Initial draft of postprocessing:

Exact duplication removal
Near duplication removal
Removal of specific html tags

Questions for formatting:

How to format forums?
How to format general website articles?
How to format books?

LinusTechTip Programming Forum

Title

Dataset URL - LinusTechTip

Does the dataset exist in a scraped format? No

Description

This well-known programming forum, just scanned there have more than 10.000 topics from 2013

Procedure

Attempt to scrape topics from https://linustechtips.com/forum/20-programming/
Cleaning, duplicate, formating
Convert to lm_dataformat

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1	col2	....
row1	row1	....

StackExchange Sites

StackExchange

Dataset URL - The data dump can be used to obtain all the dumps here

Does the dataset exist in a scraped format?

Description

Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.

Potential sites

The Entire Stack Exchange Dump.

Procedure

Download all the dumps
Formulate appropriate filters if at all we want to.
Processing and formatting.
lm_dataformat Processing

GitHub Issues

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here
Only for HF datasets repository

Description

GitHub Issues are bug reports, feature requests, and discussions related to a repository. It contains text in a special GitHub markdown format and contains comments and reactions.

Procedure

We can use the procedure discuss in this blog post, which outlines how to do it for a specific repository. We just need to apply the exact same procedure, but for multiple repositories.

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

issue_post	comments	authors	reactions
issue_text	[comment_1, comment_2, ...]	[issue_author, comment_1_author, comment_2_author, ...]	[[reactions], [reactions], ...]

Google Code Archive

Title

Dataset URL - here

Does the dataset exist in a scraped format?
No

Description

The Google Code Archive contains the data found on the Google Code Project Hosting Service, which was turned down in early 2016.

This archive contains over 1.4 million projects, 1.5 million downloads, and 12.6 million issues. You can learn more about the data served from Google Cloud Storage here.

Google Code offered open-source project hosting on other domains besides just code.google.com, too.

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1	col2	....
row1	row1	....

Reddit

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet
Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

Final Data Format inside `text`

[Context]:
	"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
	using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
	[context/2]:
		Could someone there post a summary of the insightful moments.
	[context/1]:
		Basically L2L is the new deep learning.
	[context/0]:
		What's "L2L" mean?

Other features:
	[context_author]:
		goodside
	[response_author]:
		NetOrBrain
	[subreddit]:
		MachineLearning
	[thread_id]:
		5h6yvl

Bitbucket Code

Title

Dataset URL - here

Does the dataset exist in a scraped format?
URL if Yes - here

Description

Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.

Procedure

Attempt to clone repo based on information parquest file above
Filtering by Licence following this list

MIT-0
MIT
MIT-feh
Apache-2.0
BSD-3-Clause
BSD-3-Clause-Clear
BSD-3-Clause-No-Nuclear-License-2014
BSD-2-Clause
CC0-1.0
EPL-1.0
MPL-2.0
Unlicense
ISC
Artistic-2.0
deprecated_LGPL-3.0+
deprecated_LGPL-2.1+
ECL-2.0
SHL-0.51
MPL-2.0-no-copyleft-exception

Procedure processes like Github CodeParrot
Convert to lm_dataformat

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1	col2	....
row1	row1	....

Data Processing

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

Filtering of low quality documents
Filtering of documents with specific removal words
Filtering of exact duplicate content
Filtering of near duplicate content
Removal of PII

Google Code Jam Project - Archive

Dataset URL - here

Does the dataset exist in a scraped format?
Yes - here

Description

Google Code Jam is one of the most famous programming contests conducted with large scale participation

Procedure

Add Downloader
Decompress .bz
post-process

UseNet

Dataset URL - ~~UsenetArchives and~~ InternetArchive

Does the dataset exists in a scraped format ? No

Description

Procedure

Download appropriate dumps
Cleanup and de-duplicate (Example)
~~Tag code blocks~~
Format in the standard forum format

gitlab

Title

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Gitlab, like github, but not in bigquery

Procedure

Scrape the website
Formulate appropriate filters if at all we want to.
Processing and formatting.
lm_dataformat Processing

Texas Instruments Forum

Title

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Texas Instruments Forum. Likely we only care about the Microcontroller, Processor, and Tools subforums.

Procedure

Scrape the website
Formulate appropriate filters if at all we want to.
Processing and formatting.
lm_dataformat Processing

Programming and Computing Books

Open Access and Free Programming and Computing Books

Dataset URL - Computing Wikibooks. We can download the dump here and filter for computing wikibooks.
Free Computing Books -- not sure if the books on here are safe to use we need to check.

Does the dataset exist in a scraped format? Yes if HTML/website No if the book is in PDF.

Description

Books contain rich information and present cumulations of knowledge on specific topics. It could also be home to exercises and solutions. If a model is pretrained on it could perhaps enhance its chain of thought capabilities.

Procedure

Mailing Lists

Dataset URL -

Does the dataset exists in a scraped format ? No

Description

In general.
(Almost) every programmer uses a programming language, huge swathes of programming are organized around these languages
Most of these languages have some kind of package manager
This package manager usually has download statistics

Procedure

Determine the top 50-100 programming languages as shown by GitHub statistics or whatever
Ignore this list and immediately add Coq, Lean, Haskell, and OCaml as languages no matter what since you need them for proof solving
Then add the other 50 languages
Locate the mailing list(s) for that programming language, scrape its archives

Zulip Discussions

Dataset URL - here

Does the dataset exists in a scraped format ? No

Description

Zulip is a real-time chat application for self-hosting or cloud based discussions of various communities.

There are many CS and SE communities that use Zulip for discussions such as coq.

Procedure

See if communities are in Common Crawl
If so, see if they have an archive like coq
If so, easily scrape from their archive

Leetcode

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Leetcode contains many computer science programming questions with a rich community to share solutions and discuss the problem

Procedure

Scrape the website
Formulate appropriate filters if at all we want to.
Processing and formatting.
lm_dataformat Processing

Intermediate Data Storage Format

Questions:

How do we want to store data in an intermediate format before moving it to the lm_dataformat that uses json lists?
Do we even want an intermediate data format?

Let's use this issue to discuss this topic.

Resources:

Bitbucket diffs

Bitbucket has an API for public repos

Dataset URL - None

Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).

Description

Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.

Procedure

Approximate the value of a bitbucket dataset by pulling metrics on open source. Using the Bitbucket API, pull the following information :

number of public repositories
distribution of watchers per repository
distribution of contributors per
number of commits per

With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.
Start pulling commit diffs from the highest priority repos. Docs

Arduino Forum

Arduino forums

Dataset URL - here

Does the dataset exists in a scraped format ? No
URL if Yes - here

Description

Forum for Arduino development.

Procedure

Scrape the forum
Formulate appropriate filters if at all we want to.
Processing and formatting.
lm_dataformat Processing

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

Raw files before changes
Commit message
Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

Example

Give an example of the columns and data:

before_file	commit_message	diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ]	Change version	[{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]

carperai / code-pile Goto Github PK

code-pile's People

Contributors

Stargazers

Watchers

Forkers

code-pile's Issues

Gitter Discussions

Description

Procedure

Title

Description

Procedure

Title

Description

Procedure

Tests

Discourse Forums

Description

Procedure

Title

Description

Procedure

Tests

StackExchange

Description

Potential sites

Procedure

GitHub Issues

Description

Procedure

Tests

Title

Description

Procedure

Tests

Programming & Computing Sub-Reddits

Description

Procedure

Final Data Format inside text

Title

Description

Procedure

Tests

Google Code Jam Project - Archive

Description

Procedure

UseNet

Description

Procedure

Title

Description

Procedure

Title

Description

Procedure

Open Access and Free Programming and Computing Books

Description

Procedure

Mailing Lists

Description

Procedure

Zulip Discussions

Description

Procedure

Leetcode

Description

Procedure

Bitbucket has an API for public repos

Description

Procedure

Arduino forums

Description

Procedure

GitHub Diffs

Description

Procedure

Example

Recommend Projects

Recommend Topics

Final Data Format inside `text`