carperai / code-pile Goto Github PK
View Code? Open in Web Editor NEWThis repository contains all the code for collecting large scale amounts of code from GitHub.
License: MIT License
This repository contains all the code for collecting large scale amounts of code from GitHub.
License: MIT License
Follow work in data documentation space such as https://arxiv.org/abs/1803.09010 and https://arxiv.org/abs/2201.07311
We will be basing our documentation off the template from huggingface: https://github.com/huggingface/datasets/blob/main/templates/README.md
Discuss, standardize and track how we want to test the submodules of the Code-Pile here.
For every data source, we need to keep track of the license to ensure we are not violating it, especially around redistribution.
The main sources we need to catalog for the first thrust of code pile is the following sources:
Dataset URL - here
Does the dataset exists in a scraped format ? No
Gitter is a chat and networking platform that helps to manage, grow and connect communities through messaging, content and discovery.
It has a rich set of discussions around specific topics such as Docker, webpack, etc.
Dataset URL - Collect Dataset from Programming Contest Sites
Does the dataset exist in a scraped format? Exists some resources available like CodeContest From DeepMind, APPS, LeetCode.
Code Data from Competitive Programming Pages is a high-quality resource for code generation. Websites like Codeforces, AtCoder,... provided good resources about competitive programming problem and code.
lm_dataformat
ProcessingGoogle AI4Code โ Understand Code in Python Notebooks
Dataset URL - here
Does the dataset exists in a scraped format ?
URL if Yes - here
The dataset comprises about 160,000 Jupyter notebooks published by the Kaggle community. Jupyter notebooks are the tool of choice for many data scientists for their ability to tell a narrative with both code and natural language. These two types of discourse are contained within cells of the notebook, and we refer to these cells as either code cells or markdown cells (markdown being the text formatting language used by Jupyter).
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
Dataset URL - here
Does the dataset exists in a scraped format ? No
Discourse is a self-hosting platform for communities to create discussions around a particular topic. They include threads of posts and an eco system to discuss a particular topic.
This issue focuses on collecting ideas and formalizing the postprocessing steps and formatting of data instances for datasets in different categories, e.g., forums, articles, books, etc.
Initial draft of postprocessing:
Questions for formatting:
Dataset URL - LinusTechTip
Does the dataset exist in a scraped format? No
This well-known programming forum, just scanned there have more than 10.000 topics from 2013
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
Dataset URL - The data dump can be used to obtain all the dumps here
Does the dataset exist in a scraped format?
Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.
The Entire Stack Exchange Dump.
lm_dataformat
ProcessingDataset URL - here
Does the dataset exists in a scraped format ?
URL if Yes - here
Only for HF datasets repository
GitHub Issues are bug reports, feature requests, and discussions related to a repository. It contains text in a special GitHub markdown format and contains comments and reactions.
We can use the procedure discuss in this blog post, which outlines how to do it for a specific repository. We just need to apply the exact same procedure, but for multiple repositories.
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
issue_post | comments | authors | reactions |
---|---|---|---|
issue_text | [comment_1, comment_2, ...] | [issue_author, comment_1_author, comment_2_author, ...] | [[reactions], [reactions], ...] |
Dataset URL - here
Does the dataset exist in a scraped format?
No
The Google Code Archive contains the data found on the Google Code Project Hosting Service, which was turned down in early 2016.
This archive contains over 1.4 million projects, 1.5 million downloads, and 12.6 million issues. You can learn more about the data served from Google Cloud Storage here.
Google Code offered open-source project hosting on other domains besides just code.google.com, too.
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet
Another list of programming subreddits Thanks to @ncoop57!
Does the dataset exist in a scraped format ?
No, we need to format them into a dialogue format.
Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.
{"text": string, "meta": obj}
lm_format
scripttext
[Context]:
"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
using deep learning with SGD to design the learning algorithms of another deep network *
Extra Contexts:
[context/2]:
Could someone there post a summary of the insightful moments.
[context/1]:
Basically L2L is the new deep learning.
[context/0]:
What's "L2L" mean?
Other features:
[context_author]:
goodside
[response_author]:
NetOrBrain
[subreddit]:
MachineLearning
[thread_id]:
5h6yvl
Dataset URL - here
Does the dataset exist in a scraped format?
URL if Yes - here
Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.
MIT-0
MIT
MIT-feh
Apache-2.0
BSD-3-Clause
BSD-3-Clause-Clear
BSD-3-Clause-No-Nuclear-License-2014
BSD-2-Clause
CC0-1.0
EPL-1.0
MPL-2.0
Unlicense
ISC
Artistic-2.0
deprecated_LGPL-3.0+
deprecated_LGPL-2.1+
ECL-2.0
SHL-0.51
MPL-2.0-no-copyleft-exception
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
col1 | col2 | .... |
---|---|---|
row1 | row1 | .... |
We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.
They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling
Here is an initial set of tasks to perform:
Dataset URL - UsenetArchives and InternetArchive
Does the dataset exists in a scraped format ? No
Dataset URL - here
Does the dataset exists in a scraped format ? No
URL if Yes - here
Gitlab, like github, but not in bigquery
Dataset URL - here
Does the dataset exists in a scraped format ? No
URL if Yes - here
Texas Instruments Forum. Likely we only care about the Microcontroller, Processor, and Tools subforums.
Dataset URL - Computing Wikibooks. We can download the dump here and filter for computing wikibooks.
Free Computing Books -- not sure if the books on here are safe to use we need to check.
Does the dataset exist in a scraped format? Yes if HTML/website No if the book is in PDF.
Books contain rich information and present cumulations of knowledge on specific topics. It could also be home to exercises and solutions. If a model is pretrained on it could perhaps enhance its chain of thought capabilities.
Dataset URL -
Does the dataset exists in a scraped format ? No
In general.
(Almost) every programmer uses a programming language, huge swathes of programming are organized around these languages
Most of these languages have some kind of package manager
This package manager usually has download statistics
Dataset URL - here
Does the dataset exists in a scraped format ? No
Zulip is a real-time chat application for self-hosting or cloud based discussions of various communities.
There are many CS and SE communities that use Zulip for discussions such as coq.
Dataset URL - here
Does the dataset exists in a scraped format ? No
URL if Yes - here
Leetcode contains many computer science programming questions with a rich community to share solutions and discuss the problem
Questions:
Let's use this issue to discuss this topic.
Resources:
Dataset URL - None
Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).
Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.
With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.
Start pulling commit diffs from the highest priority repos. Docs
Dataset is on BigQuery as a table of commit hashes and messages.
From commit hash and message, produce dict containing:
This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.
We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.
Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a
dataset.py
base classesGive an example of the columns and data:
before_file | commit_message | diff |
---|---|---|
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] | Change version | [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}] |
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.