Coder Social home page Coder Social logo

techthiyanes / code-pile Goto Github PK

View Code? Open in Web Editor NEW

This project forked from carperai/code-pile

0.0 0.0 0.0 1.62 MB

This repository contains all the code for collecting large scale amounts of code from GitHub.

License: MIT License

Python 100.00%

code-pile's Introduction

Code-Pile

pytest

This repository contains the processing scripts to scrape/process the code-pile dataset.

Table of Contents

  • Project Description
  • How to use the Code-Pile (todo)
  • How to Contribute
  • Additional Resources

Project Description

Check out The code pile proposal

The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat

How to use the Code-Pile

It's not finished, ask on discord

How to Contribute

Think about the most usefull Code-data for the next generation of textual Code Models.

The most valuable dataset properties (use your own judgment) are:

  1. Open License
  2. Data quality
  3. Dataset size
  4. Data variance/variety/nicheness
  5. Ease of obtaining/processing

To add a new dataset, open a Issue under given dataset-request template. Gather all the related informations appropriate to it. Use the issue to track.

Check if there is existing Code or someone already working on it: See Additional Resources

  1. Eleuthers Pile V1 Repos
  2. Ask on Carper #code-pile
  3. Ask on Eleuther
  4. Consult the linked Spreadsheets below

Then implement it through the following steps:

  1. Fork this repo
  2. Use the working branch
  3. Read the shared classes in datasets.py and codepile.py
  4. Create mvp/example for your dataset
  5. Create a pull request
  6. Keep building the data-domain specific classes and repeat

Citation Placeholder:

@misc{Code-Pile,
  author = {},
  doi = {},
  month = {},
  title = {},
  url = {https://github.com/CarperAI/Code-Pile},
  version = {},
  year = {2022}
}

Additional Resources

Closely related projects:

Previous work:

code-pile's People

Contributors

ncoop57 avatar reshinthadithyan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.