Coder Social home page Coder Social logo

saltudelft / many-types-4-py-dataset Goto Github PK

View Code? Open in Web Editor NEW
18.0 7.0 5.0 15 MB

ManyTypes4Py: A benchmark Python dataset for machine learning-based type inference

License: Apache License 2.0

Python 6.72% Jupyter Notebook 92.29% Shell 0.99%
dataset type-inference machine-learning python benchmark manytypes4py type-annotations visible-type-hints mt4py msr

many-types-4-py-dataset's Introduction

ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

DOI

Intro

  • It has clean and complete versions (from v0.7):
    • The clean version has 5.1K type-checked Python repositories and 1.2M type annotations.
    • The complete version has 5.2K Python repositories and 3.3M type annotations.
  • Its source files are type-checked using mypy (clean version).
  • Its projects were processed in JSON-formatted files using the LibSA4Py pipeline.
  • Its source files were already split into training, validation, and test sets for training ML models.
  • It is de-duplicated using CD4Py.
  • It contains Visible Type Hints (VTHs), which is a deep, recursive, and dynamic analysis of types from the import statements of source files and their dependencies.
  • It is published in the Data Showcase of the MSR'21 conference.

Downloading dataset

The latest version of the dataset is publicly available on zenodo.

Dataset preparation

We highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.

Requirements

  • Python 3.5 or newer
  • Python dependencies from scripts/requirements.txt installed (run pip install -r scripts/requirements.txt)
  • Install the libsa4py package (run git clone https://github.com/saltudelft/libsa4py.git && cd libsa4py && pip install .)

Steps

  1. Clone the dataset:

    python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos
    
  2. To change the state of the cloned repositories to the ManyType4Py's, run the following command on the ManyTypes4PyDataset.spec:

    ./scripts/reset_commits.sh  ./ManyTypes4PyDataset.spec repos
    
  3. Generate duplicate tokens for dataset using cd4py

    cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024
    
  4. Gather duplicate files from the cd4py output tokens, and output as a single text file (using collect_dupes.py)

    python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt
    
  5. Create a copy dataset with duplicates removed from the duplicate files collected prior (using process_dataset.py)

    python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]
    
  6. Split dataset into test, train and validation data (using split_dataset.py)

    python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv
    
  7. To process the Python repositories and produce JSON output files, run the libsa4py pipeline as follows:

    libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]
    

    Check out the libsa4py README for more info on its usage.

  8. Create a tar of the full dataset & artifacts in one folder

    tar -czvf [output path] [dataset artifacts path]
    

Citing the dataset

If you have used the dataset in your research work, please consider citing it.

@inproceedings{mt4py2021,
author = {A. M. Mir and E. Latoskinas and G. Gousios},
booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},
year = {2021},
pages = {585-589},
doi = {10.1109/MSR52588.2021.00079},
publisher = {IEEE Computer Society},
month = {May}
}

Roadmap

  • Gathering Python projects that depend on type-checkers other than mypy, i.e., pyre, pytype, and pyright.
  • Apply type annotations from typeshed to the dataset.

many-types-4-py-dataset's People

Contributors

elatoskinas avatar mir-am avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

many-types-4-py-dataset's Issues

process_dataset.py bug

Hi,
I think there might be a bug in the process_dataset.py script.
According to your readme, the [copy target] should be the folder with duplicates removed.
In fact, the script leads to the input dataset being the one with the removed duplicate files instead of the output dataset.

where are human anntation?

Hello, I try to use the tool and predict the types on this dataset but I need the human annotation, where are human annotation in the data set? you didnot upload file about ground truth, could you please upload it
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.