A curated list of awesome research papers, datasets and software projects devoted to machine learning and source code. #MLonCode
- Digests
- Conferences
- Competitions
- Papers
- Program Synthesis and Induction
- Source Code Analysis and Language modeling
- Neural Network Architectures and Algorithms
- Embeddings in Software Engineering
- Program Translation
- Code Suggestion and Completion
- Program Repair and Bug Detection
- APIs and Code Mining
- Code Optimization
- Topic Modeling
- Sentiment Analysis
- Code Summarization
- Clone Detection
- Differentiable Interpreters
- Related research
(links require "Related research" spoiler to be open)
- Posts
- Talks
- Software
- Datasets
- Credits
- Contributions
- License
- 2018 IEEE 25th International Conference on Software Analysis, Evolution, and Reengineering (SANER)
- Machine Learning for Programming
- Workshop on NLP for Software Engineering
- SysML
- Mining Software Repositories
- AIFORSE
- source{d} tech talks
- NIPS Neural Abstract Machines and Program Induction workshop
- NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System - Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, 2018.
- Recent Advances in Neural Program Synthesis - Neel Kant, 2018.
- Neural Sketch Learning for Conditional Program Generation - Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine, 2018.
- Neural Program Search: Solving Programming Tasks from Description and Examples - Illia Polosukhin, Alexander Skidanov, 2018.
- Neural Program Synthesis with Priority Queue Training - Daniel A. Abolafia, Mohammad Norouzi, Quoc V. Le, 2018.
- Towards Synthesizing Complex Programs from Input-Output Examples - Xinyun Chen, Chang Liu, Dawn Song, 2018.
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning - Xiaojun Xu, Chang Liu, Dawn Song, 2017.
- Learning to Select Examples for Program Synthesis - Yewen Pu, Zachery Miranda, Armando Solar-Lezama, Leslie Pack Kaelbling, 2017.
- Neural Program Meta-Induction - Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew Hausknecht, Pushmeet Kohli, 2017.
- Glass-Box Program Synthesis: A Machine Learning Approach - Konstantina Christakopoulou, Adam Tauman Kalai, 2017.
- Learning to Infer Graphics Programs from Hand-Drawn Images - Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Joshua B. Tenenbaum, 2017.
- Neural Attribute Machines for Program Generation - Matthew Amodio, Swarat Chaudhuri, Thomas Reps, 2017.
- Abstract Syntax Networks for Code Generation and Semantic Parsing - Maxim Rabinovich, Mitchell Stern, Dan Klein, 2017.
- Making Neural Programming Architectures Generalize via Recursion - Jonathon Cai, Richard Shin, Dawn Song, 2017.
- A Syntactic Neural Model for General-Purpose Code Generation - Pengcheng Yin, Graham Neubig, 2017.
- Program Synthesis from Natural Language Using Recurrent Neural Networks - Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Luke Zettlemoyer, Michael Ernst, 2017.
- RobustFill: Neural Program Learning under Noisy I/O - Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, Pushmeet Kohli, 2017.
- Lifelong Perceptual Programming By Example - Gaunt, Alexander L., Marc Brockschmidt, Nate Kushman, and Daniel Tarlow, 2017.
- Neural Programming by Example - Chengxun Shu, Hongyu Zhang, 2017.
- DeepCoder: Learning to Write Programs - Balog, Matej, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow, 2017.
- A Differentiable Approach to Inductive Logic Programming - Yang, Fan, Zhilin Yang, and William W. Cohen, 2017.
- Latent Attention For If-Then Program Synthesis - Xinyun Chen, Chang Liu, Richard Shin, Dawn Song, Mingcheng Chen, 2016.
- Latent Predictor Networks for Code Generation - Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom, 2016.
- Meta-Interpretive Learning of Efficient Logic Programs - Cropper, Andrew, and Stephen H. Muggleton, 2016.
- Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version) - Liang, Chen, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao, 2016.
- Programs as Black-Box Explanations - Singh, Sameer, Marco Tulio Ribeiro, and Carlos Guestrin, 2016.
- Structured Generative Models of Natural Source Code - Chris J. Maddison, Daniel Tarlow, 2014.
- Syntax and Sensibility: Using language models to detect and correct syntax errors - Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral, 2018.
- code2vec: Learning Distributed Representations of Code - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, 2018.
- A Survey of Machine Learning for Big Code and Naturalness - Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2017.
- Learning to Represent Programs with Graphs - Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi, 2017.
- A deep language model for software code - Hoa Khanh Dam, Truyen Tran, Trang Pham, 2016.
- Suggesting Accurate Method and Class Names - Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton, 2015.
- Mining Source Code Repositories at Massive Scale using Language Modeling - Miltiadis Allamanis, Charles Sutton, 2013.
- A General Path-Based Representation for Predicting Program Properties - Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav, 2018.
- Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks - Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu, 2017.
- Syntax-Directed Variational Autoencoder for Structured Data - Hanjun Dai, Yingtao Tian, Bo Dai, Steven Skiena, Le Song, 2018.
- Divide and Conquer with Neural Networks - Nowak, Alex, and Joan Bruna, 2017.
- Learning Efficient Algorithms with Hierarchical Attentive Memory - Andrychowicz, Marcin, and Karol Kurach, 2016.
- Learning Operations on a Stack with Neural Turing Machines - Deleu, Tristan, and Joseph Dureau, 2016.
- Probabilistic Neural Programs - Murray, Kenton W., and Jayant Krishnamurthy, 2016.
- Learning Latent Multiscale Structure Using Recurrent Neural Networks - Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio, 2016.
- Neural Programmer: Inducing Latent Programs with Gradient Descent - Neelakantan, Arvind, Quoc V. Le, and Ilya Sutskever, 2016.
- Neural Programmer-Interpreters - Reed, Scott, and Nando de Freitas, 2016.
- Neural GPUs Learn Algorithms - Kaiser, Łukasz, and Ilya Sutskever, 2016.
- Neural Random-Access Machines - Karol Kurach, Marcin Andrychowicz, Ilya Sutskever, 2016.
- Learning to Execute - Wojciech Zaremba, Ilya Sutskever, 2015.
- Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets - Joulin, Armand, and Tomas Mikolov, 2015.
- Neural Turing Machines - Graves, Alex, Greg Wayne, and Ivo Danihelka, 2014.
- From Machine Learning to Machine Reasoning - Bottou, Leon, 2011.
- Word Embeddings for the Software Engineering Domain - Vasiliki Efstathiou, Christos Chatzilenas, Diomidis Spinellis, 2018.
- Document Distance Estimation via Code Graph Embedding - Zeqi Lin, Junfeng Zhao, Yanzhen Zou, Bing Xie, 2017.
- Combining Word2Vec with revised vector space model for better code retrieval - Thanh Van Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Trong Duc Nguyen, Tien N. Nguyen, 2017.
- From word embeddings to document similarities for improved information retrieval in software engineering - Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, Chang Liu, 2016.
- Mapping API Elements for Code Migration with Vector Representation - Trong Duc Nguyen, Anh Tuan Nguyen, Tien N. Nguyen, 2016.
- Tree-to-tree Neural Networks for Program Translation - Xinyun Chen, Chang Liu, Dawn Song, 2018.
- Code Attention: Translating Code to Comments by Exploiting Domain Features - Wenhao Zheng, Hong-Yu Zhou, Ming Li, Jianxin Wu, 2017.
- Automatically Generating Commit Messages from Diffs using Neural Machine Translation - Siyuan Jiang, Ameer Armaly, Collin McMillan, 2017.
- A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation - Antonio Valerio Miceli Barone, Rico Sennrich, 2017.
- A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes - Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo, 2017.
- Code Completion with Neural Attention and Pointer Networks - Jian Li, Yue Wang, Irwin King, Michael R. Lyu, 2017.
- Learning Python Code Suggestion with a Sparse Pointer Network - Avishkar Bhoopchand, Tim Rocktäschel, Earl Barr, Sebastian Riedel, 2016.
- Code Completion with Statistical Language Models - Veselin Raychev, Martin Vechev, Eran Yahav, 2014.
- Learning to Repair Software Vulnerabilities with Generative Adversarial Networks - Jacob A. Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin, 2018.
- Dynamic Neural Program Embedding for Program Repair - Ke Wang, Rishabh Singh, Zhendong Su, 2018.
- To Type or Not to Type: Quantifying Detectable Bugs in JavaScript - Zheng Gao, Christian Bird, Earl Barr, 2017.
- Semantic Code Repair using Neuro-Symbolic Transformation Networks - Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli, 2017.
- Automated Identification of Security Issues from Commit Messages and Bug Reports - Yaqin Zhou and Asankhaya Sharma, 2017.
- SmartPaste: Learning to Adapt Source Code - Miltiadis Allamanis, Marc Brockschmidt, 2017.
- End-to-End Prediction of Buffer Overruns from Raw Source Code via Neural Memory Networks - Min-je Choi, Sehun Jeong, Hakjoo Oh, Jaegul Choo, 2017.
- Tailored Mutants Fit Bugs Better - Miltiadis Allamanis, Earl T. Barr, René Just, Charles Sutton, 2016.
- Estimating defectiveness of source code: A predictive model using GitHub content - Ritu Kapur, Balwinder Sodhi, 2018
- Automated software vulnerability detection with machine learning - Jacob A. Harer, Louis Y. Kim, Rebecca L. Russell, Onur Ozdemir, Leonard R. Kosta, Akshay Rangamani, Lei H. Hamilton, Gabriel I. Centeno, Jonathan R. Key, Paul M. Ellingwood, Marc W. McConley, Jeffrey M. Opper, Peter Chin, Tomo Lazovich, 2018
- DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, 2017.
- Deep API Learning - Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim, 2017.
- API usage pattern recommendation for software development - Haoran Niu, Iman Keivanloo, Ying Zou, 2017.
- Exploring API Embedding for API Usages and Applications - Nguyen, Nguyen, Phan and Nguyen, 2017.
- Parameter-Free Probabilistic API Mining across GitHub - Jaroslav Fowkes, Charles Sutton, 2016.
- A Subsequence Interleaving Model for Sequential Pattern Mining - Jaroslav Fowkes, Charles Sutton, 2016.
- Lean GHTorrent: GitHub data on demand - Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, Andy Zaidman, 2014.
- Mining idioms from source code - Miltiadis Allamanis, Charles Sutton, 2014.
- The GHTorent Dataset and Tool Suite - Georgios Gousios, 2013.
- The Case for Learned Index Structures - Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis, 2017.
- Learning to superoptimize programs - Rudy Bunel, Alban Desmaison, M. Pawan Kumar, Philip H.S. Torr, Pushmeet Kohlim 2017.
- Neural Nets Can Learn Function Type Signatures From Binaries - Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang, 2017.
- Adaptive Neural Compilation - Rudy Bunel, Alban Desmaison, Pushmeet Kohli, Philip H.S. Torr, M. Pawan Kumar, 2016.
- Learning to Superoptimize Programs - Workshop Version - Bunel, Rudy, Alban Desmaison, M. Pawan Kumar, Philip H. S. Torr, and Pushmeet Kohli, 2016.
- Topic modeling of public repositories at scale using names in source code - Vadim Markovtsev, Eiso Kant, 2017.
- Why, When, and What: Analyzing Stack Overflow Questions by Topic, Type, and Code - Miltiadis Allamanis, Charles Sutton, 2013.
- Semantic clustering: Identifying topics in source code - Adrian Kuhn, Stéphane Ducasse, Tudor Girba, 2007.
- A Benchmark Study on Sentiment Analysis for Software Engineering Research - Nicole Novielli, Daniela Girardi, Filippo Lanubile, 2018.
- Sentiment Analysis for Software Engineering: How Far Can We Go? - Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, Rocco Oliveto, 2018.
- Leveraging Automated Sentiment Analysis in Software Engineering - Md Rakibul Islam, Minhaz F. Zibran, 2017.
- Sentiment Polarity Detection for Software Development - Fabio Calefato, Filippo Lanubile, Federico Maiorano, Nicole Novielli, 2017.
- SentiCR: A Customized Sentiment Analysis Tool for Code Review Interactions - Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, Shahram Rahimi, 2017.
- A Convolutional Attention Network for Extreme Summarization of Source Code - Miltiadis Allamanis, Hao Peng, Charles Sutton, 2016.
- TASSAL: Autofolding for Source Code Summarization - Jaroslav Fowkes, Pankajan Chanthirasegaran, Razvan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton, 2016.
- Summarizing Source Code using a Neural Attention Model - Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer, 2016.
- DéjàVu: a map of code duplicates on GitHub - Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, Jan Vitek, 2017.
- Some from Here, Some from There: Cross-project Code Reuse in GitHub - Mohammad Gharehyazie, Baishakhi Ray, Vladimir Filkov, 2017.
- Deep Learning Code Fragments for Code Clone Detection - Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk, 2016.
- A study of repetitiveness of code changes in software evolution - HA Nguyen, AT Nguyen, TT Nguyen, TN Nguyen, H Rajan, 2013.
- DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer - Joseph Suarez, Justin Johnson, Fei-Fei Li, 2018.
- Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction - Da Xiao, Jo-Yu Liao, Xingyuan Yuan, 2018.
- Differentiable Programs with Neural Libraries - Alexander L. Gaunt, Marc Brockschmidt, Nate Kushman, Daniel Tarlow, 2017.
- Differentiable Functional Program Interpreters - John K. Feser, Marc Brockschmidt, Alexander L. Gaunt, Daniel Tarlow, 2017.
- Programming with a Differentiable Forth Interpreter - Bošnjak, Matko, Tim Rocktäschel, Jason Naradowsky, and Sebastian Riedel, 2017.
- Neural Functional Programming - Feser, John K., Marc Brockschmidt, Alexander L. Gaunt, and Daniel Tarlow, 2017.
- TerpreT: A Probabilistic Programming Language for Program Induction - Gaunt, Alexander L., Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow, 2016.
Related research
- Clustering Binary Data with Bernoulli Mixture Models - Neal S. Grantham
- A Family of Blockwise One-Factor Distributions for Modelling High-Dimensional Binary Data - Matthieu Marbac and Mohammed Sedki
- BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data - Panagiotis Papastamoulis and Magnus Rattray
- Robust mixture modelling using the t distribution - D. PEEL and G. J. MCLACHLAN
- Robust mixture modeling using the skew t distribution - Tsung I. Lin, Jack C. Lee and Wan J. Hsieh
- Learning from Source Code
- Training a Model to Summarize Github Issues
- Sequence Intent Classification Using Hierarchical Attention Networks
- Syntax-Directed Variational Autoencoder for Structured Data
- Weighted MinHash on GPU helps to find duplicate GitHub repositories.
- Source Code Identifier Embeddings
- Using recurrent neural networks to predict next tokens in the java solutions
- The half-life of code & the ship of Theseus
- The eigenvector of "Why we moved from language X to language Y"
- Analyzing Github, How Developers Change Programming Languages Over Time
- Topic Modeling of GitHub Repositories
- Machine Learning on Source Code
- Similarity of GitHub Repositories by Source Code Identifiers
- Using deep RNN to model source code
- Source code abstracts classification using CNN (1)
- Source code abstracts classification using CNN (2)
- Source code abstracts classification using CNN (3)
- Embedding the GitHub contribution graph
- Measuring code sentiment in a Git repository
- Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
- sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
- vecino - Finds similar Git repositories.
- apollo - Source code deduplication as scale, research.
- gemini - Source code deduplication as scale, production.
- enry - Insanely fast file based programming language detector.
- hercules - Git repository mining framework with batteries on top of go-git.
- Code Neuron - Recurrent neural network to detect code blocks in natural language text.
- Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
- Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
- Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
- Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
- Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
- TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
- JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
- Clone Digger - clone detection for Python and Java.
- go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
- bblfsh - Self-hosted server for source code parsing.
- engine - Scalable and distributed data retrieval pipeline for source code.
- minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
- kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
- wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
- Public Git Archive - 3 TB of Git repositories from GitHub.
- StackOverflow Question-Code Dataset - ~148K Python and ~120K SQL question-code pairs mined from StackOverflow.
- GitHub Issue Titles and Descriptions for NLP Analysis - ~8 million GitHub issue titles and descriptions from 2017.
- GitHub repositories - languages distribution - Programming languages distribution in 14,000,000 repositories on GitHub (October 2016).
- 452M commits on GitHub - ≈ 452M commits' metadata from 16M repositories on GitHub (October 2016).
- GitHub readme files - Readme files of all GitHub repositories (16M) (October 2016).
- from language X to Y - Cache file Erik Bernhardsson collected for his awesome blog post.
- GitHub word2vec 120k - Sequences of identifiers extracted from top starred 120,000 GitHub repositories.
- GitHub Source Code Names - Names in source code extracted from 13M GitHub repositories, not people.
- GitHub duplicate repositories - GitHub repositories not marked as forks but very similar to each other.
- GitHub lng keyword frequencies - Programming language keyword frequency extracted from 16M GitHub repositories.
- GitHub Java Corpus - GitHub Java corpus is a set of Java projects collected from GitHub that we have used in a number of our publications. The corpus consists of 14,785 projects and 352,312,696 LOC.
- 150k Python Dataset - Dataset consisting of 150,000 Python ASTs.
- 150k JavaScript Dataset - Dataset consisting of 150,000 JavaScript files and their parsed ASTs.
- card2code - This dataset contains the language to code datasets described in the paper Latent Predictor Networks for Code Generation.
- NL2Bash - This dataset contains a set of ~10,000 bash one-liners collected from websites such as StackOverflow and their English descriptions written by Bash programmers, as described in the paper.
- A lot of references and articles were taken from mast-group.
- Inspired by Awesome Machine Learning.
See CONTRIBUTING.md.