Comments (5)
Hi @Vitvicky and @kambehmw! Thanks for trying out our code.
It takes quite a bit of time to preprocess the data, so we provide pre-augmented datasets at this location https://drive.google.com/file/d/1YfPTPPOv4evldpN-n_4QBDWDWFImv7xO/view?usp=sharing, as part of this dataset folder.
However, if you want to modify some of the augmentations and add a new transform, these are an example set of commands we used to generate the augmented pre-processed dataset:
$ split -l$((`wc -l < javascript_dedupe_definitions_nonoverlap_v2_train.jsonl`/136)) javascript_dedupe_definitions_nonoverlap_v2_train.jsonl javascript_dedupe_definitions_nonoverlap_v2_train.split.jsonl -da 3
$ mkdir -p $OUTDIR
$ find . -name "*split.jsonl*" | parallel --files -I% --max-args 1 -j137 wc -l %
$ find . -name "*split.jsonl*" | parallel --files -I% --max-args 1 -j137 node node_src/transform_jsonl.js % ../augmented/%.augmented
from contracode.
Thank you for sharing this code. I also would like to know how to perform pre-processing and compiler transforms. I would appreciate it if you could update the README on the procedure.
from contracode.
Hi Parasj, thanks for publishing your code. I want to ask what are the parameters of the equipment used in your experiment? I found that 16G memory is not enough if using the javascript_augmented.pickle.gz file.
from contracode.
Hi @QZH-eng -- thank you for trying our repository!
Pretraining is memory hungry as contrastive learning benefits from large batch sizes (see https://arxiv.org/abs/2002.05709). Moreover, the transformer backbone we leverage uses significantly more memory than typical image classification architectures.
We generally performed pretraining over 2-4 16GB V100 GPUs. We provide pretrained checkpoints due to the large cost of pretraining. Finetuning is very cheap and was performed on 1 V100 GPU.
Some recommendations to reduce memory consumption:
(1) reducing the sequence length for the Transformer encoder
(2) decreasing the hidden dimension size of our model
(3) adding checkpoint annotations for gradient checkpointing (e.g. PyTorch gradient checkpointing)
from contracode.
@QZH-eng I created a new issue #17 -- please use this for further discussion
from contracode.
Related Issues (11)
- Memory requirements for ContraCode HOT 3
- Proper Pytorch version HOT 1
- Memory explosion when pretrain Bidirectional LSTM HOT 2
- ask help for the codeclone dataset HOT 4
- code embedding HOT 1
- Python functions extension HOT 2
- data.zip HOT 7
- Cannot obtain the checkpoint HOT 1
- How many GPUs are used by this project? HOT 1
- What kind of gpu environment did you use to train the model? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from contracode.