Coder Social home page Coder Social logo

jerryji1993 / dnabert Goto Github PK

View Code? Open in Web Editor NEW
547.0 547.0 150.0 11.66 MB

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Home Page: https://doi.org/10.1093/bioinformatics/btab083

License: Apache License 2.0

Python 100.00%
deep-learning dnabert-model genome gpu kmer kmer-format machine-learning natural-language-processing nlp sequence

dnabert's People

Contributors

hjgwak avatar jerryji1993 avatar project-delphi avatar timlautk avatar zhihan1996 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dnabert's Issues

prediction outputs & model classes

Hello,

1. Prediction outputs

When I run fine-tuning script on prom-core set, I get the following results:

01/12/2021 16:02:40 - INFO - __main__ -   ***** Eval results  *****
01/12/2021 16:02:40 - INFO - __main__ -     acc = 0.49248183814833585
01/12/2021 16:02:40 - INFO - __main__ -     auc = 0.5877368768457378
01/12/2021 16:02:40 - INFO - __main__ -     aupr = 0.5465873116339481
01/12/2021 16:02:40 - INFO - __main__ -     f1 = 0.6599501924383065
01/12/2021 16:02:40 - INFO - __main__ -     mcc = 0.0
01/12/2021 16:02:40 - INFO - __main__ -     precision = 0.49248183814833585
01/12/2021 16:02:40 - INFO - __main__ -     recall = 1.0

After that running the prediction script, the results are:

01/12/2021 16:47:11 - INFO - __main__ -   ***** Pred results  *****
01/12/2021 16:47:11 - INFO - __main__ -     acc = 0.49248183814833585
01/12/2021 16:47:11 - INFO - __main__ -     auc = 0.587710896620401
01/12/2021 16:47:11 - INFO - __main__ -     aupr = 0.5466122449761592
01/12/2021 16:47:11 - INFO - __main__ -     f1 = 0.6599501924383065
01/12/2021 16:47:11 - INFO - __main__ -     mcc = 0.0
01/12/2021 16:47:11 - INFO - __main__ -     precision = 0.49248183814833585
01/12/2021 16:47:11 - INFO - __main__ -     recall = 1.0

The scores for two different tasks are almost almost completely identical. Also, not sure which dataset in the paper prom-core corresponds to (maybe general), but the scores here looks quite different from the ones in paper.

Finally, when I first run fine-tune script and then predict for my own dataset, the results look like this

01/11/2021 19:39:51 - INFO - __main__ -   ***** Eval results  *****
01/11/2021 19:39:51 - INFO - __main__ -     acc = 0.5020841847159787
01/11/2021 19:39:51 - INFO - __main__ -     auc = 0.4945063291829262
01/11/2021 19:39:51 - INFO - __main__ -     aupr = 0.4952829066522459
01/11/2021 19:39:51 - INFO - __main__ -     f1 = 0.0
01/11/2021 19:39:51 - INFO - __main__ -     mcc = 0.0
01/11/2021 19:39:51 - INFO - __main__ -     precision = 0.0
01/11/2021 19:39:51 - INFO - __main__ -     recall = 0.0

I'm wondering why the last four metrics are zero? I am sure the data is formatted similar to your example, otherwise it wouldn't be able train in the first place, right? What could be the problem?

2. Different model classes

In run_pretrain.py there are several classes including Roberta and GPT. Can we also use these for pretraining if we use DNATokenizer ?

I'd appreciate it a lot if you could give some information about these two questions.

Over 512 length sequence

Hi,

Thanks for the good work.

Is it possible to tweak the program to be able to handle an input of a sequence with a length over 512?
Also is 512 referring to nucleotides or k-mers (like it can take for ex. for a 4-mer (4*512))?

Thanks,
Nina

How the DNABERT pre-train with long sequences?

HI!

Today I found that there is not a selection called "dnalong" in the pre-training process. Actually it only exists in the fine-tuning part.
Here I wonder if there will be some problems if I use "dna" to pre-train by using long sequences. For example, if I use the 1000bp-long sequences, what problems will I meet?
Although I knew there were some issues about how the DNABERT deals with the long sequences, I browsed them and thought they could not solve this issue.
Thank you so much in advance!

DNATokenizer and BertTokenizer

hi,I'd like to konw the difference between DNATokenizer and BertTokenizer。
I use pyenv to create my virtual env so that I can not use conda install pytorch torchvision cudatoolkit=10.0 -c pytorch , and DNATokenizer always can not be imported, So I try to use BertTokenizer to make it work.

is it a right way?

Genome split questions

Hi,
I have a few questions related to the training sub-sequences:

  1. when you split the genome, do you merge the sub-sequences fro both strategies : non-overlapping sub-sequences and randomly sampled sub-sequences ? if so, why ? did you try each of them separately first ?

  2. The randomly sampled sub-sequences range from 5 to 510bp. How do you use 6-mers for the 5bp sub-sequences?

  3. how do you train the tokenizer ?

Thanks

DNA Sequences Embedding

Hi,
Thanks for this very good work. I was wondering if I can retrieve the embedding from DNA sequences that I would train and use them later then for down stream tasks. Can you please confirm that with me and guide me on how I can get the embedding?

Thanks!

How many sequences were used for pre-training?

Hi,

I'm trying to use this model for microbiome data.
As a practice, I trained a model using a small dataset (10K sequences extracted from viruses genomes).
Unfortunately, fine-tuned model using that pre-trained model shows extremely bad performance like a random model.

Can you inform me how many sequences were used for pre-training?

sequences with multiple labels

The Basset dataset has DNA sequences each with 164 binary labels. I would like to fine-tune DNABERT with this dataset. However, DNABERT is only built for sequences to have 1 label each. Is it possible to modify DNABERT so it can perform fine-tuning on data with multiple labels?

I know I will have to edit src/transformers/data/processors/glue.py and src/transformers/data/processors/utils.py
I'm lost after that. Any help would be appreciated. Thank you.

I want DNABERT to have more than one output neuron, I will have a go at modifying the code myself and share if I can hack it.

No module named 'transformers' error

Hi, I am Heejo.
Thank you for the amazing package!
I followed your guide and got an error in run_finetune.py

Traceback (most recent call last):
File "run_finetune.py", line 36, in
from transformers import (
ModuleNotFoundError: No module named 'transformers'

I think that I should not install the module using 'pip install transformers'
because you extended the code from huggingface.
How can I install your modified transformers package?

Segmentation fault when running run_finetune.py

Hi,

I just installed your DNABERT and downloaded the DNABERT6 model.
I create the conda env following all the step.
Finally, I tried to run the fine tune example, but the code ends with Segmentation fault. The error log is below. Can you please help ?

Thanks

11/30/2020 15:16:01 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 4, distributed training: False, 16-bits training: False
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

11/30/2020 15:16:01 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 15:16:01 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
/var/spool/slurmd/job44157571/slurm_script: line 46: 9764 Segmentation fault python DNABERT/examples/run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8

Hello! I met "StopIteration" Error when trying to run DNABERT using example data

Hello. I have a huge interest in this DNABERT model.
I ran the exact same command for pre-training using example data.
BUT, I got the following ERROR.

Traceback (most recent call last):
  File "run_pretrain.py", line 888, in <module>
    main()
  File "run_pretrain.py", line 838, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_pretrain.py", line 434, in train
    outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/DNABERT/src/transformers/modeling_bert.py", line 998, in forward
    encoder_attention_mask=encoder_attention_mask,
  File "/root/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/DNABERT/src/transformers/modeling_bert.py", line 745, in forward
    extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
StopIteration

The version and build of packages are listed in bellow

pytorch      |  1.8.1   |    py3.6_cuda10.2_cudnn7.6.5_0
cudatoolkit  |  10.2.89 |    hfd86e86_0
transformers |  2.5.0   |    dev_0

OS: Linux 16.04
Graphic Cards INFO (nvidia-smi)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0    39W / 300W |     51MiB / 32505MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   36C    P0    39W / 300W |      1MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   37C    P0    38W / 300W |      1MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   36C    P0    37W / 300W |      1MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     37975      G   /usr/lib/xorg/Xorg                 49MiB |
|    1   N/A  N/A     37975      G   /usr/lib/xorg/Xorg                  0MiB |
|    2   N/A  N/A     37975      G   /usr/lib/xorg/Xorg                  0MiB |
|    3   N/A  N/A     37975      G   /usr/lib/xorg/Xorg                  0MiB |
+-----------------------------------------------------------------------------+

6 Motif analysis does not work (bug)

Hello, Step 6 Motif finding does not work, even with the provided example data and commands.

Motif analyis converts patterns detected by DNABERT to actionable biological insights. I really hope this bug can be fixed, becasue DNABERT is working out great so far, I don’t want to be stopped at the crucial last step. @Zhihan1996

The problem has also been identified as a bug here: #2

(Fixing the issue with the header lines only results in other error messages, and fixing those errors causes even more confusing error messages.)

python find_motifs.py \
    --data_dir $DATA_PATH \
    --predict_dir $PREDICTION_PATH \
    --window_size 24 \
    --min_len 5 \
    --pval_cutoff 0.005 \
    --min_n_motif 3 \
    --align_all_ties \
    --save_file_dir $MOTIF_PATH \
    --verbose
 File "find_motifs.py", line 110, in <module>
    main()
  File "find_motifs.py", line 86, in main
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=None)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2

How do I get the DNABERT-TF dataset

The code has the associated file name, but I can't find where to get this data set. Where should I download the data set?
I am trying to replicate the DNABERT-TF results.

image

Fine-tuning by Transfer learning

Hi, I want to realize the transfer learning from big data set to small data set on DNABERT,
First of all, I run the fine-tuning with big data sets,get a global model.
Take this model parameter as a starting point, and continue to fine tune the small data set.
The following log appears:
06/11/2021 22:21:21 - INFO - main - Continuing training from checkpoint, will skip to saved global_step
06/11/2021 22:21:21 - INFO - main - Continuing training from epoch 119
06/11/2021 22:21:21 - INFO - main - Continuing training from global step 20000
06/11/2021 22:21:21 - INFO - main - Will skip the first 8 steps in the first epoch
Epoch: 0it [00:00, ?it/s]
06/11/2021 22:21:21 - INFO - main - global_step = 20000, average loss = 0.0
06/11/2021 22:21:21 - INFO - main - Saving model checkpoint to ./ft/out

It doesn't seem to be fine-tuning on small data sets.
Why does this happen?

Pre-trained weights for long sequence models

Hi,

I am looking for pre-trained weights for the BertForLongSequenceClassification/BertForLongSequenceClassificationCat models. Are these available? Any kmer split would be fine but ideally 4-mer.

Thanks,
Nina

DNABERT-XL availability

Greetings everyone,

In the paper, you mentioned that to get promo-regions prediction, TF-binding motifs and variants prediction outside coding regions, the DNABERT-XL was used, which takes sequences sizes of 10k bp.

Is/will be this model available?

Ty!

Obtain vector embeddings of k-mer tokens

Hello,

I'm trying to obtain the vector embeddings for each token in a DNA sequence. I noticed a similar question was asked in #11 , but this specifically asks for the embeddings of the entire sequence. I would appreciate if you could guide on how to obtain the token embeddings.

Thank!

run_finetune.py valueerror: number of classes in y_true not equal to the number of columns in 'y_score'

I have an input of 23bp sequence with a 2 classification problem, I put my data as said in $DATA_PATH/train.tsv and dev.tsv

Input:
`
export KMER=3
export MODEL_PATH='/mnt/d/M3/Projects/BCB/DNABERT/models/3-new-12w-0/'
export DATA_PATH='/mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/3'
export OUTPUT_PATH='/mnt/d/M3/Projects/BCB/DNABERT/examples/OUTPUT/fit/3mer/'

python run_finetune.py
--model_type dna
--tokenizer_name=dna$KMER
--model_name_or_path $MODEL_PATH
--task_name dnasplice
--do_train
--do_eval
--data_dir $DATA_PATH
--max_seq_length 23
--per_gpu_eval_batch_size=8
--per_gpu_train_batch_size=8
--learning_rate 2e-4
--num_train_epochs 3.0
--output_dir $OUTPUT_PATH
--evaluate_during_training
--logging_steps 100
--save_steps 4000
--warmup_percent 0.1
--hidden_dropout_prob 0.1
--overwrite_output
--weight_decay 0.01
--n_process 8
`

I get an error:
`
07/28/2021 20:37:24 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
07/28/2021 20:37:24 - INFO - transformers.configuration_utils - loading configuration file /mnt/d/M3/Projects/BCB/DNABERT/models/3-new-12w-0/config.json
07/28/2021 20:37:24 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnasplice",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 3,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 69
}

============================================================
<class 'transformers.tokenization_dna.DNATokenizer'>
07/28/2021 20:37:24 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-3/vocab.txt from cache at /home/woreom/.cache/torch/transformers/e1e7221d086d0af09215b2c6ef3ded41de274c79ace1930c48dfce242a7b36fa.b24b7bce4d95258cccdbc46b651c8283db3a0f1324fb97567c8b22b19970f82c
07/28/2021 20:37:24 - INFO - transformers.modeling_utils - loading weights file /mnt/d/M3/Projects/BCB/DNABERT/models/3-new-12w-0/pytorch_model.bin
07/28/2021 20:37:26 - INFO - transformers.modeling_utils - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
07/28/2021 20:37:26 - INFO - transformers.modeling_utils - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
07/28/2021 20:37:26 - INFO - main - finish loading model
07/28/2021 20:37:28 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-08, attention_probs_dropout_prob=0.1, beta1=0.9, beta2=0.999, cache_dir='', config_name='', data_dir='/mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/3', device=device(type='cuda'), do_ensemble_pred=False, do_eval=True, do_lower_case=False, do_predict=False, do_train=True, do_visualize=False, early_stop=0, eval_all_checkpoints=False, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, hidden_dropout_prob=0.1, learning_rate=0.0002, local_rank=-1, logging_steps=100, max_grad_norm=1.0, max_seq_length=75, max_steps=-1, model_name_or_path='/mnt/d/M3/Projects/BCB/DNABERT/models/3-new-12w-0/', model_type='dna', n_gpu=1, n_process=8, no_cuda=False, num_rnn_layer=2, num_train_epochs=3.0, output_dir='/mnt/d/M3/Projects/BCB/DNABERT/examples/OUTPUT/fit/3mer/', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_pred_batch_size=8, per_gpu_train_batch_size=8, predict_dir=None, predict_scan_size=1, result_dir=None, rnn='lstm', rnn_dropout=0.0, rnn_hidden=768, save_steps=4000, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, task_name='dnasplice', tokenizer_name='dna3', visualize_data_dir=None, visualize_models=None, visualize_train=False, warmup_percent=0.1, warmup_steps=0, weight_decay=0.01)
07/28/2021 20:37:28 - INFO - main - Loading features from cached file /mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/3/cached_train_3-new-12w-0_75_dnasplice
07/28/2021 20:37:29 - INFO - main - ***** Running training *****
07/28/2021 20:37:29 - INFO - main - Num examples = 16748
07/28/2021 20:37:29 - INFO - main - Num Epochs = 3
07/28/2021 20:37:29 - INFO - main - Instantaneous batch size per GPU = 8
07/28/2021 20:37:29 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8
07/28/2021 20:37:29 - INFO - main - Gradient Accumulation steps = 1
07/28/2021 20:37:29 - INFO - main - Total optimization steps = 6282
07/28/2021 20:37:29 - INFO - main - Continuing training from checkpoint, will skip to saved global_step
07/28/2021 20:37:29 - INFO - main - Continuing training from epoch 0
07/28/2021 20:37:29 - INFO - main - Continuing training from global step 0
07/28/2021 20:37:29 - INFO - main - Will skip the first 0 steps in the first epoch
Epoch: 0%| | 0/3 [00:00<?, ?it/s07/28/2021 20:38:09 - INFO - main - Loading features from cached file /mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/3/cached_dev_3-new-12w-0_75_dnasplice
07/28/2021 20:38:09 - INFO - main - ***** Running evaluation *****
07/28/2021 20:38:09 - INFO - main - Num examples = 424
07/28/2021 20:38:09 - INFO - main - Batch size = 8
Evaluating: 100%|███████████████████████████████████████████████████████████████████████| 53/53 [00:05<00:00, 9.22it/s]
/home/woreom/anaconda3/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/woreom/anaconda3/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_classification.py:873: RuntimeWarning: invalid value encountered in double_scalars
mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
Iteration: 5%|███▎ | 99/2094 [00:46<15:32, 2.14it/s]
Epoch: 0%| | 0/3 [00:46<?, ?it/s]
Traceback (most recent call last):
File "run_finetune.py", line 1282, in
main()
File "run_finetune.py", line 1097, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_finetune.py", line 304, in train
results = evaluate(args, model, tokenizer)
File "run_finetune.py", line 447, in evaluate
result = compute_metrics(eval_task, preds, out_label_ids, probs)
File "/mnt/d/M3/Projects/BCB/DNABERT/src/transformers/data/metrics/init.py", line 110, in glue_compute_metrics
return acc_f1_mcc_auc_pre_rec(preds, labels, probs)
File "/mnt/d/M3/Projects/BCB/DNABERT/src/transformers/data/metrics/init.py", line 79, in acc_f1_mcc_auc_pre_rec
auc = roc_auc_score(labels, probs, average="macro", multi_class="ovo")
File "/home/woreom/anaconda3/envs/dnabert/lib/python3.6/site-packages/sklearn/utils/validation.py", line 63, in inner_f
return f(*args, **kwargs)
File "/home/woreom/anaconda3/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_ranking.py", line 538, in roc_auc_score
multi_class, average, sample_weight)
File "/home/woreom/anaconda3/envs/dnabert/lib/python3.6/site-packages/sklearn/metrics/_ranking.py", line 632, in _multiclass_roc_auc_score
"Number of classes in y_true not equal to the number of "
ValueError: Number of classes in y_true not equal to the number of columns in 'y_score'
`

Problem with reading the dev.tsv in find_motif.py

I tried to run the script to find motifs, but it seems that a problem with reading the file dev.tsv. Pandas reads the .csv file as a one dimension. When the script reaches this line 87 dev.columns = ['sequence','label'] it bugs.
Further investigations showed that pandas excepts one column but it finds 2. Probably at the line 86 dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=None) \t separation is not taken in account resulting one column. I am using Linux as a host, I do not know if it has an effect.
Does it work correctly for you @jerryji1993 ?

Index(['sequence  label'], dtype='object')
Traceback (most recent call last)

 File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "find_motifs.py", line 113, in <module>
    main()
  File "find_motifs.py", line 88, in main
    print(dev[1])
  File "/home/miniconda3/envs/dnabert/lib/python3.6/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/miniconda3/envs/dnabert/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc
    raise KeyError(key) from err
KeyError: 1

On getting last hidden layer weights through python

Hi,

I want to run the DNABERT model to a given sequence loaded in python, e.g.:

sequence = "ATCTACTACTGTACTACGTAG"

Currently, the documentation supports only running sequences through CLI. What is the best way to load the DNABERT model so I can do something like:

kmer_sequence = some_kmer_function(sequence)
embedding = DNABERT(kmer_sequence).last_hidden_states

This DNABERT(kmer_sequence).last_hidden_outputs I took from HF documentation.

shape of atten_scores

Hi,
tanks for this package!
I was replicating your example and I got an error in the last step when I call find_motifs.py

Traceback (most recent call last):
File "DNABERT/motif/find_motifs.py", line 110, in
main()
File "DNABERT/motif/find_motifs.py", line 91, in main
pos_atten_scores = atten_scores[dev_pos.index.values]
IndexError: index 5918 is out of bounds for axis 0 with size 5918

it seems there is a mismatch between the shape of the atten_scores/pred_results
atten_scores.shape - > (5918, 81)
pred.shape - > (5918, 81)

and the dev file shape
dev.shape - > (5919, 81)

any idea what is causing this?
cheers
Michele

Error when running run_language_modeling.py

Hi,

I tried to run DNABERT as shows in README, but I can not run the following command:

cd examples

export KMER=6
export TRAIN_FILE=sample_data/pre/6_3k.txt
export TEST_FILE=sample_data/pre/6_3k.txt
export SOURCE="/home/malab14/Research/DeepTFBS/DNABERT"
export OUTPUT_PATH="/home/malab14/Research/DeepTFBS/DNABERT/output"

python run_language_modeling.py \
    --output_dir $OUTPUT_PATH \
    --model_type=dna \
    --tokenizer_name=dna$KMER \
    --config_name=$SOURCE/src/transformers/dnabert-config/bert-config-$KMER/config.json \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm \
    --gradient_accumulation_steps 25 \
    --per_gpu_train_batch_size 10 \
    --per_gpu_eval_batch_size 6 \
    --fp16 \
    --save_steps 500 \
    --save_total_limit 20 \
    --max_steps 200000 \
    --evaluate_during_training \
    --logging_steps 500 \
    --line_by_line \
    --learning_rate 4e-4 \
    --block_size 512 \
    --adam_epsilon 1e-6 \
    --weight_decay 0.01 \
    --beta1 0.9 \
    --beta2 0.98 \
    --mlm_probability 0.025 \
    --warmup_steps 10000 \
    --overwrite_output_dir \
    --n_process 10

I got an error like this:

Traceback (most recent call last):
  File "run_language_modeling.py", line 858, in <module>
    main()
  File "run_language_modeling.py", line 762, in main
    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
  File "/home/malab14/Research/DeepTFBS/DNABERT/src/transformers/tokenization_utils.py", line 377, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/home/malab14/Research/DeepTFBS/DNABERT/src/transformers/tokenization_utils.py", line 469, in _from_pretrained
    raise EnvironmentError(msg)
OSError: Couldn't reach server at '{}' to download vocabulary files.

Could you give me a hand?

finetune issue

Hi!
I tried to use DNABERT6 pre-training model to run the finetune process by using our examples. However, there occured an error:
Traceback (most recent call last):
File "run_finetune.py", line 1281, in
main()
File "run_finetune.py", line 1095, in main
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
File "run_finetune.py", line 704, in load_and_cache_examples
features = torch.load(cached_features_file)
File "/.local/lib/python3.6/site-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/.local/lib/python3.6/site-packages/torch/serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version
<= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b7d94ae6193 in /.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x2b7d4ce169eb in /.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x2b7d4ce17c04 in /.local/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x6c53a6 (0x2b7d493cd3a6 in /.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2961c4 (0x2b7d48f9e1c4 in /.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #39: __libc_start_main + 0xf5 (0x2b7bb9c17555 in /lib64/libc.so.6)
frame #40: python() [0x400e02]

Our Pytorch version is 1.4.0, but if we upgrade Pytorch, it will occur "StopIteration"

find_motifs.py is no work

`#### ::: DNABERT-viz find motifs ::: ####

import os
import pandas as pd
import numpy as np
import argparse
import motif_utils as utils

def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the sequence+label .tsv files (or other data files) for the task.",
)

parser.add_argument(
    "--predict_dir",
    default=None,
    type=str,
    required=True,
    help="Path where the attention scores were saved. Should contain both pred_results.npy and atten.npy",
)

parser.add_argument(
    "--window_size",
    default=24,
    type=int,
    help="Specified window size to be final motif length",
)

parser.add_argument(
    "--min_len",
    default=5,
    type=int,
    help="Specified minimum length threshold for contiguous region",
)

parser.add_argument(
    "--pval_cutoff",
    default=0.005,
    type=float,
    help="Cutoff FDR/p-value to declare statistical significance",
)

parser.add_argument(
    "--min_n_motif",
    default=3,
    type=int,
    help="Minimum instance inside motif to be filtered",
)

parser.add_argument(
    "--align_all_ties",
    action='store_true',
    help="Whether to keep all best alignments when ties encountered",
)

parser.add_argument(
    "--save_file_dir",
    default='.',
    type=str,
    help="Path to save outputs",
)

parser.add_argument(
    "--verbose",
    action='store_true',
    help="Verbosity controller",
)

parser.add_argument(
    "--return_idx",
    action='store_true',
    help="Whether the indices of the motifs are only returned",
)

# TODO: add the conditions
args = parser.parse_args()

atten_scores = np.load(os.path.join(args.predict_dir,"atten.npy"))
pred = np.load(os.path.join(args.predict_dir,"pred_results.npy"))
dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t')
#dev.columns = ['sequence','label']
dev['sequence'] = dev['sequence'].apply(utils.kmer2seq)
dev_pos = dev[dev['label'] == 1]
dev_neg = dev[dev['label'] == 0]
pos_atten_scores = atten_scores[dev_pos.index.values]
neg_atten_scores = atten_scores[dev_neg.index.values]
assert len(dev_pos) == len(pos_atten_scores)

# run motif analysis
merged_motif_seqs = utils.motif_analysis(dev_pos['sequence'],
                                    dev_neg['sequence'],
                                    pos_atten_scores,
                                    window_size = args.window_size,
                                    min_len = args.min_len,
                                    pval_cutoff = args.pval_cutoff,
                                    min_n_motif = args.min_n_motif,
                                    align_all_ties = args.align_all_ties,
                                    save_file_dir = args.save_file_dir,
                                    verbose = args.verbose,

                                    ###return_idx  = args.return_idx      //TypeError: filter_motifs() got multiple values for keyword argument 'return_idx'

                                )

if name == "main":
main()
`

`*** Begin motif analysis ***

  • pos_seqs: 9674; neg_seqs: 9709
  • Finding high attention motif regions
  • Filtering motifs by hypergeometric test
    motif CACCT: N=19383; K=9674; n=3316; x=1956; p=7.539397922768265e-31
  • Merging similar motif instances
  • Making fixed_length window = 24
  • Removing motifs with less than 3 instances
  • Saving outputs to directory
    `

No files were saved
Let me change --min_n_motif 3 to --min_n_motif 1
lead to

image

How do you get the right result

run_finetine.py issue

When running the test.sh script (run_fintune.py), I am getting the following error message:

<class 'transformers.tokenization_dna.DNATokenizer'>
01/18/2021 11:20:34 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/mcb/users/zipcode/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
01/18/2021 11:20:34 - INFO - transformers.modeling_utils - loading weights file /home/mcb/users/zipcode/code/DNABERT/6-new-12w-0/pytorch_model.bin
01/18/2021 11:20:40 - INFO - transformers.modeling_utils - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
01/18/2021 11:20:40 - INFO - transformers.modeling_utils - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
01/18/2021 11:20:54 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-08, attention_probs_dropout_prob=0.1, beta1=0.9, beta2=0.999, cache_dir='', config_name='', data_dir='sample_data/ft/prom-core/6', device=device(type='cuda', index=8), do_ensemble_pred=False, do_eval=True, do_lower_case=False, do_predict=False, do_train=True, do_visualize=False, early_stop=0, eval_all_checkpoints=False, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, hidden_dropout_prob=0.1, learning_rate=0.0002, local_rank=-1, logging_steps=100, max_grad_norm=1.0, max_seq_length=75, max_steps=-1, model_name_or_path='/home/mcb/users/zipcode/code/DNABERT/6-new-12w-0', model_type='dna', n_gpu=1, n_process=8, no_cuda=False, num_rnn_layer=2, num_train_epochs=3.0, output_dir='./ft/prom-core/6', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=16, per_gpu_pred_batch_size=8, per_gpu_train_batch_size=16, predict_dir=None, predict_scan_size=1, result_dir=None, rnn='lstm', rnn_dropout=0.0, rnn_hidden=768, save_steps=4000, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, task_name='dnaprom', tokenizer_name='dna6', visualize_data_dir=None, visualize_models=None, visualize_train=False, warmup_percent=0.1, warmup_steps=0, weight_decay=0.01)
01/18/2021 11:20:54 - INFO - main - Loading features from cached file sample_data/ft/prom-core/6/cached_train_6-new-12w-0_75_dnaprom
Traceback (most recent call last):
File "run_finetune.py", line 1288, in
main()
File "run_finetune.py", line 1102, in main
train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
File "run_finetune.py", line 704, in load_and_cache_examples
features = torch.load(cached_features_file)
File "/home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version
<= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1579027003190/work/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /opt/conda/conda-bld/pytorch_1579027003190/work/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f2c10e32627 in /home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f2a99e01e2b in /home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f2a99e03044 in /home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x6d1326 (0x7f2c11aca326 in /home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x28c076 (0x7f2c11685076 in /home/mcb/users/zipcode/miniconda3/envs/dnabert/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #40: __libc_start_main + 0xe7 (0x7f2c154d4bf7 in /lib/x86_64-linux-gnu/libc.so.6)

Any idea how to fix this? PyTorch version 1.4.0.

example for Fine-tune with pre-trained model is not working

So with this input following the instructions in README file

export KMER=6
export MODEL_PATH='/mnt/d/M3/Projects/BCB/DNABERT/models/6mer/'
export DATA_PATH='/mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/6'
export OUTPUT_PATH='/mnt/d/M3/Projects/BCB/DNABERT/examples/OUTPUT/'

python run_finetune.py \
    --model_type dna \
    --tokenizer_name=dna$KMER \
    --model_name_or_path $MODEL_PATH \
    --task_name dnaprom \
    --do_train \
    --do_eval \
    --data_dir $DATA_PATH \
    --max_seq_length 75 \
    --per_gpu_eval_batch_size=16   \
    --per_gpu_train_batch_size=16   \
    --learning_rate 2e-4 \
    --num_train_epochs 3.0 \
    --output_dir $OUTPUT_PATH \
    --evaluate_during_training \
    --logging_steps 100 \
    --save_steps 4000 \
    --warmup_percent 0.1 \
    --hidden_dropout_prob 0.1 \
    --overwrite_output \
    --weight_decay 0.01 \
    --n_process 8

I get the error

Loading features from cached file /mnt/d/M3/Projects/BCB/DNABERT/examples/sample_data/ft/prom-core/6/cached_train_6mer_75_dnaprom
06/23/2021 06:47:35 - INFO - __main__ -   ***** Running training *****
06/23/2021 06:47:35 - INFO - __main__ -     Num examples = 53277
06/23/2021 06:47:35 - INFO - __main__ -     Num Epochs = 3
06/23/2021 06:47:35 - INFO - __main__ -     Instantaneous batch size per GPU = 16
06/23/2021 06:47:35 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 16
06/23/2021 06:47:35 - INFO - __main__ -     Gradient Accumulation steps = 1
06/23/2021 06:47:35 - INFO - __main__ -     Total optimization steps = 9990
/mnt/d/M3/Projects/BCB/DNABERT/models/6mer/
Traceback (most recent call last):
  File "run_finetune.py", line 1282, in <module>
    main()
  File "run_finetune.py", line 1097, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "run_finetune.py", line 237, in train
    global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
ValueError: invalid literal for int() with base 10: ''
(dnabert) woreom@AsusFanap:/mnt/d/M3/Projects/BCB/DNABERT/examples$

I also added a print(args.model_name_or_path) to run_finetune.py to see what this variable is.
It loads the model fine, I couldn't figure out what this line is doing.
Can you guys help me? Am I doing something wrong or something?

Model name 'dna6' was not found in tokenizers model name list

Hi there,

I am running the DNABERT run_finetune.py as instructed by the readme file. It works well at my workstation, but when I run the same code on the server, it reports the following error:

OSError: Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6). We assumed 'dna6' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

I wonder why "Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6)"? It seems so strange, because dna6 is definitely in the list.

Thanks for the answer!

Token indices sequence length is longer than the specified maximum sequence length for this model (3000 > 512). Running this sequence through the model will result in indexing errors

Token indices sequence length is longer than the specified maximum sequence length for this model (3000 > 512). Running this sequence through the model will result in indexing errors

Traceback (most recent call last):
File "", line 1, in
File "F:\PyCharm 2020.2.1\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "F:\PyCharm 2020.2.1\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/Documents/PycharmProjects/bert/getBertWordvec.py", line 7, in
outputs = model(input_ids)
File "F:\Anaconda3\envs\dnabert\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "F:\Anaconda3\envs\dnabert\lib\site-packages\pytorch_transformers\modeling_bert.py", line 707, in forward
embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
File "F:\Anaconda3\envs\dnabert\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "F:\Anaconda3\envs\dnabert\lib\site-packages\pytorch_transformers\modeling_bert.py", line 252, in forward

hi, my input data length is 3000, so the error has happened. And could I fix it through changing your code such as changge Token indices sequence length?

MODEL_TYPE and TASK_NAME : more details

Hi,

could you please explain the MODEL_TYPE and TASK_NAME options ?

I have a large number of DHS sequences and I want to use the sequence embeddings for downstream processing

Thanks

Can't recreate your results

Hi, I tried to recreate the model with the examples you have given but the resulting model is random (acc:0.53)

running DNABERT pretrain.py file output environment error

Hi there,

I am running the DNABERT pretrain.py as instructed by the readme file, once it worked well and started the model training, while when I re-build the environment after a week, the system output the error as below:

08/30/2021 21:23:17 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Traceback (most recent call last):
File "/home/wuchao/dl/DNABERT/src/transformers/configuration_utils.py", line 225, in get_config_dict
raise EnvironmentError
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_pretrain.py", line 885, in
main()
File "run_pretrain.py", line 781, in main
config = config_class.from_pretrained(args.config_name, cache_dir=args.cache_dir)
File "/home/wuchao/dl/DNABERT/src/transformers/configuration_utils.py", line 176, in from_pretrained
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/wuchao/dl/DNABERT/src/transformers/configuration_utils.py", line 241, in get_config_dict
raise EnvironmentError(msg)
OSError: Model name 'PATH_TO_DNABERT_REPO/src/transformers/dnabert-config/bert-config-6/config.json' was not found in model name list. We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/PATH_TO_DNABERT_REPO/src/transformers/dnabert-config/bert-config-6/config.json/config.json' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

Not only for the terminal training version, but also for the google colab version, may I ask if anyone could help me solve this issue?

Thanks a lot!

Best regards,
Chao

Pre-Training DNABERT

Hi, I have attached the actual scores and plot which I have got when I pretrained DNABERT on my data. So I just wanted to know the optimal values for the perplexity score and if possible, then can you please share with me the perplexity scores of your corpus

image
image

Pretraining DNABERT

Hi, I am actually pretraining the DNAbert model on my custom data, and I am getting these as the perplexity score : 1.005971908569336
1.0055251121520996
1.0050606727600098
1.005359411239624
1.005840539932251
1.0052825212478638
1.0051802396774292
1.0055253505706787
1.0054320096969604
1.0054410696029663
1.0058468580245972
1.0049262046813965
1.0057575702667236
1.0051915645599365
1.0054072141647339
1.0055351257324219
1.0054702758789062
1.0053589344024658
1.0051729679107666
1.005456566810608
1.0054833889007568
1.0049924850463867
1.0052168369293213
1.0055359601974487
1.0054214000701904
1.0054751634597778
1.005573034286499
1.0051946640014648
1.0053223371505737
1.0050946474075317
1.0055451393127441
1.0052800178527832
1.0052553415298462
1.005454421043396
1.0052385330200195
1.0048243999481201
1.005685806274414
1.0053269863128662
1.0049481391906738
1.0052223205566406
1.0053377151489258
1.0051454305648804
1.0050266981124878
1.005757451057434
1.005202054977417
1.005906343460083
1.0050561428070068
1.0051881074905396
1.0052803754806519
1.0053002834320068
1.005397915840149
1.0059492588043213
1.0059244632720947
1.0054737329483032
1.00540030002594
1.0050368309020996
1.0050461292266846
1.005406141281128
1.005310297012329
1.0049501657485962
1.0049052238464355
1.005474328994751
1.0050350427627563
1.0050352811813354
1.0047125816345215
1.0053828954696655
1.0057741403579712
1.0050772428512573
1.0055228471755981
1.0052945613861084
1.005362868309021
1.0057356357574463
1.0052978992462158

I am actually new to ML-AI domain so was little confused about the bounds of perplexity score, can you please help in validating the score?

Fine-tuning for multiclass data

Hi, I am working with a dataset that has around 60 classes labeled from 0 - 59. I created a new DataProcessor for the task first which returns those labels on get_labels() function.

However, after following the pre-training script in the READme.md i get a keyerror :

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/mr/miniconda3/envs/bert/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/mr/Workspace/Team8993/THESIS/DNABERT/examples/transformers/data/processors/glue.py", line 120, in glue_convert_examples_to_features
    label = label_map[example.label]
KeyError: '35'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_finetune.py", line 1283, in <module>
    main()
  File "run_finetune.py", line 1097, in main
    train_dataset = load_and_cache_examples(args, args.task_name, tokenizer, evaluate=False)
  File "run_finetune.py", line 760, in load_and_cache_examples
    features.extend(result.get())
  File "/home/mr/miniconda3/envs/bert/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
KeyError: '35'

Can you please tell me where this might be originating from?

error while running examples - segmentation fault

Hi,
I'm trying to run the example. I created the dnabest env and downloaded the packages and files. I get an error at step 3.3. while trying to run the Fine-tune with pre-trained model (DNABERT6). I get the following error message:

<class 'transformers.tokenization_dna.DNATokenizer'>
01/05/2021 17:08:16 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/mcb/users/zipcode/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
01/05/2021 17:08:16 - INFO - transformers.modeling_utils - loading weights file /home/mcb/users/zipcode/code/DNABERT/6-new-12w-0/pytorch_model.bin
Segmentation fault (core dumped)

I have tried redownloading the pretrained model, but got the same error. Strangely, I do not get this error locally on my mac, but without any GPU, it would take too long to run. I get this error on a linux server.

Any ideas on how to fix this? Thanks!

No module named 'dateutil'

Dear author:

I am new to this application. I set up dnabert conda environment for the application on HPC cluster. I followed the instruction step 2.2 Model training to run the test, however, I hit an error message:

Traceback (most recent call last):
File "run_pretrain.py", line 42, in
from transformers import (
File "/risapps/noarch/dnabert/20210826/src/transformers/init.py", line 22, in
from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
File "/risapps/noarch/dnabert/20210826/src/transformers/configuration_albert.py", line 18, in
from .configuration_utils import PretrainedConfig
File "/risapps/noarch/dnabert/20210826/src/transformers/configuration_utils.py", line 25, in
from .file_utils import CONFIG_NAME, cached_path, hf_bucket_url, is_remote_url
File "/risapps/noarch/dnabert/20210826/src/transformers/file_utils.py", line 22, in
import boto3
File "/risapps/rhel7/python/3.7.3/envs/dnabert/lib/python3.6/site-packages/boto3/init.py", line 16, in
from boto3.session import Session
File "/risapps/rhel7/python/3.7.3/envs/dnabert/lib/python3.6/site-packages/boto3/session.py", line 17, in
import botocore.session
File "/risapps/rhel7/python/3.7.3/envs/dnabert/lib/python3.6/site-packages/botocore/session.py", line 29, in
import botocore.configloader
File "/risapps/rhel7/python/3.7.3/envs/dnabert/lib/python3.6/site-packages/botocore/configloader.py", line 19, in
from botocore.compat import six
File "/risapps/rhel7/python/3.7.3/envs/dnabert/lib/python3.6/site-packages/botocore/compat.py", line 27, in
from dateutil.tz import tzlocal
ModuleNotFoundError: No module named 'dateutil'

Did I ignored anything? Thank you for your suggestion to fix this.

Regards,

Need example of using compute_result.py

predict_results = np.load(args.pred_path) 
labels = np.load(args.label_path)
labels = list(labels.astype(int))

results = []
for i in range(len(labels)):
    pred = generate_pred(predict_results, i, args.slide, args.metric)
    
    if pred >= args.bound:
        results.append(1)
    else:
        results.append(0)
a = set(results)
b = set(labels)
f1 = f1_score(y_true=labels, y_pred=results)
mcc = matthews_corrcoef(labels, results)
tn, fp, fn, tp = confusion_matrix(labels, results).ravel()

count = 0
for i in range(len(results)):
    if results[i] == labels[i]:
        count+=1

print("number of examples: " + str(len(labels)))
print("number of positive examples: " + str(sum(labels)))
print("number of negative examples: " + str(len(labels)-sum(labels)))
print("f1: ", str(f1))
print("mcc: " + str(mcc))
print("accuracy: " + str(float(count)/len(results)))
print("tn:" + str(tn))
print("fp:" + str(fp))
print("fn:" + str(fn))
print("tp:" + str(tp))

Hi,

In compute_result.py, I find some function like compute_scan(above code) and compute_mouse, but I cannot understand how they load data. Does np.load(args.pred_path) load pred_results.npy into this function and generate confusion matrix? However, pred_results.npy is prediciton result per batch so that I cannot figure out how to generate ground-truth label file.

Thanks for your help,
Henry

something about feature extraction

Hello,I seemly do not find how to get feature extraction vector with your pre-trained model in your readme.md. So I use your dnaber in pytorch-transformer(Huggingface/transformer is from pytorch-transformer),but i always got a vector of [batchsize,1,768], not the right dim [batchsize,Kmer-len,768].

And, I want to konw how I can use dnabert to obtain feature extraction vector, or do you support the function of feature extraction?

sincerely,PBC

how to use utility function kmer2seq

Dear DNABERT author:

May you please kindly provide a python example how to call kmer2seq to convert a text file (for example examples/sample_data/pre/6_3k.txt) to its original sequence?

Thank you very much,
Rong

Conflict in package version

Hi,

The current repo requests a tokenizer==0.5.0 however it calls for a tokenizer function which is from 0.10.0.

Installation with conda strictly followed the order on README but failed to run run_finetune.py. With import issues.

export PATH_TO_DNABERT_REPO=/gpfs/bin/DNABERT
export SOURCE=/gpfs/bin/DNABERT
export KMER=6
export MODEL_PATH=/gpfs/bin/DNABERT/pretrained/6-new-12w-0
export DATA_PATH=sample_data/ft/prom-core/$KMER
export OUTPUT_PATH=./ft/prom-core/$KMER

python run_finetune.py
--model_type dna
--tokenizer_name=dna$KMER
--model_name_or_path $MODEL_PATH
--task_name dnaprom
--do_train
--do_eval
--data_dir $DATA_PATH
--max_seq_length 75
--per_gpu_eval_batch_size=16
--per_gpu_train_batch_size=16
--learning_rate 2e-4
--num_train_epochs 3.0
--output_dir $OUTPUT_PATH
--evaluate_during_training
--logging_steps 100
--save_steps 4000
--warmup_percent 0.1
--hidden_dropout_prob 0.1
--overwrite_output
--weight_decay 0.01
--n_process 8

Traceback (most recent call last):
File "run_finetune.py", line 69, in
from transformers import glue_compute_metrics as compute_metrics
ImportError: cannot import name 'glue_compute_metrics'

Yi

training issue

Hi,

I tried to train from scratch to reproduce the paper results. I started from the genome fasta, split it into 510 sub-sequences with the BBMap script shred.sh.

I found a function under the DNABERT data_process_template folder process_pretrain_data.py to generate split sequences into k-mers (though I am not sure about the "sampling_rate"). Anyway, I used the vocab.txt under bert-config-6.

I ran the run_pretrain.py as described, with the same parameters. I get the error below. Can you please help ?

01/08/2021 01:44:15 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True
01/08/2021 01:44:15 - INFO - transformers.configuration_utils - loading configuration file /home/DNABERT/src/transformers/dnabert-config/bert-config-6-mm/config.json
01/08/2021 01:44:15 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

01/08/2021 01:44:15 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
01/08/2021 01:44:15 - INFO - main - Training new model from scratch
01/08/2021 01:44:19 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-06, beta1=0.9, beta2=0.98, block_size=512, cache_dir=None, config_name='/home/DNABERT/src/transformers/dnabert-config/bert-config-6-mm', device=device(type='cuda', index=0), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/home/DNABERT/examples/sample_data/pre/GRCh38.dna.primary_assembly_cut6_test.txt', evaluate_during_training=True, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=25, learning_rate=0.0004, line_by_line=True, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=200000, mlm=True, mlm_probability=0.025, model_name_or_path=None, model_type='dna', n_gpu=1, n_process=4, no_cuda=False, num_train_epochs=1.0, output_dir='output6-mm', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=6, per_gpu_train_batch_size=10, save_steps=500, save_total_limit=20, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name='dna6', train_data_file='/home/DNABERT/examples/sample_data/pre/GRCh38.dna.primary_assembly_cut6_train.txt', warmup_steps=10000, weight_decay=0.01)
01/08/2021 01:44:19 - INFO - main - Creating features from dataset file at /home/DNABERT/examples/sample_data/pre/GRCh38.dna.primary_assembly_cut6_train.txt
Traceback (most recent call last):
File "run_pretrain.py", line 885, in
main()
File "run_pretrain.py", line 830, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
File "run_pretrain.py", line 200, in load_and_cache_examples
return LineByLineTextDataset(tokenizer, args, file_path=file_path, block_size=args.block_size)
File "run_pretrain.py", line 183, in init
ids = result.get()
File "/gpfs/project/conda_envs/dnabert/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/gpfs/project/conda_envs/dnabert/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/gpfs/project/conda_envs/dnabert/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/gpfs/project/conda_envs/dnabert/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Generate Attention Scores on dnalongcat finetuned models

Hi, I'm currently running step 5.1 which is to generate attention scores. When using a model finetuned on max seq length = 2048 and a batch size of 8 I run into the following error.

ValueError: could not broadcast input array from shape (32,12,512,512) into shape (8,12,2048,2048)

I assume this has something to do with the shape of storing of kmers in the finetuning of the model = [maxseqlength/512, 512].
Is there a quick fix for this?
Cheers

KeyError '23' while running examples

Has anyone else seen this while running the examples?

export MODEL_PATH=../examples/ft/prom-core/$KMER
export DATA_PATH=examples
export PREDICTION_PATH=examples
python ../examples/run_finetune.py \
    --model_type dna \
    --tokenizer_name=dna$KMER \
    --model_name_or_path $MODEL_PATH \
    --task_name dnaprom \
    --do_predict \
    --data_dir $DATA_PATH  \
    --max_seq_length 75 \
    --per_gpu_pred_batch_size=128   \
    --output_dir $MODEL_PATH \
    --predict_dir $PREDICTION_PATH \
    --fp16 \
    --n_process 48

error message we are seeing is

12/14/2020 01:54:26 - INFO - transformers.data.processors.glue -   Writing example 0/5
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/DNABERT/src/transformers/data/processors/glue.py", line 120, in glue_convert_examples_to_features
    label = label_map[example.label]
KeyError: '23'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "../examples/run_finetune.py", line 1281, in <module>
    main()
  File "../examples/run_finetune.py", line 1152, in main
    prediction = predict(args, model, tokenizer, prefix=prefix)
  File "../examples/run_finetune.py", line 484, in predict
    pred_dataset = load_and_cache_examples(args, pred_task, tokenizer, evaluate=True)
  File "../examples/run_finetune.py", line 761, in load_and_cache_examples
    features.extend(result.get())
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
KeyError: '23'

Hi I met a problem when I am trying to install dnabert env.

Running setup.py install for apex ... error
ERROR: Command errored out with exit status 1: /nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-4x23nl6p/setup.py'"'"'; file='"'"'/tmp/pip-req-build-4x23nl6p/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-elj3n3dy/install-record.txt --single-version-externally-managed --compile --install-headers /nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/include/python3.6m/apex Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 854, in install
req_description=str(self.req),
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/operations/install/legacy.py", line 86, in install
raise LegacyInstallFailure
pip._internal.operations.install.legacy.LegacyInstallFailure

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 224, in _main
status = self.run(options, args)
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", line 180, in wrapper
return func(self, options, args)
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 403, in run
pycompile=options.compile,
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/req/init.py", line 90, in install_given_reqs
pycompile=pycompile,
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 858, in install
six.reraise(*exc.parent)
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_vendor/six.py", line 703, in reraise
raise value
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/operations/install/legacy.py", line 76, in install
cwd=unpacked_source_directory,
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runner
spinner=spinner,
File "/nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 240, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-4x23nl6p/setup.py'"'"'; file='"'"'/tmp/pip-req-build-4x23nl6p/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-elj3n3dy/install-record.txt --single-version-externally-managed --compile --install-headers /nas/longleaf/home/lehuang/tool/anaconda/envs/dnabert/include/python3.6m/apex Check the logs for full command output.
1 location(s) to search for versions of pip:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.