kr-colab / relernn Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 12.0 1.03 MB

Recombination Landscape Estimation using Recurrent Neural Networks

License: MIT License

Python 100.00%

deep-learning population-genomics recombination recurrent-neural-networks

relernn's People

Contributors

Stargazers

Watchers

Forkers

jgallowa07 jradrion lzeitler skyclub3 brandonlind peterdfields silastittes octpalacios ziyimo ningshuang-yao thomasbrazier wjbmattingly

relernn's Issues

ReLERNN_TRAIN step has a problem: ValueError: The filepath provided must end in `.keras` (Keras model format)

Hi, I got an Error when I ran the TRAIN step, could you share your help? Thank you very much, here is my code:

/data2/software-use/ReLERNN/ReLERNN/ReLERNN_SIMULATE -v species.vcf
-d myfile
--demographicHistory species_plot.csv
-u 3.3e-9
-l 1
-t 40
-s 123
-g genome.length.bed
--unphased

/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN -d myfile -t 20 -s 123
The output is :
Total params: 76,002,769 (289.93 MB)
Trainable params: 76,002,769 (289.93 MB)
Non-trainable params: 0 (0.00 B)
Traceback (most recent call last):
File "/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN", line 130, in
main()
File "/data2/software-use/ReLERNN/ReLERNN/ReLERNN_TRAIN", line 109, in main
runModels(ModelFuncPointer=GRU_TUNED84,
File "/data2/software-use/anaconda3/envs/ReLERNN_python/lib/python3.10/site-packages/ReLERNN/helpers.py", line 353, in runModels
ModelCheckpoint(
File "/data2/software-use/anaconda3/envs/ReLERNN_python/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in init
raise ValueError(
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=myfile/networks/weights.h5

ReLERNN_TRAIN_POOL is slow

I had to remove multiprocessing from model.fit to remedy a memory leak with tensorflow 2, which means that the generation of training data when training on pooled sequences is now painfully slow. I will be working on a fix for this issue, but I do not currently have a resolution.

ReLERNN_PREDICT_HOTSPO unable to run

When I run ReLERNN_PREDICT_HOTSPOT with default parameters, I get an error of

File "/project-whj/software/ReLERNN/ReLERNN/ReLERNN_PREDICT_HOTSPOT", line 100, in main
pred_sequence = VCFBatchGenerator(**bds_pred_params)
TypeError: init() got an unexpected keyword argument 'WIN'

Even using the examples file of the software, I don't know how to solve it, please give me your guidance

Hi Im wondering how to generate the example.vcf haplotype file format

I simulated the example.vcf file format and split a phased VCF file into VCF file containing two haplotypes (CHR like 3L 3R) by my own python script. I made sure the haplotype VCF file can be processed by other VCF file tools (VCFtools, bcftools). But ReLERNN.SIMILATE told me 'Error: chromosomes have different sample sizes!'.
Please help me, and told how to get the haplotype VCF file quickly. THANKS!!!

Error running example of ReLERNN

Hello,
After installing all the dependencies and packages when I try to run the example I am getting the below error.
ReLERNN/examples$ python2.7 ./example_pipeline.sh
File "./example_pipeline.sh", line 14
${SIMULATE}
^
SyntaxError: invalid syntax
ReLERNN/examples$ python3.5 ./example_pipeline.sh
File "./example_pipeline.sh", line 14
${SIMULATE}
^
SyntaxError: invalid syntax

/ReLERNN/examples$ ./example_pipeline.sh
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/bin/ReLERNN_SIMULATE", line 4, in
import('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 1462, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 7, in
from ReLERNN.imports import *
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/ReLERNN/init.py", line 4, in
from ReLERNN.helpers import *
File "/usr/local/lib/python2.7/dist-packages/ReLERNN-0.1-py2.7.egg/ReLERNN/helpers.py", line 184
out_fp.write(f"pop0,{time_years},{size}\n")
^
SyntaxError: invalid syntax
./example_pipeline.sh: line 15: --vcf: command not found
./example_pipeline.sh: line 16: --genome: command not found
./example_pipeline.sh: line 17: --mask: command not found
./example_pipeline.sh: line 18: --phased: command not found
./example_pipeline.sh: line 19: --projectDir: command not found
./example_pipeline.sh: line 20: --assumedMu: command not found
./example_pipeline.sh: line 21: --upperRhoThetaRatio: command not found
./example_pipeline.sh: line 22: --nTrain: command not found
./example_pipeline.sh: line 23: --nVali: command not found
./example_pipeline.sh: line 24: --nTest: command not found
./example_pipeline.sh: line 25: --nCPU: command not found

Any help or suggestion is appreciated.
Thank you,
Tanushree

NumPy Version Related Error

Hi! I am trying to get the example for ReLERNN working but I keep getting a NumPy version error where it wants a newer version of NumPy. Do you know which line in which script specifies the version of NumPy?

memory allocate error on PREDICT

Hi,
I have successfully run SIMULATE and TRAIN, but am now running into a memory error on PREDICT:
"MemoryError: Unable to allocate 32.4 GiB for an array with shape (14212, 1801, 340) and data type float32"
I realize that this likely on my side of things, but I wanted to know if it was possible to pass a VCF to PREDICT that was a subset of the VCF used in the SIMULATE step? PREDICT seems to look for the VCF as an hdf5 ...
thanks,
@stsmall

Demographic history error message

Hi Jeff,
Just following up about the --demographicHistory error message with SIMULATE. In the --help output of SIMULATE it says it needs the output from SMC++ which is currently a model.final.json file but this returns an error. --demographicHistory is looking for .csv file and I've confirmed this works if you have SMC++ produce a .csv file. So, if you change the error message of SIMULATE to supply a .csv or include it in the help message it should clear up any confusion. Thanks for your help!
Best,
Kenny

The train step needs large memory

Hi,
When I run the RELERNN_TRAIN with default settings, the step was killed because of the large memory, how to deal with this? could you share your help? Thank you very much.

RELERNN SIMULATE issue with vcf?

Hello I think I may be having a similar issue to the closed issue #7. I commented on that thread as well but wasn't sure if GitHub alerts of a closed issue comment?

I’ve removed all the hemizygous/haploid chromosomes from my vcf and my windowSizes file only has chromosomes with sample size 6

However, I am getting the following error:
Reading HDF5 mask: /home/ddebaun/mendel-nas1/redo_recombination/splitVCFs/Leioheterodon_madagascarensis_B_biallelic_7204_RagTag:0-11000_md_mask.hdf5...
Traceback (most recent call last):
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/home/ddebaun/mendel-nas1/miniconda3/bin/ReLERNN_SIMULATE", line 152, in main
md_mask = np.concatenate(md_mask)
File "<array_function internals>", line 180, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 6 and the array at index 6 has size 3

I’m including the information I used for this run (first 50k lines of vcf). Is this also an issue with the vcf for scikit-allel? Any help would be appreciated!
filestorun.zip

loss: nan - val_loss nan

Hi,
For some of my training runs, not all, the loss values eventually start being reported as 'nan'.
Epoch 308/1000
99/100 [============================>.] - ETA: 2s - loss: 0.2445Epoch 1/1000
100/100 [==============================] - 328s 3s/step - loss: 0.2445 - val_loss: 0.2342
Epoch 309/1000
99/100 [============================>.] - ETA: 3s - loss: nanEpoch 1/1000
100/100 [==============================] - 332s 3s/step - loss: nan - val_loss: nan
Epoch 310/1000
99/100 [============================>.] - ETA: 2s - loss: nanEpoch 1/1000
100/100 [==============================] - 325s 3s/step - loss: nan - val_loss: nan

It usually terminates successfully after a repeat of 'nan'. Just wanted to verify that there is nothing wrong with the training that would influence the predictions in this case.
thanks,
@stsmall

How can I get a good fitting effect model

Hello,

I am running ReLERNN to estimate recombination rate of maize genome(85% TE, 2.2 Gb genome size).

No matter how I adjust windowsize, the model of this group(F2，n=180, depth=5x,Sequencing per plant for calling SNP variants and filtered low_qual......) R2 and MAE values are not ideal.

Can you give me some advice？

Thank you very much

Error: chromosomes have different sample sizes!

Hi! I am running ReLERNN on two populations of sheep (each with 10 samples) and chromosomes 1-26. I got this error for both populations. My initial thoughts were that some of my filtering may have left some samples with a 0 snp count for certain chromosomes but when I printed out the snp count per sample per chromosome, all had many snps (well into thousands). I followed a previous resolved issue and checked if windowSizes.txt was empty and it was. Do you know what is happening or what I should try? I attached part of my vcf and my ReLERNN script for one population. As background, I have run the example and other species successfully.

first_6000_IR_sheep_autosomes_biallelic_fixup.vcf.gz

IR.txt

ReLERNN_SIMULATE not splitting vcf file properly

Dear users,

I have used ReLERNN with a previous individuals resequencing data with no problem at all.

However, I am running into problems with a new dataset. It seems ReLERNN fails in the step of splitting the vcf into the different chromosome vcfs. By looking at this directory for the analzsis that previously worked, I see there are files missing.

[s_menb@jupiter SS1]$ cd splitVCFs/
[s_menb@jupiter splitVCFs]$ ll
total 4204
-rw-r--r--. 1 s_menb clusteruser 1285117 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr1:0-61357614.hdf5
-rw-r--r--. 1 s_menb clusteruser 1284380 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr2:0-58906861.hdf5
-rw-r--r--. 1 s_menb clusteruser 1162613 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr3:0-53163979.hdf5
-rw-r--r--. 1 s_menb clusteruser 566572 Mar 1 10:59 SS1.m2M2.recode.CLEAN_Chr4:0-17018963.hdf5

My vcf file (for a single individual was created using Platypus and filtering for biallelic positions.

My bed file is tab separated as follows:

Chr1 0 61357614
Chr2 0 58906861
Chr3 0 53163979
Chr4 0 17018963

And below the error message:

Traceback (most recent call last):
File "/cluster/software/relernn/ReLERNN-1.0.0_ve/bin/ReLERNN_SIMULATE", line 245, in
main()
File "/cluster/software/relernn/ReLERNN-1.0.0_ve/bin/ReLERNN_SIMULATE", line 160, in main
thetaW=maxS/a
ZeroDivisionError: division by zero

Any ideas or suggestions?

Best wishes

empty files in SplitVCF/

when I run it on my pc, all the files listed in SplitVCF/ folder are empty after running simulation. And I didn't observe any errors about that through log console.
PS, I just use example.

tensorflow needed in requirements.txt

After creating an anaconda environment conda create --name relernn python=3.7, and installing dependencies following the README, I was not able to run ./example_pipeline.sh without manually installing tensorflow : pip install tensorflow. After pip installing tensorflow (tensorflow-2.2.0-cp37-cp37m-manylinux2010_x86_64.whl), the example script ran through just fine.

Genome bed file

Hello,
I am trying to run ReLERNN for my species on pool sample and it is failing due to this error:
Error: genome file must be formatted as a bed file (i.e.'chromosome start end')
head of my genome.bed is :
CM009931.2 0 27754200
CM009932.2 0 16093500
CM009933.2 0 13619445
CM009934.2 0 13404451
CM009935.2 0 13920984

Any suggestions what I am doing wrong?
Thank You in advance.
Tanu

Error with ReLERNN example script.

Hello,
I was able to install and run the script but in the second step I get this warning and error

ReLERNN_SIMULATE_POOL.py FINISHED!

Using TensorFlow backend.
Warning: training data to be treated as if generated by pool-seq
Model: "model_1"

Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 2930, 2) 0

bidirectional_1 (Bidirection (None, 168) 44352

dense_1 (Dense) (None, 256) 43264

dropout_1 (Dropout) (None, 256) 0

dense_2 (Dense) (None, 1) 257

Total params: 87,873
Trainable params: 87,873
Non-trainable params: 0

2019-12-13 14:29:39.255505: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-13 14:29:39.289945: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2497315000 Hz
2019-12-13 14:29:39.293212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x532a2b0 executing computations on platform Host. Devices:
2019-12-13 14:29:39.293241: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1339, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1374, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional_1/CudnnRNN}}with these attrs: [dropout=0, seed=87654321, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="gru", is_training=true, seed2=0]
Registered devices: [CPU, XLA_CPU]
Registered kernels:

     [[bidirectional_1/CudnnRNN]]

During handling of the above exception, another exception occurred:

I don't have GPUs, is there any other way I can run it.
Any suggestions?
Thank You,
Tanushree

Which python version needed to successfully run ReLERNN?

Hi, I'm having difficulty getting all the dependencies to install when running pip install. What python version should I use? Perhaps this could be included in documentation? Thank you!!

Andre Moncrieff
Postdoc at Louisiana State University

Using --mask option does not change the output

Hello,

I've run ReLERNN both with and without the --mask option, yet I've obtained identical results (same window size and the resulting table with nSites). For --mask option I provide a .bed file containing masked transposable elements obtained from EDTA.
In the log file, I notice the following message:

'Accessibility mask found: calculating the proportion of the genome that is masked...
44.0% of the genome inaccessible'

Despite this, there is no impact on the output. Could you shed some light on why this might be the case?

missing comma from line 9 in setup.py

Hi, I believe there is a comma missed from line 9 in the new committed setup.py; this is causing the command for installing dependency "pip install ." failing

what is the recommended threshold for maf filtering

When I run ReLERNN with different MAF parameters, I get the exact opposite pattern of three populations. How do I solve this problem, what is the recommended threshold for maf filtering?

Error with ReLERNN_SIMULATE

Hi!

Really excited about using ReLERNN to estimate recombination in some natural data with a low-ish sample size (n22) and also to have a go on some poolseq data too

Just tried to run on my natural data, and I get the following error message when reading the hd5f files

Reading HDF5: "ReLERNN/splitVCFs/paria_marianne_1027798.final_chr1:0-34343053.hdf5"...
Process Process-2:
Error: chromosomes have different numbers of samples
Traceback (most recent call last):
  File "/gpfs/ts0/home/jrp228/.local/bin/ReLERNN_SIMULATE", line 4, in <module>
    __import__('pkg_resources').run_script('ReLERNN==0.1', 'ReLERNN_SIMULATE')
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 750, in run_script
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/setuptools-38.4.0-py3.6.egg/pkg_resources/__init__.py", line 1527, in run_script
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 219, in <module>
Traceback (most recent call last):
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 199, in worker_countSites
    if md_mask.any():
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/allel/abc.py", line 43, in __getattr__
    return getattr(self.values, item)
AttributeError: 'Dataset' object has no attribute 'any'
    main()
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/EGG-INFO/scripts/ReLERNN_SIMULATE", line 108, in main
    wins, nSamps, maxS, maxLen = vcf_manager.countSites(nProc=nProc)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/ReLERNN-0.1-py3.6.egg/ReLERNN/manager.py", line 170, in countSites
    return sorted_wins, nSamps[0], maxS, maxLen
IndexError: list index out of range

My vcf file is pretty standard, although there is some missing data, and I'm running ReLERNN like this:
ReLERNN_SIMULATE -v paria_marianne_1027798.final.vcf -g STAR.extents.bed -m STAR.chromosomes.release.repeats.bed -d ReLERNN/ -u 4.8e-8 --unphased

I checked the vcf files generated in the first step of the script, and they all have the same number of samples:
for i in *vcf; do bcftools query -l $i | wc -l; done | sort | uniq
22

Is there a method to convert the result to other window size based results

Hi,
I got the predicted result from using ReLERNN, and Now I want to convert the result to other window size, like 50kb, and convert to map file used for the input of Relate (https://myersgroup.github.io/relate/input_data.html#Prepare), could you share some methods to do such work?

Thanks.

ReLERNN train TF2 model.fit memory leak and errors

I have problems running the TF2 version of relernn.
I'm using:
tensorflow 2.1
cudatk 10.1.243
cudnn 7.6.4
CUDA enabled GPU (1080Ti)

Memory leak
Each training iteration memory usage keeps increasing which eventually leads to >200GB RAM usage. I think it's related to these issues
tensorflow/tensorflow#33030
tensorflow/tensorflow#35100
I also tried nightly which has the same issue.

Error message
I'm also getting error and warning messages in each epoch with TF2.

2020-02-22 01:44:32.078164: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

I don't know if these problems are related but maybe they are.

On a side note, I can run the example pipeline with producing output, even though another error comes up when loading modules (Could not load dynamic library 'libnvinfer.so.6')

There was another issue with ReLERNN train (earlier TF1 commits, where model.fit_generator was used). The model fitting would not succeed after all epochs ran, without any error message. Maybe you have an idea what the problem could be here? Then I could use the TF1 version of ReLERNN and run my stuff that way.

I'm running it on a dataset with 5 individuals and about 2M SNPs (unphased, with some missing data).

Any help would be greatly appreciated.

very different corrections when bootstrapping

Hi!

I used ReLERNN to estimate the recombination rate along a very long genome, and ran the analyses by pieces of 500Mb. The results between the different parts are comparable when I use the results of the "predict" function, but differ a lot after correction with the "bscorrect" function.
For example, before correction:

And after correction:

Any idea what could cause such differences? Is there anything I should do?
Thanks in advance!

Recommendations for how to parameterize ReLERNN

Dear @jradrion @andrewkern et al,

I am exploring the possibility of using ReLERNN to infer recombination rates in a non-model arthropod with a large genome and high levels of nucleotide diversity (1-1.5%). Nothing is known currently about the recombination in this species and we have no ground-truth evidence to fall back on to verify results. Our assembly is relatively fragmented and our sample size is just below 70 diploids. The decay of LD seems relatively rapid in our data (phased with BEAGLE 4.0), similar to what is seen in many other arthopods.

The ReLERNN paper speaks much about the of relationship between mutation and recombination rates, and both the mutation rate and parameter "--upperRhoThetaRatio" seems to be key to successful inference.

I tried setting --upperRhoThetaRatio to 35 as in the paper and used a mutation rate typical for arthropods, and while all steps in ReLERNN worked on my machine with a powerful GPU, the inferred recombination rates came out very flat, with dips around contig brakes along scaffolds or genes with reduced levels of variation, suggesting training and parameterization has not worked well.

ReLERNN is new to me and I am not sure how to move forward.

Can you give some hints as for what parameters to tweak?

A higher or lower "--upperRhoThetaRatio"?

Removing variants with low minor allele frequencies?

Separate predictions on different samples from same vcf?

I have a use case where samples are all in the same vcf and have same hyper parameters (mutation rate etc.), but would like to make predictions for separate taxa. Out of the box prediction on new samples from the same vcf didn't work I guess because the windows are broken up according to the vcf/samples given to ReLERNN_SIMULATE. Seems like a waste to do distinct sims for this. I would be happy to try tackling this if it seems feasible. Naively it seems like it would require basing the windows according to a reference genome instead of a vcf? Are this issues I'm overlooking?

Question about software usage

Hi
I would like to know if it is possible to use this software to estimate recombination rates and then analyze population demographics using linkage-disequilibrium-based Ne estimation that use recombination rates information.
If you have time, please tell me.

Installation instructions

The dependencies in requirements.txt and setup.py are identical, so it seems to me the instructions could be simplified to

$ git clone https://github.com/kern-lab/ReLERNN.git
$ cd ReLERNN
$ pip install .

instead of

$ git clone https://github.com/kern-lab/ReLERNN.git
$ cd ReLERNN
$ pip install -r requirements.txt
$ python setup.py install

Or is there any reason for the two steps and legacy install?

Chromosome length bounded to 20 Mbp

Hello,

I have tried the ReLERNN pipeline on some poolseq data for a genome including 5 chromosomes of about 45 Mbp each.
The pipeline works fine but I noticed that the maximal position considered in the splitPOOL files is 20Mbp.
I guess this is a result of the max number of sites the pipeline can handle at once, is this correct ? Or is there any other issue I should worry about ?

In any case, great tool and impressive computational performances.

Best,
Guillaume

How many diploid samples should be at least used

Hi, I am wondering how many diploid samples should be used, I found the paper used at least 4 chromosomes, so, for diploid samples, at least two samples should be used. Am I right? And what about the accuracy?
Thank you very much

Issue with chromosome length

Hi!

I am facing a very usual issue with ReLERNN regarding chromosome length. I am studying a species with chromosomes longer than 2,147,483,647 bp which is the usual limit for integer storage in memory.
I can of course divide my chromosomes to take that into account (which I usually have to do, as most software have the same issue), but if you could consider that for one of your next releases, it would be amazing!

Thanks a lot!

Unable to allocate memory with ReLERNN_TRAIN_POOL

I'm running into memory issues with ReLERNN_TRAIN_POOL. I'm not sure if this is a ReLERNN problem, or (more likely) something about the way my cluster and GPUs are set up.

I'm running my analysis on a cluster (CPU: Intel Xeon Gold 6240 @ 2.60GHz, GPU: NVIDIA RTX 2080Ti), using 24 threads. I installed the dependencies through conda, here are the versions I'm using:

While ReLERNN_TRAIN_POOL runs I get frequent warnings:

WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

It runs for a while and then eventually (~24 hours, 49 epochs) relernn gives a memory allocation error:

Traceback (most recent call last):
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/utils/data_utils.py", line 843, in _run
    with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/utils/data_utils.py", line 820, in pool_fn
    pool = get_pool_class(True)(
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/context.py", line 276, in _Popen
    return Popen(process_obj)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/tt164677e/anaconda3/envs/tf/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

This error doesn't cause any out-of-memory errors in the scheduler and doesn't halt the program: I just notice that the logs stop updating and I have to manually stop the job.

Here's the command I ran:

ReLERNN_TRAIN_POOL -d genome_MQ20_minDP90_maf05 --readDepth 255 --maf 0.05 -t 24

If it helps, this is coming from a pool of 81 diploid individuals (so I specified --sampleDepth 162 in ReLERNN_SIMULATE_POOL). The genome size is ~3Gb, and I have just under 55k SNPs at the current level of filtering. The cluster I'm running on has I think up to 512 Gb of memory to work with, though if I check memory usage of the failed job with sacct I get some nonsensical numbers (MaxRSS = 18130.31G, MaxVMSize = 19035.73G), so I'm not sure what's going on there.

Also if it helps, I was successfully able to run the example pooled pipeline, but it took longer than I was expecting given what the readme says for the non-pooled example: ~60 minutes running on 4 cores.

Please let me know if you need any other info. I'm very new with running GPU-based analyses, so even if you can just point me in the right direction in terms of questions to ask my sysadmin, I'd appreciate it.

Illegal instruction 4 errors when running the example

Hello,

I am running ReLERNN on a mac with an M1 chip and suspect that this might be the main cause of the following error when running the example file. Is there an update for MacOS instillations with the M1-M3 chips?

Here is the error:

(msprime-env) frankburbrink@Mac-Studio examples % ./example_pipeline_pool.sh
./example_pipeline_pool.sh: line 25: 54699 Illegal instruction: 4 ${SIMULATE} --pool ${POOL} --sampleDepth 20 --genome ${GENOME} --mask ${MASK} --projectDir ${DIR} --assumedMu ${MU} --upperRhoThetaRatio ${URTR} --nTrain 13000 --nVali 2000 --nTest 100 --seed ${SEED}
./example_pipeline_pool.sh: line 34: 54779 Illegal instruction: 4 ${TRAIN} --projectDir ${DIR} --readDepth 20 --maf 0.05 --nEpochs 2 --nValSteps 2 --seed ${SEED}
./example_pipeline_pool.sh: line 40: 54783 Illegal instruction: 4 ${PREDICT} --pool ${POOL} --projectDir ${DIR} --seed ${SEED}
./example_pipeline_pool.sh: line 47: 54788 Illegal instruction: 4 ${BSCORRECT} --projectDir ${DIR} --nSlice 2 --nReps 2 --seed ${SEED}

Thanks for any advice!

Frank

The filepath provided must end in `.keras` (Keras model format)

Hello,
I'm trying to test ReLERNN installation running the example_pipeline.sh and I'm having the following error during training step:

Traceback (most recent call last):
File "/home/quaranta/anaconda3/bin/ReLERNN_TRAIN", line 130, in
main()
File "/home/quaranta/anaconda3/bin/ReLERNN_TRAIN", line 109, in main
runModels(ModelFuncPointer=GRU_TUNED84,
File "/home/quaranta/anaconda3/lib/python3.10/site-packages/ReLERNN/helpers.py", line 353, in runModels
ModelCheckpoint(
File "/home/quaranta/anaconda3/lib/python3.10/site-packages/keras/src/callbacks/model_checkpoint.py", line 191, in init
raise ValueError(
ValueError: The filepath provided must end in .keras (Keras model format). Received: filepath=./example_output/networks/weights.h5

How can i solve? Thanks

Wrong auto-estimate of #CPUs (Slurm)

The automatic estimate of the number of CPUs available is wrong on an HPC cluster with Slurm scheduler. The program counts all CPUs on the node, not just the ones allocated by Slurm. As a result, if it has not been allocated the whole node, it tries to start too many processes and it crashes.

To fix this, it should detect whether the environment variable SLURM_NTASKS has been set, and if so, set the number of processes equal to: either SLURM_NTASKS * SLURM_CPUS_PER_TASK (if SLURM_CPUS_PER_TASK has been set), or SLURM_NTASKS (if SLURM_CPUS_PER_TASK has not been set). If SLURM_NTASKS is unset, proceed as before.

The problem is of course resolved by using the -t flag. It is, however, inconvenient that one must then also alter the example_pipeline.sh and example_pipeline_pool.sh scripts, which are meant as elementary ready-to-run tests.

Issue with seed in examples

Hi!

I just installed ReLERNN and tried it on the example dataset, but I got some unexpected issue during the simulation stage.
Here is the error message

Traceback (most recent call last):
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/j/jbruxaux/.local/lib/python3.11/site-packages/ReLERNN/simulator.py", line 301, in worker_simulate
    result_q.put([i,self.runOneMsprimeSim(i,direc)])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j/jbruxaux/.local/lib/python3.11/site-packages/ReLERNN/simulator.py", line 87, in runOneMsprimeSim
    random.seed(SEED)
  File "/hpc2n/eb/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/random.py", line 160, in seed
    raise TypeError('The only supported seed types are: None,\n'
TypeError: The only supported seed types are: None,
int, float, str, bytes, and bytearray.

I am using python 3.11, TensorFlow 2.13.0, CUDA 11.4.1 and cuDNN 8.2.2.26 on a v100 gpu node.
Am I missing something?

Thanks for your help!

kr-colab / relernn Goto Github PK

relernn's People

Contributors

Stargazers

Watchers

Forkers

relernn's Issues

Layer (type) Output Shape Param #

dense_2 (Dense) (None, 1) 257

Recommend Projects

Recommend Topics

Recommend Org