This looks like a comprehensive workflow. I'd like to use it to do DE analysis and var

Ah I see. I work on <a href="https://plasmodb.org/plasmo/app/record/organism/TMPTX

Adapting workflow to other species,about sanjaynagi/rna-seq-pop

Comments (67)

sanjaynagi commented on September 26, 2024 1

Thank you, I've made it a template now, but would recommend maybe just doing a git clone --recursive for now. I'm not fully sure of the purpose of the templating, this was residual from the snakemake cookie-cutter template.

I'll edit the readme accordingly.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024 1

I will do. For now, can you add this into resources folder (remove the .txt). It wont be used.
exampleMutations.tsv.txt

from rna-seq-pop.

sanjaynagi commented on September 26, 2024 1

Ive just added a fix which removes the need for exampleMutations and makes a dummy dataframe instead. can you try a git pull and try again :)

from rna-seq-pop.

sanjaynagi commented on September 26, 2024 1

Yeah sure :) would you mind making a new issue re allowing custom snpEff databases and then have a go? Thanks

from rna-seq-pop.

sanjaynagi commented on September 26, 2024 1

Morning Victor. Could you send over an abundance.tsv file and also your full gene_names.tsv file? I need to fix something relating to mapping transcripts to genes.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Hi there! :)

currently there is a limitation of the pipeline is that it assumes we have genes anchored to chromosomes. This is not a problem for DE and related analysis but becomes so for variant calling and the subsequent analyses, as the workflow splits the genome by chromosome for convenience. This would be fine for falciparum/vivax but I assume your non-human malaria species does not have genes anchored to chromosomes. is this the case?

However, the only real part of the pipeline which truly requires genes->chromosomes is the windowed Fst and PBS analysis to detect selection. This could still be done solely at the gene-level, so it would definitely be possible to alter the workflow to work for non-model organisms. However, I think this would take a decent amount of work to allow the two options seamlessly (model v non-model). Its something that I have been thinking about implementing, but unfortunately I don't currently have the time to do so. If you know snakemake and wanted to give it a go I'd be happy to help.

Cheers
Sanj

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I'm not sure what you mean by 'genes anchored to a chromosome'. My species has a published, annotated genome which I map to. Is that sufficient?

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Apologies - what I mean is that, for example, the plasmodium falciparum 3D7 reference genome has 14 full chromosomes, almost all genes are assigned to a chromosome, and their specific position on the chromosome is known. In this workflow, currently, 14 different VCFs for each chromosome would be produced.

Alternatively, in less-well annotated organisms, instead of a few full length chromosomes, the genome assembly might have hundreds to thousands of small scaffolds. Currently, that would make it problematic for the workflow, and for interpreting the results. I hope that helps.

What is your malaria species?

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Ah I see.
I work on Plasmodium relictum, avian malaria.
As you can see, I have 14 chromosomes and the mito and apicoplast chromosomes. Is there a way I could remove the 460 short scaffolds after mapping to simplify the representation? Or perhaps the the short scaffolds could be concatenated into a meta-chromosome for simpler representation of results?

Anyway, I don't think that is the hardest part for me. The part I am struggling with is getting started. How should I remove the analyses from the workflow that I can't use, like the resistance genes, or account for the lack a of pre-existing snp database?

EDIT:
Ah I see there is an 'Activate' option in the example config file. Very helpful!

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Ok, awesome! Well certainly if you just list the 14 chromosomes + MIto + apicoplast in the config file, for now, the other small scaffolds will be ignored for the purposes of the variant calling and extra variant analyses. Genes on the small scaffolds will still be used for things like differential expression, etc.

Apologies - I am in the process of writing a README in the config/ folder, with documentation on how to configure the workflow (currently there is not a lot of documentation, so im not surprised if it might be confusing!). This is a priority and so hoping to get this done in the next week or so. I'm also writing the manuscript for the workflow which may also aid interpretation. Anyway, if you want to give it a go, ill try and give you a hand where there are issues! :)

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Thank you so much! I am definitely going to try.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

When I try to create a template clone of the directory using the github cli app, I get his error:
GraphQL error: Could not clone: sanjaynagi/rna-seq-ir is not a template repository.
I think you need to mark this repo as a template repo by doing this.

Btw, the link you have in the README about cloning/forking the directory as a template does not help. Github's UI has changed, or perhaps it's because you have not made this repo a template repo.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

OK, I've hit my first speed bump.
I've adapted the config file, all the resources, sample name and gene name files.
then I try the dry run.
And I get this error.

snakemake --use-conda -n
---------------------------- RNA-Seq-IR ----------------------------
Running RNA-Seq-IR snakemake workflow in /scratch/victor/rna-seq-pr/workflow

Author:   Sanjay Curtis Nagi
Workflow Version: v0.4.0
Execution time:  2021-05-27 14:27:38.687554
Dataset: six_nonsiskins


AttributeError in line 40 of /scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk:
'list' object has no attribute 'Name'
  File "/scratch/victor/rna-seq-pr/workflow/Snakefile", line 32, in <module>
  File "/scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk", line 40, in <module>

This is line 40 of variantAnalysis.smk
mut=mutationData.Name,

Here is the config file. Ignore the .txt at the end. Just needed that to upload to github.
config.yaml.txt

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Ok - this was a bug relating to that rule mpileupIR still needing a mutations file even when IRmutations is inactivated in the config. Strange, as snakemake shouldnt really be needing to look for that when the rule isnt gonna be ran. I've just pushed a temporary fix to github which reads in the exampleMutations.tsv file, if you want to do a git pull! :)

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Thanks!

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I'm not a frequent git user, so I'm missing something obvious.
First I :
git clone --recursive https://github.com/sanjaynagi/rna-seq-ir.git
then I
git pull
But it says
Already up to date.
How do I git pull the latest version?

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Is this still the case? sorry, i find it difficult to check these things, as being the repo owner gives me more permissions...

if you run grep example workflow/Snakefile the result should be mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t") if the fix has been incorporated.

otherwise, does completing the steps in section 6 of the README help at all?
Thanks, I appreciate your help in testing this!

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

It's still the old Snakefile file.
This is what happened when trying to follow section 6


git remote add -f upstream https://github.com/snakemake-workflows/rna-seq-ir.git
fatal: remote upstream already exists.

git fetch upstream
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I feel really stupid..

Of course I can manually make the change to the Snakefile, but that is a bit of a hack.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Dont feel stupid!! :) its my fault that I havent sorted this yet, you're my first user of the workflow outside of my department, thats all. Im going to try and use a different machine without my github ssh so i can test these steps.

In the meantime - do you want to make a backup of your config.yaml, samples.tsv, and anything extra you've put in the repo (reference files etc). And then remove the folder and do a fresh git clone --recursive ? hopefully it'll work then.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

You did a template thing I see. I've just been attempting to do the same and Im struggling to get it working. Id recommend just doing git clone --recursive on the rna-seq-ir (or forking it), and im gonna remove all the templating advice from the README

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?

At first I did a template, but since I complained about the upstream issue, I've done the git clone --recursive.

Snakemake is running now.
But I'm getting this error. Perhaps you can make it optional to use example mutations, since I don't have any pre-existing snp database.

snakemake -n --use-conda
FileNotFoundError in line 24 of /scratch/victor/rna-seq-ir/workflow/Snakefile:
[Errno 2] No such file or directory: 'resources/exampleMutations.tsv'
  File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 24, in <module>
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 610, in read_csv
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 462, in _read
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 819, in __init__
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1867, in __init__
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/common.py", line 642, in get_handle

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

That did not fix it. I still ge the same error.
I also tried commenting out these 2 lines in the Snakemake file:

else:
    mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t")

but then I get this error, because some rules still need mutation data.

NameError in line 40 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
name 'mutationData' is not defined
  File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 31, in <module>
  File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 40, in <module>

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

The dry run completed successfully! Thank you for your help Sanjay.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I found a Plasmodium relictum snpEff database called Plasmodium_relictum_gca_900005765

I can run
snpEff download Plasmodium_relictum_gca_900005765
without error (though I don't know where the file goes to).
However, when I try run the pipeline, that snpEff database download rule fails.
This is the failing code:

snpEff download Plasmodium_relictum_gca_900005765 2> logs/snpEff/snpEffDbDownload.log

This is the log:

java.lang.RuntimeException: Property: 'Plasmodium_relictum_gca_900005765.genome' not found
at org.snpeff.interval.Genome.(Genome.java:106)
at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:681)
at org.snpeff.snpEffect.Config.readConfig(Config.java:649)
at org.snpeff.snpEffect.Config.init(Config.java:480)
at org.snpeff.snpEffect.Config.(Config.java:117)
at org.snpeff.SnpEff.loadConfig(SnpEff.java:451)
at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80)
at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72)
at org.snpeff.SnpEff.run(SnpEff.java:1183)
at org.snpeff.SnpEff.main(SnpEff.java:162)

Do you know why the ".download" is added at the end of the name? Cause it seems that is the cause of the problem.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

What did you put in the config.yaml under snpeff ? I think it might be just plasmodium_relictum as opposed to the full name but I could be wrong here. What does it show if you do snpEff databases | grep ‘plasmodium_relictum’ ? ( or whatever grep finds relictum). I’m on a train without laptop otherwise would have looked myself. I think the snpeff version is the latest version (5.0) however there was a known bug in 4.5 in which it couldn’t find the Aedes aegypti genome

…

On 28 May 2021, at 15:06, vmkalbskopf ***@***.***> wrote: I found a Plasmodium relictum snpEff database called Plasmodium_relictum_gca_900005765 I can run snpEff download Plasmodium_relictum_gca_900005765 without error (though I don't know where the file goes to). However, when I try run the pipeline, that snpEff database download rule fails. This is the log: java.lang.RuntimeException: Property: 'Plasmodium_relictum_gca_900005765.genome' not found at org.snpeff.interval.Genome.(Genome.java:106) at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:681) at org.snpeff.snpEffect.Config.readConfig(Config.java:649) at org.snpeff.snpEffect.Config.init(Config.java:480) at org.snpeff.snpEffect.Config.(Config.java:117) at org.snpeff.SnpEff.loadConfig(SnpEff.java:451) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1183) at org.snpeff.SnpEff.main(SnpEff.java:162) Do you know why the ".download" is added at the end of the name? Cause it seems that is the cause of the problem. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Plasmodium_relictum_gca_900005765 is the record in the database when I grep for Plasmodium_relictum.
And Plasmodium_relictum_gca_900005765 is what I put in the config file.
Conda is installing version SnpEff 4.3t 2017-11-24 into the environment.
Do you know why that might be?

I'll try specifying the latest version in the env file.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

OK! I think upgrading to a new version fixed that.
New error.

Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)[Fri May 28 17:49:48 2021]
Error in rule Ag1000gSweepsDE:
    jobid: 347
    output: results/genediff/ag1000gSweeps/cond1_cond2_swept.tsv
    log: logs/Ag1000gSweepsDE.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930

I think those are two separate errors, because when I look at that log file, I get something that is unrelated to the number of threads (btw, I am running this on a system with 256 threads, and I'm specifying 100 jobs at a time for snakemake).
Here is the log file:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpi1y8xn68.Ag1000gSweepsDE.py", line 24, in <module>
    signals = pd.read_csv("resources/signals.csv")
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 814, in __init__
    self._engine = self._make_engine(self.engine)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1862, in __init__
    self._open_handles(src, kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1363, in _open_handles
    storage_options=kwds.get("storage_options", None),
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/common.py", line 647, in get_handle
    newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'resources/signals.csv'

There is indeed no signals.csv file there.

EDIT:
I'm stupid. I forgot to disable the Ag1000g sweep in the config file.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Still getting the error about the number of threads even when running with 63 threads.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Ok, it should be version 5.00... that must be why. And try with less threads perhaps. I probably won’t be responsive until Tuesday now as it’s holiday over here but good luck!

…

On 28 May 2021, at 16:43, vmkalbskopf ***@***.***> wrote: Plasmodium_relictum_gca_900005765 is the record in the database when I grep for Plasmodium_relictum. And Plasmodium_relictum_gca_900005765 is what I put in the config file. Conda is installing version SnpEff 4.3t 2017-11-24 into the environment. Do you know why that might be? I'll try specifying the latest version in the env file. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Ok, it should be version 5.00... that must be why. And try with less threads perhaps. I probably won’t be responsive until Tuesday now as it’s holiday over here but good luck!

No worries. Enjoy the wonderful weather (I hope)!

I have restricted it to running with one core. Now it stops with the same error consistently:

Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)
-------------- Reading VCF for chromosome PRELSG_01_v1 --------------
------- Filtering VCF at QUAL=30 and missingness proportion of 0.8 -------
[Sat May 29 17:24:05 2021]
Error in rule WindowedFstPBS:
*I've removed the details about the job specs, jumping straight to the error*

log: logs/WindowedFstPCA.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930

RuleException:
CalledProcessError in line 204 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930'; set -euo pipefail;  python /scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py' returned non-zero exit status 1.
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
  File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 204, in __rule_WindowedFstPBS
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper
Shutting down, this might take some time.

Here is the log mentioned in the error:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py", line 45, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 115, in readAndFilterVcf
    geno = allel.GenotypeArray(vcf['calldata/GT'].compress(passfilter, axis=0))
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1476, in __init__
    check_ndim(self.values, 3)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/util.py", line 66, in check_ndim
    raise TypeError('bad number of dimensions: expected %s; found %s' % (ndim, a.ndim))
TypeError: bad number of dimensions: expected 3; found 2

I'm wondering if there is something amiss with the vcf file, causing numpy to spit out that error (bad number of dimensions: expected 3; found) as part of the WindowedFstPBS.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Hey Victor,

I realised that the reason snpEff was v4.3.1 was because I had changed it to 5.0.0, but then snpSift latest conda version is 4.3.1, and it complained of being incompatible (snpSift is used later for the differential SNPs module which AFAIK is not working right now, I need to fix). I might try and remove the snpSift rule as its just filtering and i assume bcftools will be able to do the same.

re the latest error - what level of ploidy did you use in the config? if 1, this could be why, although allel.GenotypeArray() should work with variable ploidy. Are these pooled samples? I guess with plasmodium you need to pool lots of parasites for RNA-Seq? is that right?

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Im incorrect - GenotypeArray needs a ploidy of ABOVE 1 (I didnt know about this), and I should instead use a haplotypeArray. I will implement this as its important, but I do think if you are using pooled data then it might actually make more sense to use a higher ploidy, as allele frequencies in the sample will be captured more accurately then if we try and force it to call one allele only.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Hi Sanjay

I did indeed specify a ploidy of one. The samples are not pooled, instead we sequence each sample very deeply (50x -80x). They are blood samples are from infected hosts. Each host will have a multitude of parasites that get sequenced, and this is RNA-seq so ploidy is tricky.. Not sure what ploidy to use.

I've rereun with a ploidy of 5 (why not). The filtering is working now, but I'm still getting the error about the number of threads:
Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)
I ran it with 1 cpu.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

OK, in that case we can probably assume they are (relatively) clonal? and therefore I do agree that a ploidy of 1 makes the most sense. You'll have to give me a couple of days to implement.

Looks as though the nthreads issue is related to this cggh/scikit-allel#285 . I haven't run before on a machine with more than 64 threads so haven't come across it.

supposedly adding this to the top of the python script can fix it. Could you try and manually edit the workflow/scripts/tools.py script to do the following directly at the top? if it works, I could add it to the workflow. I think change 272 to whatever number of cores you have maybe, but even if not it should be OK.

import os
os.environ["NUMEXPR_MAX_THREADS"]="272"
import allel

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Specifying the environmental core count did the trick!

Now onto the next hiccup.

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp6xi5rrhl.WindowedFstPBS.py", line 45, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 123, in readAndFilterVcf
    ac = geno.count_alleles()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1839, in count_alleles
    max_allele = self.max()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity

I'm going to rerun all the variant analysis steps, in case there are issues with uncompleted jobs that aren't cleaned up correctly. If I say nothing after this, it didn't help.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Yo. Ive edited the pipeline and it should now work with haploids :)

I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Yo. Ive edited the pipeline and it should now work with haploids :)

I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.

Working on haploids is great! I'll pull that change :-)

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I've discovered that the snpEff Plasmodium relictum database is (unusably) out of date. I am building a new one based on the latest genome and annotations.
But snpEff stores it's databases in it's own working directory, which gets regenerated each time the conda environment is created. This makes it hard to specify a custom snpEff database on the fly, as a parameter in a Snakemake rule.

Of course, one could simply add a rule that builds the database based on params from the config file.
Would you like me to try my hand at that and push it to you?
Should this be its own issue?

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Latest error when running StatisticsAndPCA.py:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjm6zc5wp.StatisticsAndPCA.py", line 59, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 128, in readAndFilterVcf
    ac = geno.count_alleles()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 2394, in count_alleles
    max_allele = self.max()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I'm all of a sudden running into a new error that has not occured before, which doesn't make sense, because they used to work.
It applies to both DifferentialIsoformExpression and DifferentialGeneExpression

Error in rule DifferentialIsoformExpression:
    jobid: 228
    output: results/isoformdiff/cond1_cond2.csv, results/isoformdiff/six_nonsiskins_isoformdiffexp.xlsx
    log: logs/DifferentialIsoformExpression.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591

RuleException:
CalledProcessError in line 100 of /scratch/victor/rna-seq-ir/workflow/rules/diff.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591'; set -euo pipefail;  Rscript --vanilla /scratch/victor/rna-seq-ir/.snakemake/scripts/tmpnn9aq7u2.SleuthIsoformsDE.R' returned non-zero exit status 1.
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
  File "/scratch/victor/rna-seq-ir/workflow/rules/diff.smk", line 100, in __rule_DifferentialIsoformExpression
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper

Here is the relevant part of the log:

Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
Backtrace:

This script has quite a few input files:

config/samples.tsv, resources/GeneNames.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52

Here's the samples file:

samples treatment       species strain
R47     cond1   prelictum       SGS1
R48     cond1   prelictum       SGS1
R49     cond1   prelictum       SGS1
R50     cond2   prelictum       SGS1
R51     cond2   prelictum       SGS1
R52     cond2   prelictum       SGS1

and part of GeneNames.tsv file

Gene_stable_ID  Gene_name       Gene_description
PRELSG_03_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_14_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_08_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_02_v1    ""      transcription initiation factor TFB%2C putative
PRELSG_12_v1    ""      GTP-binding protein%2C putative

Here's a quant file abundance.tsv:

target_id       length  eff_length      est_counts      tpm
PRELSG_0100100.1        1764    1568.65 0       0
PRELSG_0100200.1        73      29.154  0       0
PRELSG_0100300.1        84      32.6917 0       0
PRELSG_0100400.1        312     119.706 0       0
PRELSG_0100500.1        5073    4877.65 0       0
PRELSG_0100600.1        534     338.669 0       0
PRELSG_0100700.1        3777    3581.65 0       0
PRELSG_0100800.1        420     224.851 0       0
PRELSG_0100900.1        2061    1865.65 0       0
PRELSG_0101000.1        6195    5999.65 14      3.04701
PRELSG_0101100.1        786     590.645 24      53.0586

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

The last error should now be fixed I think, the case_when statement was not evaluating quite right. Try a git pull :)

FYI I've changed the 'samples' column to be named 'sampleID' in the samples.tsv file, so that will need changing.

Could you also try a much lower missingness (under pbs: missingness: in config file, I probably need to change that as its confusing) and see if the StatisticsAndPCA works then? thanks.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Thanks!

The differential expression is still giving the same error:

Joining, by = "sampleID"
 --- Running DESeq2 differential expression analysis on cond1_cond2 ---
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

As far as I can tell, I am using the latest version, as the modified time stamps on the rule files are from today.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

gene_names.tsv.txt
abundance.tsv.txt

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Thanks Victor. I've edited the workflow - previously it was only suited to using geneIDs from VectorBase.

it now needs the gene names file in a slightly different format which maps genes to transcripts - tab separated 4 columns - GeneID, TranscriptID, GeneName, GeneDescription although the description is optional. In the config file it is now called genes2transcripts also instead of gene_names. this will be in the exampleconfig that gets pulled, and there is an example Gene2TranscriptMap.tsv.

This should fix it but let me know if it doesnt.

edit: just had a look at your gene_names file and there is no column names? which would mean it definitely wouldn't work anyway. But it needed the above change in any case.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

OK, I actually sent you the wrong gene names file. Mine is called GeneNames.tsv
It has the header which gene_names.tsvwas missing.

Gene_stable_ID  Gene_name       Gene_description
PRELSG_03_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_14_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_08_v1    ""      conserved Plasmodium protein%2C unknown function

But this is irrelevant now anyway.
My genes don't have canonical names, or at least not ones I can quickly find, so I assumed that the "" would be enough to specify an empty value.

I'll attempt the new version.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I don't think you've pushed your changes to the repo.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

whoops. good point! Done now. yeah, empty gene names is fine :)

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

When executing Differential expression rule:

 
------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
  missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted

When executing Isoform Differential expression rule:

Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

genesList.tsv.txt
Here is the Gene2transcripts file.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

The transcriptIDs should match with that found in abundance.tsv - Kallisto should be mapping to the reference transcriptome fasta file, and so the 'target_id' from Kallisto should match with the fasta headers.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

It was reporting the exon IDs before, but now it's reporting the identical ID in the transcript file.
But I am getting the same error as before.

Gene diff expr.

 ------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
  missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted

Isoform DE:

Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

difficult to figure out whats going on, without being able to run the script. Do you have any missing rows or duplicates in the Transcript ID column? I would try and run the script in rstudio and see where it falters, just edit the snakemake@input bits to be the paths to your files

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

How do I edit the snakemake@input bits' to be paths to my files? I'm quite lost.

I see the lines

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()

contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])

but I don't know how to change them to refer to the file paths instead.
metadata and genes2transcripts refers to inputs in the rules.

I have a list of input files generated when the rule fails:

input: config/samples.tsv, resources/Gene2TranscriptMap.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52, results/quant/B7, results/quant/B8, results/quant/B9

EDIT!

I'm figuring it out...

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

Hey Victor. What i mean, is to open the Rscript outside of the snakemake pipeline in Rstudio. You may want to make a copy of the script of or something.

then change

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()

metadata = fread("config/samples.tsv", sep="\t") %>% as.data.frame()
contrastsdf = data.frame("contrast" = list("cond1_cond2"))

and do similar for gene_names (genes2transcript file). Then you should be able to run the Rscript step by step (you may need to install the packages), and it will be clearer where the script is failing.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Yup! You're too nice Sanjay, and I'm too impatient. I struggled a bit after asking for help, and figured it.

I've gotten to the same point in the script where it's failing when run with sn.

I'll now try figure out why it's saying there are missing row names.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

>sum(is.na(counts$GeneID))
[1] 1

> sum(is.na(gene_names$GeneID))
[1] 0

The very last row of counts has a missing GeneID.


5177 | PRELSG_MIT00300 | 2108.0000 | 1961 | 1186.00000 | 3249.000 | 1.24500e+03 | 1341.00000


5178 | NA | 53001.0000 | 77939 | 32034.00000 | 46699.000 | 1.09758e+05 | 38210.00000

I hate off-by-one errors.
I'm trying to figure out if there there's a line in my Gene2TranscriptMap.tsv that's missing a GeneID.

EDIT
I can't find the missing gene ID. I visually inspected the whole file twice.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I have 5312 entries in the abundance.tsv file. None of the transcript ID's repeat, but I have 5178 entries in the Gene2Transcript file.
All of the genome resources are the same version.
I'm lost now.

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

I've just stripped them from the PrelictumSG1-like transcriptome.fasta. it has 5312 entries now, but you'll have to do some sort of join if you want gene names and descriptions.

i did the following....

grep -e ">" PlasmoDB-52_PrelictumSGS1-like_AnnotatedTranscripts.fasta | cut -c 2-50 > prelictum_headers.fa
then in r...

library(data.table)
library(tidyverse)
names = fread("~/prelictum_headers.fa", header = FALSE)[,1:2]
colnames(names) = c("TranscriptID", "GeneID")
names$GeneID = str_remove(names$GeneID, "gene=")
names$GeneName = ""
fwrite(names, "~/Prelictum_Genes2TranscriptMap.tsv", sep="\t", col.names = TRUE)

Prelictum_Genes2TranscriptMap.tsv.txt

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Huh.. It works now. Thanks.

I built my Genes2Transcripts file from the gff file. And that has 5178 mRNA entries. This is not relevant to your project, but I wonder why the gff file has fewer transcripts than the Annotated Transcripts fasta file.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I've had a frustrating few days trying to figure out why the code running on the local R session worked, but not when run part of the sm workflow.
Well, turns out it actually wasn't running all the way through locally.
Here is the error while running the DESeq2 function:

 --- Running DESeq2 differential expression analysis on cond1_cond2 --- 
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` input `Gene_name`.
x must be a character vector, not a logical vector.
ℹ Input `Gene_name` is `case_when(...)`.

I get this identical error locally and in the SM workflow.
I don't have canonical gene names. but not sure if that is what the error is referring too. It's just the GeneID, right?

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I think I found the problem.

 labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                     is.na(GeneName) ~ GeneID,
                                     TRUE ~ GeneID)) %>% select(Gene_name) %>% deframe()

I changed TRUE ~ Gene_name to TRUE ~ GeneID and it seems to work now.

I was getting nearly the identical errors for the Isoform difexpr rule.
So I made the same change, swapping out GeneID for TranscriptID in the same function.

I'm not sure if I broke it though, since I don't understand the function.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Next up!
ExtractBedVCF throws this error

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmprgz_st1m.ExtractBedVCF.py", line 21, in <module>
    pos = vcf['variants/POS']
TypeError: 'NoneType' object is not subscriptable

I tried

print(vcf)
pos = vcf['variants/POS']

And it printed a long a log vcf file to the log file, so I know it exists.

The last chromosome it reads is annot.missense.PRELSG_API_v1.vcf.
It only has one SNP according to grep -v '#' annot.missense.PRELSG_API_v1.vcf | wc -l

Here is that vcf file.
annot.missense.PRELSG_API_v1.vcf.txt

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Here's 2 more unrelated errors

When running PerGeneFstPBS.py

  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpin7thiea.PerGeneFstPBS.py", line 143, in <module>
    ndf.columns = ['GeneID', 'nSNPs']
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 5475, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 669, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 221, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

and I get this when running StatisticsAndPCA.py

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjvl4kifb.StatisticsAndPCA.py", line 113, in <module>
    pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 253, in pca
    fig_pca(coords1, model1, f"PCA {chrom} {dataset}", f"results/variants/plots/PCA-{chrom}-{dataset}", samples, pop_colours, sample_population=populations)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 234, in fig_pca
    plot_pca_coords(coords, model, 0, 1, ax, sample_population, samples, pop_colours)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 215, in plot_pca_coords
    y = coords[:, pc2]
IndexError: index 1 is out of bounds for axis 1 with size 1

from rna-seq-pop.

sanjaynagi commented on September 26, 2024

those latest errors must be to do with the lack of SNPs on the apicoplast and possibly mitochondrial chromosome - I would remove this from the analysis personally.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

Interesting! We're making progress!
Removing those chromosomes worked. Perhaps the workflow would be less fragile if a lack of SNPs wouldn't stall it.

Anyway, here is the next error from AlleleTables:

[mpileup] 1 samples in 1 input files
r/rna-seq-ir/workflow/scripts/mpileup2readcounts/mpileup2readcounts: No such file or directory

The /rna-seq-ir/workflow/scripts/mpileup2readcounts/ is indeed empty.
I don't know why. It should have been cloned with everything else.
I see that it is part of another repo.

I don't know why there is a leading r in the error though r/rna-seq-ir/workfl.... Probably just a printing artifact in samtools.

from rna-seq-pop.

vmkalbskopf commented on September 26, 2024

I downloaded mpileup2readcounts.cc directly from the remote repo into the correct directory and compiled it according to the readme. And it works!

I'm still getting the same PerGeneFstPBS error as before. Removing the 2 small chromosomes did not fix this one.

from rna-seq-pop.

Adapting workflow to other species about rna-seq-pop HOT 67 CLOSED

Comments (67)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent