Comments (67)
Thank you, I've made it a template now, but would recommend maybe just doing a git clone --recursive for now. I'm not fully sure of the purpose of the templating, this was residual from the snakemake cookie-cutter template.
I'll edit the readme accordingly.
from rna-seq-pop.
I will do. For now, can you add this into resources folder (remove the .txt). It wont be used.
exampleMutations.tsv.txt
from rna-seq-pop.
Ive just added a fix which removes the need for exampleMutations and makes a dummy dataframe instead. can you try a git pull and try again :)
from rna-seq-pop.
Yeah sure :) would you mind making a new issue re allowing custom snpEff databases and then have a go? Thanks
from rna-seq-pop.
Morning Victor. Could you send over an abundance.tsv file and also your full gene_names.tsv file? I need to fix something relating to mapping transcripts to genes.
from rna-seq-pop.
Hi there! :)
currently there is a limitation of the pipeline is that it assumes we have genes anchored to chromosomes. This is not a problem for DE and related analysis but becomes so for variant calling and the subsequent analyses, as the workflow splits the genome by chromosome for convenience. This would be fine for falciparum/vivax but I assume your non-human malaria species does not have genes anchored to chromosomes. is this the case?
However, the only real part of the pipeline which truly requires genes->chromosomes is the windowed Fst and PBS analysis to detect selection. This could still be done solely at the gene-level, so it would definitely be possible to alter the workflow to work for non-model organisms. However, I think this would take a decent amount of work to allow the two options seamlessly (model v non-model). Its something that I have been thinking about implementing, but unfortunately I don't currently have the time to do so. If you know snakemake and wanted to give it a go I'd be happy to help.
Cheers
Sanj
from rna-seq-pop.
I'm not sure what you mean by 'genes anchored to a chromosome'. My species has a published, annotated genome which I map to. Is that sufficient?
from rna-seq-pop.
Apologies - what I mean is that, for example, the plasmodium falciparum 3D7 reference genome has 14 full chromosomes, almost all genes are assigned to a chromosome, and their specific position on the chromosome is known. In this workflow, currently, 14 different VCFs for each chromosome would be produced.
Alternatively, in less-well annotated organisms, instead of a few full length chromosomes, the genome assembly might have hundreds to thousands of small scaffolds. Currently, that would make it problematic for the workflow, and for interpreting the results. I hope that helps.
What is your malaria species?
from rna-seq-pop.
Ah I see.
I work on Plasmodium relictum, avian malaria.
As you can see, I have 14 chromosomes and the mito and apicoplast chromosomes. Is there a way I could remove the 460 short scaffolds after mapping to simplify the representation? Or perhaps the the short scaffolds could be concatenated into a meta-chromosome for simpler representation of results?
Anyway, I don't think that is the hardest part for me. The part I am struggling with is getting started. How should I remove the analyses from the workflow that I can't use, like the resistance genes, or account for the lack a of pre-existing snp database?
EDIT:
Ah I see there is an 'Activate' option in the example config file. Very helpful!
from rna-seq-pop.
Ok, awesome! Well certainly if you just list the 14 chromosomes + MIto + apicoplast in the config file, for now, the other small scaffolds will be ignored for the purposes of the variant calling and extra variant analyses. Genes on the small scaffolds will still be used for things like differential expression, etc.
Apologies - I am in the process of writing a README in the config/ folder, with documentation on how to configure the workflow (currently there is not a lot of documentation, so im not surprised if it might be confusing!). This is a priority and so hoping to get this done in the next week or so. I'm also writing the manuscript for the workflow which may also aid interpretation. Anyway, if you want to give it a go, ill try and give you a hand where there are issues! :)
from rna-seq-pop.
Thank you so much! I am definitely going to try.
from rna-seq-pop.
When I try to create a template clone of the directory using the github cli app, I get his error:
GraphQL error: Could not clone: sanjaynagi/rna-seq-ir is not a template repository.
I think you need to mark this repo as a template repo by doing this.
Btw, the link you have in the README about cloning/forking the directory as a template does not help. Github's UI has changed, or perhaps it's because you have not made this repo a template repo.
from rna-seq-pop.
OK, I've hit my first speed bump.
I've adapted the config file, all the resources, sample name and gene name files.
then I try the dry run.
And I get this error.
snakemake --use-conda -n
---------------------------- RNA-Seq-IR ----------------------------
Running RNA-Seq-IR snakemake workflow in /scratch/victor/rna-seq-pr/workflow
Author: Sanjay Curtis Nagi
Workflow Version: v0.4.0
Execution time: 2021-05-27 14:27:38.687554
Dataset: six_nonsiskins
AttributeError in line 40 of /scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk:
'list' object has no attribute 'Name'
File "/scratch/victor/rna-seq-pr/workflow/Snakefile", line 32, in <module>
File "/scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk", line 40, in <module>
This is line 40 of variantAnalysis.smk
mut=mutationData.Name,
Here is the config file. Ignore the .txt at the end. Just needed that to upload to github.
config.yaml.txt
from rna-seq-pop.
Ok - this was a bug relating to that rule mpileupIR still needing a mutations file even when IRmutations is inactivated in the config. Strange, as snakemake shouldnt really be needing to look for that when the rule isnt gonna be ran. I've just pushed a temporary fix to github which reads in the exampleMutations.tsv file, if you want to do a git pull! :)
from rna-seq-pop.
Thanks!
from rna-seq-pop.
I'm not a frequent git user, so I'm missing something obvious.
First I :
git clone --recursive https://github.com/sanjaynagi/rna-seq-ir.git
then I
git pull
But it says
Already up to date.
How do I git pull the latest version?
from rna-seq-pop.
Is this still the case? sorry, i find it difficult to check these things, as being the repo owner gives me more permissions...
if you run grep example workflow/Snakefile
the result should be mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t")
if the fix has been incorporated.
otherwise, does completing the steps in section 6 of the README help at all?
Thanks, I appreciate your help in testing this!
from rna-seq-pop.
It's still the old Snakefile file.
This is what happened when trying to follow section 6
git remote add -f upstream https://github.com/snakemake-workflows/rna-seq-ir.git
fatal: remote upstream already exists.
git fetch upstream
ERROR: Repository not found.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
I feel really stupid..
Of course I can manually make the change to the Snakefile, but that is a bit of a hack.
from rna-seq-pop.
Dont feel stupid!! :) its my fault that I havent sorted this yet, you're my first user of the workflow outside of my department, thats all. Im going to try and use a different machine without my github ssh so i can test these steps.
In the meantime - do you want to make a backup of your config.yaml, samples.tsv, and anything extra you've put in the repo (reference files etc). And then remove the folder and do a fresh git clone --recursive ? hopefully it'll work then.
from rna-seq-pop.
out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?
from rna-seq-pop.
You did a template thing I see. I've just been attempting to do the same and Im struggling to get it working. Id recommend just doing git clone --recursive on the rna-seq-ir (or forking it), and im gonna remove all the templating advice from the README
from rna-seq-pop.
out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?
At first I did a template, but since I complained about the upstream issue, I've done the git clone --recursive.
Snakemake is running now.
But I'm getting this error. Perhaps you can make it optional to use example mutations, since I don't have any pre-existing snp database.
snakemake -n --use-conda
FileNotFoundError in line 24 of /scratch/victor/rna-seq-ir/workflow/Snakefile:
[Errno 2] No such file or directory: 'resources/exampleMutations.tsv'
File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 24, in <module>
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 610, in read_csv
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 462, in _read
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 819, in __init__
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1867, in __init__
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/common.py", line 642, in get_handle
from rna-seq-pop.
That did not fix it. I still ge the same error.
I also tried commenting out these 2 lines in the Snakemake file:
else:
mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t")
but then I get this error, because some rules still need mutation data.
NameError in line 40 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
name 'mutationData' is not defined
File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 31, in <module>
File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 40, in <module>
from rna-seq-pop.
The dry run completed successfully! Thank you for your help Sanjay.
from rna-seq-pop.
I found a Plasmodium relictum snpEff database called Plasmodium_relictum_gca_900005765
I can run
snpEff download Plasmodium_relictum_gca_900005765
without error (though I don't know where the file goes to).
However, when I try run the pipeline, that snpEff database download rule fails.
This is the failing code:
snpEff download Plasmodium_relictum_gca_900005765 2> logs/snpEff/snpEffDbDownload.log
This is the log:
java.lang.RuntimeException: Property: 'Plasmodium_relictum_gca_900005765.genome' not found
at org.snpeff.interval.Genome.(Genome.java:106)
at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:681)
at org.snpeff.snpEffect.Config.readConfig(Config.java:649)
at org.snpeff.snpEffect.Config.init(Config.java:480)
at org.snpeff.snpEffect.Config.(Config.java:117)
at org.snpeff.SnpEff.loadConfig(SnpEff.java:451)
at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80)
at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72)
at org.snpeff.SnpEff.run(SnpEff.java:1183)
at org.snpeff.SnpEff.main(SnpEff.java:162)
Do you know why the ".download" is added at the end of the name? Cause it seems that is the cause of the problem.
from rna-seq-pop.
from rna-seq-pop.
Plasmodium_relictum_gca_900005765 is the record in the database when I grep for Plasmodium_relictum.
And Plasmodium_relictum_gca_900005765 is what I put in the config file.
Conda is installing version SnpEff 4.3t 2017-11-24 into the environment.
Do you know why that might be?
I'll try specifying the latest version in the env file.
from rna-seq-pop.
OK! I think upgrading to a new version fixed that.
New error.
Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)[Fri May 28 17:49:48 2021]
Error in rule Ag1000gSweepsDE:
jobid: 347
output: results/genediff/ag1000gSweeps/cond1_cond2_swept.tsv
log: logs/Ag1000gSweepsDE.log (check log file(s) for error message)
conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930
I think those are two separate errors, because when I look at that log file, I get something that is unrelated to the number of threads (btw, I am running this on a system with 256 threads, and I'm specifying 100 jobs at a time for snakemake).
Here is the log file:
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpi1y8xn68.Ag1000gSweepsDE.py", line 24, in <module>
signals = pd.read_csv("resources/signals.csv")
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 605, in read_csv
return _read(filepath_or_buffer, kwds)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 814, in __init__
self._engine = self._make_engine(self.engine)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1862, in __init__
self._open_handles(src, kwds)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1363, in _open_handles
storage_options=kwds.get("storage_options", None),
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/common.py", line 647, in get_handle
newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'resources/signals.csv'
There is indeed no signals.csv file there.
EDIT:
I'm stupid. I forgot to disable the Ag1000g sweep in the config file.
from rna-seq-pop.
Still getting the error about the number of threads even when running with 63 threads.
from rna-seq-pop.
from rna-seq-pop.
Ok, it should be version 5.00... that must be why. And try with less threads perhaps. I probably won’t be responsive until Tuesday now as it’s holiday over here but good luck!
No worries. Enjoy the wonderful weather (I hope)!
I have restricted it to running with one core. Now it stops with the same error consistently:
Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)
-------------- Reading VCF for chromosome PRELSG_01_v1 --------------
------- Filtering VCF at QUAL=30 and missingness proportion of 0.8 -------
[Sat May 29 17:24:05 2021]
Error in rule WindowedFstPBS:
*I've removed the details about the job specs, jumping straight to the error*
log: logs/WindowedFstPCA.log (check log file(s) for error message)
conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930
RuleException:
CalledProcessError in line 204 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930'; set -euo pipefail; python /scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py' returned non-zero exit status 1.
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 204, in __rule_WindowedFstPBS
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper
Shutting down, this might take some time.
Here is the log mentioned in the error:
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py", line 45, in <module>
missingfltprop=missingprop)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 115, in readAndFilterVcf
geno = allel.GenotypeArray(vcf['calldata/GT'].compress(passfilter, axis=0))
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1476, in __init__
check_ndim(self.values, 3)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/util.py", line 66, in check_ndim
raise TypeError('bad number of dimensions: expected %s; found %s' % (ndim, a.ndim))
TypeError: bad number of dimensions: expected 3; found 2
I'm wondering if there is something amiss with the vcf file, causing numpy to spit out that error (bad number of dimensions: expected 3; found
) as part of the WindowedFstPBS.
from rna-seq-pop.
Hey Victor,
I realised that the reason snpEff was v4.3.1 was because I had changed it to 5.0.0, but then snpSift latest conda version is 4.3.1, and it complained of being incompatible (snpSift is used later for the differential SNPs module which AFAIK is not working right now, I need to fix). I might try and remove the snpSift rule as its just filtering and i assume bcftools will be able to do the same.
re the latest error - what level of ploidy did you use in the config? if 1, this could be why, although allel.GenotypeArray() should work with variable ploidy. Are these pooled samples? I guess with plasmodium you need to pool lots of parasites for RNA-Seq? is that right?
from rna-seq-pop.
Im incorrect - GenotypeArray needs a ploidy of ABOVE 1 (I didnt know about this), and I should instead use a haplotypeArray. I will implement this as its important, but I do think if you are using pooled data then it might actually make more sense to use a higher ploidy, as allele frequencies in the sample will be captured more accurately then if we try and force it to call one allele only.
from rna-seq-pop.
Hi Sanjay
I did indeed specify a ploidy of one. The samples are not pooled, instead we sequence each sample very deeply (50x -80x). They are blood samples are from infected hosts. Each host will have a multitude of parasites that get sequenced, and this is RNA-seq so ploidy is tricky.. Not sure what ploidy to use.
I've rereun with a ploidy of 5 (why not). The filtering is working now, but I'm still getting the error about the number of threads:
Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)
I ran it with 1 cpu.
from rna-seq-pop.
OK, in that case we can probably assume they are (relatively) clonal? and therefore I do agree that a ploidy of 1 makes the most sense. You'll have to give me a couple of days to implement.
Looks as though the nthreads issue is related to this cggh/scikit-allel#285 . I haven't run before on a machine with more than 64 threads so haven't come across it.
supposedly adding this to the top of the python script can fix it. Could you try and manually edit the workflow/scripts/tools.py script to do the following directly at the top? if it works, I could add it to the workflow. I think change 272 to whatever number of cores you have maybe, but even if not it should be OK.
import os
os.environ["NUMEXPR_MAX_THREADS"]="272"
import allel
from rna-seq-pop.
Specifying the environmental core count did the trick!
Now onto the next hiccup.
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp6xi5rrhl.WindowedFstPBS.py", line 45, in <module>
missingfltprop=missingprop)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 123, in readAndFilterVcf
ac = geno.count_alleles()
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1839, in count_alleles
max_allele = self.max()
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
I'm going to rerun all the variant analysis steps, in case there are issues with uncompleted jobs that aren't cleaned up correctly. If I say nothing after this, it didn't help.
from rna-seq-pop.
Yo. Ive edited the pipeline and it should now work with haploids :)
I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.
from rna-seq-pop.
Yo. Ive edited the pipeline and it should now work with haploids :)
I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.
Working on haploids is great! I'll pull that change :-)
from rna-seq-pop.
I've discovered that the snpEff Plasmodium relictum database is (unusably) out of date. I am building a new one based on the latest genome and annotations.
But snpEff stores it's databases in it's own working directory, which gets regenerated each time the conda environment is created. This makes it hard to specify a custom snpEff database on the fly, as a parameter in a Snakemake rule.
Of course, one could simply add a rule that builds the database based on params from the config file.
Would you like me to try my hand at that and push it to you?
Should this be its own issue?
from rna-seq-pop.
Latest error when running StatisticsAndPCA.py:
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjm6zc5wp.StatisticsAndPCA.py", line 59, in <module>
missingfltprop=missingprop)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 128, in readAndFilterVcf
ac = geno.count_alleles()
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 2394, in count_alleles
max_allele = self.max()
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
from rna-seq-pop.
I'm all of a sudden running into a new error that has not occured before, which doesn't make sense, because they used to work.
It applies to both DifferentialIsoformExpression
and DifferentialGeneExpression
Error in rule DifferentialIsoformExpression:
jobid: 228
output: results/isoformdiff/cond1_cond2.csv, results/isoformdiff/six_nonsiskins_isoformdiffexp.xlsx
log: logs/DifferentialIsoformExpression.log (check log file(s) for error message)
conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591
RuleException:
CalledProcessError in line 100 of /scratch/victor/rna-seq-ir/workflow/rules/diff.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591'; set -euo pipefail; Rscript --vanilla /scratch/victor/rna-seq-ir/.snakemake/scripts/tmpnn9aq7u2.SleuthIsoformsDE.R' returned non-zero exit status 1.
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
File "/scratch/victor/rna-seq-ir/workflow/rules/diff.smk", line 100, in __rule_DifferentialIsoformExpression
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper
Here is the relevant part of the log:
Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
Backtrace:
This script has quite a few input files:
config/samples.tsv, resources/GeneNames.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52
Here's the samples file:
samples treatment species strain
R47 cond1 prelictum SGS1
R48 cond1 prelictum SGS1
R49 cond1 prelictum SGS1
R50 cond2 prelictum SGS1
R51 cond2 prelictum SGS1
R52 cond2 prelictum SGS1
and part of GeneNames.tsv file
Gene_stable_ID Gene_name Gene_description
PRELSG_03_v1 "" conserved Plasmodium protein%2C unknown function
PRELSG_14_v1 "" conserved Plasmodium protein%2C unknown function
PRELSG_08_v1 "" conserved Plasmodium protein%2C unknown function
PRELSG_02_v1 "" transcription initiation factor TFB%2C putative
PRELSG_12_v1 "" GTP-binding protein%2C putative
Here's a quant file abundance.tsv:
target_id length eff_length est_counts tpm
PRELSG_0100100.1 1764 1568.65 0 0
PRELSG_0100200.1 73 29.154 0 0
PRELSG_0100300.1 84 32.6917 0 0
PRELSG_0100400.1 312 119.706 0 0
PRELSG_0100500.1 5073 4877.65 0 0
PRELSG_0100600.1 534 338.669 0 0
PRELSG_0100700.1 3777 3581.65 0 0
PRELSG_0100800.1 420 224.851 0 0
PRELSG_0100900.1 2061 1865.65 0 0
PRELSG_0101000.1 6195 5999.65 14 3.04701
PRELSG_0101100.1 786 590.645 24 53.0586
from rna-seq-pop.
The last error should now be fixed I think, the case_when statement was not evaluating quite right. Try a git pull :)
FYI I've changed the 'samples' column to be named 'sampleID' in the samples.tsv file, so that will need changing.
Could you also try a much lower missingness (under pbs: missingness: in config file, I probably need to change that as its confusing) and see if the StatisticsAndPCA works then? thanks.
from rna-seq-pop.
Thanks!
The differential expression is still giving the same error:
Joining, by = "sampleID"
--- Running DESeq2 differential expression analysis on cond1_cond2 ---
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
As far as I can tell, I am using the latest version, as the modified time stamps on the rule files are from today.
from rna-seq-pop.
gene_names.tsv.txt
abundance.tsv.txt
from rna-seq-pop.
Thanks Victor. I've edited the workflow - previously it was only suited to using geneIDs from VectorBase.
it now needs the gene names file in a slightly different format which maps genes to transcripts - tab separated 4 columns - GeneID, TranscriptID, GeneName, GeneDescription although the description is optional. In the config file it is now called genes2transcripts also instead of gene_names. this will be in the exampleconfig that gets pulled, and there is an example Gene2TranscriptMap.tsv.
This should fix it but let me know if it doesnt.
edit: just had a look at your gene_names file and there is no column names? which would mean it definitely wouldn't work anyway. But it needed the above change in any case.
from rna-seq-pop.
OK, I actually sent you the wrong gene names file. Mine is called GeneNames.tsv
It has the header which gene_names.tsv
was missing.
Gene_stable_ID Gene_name Gene_description
PRELSG_03_v1 "" conserved Plasmodium protein%2C unknown function
PRELSG_14_v1 "" conserved Plasmodium protein%2C unknown function
PRELSG_08_v1 "" conserved Plasmodium protein%2C unknown function
But this is irrelevant now anyway.
My genes don't have canonical names, or at least not ones I can quickly find, so I assumed that the "" would be enough to specify an empty value.
I'll attempt the new version.
from rna-seq-pop.
I don't think you've pushed your changes to the repo.
from rna-seq-pop.
whoops. good point! Done now. yeah, empty gene names is fine :)
from rna-seq-pop.
When executing Differential expression rule:
------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted
When executing Isoform Differential expression rule:
Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
from rna-seq-pop.
genesList.tsv.txt
Here is the Gene2transcripts file.
from rna-seq-pop.
The transcriptIDs should match with that found in abundance.tsv - Kallisto should be mapping to the reference transcriptome fasta file, and so the 'target_id' from Kallisto should match with the fasta headers.
from rna-seq-pop.
It was reporting the exon IDs before, but now it's reporting the identical ID in the transcript file.
But I am getting the same error as before.
Gene diff expr.
------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted
Isoform DE:
Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
from rna-seq-pop.
difficult to figure out whats going on, without being able to run the script. Do you have any missing rows or duplicates in the Transcript ID column? I would try and run the script in rstudio and see where it falters, just edit the snakemake@input bits to be the paths to your files
from rna-seq-pop.
How do I edit the snakemake@input bits' to be paths to my files? I'm quite lost.
I see the lines
metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()
contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
but I don't know how to change them to refer to the file paths instead.
metadata
and genes2transcripts
refers to inputs in the rules.
I have a list of input files generated when the rule fails:
input: config/samples.tsv, resources/Gene2TranscriptMap.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52, results/quant/B7, results/quant/B8, results/quant/B9
EDIT!
I'm figuring it out...
from rna-seq-pop.
Hey Victor. What i mean, is to open the Rscript outside of the snakemake pipeline in Rstudio. You may want to make a copy of the script of or something.
then change
metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
to
metadata = fread("config/samples.tsv", sep="\t") %>% as.data.frame()
contrastsdf = data.frame("contrast" = list("cond1_cond2"))
and do similar for gene_names (genes2transcript file). Then you should be able to run the Rscript step by step (you may need to install the packages), and it will be clearer where the script is failing.
from rna-seq-pop.
Yup! You're too nice Sanjay, and I'm too impatient. I struggled a bit after asking for help, and figured it.
I've gotten to the same point in the script where it's failing when run with sn.
I'll now try figure out why it's saying there are missing row names.
from rna-seq-pop.
>sum(is.na(counts$GeneID))
[1] 1
> sum(is.na(gene_names$GeneID))
[1] 0
The very last row of counts
has a missing GeneID.
5177 | PRELSG_MIT00300 | 2108.0000 | 1961 | 1186.00000 | 3249.000 | 1.24500e+03 | 1341.00000
5178 | NA | 53001.0000 | 77939 | 32034.00000 | 46699.000 | 1.09758e+05 | 38210.00000
I hate off-by-one errors.
I'm trying to figure out if there there's a line in my Gene2TranscriptMap.tsv that's missing a GeneID.
EDIT
I can't find the missing gene ID. I visually inspected the whole file twice.
from rna-seq-pop.
I have 5312 entries in the abundance.tsv file. None of the transcript ID's repeat, but I have 5178 entries in the Gene2Transcript file.
All of the genome resources are the same version.
I'm lost now.
from rna-seq-pop.
I've just stripped them from the PrelictumSG1-like transcriptome.fasta. it has 5312 entries now, but you'll have to do some sort of join if you want gene names and descriptions.
i did the following....
grep -e ">" PlasmoDB-52_PrelictumSGS1-like_AnnotatedTranscripts.fasta | cut -c 2-50 > prelictum_headers.fa
then in r...
library(data.table)
library(tidyverse)
names = fread("~/prelictum_headers.fa", header = FALSE)[,1:2]
colnames(names) = c("TranscriptID", "GeneID")
names$GeneID = str_remove(names$GeneID, "gene=")
names$GeneName = ""
fwrite(names, "~/Prelictum_Genes2TranscriptMap.tsv", sep="\t", col.names = TRUE)
Prelictum_Genes2TranscriptMap.tsv.txt
from rna-seq-pop.
Huh.. It works now. Thanks.
I built my Genes2Transcripts file from the gff file. And that has 5178 mRNA entries. This is not relevant to your project, but I wonder why the gff file has fewer transcripts than the Annotated Transcripts fasta file.
from rna-seq-pop.
I've had a frustrating few days trying to figure out why the code running on the local R session worked, but not when run part of the sm workflow.
Well, turns out it actually wasn't running all the way through locally.
Here is the error while running the DESeq2 function:
--- Running DESeq2 differential expression analysis on cond1_cond2 ---
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` input `Gene_name`.
x must be a character vector, not a logical vector.
ℹ Input `Gene_name` is `case_when(...)`.
I get this identical error locally and in the SM workflow.
I don't have canonical gene names. but not sure if that is what the error is referring too. It's just the GeneID, right?
from rna-seq-pop.
I think I found the problem.
labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
is.na(GeneName) ~ GeneID,
TRUE ~ GeneID)) %>% select(Gene_name) %>% deframe()
I changed TRUE ~ Gene_name
to TRUE ~ GeneID
and it seems to work now.
I was getting nearly the identical errors for the Isoform difexpr rule.
So I made the same change, swapping out GeneID
for TranscriptID
in the same function.
I'm not sure if I broke it though, since I don't understand the function.
from rna-seq-pop.
Next up!
ExtractBedVCF throws this error
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmprgz_st1m.ExtractBedVCF.py", line 21, in <module>
pos = vcf['variants/POS']
TypeError: 'NoneType' object is not subscriptable
I tried
print(vcf)
pos = vcf['variants/POS']
And it printed a long a log vcf file to the log file, so I know it exists.
The last chromosome it reads is annot.missense.PRELSG_API_v1.vcf.
It only has one SNP according to grep -v '#' annot.missense.PRELSG_API_v1.vcf | wc -l
Here is that vcf file.
annot.missense.PRELSG_API_v1.vcf.txt
from rna-seq-pop.
Here's 2 more unrelated errors
When running PerGeneFstPBS.py
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpin7thiea.PerGeneFstPBS.py", line 143, in <module>
ndf.columns = ['GeneID', 'nSNPs']
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 5475, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 669, in _set_axis
self._mgr.set_axis(axis, labels)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 221, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
and I get this when running StatisticsAndPCA.py
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjvl4kifb.StatisticsAndPCA.py", line 113, in <module>
pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 253, in pca
fig_pca(coords1, model1, f"PCA {chrom} {dataset}", f"results/variants/plots/PCA-{chrom}-{dataset}", samples, pop_colours, sample_population=populations)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 234, in fig_pca
plot_pca_coords(coords, model, 0, 1, ax, sample_population, samples, pop_colours)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 215, in plot_pca_coords
y = coords[:, pc2]
IndexError: index 1 is out of bounds for axis 1 with size 1
from rna-seq-pop.
those latest errors must be to do with the lack of SNPs on the apicoplast and possibly mitochondrial chromosome - I would remove this from the analysis personally.
from rna-seq-pop.
Interesting! We're making progress!
Removing those chromosomes worked. Perhaps the workflow would be less fragile if a lack of SNPs wouldn't stall it.
Anyway, here is the next error from AlleleTables
:
[mpileup] 1 samples in 1 input files
r/rna-seq-ir/workflow/scripts/mpileup2readcounts/mpileup2readcounts: No such file or directory
The /rna-seq-ir/workflow/scripts/mpileup2readcounts/
is indeed empty.
I don't know why. It should have been cloned with everything else.
I see that it is part of another repo.
I don't know why there is a leading r
in the error though r/rna-seq-ir/workfl...
. Probably just a printing artifact in samtools.
from rna-seq-pop.
I downloaded mpileup2readcounts.cc directly from the remote repo into the correct directory and compiled it according to the readme. And it works!
I'm still getting the same PerGeneFstPBS error as before. Removing the 2 small chromosomes did not fix this one.
from rna-seq-pop.
Related Issues (20)
- add venn and heatmap to documentation HOT 1
- check codecov
- add tests for main functions in rnaseqpoptools
- In docs, do example analysis of bouake
- remove compkaryo and mpileup2readcounts submodules HOT 1
- rule that unzips a .fa.gz ref file HOT 1
- implement own gsea from GAF file
- move gwss into different folders based on window size HOT 1
- go on a code removal and tidying spree HOT 1
- Create Jupyter book and convert appropriate analyses to ipynb and papermill rather than py HOT 2
- Ensure meets FAIR principles and add to workflowhub.eu
- Configure conda envs so they work with conda channel priority = strict
- Allow .tsv or .xlsx or .csv metadata input HOT 1
- fix single-end reads with column input HOT 1
- xlsx and csv fail due to common.smk
- Change some config naming to make more sense HOT 1
- make isoform level DE optional
- provide singularity containers for the conda envs
- matplotlib can't import docstring causes error in checkinputs HOT 3
- remove venn diagrams HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rna-seq-pop.