evolinc / evolinc-i Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 3.0 15.78 MB

Python 29.75% Shell 37.89% R 27.42% Dockerfile 4.94%

rna-seq lincs cyverse-discovery-environment

evolinc-i's People

Contributors

Stargazers

Watchers

Forkers

bioxiao phdindirthoeing chosenobih

evolinc-i's Issues

Modify "updated gff"

Change gene IDs in the updated gff so that there are no "_" (underscores), as HTseq version 0.6.1 appears to have an issue with them when there are linked to the gene_id.

Hi,
I am interested in using the evolinc pipeline. I work on a cluster and that's where all my data is, so I want to run evolinc on a cluster using the command line. I tried to get docker, but it is not possible to get it on a cluster and I cannot install it otherwise. Is there a way I can run evolinc on the commandline on the cluster without docker?

IUPAC ambiguity codes in FASTA file

Error encountered when running Evolinc-I using FASTA files that have IUPAC ambiguity codes (e.g. KeyError: 'R'). Error was not found when ambiguous characters were replaced with 'N' and run with Evolinc-I again.

Add rFAM automatic screen

Add rFAM screen to the end of Evolinc-I so that this doesn't have to be done outside of the DE/command line. This will entail adding the rFAM library of RNAs (except snoRNAs).

Add an option for long read filtering/comparison.

People should be able to run long read transcripts through Evolinc. Alternatively, they should be able to compare their short read derived lncRNAs against any long read transcripts that are available. This would be an optional argument that would provide further support for the lncRNA annotation.

Questions about Evolinc modifications since publication

Hi,
I have read the paper of the most recent version of Evolinc and I have the following questions:

It says in the paper to run the output FASTA against Rfam, but I see here in the resolved issues github that it says this feature has been added. Do you still suggest that I run my output against Rfam?
Does Evolinc detect only long intergenic non coding RNAs or does it detect other types too?
Since the paper came out, the developers of cuffcompare also made the program gffcompare, which is analogous to cuffcompare and I believe produces the same output. I would like to use gffcompare because its usage is simpler, can I use the output gtf from gffcompare in Evolinc, or do you suggest I stick to cuffcompare?
Thank you

Create "intronic space" parameter

Allow for variable distance for removing gaps and merging hits on similar scaffolds (max gap length currently set to length of query lncRNA).

Add AOTs and SOTs to updated GFF file

Users have requested the addition of AOT and SOT lncRNAs to the GFF file in order to perform differential expression.

Error in calling unlink in diamondBlast step

This causes there to be no longest_ORFS_cat.pep.blastp file. Not sure if this is occurring on all systems or just within a windows Docker container.

Replace underscores in Known_lincRNA bed file

There is a known issue when appending the gene ID of a known lincRNA to the final summary table if that known lincRNA has an underscore in its name in the bed/gff file used as input.

Chromosome IDs of Evolinc identified lincRNAs do not match parent annotation.

After running Evolinc 1.7.5 and Evolinc-Merge on the Discovery Environment, the output annotations have chromosome IDs of newly identified lincRNAs that do not match the parent (input) annotation.

The input annotation uses the nomenclature: 'Chr1', 'Chr2', 'Chr3', etc. and 'Scaffold12345'. The 'Final_updated.gtf' from Evolinc-Merge keeps this pattern for existing features, but new lincRNAs will lose the "Chr" identifier or the "Scaffold" identifier in column 1. Additionally, scaffold numbers that begin with a 0 in the parent annotation (e.g. "Scaffold00123") will lose those 0 values and will show "123" as the new chromosome.

Is this an issue for lincRNA identification if Evolinc is not able to assign the lincRNAs to the "known" chromosomes?

I have attached gzipped input and output annotations for your reference .

Thank you for your support.
Final_updated.gtf.gz
Cs_genes_v2.1_annot.gff3.gz

Using merged from gffcompare on DE

I tried running Evolinc-I on a cluster with Singularity and I have run into a number of issues, so I am opting to run it on the DE instead. I have the following question, my merged gtf is from gffcompare and not cuffcompare (since this program is now outdated and gffcompare is basically its newer version). Previously I had been told this was fine, but to use the -r flag. I was wondering what I can do when running it on the DE. Is there an option for this?

Offer the option for FPKM filter in Evolinc-I

Can do it in a similar way to the coverage/base filter.

Error running Evolinc-I With both mandatory and optional files for sample data

i am getting this error message to the end of the the Evolinc run with optional files for sample data:

Error in as.data.table(newmat) : could not find function "as.data.table"
Calls: cSplit -> is.data.table
Execution halted
cp: cannot stat 'final_Summary_table.tsv': No such file or directory
All necessary files written to test_out
Finished Evolinc-part-I!

Error in parsing transcripts

Hi,
I was able to run Evolinc with the test data but now I am getting an error when using it on braker genome annotations. This is the error message

Tue Mar 9 17:39:39 UTC 2021
No fasta index found for referencegenome.fa. Rebuilding, please wait..
Fasta index rebuilt.
Generating Number of transcripts
##################################
grep: transcripts.*.fa: No such file or directory
transcripts.*.fa 
##################################
cat: transcripts.*.filter.fa: No such file or directory
[INFO] read file 'transcripts.all.overlapping.filter.fa'
[INFO] Predicting coding potential, please wait ...
[INFO] Running Done!
[INFO] cost time: 0s
[ERROR] putative_intergenic.genes.fa is not a file
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
Can't open putative_intergenic.genes.fa: No such file or directory.
Generating Number of coding and noncoding
##################################
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_coding_transcripts
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_noncoding_transcripts
overlapping_coding_transcripts 1
overlapping_coding_transcripts 0

Looks like it's not able to extract the transcript sequences and run transdecoder correctly?
This is the format og my gtf file

CsWA_scaf115    AUGUSTUS        gene    1563351 1564313 .       -       .       jg29579
CsWA_scaf115    AUGUSTUS        transcript      1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579"
CsWA_scaf115    AUGUSTUS        stop_codon      1563351 1563353 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        CDS     1563351 1564313 0.88    -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        exon    1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        start_codon     1564311 1564313 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_chr04      AUGUSTUS        gene    6431667 6433016 .       +       .       jg761
CsWA_chr04      AUGUSTUS        transcript      6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761"
CsWA_chr04      AUGUSTUS        start_codon     6431667 6431669 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        CDS     6431667 6433016 0.94    +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        exon    6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        stop_codon      6433014 6433016 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_scaf115    AUGUSTUS        gene    4180987 4181720 .       +       .       jg31437
CsWA_scaf115    AUGUSTUS        transcript      4180987 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437"
CsWA_scaf115    AUGUSTUS        start_codon     4180987 4180989 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4180987 4181063 0.59    +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4180987 4181063 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        intron  4181064 4181137 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4181138 4181720 0.54    +       1       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4181138 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        stop_codon      4181718 4181720 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";

Is there anything wrong with that?
Thank you in advance

final_summary_table_gen_evo-I.R sub() function?

what does the "AGE_PLUS" refer to in line 422 of the R script?

(422) merge2$V1_2 <- sub("AGE_PLUS", "Yes", merge2$V1_2)