dessimozlab / omastandalone Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 5.4 MB

OMA standalone repository

License: Other

CSS 2.34% Shell 39.21% Python 57.12% Dockerfile 1.33%

omastandalone's People

Contributors

Stargazers

Watchers

Forkers

yiyanyang0728 ericdeveaud

omastandalone's Issues

Error message "Error, (in LoadSpliceMap) string to parse is too long"

Hello,

I have been using OMA standalone lately and everything went well until the error message that I got bellow.
I am giving 32GB of memory using "#SBATCH --mem=32G".
I am trying to get orthologs from 12 different genomes, which most of them are not in OMA yet.
Thanks in advance for your answer.
Best
Josquin

"Reading the all-against-all files...

Matches loaded. Mem: 1.339GB

Error, (in LoadSpliceMap) string to parse is too long"

Error segmentation fault

Hello,

I have a problem when trying to run OMA on a HPC with Debian GNU/Linux distribution.
After following the instruction to launch OMA using slurm SBATCH it returns the error segmentation fault.
I would appreciate it if there any suggestions to find a solution.
Thank you.

Best regards,

Irene

OMA standalone vs. OMAmer HOGs

As mentioned in my previous issue (#6), I am bit confused when comparing the results produced by OMA standalone and OMAmer. In particular, despite using the same dataset, I found large differences in the number of HOGs (at the topmost taxonomic level) inferred by the both programs with default settings. That is, OMAmer grouped the proteins in 4 different HOGs, while OMA standalone divided them into 16 HOGs. In the case of OMAmer, three of the HOGs included a relatively large percentage of species [around 80-90% of the total each] and the other HOG a relatively smaller number [around 10% of the species included]. Nonetheless, the 4 HOGs detected by OMAmer included a relatively similar percentage of sequences each (49%, 20%, 16%, 15%). However, this was not the case with OMA standalone. In the latter, 7 HOGs (out of 16) consisted of only 2 sequences each, and other HOGs also seem to be over-specific in general. The reason why I believe so is because the dataset was built by joining proteins that very likely have the same functionality. Thus, I think there shouldn't be more than one or a few gene families.

I noticed that OMAmer includes a flag called "--threshold" which could help reducing the number of HOGs. However, in the case of OMA standalone, it seems there are several parameters (MinScore, LengthTol, MinSeqLen) which could make the grouping more stringent. I believe the most important one is the "MinScore" (default value is 181). Unfortunately, I have little knowledge about the algorithms behind the two programs, so it is not clear to me what could be a more stringent value in either case. How can I determine the best MinScore value for my dataset?

Moreover, I would like to know if the "top-down" algorithm in OMA standalone (combined with the use of "StableIdsForGroups") is able to determine the actual HOG id in a similar way as OMAmer does (e.g. "HOG:B0561231" instead of "HOG1")?

Error, (in EstablishSpliceMap) splicing variant prefix in not uniq

Hello!

I've been trying to run OMA standalone (v 2.5.0) but it fails already during database conversion. The genomes are all not on OMA but from internal efforts.

Full error message:

only_run_allall := true Starting database conversion and checks... Process 56886 on n-hpc-ca5: job nr 1 of 124 Error, (in EstablishSpliceMap) splicing variant prefix 'genome1_anno1.g17209.t1\ ' is not uniq in the proteome. There are at least two proteins with that ID pre\ fix

I checked for that prefix with grep in my fasta:

genome1_anno1.g17209.t1 gene=g_16998 seq_id=scaffold_6 type=cds
genome1_anno1.g17209.t10 gene=g_16998 seq_id=scaffold_6 type=cds

And in my Splice File (this represents one line):

genome1_anno1.g17209.t4; genome1_anno1.g17209.t5; genome1_anno1.g17209.t6; genome1_anno1.g17209.t1; genome1_anno1.g17209.t2; genome1_anno1.g17209.t3; genome1_anno1.g17209.t7; genome1_anno1.g17209.t8; genome1_anno1.g17209.t9; genome1_anno1.g17209.t10; genome1_anno2.ENSACAT00000040140.6; genome1_anno2.ENSACAT00000045760.6

I am not sure if/why OMA thinks genome1_anno1.g17209.t1 isnt unique.
Or am I formatting my splice file wrong? the splicing variants for a gene are always separated by a semicolon and a space

Grateful for any help/input!

Error, (in BuildSpeciesTree) range selector out of bounds

Following up on my question in Biostars (https://www.biostars.org/p/9506368), I tried to run OMA standalone in a subset of my dataset including 263 species + 2 outgroup species and 1,663 protein sequences in total. The species with the most data includes 111 sequences, while at the other extreme (with the less data), there are 15 species with 1 sequence each. The first stage of OMA ran successfully. However, when running the second stage, OMA stopped unexpectedly as shown below (these are last 20 lines):

shrinking VerifiedPairs 46 246 480604
shrinking VerifiedPairs 52 150 517508
shrinking VerifiedPairs 63 80 543926
shrinking VerifiedPairs 70 244 566373
shrinking VerifiedPairs 83 163 584692
shrinking VerifiedPairs 105 220 602237
shrinking VerifiedPairs 123 178 617710
shrinking VerifiedPairs 168 219 631428
shrinking VerifiedPairs 210 256 644808
Iteration with 31607 Verified pairs, 0 in Orthologous
Iteration with 471 Verified pairs, 83 in Orthologous
118 orthologous groups, histog=[0, 34, 21, 13, 12, 4, 3, 1, 3, 2, 1, 3, 1, 0, 2, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

SpeciesTree Reconstruction started ...
# keeping the top 56 largest orthologs groups with [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 orthologous groups before Filtering=BestDimless, AveVarWeighted
dimensionless fitting index 1.24
dimensionless fitting index 1.676
Error, (in BuildSpeciesTree) range selector out of bounds
/home/biomendi/bioinfo_bin/OMA/bin/OMA: line 196: kill: (8564) - No such process

Unfortunately, the error message is not very informative to me, so I do not know what exactly went wrong. It seems to be related to the reconstruction of the species tree. Interestingly, doing a grep search of the word "Error" shows that the same error message appeared two times more before the program crashed (three times in total). A possible (blind) explanation that I just thought about is that maybe OMA cannot estimate the species tree due to the large differences in the amount of sequences between species. Could that be the case?

Virtualenv in installation required?

Whilst creating an easyconfig for the most recent version (see easybuilders/easybuild-easyconfigs#13447 ) I realized that python is required. Taking a closer look now it seems that Python is actually a runtime dependence for certain functionality. Is that correct?

How get paralogs genes from OMAStandalone

Hello,

Thank for the tool which is handy i was able run oma standalone tool but I need paralogs genes info I see that WriteOutput_Paralogs := true ; option still there is no output can you suggest how can I get paralogs gene information

Thank you

OMA analysis with around 400 proteomes

Dear Sir/Madam,

Thank you for providing this perfect tool. I am using OMA to infer HOGs for >400 proteomes. After the database checking step and the All Vs All step (working perfectly) were performed, the next step (the orthology calling) take a huge quantity of time (more than 1 month)...
Do you have any issue to decrease the time of computation of this last step? I'd also appreciate any advice about how to modify the parameters.

Thanks,
Alex

Update OMA on dockerhub

Hello,
Thanks for the tool. Is it possible to update OMA's docker (on dockerhub) to the 2.6.0 version please?

Error during database conversion step

Hi,

I'm faced with some problem during the database conversion step. Once it's the first time that it occurs, I think that is something related to my dataset, but I no have an idea about the reason.

Can you help me?

Follow the error:

Reading 6101 characters from file Cache/DB/OG0000332.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Reading 8460 characters from file Cache/DB/OG0000336.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Reading 10157 characters from file Cache/DB/OG0000339.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Error, (in ReadDb) cannot open index file

And the file
OG0000339.db.txt

Best regards!

running OmaStandalone without network access

Hello,
our cluster compute nodes does not have access to internet, so oma fails while trying to download at first http://purl.obolibrary.org/obo/go.obo

I may execute a run on a machine that have internet access and provide the $HOME/.cache/oma/GOdata.drw or for our users

but I saw that darwinlib/Taxonomy also perform a download from http://www.uniprot.org/taxonomy/?query=*&compress=yes&format=tab
is there a way that I can download and process (ConvertRawFile) this file and provide the resulting UniProtTaxonomy.drw file to our users in order to be abble to run oma without internet access.
this way oma will be really Standalone ;-)
regards

Eric

edit typo

[Error] Too many mapped files

Dear Sir/Madam,

Thank you for providing this useful tool. I am using OMA to infer HOGs for >3000 genomes (the fa files in DB/ folder is about 4.3 GB). After the database checking step, I began to run All Vs All step but I encountered an issue and the log file said:

The program terminated incorrectly: Too many mapped files.

Here is my sbatch code for OMA part2:

#!/bin/bash
#SBATCH --array=1-1000
#SBATCH --partition=norm
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100GB
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --time=8:00:00
cd /gpfs/gsfs12/users/yangy34/softwares/OMA.2.4.2
export NR_PROCESSES=1000
./bin/oma -s -W 20000
if [[ "$?" == "99" ]] ; then
scontrol requeue /
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0

Do you have any idea about this? I'd also appreciate any advice about how to modify the parameters.

Thanks,
Yiyan

HierarchicalGroups.orthoxml file contains all splicing variants

The HierarchicalGroups.orthoxml file contains all splicing variants as individual genes in the header part. As only one (the infered main isoform) will appear in any orthology relation, tools like pyham will list the alternative isoforms as species specific gene gains, which is wrong.

dessimozlab / omastandalone Goto Github PK

omastandalone's People

Contributors

Stargazers

Watchers

Forkers

omastandalone's Issues

Matches loaded. Mem: 1.339GB

Recommend Projects

Recommend Topics

Recommend Org