dessimozlab / omastandalone Goto Github PK
View Code? Open in Web Editor NEWOMA standalone repository
License: Other
OMA standalone repository
License: Other
Hello,
I have been using OMA standalone lately and everything went well until the error message that I got bellow.
I am giving 32GB of memory using "#SBATCH --mem=32G".
I am trying to get orthologs from 12 different genomes, which most of them are not in OMA yet.
Thanks in advance for your answer.
Best
Josquin
"Reading the all-against-all files...
Error, (in LoadSpliceMap) string to parse is too long"
Hello,
I have a problem when trying to run OMA on a HPC with Debian GNU/Linux distribution.
After following the instruction to launch OMA using slurm SBATCH it returns the error segmentation fault.
I would appreciate it if there any suggestions to find a solution.
Thank you.
Best regards,
Irene
As mentioned in my previous issue (#6), I am bit confused when comparing the results produced by OMA standalone and OMAmer. In particular, despite using the same dataset, I found large differences in the number of HOGs (at the topmost taxonomic level) inferred by the both programs with default settings. That is, OMAmer grouped the proteins in 4 different HOGs, while OMA standalone divided them into 16 HOGs. In the case of OMAmer, three of the HOGs included a relatively large percentage of species [around 80-90% of the total each] and the other HOG a relatively smaller number [around 10% of the species included]. Nonetheless, the 4 HOGs detected by OMAmer included a relatively similar percentage of sequences each (49%, 20%, 16%, 15%). However, this was not the case with OMA standalone. In the latter, 7 HOGs (out of 16) consisted of only 2 sequences each, and other HOGs also seem to be over-specific in general. The reason why I believe so is because the dataset was built by joining proteins that very likely have the same functionality. Thus, I think there shouldn't be more than one or a few gene families.
I noticed that OMAmer includes a flag called "--threshold" which could help reducing the number of HOGs. However, in the case of OMA standalone, it seems there are several parameters (MinScore, LengthTol, MinSeqLen) which could make the grouping more stringent. I believe the most important one is the "MinScore" (default value is 181). Unfortunately, I have little knowledge about the algorithms behind the two programs, so it is not clear to me what could be a more stringent value in either case. How can I determine the best MinScore value for my dataset?
Moreover, I would like to know if the "top-down" algorithm in OMA standalone (combined with the use of "StableIdsForGroups") is able to determine the actual HOG id in a similar way as OMAmer does (e.g. "HOG:B0561231" instead of "HOG1")?
Hello!
I've been trying to run OMA standalone (v 2.5.0) but it fails already during database conversion. The genomes are all not on OMA but from internal efforts.
Full error message:
only_run_allall := true Starting database conversion and checks... Process 56886 on n-hpc-ca5: job nr 1 of 124 Error, (in EstablishSpliceMap) splicing variant prefix 'genome1_anno1.g17209.t1\ ' is not uniq in the proteome. There are at least two proteins with that ID pre\ fix
I checked for that prefix with grep in my fasta:
genome1_anno1.g17209.t1 gene=g_16998 seq_id=scaffold_6 type=cds
genome1_anno1.g17209.t10 gene=g_16998 seq_id=scaffold_6 type=cds
And in my Splice File (this represents one line):
genome1_anno1.g17209.t4; genome1_anno1.g17209.t5; genome1_anno1.g17209.t6; genome1_anno1.g17209.t1; genome1_anno1.g17209.t2; genome1_anno1.g17209.t3; genome1_anno1.g17209.t7; genome1_anno1.g17209.t8; genome1_anno1.g17209.t9; genome1_anno1.g17209.t10; genome1_anno2.ENSACAT00000040140.6; genome1_anno2.ENSACAT00000045760.6
I am not sure if/why OMA thinks genome1_anno1.g17209.t1 isnt unique.
Or am I formatting my splice file wrong? the splicing variants for a gene are always separated by a semicolon and a space
Grateful for any help/input!
F
Following up on my question in Biostars (https://www.biostars.org/p/9506368), I tried to run OMA standalone in a subset of my dataset including 263 species + 2 outgroup species and 1,663 protein sequences in total. The species with the most data includes 111 sequences, while at the other extreme (with the less data), there are 15 species with 1 sequence each. The first stage of OMA ran successfully. However, when running the second stage, OMA stopped unexpectedly as shown below (these are last 20 lines):
shrinking VerifiedPairs 46 246 480604
shrinking VerifiedPairs 52 150 517508
shrinking VerifiedPairs 63 80 543926
shrinking VerifiedPairs 70 244 566373
shrinking VerifiedPairs 83 163 584692
shrinking VerifiedPairs 105 220 602237
shrinking VerifiedPairs 123 178 617710
shrinking VerifiedPairs 168 219 631428
shrinking VerifiedPairs 210 256 644808
Iteration with 31607 Verified pairs, 0 in Orthologous
Iteration with 471 Verified pairs, 83 in Orthologous
118 orthologous groups, histog=[0, 34, 21, 13, 12, 4, 3, 1, 3, 2, 1, 3, 1, 0, 2, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
SpeciesTree Reconstruction started ...
# keeping the top 56 largest orthologs groups with [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 orthologous groups before Filtering=BestDimless, AveVarWeighted
dimensionless fitting index 1.24
dimensionless fitting index 1.676
Error, (in BuildSpeciesTree) range selector out of bounds
/home/biomendi/bioinfo_bin/OMA/bin/OMA: line 196: kill: (8564) - No such process
Unfortunately, the error message is not very informative to me, so I do not know what exactly went wrong. It seems to be related to the reconstruction of the species tree. Interestingly, doing a grep search of the word "Error" shows that the same error message appeared two times more before the program crashed (three times in total). A possible (blind) explanation that I just thought about is that maybe OMA cannot estimate the species tree due to the large differences in the amount of sequences between species. Could that be the case?
Whilst creating an easyconfig for the most recent version (see easybuilders/easybuild-easyconfigs#13447 ) I realized that python is required. Taking a closer look now it seems that Python is actually a runtime dependence for certain functionality. Is that correct?
Hello,
Thank for the tool which is handy i was able run oma standalone tool but I need paralogs genes info I see that WriteOutput_Paralogs := true ;
option still there is no output can you suggest how can I get paralogs gene information
Thank you
Dear Sir/Madam,
Thank you for providing this perfect tool. I am using OMA to infer HOGs for >400 proteomes. After the database checking step and the All Vs All step (working perfectly) were performed, the next step (the orthology calling) take a huge quantity of time (more than 1 month)...
Do you have any issue to decrease the time of computation of this last step? I'd also appreciate any advice about how to modify the parameters.
Thanks,
Alex
Hello,
Thanks for the tool. Is it possible to update OMA's docker (on dockerhub) to the 2.6.0 version please?
Hi,
I'm faced with some problem during the database conversion step. Once it's the first time that it occurs, I think that is something related to my dataset, but I no have an idea about the reason.
Can you help me?
Follow the error:
Reading 6101 characters from file Cache/DB/OG0000332.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Reading 8460 characters from file Cache/DB/OG0000336.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Reading 10157 characters from file Cache/DB/OG0000339.db
Pre-processing input (peptides)
72 sequences within 72 entries considered
Error, (in ReadDb) cannot open index file
And the file
OG0000339.db.txt
Best regards!
Hello,
our cluster compute nodes does not have access to internet, so oma fails while trying to download at first http://purl.obolibrary.org/obo/go.obo
I may execute a run on a machine that have internet access and provide the $HOME/.cache/oma/GOdata.drw
or for our users
but I saw that darwinlib/Taxonomy
also perform a download from http://www.uniprot.org/taxonomy/?query=*&compress=yes&format=tab
is there a way that I can download and process (ConvertRawFile) this file and provide the resulting UniProtTaxonomy.drw
file to our users in order to be abble to run oma without internet access.
this way oma will be really Standalone ;-)
regards
Eric
edit typo
Dear Sir/Madam,
Thank you for providing this useful tool. I am using OMA to infer HOGs for >3000 genomes (the fa files in DB/ folder is about 4.3 GB). After the database checking step, I began to run All Vs All step but I encountered an issue and the log file said:
The program terminated incorrectly: Too many mapped files.
Here is my sbatch code for OMA part2:
#!/bin/bash
#SBATCH --array=1-1000
#SBATCH --partition=norm
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100GB
#SBATCH --job-name=oma2
#SBATCH --output=logs/oma2-%A.%a.log
#SBATCH --export=None
#SBATCH --error=logs/oma2-%A.%a.err
#SBATCH --time=8:00:00
cd /gpfs/gsfs12/users/yangy34/softwares/OMA.2.4.2
export NR_PROCESSES=1000
./bin/oma -s -W 20000
if [[ "$?" == "99" ]] ; then
scontrol requeue /
${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi
exit 0
Do you have any idea about this? I'd also appreciate any advice about how to modify the parameters.
Thanks,
Yiyan
The HierarchicalGroups.orthoxml file contains all splicing variants as individual genes in the header part. As only one (the infered main isoform) will appear in any orthology relation, tools like pyham will list the alternative isoforms as species specific gene gains, which is wrong.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.