darcy220606 / ampcombi Goto Github PK
View Code? Open in Web Editor NEWAMPcombi parses and filters results from AMP prediction tools
License: MIT License
AMPcombi parses and filters results from AMP prediction tools
License: MIT License
Currently the bash-based interaction AMPcombi requires --tooldict
input to be a python dictionary
--tooldict '{"ampir":"ampir.tsv", "amplify":".tsv", "macrel":".prediction", "neubi":"neubi.fasta", "hmmer_hmmsearch":".txt"}'
This is results in inconsistency in the way you're interfacing with the tool (having to think in both bash and python), and furhtermore the nested quotes makes it difficult when wrapping the command in e.g. pipeline code.
I would reocmmend changing the definition of input file suffices for each tool with an explicitt flag e.g.
--ampir-suffix 'ampir.tsv' --amplify-suffix '.tsv' --macrel-suffix 'prediction' --neubi-suffix 'neubi.fasta' --hmmsearch-suffix '.txt'
This will make the UX much smoother, lower mental load when trying to use the tool, and make it easier to integrate into pipeliens
Log file is not appended but rewritten for every sample in the case of funcscan. The patch should include, ampcombi looking for a file in the output folder with the same name and append it
the summary.tsv contains the index numbers.
To associate the file ending with the tools as this is now hardcoded and the user should be able to asssign fileending for the individual tools, e.g. in a dictionary
Add pydamage output as an optional flag
Add MAGs identification input as an optional flag (from metawrap output)
Add the MMseqs contig classification as an optional flag
create a bash script that runs diamond
[ ] Check which tools were run, read in and merge accordingly
add input command options with argparse:
First command to implement:
--amp-results
: path to folder which contains tools results outputs
Others (see issues):
Add an if statement: If conda environemnt with the name diamond cannot be found, create diamond conda environment from yaml
For alignment: build new fasta from merged table contig_ids? (see code in https://github.com/louperelo/longmetarg/blob/main/bin/read_analysis.py ) -> fasta with sequences of AMP hits as input to Diamond or MMseq
Idea: convert input contig faa to table and then join the seq to the merged table and then use the amp conig IDs and and seq to create a fasta
For release 0.1.9
This issue requests the addition of gbk extractor, which constructs gbk files based on the contig id. This helps the user downstream to only obtian the gbk files with teh contig of interest for easy manipulation and exploring - requested by Rosa Herbst
Tool input:
[ ] Give a path to the folder which contains the files --amp-results (like funcscan - put in README how the folder should be structured
[ ] The input folder should be structured like funcscan run/amp/amptool/SAmple1/output.tsv -> subdirectories names will be prefix for samples and that we can use to create the output directories which should be /ampcombi/sample1/table.tsv
[ ] OR work with sample sheets?
Filtering is not done in funcscan but in AMPcombi!
Put a probability-threshold on tools (except hmmer, which would be evalue) BEFORE merging AFTER formatting
hmmer hmmsearch has different output options.
At the moment AMPcombi is designed to parse the default output (no parameter set)
Also it is set to parse only one hmm-file per sample, corresponding to one hmmsearch model.
Note: The neubi results are in a fasta file so they should be first converted to table and then reformated
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Should we change the .append to pd.concat instead?
Name output files according to samples (use sample names from directory), use --outdir to name the output folder.
Create subdirectory with sample names, and each subdir should have one resulting table.
Add the summary tables from new tool : AMPGram
Add the different thresholds to AMPcombi summary as requested by Rosa Herbst:
Change it that the function 'html_generator()' is in a seperate parameter that is activated independednt from completee summary as that will create a problem in funcscan
Eliminate the non-AMPs if tools setting (i.e. AMPLIFY) gives them out as well. If more than one tool was run, condition would be that it was non-AMP in all of them.
[ ] Check if the given path and files exist and print error message if folders/files do not exist or are not found
Add an 'if' statement to look for the same date.fasta file/directory, if there is one don't download it again. This is for funcscan so that not every time a sample is run the file is downloaded again
Sort it by the probability that is found in all tools
Remove the head output directory <- This was chosen!
OR
Make it optional
If only one sample is processed, a head directory would not be necessary and the results can be written into a directory with the sample name.
If outdir is chosen, subdirectories with sample names are created inside it
Make sure it is created in the working directory
Retain the downloaded DRAMP db file in the results output
[ ] We give the user the possibility to indicate specific file paths, like: --ampir-outfile, --amplify-outfile etc. (would be in case of few samples - AMPcombi has to be run manually for each sample)
[ ] How to handle different samples? --amp-results points to output of 1 sample? Output is one table per sample in one output directory
Which marker to use to identify a contig was not attributed to AMPs in one of the tools (python gives NAN at merge, add zero instead.
Sometimes the DRAMP database is down so perhaps add an option to use an alternative AMP database like those referenced here ex. /CAMP/APD/dbAMP
It would be nice to have the option to change the output format. nf-core/funcscan would be happier with TSV output, not CSV.
Put an extra column with sample_name first.
Then, put an extra flag to concat all output.summaries to one summary (sample name needs to be in an extra column, see issue #36 )
hmmsearch has a special way of output, with large header and footer part. The output refers to one run with one hmm-model.
The model name is in the line Query:
.
How to treat results from several runs with different models on the same samples?
Query
name (model), evalue
, contig_id
? (prior concatenation)We need to be able to get the version number!
This issue requests the addition of gbk extractor, which constructs gbk files based on the contig id. This helps the user downstream to only obtian the gbk files with teh contig of interest for easy manipulation and exploring - requested by Rosa Herbst
Jasmin observed that AMPcombi fails in the funcscan pipeline, if only AMPIR is used as input. Macrel only works, as does Ampir+Macrel. Test the other input tools.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.