lozuponelab / amon Goto Github PK

Annotation of Metabolite Origin via Networks: A tool for predicting putative metabolite origins for microbes or between microbes and host with or without metabolomics data

License: MIT License

Python 100.00%

microbiome metabolome bioinformatics omics-data-integration

amon's Introduction

AMON

A command line tool for predicting the compounds produced by microbes and the host.

Installation

It is recommended to install AMON in a conda environment. The environment can be created by first downloading the environment file.

wget https://raw.githubusercontent.com/shafferm/AMON/master/environment.yaml

Then create a new conda environment. Using the environment file and activate it.

conda env create -f environment.yaml -n AMON
conda activate AMON

Then it can be installed via pip.

pip install AMON-bio

Alternative installation

Alternatively AMON can be installed from pip directly.

pip install AMON-bio

Running AMON

AMON includes two scripts. extract_ko_genome_from_organism.py takes a KEGG organism flat file and makes a list of KOs present in that file. AMON.py predicts the metabolites that could be produced by the KOs used as input. This can be compared to the KOs present in the host or from some other gene set as well as to as set of KEGG metabolites.

`extract_ko_genome_from_organism.py`

A simple script. Takes a download of an organism file from KEGG or a KEGG organism ID and outputs a new line separate list of KOs present in that file.

extract_ko_genome_from_organism.py --help
usage: extract_ko_genome_from_organism.py [-h] -i INPUT -o OUTPUT
                                          [--from_flat_file]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        KEGG organism identifier or KEGG organism flat file
                        (default: None)
  -o OUTPUT, --output OUTPUT
                        Output file of new line separated list of KOs from
                        genome (default: None)
  --from_flat_file      Indicates that input is a flat flile to be parsered
                        directly (default: False)

`AMON.py`

The full script to preform an analysis of possible metabolites originating from the list of KOs. From this as well as optional lists of compounds detected via metabolomics and lists of KOs present in a host or other environment a table of possible origin of compounds can be generated. From the list of compounds that could possibly be generated a pathway enrichment is also done with the hypergeometric test. Also if either of the other lists are included a Venn diagram will be generated representing the compounds which can be produced or where measured between the lists. If both the bacterial and host KOs are given a heatmap of pathway enrichments will be generated as well and in the enrichment test only compounds which are predicted to be uniquely generated by the bacteria or the host will be used.

Inputs

The gene_set parameter is a list that can be in the form of a plain text file that is a white space separated list of KO ids, a tsv or csv where the column labels are KO ids or a biom formatted file where the observation ids are KO ids. These are the KOs that will be used to determine the compounds that could be generated by the bacterial community. This and the output directory where all results will be written are the only required requirements. There are two other optional inputs: detected_compounds and other_gene_set. detected_compounds is a set of compounds that where detected in metabolomics of the sample and can come in any of the forms available for the input. other_gene_set is a set of KO ids that are encoded by the host or another set of genes that can be expressed as KO ids. This can also take any of the forms available to the input parameter.

Two flags are available that will affect the Venn diagram made and the enrichment analysis that is done. detected_only will only include compounds that were detected as the background set of compounds for the hypergeometric test. This flag requires the compound_detected variable to be used. The rn_compound_only flag makes it so that only detected compounds which have a reaction associated with them in KEGG will be used for both the Venn diagram and the hypergeometric test.

Finally a set of locations for KEGG FTP downloaded files is avaliable. These inputs are optional and if they are not provided the KEGG API will be used to retrieve the records necessary. It is much faster to run with the KEGG FTP downloaded files if you have access to them.

NOTE: the KEGG API has limits. For small datasets, (< 100 KOs/COs), data can be pulled quickly and in parallel. However, pulling all data for a reasonably sized dataset from the KEGG API will be rate-limited by KEGG and cannot be done in parallel. Sometimes, KEGG will even deny the connection for this synchronous download if you have hit the request rate limit. If this happens, you may have to wait 30-60 minutes before trying again. If you have any suggestions for how to work within these limits please create an issue or pull request with a fix. Otherwise, paying for a subscription to the KEGG FTP will avoid this issue entirely.

Outputs

All outputs are written to the output directory. If only the input parameter is given then two files will be generated called origin_table.tsv, kegg_mapper.tsv and bacteria_enrichment.tsv. The origin_table.tsv has rows as the compounds that could be generated and the first column is true or false indicating if the bacterial KOs provided could generate this KO. If the other_gene_set input is provided an additional column will be generated in this table with true/false values indicating if this set of KOs could generate these compounds. If the detected_compounds parameter is given then an additional column with true/false values indicating whether or not this compound was generated is added.

To visualize the compounds predicted to be produced by microbiome as well as optionally the host and measured compounds the kegg_mapper.tsv file can used. This file can be used as input here. This will color the detected compounds. Blue compounds are generated only by the microbiome and yellow are generated only by the host. Yellow compounds could have been generated by both. Compounds that were detected have an orange outline, with a light orange fill if that compound was not predicted to be produced by microbiome or host.

The bacteria_enrichment.tsv file, and the host_enrichment.tsv file if the other_gene_set parameter is given, gives the results of the pathway enrichment analysis from the compounds able to be produced by the KOs provided. When the other_gene_set parameter is given a heatmap is made to compare the significant pathways present from the bacteria and host KO lists.

When the other_gene_set and/or detected_compounds parameters are given a venn diagram will be made to see overlap in compounds possibly generated or detected.

Full help

amon.py --help
usage: amon.py [-h] -i GENE_SET -o OUTPUT_DIR
               [--detected_compounds DETECTED_COMPOUNDS]
               [--other_gene_set OTHER_GENE_SET] [--detected_only]
               [--rn_compound_only] [--ko_file_loc KO_FILE_LOC]
               [--rn_file_loc RN_FILE_LOC] [--co_file_loc CO_FILE_LOC]
               [--pathway_file_loc PATHWAY_FILE_LOC] [--save_entries]
               [--verbose]

optional arguments:
  -h, --help            show this help message and exit
  -i GENE_SET, --gene_set GENE_SET
                        KEGG KO's from bacterial community or organism of
                        interest in the form of a white space separated list,
                        a tsv or csv with KO ids as column names or a biom
                        file with KO ids as observations (default: None)
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        directory to store output (default: None)
  --detected_compounds DETECTED_COMPOUNDS
                        list of compounds detected via metabolomics (default:
                        None)
  --other_gene_set OTHER_GENE_SET
                        white space separated list of KEGG KO's from the host,
                        another organism or other environment (default: None)
  --detected_only       only use detected compounds in enrichment analysis
                        (default: False)
  --rn_compound_only    only use compounds with associated reactions (default:
                        False)
  --ko_file_loc KO_FILE_LOC
                        Location of ko file from KEGG FTP download (default:
                        None)
  --rn_file_loc RN_FILE_LOC
                        Location of reaction file from KEGG FTP download
                        (default: None)
  --co_file_loc CO_FILE_LOC
                        Location of compound file from KEGG FTP download
                        (default: None)
  --pathway_file_loc PATHWAY_FILE_LOC
                        Location of pathway file from KEGG FTP download
                        (default: None)
  --save_entries        Save json file of KEGG entries at all levels used in
                        analysis for deeper analysis (default: False)
  --verbose             verbose output (default: False)

amon's People

Contributors

Stargazers

Watchers

Forkers

seedpcseed aspirincode jkuligowski aliyoussef96 khemlalnirmalkar olaftyc fjuradorueda jialiux22 pooranis 754932057 pnnl-projects

amon's Issues

KEGG API timeouts

Documenting email conversation with @acolorado1 in case others run into this

Sofia:

Currently trying to run 3 pairwise comparisons with AMON (e.g., 3 files total and AMON takes 2 at a time). I have manged to run 2 of the comparisons but the third keeps failing out with a connection error. Does this have to do with the limits of the KEGG API? I thought that limitation had more to do with the file size, which would not apply in this case as the files have already been used in a previous comparison without any issue. I completely understand if this is not enough information (or if I explained it poorly), I was just hoping you might have some thoughts on the issue.

The issue miraculously resolved itself. I guess it needed time between queries? Not sure, would still be interested in your thoughts on this.

John:

What exactly was the connection error? Sometimes the KEGG API will boot you out for like an hour and then you have to wait to get back in. This "time out" used to be shorter but they've been gradually making it longer.

Sofia:

I definitely think that is what happened. There was a long message of code and pathway information that ended in a connection error that I unfortunately did not document. This was only after I had been running AMON for over 3 hours and once I returned about an hour later, and it worked again. Maybe this would be something good to mention in the AMON README as it was a bit unexpected.

I don't necessarily think the issue is file size, just seems like KEGG API requests are limited on a roughly per minute basis and maybe per hour as well (so larger files are more susceptible, but if we can slow the request rate it wouldn't be an issue). It used to just time users out for a few minutes and then you could start again, but it seems now like it may be timing users out for 30+ minutes, so longer wait periods are sometimes necessary. This issue only happens if AMON users rely on making requests from KEGG because they haven't paid for a local copy of the database itself.

I do agree that mentioning this either in the README or here would be good. For quality of life and robustness, I'm also thinking that if a user passes --save_entries, AMON should export/save any parsed entries if the connection times out, and we could add something like a --resume flag to pick up where that left off, so that a user doesn't have to start from scratch after a failed connection issue.

What do you think, @acolorado1?

Bad http request error

Hi,

I am getting a bad http request error when trying to run AMON. Doesn't seem to matter what I use as input, even with 10 KO's as input I still get the same error.

$ amon.py -i amon_input/picrust_kos.tsv --detected_compounds amon_input/AMON_input-metabolome.tsv --other_gene_set_name amon_input/mouse_genome_ko_list.tsv -o amon_output

Traceback (most recent call last):
File "C:/Users/adamsorbie/miniconda3/envs/amon/Scripts/amon.py", line 76, in
co_file_loc=co_file_loc, pathway_file_loc=pathway_file_loc, write_json=write_json)
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\site-packages\AMON\predict_metabolites.py", line 283, in main
ko_dict = get_kegg_record_dict(set(all_kos), parse_ko, ko_file_loc)
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\site-packages\KEGG_parser\downloader.py", line 55, in get_kegg_record_dict
records = get_from_kegg_api(loop, list_of_ids, parser)
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\site-packages\KEGG_parser\downloader.py", line 49, in get_from_kegg_api
return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))]
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\asyncio\base_events.py", line 587, in run_until_complete
return future.result()
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\site-packages\KEGG_parser\downloader.py", line 43, in kegg_download_manager
results = await asyncio.gather(*tasks)
File "C:\Users\adamsorbie\miniconda3\envs\amon\lib\site-packages\KEGG_parser\downloader.py", line 32, in download_coroutine
raise ValueError('Bad HTTP request status %s: %s\n%s' % (response.status, response.reason, url))
ValueError: Bad HTTP request status 400: Bad Request
http://rest.kegg.jp/get/Unnamed: 2+Unnamed: 3+Unnamed: 1+1

Do you have any idea why this is happening?

Min pathway size

Error message says to decrease min pathway size. Let's not add that as a param for now, min pathway should be a decent size (pathways < 10 may not be the most reliable for our use).

Therefore, the error shouldn't say to decrease min pathway size.

Error in specific samples

Hi,

I'm using AMON in my metagenomic data. I have 79 MAGs, and in 70 I was able to run it without problems. But for 9 of them I get the following error. I believe that it may be due to a non-recognized KO annotation, but I don't know how to figure out which ones.

amon.py -i ko_list.txt -o ../teste Traceback (most recent call last): File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/bin/amon.py", line 74, in <module> main(kos_loc, output_dir, other_kos_loc, detected_compounds, name1, name2, keep_separated, samples_are_columns, File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/AMON/predict_metabolites.py", line 283, in main ko_dict = get_kegg_record_dict(set(all_kos), parse_ko, ko_file_loc) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 55, in get_kegg_record_dict records = get_from_kegg_api(loop, list_of_ids, parser) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 49, in get_from_kegg_api return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))] File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete return future.result() File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 43, in kegg_download_manager results = await asyncio.gather(*tasks) File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/KEGG_parser/downloader.py", line 30, in download_coroutine return await response.text() File "/home/ABTLUS/lucas.maciel/anaconda3/envs/AMON/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1014, in text return self._body.decode(encoding, errors=errors) # type: ignore UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 80011: invalid start byte

Issue

Hello,
Thanks for your work.

I am trying to tun AMON but I do not seem to be able to get any results out.

Specifically, I always encounter this problem:
amon.py -i Komicro.txt -o output_AMON --detected_compounds COlist1.txt --other_gene_set mus_mucus1.txt
Traceback (most recent call last):
File "/Users/gabri/opt/miniconda3/envs/AMON/bin/amon.py", line 74, in
main(kos_loc, output_dir, other_kos_loc, detected_compounds, name1, name2, keep_separated, samples_are_columns,
File "/Users/gabri/opt/miniconda3/envs/AMON/lib/python3.10/site-packages/AMON/predict_metabolites.py", line 283, in main
ko_dict = get_kegg_record_dict(set(all_kos), parse_ko, ko_file_loc)
File "/Users/gabri/opt/miniconda3/envs/AMON/lib/python3.10/site-packages/KEGG_parser/downloader.py", line 55, in get_kegg_record_dict
records = get_from_kegg_api(loop, list_of_ids, parser)
File "/Users/gabri/opt/miniconda3/envs/AMON/lib/python3.10/site-packages/KEGG_parser/downloader.py", line 49, in get_from_kegg_api
return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))]
File "/Users/gabri/opt/miniconda3/envs/AMON/lib/python3.10/site-packages/KEGG_parser/downloader.py", line 49, in
return [parser(raw_record) for raw_record in loop.run_until_complete(kegg_download_manager(loop, list_of_ids))]
File "/Users/gabri/opt/miniconda3/envs/AMON/lib/python3.10/site-packages/KEGG_parser/parsers.py", line 195, in parse_ko
raise ValueError('What is {} in {}?'.format(current_entry_name, ko_dict['ENTRY']))
ValueError: What is SYMBOL in K09637?

Any idea of why this is happening?

Thanks in advance!
Gabri

Format of KEGG input files?

Thank you very much for making your program available!
It looks like the user would be able to input their own reaction and pathway files instead of KEGG's. Would you be able to share the expected file formats for those for use in AMON?

Output explanation

Hello!
Could you indicate to me where to find information regarding the "pathway size", and "overlap" columns in the "gene_set_1_compound_pathway_enrichment.tsv" file. I do not know what they mean.
Thanks!

Error after install

Hello,
I installed AMON in HPC cluster with conda and yalm file and I get these error:

/home/csanchez/.conda/envs/AMON/bin/amon.py: line 1: import: command not found
/home/csanchez/.conda/envs/AMON/bin/amon.py: line 3: from: command not found
/home/csanchez/.conda/envs/AMON/bin/amon.py: line 6: syntax error near unexpected token `('
/home/csanchez/.conda/envs/AMON/bin/amon.py: line 6: `    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)'

If I go to the environment folder and I execute python amon.py it works. So maybe it's a problem with environment setup

AMON is getting no reactions from the KOs...

This is a new issue...

With polyomic dataset

Total number of KOs: 12707
Total number of reactions: 0
...
Number of cos produced across samples: 0

Previously was:

Total number of KOs: 717
Total number of reactions: 1019
...
Number of cos produced across samples: 796

REACTION was added 2 weeks ago the to KEGG_Parser KO not captured fields - I wonder if something is going on there. If it isn't there, the parser errors with "What is REACTION in ?"

Error while using KEGG FTP Ko list.

Dear Team,

I ran AMON with the picrust2 predicted Ko's and the detected compound KEGG IDs. I was able to get an enrichment heat map and KEGG Mapper file. But when I added the Host KO's list and the corresponding KEGG FTP file, I ran into an error. It would be of great help, if you can kindly let me know your insights on this. Please find the command used and the error message for the same. Looking forward for the suggestions

Thank you

command used:
amon.py --gene_set Picrust2_metagenome_KO.txt --detected_compounds Human_mesured_metabolite_withKEGG_id.txt --other_gene_set Host_human_KO_id.txt -o Test_with_hsa_KO --ko_file_loc /AMON/HSA_GENES/hsa_link/hsa_ko.list

Error message:
Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/matplotlib_venn/_venn3.py:57: UserWarning: Circle B has zero area
warnings.warn("Circle B has zero area")
/Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/matplotlib_venn/_venn3.py:61: UserWarning: Circle C has zero area
warnings.warn("Circle C has zero area")
Traceback (most recent call last):
File "/Tools/anaconda3/envs/AMON/bin/amon.py", line 74, in
main(kos_loc, output_dir, other_kos_loc, detected_compounds, name1, name2, keep_separated, samples_are_columns,
File "Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/AMON/predict_metabolites.py", line 364, in main
pathway_enrichment_df = calculate_enrichment(cos_produced, pathway_to_compound_dict)
File "/Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/AMON/predict_metabolites.py", line 243, in calculate_enrichment
enrichment_table['adjusted probability'] = p_adjust(enrichment_table.probability)
File "/Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/AMON/predict_metabolites.py", line 38, in p_adjust
res = multipletests(pvalues, method=method)
File "/Tools/anaconda3/envs/AMON/lib/python3.9/site-packages/statsmodels/stats/multitest.py", line 147, in multipletests
alphacSidak = 1 - np.power((1. - alphaf), 1./ntests)
ZeroDivisionError: float division by zero