evolgeniusteam / gmrepoprogrammableaccess Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 12.0 139 KB

programmable access to GM repo

License: GNU General Public License v3.0

gmrepoprogrammableaccess's People

Contributors

Stargazers

Watchers

Forkers

cshine0907 zoexfq wanliu2019 tenlives skjq mariaivanciu625 pcty-kk iseekwonderful zhujiaying1998 clay-jona lih2022 aijeanka

gmrepoprogrammableaccess's Issues

mapping microbiome->phenotype labels by run/sample id

Hi, I have downloaded the microbial abundances and phenotype information. But I can't find a data dictionary or a way to map the run/sample ids to the microbiome+phenotype data (e.g., I want to be able to map the microbial abundances to the phenotypes of each run/sample so I can have labeled data for testing a machine learning classifier). How can I do that by using the programmable access tool?

Cannot access to GMrepo 10 november 2022.

Hi,

Is there is some maintenance task of the GMrepo? I'm trying to enter from all the day from different browsers but in all the cases, the session is time out and does not appear the GMrepo main page. Also I cannot access to specific projects...

Someone else have the same problem?

I've googled if exists some incidence but I did not find any issue and therefore I write here. Apologise.

Thanks,

HTTP 500 error when fetching abundance through API

Hi, I get an HTTP 500 error when getting data through Python API for certain projects. The following code is an example to reproduce this error.

import json
import requests

mesh_id = "D008103"
project_id = "PRJNA431746"
query = {"mesh_id": mesh_id, "project_id": project_id}

# Query data
url = 'https://gmrepo.humangut.info/api/getMicrobeAbundancesByPhenotypeMeshIDAndProjectID'
post_result = requests.post(url, data=json.dumps(query))

print(post_result)
# <Response [500]>
print(post_result.text == "")
# True

Misalignment between columns and values when downloading with search functionality

Hi there,

When using "download data as TSV" functionality in the main GMrepo webpage, I get 13 columns in the .txt file when the values have 12 columns. I believe "Disease name" should be dropped but would be great to confirm with the team.

Does GMRepo still exist? The link does not seem to be working

The link for GMRepo (https://gmrepo.humangut.info/) produces an error that the "Sign can't be reached."

If anyone knows how to solve this issue that would be highly appreciated. Thank you

incomplete number of runs by restful api getAssociatedRunsByPhenotypeMeshIDLimit

Hi, when I use api countAssociatedRunsByPhenotypeMeshID to get total number of runs, for example, D003424, I get 2671 runs. But when I use api getAssociatedRunsByPhenotypeMeshIDLimit to retrieve the runs with 1000 limit each time, I can only get 2000 runs. This is very weird, please help me out, thanks!

For reproducibility, the python script is attached.
gmrepo_age_distribution.zip

Curated projects

Is there a way to get a list of curated projects through the API?

relative abundance to counts

Hello, and thanks for your repository.
I have been looking at the data, and I see that all abundances accessible are relative abundance. I haven't found any way to download counts data. Is count data available??
Also, as I am interested in count abundances, looking for a way to calculate it from relative abundance, I found a "nr_reads_sequenced," and I wonder what that number represents. For example, when it is metagenomic data, nr_reads_sequenced means all the oligonucleotides sequenced or bins?

taxonomy database for amplicon data

Could I please clarify one technical question. It is stated in the article (https://academic.oup.com/nar/article/50/D1/D777/6426060#authorNotesSectionTitle) that mapping to the GreenGenes database was used for amplicon data, while I see the NСBI taxonomy in the database itself. Could you please tell how the transition from the GreenGenes taxonomy to NСBI was made?

Thank you in advance!

Query full abundance of certain run

Hi I follow the API docs of Get relative species/genus abundances for a sample/run but only retrieved run information.

code

query = {"run_id":"ERR475468"}  
url = 'https://gmrepo.humangut.info/api/getRunDetailsByRunID'
data = requests.post(url, data=json.dumps(query)).json()

## --get run List
run = data.get("run")

## --get DataFrames
species = pd.DataFrame(data.get("species"))
genus = pd.DataFrame(data.get("genus"))

reponse data

{'run': {'project_id': 'PRJEB6070',
  'original_sample_description': 'Potential of fecal microbiota for early stage detection of colorectal cancer',
  'run_id': 'ERR475468',
  'experiment_type': 'Amplicon',
  'instrument_model': 'Illumina',
  'nr_reads_sequenced': None,
  'host_age': 74,
  'sex': None,
  'BMI': 27,
  'country': 'France',
  'longitude': None,
  'latitude': None,
  'loaded_uid': 54204,
  'QCStatus': 0,
  'QCMessage': 'a single taxon  unknown  account for 100 percent of abundance, which is too much!!',
  'Original_Project_description': 'Several bacterial species have been implicated in the development of colorectal carcinoma (CRC), but CRC-associated changes of fecal microbiota and their potential for cancer screening remain to be explored. Here we used metagenomic sequencing of fecal samples to identify taxonomic markers that distinguished CRC patients from tumor-free controls in a study population of 156 participants. Accuracy of metagenomic CRC detection was similar to the standard fecal occult blood test (FOBT) and when both approaches were combined, sensitivity improved >45% relative to the FOBT while maintaining its specificity. Accuracy of metagenomic CRC detection did not differ significantly between early and late-stage cancer and could be validated in independent patient and control populations (N=335) from different countries. CRC-associated changes in the fecal microbiome at least partially reflected microbial community composition at the tumor itself, indicating that observed gene pool differences may reveal tumor-related host-microbe interactions. Indeed, we deduced a metabolic shift from fiber degradation in controls to utilization of host carbohydrates and amino acids in CRC patients accompanied by an increase of lipopolysaccharide metabolism. '},
 'phenotypes': [{'disease': 'D006262', 'term': 'Health'}],
 'phenotypes_exist': True}

Could you please help me retrieve full info?

API and website report different abundance profiles

Retrieving an abundance profile using the API and the website return different abundance profiles.

When using getRunDetailsByRunID as described in the documentation, the abundance profile for sample ERR475468 is as follows:

scientific_name	relative_abundance
Others	33.1793295
Unknown	30.255
Ruminococcus bromii	11.7594
Faecalibacterium sp. MC_41	5.4017
Bacteroides vulgatus	3.96235
[Eubacterium] eligens	3.09785
Bacteroides uniformis	2.79168
Escherichia coli	2.6621
Sphingomonas sanguinis	2.39532
Dialister invisus	2.33688
Sphingomonas paucimobilis	2.15839

However, when downloading the relative species abundance table as a TSV from the website, the abundance profile for sample ERR475468 is as follows:

relative_abundance	scientific_name
30.255	Unknown
11.7594	Ruminococcus bromii
5.4017	Faecalibacterium sp. MC_41
3.96235	Bacteroides vulgatus
3.09785	[Eubacterium] eligens
2.79168	Bacteroides uniformis
2.6621	Escherichia coli
2.39532	Sphingomonas sanguinis
2.33688	Dialister invisus
2.15839	Sphingomonas paucimobilis
2.04215	Oscillibacter valericigenes
1.88589	Blautia obeum
1.88018	Streptococcus salivarius
1.09825	Methanobrevibacter smithii
0.971213	Streptococcus mutans
etc	etc

When downloading data from the website, the taxonomic breakdown of the "Others" group is reported. I am interested in 100s of samples and don't want to download their profiles manually.

How can I retrieve the full taxonomic profile programatically?

Incomplete curated projects

Hi there,

This is an extension to #8, where getCuratedProjectsList method was added to the API.

Here's the code I used to fetch the list of curated project IDs.

def get_curated_project_ids():
    query = {}
    url = 'https://gmrepo.humangut.info/api/getCuratedProjectsList'
    content = requests.post(url, data=json.dumps(query))

    project_id_set = set([x["project_id"] for x in content.json()])
    return project_id_set

Upon running this code, I manually verified if the curated project IDs are included in the output. For example, PRJEB1775 is a project involving metagenomics samples with diarrhea. However,

pid_set = get_cureated_project_ids()
"PRJEB1775" in pid_set
# False

Is it possible that getCuratedProjectsList returns an incomplete list of project IDs?