Coder Social home page Coder Social logo

microbiomedata / kaiko_metaproteome Goto Github PK

View Code? Open in Web Editor NEW
1.0 6.0 3.0 73.87 MB

Generates sample-specific protein databases from raw peptide spectra

License: BSD 2-Clause "Simplified" License

Dockerfile 0.17% Python 82.98% Shell 0.64% Jupyter Notebook 16.21%

kaiko_metaproteome's Introduction

Kaiko Pipeline

Introduction

Put simply, this tool takes raw proteomic input and outputs a FASTA file of those organisms most likely to be present in the proteomic input.

The pipeline uses neural networks to identify peptide sequences from raw proteomic input, which are then aligned against all protein sequences using a diamond search. This offers us a view of those organisms most likely to be present in the proteomic samples, with which we make a FASTA file from the most likely organisms identified.

Setup

Uses python 3.10 and tensorflow 2.11.0. The full list of requirements can be found in Kaiko_volume/setup_libraries.txt

Before first use, a few files are needed.

Downloading Files

  1. Run the file Kaiko_denovo/model/get_data.sh to download the trained Kaiko denovo model.

Download the following files to the Kaiko_volume/Kaiko_stationary_files folder.

  1. UniRef100 FASTA Large file, 80 Gb+.

  2. UniRef100 XML Large file, 100 Gb+.

  3. NCBI Taxonomy dump Less than 1Gb.

  4. Diamond search, choosing the appropriate system. If using Docker, get the Linux version.

Processing

  1. Extract the diamond file from step 4 into its own folder within Kaiko_volume/Kaiko_stationary_files, eg Kaiko_volume/Kaiko_stationary_files/diamond.

  2. Within a command prompt, navigate to the diamond folder created in the previous step and run diamond makedb --in ../uniref100.fasta.gz --db ../uniref100. The process can take a while. Note: If using Linux or Mac, replace diamond with ./diamond.

  3. Extract the contents of NCBI Taxonomy dump to its own folder within Kaiko_volume/Kaiko_stationary_files, eg Kaiko_volume/Kaiko_stationary_files/ncbi_taxa.

  4. Within a command prompt, navigate to the Kaiko_volume/Kaiko_stationary_files folder and run python ExtractUniRefMembers.py. This will make the file uniref100_member_taxa_tbl.csv within Kaiko_volume/Kaiko_stationary_files. Copy this file into the taxa folder from step 3, eg Kaiko_volume/Kaiko_stationary_files/ncbi_taxa. This step can also take some time.

Check

In the end, Kaiko_volume/Kaiko_stationary_files should have two new files, uniref100.dmnd and uniref100.fasta. It should also contain two folders, Kaiko_volume/Kaiko_stationary_files/diamond and Kaiko_volume/Kaiko_stationary_files/ncbi_taxa, if using default names. The diamond folder should contain the diamond file, while the taxa_folder should contain the contents of the NCBI Taxanomy dump (.dmp files), and the file uniref100_member_taxa_tbl.csv. If the names of these two new folders differ from the default used in the readme, the config.yaml file must be edited to point to these new folders, see the repo config.yaml for an example.

Usage

Currently, only .mgf files are supported. To use, simply follow these steps.

  1. Place the input into a separate folder WITHIN the Kaiko_volume/Kaiko_input_files/ directory. This folder should have a descriptive name.

  2. Edit the config.yaml file within the Kaiko_volume directory to include the location of the folder with the input. An example can be found in the current file config.yaml.

  3. Run the command python Kaiko_pipeline_main.py within the main directory of this repo. The kaiko_defaults.yaml file will fill in any necessary parameters not present in config.yaml

The Kaiko_volume/Kaiko_intermediate/ folder will be populated with a few intermediate files. These are named using the mgf_input folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/ folder, again named using the folder name of the input.

  1. If you would like to profile the pipeline using cProfile, add the profile = True flag to the config file. To use memory-profiler, within the main repo directory, run mprof run --include-children Kaiko_pipeline_main.py.

Usage with Docker

To use the pipeline within Docker, follow steps 1-2 in Usage, then jump here:

  1. (Docker) Run the command docker build . -t kaiko-py310 (within the Kaiko_metaproteome folder) to build the Kaiko docker image.

  2. (Docker) Run the command docker run --name Kaiko_container-py310 -v path_Kaiko_volume:/Kaiko_metaproteome/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py, where path_Kaiko_volume is the absolute path to the Kaiko_volume folder. This allows Docker to store the outputs in Kaiko_volume. For example, such a command may look like docker run --name Kaiko_container-py310 -v C:/Users/memmys/Documents/GitHub/Kaiko_metaproteome/Kaiko_volume/:/Kaiko_metaproteome/Kaiko_volume kaiko-py310 python Kaiko_pipeline_main.py --config Kaiko_volume/sample_config.yaml

  3. (Docker) Make sure to update the config file to point to the Linux version of diamond. See the setup for more details.

The Kaiko_volume/Kaiko_intermediate/ folder will be populated with a few intermediate files. These are named using the mgf_input folder name. The final FASTA output can be found within Kaiko_volume/Kaiko_output/ folder, again named using the folder name of the input.

Unit Tests

After installing the files, we should ensure the denovo network is producing the expected output given the model. To do this, navigate to the main repo folder in a command prompt and run python kaiko_unit_test.py. This runs the denovo model on a predetermined dataset and compares line by line to stored output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.