Coder Social home page Coder Social logo

biotukey's Introduction

Logo BioTukey


BioTukey: a comprehensive toolkit for interactive data analysis, engineering, and classification of biological sequences


Summary


This work proposes a toolkit named BioTukey, a graphical application that allows users to gather different kinds of insights about biological sequences such as DNA, RNA, and protein by providing a set of statistics, visualizations as multiple sequence alignment and sequence structure visualization, and machine learning approaches to proper show relevant characteristics and discriminate between types of these sequences. BioTukey provides different sequence descriptors, such as domain-specific biological and mathematical such as complex networks and entropy, to contribute to these insights. Machine learning models such as Random Forests and XGBoost can be selected along with choosing specific hyperparameters or using AutoML to recommend descriptors for classification. BioTukey can provide insightful perspectives to researchers working with multiple views on biological sequences, and even machine learning nonexperts could better apply ML pipelines considering the interactiveness of the application.

BioTukey was developed using Python programming, with the Streamlit framework as the foundation for its GUI. Streamlit is a web application framework that simplifies creating interactive data-driven applications. In this context, BioTukey leverages the capabilities of Streamlit to function as a web application, enabling users to interact with the application through its GUI. The application is designed as a responsive single-page application (SPA) to provide users with a user experience akin to a desktop application. It is organized into three main modules/pages: Overview, Feature Engineering, and Classification. Each module offers distinct functionalities to support analyzing and exploring biological sequences.

The Overview module provides comprehensive data visualization and summary statistics. It presents users with diverse statistics on the provided sequences, including sequence alignment visualization, $k$-mer distribution, and structure visualization for DNA/RNA and protein sequences. This module aims to provide users with a holistic understanding of the characteristics and properties of the input sequences.

The Feature Engineering module empowers users to extract different types of descriptors from the sequences, generating feature vectors that can be concatenated. These feature vectors serve as input for visualizations such as distribution plots, dimensionality reduction plots using techniques like PCA, t-SNE, or UMAP, correlation heatmaps, and feature importance bar charts. By allowing users to manipulate and explore the extracted features, this module facilitates a deeper understanding of the data and enables informed decision-making in subsequent analysis steps.

The Classification module allows users to perform classification tasks on the provided sequences. Users can select between two popular machine learning models, Random Forest, and XGBoost, as the algorithms for classification. They can further choose to apply 10-fold cross-validation, either with or without a separate test set. Additionally, users have the option to engage with the model manually by creating a custom pipeline. This includes selecting hyperparameters, filtering for the most important features, and applying dimensionality reduction techniques. The module also provides performance scores of the model, such as accuracy, precision, recall, and F1-score, along with a confusion matrix to evaluate the model's predictive capabilities.

In summary, BioTukey combines the capabilities of Python, Streamlit, and various data analysis techniques to create a user-friendly and comprehensive application for the analysis and classification of biological sequences. The application's modules, including Overview, Feature Engineering, and Classification, enable users to visualize, explore, and model biological data effectively, facilitating informed decision-making in bioinformatics research and applications.

You need the packages in environment.yml. You can use Miniconda to easily configure the environment with the packages using:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

chmod +x Miniconda3-latest-Linux-x86_64.sh

./Miniconda3-latest-Linux-x86_64.sh

export PATH=~/miniconda3/bin:$PATH

After installing Miniconda, clone this repository using the following:

git clone https://github.com/brenoslivio/BioTukey.git BioTukey

cd BioTukey

git submodule init

git submodule update

This also guarantees that the MathFeature package is also appropriately configured. With Miniconda, create the environment with:

conda env create -f environment.yml -n biotukey

Activate the environment with:

conda activate biotukey

Finally, you can run the application with Streamlit using:

streamlit run main.py

You can access the application in a web browser in http://localhost:8501.

Distributed under the MIT License. See LICENSE for more information.

biotukey's People

Contributors

brenoslivio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.