Coder Social home page Coder Social logo

tomazc / nlp-based-classification-of-metagenomics-tools Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kaoutardaoudhiri/nlp-based-classification-of-metagenomics-tools

0.0 0.0 0.0 119.17 MB

License: GNU General Public License v3.0

Python 0.22% Jupyter Notebook 99.78%

nlp-based-classification-of-metagenomics-tools's Introduction

Classification of HTS data analysis tools for plant virus detection

by Kaoutar Daoud Hiri, Matjaž Hren and Tomaž Curk

This paper has been submitted for publication in Bioinformatics

We used machine learning methods to develop a metagenomics tool classification system based on their text description extracted from published papers. The users will quickly and efficiently filter tools of 13 different categories by identifying the tool's specific function.

confusion matrix

Abstract

Motivation: The explosion of metagenomics data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. As a direct consequence, novel analysis tools for big data metagenomics are continuously emerging. One of the biggest challenges for researchers of plant virome emerges already at the stage of planning the analysis: selecting the most suitable bioinformatics tool capable of getting valuable insights from the HTS data. Knowing how a tool can be applied and what bioinformatics tasks it is suitable for is a fundamental and critical aspect of the recommendation process; manually gathering this kind of data from the tools papers would be laborious and time-consuming. Results: We have addressed this challenge by using machine learning methods to develop a metagenomics tool classification system. We trained three classifiers (Naive Bayes, Logistic Regression, and Random Forest) on Three manually gathered data sets on 224 software tools assigned to 13 different classes, using 12(+3) text feature extraction techniques. The first data set includes only the abstract section of the tools publications, The second data set includes only the methods section of the tools publications and The third data set consists of both the abstract and methods section of the tools publications. We conclude that Logistic regression using BioBERT for text representation of the abstracts only dataset is the best model, which achieves Area Under the Precion Recall Curve score of 0.85.

Software implementation

All source code used to generate the results and figures in the paper is in the code folder, the data used in this study is in the datasets folder and the results generated by the code are in the results folder. The calculations and figure generation are all run inside Jupyter notebooks. See the README.md files in each directory for a full description.

Getting the code

You can download a copy of all the files in this repository by cloning the git repository:

git clone https://github.com/kaoutarDaoudHiri/HTS-data-analysis-tools-classifier.git

Dependencies

You'll need a working Python environment to run the code. The recommended way to set up your environment is through the Anaconda Python distribution which provides the conda package manager. Anaconda can be installed in your user directory and does not interfere with the system Python installation.

Reproducing the results

To explore the code results you can execute the Jupyter notebooks individually. To do this, you must first start the notebook server by going into the repository top level and running:

jupyter notebook

This will start the server and open your default web browser to the Jupyter interface. In the page, go into the code/notebooks folder and select the notebook that you wish to view/run. In the Jupyter notebooks the names of the EDAM classes are shortened, we keep the part outside the parentheses (Sequence) alignment, (Taxonomic) classification, (Sequence) assembly, (Sequence) trimming, (Sequencing) quality control, (Sequence) annotation, (Sequence) assembly validation, (RNA-seq quantification for) abundance estimation, SNP-Discovery, Visualization.

nlp-based-classification-of-metagenomics-tools's People

Contributors

kaoutardaoudhiri avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.