Classification of HTS data analysis tools for plant virus detection

by Kaoutar Daoud Hiri, Matjaž Hren and Tomaž Curk

This paper has been submitted for publication in Bioinformatics

We used machine learning methods to develop a metagenomics tool classification system based on their text description extracted from published papers. The users will quickly and efficiently filter tools of 13 different categories by identifying the tool's specific function.

Abstract

Motivation: The explosion of metagenomics data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. As a direct consequence, novel analysis tools for big data metagenomics are continuously emerging. One of the biggest challenges for researchers of plant virome emerges already at the stage of planning the analysis: selecting the most suitable bioinformatics tool capable of getting valuable insights from the HTS data. Knowing how a tool can be applied and what bioinformatics tasks it is suitable for is a fundamental and critical aspect of the recommendation process; manually gathering this kind of data from the tools papers would be laborious and time-consuming. Results: We have addressed this challenge by using machine learning methods to develop a metagenomics tool classification system. We trained three classifiers (Naive Bayes, Logistic Regression, and Random Forest) on Three manually gathered data sets on 224 software tools assigned to 13 different classes, using 12(+3) text feature extraction techniques. The first data set includes only the abstract section of the tools publications, The second data set includes only the methods section of the tools publications and The third data set consists of both the abstract and methods section of the tools publications. We conclude that Logistic regression using BioBERT for text representation of the abstracts only dataset is the best model, which achieves Area Under the Precion Recall Curve score of 0.85.

Software implementation

All source code used to generate the results and figures in the paper is in the code folder, the data used in this study is in the datasets folder and the results generated by the code are in the results folder. The calculations and figure generation are all run inside Jupyter notebooks. See the README.md files in each directory for a full description.

Getting the code

You can download a copy of all the files in this repository by cloning the git repository:

git clone https://github.com/kaoutarDaoudHiri/HTS-data-analysis-tools-classifier.git

Dependencies

You'll need a working Python environment to run the code. The recommended way to set up your environment is through the Anaconda Python distribution which provides the conda package manager. Anaconda can be installed in your user directory and does not interfere with the system Python installation.

Reproducing the results

To explore the code results you can execute the Jupyter notebooks individually. To do this, you must first start the notebook server by going into the repository top level and running:

jupyter notebook

This will start the server and open your default web browser to the Jupyter interface. In the page, go into the code/notebooks folder and select the notebook that you wish to view/run. In the Jupyter notebooks the names of the EDAM classes are shortened, we keep the part outside the parentheses (Sequence) alignment, (Taxonomic) classification, (Sequence) assembly, (Sequence) trimming, (Sequencing) quality control, (Sequence) annotation, (Sequence) assembly validation, (RNA-seq quantification for) abundance estimation, SNP-Discovery, Visualization.

tomazc / nlp-based-classification-of-metagenomics-tools Goto Github PK

nlp-based-classification-of-metagenomics-tools's Introduction

Classification of HTS data analysis tools for plant virus detection

This paper has been submitted for publication in Bioinformatics

Abstract

Software implementation

Getting the code

Dependencies

Reproducing the results

nlp-based-classification-of-metagenomics-tools's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent