by Kaoutar Daoud Hiri, Matjaž Hren and Tomaž Curk
We used machine learning methods to develop a metagenomics tool classification system based on their text description extracted from published papers. The users will quickly and efficiently filter tools of 13 different categories by identifying the tool's specific function.
Motivation: The explosion of metagenomics data makes metagenomics increasingly dependent on computational and statistical methods for fast and efficient analysis. As a direct consequence, novel analysis tools for big data metagenomics are continuously emerging. One of the biggest challenges for researchers of plant virome emerges already at the stage of planning the analysis: selecting the most suitable bioinformatics tool capable of getting valuable insights from the HTS data. Knowing how a tool can be applied and what bioinformatics tasks it is suitable for is a fundamental and critical aspect of the recommendation process; manually gathering this kind of data from the tools papers would be laborious and time-consuming. Results: We have addressed this challenge by using machine learning methods to develop a metagenomics tool classification system. We trained three classifiers (Naive Bayes, Logistic Regression, and Random Forest) on Three manually gathered data sets on 224 software tools assigned to 13 different classes, using 12(+3) text feature extraction techniques. The first data set includes only the abstract section of the tools publications, The second data set includes only the methods section of the tools publications and The third data set consists of both the abstract and methods section of the tools publications. We conclude that Logistic regression using BioBERT for text representation of the abstracts only dataset is the best model, which achieves Area Under the Precion Recall Curve score of 0.85.
All source code used to generate the results and figures in the paper is in the code folder, the data used in this study is in the datasets folder and the results generated by the code are in the results folder. The calculations and figure generation are all run inside Jupyter notebooks. See the README.md files in each directory for a full description.
You can download a copy of all the files in this repository by cloning the git repository:
git clone https://github.com/kaoutarDaoudHiri/HTS-data-analysis-tools-classifier.git
You'll need a working Python environment to run the code. The recommended way to set up your environment is through the Anaconda Python distribution which provides the conda package manager. Anaconda can be installed in your user directory and does not interfere with the system Python installation.
To explore the code results you can execute the Jupyter notebooks individually. To do this, you must first start the notebook server by going into the repository top level and running:
jupyter notebook
This will start the server and open your default web browser to the Jupyter interface. In the page, go into the code/notebooks folder and select the notebook that you wish to view/run. In the Jupyter notebooks the names of the EDAM classes are shortened, we keep the part outside the parentheses (Sequence) alignment, (Taxonomic) classification, (Sequence) assembly, (Sequence) trimming, (Sequencing) quality control, (Sequence) annotation, (Sequence) assembly validation, (RNA-seq quantification for) abundance estimation, SNP-Discovery, Visualization.