Coder Social home page Coder Social logo

database-curation's Introduction

Reference database curation

Intro

This is the repository of the database curation pipeline for (meta-)barcoding for ITS2 vascular plants. The corresponding article is available as preview while being under review: https://www.biorxiv.org/content/10.1101/2023.06.12.544582v1

Requirements

Compatible Database

These functions are intented for usage with databases that have a taxonomy stored in a specific format used for classifiers. Such databases can be created using BCdatabase: https://github.com/molbiodiv/bcdatabaser

Documentation: https://molbiodiv.github.io/bcdatabaser/

And particuarly syntax information: https://molbiodiv.github.io/bcdatabaser/output.html

Dependencies

Software dependencies are declared in bin/externals.txt These are

Tested

This was tested under

  • Mac OSX 11.0.1
  • Ubuntu 20.04

Functions:

Automatized curation

  1. Fungal removal
  2. Non-target (non-ITS2) sequence removal
  3. Removing incomplete taxonomies
  4. Chlorophyta removal
  5. Identify and remove iterative intra-spec outliers

(Details on these filters are provided in the article above)

This function performs the automated curation:

bash /bin/_curation.sh YOUR.DB.NAME.fa

Manual list curation by identified wrong NCBI taxonomies

  • taxonomy corrections
  • sequence removal
  1. Place a .txt in the format as in the examples into the folder corrections. The format is
NCBI-Accession;Wrong_ScientificName;Corrected_ScientificName;Your_Name

Multiple separate files can be made, all .txt files in that folder will be used for corrections.

  1. Then call the function on your database
bash /bin/_correct_manuals.sh YOUR.DB.NAME.fa

This can take a while for large databases.

Manual addition of sequences by patching taxonomy and inclusion

  • adding taxonomy and appending to DB
  1. Place one or more .fasta in the format as in the examples into the folder additions. The format is
>Scientific_name
ACGT

Multiple separate files can be made, all .fasta files in that folder will be used for additions.

  1. Then call the function on your database
bash /bin/_add_manuals.sh YOUR.DB.NAME.fa

This can take a while for large number of sequences.

Subsetting: Input DB, list -> Output DB

Subsetting your input database into a geographically database

SeqFilter --ids_pattern LOCAL.FLORA.csv YOUR.DB.NAME.fa -o LOCAL.FLORA.DB.fa

database-curation's People

Contributors

aoquaresma avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.