Coder Social home page Coder Social logo

bioone / sequence-database-curator Goto Github PK

View Code? Open in Web Editor NEW

This project forked from eslam-samir-ragab/sequence-database-curator

0.0 0.0 0.0 2.14 MB

This program dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).

Home Page: https://sites.google.com/pharma.cu.edu.eg/eslam-ibrahim/github-and-softwares/sddc-program

License: GNU General Public License v3.0

Python 100.00%

sequence-database-curator's Introduction

Sequence Dereplicator and Database Curator (SDDC) program

This program dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).

This software is under GNU General Public License v3.0

Please, cite: DOI: 10.1007/s00284-017-1327-6

How to use:

  1. you need to install python 2.7 or python 3 on your machine.
  2. you need to install Numpy and Biopython
  3. you need to install future module by pip command
  4. Click “Clone or download” > “Download ZIP” > extract the downloaded file.
  5. Open the file “sddc.py” with (python.exe).
  • Windows
  • U/Linux : use the command chmod u+x sddc.py
  • Mac : use the command python sddc.py
  1. State your variables and press Enter.

The full SDDC commands, Cheat sheet and notes are here

Updates in SDDC v3.0:

  1. Bugs fixes.
  2. Usage of -org_order with -kw is updated
  3. Exchange FASTA headers mode is now available.

Updates in SDDC v2.0:

  1. You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.
  2. You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.

Notes:

  • The rate of SDDC as determined using Intel(R) Pentium(R) CPU G630 @ 2.70GHz 2.70 GHz Processor, 4.00 GB RAM, 32-bit Operating System

  • List of options and commands in the program you can download it from here:

Examples

if you want to dereplicate protein sequences use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep

if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order

if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi

if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300

if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive

if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw

if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive

if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw

if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)

if you want to exchange words in FASTA headers of your protein sequences use the following command

python sddc.py -in (input_file) -p -out (output_file) -mode exchange_headers -ex_file (exchange_file in csv)

if you want to exchange words in FASTA headers of your nucleotide sequences use the following command

python sddc.py -in (input_file) -n -out (output_file) -mode exchange_headers -ex_file (exchange_file in csv)

Example (1)

Example (2)

Any errors please send me an email to [email protected]

Visit my website for more details, other publications, and contact

sequence-database-curator's People

Contributors

eslam-samir-ragab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.