I stumbled across this program while looking for alternatives to fastx toolkit and CD-HIT for removing duplicates from fasta files. My fasta file is very large (~2.5 GB) and contains many genes of interest. I'm working on a system that has up to 40 core-nodes and 754 GB of Memory to run any individual job with, but we have a time limit of 72 hours.
This was my command:
module load sddc/3.0
eval "$(/util/common/python/py37/anaconda-2020.02/bin/conda shell.bash hook)"
conda activate sddc
python /util/common/sddc/3.0/Sequence-database-curator/sddc.py -in GENES.fasta -out GENES_set_sddc.fasta -n -mode derep -org_order
In your experience, should your program be able to de-replicate my fasta file with about ~925,000 sequences within my time limit of 72 hours? I have a feeling that it wouldn't be able to based on the time calculation on the main tab, but I wanted to ask. If not, do you recommend any other options for this kind of job. Thank you