This script is designed to perform sequence analysis on FASTA files. It calculates sequence statistics, generates a TSV output file, and provides various functionalities for analyzing protein sequences.
- Parsing FASTA files to extract sequence data
- Calculating sequence statistics such as total sequences, average length, minimum length, and maximum length
- Saving the analysis results to a tab-separated values (TSV) file
- Identifying sequences with minimum and maximum lengths
- Optional protein domain analysis using InterProScan or Pfam (not implemented)
- Customizable alignment parameters for protein sequences (not implemented)
- Comparison of sequences using pairwise alignment (not implemented)
The following dependencies are required to run the script:
- Python 3.x
- Biopython
Install the dependencies using the following command:
pip install biopython
-
Ensure you have Python 3.x installed on your system.
-
Install the required dependencies using the command mentioned above.
-
Prepare a FASTA file containing the input sequences.
-
Update the
input_file
andoutput_file
variables in the script with the appropriate file paths. -
Run the script using the following command:
python fasta_parser.py
-
The output will be saved in the specified output file as a TSV format, containing the sequence names, lengths, and sequences themselves.
-
You can modify the script parameters and alignment settings as needed for your specific analysis by editing the
fasta_parser.py
file.
Contributions to the project are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue or submit a pull request.
This project is licensed under the MIT License.