prunus-cp-genome-assembly

Project for Genomics Course 2021 (MSc Bioinformatics for Computational Genomics) helded by Prof. Aureliano Gomez Bombarely at Università degli Studi di Milano.

Aim

The project aims to investigate and compare the de novo assemblies of the chloroplast genome using both short-reads of Illumina and long-reads coming from PacBio and Oxford Nanopore.

The chloroplast genome of Prunus avium was chosen for the assembly.

Description

1. Retrieving raw data

The sequencing raw data were retrieved from the SRA repository in NCBI under the accession number:

SRR10362958: Illumina
SRR4280451: PacBio SMRT
SRR7786091: Oxford Nanopore

The reference chloroplast genome of Prunus avium (MK622380.1) and Prunus apetala (NC_053693.1) were taken from NCBI Nucleotide Database.

The raw data were downloaded and converted into FASTQ files using fastq-dump.

The statistics of the fastq were obtained using fastq-stats.

2. Mapping

The mapping was performed both to the prunus avium and prunus apetala using:

bowtie2 for short-reads
minimap2 and ngmlr for long-reads

Illumina reads were pre-processed by using fastq-mcf to remove the adapters before the mapping.

The reads mapping to the chloroplast genome were extracted using samtoos view.

The mapping stats were evaluated with samtools stats.

3. Coverage evaluation

The mapped reads were sorted by position using samtool sort.

bedtools genomecov with the option -d was used to obtain the depth at each genome position.

4. Assembly

The BAM files obtained from the mapping to Prunus avium were converted into FASTQ with bedtools bamtofast.

In order to find the best assembly, several subsets of reads were generated by sektq sample.

The assemblies were obtained choosing:

canu for long-reads
AbySS for short-reads comparing different kmer size

The statistics of each assembly were obtained with FastaSeqStats.

5. Annotation of the longest contigs reconstructed

The longest contigs for each dataset were selected using FastaExtract and aligned to the reference using BLASTN.

The annotation was performed using GeSeq.

mariachiaragrieco / prunus-cp-genome-assembly Goto Github PK