Project to test/play with genome-size estimates from kmers and correlations with other traits.
Kmer counting failed, as coverage was too low. Quick and dirty trials suggested it was giving reasonable estimates only for genomes >>10X coverage.
Here, the idea is that if we have a known quantity of DNA in each line, genome size should negatively correlate with the percent of reads mapping to that known quantity. We start off using the cDNA.
It turns out a lot fo the cDNA are probably repetitive, as a large number of reads map to them (sometimes 2% of the genome!). This is clearly unrealistic, and probably due to some TEs being annotated as genes and/or TEs picking up bits of exon of real genes.
Our solution is to filter the set of cDNAs to those that have a beleivable number of reads mapping, and use that as the constant amount of "known" DNA.
- RIMMAs_list.txt: list of 51 RIMMA lines from the centromere diversity panel for which we have flow cytommetry estimates of genomes size
- hm2.files: list of the hm2 lines for which I extracted a small amount of fastq sequence for testing
- gsize_rimma.csv: csv file of the genome sizes from flow cytometry of the RIMMAs
- Zea_mays.AGPv3.22.cdna.all.fa.gz: initial cDNA file to be used
- gethm2.sh: script to subsample some fastq from the hapmap2 bamfiles
- jelly.sh: (don't use) script to run jellyfish kmer counter
- depth.sh: ignore for now
- fgs.sh: this maps to the cDNA reference using bwa, write to a bam file in the alignments dir. currently all commented out, needs to be run differently for HM2 and Paul's files
- get_mapping.sh: takes aligned bam files and runs 1) Paul's perl script (doesn't work for paired end reads) and a simple read counter using wc -l
- seed_size.Rmd: R markdown analysis of genome size correlation and some phenotypes using hapmap2 data. Shows plant height flowering time correlation, but not with seed size.
- fixed_genes_precent.txt: output of % mapping for the "good" set of standard genes. columns are line,total_reads,mapped_reads_corrected,percent_corrected where "corrected" means we're ignoring those genes that had too many reads hit
- mapped_uncorrected_rimma.txt: total number of reads and reads mapped to v3.22 cDNA, uncorrected for genes with too many reads mapping. from get_mapping.sh
- example.abundance.txt: example abundance per gene output from Paul's parsesam.pl script
- errors/outs: You will need an "outs" and an "errors" directory to run these scripts on the cluster, but these are not included in the git as we don't need to track zillions of logfiles.
- data/hm2: this directory is referenced in scripts and is where I store the subsampled fastq from HM2. See "gethm2.sh" script for details. Not included in repo.
- alignments: where bam alignments are stored. not included in repo
- parsesam.pl: Paul's script for splitting sam output into reads per gene, counting multiplt mapping reads as fractions. Currently does not work correctly for paired end reads
You can see results of R analyses here and here.
Then we index the file:
bwa index Zea_mays.AGPv3.22.cdna.all.fa
Then we make sure fgs file is tweaked appropriately for our samples/dirs, and align:
sbatch -p bigmem fgs.sh
Then we get mapping data (this uses Paul's script and assumes single end reads). May need to tweak headers and list of files names for this script to get it to run appropriately too.
sbatch -p bigmem get_mapping.sh
Note that the abundance files that come out of get_mapping.sh each have a line with "*" which has the number of unmapped reads.
Now we have a list of files names "abundance.blah" that have per-gene counts of reads mapping. We want to use these to filter out genes that have way too many reads mapping and should be removed from our "standard".
First we run through each file, get the total number of reads. Run back through each file and calculate the % of reads mapping to each genes. We flag any gene that has more than 0.00001% of reads mapping. We do this across all files, make a list of genes to ignore, and write that to "skip_genes.txt" in the results directory.
cut -f 1 <( for i in $( ls abundance*); do perl -e '@file=<>; $sum=0; foreach(@file){ ($gene,$reads)=split(/,/,$_); $sum+=$reads}; foreach(@file){ ($gene,$reads)=split(/,/,$_); next if $gene=~m/\*/; print "$gene\t",$reads/$sum,"\n" if $reads/$sum>5E-5; } ' < $i; done ) | sort -n | uniq > skip_genes.txt
Note that 5E-5 is not a perfect cutoff. For some files you may need higher/lower. Key is to rule out the worst set of repetitive stuff, so probably not more than 5000 genes you want to be removing.
Using the skip_genes.txt file, we run back through each abundance file, and ignore genes that should be skipped, recalculating the reads mapping to our "good" reference, and writing that to a file.
for i in $( ls abundance*); do echo "$i,$( perl -e 'open BAD, "<skip_genes.txt"; while(<BAD>){ chomp; $badgenes{$_}=1;}; close BAD; @file=<>; $count=0; $totcount=0; $sum=0; $bigsum=0; foreach(@file){ ($gene,$reads)=split(/,/,$_); $bigsum+=$reads; $totcount+=1; next if $gene=~m/\*/; next if $badgenes{$gene}; $count+=1; $sum+=$reads; } print "$bigsum,$totcount,$sum,$count,",$sum/$bigsum,"\n";' < $i )" ; done > fixed_genes_percent.txt
This columns of this file are: total_reads, total_genes, corrected_reads, corrected_genes, percent_corrected. Percent corrected is corrected_reads/total_reads.