I started working on this

This was resolved in commit <a class="commit-link" data-hovercard-type="commit" data-h

Working on step 7 ("HISAT2 confirmation of removal of human data") about covid-19-signal HOT 14 CLOSED

kmsmith137 commented on August 22, 2024

Working on step 7 ("HISAT2 confirmation of removal of human data")

from covid-19-signal.

Comments (14)

jts commented on August 22, 2024 2

I think a useful analysis to inform this discussion would be:

a) map human reads to the ncov reference to see what maps
b) map coronavirus reads to human to see what gets lost

(NB: in real amplicon sequencing sets very few reads (if any) will be human so a) will be a massive overestimate of what happens in a real experiment)

from covid-19-signal.

fmaguire commented on August 22, 2024

This was resolved in commit 7b580ba right?

from covid-19-signal.

jaleezyy commented on August 22, 2024

I believe we wanted confirmation to be through HISAT2 alignment against the human genome? This looks like the removal of non-SARS-CoV-2 reads, but not necessarily the confirmation step.

from covid-19-signal.

fmaguire commented on August 22, 2024

Considering we are throwing away any reads that don't map to the reference with the new "core" (see current branch) we can probably dispense with this entirely.

from covid-19-signal.

agmcarthur commented on August 22, 2024

There was discussion in one of the groups that step 5 (removal of non-SARS-CoV-2) be via alignment to the SARS-CoV-2 genome but step 7 verification was alignment against human genome as this combination would absolutely pass ethics/privacy requirements and provide full confidence to our healthcare colleagues.

from covid-19-signal.

fmaguire commented on August 22, 2024

Just chatting about that on the cancogen call. Sounds like: map raw sorted reads to human reference and remove any reads that map. Then go ahead with the remaining reads for trimming.

The human removed but otherwise raw reads can then be uploaded to SRA?

I can add that and remake the PR.

from covid-19-signal.

agmcarthur commented on August 22, 2024

Are we sure removing based on alignment to human won't exclude some legitimate SARS-CoV-2 data? Don't want to create a region of false low coverage.

from covid-19-signal.

jts commented on August 22, 2024

Thinking about it a bit more human WGS won't be representative of the type of off-target sequences we might see so I'm going off the idea of a) a bit. b) is worth doing though

from covid-19-signal.

robynslee commented on August 22, 2024

Yeah, not sure about benefit of a. B seems useful, as that's the issue we're worried about. Might be good to try b with a diverse sample of strains from GISAID, to get a sense of this

from covid-19-signal.

jts commented on August 22, 2024

Yeah the thought with a) was to test whether mapping to human is necessary to remove any potential host reads. It would be far simpler if we can just map to coronavirus and discard everything else, I wanted to address that with a) but realized it won't really answer the question

from covid-19-signal.

robynslee commented on August 22, 2024

Yes, agree, a) doesn't address that question.

from covid-19-signal.

fmaguire commented on August 22, 2024

Just throwing this in here so its all together.

Need to consider mapping scores: galaxyproject/SARS-CoV-2#49
Could do with a set of SRA archives across the nextstrain tree for doing this analysis (among other QC): e.g. create a composite reference and see if there is a good threshold for distinguishing host contamination.

from covid-19-signal.

fmaguire commented on August 22, 2024

I took the one wuhan scheme illumina sample I had to hand and ran BWA-MEM versus a composite human + viral reference.

Of those reads which mapped to viral and human contigs (~200 or 0.02%) the distribution the respective mapping qualities looking like this:

So most of the small number of problematic reads are a clear viral hit and lower quality human hit (as Torsten suggests in the linked thread).
If we just take those multihit reads with a MAPQ>=30 to the human reference, we are left with 13 (0.002%) reads:

We could save a whole 4 of them by comparing the map-qualities between human and viral. The remaining 11 reads with equally good hits to viral and human aren't likely to majorly affect the viral consensus or variant calling.

I could grab a bunch more SRAs and do this across a lot more samples but honestly, we are probably fine just using BWA-MEM MAPQ>=30 in the host removal stage and calling it a good'un.

from covid-19-signal.

agmcarthur commented on August 22, 2024

This excellent, please make sure it is clear in the documentation, including the supporting data.

from covid-19-signal.

Working on step 7 ("HISAT2 confirmation of removal of human data") about covid-19-signal HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent