Coder Social home page Coder Social logo

qtlseqr's Introduction

QTLseqr v0.7.5.2

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis.

QTLseqr is still under development and is offered with out any guarantee.

For more detailed instructions please read the vignette here

For updates read the NEWS.md

Installation

You can install QTLseqr from github with:

# install devtools first to download packages from github
install.packages("devtools")

# use devtools to install QTLseqr
devtools::install_github("bmansfeld/QTLseqr")

Note: Apart from regular package dependencies, there are some Bioconductor tools that we use as well, as such you will be prompted to install support for Bioconductor, if you haven’t already. QTLseqr makes use of C++ to make some tasks significantly faster (like counting SNPs). Because of this, in order to install QTLseqr from github you will be required to install some compiling tools (Rtools and Xcode, for Windows and Mac, respectively).

If you use QTLseqr in published research, please cite:

Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant analysis with next-generation sequencing The Plant Genome doi:10.3835/plantgenome2018.01.0006

We also recommend citing the paper for the corresponding method you work with.

QTL-seq method:

Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J, 74: 174–183. doi:10.1111/tpj.12105

G prime method:

Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi.org/10.1371/journal.pcbi.1002255

Abstract

Next Generation Sequencing Bulk Segregant Analysis (NGS-BSA) is efficient in detecting quantitative trait loci (QTL). Despite the popularity of NGS-BSA and the R statistical platform, no R packages are currently available for NGS-BSA. We present QTLseqr, an R package for NGS-BSA that identifies QTL using two statistical approaches: QTL-seq and G’. These approaches use a simulation method and a tricube smoothed G statistic, respectively, to identify and assess statistical significance of QTL. QTLseqr, can import and filter SNP data, calculate SNP distributions, relative allele frequencies, G’ values, and log10(p-values), enabling identification and plotting of QTL.

Examples:

Example figure

Example figure

###For more detailed instructions please read the vignette here

This is a basic example which shows you how to import and analyze NGS-BSA data.

#load the package
library("QTLseqr")

#Set sample and file names
HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
file <- "SNPs_from_GATK.table"

#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
Chroms <- paste0(rep("Chr", 12), 1:12)

#Import SNP data from file
df <-
    importFromGATK(
        file = file,
        highBulk = HighBulk,
        lowBulk = LowBulk,
        chromList = Chroms
     )

#Filter SNPs based on some criteria
df_filt <-
    filterSNPs(
        SNPset = df,
        refAlleleFreq = 0.20,
        minTotalDepth = 100,
        maxTotalDepth = 400,
        minSampleDepth = 40,
        minGQ = 99
    )


#Run G' analysis
df_filt <- runGprimeAnalysis(
    SNPset = df_filt,
    windowSize = 1e6,
    outlierFilter = "deltaSNP")

#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
    SNPset = df_filt,
    windowSize = 1e6,
    popStruc = "F2",
    bulkSize = c(25, 25),
    replications = 10000,
    intervals = c(95, 99)
)

#Plot
plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)

#export summary CSV
getQTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "my_BSA_QTL.csv")

qtlseqr's People

Contributors

bmansfeld avatar rvosa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

qtlseqr's Issues

error: fewer than one row in the data in runGprimeAnalysis

Hi Ben,
I got a problem in runGprimeAnalysis when I use version 0.7.5.2
Here is my code, same with your guide.

> df_filt <- runGprimeAnalysis(
+   SNPset = df_filt,
+   windowSize = 1e6,
+   outlierFilter = "deltaSNP")

Here is output:

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Error in locfit::locfit(Stat ~ locfit::lp(POS, h = windowSize, deg = 0),  : 
  fewer than one row in the data
In addition: There were 37 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In lfproc(x, y, weights = weights, cens = cens, base = base,  ... :
  Estimated rdf < 1.0; not estimating variance

Here is code I use to get table, same with your guide for -F and -GF:

gatk VariantsToTable -R ${ref}/$reference_file -V ${reference}_Mr_cohort_snps.vcf -F CHROM -F POS -F REF -F ALT -GF AD -GF DP -GF GQ -GF PL -O ${reference}_Mr_db.snps.vcf.table

Here is head of df:

> head(df)
  CHROM   POS REF ALT AD_REF.LOW AD_ALT.LOW DP.LOW GQ.LOW     PL.LOW SNPindex.LOW AD_REF.HIGH AD_ALT.HIGH DP.HIGH GQ.HIGH     PL.HIGH SNPindex.HIGH
1  Chr1 31071   A   G         34         36     70     99  897,0,855    0.5142857          26          22      48      99   522,0,698     0.4583333
2  Chr1 31478   C   T         34         52     86     99 1363,0,844    0.6046512          40          34      74      99  848,0,1099     0.4594595
3  Chr1 33667   A   G         20         48     68     99 1331,0,438    0.7058824          24          29      53      99   765,0,599     0.5471698
4  Chr1 34057   C   T         38         40     78     99 1059,0,996    0.5128205          29          26      55      99   673,0,772     0.4727273
5  Chr1 35239   A   C         25         36     61     99  987,0,630    0.5901639          40          60     100      99 1632,0,1015     0.6000000
6  Chr1 38389   T   C         36         42     78     99 1066,0,906    0.5384615          42          40      82      99  984,0,1105     0.4878049
    REF_FRQ     deltaSNP
1 0.5084746 -0.055952381
2 0.4625000 -0.145191703
3 0.3636364 -0.158712542
4 0.5037594 -0.040093240
5 0.4037267  0.009836066
6 0.4875000 -0.050656660

Could you tell me what might cause that error?
Thanks!

Issue with plotQTLstat

Hi Ben,

I am new user of QTLseqr, but it seems a really nice package!! Congrats!!
However, I got an error when trying to plot graphs from plotQTLstat as described above.

p2 <- plotQTLStats(SNPset = data_filt, var = "deltaSNP", plotIntervals = TRUE)
p2
Error in f(..., self = self) : Breaks and labels are different lengths
In addition: Warning message:
In x/limits[i] :
longer object length is not a multiple of shorter object length

I am trying to plot a yeast QTL analysis having 16 chromossomes with distinct sizes.
By using subset() and manualy adding the chromossomes I figured out that I got the graph when I exclude the two majors chromossomes > 1Mb.
So seems the error happens when one of these "large" chromossomes is on my plot.

Do you know how to fix it? Sorry if the question sounds idiot.

Best,

Alessandro

Error: Problem with `mutate()` column `tricubeDeltaSNP`

prepare the file

gatk VariantsToTable -R ${RefDir}/$Ref -V merge.SNP.filter.vcf
-F CHROM -F POS -F REF -F ALT
-GF AD -GF DP -GF GQ -GF PL
-O merge.snp.table

HighBulk="R"
LowBulk="S"
rawData="merge.snp.table"
Chroms <- "Chr1"

read file

df <- importFromGATK( file =rawData, highBulk =HighBulk, lowBulk =LowBulk, chromList =Chroms )
#output
`
head(df)

CHROM POS REF ALT AD_REF.LOW AD_ALT.LOW DP.LOW GQ.LOW PL.LOW SNPindex.LOW AD_REF.HIGH AD_ALT.HIGH DP.HIGH GQ.HIGH PL.HIGH SNPindex.HIGH REF_FRQ deltaSNP
1 Chr1 144949 G C 0 5 5 15 225,15,0 1 0 0 0 NA 0,0,0 NaN 0 NaN
2 Chr1 144955 G A 0 5 5 15 225,15,0 1 0 0 0 NA 0,0,0 NaN 0 NaN
3 Chr1 144979 T C 0 6 6 18 270,18,0 1 0 0 0 NA 0,0,0 NaN 0 NaN
4 Chr1 144984 C A 0 6 6 18 270,18,0 1 0 0 0 NA 0,0,0 NaN 0 NaN
5 Chr1 144986 T C 0 6 6 18 270,18,0 1 0 0 0 NA 0,0,0 NaN 0 NaN
6 Chr1 519510 A T 0 0 0 NA 0,0,0 NaN 0 1 1 3 45,3,0 1 0 NaN
`

filter SNP

df_filt <- filterSNPs( SNPset =df, refAlleleFreq =0.20, minTotalDepth =100, maxTotalDepth =400, depthDifference =100, minSampleDepth =40, minGQ =99, verbose =TRUE )
#output

Filtering by reference allele frequency: 0.2 <= REF_FRQ <= 0.8
...Filtered 27647 SNPs
Filtering by total sample read depth: Total DP >= 100
...Filtered 11087 SNPs
Filtering by total sample read depth: Total DP <= 400
...Filtered 447 SNPs
Filtering by per sample read depth: DP >= 40
...Filtered 156 SNPs
Filtering by Genotype Quality: GQ >= 99
...Filtered 486 SNPs
Filtering by difference between bulks <= 100
...Filtered 119 SNPs
Original SNP number: 41064, Filtered: 39942, Remaining: 1122

Run G' analysis

df_filt <- runGprimeAnalysis( SNPset = df_filt, windowSize = 1e6, outlierFilter = "deltaSNP")
#output

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Error: Problem with mutate() column tricubeDeltaSNP.
i tricubeDeltaSNP = tricubeStat(POS = POS, Stat = deltaSNP, windowSize, ...).
x newsplit: out of vertex space
i The error occurred in group 1: CHROM = Chr1.
Run rlang::last_error() to see where the error occurred.
addation: There were 50 or more warnings (use warnings() to see the first 50)

version

platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 4 minor 0.5 year 2021 month 03 day 31 svn rev 80133 language R version.string R version 4.0.5 (2021-03-31) nickname Shake and Throw

Replication

Just having a play with QTLseqr and it seems to work quite nicely. Great job.

Just wondering if/can you consider replicates in the analysis. So far it looks like it one uses a single high and low bulk as input.

Cheers,
Steve

i meet a bug and need your help.

Hi,i m so great sorry to trouble you.when i ues runQTLseqAnalysis to calculate the CI of 99% and 95%,i found when i set the windowSize is 1Mb,the error will arrival 'newsplit :out of vertex space',i thought my computer‘s space to little,but i can not sovle the problem when i use a bigger space computer.so i have to take the qustion to you for help.

Evaluation error: newsplit: out of vertex space.

when i use my data analysis, and i meet the error

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Error in mutate_impl(.data, dots) : 
  Evaluation error: newsplit: out of vertex space.
besides: There were 50 or more warnings (use warnings() to see the first 50)

my code is
df_filt <- runGprimeAnalysis(SNPset = df_filt,windowSize = 1000,outlierFilter = "deltaSNP")

df <- importFromTable() troubles

Hello bmansfeld

I am having problems running QTLseqr. When I use the "importFromTable" function to assign the value to the variable "df" I am getting the following failures.

I have defined the variables as follows (the names "POOLR_sorted.bam" and "POOLS_sorted.bam" are the names of the two pools I use):

library("QTLseqr")
HighBulk <- "POOLR_sorted.bam"
LowBulk <- "POOLS_sorted.bam"
Chroms <- paste0(rep("chr2",793229), 2:793230)

On the other hand, I created a table or csv document, and saved it using the "character set: Unicod (UTF-8)", as "Field delimiter: {Tables}", and as 'String delimiter: " ', and saved with the option "Save cell contents as shown" checked.

Subsequently, I have executed the function "importFromTable()", assigning the result to the variable "df" as it is done in the guide. However, I get the following error:

**df <- importFromTable(file="BSA_tabla.csv,
highBulk = "POOLR_sorted.bam",
lowBulk = "POOLS_sorte.bam",
chromList = Chroms)

Error in importFromTable(file = "tablaBSA.csv", highBulk = HighBulk, :
No 'CHROM' coloumn found.
Además: Warning message:
The following named parsers don't match the column names: CHROM, POS**

After seeing this error, I looked in the help or support offered by R: "?importFromTable()". After looking at this help, I saw that the chromosome list was not needed and that the highBulk and lowBulk values could be assigned directly, so I tried the following:

**df <- importFromTable(file="BSA_tabla.csv,
highBulk = "POOLR_sorted.bam",
lowBulk = "POOLS_sorte.bam")

Error in importFromTable(file = "tablaBSA.csv", highBulk="POOLR_sorted.bam", :
No 'CHROM' coloumn found.
Además: Warning message:
The following named parsers don't match the column names: CHROM, POS**

However, I kept getting the same error, and it would not allow me to move forward with the analysis.
Could you please help me?

Thank you very much in advance for your help.

Best regards,

Juan Luis

Cannot install "tools" for QTLseqr

Hi.

I'm getting the following error when trying to install QTLseqr:

> devtools::install_github("bmansfeld/QTLseqr")
Downloading GitHub repo bmansfeld/QTLseqr@master
Error: Could not find tools necessary to compile a package

Any idea what this means?

Thanks for any help

The script running normal and no error produced, why no plot generated? just only produce a table of CSV

Help my QTLseqr was installed in linux R envs, R was installed use anconda, the QTLseqr installed success, but when running my script, though no error produced, but the Rplots.pdf and df had nothing!
Has anyone meet this situation?

plotQTLStats(

  • SNPset = df_filt,
  • var = "Gprime",
  • plotThreshold = TRUE,
  • q = 0.01
  • )

plotQTLStats(

  • SNPset = df_filt,
  • var = "deltaSNP",
  • plotIntervals = TRUE)

getQTLTable(

  • SNPset = df_filt,
  • alpha = 0.01,
  • export = TRUE,
  • fileName = "my_BSA_QTL1.csv"
  • )
    CHROM qtl start end length nSNPs avgSNPs_Mb peakDeltaSNP
    1 R05 1 2404 2923609 2921205 1689 578 0.3715301
    2 R08 2 356332 3308448 2952116 316 107 0.5995971
    3 R08 3 5518535 21451977 15933442 2755 173 0.6366040
    4 R08 4 27354643 27493493 138850 4 29 0.3348856
    posPeakDeltaSNP avgDeltaSNP maxGprime posMaxGprime meanGprime sdGprime
    1 1892524 0.3606035 17.80104 1892524 16.34035 1.059533
    2 3027509 0.4555751 19.03438 1280127 17.04648 1.551114
    3 6880353 0.4669769 32.52107 6885361 22.81858 3.327091
    4 27493493 0.2800255 19.11753 27493493 16.10660 2.012530
    AUCaT meanPval meanQval
    1 5917302 0.0002306070 0.006173703
    2 7621865 0.0001949842 0.005554059
    3 147785964 0.0000461582 0.002235581
    4 381450 0.0002735495 0.006789872

for help

library("QTLseqr")
library("Yang2013data")
rawData <- system.file("extdata","Yang_et_al_2013.table",package = "Yang2013data",mustWork = TRUE)
HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
Chroms <- paste0(rep("Chr", 12), 1:12)
df <- importFromGATK(file = rawData,highBulk = HighBulk,lowBulk = LowBulk,chromList = Chroms)
###Run here and the error is as follows
Error in setNames(rep(list(readr::col_integer()), length(int_matches)), :
unused argument (int_matches)
###Please help me how should this solve?

argument 'x' contains missing values

df_filt <- runGprimeAnalysis(df_filt,

  •                          windowSize = 1e6,
    
  •                          outlierFilter = "deltaSNP",
    
  •                          filterThreshold = 0.1)
    

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.1
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Error: argument 'x' contains missing values
I encountered the above error. Please help.

Request for Help: no points with non-zero weight

I am having some problems running QTLseqR on a data set.

For both #Run QTLseq analysis and #Run Gprime analysis I am getting the following error

#Run G' analysis
df_filt <- runGprimeAnalysis(
SNPset = df_filt,
windowSize = 1e6,
outlierFilter = "deltaSNP")
Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.1
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Calculating p-values...
There were 50 or more warnings (use warnings() to see the first 50)

#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
SNPset = df_filt,
windowSize = 1e6,
popStruc = "F2:3",
bulkSize = c(25, 25),
replications = 10000,
intervals = c(95, 99)
)
Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Returning the following two sided confidence intervals: 95, 99
Variable 'depth' not defined, using min and max depth from data: 40-191
Assuming bulks selected from RIL population, with 2525 individuals per bulk.
Simulating 10000 SNPs with reads at each depth: 40-191
Keeping SNPs with >= 0.3 SNP-index in both simulated bulks
Joining, by = "tricubeDP"
There were 50 or more warnings (use warnings() to see the first 50)

warnings gives me

warnings()
Warning messages:
1: In lfproc(x, y, weights = weights, cens = cens, base = base, ... :
procv: no points with non-zero weight

repeated 50 times.

When running the same script with the Yang dataset from the vignette example everything works fine, however once I try on my data it does not seem to work.

Comparing the df_filt variables from the yang data set and my own visually lead to me concluding there was no differences is format.

Here is my df_filt

head(df_filt)
CHROM POS REF ALT AD_REF.LOW AD_ALT.LOW DP.LOW GQ.LOW PL.LOW
1 Chr1 154894 T C 19 57 76 99 2418,0,594
2 Chr1 676832 A G 23 42 65 99 1690,0,872
3 Chr1 676847 T G 23 43 66 99 1711,0,866
4 Chr1 676883 A G 23 43 66 99 1736,0,944
5 Chr1 676884 T G 23 43 66 99 1736,0,944
6 Chr1 868399 A G 23 48 71 99 1915,0,1126
SNPindex.LOW AD_REF.HIGH AD_ALT.HIGH DP.HIGH GQ.HIGH PL.HIGH
1 0.7500000 21 60 81 99 2485,0,740
2 0.6461538 19 40 59 99 1461,0,654
3 0.6515152 15 40 55 99 1480,0,490
4 0.6515152 21 40 61 99 1616,0,886
5 0.6515152 21 40 61 99 1616,0,886
6 0.6760563 15 78 93 99 3167,0,703
SNPindex.HIGH REF_FRQ deltaSNP nSNPs tricubeDeltaSNP G
1 0.7407407 0.2547771 -0.009259259 1 0.04474525 0.017710620
2 0.6779661 0.3387097 0.031812256 8 0.07020005 0.139879722
3 0.7272727 0.3140496 0.075757576 8 0.07020079 0.803952922
4 0.6557377 0.3464567 0.004222553 8 0.07020254 0.002496501
5 0.6557377 0.3464567 0.004222553 8 0.07020259 0.002496501
6 0.8387097 0.2317073 0.162653339 8 0.05748516 5.948665346
Gprime pvalue negLog10Pval qvalue minDP tricubeDP CI_95 CI_99
1 1.156295 0.4455630 0.3510909 0.7744778 76 64 -0.328125 -0.40625
2 2.068951 0.2106250 0.6764902 0.6470770 59 64 -0.328125 -0.40625
3 2.068978 0.2106208 0.6764988 0.6470770 55 64 -0.328125 -0.40625
4 2.069040 0.2106107 0.6765196 0.6470770 61 64 -0.328125 -0.40625
5 2.069042 0.2106104 0.6765202 0.6470770 61 64 -0.328125 -0.40625
6 2.214282 0.1888550 0.7238714 0.6470770 71 64 -0.328125 -0.40625

as well as the R script I am using, which is modified from your documentations example.

#load the package
library("QTLseqr")

#Set sample and file names
HighBulk <- "egusi_bulk"
LowBulk <- "normal_bulk"
file <- "normal_egusi_SNPs_forR2.table"

#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
Chroms <- paste0(rep("Chr", 11), 1:11)

#Import SNP data from file
df <-
importFromGATK(
file = file,
highBulk = HighBulk,
lowBulk = LowBulk,
chromList = Chroms
)

#Filter SNPs based on some criteria
df_filt <-
filterSNPs(
SNPset = df,
refAlleleFreq = 0.20,
minTotalDepth = 100,
maxTotalDepth = 400,
minSampleDepth = 40,
minGQ = 99
)

#Run G' analysis
df_filt <- runGprimeAnalysis(
SNPset = df_filt,
windowSize = 1e6,
outlierFilter = "deltaSNP")

#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
SNPset = df_filt,
windowSize = 1e6,
popStruc = "F2:3",
bulkSize = c(25, 25),
replications = 10000,
intervals = c(95, 99)
)

#Plot
plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)

#export summary CSV
getQTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "egusi_QTL.csv")

Any help you could give would be much appreciated. I have not been able to find any answers online.

Thanks!

plotQTLStats scaling

really nice package you created, it is really easy to use.
I think it would be better though to set the scale and space in the plotQTLStats function to free_x otherwise the plots looks misleading as chromosomes look differently large.

Error: argument 'x' contains missing values

Please help me in resolving the error of missing values for x, which is not allowing me to calculate G prime.

df_filt <- runGprimeAnalysis(

  • SNPset = df_filt,
    
  • windowSize = 1e6,
    
  • outlierFilter = "deltaSNP")
    

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.1
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Error: argument 'x' contains missing values

My file looks like this:

head(df_filt)
CHROM POS REF ALT AD_REF.LOW AD_ALT.LOW DP.LOW GQ.LOW PL.LOW SNPindex.LOW AD_REF.HIGH AD_ALT.HIGH DP.HIGH
1 Chr1 1058 T TA 28 18 46 NA NA 0.3913043 41 17 58
2 Chr1 2541 T TC 31 4 35 NA NA 0.1142857 25 2 27
3 Chr1 2559 CTAGG ATAGT,CTAGT 32 10 42 NA NA 0.2380952 28 9 37
4 Chr1 2627 GCTTC ACTCT,ACGTC,GCTCC 41 18 59 NA NA 0.3050847 NA NA 40
5 Chr1 2653 C T 33 31 64 NA NA 0.4843750 29 13 42
6 Chr1 2722 ATTTTTTTTGAG ATTTTTTTTTAAG,ATTTTTTTTAAA NA NA 57 NA NA NA 24 15 39
GQ.HIGH PL.HIGH SNPindex.HIGH REF_FRQ deltaSNP nSNPs tricubeDeltaSNP minDP tricubeDP CI_95 CI_99
1 NA NA 0.29310345 0.6634615 -0.098200900 619 -0.009587027 46 30 -0.4333333 -0.5333333
2 NA NA 0.07407407 0.9032258 -0.040211640 622 -0.009638414 27 30 -0.4333333 -0.5333333
3 NA NA 0.24324324 0.7594937 0.005148005 622 -0.009639038 37 30 -0.4333333 -0.5333333
4 NA NA NA NA NA 622 -0.009641394 40 30 -0.4333333 -0.5333333
5 NA NA 0.30952381 0.5849057 -0.174851190 622 -0.009642295 42 30 -0.4333333 -0.5333333
6 NA NA 0.38461538 NA NA 622 -0.009644686 39 30 -0.4333333 -0.5333333

QTLseqR on GBS data

Hi Ben,
in a previous experiment, we run your pipeline with really nice results performing wgs on the bulks.
That's not economically feasible in species with really wide genomes (7-9Gb).

What I would like to do now is to use QTLseqR with GBS data, using a catalog from stacks as reference genome.
Is it something you think is doable with your pipeline?

I can try anylize the data again, but last time I tried I ended up with several errors.

Thanks!
Matteo

Unknown or uninitialised column: 'CHROM'

when i'm useing the qtlseqr, and i have a promble, i can find the mathod
my file is 'zhang'

CHROM	POS	REF	ALT	H.AD	H.DP	H.GQ	H.PL	L.AD	L.DP	L.GQ	L.PL
Chr1	1984	T	C	19,0	19	57	0,57,956	10,3	13	87	87,0,420
Chr1	1994	C	T	19,1	20	34	0,34,532	10,4	14	61	61,0,297
Chr1	1995	G	A	21,0	21	63	0,63,1046	11,3	14	84	84,0,436
Chr1	2025	G	A	20,7	27	99	128,0,515	17,4	21	75	75,0,515
Chr1	2031	C	T	31,0	31	93	0,93,1355	19,5	24	76	76,0,667
Chr1	16386	C	A	3,2	5	75	75,0,140	6,0	6	18	0,18,250
Chr1	16402	TA	T	5,0	5	15	0,15,185	3,2	5	49	49,0,106
Chr1	16417	G	C	5,2	7	69	69,0,187	7,0	7	21	0,21,277
Chr1	17148	T	A	16,6	22	68	68,0,526	18,0	18	60	0,60,663
high<-"H"
low<-"L"
df<-importFromGATK(file="zhang",highBulk=high,lowBulk=low,chromList="Chr1")
Parsed with column specification:
cols(
  CHROM = col_integer(),
  POS = col_integer(),
  REF = col_character(),
  ALT = col_character(),
  H.AD = col_character(),
  H.DP = col_integer(),
  H.GQ = col_integer(),
  H.PL = col_character(),
  L.AD = col_character(),
  L.DP = col_integer(),
  L.GQ = col_integer(),
  L.PL = col_character()
)
Removing the following chromosomes: 
Warning messages:
1: Unknown or uninitialised column: 'CHROM'. 
2: Unknown or uninitialised column: 'CHROM'. 
3: Unknown or uninitialised column: 'CHROM'. 
4: Length of logical index must be 1 or 99, not 0 
5: Unknown or uninitialised column: 'CHROM'. 
6: Unknown or uninitialised column: 'CHROM'. 

What's the increment of the sliding window analysis?

Dear Ben,
I am trying to use the QTLseqr to treat my bulked data for a rice QTL-seq analysis. In your example of rice study, you showed the fixed size (1000kb) of the sliding window analysis but didn't mention the increment of the sliding window. In Takagi's paper, they chose the window size as 2Mb with 10kb increment to apply the sliding window analysis. Could you tell me the increment size in the rice example? Or how can I set the sliding window increment size? Thank you very much.
Sincerely
Jifeng

some error comes

#Import SNP data from file
df <-

  • importFromGATK(
    
  •     file = file,
    
  •     highBulk = HighBulk,
    
  •     lowBulk = LowBulk,
    
  •     chromList = Chroms
    
  • )
    

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'SNPs_from_GATK.table': No such file or directory

No QTL signal when running runQTLseqAnalysis() but a clear signal when running runGprimeAnalysis()

Hi,

When I run the Takagi method I get no singal at all. Below you have the command that I am running

df_filt_qtlseq <- runQTLseqAnalysis(df_filt, windowSize = 5e6,
                                    popStruc = "F2",
                                    bulkSize = c(14, 14),
                                    replications = 10000,
                                    intervals = c(95, 99))

plotQTLStats(SNPset = df_filt_qtlseq, var = "deltaSNP", plotIntervals = TRUE) + 
  ggtitle('deltaSNP - 5Mbp window') + 
  theme(axis.text.x = element_text(angle = 90, size = 6)) + 
  theme(plot.title = element_text(size = 10))

with does not give me any significant signal at all: Please see figure: QTLseqR_deltaSNP_5Mbp.pdf

However, running the G prime method, I get a clear signal in chromosome 1.

runGprimeAnalysis(SNPset = df_filt, windowSize = 5e6, outlierFilter = "deltaSNP", 
                                    filterThreshold = 0.01)

plotQTLStats(SNPset = df_filt_gprime, var = "Gprime", plotThreshold = TRUE, q=0.01) + 
  ggtitle('Gprime - 5Mbp') +
  theme(axis.text.x = element_text(angle = 90, size = 6)) +
  theme(plot.title = element_text(size = 10))

G prime analysis figure: QTLseqR_Gprime-distribution-deltaSNPoutlierFilter.pdf

Please let me know if you need any additional information. I am using QTLseqr_0.7.5.2.

Best
Kostas

"runGprimeAnalysis" cannot work well

QTLseq analysis can be successfully runned and get the good QTL analysis plots. Howerver, when I try to
extract significant regions using "getSigRegions", I was told the following warnings:
"Error in****getSigRegions(SNPset = df_filt, alpha = 0.01) : Please first use runGprimeAnalysis to calculate q-values".
Thus, I run "> df_filt <- runGprimeAnalysis(SNPset = df_filt, windowSize = 2e8, outlierFilter = "deltaSNP")"
Then more warnings coming.........

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.1
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Error in mutate_impl(.data, dots) :
Evaluation error: $ operator is invalid for atomic vectors.

May you give me some advices ?

Narrow down QTL region

Is there a way to further narrow down a QTL region?
I have whole genome sequencing data and my QTL after a standard run of QTLseqr ist over 10 mb including 43126 SNPs. Is there a way I can further narrow this down?

Attempting to use QTLseqR..learning :)

Hi Ben I just downloaded QTLseqr latest version and tried to run the example. It kept giving the below message. Thanks for any help.
Shaker

df<-importFromGATK(file = file, highBulk = HighBulk, lowBulk = LowBulk, chromList = Chroms)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'SNPs_from_GATK.table': No such file or directory

cant install this R package

Hi!
i used the install code on Rstudio ,but connection is timeout, same for conda.
how can i fix this?

best
jennies

getQTLTable without running runGprimeAnalysis

Hello,
To my knowledge , runGprimeAnalysis and runQTLseqAnalysis are two seperate function. after running runQTLseqAnalysis, I want to get QTL region using getQTLTable, but the screen says "Error in getQTLTable(SNPset = df_filt, alpha = 0.01, interval = 99.9999999, : Please first use runGprimeAnalysis to calculate q-values".

Can't I only use runQTLseqAnalysis and get the QTL region ?

If I run runGprimeAnalysis and then run runQTLseqAnalysis , which is the result comes from ? runGprimeAnalysis or runQTLseqAnalysis

windowSize change, popStruc change

Hello Ben,
I have 2 questions in runing the under lines of QTLseq analysis.

#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
SNPset = df_filt,
windowSize = 1e6,
popStruc = "F2",#根据群体结构调整F2或者RIL
bulkSize = c(160, 160),#根据池中的样本单株数量
replications = 10000,
intervals = c(95, 99)
)

The 1st is that how can I change the "windowSize = 1e6" to 1e5.
The 2nd is that how can I change the popStruc = "F2" to "BC1".

looking for your help.
Thanks.
Rui

BUG?

when i try to run runGprimeAnalysis function, the following messages showed,and the result was become quite strange...

Counting SNPs in each window...
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.05
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Calculating p-values...
Warning messages:
1: Problem with `mutate()` input `pvalue`.
i encountered a tie, and the difference between minimal and 
                   maximal value is > length('x') * 'tie.limit'
the distribution could be multimodal
i Input `pvalue` is `getPvals(...)`. 
2: encountered a tie, and the difference between minimal and 
                   maximal value is > length('x') * 'tie.limit'
the distribution could be multimodal 

improving `bulkSize` option in `runQTLseqAnalysis()`

(before moving on to the issue, firstly thanks a lot for developing this very convenient package!)

In runQTLseqAnalysis() the option bulkSize only uses an integer of length 1. Therefore, if I understood the code rightly, the allele frequency simulation samples frequencies assuming both bulks have the same number of individuals.

However, in practice the two bulks might not have the same number of individuals. Perhaps it should be considered that the user could provide either one or two values for bulkSize. Then, instead of only generating one vector of "null" allele frequencies (temp_freq), two vectors could be generated, one for each pool?

Something like this might work?:

# Check it's a positive integer
if(!is.integer(bulkSize) | bulkSize <=0) stop("bulkSize must be a positive integer.")

# If length one assume both pools have same size
if(length(bulkSize) == 1){
   message("Assuming both pools have the same size: ", bulkSize)
   bulkSize <- c(bulkSize, bulkSize)
}

# Sample null allele frequencies for each pool separately
tmp_freq1 <-
   replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[1], pop = popStruc))
tmp_freq2 <-
   replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[2], pop = popStruc))

# Apply to depths
CI <- sapply(
            X = depth,
            FUN = function(x)
            {
                quantile(
                    x = simulateSNPindex(
                        depth = x,
                        altFreq1 = sample(
                            x = tmp_freq1,
                            size = replications,
                            replace = TRUE
                        ),
                        altFreq2 = sample(
                            x = tmp_freq2,
                            size = replications,
                            replace = TRUE
                        ),
                        replicates = replications,
                        filter = filter
                    ),
                    probs = intervals,
                    names = TRUE
                )
            }
)

Also, is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled? If so, the documentation could be a bit more explicit about this, maybe?

I'm glad to fork and implement the suggestion above, but just wanted to check if it makes sense.

a little question of G stat

hi, I am so sorry for trouble you !
In script QTLseqr/R/G_functions.R when stat G
obs <- c(LowRef, HighRef, LowAlt, HighAlt)
G <-2 * (rowSums(obs * log(matrix(obs, ncol = 4) / matrix(exp, ncol = 4) )

so my question is "if the element of obs be equal to zero, how to stat log(0)?"

G' Distribution Fitting

Hello Ben,

I have another question this time regarding G' distribution fitting. We have noticed that making relatively small changes to our SNP sets can result in fairly profound differences in G' distribution fits. For example, we have seen that two almost identical SNP sets (where individual SNPs are assigned comparable G and G' values between SNP sets) are assigned different G' distribution fits, resulting in different adjusted p.values (q.values) between the SNP sets. A first glance at the respective deltaSNP and G' maps suggest that the maps are nearly identical between SNP sets, however the only significant change between them is where the FDR threshold gets drawn. Sure enough, if you compare the two SNP sets G' distribution fits, you can see that the two SNP sets are being fit to two distinct null G' distributions. The difference in distribution fits results in 5 significant peaks in one SNP set and zero significant peaks in the other, both at an FDR of 5% with other plotting and filtering settings constant.

I should also add that we have only seen this occur for one of our 6 experiments. In each of the other experiments we made similar edits to our SNP sets, and while there were usually some slight tweaks in distribution fits, the FDR thresholds and interpretation of significant peaks were largely unchanged.

Have you encountered anything similar before? If so, any insight about where to begin would be greatly appreciated!

Best,
Tanner

runGprimeAnalysis

when I run runGprimeAnalysis command, I got an error :
Calculating tricube smoothed delta SNP index...
Calculating G and G' statistics...
Using deltaSNP-index to filter outlier regions with a threshold of 0.1
Estimating the mode of a trimmed G prime set using the 'modeest' package...
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to mathematical function.

Error: The subscript for category ' closure ' is wrong

Hi, Ben

when I call function "getSigRegions" to get the QTL snps, error was reported by R, as below:

Error: The subscript for category ' closure ' is wrong


I check source code of this function,found this:

qq 20180326113724

'table' is a function in R, is this the source of error?

Thanks!

a new question

i have calculate the snp-index of bulk_1 and bulk_2,and i take the office-excel to plot the picture,but i didn't plot the CI of 95% and 99%,i sincerely hope you can give me a script to sovle my trouble. the date is as follows:

image
and i have plot the picture as follows:
! Rplot01-1-1

but i need to plot the picture which contains the line of 95% CI and 99% CI, so i want to know can i use the data i povided above to cauculate the CI of 95% and 99% ?

Help me please...

Hello bmansfeld.
I keep getting the same error message while installing your package.
It was difficult to solve even with my Google search, and I would like to help you.
First, let's talk about my environment, my OS is Windows 10, and the installation program uses RStudio.
When I run the command devtools::install_github("bmansfeld/QTLseqr") , I get "Error: Failed to install 'unknown package' from GitHub:
Line starting 'LinkingTo ...' is malformed!"
Maybe it's a reference to LinkingTo in the last line, I think.
On the Internet, using remotes::install_github or using bioconductor, various methods were suggested, but the package could not be installed by either method, and the above message was displayed repeatedly. Here is the code I used to install your package attached and sent. If you have a solution, please let me know.

Thank you in advance.

QTLseqr v0.7.5.2

Installation

install.packages("devtools")
library(devtools)

use devtools to install QTL_seqr

devtools::install_github("bmansfeld/QTLseqr")

Error : Failed to install 'unknown package' from Github :

install.packages("remotes")
library(remotes)
remotes::install_github("bmansfeld/QTLseqr")
install_github("bmansfeld/QTLseqr")

remotes library x.

Error massage print

Error : Failed to install 'unknown package' from GitHub:

Line starting 'LinkingTo ...' is malformed!

BiocManager::install("QTLseqr")

why install BiocManager... but Error in library("QTLseqr")...

#REEOR# importFromGATK# Error in file(file, "rt") : cannot open the connection

df <-

  • importFromGATK(
  • file = file,
  • highBulk = HighBulk,
  • lowBulk = LowBulk,
  • chromList = Chroms
  • )
    Error in file(file, "rt") : cannot open the connection
    In addition: Warning message:
    In file(file, "rt") :
    cannot open file 'Yang_et_al_2013.table': No such file or directory

)> #Set sample and file names

HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
file <- "SNPs_from_GATK.table"
#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
Chroms <- paste0(rep("Chr", 12), 1:12)
#Import SNP data from file
df <-

  • importFromGATK(
  • file = file,
  • highBulk = HighBulk,
  • lowBulk = LowBulk,
  • chromList = Chroms
  • )
    Error in file(file, "rt") : cannot open the connection
    In addition: Warning message:
    In file(file, "rt") :
    cannot open file 'SNPs_from_GATK.table': No such file or directory

trouble with plotQTLstats

Hello!

I'm having some issues with the plotQTLStats function. I have used QTLseqr in the past on other datasets and haven't had any issues. I prepared my current dataset using the exact same scripts, however this time plotQTLStats is throwing me an error.

When I input:

f4 <- plotQTLStats(SNPset = G_filt_five_hr, var = "Gprime", plotThreshold = TRUE, q = 0.01)
f4

I get the output:

Error in f(..., self = self) : Breaks and labels are different lengths
In addition: Warning message:
In x/limits[i] :
longer object length is not a multiple of shorter object length

Any idea what may be causing this?

Thanks for the help,
Tanner

Yang2013data

Hi,

I have done NGS-BSA on 8 bulks of samples. SNPs have been called and filtered using vcf tools. The vcf file is now ready to be used for QTLseqr.

Referring to the QTLseqr workflow:

#load the package
library("QTLseqr")

#Set sample and file names
HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
file <- "SNPs_from_GATK.table"
#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
Chroms <- paste0(rep("Chr", 12), 1:12)

#Import SNP data from file
df <- importFromGATK(

    file = file,
    highBulk = HighBulk,
    lowBulk = LowBulk,
    chromList = Chroms

)

How do I convert my vcf file into the "SNPs_from_GATK.table format?

I checked the Yang2013data paper. They only have Table S1-16. How does the raw data look like?

Also, how do I define "HighBulk" and "LowBulk" since I have 8 bulks?

Appreciate your help.

Many thanks.

Considerable difference between "posPeakDeltaSNP" and "posMaxGprime" numbers

Hi Steve,

First of all let me congratulate you for this nice tool, bringing together both statistical methods developed by Takagi et al. and Magwene et al.

After running successfully the QTL analysis, I extracted the QTLs using the getQTLTable() function and, for my major QTL in chr5, I see a considerable difference between the peak positions of deltaSNP and Gprime statistics. This holds true for the minor QTLs as well. I think that, in general, these two values should be quite close, no?

Below I provide the workflow of the R script used and the resulting QTL table:

library("QTLseqr")

file20 <- "SR_filt_perc20_2.table"
hb20 <- "BWA_output/R_all_20perc.bam"
lb20 <- "BWA_output/S_all_20perc.bam"
Chroms <- paste0(rep("chr", 7), 1:7)

df20 <- importFromGATK(file = file20, highBulk = hb20, lowBulk = lb20, chromList = Chroms)

df_filt20 <- filterSNPs(SNPset = df20, minTotalDepth = 10, refAlleleFreq = 0.2, maxTotalDepth = 150, minSampleDepth = 15, minGQ = 30)
`

QTLseq analysis (Takagi et al. (2013))

df_filt20 <- runQTLseqAnalysis(df_filt20,
                                     windowSize = 4e6,
                                     popStruc = "F2",
                                     bulkSize = c(15, 15),
                                     replications = 10000,
                                     intervals = c(95, 99))

G' analysis (Magwene et al. (2011))

df_filt20 <- runGprimeAnalysis(SNPset = df_filt20, windowSize = 4e6, outlierFilter = "deltaSNP", filterThreshold = 0.2)

Plotting

plotQTLStats(SNPset = df_filt20, var = "nSNPs") + ggtitle('nSNPs - perc20_samtools')
plotQTLStats(SNPset = df_filt20, var = "deltaSNP", plotIntervals = TRUE) + ggtitle('deltaSNP - perc20_samtools') + ylim(-0.8,0.8)

plotQTLStats(SNPset = df_filt20, var = "Gprime", plotThreshold = TRUE, q = 0.001) + ggtitle('Gprime - perc20_samtools')

Extract QTLs

results_20 <- getQTLTable(SNPset = df_filt20, method = "Gprime",alpha = 0.001, export = FALSE)
results_20

CHROM qtl start      end   length nSNPs avgSNPs_Mb peakDeltaSNP posPeakDeltaSNP avgDeltaSNP maxGprime posMaxGprime meanGprime    sdGprime     AUCaT
1  chr3   1  3141  7533286  7530145  4234        562    0.1477947            3141   0.1058236  5.367394      4987052    5.13946  0.09072959   6933819
2  chr5   1   691 24469450 24468759 37979       1552   -0.5857466         5981863  -0.2947747 39.274411      8086713   18.43512 11.06228565 430543716
      meanPval     meanQval
1 9.428524e-06 7.545695e-05
2 1.243138e-06 9.720765e-06

Thanks
Kostas

How to deal with the multi-sample ?

I know the QTLseqr is based on BSA-seq,it can calculate the 2 F2 sample's SNP -index. but usually we have 4 samples or more,which contains P1&P2 and 2 X F2 sample.
How to filter out the SNPs that P1 & P2 & F2 contains, Then calculate the SNP -index.
I will be appreciate that you consider my questions.

Help

Hello !

I just have a couple questions about how to edit the generated graphs. is this possible ? I would like to change the background and maybe add some more details to the Axis, my data only shows 0 and 100 in the horizontal axis and would like it to show every 20 in the scale.

In another note, i noticed that when i tried to plot deltaSNP, I don't get any of the confidence intervals some short lines would appear in some regions. along with this i always get a message saying that there were a high amount of rows that had to be removed because they contained missing values. this is the message:

: Removed 55135 row(s) containing missing values (geom_path).

but checking the file for # of NAs i can only find 35699

and my plot showing only the 95 interval and not 99

image

thanks !

Marco

Tag releases?

Can you tag a release? The last one was in 2018 and it seems there have been a number of version number increases since. Having tagged releases aids in reproducibility and incorporation into projects like Bioconda.

importFromTable

Hello!
I am trying to use the QTLsegr package, but I have trouble with the ImportFromTable function.
I have my data in comma delimited file (csv). The head is:

CHROM | POS | id | REF | ALT | AD_REF.LV | AD_ALT.LV | AD_REF.CV | AD_ALT.CV
Chr01 | 22492233 | 1122 | G | A | 9 | 5 | 8 | 4
Chr01 | 22492228 | 1141 | G | A | 10 | 6 | 11 | 3
Chr01 | 22492206 | 1393 | G | A | 11 | 5 | 9 | 3
Chr01 | 22492264 | 1429 | C | T | 11 | 5 | 10 | 3

When I run the function, I get this message:
"Error in importFromTable(file = "slSHQTLseqr.csv", highBulk = "HighBulk", :
No 'CHROM' coloumn found.
Además: Warning message:
The following named parsers don't match the column names: CHROM, POS "

I will be very grateful if you can help me solve the problem!

Thanks in advance!

Graciela Caruso

Handling experimental replicates

Hello again Ben,

I don't have a technical problem this time, but I do have a statistics / theory question for you! Do you have any suggestions on how to handle multiple experimental replicates under a G' statistical framework? For example, say you performed two entire BSA experiments on two distinct segregant populations (derived from the same parents), how would you best leverage your replicate data? My first thought was to simply pool all the reads in the upstream read processing pipeline in order to generate VCF files that contain both experimental replicates, although this seems problematic for several reasons. My next thought was to calculate a raw G statistic for each SNP in both replicate SNPsets independently and then use the mean G value for each SNP during the G' sliding-window calculation, although this doesn't seem like a perfect approach either.

As always, thank you in advance for your help!

Tanner

PS - I took up learning some GGplot2 (thanks again for the advice) and used your QTL-map-plotting source code to figure out how to plot multiple experimental replicates on a single QTL map. Although, the replicates are a little noisy which is part of what is motivating us to consider both replicates in a single statistical framework

chromosomes removed without a reason

Hi,
I'm using QTLseqr and when it comes to the step
df <- importFromGATK(file = rawData, highBulk = aBulk, lowBulk = bBulk, chromList = Chroms)
it shows
Removing the following chromosomes: 1, 2, 3, 4, 5
That's all chromosomes I have! Does anyone have any idea why this is happening?

difference between Upward peaks and Downward peaks ?

Hi,
When run QTLseqr, sometimes value of ∆(SNP-index) was positive, the peak was upward. But sometimes value of ∆(SNP-index) was negative, and the peak was downward.

So, how to interpret negative ∆(SNP-index) ?What is the biological significance?

Thank you very much.

I cann't install QTLseqr. Could you help me?

Dear Ben

I got this error message when I installed the QTLseqr.

devtools::install_github("bmansfeld/QTLseqr")
Error: Failed to install 'unknown package' from GitHub:
invalid multibyte string, element 1

I tried to install QTLseqr in various versions of R; R 3.3, R3.4, R3.5, and R3.6

'Rtools' and 'devtools' were already installed in my computer.

Could you help me?

Sincerely,

HTKim

Problem in G G` statistic calculation: Error in locfit::locfit(Stat ~ locfit::lp(POS, h = windowSize, deg = 0), : fewer than one row in the data

Hi Ben, I have a problem calculating G statics using function runGprimeAnalysis. Here is the copy from my console:

df_filt <- runGprimeAnalysis(df_filt,

  • windowSize = 1e7,
  • outlierFilter = "deltaSNP",
  • filterThreshold = 0.1)
    Counting SNPs in each window...
    Calculating tricube smoothed delta SNP index...
    Calculating G and G' statistics...
    Error in locfit::locfit(Stat ~ locfit::lp(POS, h = windowSize, deg = 0), :
    fewer than one row in the data
    In addition: There were 50 or more warnings (use warnings() to see the first 50)

I am trying to run on the wild population, where F1s are open pollinated to get F2s and bulks are aligned to a de novo reference assembly with ~100000 contigs.

I checked to see if the df is missing any col, but its complete. Attached is the tail of the df.

Screen Shot 2020-01-15 at 7 35 27 AM

generalise `simulateAlleleFreq()` function to any ploidy

In #5 I asked about the bulkSize option in runQTLseqAnalysis() if:

is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled?

The answer is no, bulkSize should be the number of diploid individuals. I see this is right, because of the way the null expectation is being simulated.

I guess there are two levels to the simulation, because there are two levels of sampling:

  • First we sample individuals from the population and calculate what the alternative allele frequency is (simulateAlleleFreq() function).
  • Then we sample some reads in the locus (simulateSNPindex() function).

I hand't noticed but indeed the first level of the simulation is assuming individuals are diploid, because it samples diploid individual genotypes (c(0, 0.5, 1) with probabilities relating to the expected segregation ratios in an F2 (c(1, 2, 1)/4).

But what if one is working with higher ploidy? Then the above simulation would not work.

However, the way it is implement at the moment, I think is equivalent to sampling from a binomial with probability of the event (picking an alternative allele) being 0.5 and number of trials being equal to the number of alleles sampled (2 x number of individuals).

To illustrate with code:

# Define parameters of simulation
nind <- 100  # number of sampled individuals
ploidy <- 2  # assuming diploid
replications <- 1e6  # make a lot of replications also to show how second method is faster

# Current implementation - uses number of sampled individuals
sim1 <- replicate(replications, 
                  mean(sample(c(0, 0.5, 1), size = nind, 
                              prob = c(1, 2, 1)/4, replace = TRUE)))

# Other way - sample alternative alleles with probability 0.5 from n trials given by 
## the number of alleles, which is ploidy * nind
sim2 <- rbinom(replications, ploidy*nind, 0.5)/(ploidy*nind)

# Check they are equivalent
ggplot(data.frame(sim1, sim2)) +
  geom_density(aes(sim1)) +
  geom_density(aes(sim2), colour = "pink", linetype = "dashed")

I guess the advantage is that this is general, regardless of the ploidy (besides the bonus of being faster).

I think the RIL implementation is already general as it is, because in that case we assume the individuals are fully homozygous, in which case they are equivalent to "haploid" organisms. In any case, I think the implementation can be also be made faster, by sampling from a binomial:

# Current implementation
ril1 <- replicate(replications, 
                  mean(sample(
                    x = c(0, 1),
                    size = nind,
                    prob = c(1, 1),
                    replace = TRUE
                  )))


# Other way - null expectation is also allele frequency of 0.5, except now the number of trials
## is the number of individuals (because we can look at them as equivalent to haploid organisms)
ril2 <- rbinom(replications, nind, 0.5)/(nind)

# Check they are equivalent
ggplot(data.frame(ril1, ril2)) +
  geom_density(aes(ril1)) +
  geom_density(aes(ril2), colour = "pink", linetype = "dashed")

@bmansfeld please do check all of this, as I might be making some wrong assumption somewhere (I should also probably go and read the Tagaki paper in more detail! 😄)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.