Coder Social home page Coder Social logo

msmc's Introduction

The multiple sequentially Markovian coalescent (MSMC)

This software implements MSMC, a method to infer population size and gene flow from multiple genome sequences (Schiffels and Durbin, 2014, Nature Genetics, or Preprint).

In short, msmc can infer

  • the scaled population size of a single population as a function of time
  • the timing and nature of population separations between two populations

from multiple phased haplotypes. When only two haplotypes are given, MSMC is similar to PSMC, and we call it PSMC' because of subtle differences in the method and the underlying model, which allows PSMC' to infer more accurately the recombination rate.

Installation and Requirements

Precompiled versions for Mac and Linux (both 64 bit) can be downloaded on the "Releases" tab within this github repository.

To build MSMC yourself, the GNU Scientific Library (GSL) must be installed on your system.

To build the program, have a look at the two Makefiles. Adjust the path to the GSL and eventually run make -f Makefile.linux or make -f Makefile.mac, respectively. The program is written in the D programming language. The reference compiler from Digitalmars can be downloaded here.

For generating the input files using my scripts, you need Python 3.4. I am sorry for this cutting edge dependency, I may make things compatible with Python 3.2 soon, but at the moment apparently my scripts won't work unless you use python 3.4.

Getting Help

A general guide can be found here, and a more complete tutorial (although for MSMC2) here.

To get help, please submit an issue, or email me directly

MSMC2

Note that there exists also MSMC2, a newer method which should be generally preferred over MSMC.

msmc's People

Contributors

creeki avatar songsy avatar stschiff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmc's Issues

errors with installing from source

I am installing from source on a Linux cluster and keep running into errors.

I installed DMD as advised, changed dmd.conf to reflect location of phobos libraries, and changed the makefile to reflection location of GSL libraries.

The error I get is:

gcc build/msmc.o -o build/msmc -m64 -L/mnt/lustre/home/sonal.singhal1/bin/dmd2/linux/lib64  -Xlinker --export-dynamic /mnt/lustre/home/sonal.singhal1/bin/lib/libgsl.a /mnt/lustre/home/sonal.singhal1/bin/lib/libgslcblas.a -l:libphobos2.a -lpthread -lm -lrt
/usr/bin/ld: cannot find -l:libphobos2.a
collect2: ld returned 1 exit status
--- errorlevel 1

Any thoughts?

I would use the pre-compiled binaries, but they link to much newer versions of header files than what I have...

Best,
sonal

decode time units

Hi Stephan,

I am running the decode program and I was having trouble interpreting the times.
Here is how I run it:

./decode -m 0.0005 -r 0.0004 msmc_input/${SAMPLE}_chr${CHROM}.msmc

Here is the first few lines of output:

#time_boundaries:       0       0.0253178       0.0512933       0.0779615       0.105361        0.133531        0.162519        0.192372        0.223144        0.254892        0.287682        0.321584        0.356675        0.393043        0.430783        0.470004        0.510826        0.553385        0.597837        0.644357        0.693147        0.74444 0.798508        0.855666        0.916291        0.980829        1.04982 1.12393 1.20397 1.29098 1.38629 1.49165 1.60944 1.74297 1.89712 2.07944 2.30259 2.59027 2.99573 3.68888 
1902    0.911539        0.0679054       0.00602025      0.00247097      0.00173682      0.00135048      0.00109878      0.000920002     0.000785785     0.000680943     0.000596553     0.000527016     0.000468629     0.000418846     0.000375854     0.000338324     0.000305261     0.000275906     0.000249664     0.000226068     0.000204741     0.000185379     0.000167731     0.00015159      0.000136783     0.000123162     0.000110605     9.9004e-05
      8.82681e-05     7.83175e-05     6.90825e-05     6.05015e-05     5.25191e-05     4.50846e-05     3.81496e-05     3.16647e-05     2.55725e-05     1.97895e-05     1.41384e-05     7.43505e-06     
2902    0.915115        0.0684846       0.00585743      0.00225607      0.00152013      0.00113692      0.000889969     0.000716868     0.000588942     0.000490812     0.000413428     0.000351098     0.000300054     0.000257694     0.000222162     0.000192096     0.000166471     0.000144503     0.000125577     0.000109208     9.50032e-05     8.26449e-05     7.18707e-05     6.24628e-05     5.42383e-05     4.70425e-05     4.07436e-05     3.52282e-05
     3.03982e-05     2.61683e-05     2.24632e-05     1.92165e-05     1.63685e-05     1.38651e-05     1.16558e-05     9.69212e-06     7.92387e-06     6.29146e-06     4.69901e-06     2.68994e-06  

Should I scale the times using the 0.0005 mutation rate? i.e. t/0.0005 to get generations? Or is there a different conversion I should do? Using the genome wide mutation rate of 1.25e-8 gives me coalescent times that are too old to be plausible.

Thanks!
Arun

question on analysis with whole genome data

Hi Stephan,
I am applying MSMC my dataset. I want to analysis all of the autosomes. I have already generated the vcf.gz file using:
samtools mpileup -q 20 -Q 20 -C 50 -u -r <chr> -f <ref.fa> <bam> | bcftools call -c -V indels | bgzip -c > chr.vcf.gz
In order to work on all of those chromosomes for an individual in a single run, can I concat all of those autosomes together directly using bcftools concat directly preceding the analysis. And after that, I need to run ./bamCaller.py <mean_cov> <out_mask.bed.gz> right? Is it right to set the mean_cov as the average depth of whole autosomes?

Thanks

trying to replicate Fig2 of Schiffels & Durbin (2014)

Hey there-
I'm trying to replicate Figure 2 from your original paper using the zig-zag simulation that you specify in the supplement of that paper. The particular "problem" that I'm getting is that population size estimates from recent time look way too large. I imagine I'm not performing the inference procedure correctly, but here are my simple steps:

#create simulation data, convert to msmc input using msmc-tools
ms 4 1 -t 7156.0000000 -r 2000.0000 10000000 -eN 0 5 -eG 0.000582262 1318.18 -eG 0.00232905 -329.546 -eG 0.00931619 82.3865 -eG 0.0372648 -20.5966 -eG 0.149059 5.14916 -eN 0.596236 0.5 -T | ./ms2multihetsep.py 1 10000000 > test.txt

#run HMM for 200 iterations
msmc --fixedRecombination -o my_msmc_output msmc-tools/test.txt -i 200

I then have plotted the results using the step plot tools from plot_utils.py and get the following result
Figure_1

Clearly I'm doing something wrong here as the most recent population size estimates are orders of magnitude too large. Could you point me in the right direction?
thanks,
Andy

memory consumption of MSMC

I was running MSMC on 4 genomes (8 haplotypes) from the same human population. I was particularly interested the demographic history in the last 30,000 years. After converting the data into appropriate input file of MSMC, it is around 300 MB.

The issue is that MSMC is consuming too much memory now. On our 256GB server, it rapidly took all the memory and got terminated by the system daemon. If I instead run on chr22 only (4.1MB input file), it runs fine but took 4 GB RAM.

My question is that, is there a way I can reduce the memory usage and still get a decent resolution on demographic history in the last 30,000 years? If not, would you recommend reducing the number of chromosomes or the number of haplotypes included in the MSMC run? I.e., which option compromises the accuracy of the results least?

Thanks very much for your help,
Hao Hu

Segmentation Falt on OSX (Mac)

Why am I getting a seg fault?

On this basic data file:

1	530673	530672	AAAG
1	2621645	2090972	AAAG
1	19804316	17182671	AAAG
1	22466822	2662506	GAGA
1	22915237	448415	GAGA
1	23048515	133278	GGGA
1	24445215	1396700	AAAG
1	28004741	3559526	AAAG
1	29001118	996377	GGGA
1	31071573	2070455	AAAG
1	34816438	3744865	GAGA
1	35314840	498402	AAAG
1	38538312	3223472	AAAG
1	43088121	4549809	AAAG
1	43796385	708264	GAGA
1	47664747	3868362	AGAA
1	51461120	3796373	AGAA
1	56143120	4682000	GGGA
1	63973759	7830639	GGGA
1	72298048	8324289	AAAG
1	75398992	3100944	AGAA
1	75903941	504949	GAGA
1	80413141	4509200	AAAG
1	83284762	2871621	GAGA
1	83522772	238010	AAAG
1	89044382	5521610	AGAA
1	93987208	4942826	GAGA
1	94955877	968669	AAAG
1	95645609	689732	AAAG

I get a seg fault:

> msmc/build/msmc --fixedRecombination -o t msmc.data 
read 29 SNPs from file msmc.data
estimating scaled mutation rate: 8.3153e-08
Version:             1.0.1
input files:         ["msmc.data"]
maxIterations:       20
mutationRate:        8.3153e-08
recombinationRate:   2.07882e-08
subpopLabels:        [0, 0, 0, 0]
timeSegmentPattern:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads:           2
nrTtotSegments:      40
verbose:             false
outFilePrefix:       t
naiveImplementation: false
hmmStrideWidth:      1000
fixedPopSize:        false
fixedRecombination:  true
initialLambdaVec:    []
directedEmissions:   false
skipAmbiguous:       false
indices:             [0, 1, 2, 3]
logging information written to t.log
loop information written to t.loop.txt
final results written to t.final.txt
[1/1] estimating total branchlengthsSegmentation fault: 11

How to install MSMC

How to install MSMC? In fact, I can't read the installation instructions. I'm a novice.
Anyone can help????
Thanks!!!
I have already installed the GSL, but what's the meaning of "Adjust the path to the GSL"?
And where is Makefile.linux? Is it in msmc or GSL?
I find Makefile both in msmc and GSL.

Can MSMC deal unphased haplotypes?

Hi,
In README.md, msmc can infer xxxxx from multiple "phased haplotypes", but I change my vcf file from GATK to MSMC and that is not "phased", can I use MSMC to determine the population size?
Thanks!

another Segmentation Fault

Hey Stephan, I've been bugging you a little bit, my data are already haploid, and had to make a script to generate an input file. However, after a few attempts I'm still stuck on the segmentation fault.

sample input data:
LT635612.1 6507 6112 CCCT
LT635612.1 7080 573 CACA
LT635612.1 7912 832 GGAG
LT635612.1 17144 9143 TTAA
LT635612.1 20965 3791 CCAA
LT635612.1 21774 809 TTCC
LT635612.1 23125 1351 CAAC
LT635612.1 32902 9464 TCCC
LT635612.1 35983 3081 GTGT
LT635612.1 41100 4982 AGGG

~/betterSoftware/msmc/build/msmc -o test LT*
read 401 SNPs from file LT635612.1.out
read 396 SNPs from file LT635613.1.out
read 419 SNPs from file LT635614.1.out
read 475 SNPs from file LT635615.1.out
read 750 SNPs from file LT635616.1.out
read 471 SNPs from file LT635617.1.out
read 659 SNPs from file LT635618.1.out
read 741 SNPs from file LT635619.1.out
read 982 SNPs from file LT635620.1.out
read 747 SNPs from file LT635621.1.out
read 896 SNPs from file LT635622.1.out
read 1235 SNPs from file LT635623.1.out
read 917 SNPs from file LT635624.1.out
read 1291 SNPs from file LT635625.1.out
estimating scaled mutation rate: 9.24807e-05
Version: 1.0.1
input files: ["LT635612.1.out", "LT635613.1.out", "LT635614.1.out", "LT635615.1.out", "LT635616.1.out", "LT635617.1.out", "LT635618.1.out", "LT635619.1.out", "LT635620.1.out", "LT635621.1.out", "LT635622.1.out", "LT635623.1.out", "LT635624.1.out", "LT635625.1.out"]
maxIterations: 20
mutationRate: 9.24807e-05
recombinationRate: 2.31202e-05
subpopLabels: [0, 0, 0, 0]
timeSegmentPattern: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads: 16
nrTtotSegments: 40
verbose: false
outFilePrefix: test
naiveImplementation: false
hmmStrideWidth: 1000
fixedPopSize: false
fixedRecombination: false
initialLambdaVec: []
directedEmissions: false
skipAmbiguous: false
indices: [0, 1, 2, 3]
logging information written to test.log
loop information written to test.loop.txt
final results written to test.final.txt
[11/14] estimating total branchlengthsSegmentation fault (core dumped)

I have also been less stringent on my snp calling (I get about 36K instead of 7K) and gotten the same segmentation fault. Any thoughts? Maybe faulty input files?

Mike

Segmentation fault for running barley sample

Hi, Stephan.
I am running msmc for half of the chromosome 1 for barley sample, but got "Segmentation fault".
Since Barley is the selfing species, I merged two bam files from different populations into single one and call SNPs with bamCaller.py, and used generate_multihetsep.py to create the input file for msmc.

The command I used to run msmc is:

/panfs/roc/groups/9/morrellp/llei/software/msmc-1.0.0/build/msmc --timeSegmentPattern=5*4+25*2+5*4 -o /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_input.txt

The error file looks like this:

read 623233 SNPs from file /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_input.txt
estimating scaled mutation rate: 0.00282695
input files:         ["/panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_input.txt"]
maxIterations:       20
mutationRate:        0.00282695
recombinationRate:   0.000706738
subpopLabels:        [0, 0]
timeSegmentPattern:  [4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4]
nrThreads:           24
nrTtotSegments:      90
verbose:             false
outFilePrefix:       /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt
naiveImplementation: false
hmmStrideWidth:      1000
fixedPopSize:        false
fixedRecombination:  false
initialLambdaVec:    []
directedEmissions:   false
skipAmbiguous:       false
indices:             [0, 1]
logging information written to /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt.log
loop information written to /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt.loop.txt
final results written to /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt.final.txt
[1/20] Baumwelch iteration
^M  * [1/1] Expectation Step/var/spool/torque/mom_priv/jobs/3836766.mesabim3.msi.umn.edu.SC: line 1: 10951 Segmentation fault      /panfs/roc/groups/9/morrellp/llei/software/msmc-1.0.0/build/msmc --timeSegmentPattern=5*4+25*2+5*4 -o /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_output.txt /panfs/roc/groups/9/morrellp/llei/Inversion/msmc/B1K_WBDC355_msmc_input.txt

Do you have any idea what the problem is?

Thanks.
Best,
Li

Deal unphased vcf

Hi,
I have two populations and each include many individual, I do not know the relationship of each individual in corresponding population, I had called SNP using GATK, but I can not phase the vcf. Can I generate multihetsep files to fed to MSMC (just determine the population size) like this:
./generate_multihetsep.py
--mask=22_chr1_mask.bed.gz (population 1)
--mask=23_chr1_mask.bed.gz (population 1)
--mask=24_chr1_mask.bed.gz (population 1)
--mask=25_chr1_mask.bed.gz (population 1)
--mask=1_chr1_mask.bed.gz (population 2)
--mask=2_chr1_mask.bed.gz (population 2)
--mask=3_chr1_mask.bed.gz (population 2)
--mask=4_chr1_mask.bed.gz (population 2)
--mask=chr1.mask.bed.gz
22_chr1.vcf.gz
23_chr1.vcf.gz
24_chr1.vcf.gz
25_chr1.vcf.gz
1_chr1.vcf.gz
2_chr1.vcf.gz
3_chr1.vcf.gz
4_chr1.vcf.gz >multihetsep.txt,
or any idea?
Thank you!

Unexpected 'g' when converting from type char[] to type ulong

Hi,

I'm trying to run MSMC for the first time. I generated an input file from a bam based on the msmc-tools, but when I run my input I get an error (if I just run MSMC without the input, the program outputs the options as normal). This has happened on multiple machines, with different OS (Ubuntu and CentOS). I haven't found a description of this error in the forum, does anyone have any idea what could be causing this?

The input file looks like this, it's from a single genome (I'm only giving output folder and the input file to the MSMC command):

chr1	3695	91	TC
chr1	9921	10	TC
chr1	13192	382	CT
chr1	13197	5	GA
chr1	21043	3397	TC
chr1	23124	1844	CT
chr1	23733	609	CA
chr1	24820	504	GA
chr1	25394	574	GA
chr1	31690	967	TC
chr1	31823	132	AG
chr1	32364	541	GA
chr1	33041	545	TA
chr1	33479	380	CG

This is the error I get when running the precompiled binary:

error in parsing command line: std.conv.ConvException@/projects1/clusterhomes/schiffels/software/dmd2/src/phobos/std/conv.d(1720): Unexpected 'g' when converting from type char[] to type ulong
----------------
??:? pure @safe ulong std.conv.toImpl!(ulong, char[]).toImpl(char[]) [0x53c0d3]
??:? pure @safe ulong std.conv.to!(ulong).to!(char[]).to(char[]) [0x53bd5f]
??:? model.data.SegSite_t[] model.data.readSegSites(immutable(char)[], bool, ulong[], bool) [0x55258e]
??:? model.data.SegSite_t[][] msmc.readDataFromFiles(immutable(char)[][], bool, ulong[], bool) [0x597b8c]
??:? void msmc.parseCommandLine(immutable(char)[][]) [0x596133]
??:? _Dmain [0x595b43]
??:? _D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv [0x5f0386]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x5f02dc]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() [0x5f0342]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x5f02dc]
??:? _d_run_main [0x5f0239]
??:? main [0x5a16d5]
??:? __libc_start_main [0x70a7cbf6]

When I try building MSMC from source, I get a similar error:

error in parsing command line: std.conv.ConvException@/usr/include/dmd/phobos/std/conv.d(1991): Unexpected 'g' when converting from type char[] to type ulong
----------------
??:? pure @safe ulong std.conv.toImpl!(ulong, char[]).toImpl(char[]) [0x5645f3880002]
??:? pure @safe ulong std.conv.to!(ulong).to!(char[]).to(char[]) [0x5645f387fc13]
??:? model.data.SegSite_t[] model.data.readSegSites(immutable(char)[], bool, ulong[], bool) [0x5645f38ced31]
??:? model.data.SegSite_t[][] msmc.readDataFromFiles(immutable(char)[][], bool, ulong[], bool) [0x5645f38f7982]
??:? void msmc.parseCommandLine(immutable(char)[][]) [0x5645f38f56f2]
??:? _Dmain [0x5645f38f4f47]

Doesn't multithread

I'm running with following command line:

msmc -r 0.67 -t 20 -R -o 100g 100g.msmcin

According to the documentation, that should launch 20 threads, but according to the "top" linux command, msmc is only using 1.

Questions of using MSMC decode on multiple haplotypes

Hello Stephan,

Thank you for writing MSMC! It helps me to solve some important problems of my project. After using the default function of MSMC,

I am trying to use the decode function to estimate tMRCA states for my samples. However, when I use a file that contains four haplotypes as input for MSMC decode, it just output 40 columns instead of 240 columns as described in the manual. I used this file as input for two other functions of MSMC and it worked perfectly.

Here is a snapshot of my input file:
chrXV 711 293 TCCC
chrXV 1950 1063 TCCC
chrXV 2333 215 AGGG
chrXV 2410 31 CGCC
chrXV 2448 38 TCTT
chrXV 2838 267 TATT
chrXV 2863 25 TGTT
chrXV 3057 189 GAAA
chrXV 3159 96 CCTT
chrXV 3245 86 GAGG

Here is the command I used:
local/msmc/build/decode -m 0.0015 -r 0.0012 PAXB07_LITC_UP_356_chrXV.msmc.txt > PAXB07_LITC_UP_356_chrXV.tMRCA.decode.txt

Do I need to specify in the command line there are four haplotypes in the input file? I read the help for decode function, it seems the script would automatically recognize how many haplotypes in the input. Thank you for your help.

Best Regards,
Muhua

msmc can't handle empty .conf files

msmc crashes in the iteration step if any of the .conf files it is fed with are empty. This can be somewhat confusing since generate_multihetsep.py produces empty conf files without issuing any warnings.

running msmc without final result

hi,Stephan, i just started to use msmc . installed the software v1.1.0, found and downloaded the test data from msmc2. but i have been troubled with some problems with simple command at the begining. My questions are as follows:

(1) without final result

cmd:

~/msmc-master/build/msmc -R -o EUR_AFR_msmc_output EUR_AFR.chr1.multihetsep.txt

log:
read 589382 SNPs from file EUR_AFR.chr1.multihetsep.txt
estimating scaled mutation rate: 0.000451928
Version: 1.1.0
input files: ["EUR_AFR.chr1.multihetsep.txt"]
maxIterations: 20
mutationRate: 0.000451928
recombinationRate: 0.000112982
subpopLabels: [0, 0, 0, 0, 0, 0, 0, 0]
timeSegmentPattern: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads: 80
nrTtotSegments: 40
verbose: false
outFilePrefix: EUR_AFR_msmc_output
naiveImplementation: false
hmmStrideWidth: 1000
fixedPopSize: false
fixedRecombination: true
unboundedCrossCoal: false
initialLambdaVec: []
loBoundLambda: 0
hiBoundLambda: inf
directedEmissions: false
skipAmbiguous: false
indices: [0, 1, 2, 3, 4, 5, 6, 7]
logging information written to EUR_AFR_msmc_output.log
loop information written to EUR_AFR_msmc_output.loop.txt
final results written to EUR_AFR_msmc_output.final.txt
[1/1] estimating total branchlengths

but there is no EUR_AFR_msmc_output.final.txt file and the EUR_AFR_msmc_output.loop.txt is blank

I test my own data and come across the same problem

(2) OutOfMemoryError

i found the defaults of nrTreads was too large and there is one chr(maybe small numbers of SNPs lead to the previous problem),I executed the command as follows :

~/msmc-master/build/msmc -p 10*1 -o EUR_AFR_msmc_output EUR_AFR.chr1.multihetsep.txt -t 2 

log:
read 589382 SNPs from file EUR_AFR.chr1.multihetsep.txt
estimating scaled mutation rate: 0.000451928
Version: 1.1.0
input files: ["EUR_AFR.chr1.multihetsep.txt"]
maxIterations: 20
mutationRate: 0.000451928
recombinationRate: 0.000112982
subpopLabels: [0, 0, 0, 0, 0, 0, 0, 0]
timeSegmentPattern: [9220752, 0, 169013428, 1, 140234331517568, 0, 0, 0, 9220752, 0]
nrThreads: 2
nrTtotSegments: 140234518972501
verbose: false
outFilePrefix: EUR_AFR_msmc_output
naiveImplementation: false
hmmStrideWidth: 1000
fixedPopSize: false
fixedRecombination: false
unboundedCrossCoal: false
initialLambdaVec: []
loBoundLambda: 0
hiBoundLambda: 10
directedEmissions: false
skipAmbiguous: false
indices: [0, 1, 2, 3, 4, 5, 6, 7]
logging information written to EUR_AFR_msmc_output.log
loop information written to EUR_AFR_msmc_output.loop.txt
final results written to EUR_AFR_msmc_output.final.txt
core.exception.OutOfMemoryError@src/core/exception.d(700): Memory allocation failed

I also changed my parameters several times, but all return the same erro OutOfMemoryError. Could you give me some suggestions?

Thank you in advance.

Segfault on your test file in the guide

I ran the small 16-snp example file from your guide, and get this on OSX:

msmc -t 2 msmc_test.txt 
read 16 SNPs from file msmc_test.txt
estimating scaled mutation rate: 0.00019315
Version:             1.0.1
input files:         ["msmc_test.txt"]
maxIterations:       20
mutationRate:        0.00019315
recombinationRate:   4.82874e-05
subpopLabels:        [0, 0, 0, 0]
timeSegmentPattern:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads:           2
nrTtotSegments:      40
verbose:             false
outFilePrefix:       
naiveImplementation: false
hmmStrideWidth:      1000
fixedPopSize:        false
fixedRecombination:  false
initialLambdaVec:    []
directedEmissions:   false
skipAmbiguous:       false
indices:             [0, 1, 2, 3]
logging information written to .log
loop information written to .loop.txt
final results written to .final.txt
[1/1] estimating total branchlengthsSegmentation fault: 11

mappability mask file

Dear author,

I am starting to prepare the infiles for the analyses of MSMC and I have encountered a difficulty about the mappability_mask file. I am following the indications in Heng's Li SNPable regions web to generate my mask_35_50.fa file (as a starting point I am using the same parameters as Li). Otherwise I cannot see how to transform this mask_35_50.fa file into your mappability_mask.chromosome.bed.txt.gz file. I wanted to have a look to the maks you made for humans in your ftp site as you indicate ("For humans (GRCh37), you can download these masks from my ftp site, see the readme for msmc") but I do not see where in the readme it is indicated how to access to this data
Could you give me a hint?

Thanks a lot in advance. Best,

Begona

Segmentation fault

Hi, stschiff
When I prepare the input file for MSMC and run. it occurs Segmentation fault. Need I recompile on my computer ?

read 1417 SNPs from file pre_input/preinput
estimating scaled mutation rate: 0.00287284
input files: ["pre_input/preinput"]
maxIterations: 20
mutationRate: 0.00287284
recombinationRate: 0.00071821
subpopLabels: [0, 0, 0, 0]
timeSegmentPattern: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads: 64
nrTtotSegments: 40
verbose: false
outFilePrefix: my_msmc_output
naiveImplementation: false
hmmStrideWidth: 1000
fixedPopSize: false
fixedRecombination: false
initialLambdaVec: []
directedEmissions: false
skipAmbiguous: false
indices: [0, 1, 2, 3]
logging information written to my_msmc_output.log
loop information written to my_msmc_output.loop.txt
final results written to my_msmc_output.final.txt

[1/1] estimating total branchlengthsSegmentation fault

resequencing data of non-model diploid organism

I am working on the domestication of economic animal ,and resequencing lots of samples with low depth(about 5x).I'm wondering whether it is working when I use the function of "Analyzing population separations" using the snp information from such low depth data.
I'm planning to use 2 species data with 5x coverage each to generate inputfile.As I didn't see any information about the depth ,can you give me some suggestions on these?

Thanks!!

Baumwelch iteration Maximization Step cannot complete after 200 rounds causing msmc exits with error

Hi,

Can anyone help me with this issue?

I am running MSMC on chr1 sequencing data of 4 individuals (two individual from one population). Therefore, I have eight haploid sequences in total. I prepared the input file following the pipeline provided by msmc paper.

The command I used to run is:
msmc_linux_64bit --fixedRecombination --skipAmbiguous -P 0,0,0,0,1,1,1,1 -o output input

However, the program exits without giving me a output.final.txt. I checked the log file and found that the Baumwelch iteration' Maximization Step cannot complete after 200 rounds. The log file ends with:

^M * [186/200(max)] Maximization Step^M * [187/200(max)] Maximization Step^M * [188/200(max)] Maximization Step^M * [189/200(max)] Maximization Step^M * [190/200(max)] Maximization Step^M * [191/200(max)] Maximization Step^M * [192/200(max)] Maximization Step^M^M * [193/200(max)] Maximization Step^M * [194/200(max)] Maximization Step^M * [195/200(max)] Maximization Step^M * [196/200(max)] Maximization Step^M * [197/200(max)] Maximization Step^M * [198/200(max)] Maximization Step^M * [199/200(max)] Maximization Step^M^M * [200/200(max)] Maximization Step

Can any one suggest me how to solve this issue?

Thank you very much.
Jin

Won't compile on 32-bit linux.

Yesterday, on a 64-bit mac, I pulled msmc from github and compiled with no difficulty. Also works on a 64-bit linux box. But I can't compile on a 32-bit linux box. Details are below, I'd be grateful for any suggestions.

Thanks, Alan Rogers

msmc release on github: c50cfbe

I edited Makefile as instructed, to reflect the location of my libraries. But this can't be the problem because I'm getting a compilation error, not a linkage error.

dmd version: DMD32 D Compiler v2.067.1

Beginning of error messages:

msmc.d(173): Error: cannot implicitly convert expression (reduce(timeSegmentPattern)) of type ulong to uint
msmc.d(272): Error: function maximization_step.getMaximization (double[] eVec, double[][] eMat, MSMCmodel params, const(uint[]) timeSegmentPattern, bool fixedPopSize, bool fixedRecombination) is not callable using argument types (double[], double[][], MSMCmodel, ulong[], bool, bool)
branchlength.d(129): Error: cannot implicitly convert expression (pos) of type ulong to uint
branchlength.d(154): Error: cannot append type Tuple!(ulong, double) to type Tuple!(uint, double)[]

Segmentation fault with large number of scaffolds

Hi Stephan,

I am running MSMC on a mammal with scaffold-level assembly. Unfortunately, the N50 is quite low and I have ~3500 scaffolds > 100kb, that I want to base my analyses on. On subsets of the data MSMC runs nicely, but on the whole dataset it crashes with Segmentation Fault during one of the first iterations after computing ~1000 scaffolds. This reflects about 1.5 Gb, so I don't think it is the amount of data itself. MSMC was run with default settings and 2 unphased haplotypes.

Many thanks for any help!

Best,
Fritjof

error in tools/generate_multihetsep.py

I got an error in tools/generate_multihetsep.py:
[cxue@sug-esxa-login1 msmc]$ python3 msmc-master/tools/generate_multihetsep.py --mask vcfs/9D4.outmask.2.txt.gz vcfs/final.9D4.2.vcf.gz
generating msmc input file with 2 haplotypes
adding mask: vcfs/9D4.outmask.2.txt.gz
Traceback (most recent call last):
File "msmc-master/tools/generate_multihetsep.py", line 159, in
maskIterators.append(MaskIterator(f))
File "msmc-master/tools/generate_multihetsep.py", line 19, in init
self.readLine()
File "msmc-master/tools/generate_multihetsep.py", line 23, in readLine
line = next(self.file)
File "/hgsc_software/python/python-3.4.0/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Because mask file is a binary file, I don't know how to go on. Thanks.

msmc error

Hi,
@stschiff
I use the msmc2 infer population size history and meet an error.

The command line:
msmc2 -t 5 --fixedRecombination -o msmc_result KB317697.1.multihetsep.txt KB317698.1.multihetsep.txt KB317699.1.multihetsep.txt Scaffold_1.multihetsep.txt

error in parsing command line: object.Exception@model/data.d(57): Enforcement failed

??:? pure @safe void std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, const(char[])) [0x57bb9c]
??:? @safe std.regex.__T10RegexMatchTAxaS453std5regex8internal8thompson15ThompsonMatcherZ.RegexMatch std.exception.enforce!(Exception, std.regex.__T10RegexMatchTAxaS453std5regex8internal8thompson15ThompsonMatcherZ.RegexMatch).enforce(std.regex.__T10RegexMatchTAxaS453std5regex8internal8thompson15ThompsonMatcherZ.RegexMatch, lazy const(char)[], immutable(char)[], ulong) [0x59a7c2]
??:? void model.data.checkDataLine(const(char[])) [0x56a9a1]
??:? ulong model.data.getNrHaplotypesFromFile(immutable(char)[]) [0x56aa92]
??:? void msmc2.parseCommandLine(immutable(char)[][]) [0x5c78a9]
??:? _Dmain [0x5c73bb]
??:? _D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv [0x613553]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])).tryExec(scope void delegate()) [0x61347b]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])
).runAll() [0x6134f8]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x61347b]
??:? _d_run_main [0x6133e7]
??:? main [0x5d4bb3]
??:? __libc_start_main [0x3661ed1c]

the first 5 lines of KB317697.1.multihetsep.txt :
KB317698.1 52355 874 CT
KB317698.1 115578 608 GT
KB317698.1 159001 311 GC
KB317698.1 268192 1276 TA
KB317698.1 468419 627 CT

Hope to get your reply.

MSMC running with the divergent individuals

Hi Stephan,
I have three individuals from one species, but there divergence time is around 1~2 million years ago (based on the results of MCMCtree/BEAST inferring with the fossil calibration).
My MSMC result showed a very large Ne in the recent 20~50 kya, is it real? Or just a confused signal by the different population structure? Is there are some methods to test?
Thanks.

MSMC error

Dear @stschiff
I'm new for msmc2, and when i'm trying i meet an error


Haplotype index exceeds number of haplotypes in datafile

and also:

error in parsing command line: object.Exception@model/data.d(189): chromosomes must all be the same within one file (sorry)
can one file only contain data of one scaffold/chr?

Thanks

Abnormal results of MSMC, and need help ~

Hi Stephan,
In my MSMC analysis, The genome data sets of one individal (means two phased haplotype genomes) from each population was picked out, and i did the relative CCR analysis (population split) between each population pair. But in my results i found, for some population/species pairs, the RCCR value can't be zreo even at very recent time point. Is my results normal? and what does this pattern mean? Thanks in advance !
image

different results between MSMC and MSMC2

Dear @stschiff
I ran MSMC to infer the population split history using MSMC. But the results are strange, all CCR did not reach to one. So I changed to MSMC2, but the split times are different (the same color represents the same comparsion in the two plots). The domestication of studied animals are occurred about 10,000 years ago. From admixture analysis, Many populations are mixed populations, I do not know how the admixture event affect the result and which result is more credible?

msmc

Best Wishes
Zhuqing

precompiled binaries not working on macOS Catalina

When I try running the precompiled binaries on Catalina, I get:

dyld: lazy symbol binding failed: Symbol not found: _dyld_enumerate_tlv_storage
Referenced from: /Users/jsdenton/Downloads/./msmc_1.1.0
Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _dyld_enumerate_tlv_storage
Referenced from: /Users/jsdenton/Downloads/./msmc_1.1.0
Expected in: /usr/lib/libSystem.B.dylib

Abort trap: 6

A similar issue happens with the precompiled binaries for msmc2:

dyld: lazy symbol binding failed: Symbol not found: _dyld_enumerate_tlv_storage
Referenced from: /Users/jsdenton/Downloads/./msmc2_macos64bit
Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: _dyld_enumerate_tlv_storage
Referenced from: /Users/jsdenton/Downloads/./msmc2_macos64bit
Expected in: /usr/lib/libSystem.B.dylib

Abort trap: 6

How to interpret the abormal results using two haplotypes from each population when running MSMC?

Dear @stschiff

According to your previous suggestion, MSMC will have a better performance on population split history than MSMC2. I ran MSMC analysis of two genomes (four haploid genomes) selected from each of the two populations with deeply sequenced and found all my rCCR never reach 1 (Figure 1). When I just ran analysis of one genome (two haploid genomes) for each population, although the rCCR reached to 1, the rCCR drops from 0.5 to 0 instantaneously (Figure 2). I have never seen this phenomenon in other studies, hope you can give us some advice. Thanks for your time.

This analyses only includes the relatively "pure" and deeply individuals (according to ADMIXTURE results), is this right?
image
Figure 1 four haplotypes from each population
image
Figure 2, two haplotypes from each population

Best
Zhuqing

"The sequence "chr1" was not found"

Hello,

I'm trying to follow the msmc tutorial to prepare my data for msmc2. However, I am receiving the following error:
"[E::faidx_adjust_position] The sequence "chr1" was not found"

I am using the following command:

samtools mpileup -q 20 -Q 20 -C 50 -u -r chr1 -f ragtag.scaffolds.fasta s2800_chr.bam | bcftools call -c -V indels | ~/Swallows/msmc2/msmc-tools/bamCaller.py 12 out_mask.bed.gz | gzip -c > out.s2800.vcf.gz

I get the following message:
[warning] samtools mpileup option u is functional, but deprecated. Please switch to using bcftools mpileup in future.
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[mpileup] 1 samples in 1 input files

Then the following repeats over and over again:
"[E::faidx_adjust_position] The sequence "chr1" was not found"

If you could please help me get this running I'd greatly appreciate it!

msmc is too restrictive WRT chromosome IDs

The regexp in line 118 in model/data.d restricts chromosome IDs to regular words (\w+). Many file formats, however, allow essentially any non-whitespace character in chromosome or scaffold IDs. A simple change to \S+ should do the trick.

MSMC2- install error

I want to install MSMC2. I have successfuly installed GSL ( make, make check and make install). However when I execute
"make " in msmc2-2.1.3, It reports the foolowing errors:
dmd -g -O $(pwd)/gsl-2.6/lib/libgsl.a $(pwd)/gsl-2.6/lib/libgslcblas.a -odbuild/test -ofbuild/release/msmc2 model/data.d model/gsl_matrix_vector.d model/propagation_core.d model/psmc_hmm.d model/psmc_model.d model/stateVec.d model/stateVecAllocator.d model/time_intervals.d model/transition_rate.d powell.d brent.d maximization_step.d expectation_step.d msmc2.d logger.d
$(pwd)/gsl-2.6/lib/libgsl.a(lt86-oper.o): In function gsl_vector_complex_long_double_div': oper.c:(.text+0x27a): undefined reference to hypot'
$(pwd)/gsl-2.6/lib/libgsl.a(lt86-oper.o): In function gsl_vector_complex_div': oper.c:(.text+0x7ae): undefined reference to hypot'
$(pwd)/gsl-2.6/lib/libgsl.a(lt86-oper.o): In function gsl_vector_complex_float_div': oper.c:(.text+0xcc2): undefined reference to hypot'
$(pwd)/gsl-2.6/lib/libgslcblas.a(scnrm2.o): In function cblas_scnrm2': scnrm2.c:(.text+0x129): undefined reference to sqrt'
$(pwd)/gsl-2.6/lib/libgslcblas.a(snrm2.o): In function cblas_snrm2': snrm2.c:(.text+0xdb): undefined reference to sqrt'
$(pwd)/gsl-2.6/lib/libgslcblas.a(srotg.o): In function cblas_srotg': srotg.c:(.text+0x15a): undefined reference to sqrt'
$(pwd)/gsl-2.6/lib/libgslcblas.a(dnrm2.o): In function cblas_dnrm2': dnrm2.c:(.text+0xd0): undefined reference to sqrt'
$(pwd)/gsl-2.6/lib/libgslcblas.a(drotg.o): In function cblas_drotg': drotg.c:(.text+0x178): undefined reference to sqrt'
$(pwd)/gsl-2.6/lib/libgslcblas.a(dznrm2.o):dznrm2.c:(.text+0x11a): more undefined references to `sqrt' follow
collect2: error: ld returned 1 exit status
Error: linker exited with status 1

can you give me some suggestions? or can you offer a binary version?

MSMC with genomes in multiple scaffolds, not in chromosomes

Hi,
I want to use MSMC with whole genomes which reference genome is in 1140 scaffolds. From your sentence: "MSMC takes as input several files, one for each chromosome" I wonder whether MSMC could still be used for genomes that are not properly grouped into chromosomes. Is it possible? Should I follow the same procedure as if I had chromosomes?
Thank you very much in advance.

BMartinez

AssertError bootstrap

Hi, I just try to run msmc bootstrap to analyze population separations, but one error occured as follows:

maxIterations: 20
mutationRate: 0.00121796
recombinationRate: 0.000304489
subpopLabels: [0, 0, 0, 0, 1, 1, 1, 1]
timeSegmentPattern: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
nrThreads: 96
nrTtotSegments: 40
verbose: false
outFilePrefix: /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.out/hmh.boot.1.out
naiveImplementation: false
hmmStrideWidth: 1000
fixedPopSize: false
fixedRecombination: true
initialLambdaVec: []
directedEmissions: false
skipAmbiguous: true
indices: [0, 1, 2, 3, 4, 5, 6, 7]
logging information written to /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.out/hmh.boot.1.out.log
loop information written to /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.out/hmh.boot.1.out.loop.txt
final results written to /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.out/hmh.boot.1.out.final.txt
[11/30] estimating total branchlengths
[1/20] Baumwelch iteration

  • [30/30] Expectation Step, log likelihood: -1643.79
  • [4/200(max)] Maximization Step, Q-function before: 220.606, after:220.403
    [2/20] Baumwelch iteration
  • [30/30] Expectation Step, log likelihood: -923.319
  • [2/200(max)] Maximization Step, Q-function before: 2.65402, after:2.6227
    [3/20] Baumwelch iteration
  • [29/30] Expectation Step, log likelihood: -865.246
  • [2/200(max)] Maximization Step, Q-function before: 1.04843, after:1.02013
    [4/20] Baumwelch iteration
  • [30/30] Expectation Step, log likelihood: -818.439
  • [2/200(max)] Maximization Step, Q-function before: 0.360145, after:0.344497
    [5/20] Baumwelch iteration
  • [16/30] Expectation Step, log likelihood: -794.868
  • [3/200(max)] Maximization Step, Q-function before: 0.10929, after:0.103583
    [6/20] Baumwelch iteration
  • [30/30] Expectation Step, log likelihood: -786.659
  • [85/200(max)] Maximization Step, Q-function before: 0.0316963, after:0.0312611
    [7/20] Baumwelch iteration
  • [27/30] Expectation Step, log likelihood: -785.941
  • [33/200(max)] Maximization Stepcore.exception.AssertError@model/time_intervals.d(147): Assertion failure

??:? _d_assert [0x5ebe93]
??:? void model.time_intervals.__assert(int) [0x58c30c]
??:? _D5model14time_intervals13TimeIntervals18meanTimeWithLambdaMxFmdZ8__ensureMFKxdZv [0x58bf25]
??:? const(double function(ulong, double)) model.time_intervals.TimeIntervals.meanTimeWithLambda [0x58be93]
??:? const(double function(double, ulong, ulong, double, ulong, ulong)) model.transition_rate.TransitionRate.q2IntegralSmaller [0x58cdc5]
??:? const(double function(ulong, ulong)) model.transition_rate.TransitionRate.transitionProbabilityOffDiagonal [0x58cad0]
??:? _D5model15transition_rate14TransitionRate35fillTransitionProbabilitiesParallelMFZ14__foreachbody1MFmZi [0x58c8b2]
??:? _D3std11parallelism64__T15ParallelForeachTS3std5range13__T4iotaTmTmZ4iotaFmmZ6ResultZ15ParallelForeach7opApplyMFMDFmZiZ4doItMFZv [0x58e39a]
??:? void std.parallelism.run!(void delegate()).run(void delegate()) [0x5fb49b]
??:? void std.parallelism.__T4TaskS213std11parallelism3runTDFZvZ.Task.impl(void*) [0x5faf9b]
??:? void std.parallelism.AbstractTask.job() [0x62933e]
??:? void std.parallelism.TaskPool.doJob(std.parallelism.AbstractTask*) [0x5f9c37]
??:? void std.parallelism.TaskPool.executeWorkLoop() [0x5f9d65]
??:? void std.parallelism.TaskPool.startWorkLoop() [0x5f9d0d]
??:? void core.thread.Thread.run() [0x630481]
??:? thread_entryPoint [0x6300f1]
??:? [0x6c406d73]
core.exception.AssertError@model/time_intervals.d(147): Assertion failure

The command I used was:
/data/project/tools/msmc-tools-master/msmc_1.0.0_linux64bit --fixedRecombination --skipAmbiguous -P 0,0,0,0,1,1,1,1 -o /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.out/hmh.boot.1.out /data/project/pcj/primates/Macaca_arctoides/msmc/msmc1/YND_YNQ/all.msmc.bootrap.out_1/*.txt

I do not know why this would happen. I am looking forward to getting your help.

Problems compiling on RHEL 6.4

Two of us have tried to compile msmc on RHEL 6.5 and we both received the same result. That is, the software dumps core. Here is our environment:

RHEL 6.4
gcc 4.8.1
dmd 2.068

The pre-compiled binary requires a newer version of glibc so using the pre-compiled version is not an option. I tried msmc2 to see if I would get the same results and I did. It dumps core too.

Should we expect to be able to compile msmc on RHEL 6.4?

Thanks a lot!
-Roger

Synthetic Haplotypes

I am working on African populations which have substantial admixture. Running MSMC on genome sequences tend to give larger than expected times to MRCA. I assume that this is because we are not just looking at migration events but also admixture between previously isolated populations.
I have about fifty genome sequences from each of seven populations so I have plenty of spare data.
I am wondering if I could use PCAdmix to identify parts of the genome in each sample that are from a given founder population and then cobble together a set of virtual genomes which are all derived from a putatively single population and then run MSMC on those samples? Or will this introduce horrible artifacts?

Any thoughts would be most welcome.

powell exceeding maximum iterations

Dear msmc2 developers

I have 22 chromosomes as input for msmc2. The same msmc2 command run in 19 chromosomes with no errors. However I am getting this following error for 3 other chromosomes :
[12/20] Baumwelch iteration

  • [1/1] Expectation Step, log likelihood: -656.508
  • [200/200(max)] Maximization [email protected](119): powell exceeding maximum iterations.

??:? double[] powell.Powell!(maximization_step.MinFunc).Powell.minimize(double[], double[][]) [0x567fb2]
??:? double[] powell.Powell!(maximization_step.MinFunc).Powell.minimize(double[]) [0x567bd0]
??:? model.psmc_model.PSMCmodel maximization_step.getMaximization(double[][], double[][2], model.psmc_model.PSMCmodel, const(ulong[]), bool) [0x569537]
??:? void msmc2.run() [0x56ca91]
??:? _Dmain [0x56abeb]

Please could you help me ?

AssertionError while using generate_multihetsep.py

Hi, I am trying to try following code:

python3 generate_multihetsep.py --mask=final.mask.bed.gz \
  final.vcf.gz > final.msmc

My both files only contain one individual with several chromosomes. And obviously it stops directly when the first chromosome ends. It stops with this error:

Traceback (most recent call last):
  File "/mnt/nvme1n1p1/Arabis_data/VCF/msmc-tools-master/generate_multihetsep.py", line 226, in <module>
    if mergedMask.getVal(snp_pos):
  File "/mnt/nvme1n1p1/Arabis_data/VCF/msmc-tools-master/generate_multihetsep.py", line 51, in getVal
    return all((m.getVal(pos) for m in self.maskIterators))
  File "/mnt/nvme1n1p1/Arabis_data/VCF/msmc-tools-master/generate_multihetsep.py", line 51, in <genexpr>
    return all((m.getVal(pos) for m in self.maskIterators))
  File "/mnt/nvme1n1p1/Arabis_data/VCF/msmc-tools-master/generate_multihetsep.py", line 35, in getVal
    assert pos >= self.lastPos
AssertionError

Do you know what happened here?

Effect of introgression and selfing for estimation of population split

Hi Stephan,

I want to estimate the population split of a selfing plant species using MSMC. Since the species has a very high inbreeding coefficient (0.8-0.9), I filtered out all heterozygous sites which are potentially the result of false mapping. So the data has a very long range of homozygosity, which mostly represent the true case of the species. I am wondering how selfing (or lack of heterozygous) may affect the estimation of historical Ne? From your paper, you mentioned that artifact long range of homozygosity may result in early coalescence. But I am not sure how selfing may tweak the MSMC model. I've also seen people (#28 and PSMC users) tried to generate artificial diploids to make heterozygous sites, does this necessary?

The other question is the effect of recent introgression. Your paper discussed about introgression at the time of population split (clean split or prolonged split). However, in my case, the ancestral population receives extensive gene flow from the split population in continuous and also recent times. Does MSMC have the ability to handle this scenario? In your paper, the masking of European ancestral regions in the MXL population seems helpful for the estimation of historical Ne, this is promising, but did you test how the masking may change the estimation of population split?

Thank you very much for your help!
Shujun

Couple issues understanding decode output

Hello,

Thanks Stephan for making such useful software.

I ran the decode program for tMRCAs of each chromosomes segment for data from an unphased diploid.

My first issue is about the time intervals: I noticed that the output matrix seems to be just the genomic intervals (rows) by time intervals (columns). The guide mentions the first row is the header with time boundaries, but that doesn't appear to be the case looking at the output. Please correct me if I'm wrong. Assuming the time boundaries are not there, do you know where I should get them? Are they the same intervals constructed in the msmc (psmc') run if use the same exact -m and -r values in decode as those reported by msmc?

The second issue is that I just want to verify which time interval is the most recent -- it's the leftmost, correct?

Thanks,
Jason

New Error installing from source

I'm trying to install from source using the version downloaded from github 10 minutes ago. I'm installing on a linux server. I've change the GSL paths to /opt/shared/local/lib/ as that is where the admin installed the GSL library. I've also updated the location of the dmd library. I've also made sure that the new line on the maximization.d library is there (namely import std.string;). Now, I get this error which I've not been able to understand:

sbipao@axon:~/msmc-526d9d3ff68696eb68f7468972694d872cc449c9$ make -f Makefile.linux
/opt/shared/software/dmd2/linux/bin64/dmd -O /opt/shared/local/lib/libgsl.a /opt/shared/local/lib/libgslcblas.a -release -odbuild -ofbuild/msmc model/coalescence_rate.d model/data.d model/emission_rate.d model/gsl_matrix_vector.d model/msmc_hmm.d model/msmc_model.d model/propagation_core.d model/propagation_core_fastImpl.d model/propagation_core_naiveImpl.d model/rate_integrator.d model/stateVecAllocator.d model/stateVec.d model/time_intervals.d model/transition_rate.d model/triple_index.d model/triple_index_marginal.d powell.d brent.d maximization_step.d expectation_step.d msmc.d branchlength.d logger.d
model/data.d(131): Error: undefined identifier file, did you mean struct File?
make: *** [build/msmc] Error 1

any help is most welcome.
thanks!

should I keep unphased sites?

I am using MSMC over genomes sequenced by Complete Genomics. Based on the Schiffels & Durbin 2014 paper, unphased sites would introduce bias for population split analysis. However, when I looked into run_shapeit.sh tool, it seems that phasing was performed only on SNVs present in shapeit2 reference panel. Afterwards, both phased and unphased sites were merged into the same vcf file.

My question is, should I keep the unphased sites (those not present in the reference phasing panel) in my vcf file? If not, should I somehow fix the mask file to reflect the fact that only sites present in the reference panel are callable?

Thanks very much for your suggestions,

Hao Hu

install error

$ make -f Makefile
dmd -debug -O /usr/local/lib/libgsl.a /usr/local/lib/libgslcblas.a -odbuild -ofbuild/msmc model/coalescence_rate.d model/data.d model/emission_rate.d model/gsl_matrix_vector.d model/msmc_hmm.d model/msmc_model.d model/propagation_core.d model/propagation_core_fastImpl.d model/propagation_core_naiveImpl.d model/rate_integrator.d model/stateVec.d model/stateVecAllocator.d model/time_intervals.d model/transition_rate.d model/triple_index.d model/triple_index_marginal.d powell.d brent.d maximization_step.d expectation_step.d msmc.d branchlength.d logger.d
Error: cannot find source code for runtime library file 'object.d'
dmd might not be correctly installed. Run 'dmd -man' for installation instructions.
Specify path to file 'object.d' with -I switch
make: *** [build/msmc] Error 1

Gene flow estimation with 2 haplotypes?

Hi,

Is it possible to run msmc with the -P option where there are two haplotypes, each from a different subpopulation? It only seems to work so far with 4+ haplotypes.

Thanks!

remove repeatmask region

Hi Stephan,

I'm wondering if the results make sense when I remove repeatmask region to reduce the memory consumption and running time. I mainly want to estimate the relative cross coalescence rate. The results changed a little bit the postion of rccr=0.5.

test plot: masked (orange) vs default (green)
image

best
lizhong

error running msmc

running 8 haplotypes for estimating Ne. And what I get is,

read 260864 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr10.txt
read 255964 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr11.txt
read 244425 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr12.txt
read 197484 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr13.txt
read 171440 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr14.txt
read 149298 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr15.txt
read 155058 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr16.txt
read 129607 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr17.txt
read 150660 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr18.txt
read 101668 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr19.txt
read 435286 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr2.txt
read 116107 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr20.txt
read 73866 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr21.txt
read 66201 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr22.txt
read 372724 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr3.txt
read 374496 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr4.txt
read 334388 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr5.txt
read 320806 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr6.txt
read 294493 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr7.txt
read 288326 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr8.txt
read 220013 SNPs from file 02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr9.txt
estimating scaled mutation rate: 0.000342567
error in parsing command line: object.Exception@model/triple_index_marginal.d(72): 5 140079892583040
----------------
??:? pure @safe void std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, const(char[])) [0x57c550]
??:? pure @safe bool std.exception.enforce!(Exception, bool).enforce(bool, lazy const(char)[], immutable(char)[], ulong) [0x63a641]
??:? ulong model.triple_index_marginal.MarginalTripleIndex.computeNrSubpops(const(ulong[])) [0x5d03d9]
??:? void msmc.parseCommandLine(immutable(char)[][]) [0x5e47dc]
??:? _Dmain [0x5e3f67]
??:? _D2rt6dmain211_d_run_mainUiPPaPUAAaZiZ6runAllMFZ9__lambda1MFZv [0x630383]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x6302ab]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() [0x630328]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x6302ab]
??:? _d_run_main [0x630217]
??:? main [0x5f39d7]
??:? __libc_start_main [0xe44e3fa4]

The cmd line was,

msmc -R -o 02.MSMC.Ne.pl.out/03.Out/HAN.ne.8haps \
  -i 50 -t 10 -p '1*2+15*1+1*2' \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr1.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr10.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr11.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr12.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr13.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr14.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr15.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr16.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr17.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr18.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr19.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr2.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr20.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr21.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr22.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr3.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr4.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr5.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr6.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr7.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr8.txt \
  02.MSMC.Ne.pl.out/02.Combine/HAN/HAN.chr9.txt

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.