Coder Social home page Coder Social logo

tracts's Introduction

Tracts

Tracts is a set of classes and definitions used to model migration histories based on ancestry tracts in admixed individuals. Time-dependent gene-flow from multiple populations can be modeled.

Changes in Tracts 2

  • Tracts is now a python package rather than a single .py file. Follow the installation instructions below.
  • Tracts now uses the matrix exponential form of the Phase-Type Distribution to calculate the tractlength distribution. This should not have resulted to changes to the interface. If it has, please report it in the issues.
  • Tracts no longer requires writing your own driver script. Instead, details about the simulation are read from a YAML file (examples below).
  • Demographic models also do not have to be handcoded anymore. They are now specified by a Demes-like YAML file (examples below).
  • Minor Patches: Fixed an issue with fixing multiple parameters from ancestry.

Examples

Examples contains sample hapmap data and scripts to analyze them, including two different gene flow models. It also contains a 3-population model for 1000 genomes puerto Rican data

Installation

To install:

  1. Clone this repository
  2. In your local copy, open a terminal.
  3. Run pip install .

You can now import tracts as a python package.

Tracts is currently not distributed on PyPi or Conda.

Setting up a demographic model

Tracts attempts to predict a population's migration history from the distribution of ancestry tracts. The space of all migration matrices is very large: if we have p migrant populations over g generations, there can be n*g different migration rates. In tracts, demographic models are used to describe the migration matrix as a reduced number of migration events with flexible parameters. For example, a model can contain only a founding pulse of migration from two ancestral populations. In that case, the only parameters are the time of the pulse, and the migration rate from each population. The yaml file for such a model would look like:

demes:
  - name: EUR
  - name: AFR
  - name: X
  ancestors: [EUR, AFR]
  proportions: [R, 1-R]
  start_time: tx

Here, tracts deduces the sample population to be X. The parameter for the time of founding is named tx. Since migration at the founding pulse must add to 1, there is only one other parameter for model: R. The founding proportion from EUR is equal to R, and the founding proportion from AFR is 1-R.

To add more migration pulses to the model, add a pulses field to the YAML file:

pulses:
  - sources: [EUR]
    dest: X
    proportions: [P]
    time: t2

This represents a single pulse of migration from EUR to X. It occurs at time t2 with proportion P.

The full model would then look like:

demes:
  - name: EUR
  - name: AFR
  - name: X
  ancestors: [EUR, AFR]
  proportions: [R, 1-R]
  start_time: tx
pulses:
  - sources: [EUR]
    dest: X
    proportions: [P]
    time: t2

The pulses field can also contain more than one pulse:

pulses:
  - sources: [EUR]
    dest: X
    proportions: [P]
    time: t2
  - sources: [EUR]
    dest: X
    proportions: [P]
    time: t3

Here, the proportion of both pulse migrations is the same, but they occur at different times. Tracts allows for the linking of parameters in this way. This model would have 5 parameters: R, tx, P, t2, t3. If the pulses had different rates, the model would have 6 parameters instead.

Similar to pulses, continuous migrations can be specified in the migrations field:

migrations:
  - source: EUR
    dest: X
    rate: K
    start_time: t1
    end_time: t2

Driver File

Tracts is used by passing a driver yaml file to the method tracts.run_tracts(). The first part of the driver file tells tracts how to load the sample data:

samples:
  directory: .\G10\
  filename_format: "{name}_{label}.bed"
  individual_names: [
    "NA19700", "NA19701", "NA19704", "NA19703", "NA19819", "NA19818",
    "NA19835", "NA19834", "NA19901", "NA19900", "NA19909", "NA19908",
    "NA19917", "NA19916", "NA19713", "NA19982", "NA20127", "NA20126",
    "NA20357", "NA20356"
  ]
  labels: [A, B]
  chromosomes: 1-22

In this example, the samples are located in the 'G10' directory. The individual 'NA19700' has sample data in the files 'NA19700_A.bed' and 'NA19700_B.bed'.
The 'chromosomes' field tells tracts to use data from chromosomes 1 to 22. You can also specify a single chromosome or a list of chromosomes.

The details of the model are specified as a different YAML file. The model_filename field is used to tell tracts where to find this model YAML.

model_filename: pp.yaml

Tracts optimizes the parameters of the model to best match the distribution of ancestry tracts. Starting values for the parameters can be specified as numbers or ranges. Multiple repetitions can be run on the same data, and a seed can be used for repeatability.

start_params:
  R: 0.1-0.2
  tx: 10-11
  P:  0.03-0.05
  t2: 5.5
repetitions: 2
seed: 100

Tracts also allows for the time parameter to be scaled, as some optimizers run better when all parameters are on the same scale:

time_scaling_factor: 100

Likewise, tracts below a certain length (in centimorgans) can be excluded from the analysis.

exclude_tracts_below_cM: 10

Ancestry Fixing

Input

Tracts input is a set bed-style file describing the local ancestry of segments along the genome. The file has 2 extra columns for the cM positions of the segments. There are two input files per individuals (for each haploid genome copy).

chrom		begin		end			assignment	cmBegin	cmEnd
chr13		0			18110261	UNKNOWN	0.0			0.19
chr13		18110261	28539742	YRI			0.19		22.193
chr13		28539742	28540421	UNKNOWN	22.193		22.193
chr13		28540421	91255067	CEU		22.193		84.7013

Output

The 3-population exemple files produce 5 output files, e.g.

 boot0_-252.11_bins	boot0_-252.11_liks	boot0_-252.11_ord	boot0_-252.11_pred
 boot0_-252.11_dat	boot0_-252.11_mig	boot0_-252.11_pars

boot0 means that this is bootstrap iteration 0, which in the convention used here means the fit with the real data (in the two-population example, there is no bootstrap, so the output is named "out" and "out2" instead) -252.11 is the likelihood of the best-fit model

  • _bins: the bins used in the discretization
  • _dat: the observed counts in each bins
  • _pred: the predicted counts in each bin, according to the model
  • _mig: the inferred migration matrix, with the most recent generation at the top, and one column per migrant population. Entry i,j in the matrix represent the proportion of individuals in the admixed population who originate from the source population j at generation i in the past.
  • _pars: the optimal parameters. I.e., if these models are passed to the admixture model, it will return the inferred migration matrix.
  • _liks: the likelihoods in the model parameter space in the output format of scipy.optimizes' "brute" function: the first number is the best likelihood, the top matrices define the grid of parameters usedin the search, and the last matrix defines the likelihood at all grid points. see http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brute.html

Contact

See the example files for example usage. If something isn't clear, please let me know by filing an "new issue", or emailing me.

FAQ

The distribution of tract lengths decreases as a function of tract length, but increases at the very last bin. This was not seen in the original paper. What is going on?

In tracts, the last bin represents the number of chromosomes with no ancestry switches. It does not correspond to a specific length value, and for this reason was not plotted in the tracts paper.

When I have a single pulse of admixture, I would expect an exponential distribution of tract length, but the distribution of tract lengths shows steps in the expected length. Why is that?

"Tracts" takes into account the finite length of chromosomes. Since ancestry tracts cannot extend beyond chromosomes, we expect this departure from an exponential distribution

I have migrants from the last generation. "tracts" tells me that migrants in the last two generations are not allowed. Why is that?

Haploid genomes from the last two generations have no ancestry switches and should be easy to identify in well-phased data--they should be removed from the sample before running tracts. If this is impossible (e.g., because of inaccurate phasing across chromosomes), tracts will likely attempt to assign last-generation migrants to two generations ago. This should be observable by an excess of very long tracts in the data compared to the model.

Individuals in my population vary considerably in their ancestry proportion. Is that a problem?

It is not a problem as long as the population was close to random mating. If admixture is recent, random mating is not inconsistent with ancestry variance. If admixture is ancient, however, variation in ancestry proportion may indicate population structure, and the random mating assumption may fail.

I ran the optimization steps many times, and found different optimal likelihoods. Why is that?

Optimizing functions in many dimensions is hard, and sometimes optimizers get stuck in local maxima. If you haven tried already, you can attempt to fix the ancestry proportions a priori (see the _fix examples in the documentation). In most cases, the optimization will converge to the global maximum a substantial proportion of the time: running the optimization a few times from random starting positions and comparing the best values may help control for this.

If you fail to revisit the same minimum after running say, 10 optimizations, then something else might be going on. If the model is not continuous as a function of a parameter, it could make the optimization much harder. Defining a continuous model would help, or you could try the brute-force optimization method if the number of parameters is small.

tracts's People

Contributors

apragsdale avatar general-solution avatar gonzalez-delgado avatar sgravel avatar tsani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

tracts's Issues

issues running tracts2

Hi all,
Thank you for developing this amazing software
I am trying to run tracts2 but I am running into some issues

after creating a conda env and installing tracts, I tried to run the example
The first problem was :

AttributeError: module 'tracts' has no attribute 'driver

This was solved by including from tracts import driver in the Python script

Same problem with ParametrizedDemography (I need it to change the name of parametrized_demography to ParametrizedDemography)

However, now I have this issue (on Mac) that I don't know how to solve:

ImportError: cannot import name 'PhaseTypeDistribution' from 'tracts'

and on my server (linux), I have the following:

ImportError: cannot import name 'logs' from partially initialized module 'tracts' (most likely due to a circular import) (/local/chib/toconnor_grp/victor/miniconda3/envs/RunningJasmine/envs/tracts/lib/python3.12/site-packages/tracts/init.py)

Please can you help me with this?
Have a great day!

3 migration w/ 3 population Models

Simon

I'm trying to run tracts using a 3-migration pulses model.
Can you provide a driver file to help me with the parameters I should change? I'm having problems with "slices" and "bounds".

Thanks in advance.

func_args is not used

func_arg is a leftover argument from the original dadi code. It was provided as a way to pass additional arguments to the optimization function. It was never implemented, and should probably be removed.

Add demes support

Specifying demographic models is always a bit of a pain in tracts. It would be easier is we could specify demographic models using the demes specification. For this, we will need:
-To port tracts.py to python 3
-To add a function translating a yaml file to a migration matrix. (this could be done in tracts or in demes)
-Find a way to parameterize models in the yaml file (i.e., we want to build a function that takes in parameters and a yaml model and outputs a migration matrix. I think the we can draw inspiration from the demes support in moments.

chrom argument in ancestry_at_pos function

The ancestry_at_pos function is not used in most tracts applications. It is meant to compute the proportion of ancestry at a given position. It uses bad naming conventions. In particular, it seems to redefine the argument chrom in the loop, and the argument chrom itself shadows a global definition.

Method for model selection

After reading the paper, and some source code, it is still not clear to me what the method of model selection is.

From rfmix2 to Tracts

Dear all,
I am sharing a tailored script for converting RFMix2 output into TRACTS input.

# rfmix2tracts.py

import pandas as pd
import sys
import os

def find_header_row(file_path):
    """
    Find the header row in the file. It is the first line that starts with '#'.
    """
    with open(file_path, 'r') as file:
        for i, line in enumerate(file):
            if line.startswith('#'):
                return i
    return None

def process_rfmix_to_tracts(input_file, output_folder):
    """
    Process an RFMix output file and generate TRACTS input files.
    """
    # Find the header row
    header_row = find_header_row(input_file)
    if header_row is None:
        print("Header row could not be found.")
        return

    # Read the input file
    data = pd.read_csv(input_file, sep='\t', header=header_row, skiprows=1)

    # Get the list of samples (excluding the first 6 columns)
    samples = data.columns[6:]

    # Process each sample (in case of a file with multiple samples)
    for sample in samples:
        # Determine the output file name based on the haplotype (.0 or .1)
        file_suffix = '_B.viterbi.bed.cm' if sample.endswith('.1') else '_A.viterbi.bed.cm'
        output_file = os.path.join(output_folder, f"{sample.split('.')[0]}{file_suffix}")

        # Extract the relevant columns for the sample and reorder them
        sample_data = data[['#chm', 'spos', 'epos', sample, 'sgpos', 'egpos']]
        sample_data.columns = ['chrom', 'begin', 'end', 'assignment', 'cmBegin', 'cmEnd']

        # Save to file
        sample_data.to_csv(output_file, sep='\t', index=False)
    
    print("Process completed. The files have been generated.")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python script.py <input_file> <output_folder>")
    else:
        input_file_path = sys.argv[1]
        output_folder_path = sys.argv[2]
        process_rfmix_to_tracts(input_file_path, output_folder_path)

# Contact: [email protected]
# GitHub: marsicoFL

# Example of how to run the script:
# python rfmix2tracts.py /path/to/rfmix2output.msp.tsv /path/to/output/folder/
# Or if you want to run it in the same folder:
# python rfmix2tracts.py toy.msp.tsv .

Best,
Franco

defining model and run tracts

Hello,
This is a basic question about how to run tracts. However, I am a little confused about how to run this by using the example files. Has tracts a command line interface or options (e.g., python tracts.py --input-file or something)? or do I need to modify some of the examples to run my desired model?

I wan to infer migration based on the ancestry tracts that I have inferred from RFMIX, so it is a very basic model for a Latin American population (e.g., PEL Peruvians from Lima from 1000Genomes), in which I want to know at what time European, African and Native American ancestries meet back in time (migration starting back in time), but it is not clear to me how run tracts in python? please, could you help with this by let me know how to run the examples?

4 populations sample models required?

Dear Simon,

I have tried to run my admixed population with 2 and 3-population models in your document. However, structure analyses indicated 4 ancestral populations in my cohort.

Could you suggest me how I can find a sample script for 4-population tracts analyses?

Thankyou for your great software and help.

Cheers,
James

issue w\ input file

Dear all,
thank you for this tool, I would like to use it on my data but unfortunately I am getting the error

ValueError: chromosome pairs of different lengths!

the example is running well, I am thinking that it might be a problem of input file but I am not able to figure out the issue.
Thank you in advance,
Alessandro

Error when attempting to fix multiple parameters using ancestry proportions.

Tracts gives the following error when attempting to fix multiple parameters using ancestry proportions.

    Traceback (most recent call last):
      File "/share/hennlab/projects/tracts2_testing/at2.py", line 14, in <module>
        tracts.run_tracts('test_driver.yaml')
      File "/share/hennlab/projects/tracts2_testing/tracts/tracts/driver.py", line 59, in run_tracts
        params_found, likelihoods = run_model_multi_params(func, bound, pop, population_labels, parse_start_params(driver_spec['start_params'], driver_spec['repetitions'], driver_spec['seed'], model, time_scaling_factor), exclude_tracts_below_cM=exclude_tracts_below_cM)
      File "/share/hennlab/projects/tracts2_testing/tracts/tracts/driver.py", line 157, in run_model_multi_params
        params_found, likelihood_found = run_model(model_func, bound_func, population, population_labels, start_params, exclude_tracts_below_cM=exclude_tracts_below_cM)
      File "/share/hennlab/projects/tracts2_testing/tracts/tracts/driver.py", line 168, in run_model
        xopt = tracts.optimize_cob(startparams, bins, Ls, data, nind, model_func, outofbounds_fun=bound_func, cutoff=cutoff, epsilon=1e-2)
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/core.py", line 1846, in optimize_cob
        outputs = scipy.optimize.fmin_cobyla(
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_cobyla_py.py", line 34, in wrapper
        return func(*args, **kwargs)
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_cobyla_py.py", line 181, in fmin_cobyla
        sol = _minimize_cobyla(func, x0, args, constraints=con,
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_cobyla_py.py", line 34, in wrapper
        return func(*args, **kwargs)
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_cobyla_py.py", line 264, in _minimize_cobyla
        f = c['fun'](x0, *c['args'])
      File "/share/hennlab/projects/tracts2_testing/tracts/tracts/driver.py", line 46, in <lambda>
        bound = lambda params: model.out_of_bounds(scale_select_indices(params, model.is_time_param(), time_scaling_factor))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/parametrized_demography.py", line 387, in out_of_bounds
        return super().out_of_bounds(self.get_full_params(params))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/parametrized_demography.py", line 409, in get_full_params
        solved_params = scipy.optimize.fsolve(lambda params_to_solve: _param_objective_func(self, params_to_solve), (.2,))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_minpack_py.py", line 162, in fsolve
        res = _root_hybr(func, x0, args, jac=fprime, **options)
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_minpack_py.py", line 228, in _root_hybr
        shape, dtype = _check_func('fsolve', 'func', func, x0, args, n, (n,))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/scipy/optimize/_minpack_py.py", line 25, in _check_func
        res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/parametrized_demography.py", line 409, in <lambda>
        solved_params = scipy.optimize.fsolve(lambda params_to_solve: _param_objective_func(self, params_to_solve), (.2,))
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/parametrized_demography.py", line 403, in _param_objective_func
        full_params = self.insert_params(full_params, params_to_solve)
      File "/home/awreynolds/.local/lib/python3.10/site-packages/tracts/parametrized_demography.py", line 440, in insert_params
        raise ValueError('Incorrect number of parameters to be solved')
    ValueError: Incorrect number of parameters to be solved```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.