Coder Social home page Coder Social logo

apriha / lineage Goto Github PK

View Code? Open in Web Editor NEW
150.0 10.0 25.0 1.07 MB

tools for genetic genealogy and the analysis of consumer DNA test results

License: MIT License

Python 100.00%
dna genes genetics genealogy snps chromosomes genotype bioinformatics ancestry

lineage's People

Contributors

abitrolly avatar apriha avatar arvkevi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lineage's Issues

Add tests

Add tests and integrate with Travis CI.

Add summary info

Add capability to get information that summarizes the data (e.g., SNP data source(s), SNP count, assembly name, chromosomes, etc.). Additionally, add summary info to the output file generated by lineage.

Productize for v1.0 release

Various minor updates for v1.0:

  • ensure all characters are ASCII
  • update setup.py, including keyword scrub and development status
  • rename int assembly property to build
  • specify list of chromosomes with assembly mapping data vs. querying endpoint

Absence of Y chromosome match between confirmed father and son

Hi,

I have noticed using various 23andme files that the software does not yield any results for the Y-chromosome when comparing father and sons. The X matches with the mother but the Y does not with the father. These are for confirmed genetic relationships. cM thresholds used where 0.75 and 7.

Any idea why this is happening?

Save results in memory as they are computed

For example, if the shared DNA between two individuals has been computed, save that result in memory so that it doesn't need to be computed again. These intermediate results will become increasingly important as more individuals are compared in order to perform capabilities such as phasing (see #2).

Add additional location information

For each SNP, add information about the SNP's genomic location:

  • Cytogenic location
  • Recombination rate at location
  • Whether SNP is located in a UTR, exon, intron, LINE, SINE, etc.

This will enable enhanced filtering for SNPs (e.g., location in given region, recombination rate above a given threshold, etc.).

Related to #29

Handle case when a chromosome cannot be remapped

E.g., due to the chromosome not having mapping information or an issue with the request.

Consider maintaining assembly information relative to each chromosome, or reporting when there is an assembly mismatch for partially remapped SNPs.

Update determination of sex

  • Revisit determination of sex based on X chromosome and Y chromosome data
  • Determine sex when data is loaded
  • Add property for sex to Individual

Add plotting capabilities

Add the following capabilities to lineage plots / plotting:

  • Display cytobands, genes, regions susceptible to CNVs (copy number variations), and other genetic markers on plots
  • Zoom in / out on plots
  • Create detailed plots for each chromosome
  • Link regions to genome browsers (e.g., http://genome.ucsc.edu)

Status code: 502 Reason: Bad Gateway

I'm using lineage==4.3.1. I use it in my code block as I typically do, but I'm now getting a Bac Gateway error message.

This is my code block:

from lineage import Lineage

# https://snps.readthedocs.io/en/stable/
# https://lineage.readthedocs.io/en/stable/

# initialize Lineage object
l = Lineage(
    output_dir = output_dir,
    resources_dir = f"{references_directory}",
    parallelize = True,
    processes = 8
)

# initialize dictionary variables
individuals_dict = {}
sex_determination = {}
# initialize count variable
count = 0

directory_path = os.path.join(data_directory, "opensnp_data")
file_pattern = os.path.join(directory_path, "*.ancestry.txt")
opensnp_files = glob.glob(file_pattern)
len_opensnp_files = len(opensnp_files)

# Path for the sex determination TSV file
sex_determination_file = os.path.join(results_directory, "opensnp_sex_determination.tsv")

# Create a lineage individual object for each Ancestry file
# Loop through file names and create individuals_dict
for file_path in opensnp_files:
    count = count + 1
    filename = os.path.basename(file_path)
    username = filename.split("_")[0]

    print(f"Processing file {count} in {len_opensnp_files}: {username}")
    
    # print(username)
    # assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
    # with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
    # deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
    # deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
    # Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
    individuals_dict[username] = l.create_individual(username, 
                                                     file=file_path,
                                                     assign_par_snps=True,
                                                     deduplicate_MT_chrom=True,
                                                     deduplicate_XY_chrom=True)
    
    if individuals_dict[username].build != 38:
        individuals_dict[username].remap(38)
        
    individuals_dict[username].sort()
    individuals_dict[username].to_tsv(os.path.join(output_dir, f"{username}.tsv"))

    # Determine sex
    # heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
    # y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
    # chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
    # Returns ‘Male’ or ‘Female’ if detected, else empty str
    sex_determination[username] = individuals_dict[username].determine_sex(
        heterozygous_x_snps_threshold=0.03, 
        y_snps_not_null_threshold=0.3, 
        chrom='X'
        )
    # print(sex_determination[username])

# Save sex determinations to TSV
with open(sex_determination_file, 'w', newline='') as file:
    writer = csv.writer(file, delimiter='\t')
    writer.writerow(['Username', 'Sex'])
    for username, sex in sex_determination.items():
        writer.writerow([username, sex])

print("All files processed.")

And this is the error message

Processing file 1 in 3: user6579
Request failed for /variation/v0/refsnp/34943879: Status code: 502 Reason: Bad Gateway
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], [line 43](vscode-notebook-cell:?execution_count=10&line=43)
     [35](vscode-notebook-cell:?execution_count=10&line=35) print(f"Processing file {count} in {len_opensnp_files}: {username}")
     [37](vscode-notebook-cell:?execution_count=10&line=37) # print(username)
     [38](vscode-notebook-cell:?execution_count=10&line=38) # assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
     [39](vscode-notebook-cell:?execution_count=10&line=39) # with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
     [40](vscode-notebook-cell:?execution_count=10&line=40) # deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
     [41](vscode-notebook-cell:?execution_count=10&line=41) # deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
     [42](vscode-notebook-cell:?execution_count=10&line=42) # Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
---> [43](vscode-notebook-cell:?execution_count=10&line=43) individuals_dict[username] = l.create_individual(username, 
     [44](vscode-notebook-cell:?execution_count=10&line=44)                                                  file=file_path,
     [45](vscode-notebook-cell:?execution_count=10&line=45)                                                  assign_par_snps=True,
     [46](vscode-notebook-cell:?execution_count=10&line=46)                                                  deduplicate_MT_chrom=True,
     [47](vscode-notebook-cell:?execution_count=10&line=47)                                                  deduplicate_XY_chrom=True)
     [49](vscode-notebook-cell:?execution_count=10&line=49) if individuals_dict[username].build != 38:
     [50](vscode-notebook-cell:?execution_count=10&line=50)     individuals_dict[username].remap(38)

File [~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104), in Lineage.create_individual(self, name, raw_data, **kwargs)
    [101](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:101) if "resources_dir" not in kwargs:
    [102](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:102)     kwargs["resources_dir"] = self._resources_dir
--> [104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104) return Individual(name, raw_data, **kwargs)

File [~/.venv/lib/python3.10/site-packages/lineage/individual.py:61](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:61), in Individual.__init__(self, name, raw_data, **kwargs)
     [58](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:58) init_args = self._get_defined_kwargs(SNPs, kwargs)
...
    [905](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:905)     # we'll pick the first one to decide which chromosome this PAR will be assigned to
    [906](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:906)     merged_id = "rs" + response["merged_snapshot_data"]["merged_into"][0]
    [907](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:907)     logger.info(f"SNP id {rsid} has been merged into id {merged_id}")

TypeError: argument of type 'NoneType' is not iterable

Perform additional analysis for genes / shared genes

General Gene Analysis

  • Group genes by transcripts
  • Link to gene information (e.g., NCBI, OMIM, etc.) (see #12 and #22)
  • Indicate if gene is susceptible to CNV (see #22)
  • Identify genomic regions possibly related to gene regulation and susceptible to methylation (e.g., CpG islands)
  • For SNPs in coding regions, identify effect of SNP on amino acid produced compared to reference sequence (i.e., whether variation results in a change of amino acid)

Shared Genes

  • Identify genes partially shared (i.e., gene overlaps shared DNA segment)
  • Identify percentages of coding / non-coding regions shared

Feature Request: Allow multiple chromosomal comparisons in one interface.

Essentially, replicate the functionality available through 23andme's comparison function to allow matching and display of multiple individuals at once. This feature would display the shared segment plots for a single source and multiple comparator DNA individuals. Suggest that this should also include the abillity to output shared plots for either single chromosomes, groups of chromosomes or entire digital karyotype. Example below shows the output for a single chromosome (1) in comparison between a single source individual and 4 comparators individuals, all showing various shared DNA sections:

image

The rationale for this feature request is that visualizing the shared sections is an essential component for 'rebuilding' chromosomes using data from a first cousin or similar following the Athey protocol, and we need a way to display this data so it can used for chromosomal integration analysis work.

Update pandas indexing

Update pandas indexing to fix warning generated in Travis CI Python 3.7 container:

  .ix is deprecated. Please use
  .loc for label based indexing or
  .iloc for positional indexing

Enhance documentation for installation

Double check additional installs on Linux with Agg as the Matplotlib backend (9971abe).

Check if libatlas-base-dev is also a required install.

Add Sphinx docs with instructions for installing Python, setting up a virtualenv, etc.

Update SNP sorting to use a CategoricalDtype

pandas 0.21.1 generates the following warning during SNP sorting (individual.py:_sort_snps):

FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
    self._snps['chrom'].astype('category', categories=sorted_list, ordered=True)

Therefore, update to use a CategoricalDtype.

Track and use discrepant SNPs to improve quality of data

When multiple files are loaded, track discrepant SNPs so that any discrepancies can be used to improve the quality of the data.

For example, if a discrepant SNP is found to be more congruous during processing (e.g., when finding discordant SNPs), make the discrepant SNP the SNP that is used for processing.

Add ability to reconstruct genomes

Combine techniques identified by Whit Athey in Phasing the Chromosomes of a Family Group When One Parent is Missing and the results of find_shared_dna to reconstruct genomes of maternal and/or paternal ancestors.

This can be approached as a constraint satisfaction problem. For example, the algorithm could be provided several individuals, with the maternal and/or paternal relationships also identified (e.g., siblings = [ind1, ind2]; mother = [ind3]; paternal_relation = [ind4]). Then, shared DNA could be discovered by find_shared_dna between all combinations of individuals. This information - whether the various combinations of individuals share one chromosome, both chromosomes, or no chromosomes for a given SNP position - would serve as the constraints for reconstructing the ancestral genomes.

As a simple example, say two siblings have genotypes of CA and AG at a given SNP. If one knew they shared one chromosome at that location, AN could be attributed to one parent, and CG to the other, where N would be any allele. Additional comparisons between other individuals could further narrow the solution space for the ancestral genomes.

Remove support for Build 36

Remove support for Build 36 / hg18 and only use Build 37 / hg19 and later for resource files and calculations.

If a Build 36 file is loaded, automatically re-map to Build 37 and use Build 37 for processing.

Report on allele information at any given SNP.

ALLELE SNP.

Def Allele(Input_Name, Origin='Both'):

Allele(RS2234095) --> 'AA'

Allele(RS2234095, Origin='Maternal') --> 'A'

#Should be smart enough to determine origin is paternal if origin isn't maternal.

Ability to reassemble Kit from specified fragments of other kits.

User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I have used patterns of inheritence to isolate sections of chromosomes from 6 different individuals that I would like to assemble into a single coherent kit that I can upload to Gedmatch.

#function to clip SNP fragments
Def Clip(Input_Name, Chromosome, Start, End)

in this quick example, userPW and userJW have already been phased, and are haploid in nature at this point in the analysis.

clip1 = Clip(userPW, 1, 100000, 250000)

#clip2 = Clip(userJW, 1, 5000000,7500000)

Def Assemble(x, y, z)
#haploid_assembled = Assemble(clip1, clip2)
#Makes single sided assembly from individual Clips
#Optionally returns Diploid assembly with duplicated haploid data.

For an arbitrary group of individuals, show the pattern at a given loci

User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I would like to identify the pattern of SNP's inherited at a given LOCI in either the actual nucleotides or a a/B output style.

Def Pattern(Input_Name, […]):

Pattern(RS2234095, [PW, DW, JW, SW]) --> ['TTCT' OR 'AABA']

Add ability to create groups of arbitrary individuals

A group would contain arbitrary Individuals and would provide capabilities to perform intra-group analysis. For example:

  • identify shared DNA / genes across all or a subset of members of the group (see #15)
  • tune thresholds for finding shared DNA / genes
  • find discordant SNPs between child / parent(s)
  • phase DNA (see #2)

Develop a web front-end

Develop a standalone web front-end interface to make it easier to use lineage and view results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.