apriha / lineage Goto Github PK

View Code? Open in Web Editor NEW

150.0 10.0 25.0 1.07 MB

tools for genetic genealogy and the analysis of consumer DNA test results

License: MIT License

Python 100.00%

dna genes genetics genealogy snps chromosomes genotype bioinformatics ancestry

lineage's People

Contributors

Stargazers

Watchers

lineage's Issues

zlib.error: Error -3 while decompressing data: invalid stored block lengths

Mac os 11.4

I was following the example and encountered the error at this line, everything before this line is identical to the example.
results = l.find_shared_dna([user662, user663], cM_threshold=0.75, snp_threshold=1100)

Add tests

Add tests and integrate with Travis CI.

Integrate automated code analysis

Integrate a tool for performing automated analysis and review of code.

E.g., Code Climate, Codacy, Hound, etc.

Add summary info

Add capability to get information that summarizes the data (e.g., SNP data source(s), SNP count, assembly name, chromosomes, etc.). Additionally, add summary info to the output file generated by lineage.

Speed-up processing with the multiprocessing module

Implement an option to apply multiprocessing where chromosomes are processed individually so that results can be obtained faster.

Consider limiting processes to number of physical cores instead of logical cores.

Automatically detect assembly of file being loaded

Use the coordinates of common SNPs to identify the assembly / build of a genotype file that is being loaded. Track the assembly of the SNPs as a property of the Individual.

Use logging instead of print statements

Replace print statements with calls to a logger.

Productize for v1.0 release

Various minor updates for v1.0:

ensure all characters are ASCII
update setup.py, including keyword scrub and development status
rename int assembly property to build
specify list of chromosomes with assembly mapping data vs. querying endpoint

Absence of Y chromosome match between confirmed father and son

Hi,

I have noticed using various 23andme files that the software does not yield any results for the Y-chromosome when comparing father and sons. The X matches with the mother but the Y does not with the father. These are for confirmed genetic relationships. cM thresholds used where 0.75 and 7.

Any idea why this is happening?

Save results in memory as they are computed

For example, if the shared DNA between two individuals has been computed, save that result in memory so that it doesn't need to be computed again. These intermediate results will become increasingly important as more individuals are compared in order to perform capabilities such as phasing (see #2).

Add admixture analysis capabilities

Consider integrating https://github.com/stevenliuyi/admix

Add additional location information

For each SNP, add information about the SNP's genomic location:

Cytogenic location
Recombination rate at location
Whether SNP is located in a UTR, exon, intron, LINE, SINE, etc.

This will enable enhanced filtering for SNPs (e.g., location in given region, recombination rate above a given threshold, etc.).

Related to #29

Handle case when a chromosome cannot be remapped

E.g., due to the chromosome not having mapping information or an issue with the request.

Consider maintaining assembly information relative to each chromosome, or reporting when there is an assembly mismatch for partially remapped SNPs.

Better handle discordant SNPs inherited from mother / father

Consider adding child, mother, and father parameters to find_discordant_snps to present a more meaningful analysis of discordant SNPs (i.e., only compare SNPs that are inherited directly from an individual).

Update determination of sex

Revisit determination of sex based on X chromosome and Y chromosome data
Determine sex when data is loaded
Add property for sex to Individual

Support minimum versions of dependencies

Revisit pinning dependency versions vs. supporting a minimum version of pandas, Matplotlib, etc.

Related to #52

Add plotting capabilities

Add the following capabilities to lineage plots / plotting:

Display cytobands, genes, regions susceptible to CNVs (copy number variations), and other genetic markers on plots
Zoom in / out on plots
Create detailed plots for each chromosome
Link regions to genome browsers (e.g., http://genome.ucsc.edu)

Cache assembly mapping data

Cache assembly mapping data as remapping is performed in order to increase speed of remapping. May obsolete #6.

Status code: 502 Reason: Bad Gateway

I'm using lineage==4.3.1. I use it in my code block as I typically do, but I'm now getting a Bac Gateway error message.

This is my code block:

from lineage import Lineage

# https://snps.readthedocs.io/en/stable/
# https://lineage.readthedocs.io/en/stable/

# initialize Lineage object
l = Lineage(
    output_dir = output_dir,
    resources_dir = f"{references_directory}",
    parallelize = True,
    processes = 8
)

# initialize dictionary variables
individuals_dict = {}
sex_determination = {}
# initialize count variable
count = 0

directory_path = os.path.join(data_directory, "opensnp_data")
file_pattern = os.path.join(directory_path, "*.ancestry.txt")
opensnp_files = glob.glob(file_pattern)
len_opensnp_files = len(opensnp_files)

# Path for the sex determination TSV file
sex_determination_file = os.path.join(results_directory, "opensnp_sex_determination.tsv")

# Create a lineage individual object for each Ancestry file
# Loop through file names and create individuals_dict
for file_path in opensnp_files:
    count = count + 1
    filename = os.path.basename(file_path)
    username = filename.split("_")[0]

    print(f"Processing file {count} in {len_opensnp_files}: {username}")
    
    # print(username)
    # assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
    # with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
    # deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
    # deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
    # Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
    individuals_dict[username] = l.create_individual(username, 
                                                     file=file_path,
                                                     assign_par_snps=True,
                                                     deduplicate_MT_chrom=True,
                                                     deduplicate_XY_chrom=True)
    
    if individuals_dict[username].build != 38:
        individuals_dict[username].remap(38)
        
    individuals_dict[username].sort()
    individuals_dict[username].to_tsv(os.path.join(output_dir, f"{username}.tsv"))

    # Determine sex
    # heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
    # y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
    # chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
    # Returns ‘Male’ or ‘Female’ if detected, else empty str
    sex_determination[username] = individuals_dict[username].determine_sex(
        heterozygous_x_snps_threshold=0.03, 
        y_snps_not_null_threshold=0.3, 
        chrom='X'
        )
    # print(sex_determination[username])

# Save sex determinations to TSV
with open(sex_determination_file, 'w', newline='') as file:
    writer = csv.writer(file, delimiter='\t')
    writer.writerow(['Username', 'Sex'])
    for username, sex in sex_determination.items():
        writer.writerow([username, sex])

print("All files processed.")

And this is the error message

Processing file 1 in 3: user6579
Request failed for /variation/v0/refsnp/34943879: Status code: 502 Reason: Bad Gateway
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[10], [line 43](vscode-notebook-cell:?execution_count=10&line=43)
     [35](vscode-notebook-cell:?execution_count=10&line=35) print(f"Processing file {count} in {len_opensnp_files}: {username}")
     [37](vscode-notebook-cell:?execution_count=10&line=37) # print(username)
     [38](vscode-notebook-cell:?execution_count=10&line=38) # assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
     [39](vscode-notebook-cell:?execution_count=10&line=39) # with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
     [40](vscode-notebook-cell:?execution_count=10&line=40) # deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
     [41](vscode-notebook-cell:?execution_count=10&line=41) # deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
     [42](vscode-notebook-cell:?execution_count=10&line=42) # Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
---> [43](vscode-notebook-cell:?execution_count=10&line=43) individuals_dict[username] = l.create_individual(username, 
     [44](vscode-notebook-cell:?execution_count=10&line=44)                                                  file=file_path,
     [45](vscode-notebook-cell:?execution_count=10&line=45)                                                  assign_par_snps=True,
     [46](vscode-notebook-cell:?execution_count=10&line=46)                                                  deduplicate_MT_chrom=True,
     [47](vscode-notebook-cell:?execution_count=10&line=47)                                                  deduplicate_XY_chrom=True)
     [49](vscode-notebook-cell:?execution_count=10&line=49) if individuals_dict[username].build != 38:
     [50](vscode-notebook-cell:?execution_count=10&line=50)     individuals_dict[username].remap(38)

File [~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104), in Lineage.create_individual(self, name, raw_data, **kwargs)
    [101](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:101) if "resources_dir" not in kwargs:
    [102](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:102)     kwargs["resources_dir"] = self._resources_dir
--> [104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104) return Individual(name, raw_data, **kwargs)

File [~/.venv/lib/python3.10/site-packages/lineage/individual.py:61](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:61), in Individual.__init__(self, name, raw_data, **kwargs)
     [58](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:58) init_args = self._get_defined_kwargs(SNPs, kwargs)
...
    [905](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:905)     # we'll pick the first one to decide which chromosome this PAR will be assigned to
    [906](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:906)     merged_id = "rs" + response["merged_snapshot_data"]["merged_into"][0]
    [907](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:907)     logger.info(f"SNP id {rsid} has been merged into id {merged_id}")

TypeError: argument of type 'NoneType' is not iterable

Perform additional analysis for genes / shared genes

General Gene Analysis

Group genes by transcripts
Link to gene information (e.g., NCBI, OMIM, etc.) (see #12 and #22)
Indicate if gene is susceptible to CNV (see #22)
Identify genomic regions possibly related to gene regulation and susceptible to methylation (e.g., CpG islands)
For SNPs in coding regions, identify effect of SNP on amino acid produced compared to reference sequence (i.e., whether variation results in a change of amino acid)

Shared Genes

Identify genes partially shared (i.e., gene overlaps shared DNA segment)
Identify percentages of coding / non-coding regions shared

Add support for new filetypes

E.g., additional filetypes on openSNP and new FamilyTree DNA files.

Save output files on custom location

Can we change the location of the output files?

Feature Request: Allow multiple chromosomal comparisons in one interface.

Essentially, replicate the functionality available through 23andme's comparison function to allow matching and display of multiple individuals at once. This feature would display the shared segment plots for a single source and multiple comparator DNA individuals. Suggest that this should also include the abillity to output shared plots for either single chromosomes, groups of chromosomes or entire digital karyotype. Example below shows the output for a single chromosome (1) in comparison between a single source individual and 4 comparators individuals, all showing various shared DNA sections:

The rationale for this feature request is that visualizing the shared sections is an essential component for 'rebuilding' chromosomes using data from a first cousin or similar following the Athey protocol, and we need a way to display this data so it can used for chromosomal integration analysis work.

Improve support for PAR / X chromosome

HapMap includes data for PAR regions, so use that data when possible
Revisit doubling of alleles on X chromosome for males

Update pandas indexing

Update pandas indexing to fix warning generated in Travis CI Python 3.7 container:

  .ix is deprecated. Please use
  .loc for label based indexing or
  .iloc for positional indexing

Speed-up re-mapping with the asyncio module

When re-mapping chromosomes, make the requests asynchronous (perhaps with the asyncio module) so that re-mapping can be performed faster.

If possible, combine with the multiprocessing capability (see #5).

Enhance documentation for installation

Double check additional installs on Linux with Agg as the Matplotlib backend (9971abe).

Check if libatlas-base-dev is also a required install.

Add Sphinx docs with instructions for installing Python, setting up a virtualenv, etc.

Update SNP sorting to use a CategoricalDtype

pandas 0.21.1 generates the following warning during SNP sorting (individual.py:_sort_snps):

FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
    self._snps['chrom'].astype('category', categories=sorted_list, ordered=True)

Therefore, update to use a CategoricalDtype.

Track and use discrepant SNPs to improve quality of data

When multiple files are loaded, track discrepant SNPs so that any discrepancies can be used to improve the quality of the data.

For example, if a discrepant SNP is found to be more congruous during processing (e.g., when finding discordant SNPs), make the discrepant SNP the SNP that is used for processing.

Add ability to reconstruct genomes

Combine techniques identified by Whit Athey in Phasing the Chromosomes of a Family Group When One Parent is Missing and the results of find_shared_dna to reconstruct genomes of maternal and/or paternal ancestors.

This can be approached as a constraint satisfaction problem. For example, the algorithm could be provided several individuals, with the maternal and/or paternal relationships also identified (e.g., siblings = [ind1, ind2]; mother = [ind3]; paternal_relation = [ind4]). Then, shared DNA could be discovered by find_shared_dna between all combinations of individuals. This information - whether the various combinations of individuals share one chromosome, both chromosomes, or no chromosomes for a given SNP position - would serve as the constraints for reconstructing the ancestral genomes.

As a simple example, say two siblings have genotypes of CA and AG at a given SNP. If one knew they shared one chromosome at that location, AN could be attributed to one parent, and CG to the other, where N would be any allele. Additional comparisons between other individuals could further narrow the solution space for the ancestral genomes.

Add support for VCF

Read VCF files
Write VCF files

Return discrepant SNPs

Return discrepant SNPs when adding SNPs to an Individual.

Remove support for Build 36

Remove support for Build 36 / hg18 and only use Build 37 / hg19 and later for resource files and calculations.

If a Build 36 file is loaded, automatically re-map to Build 37 and use Build 37 for processing.

Report on allele information at any given SNP.

ALLELE SNP.

Def Allele(Input_Name, Origin='Both'):

Allele(RS2234095) --> 'AA'

Allele(RS2234095, Origin='Maternal') --> 'A'

#Should be smart enough to determine origin is paternal if origin isn't maternal.

Document algorithms

Add documentation for algorithms used throughout lineage.

Ability to reassemble Kit from specified fragments of other kits.

User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I have used patterns of inheritence to isolate sections of chromosomes from 6 different individuals that I would like to assemble into a single coherent kit that I can upload to Gedmatch.

#function to clip SNP fragments
Def Clip(Input_Name, Chromosome, Start, End)

in this quick example, userPW and userJW have already been phased, and are haploid in nature at this point in the analysis.

clip1 = Clip(userPW, 1, 100000, 250000)

#clip2 = Clip(userJW, 1, 5000000,7500000)

Def Assemble(x, y, z)
#haploid_assembled = Assemble(clip1, clip2)
#Makes single sided assembly from individual Clips
#Optionally returns Diploid assembly with duplicated haploid data.

For an arbitrary group of individuals, show the pattern at a given loci

User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I would like to identify the pattern of SNP's inherited at a given LOCI in either the actual nucleotides or a a/B output style.

Def Pattern(Input_Name, […]):

Pattern(RS2234095, [PW, DW, JW, SW]) --> ['TTCT' OR 'AABA']

Integrate Read the Docs for building / hosting documentation

Support additional genetic maps

For example, support 1000 Genomes Project genetic maps. Summary here: https://github.com/joepickrell/1000-genomes-genetic-maps

Consider choosing a genetic map based on admixture analysis (see #21).

This may require a means of comparing different genetic maps.

Reference: https://doi.org/10.1086/302011

Add ability to create groups of arbitrary individuals

A group would contain arbitrary Individuals and would provide capabilities to perform intra-group analysis. For example:

identify shared DNA / genes across all or a subset of members of the group (see #15)
tune thresholds for finding shared DNA / genes
find discordant SNPs between child / parent(s)
phase DNA (see #2)