apriha / lineage Goto Github PK
View Code? Open in Web Editor NEWtools for genetic genealogy and the analysis of consumer DNA test results
License: MIT License
tools for genetic genealogy and the analysis of consumer DNA test results
License: MIT License
Mac os 11.4
I was following the example and encountered the error at this line, everything before this line is identical to the example.
results = l.find_shared_dna([user662, user663], cM_threshold=0.75, snp_threshold=1100)
Add tests and integrate with Travis CI.
Integrate a tool for performing automated analysis and review of code.
E.g., Code Climate, Codacy, Hound, etc.
Add capability to get information that summarizes the data (e.g., SNP data source(s), SNP count, assembly name, chromosomes, etc.). Additionally, add summary info to the output file generated by lineage.
Implement an option to apply multiprocessing where chromosomes are processed individually so that results can be obtained faster.
Consider limiting processes to number of physical cores instead of logical cores.
Use the coordinates of common SNPs to identify the assembly / build of a genotype file that is being loaded. Track the assembly of the SNPs as a property of the Individual
.
Replace print
statements with calls to a logger.
Various minor updates for v1.0:
setup.py
, including keyword scrub and development statusint
assembly
property to build
Hi,
I have noticed using various 23andme files that the software does not yield any results for the Y-chromosome when comparing father and sons. The X matches with the mother but the Y does not with the father. These are for confirmed genetic relationships. cM thresholds used where 0.75 and 7.
Any idea why this is happening?
For example, if the shared DNA between two individuals has been computed, save that result in memory so that it doesn't need to be computed again. These intermediate results will become increasingly important as more individuals are compared in order to perform capabilities such as phasing (see #2).
Consider integrating https://github.com/stevenliuyi/admix
For each SNP, add information about the SNP's genomic location:
This will enable enhanced filtering for SNPs (e.g., location in given region, recombination rate above a given threshold, etc.).
Related to #29
E.g., due to the chromosome not having mapping information or an issue with the request.
Consider maintaining assembly information relative to each chromosome, or reporting when there is an assembly mismatch for partially remapped SNPs.
Consider adding child
, mother
, and father
parameters to find_discordant_snps
to present a more meaningful analysis of discordant SNPs (i.e., only compare SNPs that are inherited directly from an individual).
sex
to Individual
Revisit pinning dependency versions vs. supporting a minimum version of pandas
, Matplotlib
, etc.
Related to #52
Add the following capabilities to lineage
plots / plotting:
Cache assembly mapping data as remapping is performed in order to increase speed of remapping. May obsolete #6.
I'm using lineage==4.3.1. I use it in my code block as I typically do, but I'm now getting a Bac Gateway error message.
This is my code block:
from lineage import Lineage
# https://snps.readthedocs.io/en/stable/
# https://lineage.readthedocs.io/en/stable/
# initialize Lineage object
l = Lineage(
output_dir = output_dir,
resources_dir = f"{references_directory}",
parallelize = True,
processes = 8
)
# initialize dictionary variables
individuals_dict = {}
sex_determination = {}
# initialize count variable
count = 0
directory_path = os.path.join(data_directory, "opensnp_data")
file_pattern = os.path.join(directory_path, "*.ancestry.txt")
opensnp_files = glob.glob(file_pattern)
len_opensnp_files = len(opensnp_files)
# Path for the sex determination TSV file
sex_determination_file = os.path.join(results_directory, "opensnp_sex_determination.tsv")
# Create a lineage individual object for each Ancestry file
# Loop through file names and create individuals_dict
for file_path in opensnp_files:
count = count + 1
filename = os.path.basename(file_path)
username = filename.split("_")[0]
print(f"Processing file {count} in {len_opensnp_files}: {username}")
# print(username)
# assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
# with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
# deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
# deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
# Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
individuals_dict[username] = l.create_individual(username,
file=file_path,
assign_par_snps=True,
deduplicate_MT_chrom=True,
deduplicate_XY_chrom=True)
if individuals_dict[username].build != 38:
individuals_dict[username].remap(38)
individuals_dict[username].sort()
individuals_dict[username].to_tsv(os.path.join(output_dir, f"{username}.tsv"))
# Determine sex
# heterozygous_x_snps_threshold (float) – percentage heterozygous X SNPs; above this threshold, Female is determined
# y_snps_not_null_threshold (float) – percentage Y SNPs that are not null; above this threshold, Male is determined
# chrom ({“X”, “Y”}) – use X or Y chromosome SNPs to determine sex
# Returns ‘Male’ or ‘Female’ if detected, else empty str
sex_determination[username] = individuals_dict[username].determine_sex(
heterozygous_x_snps_threshold=0.03,
y_snps_not_null_threshold=0.3,
chrom='X'
)
# print(sex_determination[username])
# Save sex determinations to TSV
with open(sex_determination_file, 'w', newline='') as file:
writer = csv.writer(file, delimiter='\t')
writer.writerow(['Username', 'Sex'])
for username, sex in sex_determination.items():
writer.writerow([username, sex])
print("All files processed.")
And this is the error message
Processing file 1 in 3: user6579
Request failed for /variation/v0/refsnp/34943879: Status code: 502 Reason: Bad Gateway
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], [line 43](vscode-notebook-cell:?execution_count=10&line=43)
[35](vscode-notebook-cell:?execution_count=10&line=35) print(f"Processing file {count} in {len_opensnp_files}: {username}")
[37](vscode-notebook-cell:?execution_count=10&line=37) # print(username)
[38](vscode-notebook-cell:?execution_count=10&line=38) # assign_par_snps (bool) – assign PAR SNPs to the X and Y chromosomes
[39](vscode-notebook-cell:?execution_count=10&line=39) # with = True, error message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
[40](vscode-notebook-cell:?execution_count=10&line=40) # deduplicate_MT_chrom (bool) – deduplicate alleles on MT; see SNPs.heterozygous_MT
[41](vscode-notebook-cell:?execution_count=10&line=41) # deduplicate_XY_chrom (bool or str) – deduplicate alleles in the non-PAR regions of X and Y for males
[42](vscode-notebook-cell:?execution_count=10&line=42) # Why message: Chromosome PAR not remapped; removing chromosome from SNPs for consistency
---> [43](vscode-notebook-cell:?execution_count=10&line=43) individuals_dict[username] = l.create_individual(username,
[44](vscode-notebook-cell:?execution_count=10&line=44) file=file_path,
[45](vscode-notebook-cell:?execution_count=10&line=45) assign_par_snps=True,
[46](vscode-notebook-cell:?execution_count=10&line=46) deduplicate_MT_chrom=True,
[47](vscode-notebook-cell:?execution_count=10&line=47) deduplicate_XY_chrom=True)
[49](vscode-notebook-cell:?execution_count=10&line=49) if individuals_dict[username].build != 38:
[50](vscode-notebook-cell:?execution_count=10&line=50) individuals_dict[username].remap(38)
File [~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104), in Lineage.create_individual(self, name, raw_data, **kwargs)
[101](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:101) if "resources_dir" not in kwargs:
[102](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:102) kwargs["resources_dir"] = self._resources_dir
--> [104](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/__init__.py:104) return Individual(name, raw_data, **kwargs)
File [~/.venv/lib/python3.10/site-packages/lineage/individual.py:61](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:61), in Individual.__init__(self, name, raw_data, **kwargs)
[58](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/lineage/individual.py:58) init_args = self._get_defined_kwargs(SNPs, kwargs)
...
[905](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:905) # we'll pick the first one to decide which chromosome this PAR will be assigned to
[906](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:906) merged_id = "rs" + response["merged_snapshot_data"]["merged_into"][0]
[907](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/home/lakishadavid/anthropology_genetic_genealogy/~/.venv/lib/python3.10/site-packages/snps/snps.py:907) logger.info(f"SNP id {rsid} has been merged into id {merged_id}")
TypeError: argument of type 'NoneType' is not iterable
General Gene Analysis
Shared Genes
E.g., additional filetypes on openSNP and new FamilyTree DNA files.
Can we change the location of the output files?
Essentially, replicate the functionality available through 23andme's comparison function to allow matching and display of multiple individuals at once. This feature would display the shared segment plots for a single source and multiple comparator DNA individuals. Suggest that this should also include the abillity to output shared plots for either single chromosomes, groups of chromosomes or entire digital karyotype. Example below shows the output for a single chromosome (1) in comparison between a single source individual and 4 comparators individuals, all showing various shared DNA sections:
The rationale for this feature request is that visualizing the shared sections is an essential component for 'rebuilding' chromosomes using data from a first cousin or similar following the Athey protocol, and we need a way to display this data so it can used for chromosomal integration analysis work.
Update pandas
indexing to fix warning generated in Travis CI Python 3.7 container:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
Double check additional installs on Linux with Agg
as the Matplotlib
backend (9971abe).
Check if libatlas-base-dev
is also a required install.
Add Sphinx docs with instructions for installing Python, setting up a virtualenv
, etc.
pandas
0.21.1 generates the following warning during SNP sorting (individual.py:_sort_snps
):
FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead
self._snps['chrom'].astype('category', categories=sorted_list, ordered=True)
Therefore, update to use a CategoricalDtype
.
When multiple files are loaded, track discrepant SNPs so that any discrepancies can be used to improve the quality of the data.
For example, if a discrepant SNP is found to be more congruous during processing (e.g., when finding discordant SNPs), make the discrepant SNP the SNP that is used for processing.
Combine techniques identified by Whit Athey in Phasing the Chromosomes of a Family Group When One Parent is Missing and the results of find_shared_dna
to reconstruct genomes of maternal and/or paternal ancestors.
This can be approached as a constraint satisfaction problem. For example, the algorithm could be provided several individuals, with the maternal and/or paternal relationships also identified (e.g., siblings = [ind1, ind2]; mother = [ind3]; paternal_relation = [ind4]
). Then, shared DNA could be discovered by find_shared_dna
between all combinations of individuals. This information - whether the various combinations of individuals share one chromosome, both chromosomes, or no chromosomes for a given SNP position - would serve as the constraints for reconstructing the ancestral genomes.
As a simple example, say two siblings have genotypes of CA
and AG
at a given SNP. If one knew they shared one chromosome at that location, AN
could be attributed to one parent, and CG
to the other, where N
would be any allele. Additional comparisons between other individuals could further narrow the solution space for the ancestral genomes.
Return discrepant SNPs when adding SNPs to an Individual
.
Remove support for Build 36 / hg18 and only use Build 37 / hg19 and later for resource files and calculations.
If a Build 36 file is loaded, automatically re-map to Build 37 and use Build 37 for processing.
ALLELE SNP.
Def Allele(Input_Name, Origin='Both'):
#Should be smart enough to determine origin is paternal if origin isn't maternal.
Add documentation for algorithms used throughout lineage
.
User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I have used patterns of inheritence to isolate sections of chromosomes from 6 different individuals that I would like to assemble into a single coherent kit that I can upload to Gedmatch.
#function to clip SNP fragments
Def Clip(Input_Name, Chromosome, Start, End)
#clip2 = Clip(userJW, 1, 5000000,7500000)
Def Assemble(x, y, z)
#haploid_assembled = Assemble(clip1, clip2)
#Makes single sided assembly from individual Clips
#Optionally returns Diploid assembly with duplicated haploid data.
User story:
I am a researcher with one of two grandparents, and four full siblings in my parents generation and I am attempting to recreate the missing Grandparent's DNA kit. Following established protocols documented in Athey et al's work, I would like to identify the pattern of SNP's inherited at a given LOCI in either the actual nucleotides or a a/B output style.
Def Pattern(Input_Name, […]):
For example, support 1000 Genomes Project genetic maps. Summary here: https://github.com/joepickrell/1000-genomes-genetic-maps
Consider choosing a genetic map based on admixture analysis (see #21).
This may require a means of comparing different genetic maps.
Reference: https://doi.org/10.1086/302011
A group would contain arbitrary Individuals
and would provide capabilities to perform intra-group analysis. For example:
Develop a standalone web front-end interface to make it easier to use lineage
and view results.
Mac os 11.4
I was following the example and encountered the error at this line, everything before this line is identical to the example.
results = l.find_shared_dna([user662, user663], cM_threshold=0.75, snp_threshold=1100)
Update find_shared_dna
such that it can compute the shared DNA between an arbitrary number of individuals. Related to #2 and #4.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.