elimuinformatics / vcf2fhir Goto Github PK
View Code? Open in Web Editor NEWvcf2fhir: a utility to convert VCF files into HL7 FHIR format for genomics-EHR integration
License: Apache License 2.0
vcf2fhir: a utility to convert VCF files into HL7 FHIR format for genomics-EHR integration
License: Apache License 2.0
add vcf2fhir sample_position parameter
Currently, when converter encounters an unknown chromosome, it is unable to determine a corresponding reference sequence, resulting in an exception, as in this case:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample chrR 301 . A G . . . GT:PS 0/1:.
We will likely want to add another exclusion criteria, for cases where VCF CHROM is not recognized. These would then also go into the invalidRecord log.
Please answer the following questions for yourself before submitting an issue.
Documentation Issue.
Documentation should be correct
There were recent changes in Converter
parameters, documentation should reflect examples of consuming new parameter.
Please answer the following questions for yourself before submitting an issue.
Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.
Test case is based on a targeted genome study. (vcf_filename='HG00403A.vcf.gz', has_tabix='True', ref_build='GRCh37', conv_region_filename='HG00403A_convert.bed', region_studied_filename='HG00403A_studied.bed') (files are on google drive in vcf2fhir/test_cases folder)
When the requested conversion region hasn't been studied, we are generating a non-conformant diagnostic report. We should decide what we want in this situation - for instance, we probably want to include a regionStudied observation that somehow indicates that the intersection of convert and studied is null.
Example : In fhir_helper.py (line 564)
temp = j['valueCodeableConcept']['coding'][0]["system"]
od_componentvalue_codeable_concept["system"] = temp
it can be converted into this
od_componentvalue_codeable_concept["system"] = j['valueCodeableConcept']['coding'][0]["system"]
and like these code of lines in the project
The following VCF record is only returning a single FHIR variant, but should result in two FHIR variants, each heterozygous (G>A, G>T), per this picture from the manual:
1 879676 . G A,T . . NS=1;AN=2;AC=2;CGA_XR=dbsnp.116|rs6605067;CGA_FI=148398|NM_152486.2|SAMD11|UTR3|UNKNOWN-INC&26155|NM_015658.3|NOC2L|UTR3|UNKNOWN-INC GT:PS:FT:GQ:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP 1/2:.:PASS:128:128,797:128,797:39,48:-797,-128,0:-48,-39,0:60:20,40:0
Dear Authors of vcf2fhir,
In the current design of vcf2fhir, Observation-variant includes a fixed content 'LOINC 69548-6'.
However, it would not be logical to convert the record this way when GT=0/0.
Since GT=0/0 represents no variant occurs, the corresponding values of LOINC 69548-6 should be 'absent'.
When GT=0/0, the values of LOINC 69548-6 should be 'absent' (LA9634-2).
The value of LOINC 69548-6 is fixed. Always shows 'present' whenever the GT values are.
Currently vcf2fhir
does not work on windows.
It should work; if not possible, update the documentation that we do not support windows.
Evaluate and change the build process to PEP-517
Use case: If, for instance, a query comes in with Build37 ranges, but the VCF in question is based on Build38, then translate the query into Build38 ranges and continue to perform the query.
From HL7 $find-subject-variants: [where a server is storing variants aligned to a build that differs from the build implied by the query range] it will be necessary for the server to lift over the query region into corresponding regions in other builds. For example, a query for variants in NC_000001.11:145507556-145513536 (build 38 range) will also need to query for variants in NC_000001.10:145921556-145927537 (build 37 range) in order to gather variants expressed against build 38 and build 37, respectively...In the (very uncommon) case of a failed lift over, a server should widen the query region as necessary in order to have a successful lift over. For example, the widened build 38 range NC_000001.11:145923285-145923306 will successfully translate into the build 37 range NC_000001.10:145511787-145511807.
Helpful references: https://github.com/badges/shields
After the anticipated Jan 2021 ballot, we may need to update the output of this service to conform to the latest FHIR Genomics format.
Hi I just had a quick question - I'm trying to run vcf2fhir on large vcfs, and the process is taking many hours - is there a way to scale up? I'm working on GCP so I can use whatever resources I need.
Thanks,
Camille
Please answer the following questions for yourself before submitting an issue.
Consider allowing for inclusion of genomic source class (e.g. somatic vs. germline) via a new optional parameter. If specified, we would include an additional component in the variant observation. This component would have LOINC code 48002-0 'Genomic source class [Type]', with component values drawn from here.
We do not currently include genomic source class.
Currently vcf2fhir
converts and exports the HL7 FHIR format data to a json file. The converted json data for all the records exists in memory till it is exported in the end.
Evaluation Required:
In memory storage required for FHIR json in case of very big VCF file conversion.
VCF files are sometimes expected to be in the size of GB's, it is better to write the converted FHIR json format for each record to file instead of in memory before moving to the next record. Major complexity in doing this is handling phase relationship json blocks which spans across multiple records.
Other Options:
Please answer the following questions for yourself before submitting an issue.
Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.
No Resource warnings while running the tests.
Resource warning due to the unclosed 'vcf file' in PyVCF3
.
Please provide detailed steps for reproducing the issue.
python -m unittest
, and you will receive the following warnings:Please answer the following questions for yourself before submitting an issue.
Hello, I have been using vcf2fhir on my test VCFs, and I have noticed that no INDEL type variants are included in the JSON product following conversion. There are quite a few entries that are missing beyond INDELs as well, and I am not sure why. Is there a way to set it that all of the variant entries convert, regardless of type? I don't believe any entry is missing information.
I am expecting all of the variant entries to be included in the final JSON output.
Only a subset of the variants are included in the JSON output following the conversion.
The code I am using for the conversion is as follows:
import vcf2fhir
vcf_fhir_converter = vcf2fhir.Converter('/test_1000.vcf', ref_build='GRCh37', genomic_source_class='mixed', patient_id='patient_ID')
vcf_fhir_converter.convert(output_filename='/test_1000.json')
I am unable to attach VCF or JSON files, but I would be more than happy to send them via email if you'd like to see them.
Now, since we have more contributors joining in we need a mechanism to make sure any new changes that are getting merged is not causing regression.
Setup Github Actions to run our tests against any new PR that is raised.
There are few things in code which does not align with python best practices or conventions, especially the variable and method names. We should scan the complete code and fix anything that deviates from guidelines listed in PEP 8 -- Style Guide for Python Code
Please answer the following questions for yourself before submitting an issue.
Add support for clinical annotations, which will be supplied as a tab-delimited text file.
Currently, we don't support clinical annotations.
Please answer the following questions for yourself before submitting an issue.
Currently, vcf2fhir
is only available on PyPI
.
We should deploy vcf2fhir
on conda
, specifically on the bioconda
channel, as it is a conda channel for bioinformatics related packages. Steps to deploy a python package(available on PyPI
) on the bioconda channel.
Please answer the following questions for yourself before submitting an issue.
Tabix considers VCF POS and length of REF allele when looking for variants that intersect conversion region. We should replicate that approach when converting a VCF that doesn't include a tabix index.
Dear authors of vcf2fhir,
We have genetic variants with VCF format. Your software is helping us to have the HL7 format, which our variants are included as observations
under DiagnosticReport
. We also need to include the gene description (affected by some pathogenic variants) and the disease description caused by those affected genes. Nevertheless, vcf2fhir does not include this kind of information (we are trying to modify vcf2fhir in order to include this information) and we do not know where to put this information within the HL7 format. We think that including this information in each observation
is wrong, since it will be redundant in case of many variants on the same gene. Do you have some feedback about it?
Thank you very much in advance.
The 'has_tabix' variable in the converter class must be a boolean and 'ratio_ad_dp' must be a floating-point number. But, there is no validation to ensure that they are of the correct data type.
For example, if a user assigns a string to 'ratio_ad_dp', a TypeError will be thrown when the program compares it with a numeric value.
Github support automatic dependency updates suggestion using dependabot. We should set up dependabot for our repository.
Dependency update is not considered.
Please answer the following questions for yourself before submitting an issue.
Hi I'm attempting to get vcf2fhir installed, but keep running into issues. I have installed cython and wheel and checked that they are installed, but when I run pip install vcf2fhir I get persistent error messages.
Are there more dependencies that might be the issue? Seems like it might be an issue with pysam, but wanted to confirm if you've seen folks encounter this error in the past and there's an different fix.. Please see the below error message that I keep getting, thanks.
Package to install.
Failure to install
Collecting vcf2fhir
Using cached vcf2fhir-0.1.1-py3-none-any.whl (26 kB)
Collecting pyranges>=0.0.96
Downloading pyranges-0.0.120.tar.gz (687 kB)
---------------------------------------- 687.9/687.9 kB 14.4 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting pysam
Using cached pysam-0.20.0.tar.gz (4.0 MB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [24 lines of output]
# pysam: cython is available - using cythonize if necessary
# pysam: htslib mode is shared
# pysam: HTSLIB_CONFIGURE_OPTIONS=None
'.' is not recognized as an internal or external command,
operable program or batch file.
'.' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\clake\AppData\Local\Temp\pip-install-y1s5vygs\pysam_f4fe7232d0f142c9b8aae5c716f0a979\setup.py", line 381, in
htslib_make_options = run_make_print_config()
File "C:\Users\clake\AppData\Local\Temp\pip-install-y1s5vygs\pysam_f4fe7232d0f142c9b8aae5c716f0a979\setup.py", line 79, in run_make_print_config
stdout = subprocess.check_output(["make", "-s", "print-config"])
File "C:\Users\clake\Anaconda3\envs\test\lib\subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Users\clake\Anaconda3\envs\test\lib\subprocess.py", line 505, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\clake\Anaconda3\envs\test\lib\subprocess.py", line 951, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\clake\Anaconda3\envs\test\lib\subprocess.py", line 1420, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
# pysam: htslib configure options: None
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Implement vcf2fhir for structural variants, as outlined here.
Please answer the following questions for yourself before submitting an issue.
Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.
Please describe the behavior you are expecting
What is the current behavior?
C:\SRC\vcf2fhir\vcf2fhir\test>pip install vcf2fhir
Collecting vcf2fhir
Using cached vcf2fhir-0.1.1-py3-none-any.whl (26 kB)
Collecting PyVCF3>=1.0.3
Using cached PyVCF3-1.0.3.tar.gz (977 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: Cython>=0.29.21 in c:\users\XYZUser\appdata\local\programs\python\python310\lib\site-packages (from vcf2fhir) (0.29.34)
Collecting pyranges>=0.0.96
Using cached pyranges-0.0.120.tar.gz (687 kB)
Preparing metadata (setup.py) ... done
Collecting pandas
Using cached pandas-2.0.0-cp310-cp310-win_amd64.whl (11.2 MB)
Collecting pysam
Using cached pysam-0.21.0.tar.gz (4.1 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [34 lines of output]
# pysam: cython is available - using cythonize if necessary
# pysam: htslib mode is shared
# pysam: HTSLIB_CONFIGURE_OPTIONS=None
'.' is not recognized as an internal or external command,
operable program or batch file.
'.' is not recognized as an internal or external command,
operable program or batch file.
# pysam: htslib configure options: None
Traceback (most recent call last):
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 353, in
main()
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\site-packages\pip_vendor\pyproject_hooks_in_process_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "C:\Users\XYZUser\AppData\Local\Temp\pip-build-env-s85ck20c\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "C:\Users\XYZUser\AppData\Local\Temp\pip-build-env-s85ck20c\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
self.run_setup()
File "C:\Users\XYZUser\AppData\Local\Temp\pip-build-env-s85ck20c\overlay\Lib\site-packages\setuptools\build_meta.py", line 484, in run_setup
super(_BuildMetaLegacyBackend,
File "C:\Users\XYZUser\AppData\Local\Temp\pip-build-env-s85ck20c\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
exec(code, locals())
File "", line 383, in
File "", line 79, in run_make_print_config
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 966, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\XYZUser\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1435, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
C:\SRC\vcf2fhir\vcf2fhir\test>
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Please provide detailed steps for reproducing the issue.
Please include any relevant log snippets or files here.
As of now all the tests are added to a single file. To scale out better we need to separate unit tests and Integration tests. We also need to add more unit tests
Need to decide if this is something we should support. If, for instance, you are looking for HLA variants, you might need to search across all HLA Alt contigs. If, for instance, you used an alternate-locus aware variant caller, you may not find variants in the alt contig regions using this algorithm, as they may have been aligned instead with the Alt contigs.
Alt contigs are generally created for regions known to be highly polymorphic (e.g. HLA) and/or known to vary considerably across populations. There may be many query liftover challenges (e.g. many Alt contigs are new in b38). We need to understand how Alt contigs version.
We should not be excluding records where FILTER=PASS, such as the record shown here:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
X 60466 . T C . PASS NS=1 GT:PS 1|0:60454
Please answer the following questions for yourself before submitting an issue.
BED files are 0-based, VCF files are 1-based. We need to make sure that we are converting correctly.
In the sequence 'AGCA', a 1-based 1..3 = AGC, whereas a 0-based 1..3 = GC.
If 1-based coordinates are 1..3, the corresponding 0-based coordinates are 0..3.
According to GitHub here, we still do not meet all the checklist for a public GitHub repository. We should complete the remaining checklist as part of this issue.
Project logo helps define a brand of the project. It just takes a few minutes to create one and can be really helpful.
A GitHub Action to build and publish Python package to PyPI. This GitHub Action by Mariam Maarouf or this by Python Packaging Authority are some good candidates for this.
Please answer the following questions for yourself before submitting an issue.
Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.
FORMAT.PS field is optional, although we're getting a fatal error if not present. If not present, we just won't try to compute phase relationships.
While installing dependencies, Cython is missing, hence it needs to be added to the install_requires=[]
in setup.py
Please answer the following questions for yourself before submitting an issue.
When ever conv_region_filename, region_studied_filename, nocall_filename files are provided it results in ResourceWarning: unclosed file
Those warning should not come
Run python -m unittest in terminal or try to run any single test case
'Human reference sequence assembly version' can be inferred from RefSeq, and doesn't always apply (e.g. for mitochondrial refSeq's).
Please answer the following questions for yourself before submitting an issue.
Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.
Python 3.9.12 (main, Mar 26 2022, 15:51:13)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
import should work
$ python3
Python 3.9.12 (main, Mar 26 2022, 15:51:13)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import vcf2fhir
Traceback (most recent call last):
File "", line 1, in
File "/Users/walsbr/coherent-dataset/venv/lib/python3.9/site-packages/vcf2fhir/init.py", line 1, in
from vcf2fhir.converter import Converter
File "/Users/walsbr/coherent-dataset/venv/lib/python3.9/site-packages/vcf2fhir/converter.py", line 1, in
import vcf
File "/Users/walsbr/coherent-dataset/venv/lib/python3.9/site-packages/vcf/init.py", line 9, in
from vcf.parser import Reader, Writer
File "/Users/walsbr/coherent-dataset/venv/lib/python3.9/site-packages/vcf/parser.py", line 25, in
from model import _Call, _Record, make_calldata_tuple
ModuleNotFoundError: No module named 'model'
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
python3 -m venv venv
source venv/bin/activate
pip3 install setuptools==58
pip3 install cython wheel
pip3 install vcf2fhir
python3 -c "import vcf2fhir"
See above
Use the first sample in a multi-sample VCF. Test that a multi-sample VCF works correctly.
After all the refactoring, cleanup and new tests we should release a new version.
Till now we have been releasing new versions but we never maintained a proper CHANGELOG.md file to list all the important changes that are made in this new release. We should start doing that and this release will be that first with Changelog.
Current software is hard-coded to assign homoplasmic vs. heteroplasmic based on: If allelic depth (FORMAT.AD) / read depth (FORMAT.DP) is greater than 99% then allelic state is homoplasmic; else heteroplasmic.
Change the '99%' to a configurable parameter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.