Coder Social home page Coder Social logo

pyir's Introduction

PyIR

An IgBLAST wrapper and parser

PyIR is a minimally-dependent high-speed wrapper for the IgBLAST immunoglobulin and T-cell analyzer. This is achieved through chunking the input data set and running IgBLAST single-core in parallel to better utilize modern multi-core and hyperthreaded processors.

PyIR has become an essential part of the Vanderbilt Vaccine Center workflow, and the requirements in the past few years has lead to the development of new features including:

  • Parsing algorithm refactorization
  • AIRR naming compliance
  • Updated IgBlast binary
  • Multiple output formats (including python dictionary)
  • Built-in sequence filtering
  • Simplified command-line interface

PyIR is described at length in the BMC Bioinformatics article: PyIR: a scalable wrapper for processing billions of immunoglobulin and T cell receptor sequences using IgBLAST

๐Ÿ’ ๐Ÿ’  2021-07-09 Update ๐Ÿ’ ๐Ÿ’ 

  • Support for non-human species
  • Package is listed on pip for distribution (under crowelab_pyir repository)
  • Blastp support for human sequences

Requires

  1. Linux
  2. Python 3.6
  3. Pip version >=10.0.1 and the following packages: tqdm
  4. Any requirements for IgBLAST (including glibc >= 2.14)
  5. wget, gawk

Files for testing and Manuscripts

Test files used for the BMC Bioinformatics manuscript can be found at: https://clonomatch.accre.vanderbilt.edu/pyirfiles/

Files pertaining to the manuscript High frequency of shared clonotypes in human B cell receptor repertoires and be found at: https://github.com/crowelab/PyIR/wiki/Files-for-Manuscripts

Standard Installation

PyIR is installed with the pip software packager, but is not currently a part of the PyPI repository index. It can be manually downloaded and installed as followed:

1. Download the repository (optional)

This repository can be downloaded by selecting "Download ZIP" from the "Clone and Download" menu at the top right of this github page or by using git from command line:

git clone https://github.com/crowelab/PyIR

2. Install with pip

Install from pypi repository (Recommended)

pip3 install crowelab_pyir

Local Installation from folder

cd PyIR/
pip3 install --user .

Global Installation from folder

cd PyIR/
sudo pip3 install .

Uninstall PyIR

pip3 uninstall crowelab_pyir

3. Database Setup

PyIR requires a set of BLAST germline databases to assign the VDJ germlines.

A snapshot of the IMGT/GENE-DB human immunome repertoire is included with PyIR, but users are recommended to build their own database to keep up with the newest germline definitions. A link to the full instructions from NCBI can be found here, or you can use PyIR's setup script to build the databases automatically:

#Builds databases in pyir library directory
pyir setup

#Builds databases in specified path
pyir setup -o path/

#Builds databases in global pyir library directory (use if installed with sudo pip3)
sudo pyir setup

Potential Issues:

1. Can't find pyir executable

Locate your local bin folder with PyIR and add it to your PATH variable. ~/.local/bin and /usr/local/bin are good places to start. If using scl or other virtual environments (such as conda) be sure to account for those when searching your directories.

2. Error with IgBLAST executable

Double-check that you've met all prerequisites to install IgBLAST, including GLIBC > 2.14 (which has caused issues with CentOS 6) and libuv (can be installed with "sudo apt install libuv1.dev")

3. Installed correctly but packages are missing

Ensure that the version of pip used to install pyir is associated with the correct version of python you are attempting to run. This can also be an issue with virtual environments.

Install With Virtual Box (Ubuntu)

Instructions for installing PyIR with a VirtualBox container can be found in the wiki

Examples

CLI

#Default PyIR
pyir example.fasta

#PyIR with filtering
pyir example.fasta --enable_filter

#PyIR with custom BLAST database
pyir example.fasta -d [path_to_DB]

API

Example 1: Running PyIR on a fasta and getting filtered sequences back as Python dictionary

## Initialize PyIR and set example file for processing
from crowelab_pyir import PyIR
FILE = 'example.fasta'

pyirfiltered = PyIR(query=FILE, args=['--outfmt', 'dict', '--enable_filter'])
result = pyirfiltered.run()

#Prints size of Python returned dictionary
print(len(result))

Example 2: Count the number of somatic variants per V3J clonotype in the returned results and print the top 10 results

## Initialize PyIR and set example file for processing
from crowelab_pyir import PyIR
FILE = 'example.fasta'

sv = {}
for key, entry in result.items():
    v3j = entry['v_family'] + '_' + entry['j_family'] + '_' + entry['cdr3_aa']
    if v3j not in sv:
        sv[v3j] = 0
    sv[v3j] += 1

for i,item in enumerate(sorted(sv.items(), key=lambda x: x[1], reverse=True)):
    if i > 9:
        break
    v3j = item[0].split('_')
    print('v:', v3j[0], 'j:', v3j[1], 'cdr3:', v3j[2], 'count:', item[1])

Example 3: Process example file and return filepath

## Initialize PyIR and set example file for processing
from crowelab_pyir import PyIR
FILE = 'example.fasta'

pyirfile = PyIR(query=FILE)
result = pyirfile.run()

#Prints the output file
print(result)

Example 4: Process example file in and return filepath to results in MIARR format

## Initialize PyIR and set example file for processing
from crowelab_pyir import PyIR
FILE = 'example.fasta'

pyirfile = PyIR(query=FILE, args=['--outfmt', 'tsv'])
result = pyirfile.run()

#Prints the output file
print(result)

Example 5: Plot CDR3 length distribution histogram

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.pyplot import figure
## Initialize PyIR and set example file for processing
from crowelab_pyir import PyIR
FILE = 'example.fasta'

#create PyIR API instance and return Python dictionary
pyirexample = PyIR(query=FILE, args=['--outfmt', 'dict', '--enable_filter'])
result = pyirexample.run()

cdr3lens = {}
total_reads = 0

#iterate over values returned by PyIR
for key, entry in result.items():
	clen = entry['cdr3_aa_length']
	#if the CDR3 length is not in the dictionary, add it
	if int(clen) not in cdr3lens.keys():
		cdr3lens[int(clen)] = 0
	#increment appropriate dictionary value and total
	cdr3lens[int(clen)] += 1
	total_reads += 1

x = []
y = []

for xval in sorted(cdr3lens.keys()):
	x.append(xval)
	y.append(cdr3lens[xval]/total_reads)

fig, ax = plt.subplots(1 , 1, dpi=600, facecolor='None', edgecolor='None')
plt.bar(x, y, color="#a0814b")
fig.savefig("synth01_cdr3length_distribution.svg", bbox_inches='tight', pad_inches=0)

Further Examples

More examples can be found in the Wiki, such as creating a CDR3 Histogram and Installing PyIR in VirtualBox

Contact

Email [email protected] with any questions or open an issue on Github and we'll get back to you.

License

PyIR is distributed under the Creative Commons Attribution 4.0 International License

pyir's People

Contributors

andrejbranch avatar crowelab avatar jwillis-twist avatar liukai1029 avatar myersml5 avatar reccej avatar tjonesster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyir's Issues

issue about downloading database

image image Hello, when I tried to download the database from your website(http://www.imgt.org/download/V-QUEST/IMGT_V-QUEST_reference_directory/Homo_sapiens/IG/IGHV.fasta), there was an error. could you kindly help me to check the website? thanks for your assistance

Adaptive Biotechnologies data set

Hi, I am trying to download Adaptive Biotechnologies data sets (FASTA) from the publication "High frequency of shared clonotypes in human B cell receptor repertoires". But the link does not seem to work.

Example PyIR output for immunarch

Hi, thank you for creating PyIR! Would you be open to provide an example output of PyIR so we can add it as a parsing option to https://github.com/immunomind/immunarch ? I tried to download files from Wiki, but some of them are not available, and I have troubles with the PyIR installation. So it would be great to have some PyIR outputs to quickly implement the parser and help PyIR users easily explore their AIRR data. Thank you, and let me know if you have any questions!

Issue to track the parser: immunomind/immunarch#84

-- Vadim

tempfile.mkdtemp() causing "No space left on device error"

Hi,

I am trying to run PyIR on many files simultaneously. It is a great wrapper for IgBlast and I have been able to parse the json files easily. But, I am running into a space issue with the tmpdir.

Line 26 of factory.py contains this command:
args['tmp_dir'] = tempfile.mkdtemp()

Which appears to be root cause of the "No space left on device" error that I am seeing below. Some of the jobs work but most fail. I am running this job on a cluster.

My current workaround will be to set the environment variable TMPDIR to a directory with sufficient space, but this may be an issue that others may run into as well.

Error output:

  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/PyIR/bin/./pyir", line 14, in <module>
    py_ir.run()
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/pyir/factory.py", line 53, in run
    total_seqs, input_pieces, fastq_input_pieces = self.split_input_file(input_format)
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/pyir/factory.py", line 159, in split_input_file
    Bio.SeqIO.write(seq, current_pieces[proc_index], 'fasta')
  File "/data/omicscore/Easterhoff-Easterhoff-20190501/scripts/conda/envs/py36/lib/python3.6/site-packages/Bio/SeqIO/__init__.py", line 529, in write
    fp.write(format_function(record))
OSError: [Errno 28] No space left on device

Thanks,
Jen

after pyir set up, pyir still can't be used with other species

Hi,

I have been able to run pyir only on human. When I try to set up database for other species by "pyir setup" and then run pyir blast search it does not work. It always default back to human. I noticed in the pyir arg_parse.py, the choices is only for human. I tried uncommented line 175 choices=['human'] and commented the line 176 choices=['human','mouse', 'rabbit', 'rat', 'rhesus_monkey' ] but it pyir still doesn't work for any other species but human. The pyir setup command seems to run fine but it seems like there is some other arguments except the one species choices are preventing it to find the database. Any suggestion?

ABhelix data

Hi, am looking at the ABhelix derived data from your paper "High frequency of shared clonotypes in human B cell receptor repertoires". The reads corresponding to IgA and IgM cover a part of the C gene which is enough for isotype identification, but reads in files corresponding to IgG1-IgG4 cover only last 5 nucleotides of the C gene. I wonder how IgG1-IG4 isotypes were determined?

Issue of using pair-end WES data

Dear developer,

I noticed that pyIR currently supports input in the form of a single FASTA/FASTQ file. I'm working with paired-end Whole Exome Sequencing (WES) data and was wondering about the best approach to use this data with pyIR. Should I preprocess the paired-end data into a single file using tools like pRESTO before inputting it into pyIR?

Thanks for your guidance!

IGBLAST_TSV_HEADER misalignment

There are column names misalignment in TSV output file. The length of TSV header is less than the actual number of output columns and all the column names after the newly added "complete_vdj" are misaligned from their meant column data.

missing FR4 with PyIR 1.3 AIRR parser

Hi PyIR team,

With PyIR 1.3, we have noticed that many queries/alignments are missing FR4. A closer look indicates that many sequences do contain the FR4 but not being reported in PyIR 1.3.

Below is one example annotated by PyIR 1.0 and PyIR 1.3 respectively; only minimal information included to reproduce behavior; PyIR 1.0 results has proper end of FR4 and agrees much better with Igblast web tool.

PyIR 1.0
"Raw Sequence":"CAACATCCGAGCAGGGTTATCTGGTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCTGTGACCGTGAGCCTGGTGCCCGGCCCGAAGTACTGCTCGTAGGATTCCAACCCCCACTCCCATCTAGCACTGCAGATGTAGAAGCTGCTGTCTTCAGGATGGGCACTGGTCACTGTCAGAGTGGACAAGGTCAGGCTTGCATGGTTGATGAGAAACTTGTCCTTCTCGACGCCTTGCTCGTATGTGGCCTTGGAGCCCTCATTGGAAGTTGCCATCAGCATGAGACTCTGTTTCGGGAACTGACGATACCAAAACATAGTTGTGGCCTGAAAGTCCAGGGAACGGCACTCGATCTTCACAGAGGTTCCACTCTTACAGATAACCCTGCTCGGATGTTGAGAGACGACAGCACCAAGCCCGGAGCCTGGCCCCAGAAGCAGCAGAAGCAGCAGCATCTTCCGTGATGGCCTCACACCACCTTCTCTGGGGAGAGTTCAGAGCGCAGAGC",
"CDR1":{
"from":174.0,
"to":191.0,
"length":18.0,
"matches":18.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"DFQATT",
"NT":"GACTTTCAGGCCACAACT"
},
"FR2":{
"from":192.0,
"to":242.0,
"length":51.0,
"matches":51.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"MFWYRQFPKQSLMLMAT",
"NT":"ATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACT"
},
"CDR2":{
"from":243.0,
"to":263.0,
"length":21.0,
"matches":21.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"SNEGSKA",
"NT":"TCCAATGAGGGCTCCAAGGCC"
},
"FR3":{
"from":264.0,
"to":377.0,
"length":114.0,
"matches":114.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"TYEQGVEKDKFLINHASLTLSTLTVTSAHPEDSSFYIC",
"NT":"ACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGC"
},
"CDR3":{
"from":378.0,
"to":386.0,
"length":9.0,
"matches":9.0,
"mismatches":0.0,
"gaps":0.0,
"percent identity":100.0,
"AA":"SARWEWGLESYEQY",
"NT":"AGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTAC"
"FR4":{
"AA":"FGPGTRLTVT",
"NT":"TTCGGGCCGGGCACCAGGCTCACGGTCACAG"

PyIR 1.3 :

"sequence":"GCTCTGCGCTCTGAACTCTCCCCAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTTCTGCTGCTTCTGGGGCCAGGCTCCGGGCTTGGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTGGAACCTCTGTGAAGATCGAGTGCCGTTCCCTGGACTTTCAGGCCACAACTATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACTTCCAATGAGGGCTCCAAGGCCACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTACTTCGGGCCGGGCACCAGGCTCACGGTCACAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGACCAGATAACCCTGCTCGGATGTTG",
"fwr1":"GGTGCTGTCGTCTCTCAACATCCGAGCAGGGTTATCTGTAAGAGTGGAACCTCTGTGAAGATCGAGTGCCGTTCCCTG",
"fwr1_aa":"GAVVSQHPSRVICKSGTSVKIECRSL",
"cdr1":"GACTTTCAGGCCACAACT",
"cdr1_aa":"DFQATT",
"fwr2":"ATGTTTTGGTATCGTCAGTTCCCGAAACAGAGTCTCATGCTGATGGCAACT",
"fwr2_aa":"MFWYRQFPKQSLMLMAT",
"cdr2":"TCCAATGAGGGCTCCAAGGCC",
"cdr2_aa":"SNEGSKA",
"fwr3":"ACATACGAGCAAGGCGTCGAGAAGGACAAGTTTCTCATCAACCATGCAAGCCTGACCTTGTCCACTCTGACAGTGACCAGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGC",
"fwr3_aa":"TYEQGVEKDKFLINHASLTLSTLTVTSAHPEDSSFYIC",
"fwr4":"",
"fwr4_aa":"",
"cdr3":"AGTGCTAGATGGGAGTGGGGGTTGGAATCCTACGAGCAGTAC",
"cdr3_aa":"SARWEWGLESYEQY",

UPDATE: A closer look tells me that this is only relating to igblast's AIRR output parser. The legacy parser is still as good but doesn't support tsv outfmt. It will be nice to have the AIRR parser return FR4.

Result difference between PyIR and IgBLAST server

Dear developer,

I noticed that using inputting the same TCR fasta file (50 sequences, I used a small file so that IgBLAST server could run) into both PyIR and IgBLAST server, it seemed that IgBLAST recognized the CDR3 sequences in all the 50 sequences, while PyIR only recognized 24 of them. I wonder did it happen because of the quality control function of PyIR? Will this issue simply be solved if the fastq/fastq file is in high quality?

Below was my code in PyIR:
PyIR(query=FILE, args=['--outfmt', 'tsv', '-r', 'TCR', '--species', 'human'])

Result from PyIR:
image

Result from IgBLAST server:
image

could we have an selection option to use blastp?

given that we have a list of protein sequences rather than nucleotide sequences in hand, could we use blastp program for immunoglobulin protein sequence blast as it shows in igBlast website?

Support for different species

# choices=['human', 'mouse', 'rabbit', 'rat', 'rhesus_monkey'],

Hello I am looking to run PYir on mice, but as I was going through the code I noticed that mouse is no longer an option that can passed into the tool. Is this something done intentional, or does this tool not work on mice anymore?

Thanks,

  • Kuzirh

empty output

Dear PyIR maintainer,

Thanks for maintaining this tool. I installed it and give a test run on one of my fasta files, but get an empty output from it.

$ pyir myfasta.fa  --outfmt tsv -m 8
4,555,603 sequences successfully split into 4554 pieces
Starting process pool using 8 processors
  0%|                                                                                                | 0/4555603 [04:45<?, ?seq/s]
4,555,603 sequences processed in 297.47 seconds, 15,314 sequences / s
Zipping up final output
Analysis complete, result file: myfasta.tsv.gz

May I ask if you have an idea about why it should happen?

Thanks,

Failure in building docker image

When building the docker image, the Ig tests fails with what looks to be an improperly written regex:

Starting process pool using 4 processors
 22%|โ–ˆโ–ˆโ–       | 108/500 [00:00<00:03, 109.25seq/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/pyir/igblast.py", line 20, in run
    return igblast_run.run_single_process(fasta_input_file, fastq_input_file)
  File "/usr/local/lib/python3.7/site-packages/pyir/igblast.py", line 144, in run_single_process
    total_parsed = parser.parse()
  File "/usr/local/lib/python3.7/site-packages/pyir/parsers.py", line 615, in parse
    self.output = parser.parse(line, self.output, previous_line_whitespace, self.seq_dict)
  File "/usr/local/lib/python3.7/site-packages/pyir/parsers.py", line 92, in parse
    self.hits.append({'gene': matches.group(1), 'bit_score': float(matches.group(2)), 'e_value':float(matches.group(3))})
ValueError: could not convert string to float: 'sapiens|IGHV8|P|V-REGION'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/pyir", line 14, in <module>
    py_ir.run()
  File "/usr/local/lib/python3.7/site-packages/pyir/factory.py", line 65, in run
    output_pieces = self.run_pool(input_pieces, fastq_input_pieces, total_seqs)
  File "/usr/local/lib/python3.7/site-packages/pyir/factory.py", line 218, in run_pool
    for x in imap:
  File "/usr/lib64/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
ValueError: could not convert string to float: 'sapiens|IGHV8|P|V-REGION'

Isotype annotation

Hi,

Thanks for creating this software. Is there any functionality for annoting B cell isotypes? If not, are there any plans to add this in future versions?

Thanks

Add license?

For those who find this repository by means other than the BMC Bioinformatics publication, it may be helpful to have a LICENSE file available. I see the publication has the following:

License: Free to academics
Any restrictions to use by non-academics: Yes; non academics should contact the author for permission to use the software or license options for incorporation into software that is being sold for profit.

Annotation for most alpha chains missing, but not for beta

Hi there,

thanks for creating PyIR! I experience a funny issue where TCR beta chains are perfectly annotated out of the box, but alpha chains are not. Alpha chains are always marked as complete_vdj=F, most gene calls are missing and most sequences (fwr1 to junctions) are missing. Is there something obvious I might be doing wrong? All sequences I used came from filtered and complete 10X output.

Many thanks,
Andreas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.