rabadanlab / arcashla Goto Github PK

View Code? Open in Web Editor NEW

109.0 109.0 46.0 11.17 MB

Fast and accurate in silico inference of HLA genotypes from RNA-seq

License: GNU General Public License v3.0

Shell 2.29% Python 97.40% Dockerfile 0.31%

arcashla's People

Contributors

Stargazers

Watchers

Forkers

wangdi2014 riederd inambioinfo zorg-it gjhanchem shicheng-guo ambarishk ramit29 jfx319 mobiusklein alexvpickering eegk toastyjt yyoshiaki cinaljess polojacky trinityctat mourisl shengqh jafors mgl0619 tnken andreas-wilm imatrm naity2 nailouzhang jbreynier sschattgen maojn elginakin alienzj rnaimehaom tez65 chloehj adeschen ryankim3gilead sunboy0523 wir963 hahapaige liminghao663 shihanl vergaju sschmeier bensolomon hazirliver carovanandel

arcashla's Issues

"arcas reference --version or --update" not working

"arcas reference --version" or "--update" is looking for dat/IMGTHLA, which I could not find in the most recent repo.

Error in ./arcasHLA reference

Hi,

when attempting to switch to IMGT reference 3.24.0 for testing the installation I get the following message:

Traceback (most recent call last):
  File "./scripts/reference.py", line 481, in <module>
    build_fasta()
  File "./scripts/reference.py", line 363, in build_fasta
    utrs, exons, final_exon_length) = process_hla_dat()
  File "./scripts/reference.py", line 127, in process_hla_dat
    lines = file.read().splitlines()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 6249748: ordinal not in range(128)

I attempted to grab the reference from the git history as well but I get the same error until around version 3.25.0

./arcasHLA reference --commit 70055402cf42eef5e0d13a1d2ef3b93de0c020f9

Also if trying to pull the latest version by commit (3.35.0) I see a different error

Traceback (most recent call last):
  File "./scripts/reference.py", line 467, in <module>
    build_fasta()
  File "./scripts/reference.py", line 386, in build_fasta
    seq_out, allele_idx, lengths = complete_records(cDNA, other)
  File "./scripts/reference.py", line 308, in complete_records
    offset = i + 1

which occurs until I get to around 3.33.0 any ideas what happening?

pseudoalign call to analyze_reads not gathering correct info for kallisto

relevant code:

arcasHLA/scripts/align.py

Line 95 in a138383

num, avg, std = analyze_reads(fqs, paired, reads_file)

Here is the docs sections from kallisto regarding single-ended reads and the -l and -s flags:

In the case of single-end reads, the -l option must be used to specify the average fragment length. Typical Illumina libraries produce fragment lengths ranging from 180–200 bp but it’s best to determine this from a library quantification with an instrument such as an Agilent Bioanalyzer.

Note that analyze_reads is determining the read lengths/sds, not the fragment lengths/sds (which must be determined experimentally e.g. using an Agilent Bioanalyzer). Read length is related to the number of cycles on the sequencer and so my expectation is that using the read length standard deviation implies much less variation than is true (in fact the current code for arcasHLA prevents -s 0 which happens because you are using read length). Also read length and fragment length can and often do differ substantially.

I would suggest exposing these arguments to the arcasHLA call, setting the default to -l 200 -s 20, and warning the user if the defaults are used. These defaults or similar are also used in a few other repos (Curse, drugseqr - personal repo, salmon) and similar suggestions have been made elsewhere. I've submitted a PR if you are interested.

One advantage is the time and memory usage is dramatically reduced. Might be nice to redo the analysis you presented in your paper to make sure it doesn't substantially change things. Thank you for the great package!

max() arg is an empty sequence

Hi,

We ran arcasHLA on multiple samples. Only one sample failed. The error is posted bellow.

Best,
Astrid

[genotype] Pairs by % explained reads:
allele pair explained
C04:01:81, C04:339:02 87.94%
C04:04:02, C04:339:02 87.94%
Traceback (most recent call last):
File "./scripts/genotype.py", line 917, in
args.zygosity_threshold)
File "./scripts/genotype.py", line 585, in genotype_gene
zygosity_threshold)
File "./scripts/genotype.py", line 490, in predict_genotype
max_prior = max(pair_prior.values())
ValueError: max() arg is an empty sequence

Error: kallisto index file not found hla_partial.idx

Hi,
I am trying to replicate the test as explained in the README. The genotyping gives the expected results in test.genotype.json, however when I proceed to the partial typing I get the following error:

[alignment] Analyzing read length
[alignment] Pseudoaligning with Kallisto: 

	kallisto pseudo -i arcasHLA-0.2.0/scripts/../dat/ref/hla_partial.idx -t 8 -o /tmp/arcas_c637d791-c0d7-4f56-a472-1f5c0b0a41d0/ test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz

	
	Error: kallisto index file not found arcasHLA-0.2.0/scripts/../dat/ref/hla_partial.idx
	
	
[alignment] Processing pseudoalignment
Traceback (most recent call last):
  File "arcasHLA-0.2.0/scripts/partial.py", line 485, in <module>
    args.threads, True)
  File "arcasHLA-0.2.0/scripts/align.py", line 277, in get_alignment
    exon_combos)
  File "arcasHLA-0.2.0/scripts/align.py", line 171, in process_partial_counts
    with open(count_file,'r', encoding='UTF-8') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/arcas_c637d791-c0d7-4f56-a472-1f5c0b0a41d0/pseudoalignments.tsv'

Indeed the hla_partial.idx file is missing. Should this file have been downloaded with the rest when executing ./arcasHLA reference --version 3.24.0 or is the user expected to create it?

Thanks

Cannot update reference

Hi, I can not choose reference version using this command:
./arcasHLA reference --version 3.24.0

The error information is here:
Traceback (most recent call last):
  File "/cluster/home/zheyang/biosoft/arcasHLA-master/scripts/reference.py", line 548, in <module>
    build_fasta()
  File "/cluster/home/zheyang/biosoft/arcasHLA-master/scripts/reference.py", line 416, in build_fasta
    utrs, exons, final_exon_length) = process_hla_dat()
  File "/cluster/home/zheyang/biosoft/arcasHLA-master/scripts/reference.py", line 134, in process_hla_dat
    with open(hla_dat, 'r', encoding='UTF-8') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/cluster/home/zheyang/biosoft/arcasHLA-master/scripts/../dat/IMGTHLA/hla.dat'

Also, it also can not update reference.
./arcasHLA reference --updata

usage: arcasHLA reference [options]
arcasHLA reference: error: unrecognized arguments: --updata

Create a new release?

Hi, could you create a new release? I'm creating a bioconda package right now and it would be nice to have the latest and greatest version up there...

error: ./arcasHLA reference --rebuild --v

correct command line : ./arcasHLA reference --rebuild -v

Samtools requirement

Encountered a problem using the recommended version of samtools 1.19. Updated to version 1.3.1 and tool ran successfully.

hla.p not found

FileNotFoundError: [Errno 2] No such file or directory: '/arcas/scripts/../dat/ref/hla.p'

Get this error when running arcasHLA genotyp

Reads from alt chromosomes don't seem to be extracted

I've aligned my samples against the hg38 human reference provided by GATK, which includes alt chromosomes such as the HLA loci. Nonetheless, when I run arcasHLA, it only extracts data from chromosome 6, although I saw in the code that it should also extract from those alt chromosomes... for some reason it doesn't seem to be finding them?

Homozygous versus heterozygous with one allele missing

Thanks for the great tool! I ran it on a bunch of RNA-seq samples and so far it seems to be working.

Suppose I receive the following result from arcasHLA:

{
  "A": ["A*02:816", "A*03:01:77"],
  "B": ["B*83:01"],
  "DQB1": ["DQB1*03:03:02", "DQB1*03:94"]
}

I'd like to ask if you might be able to help me understand the answers to a few questions:

Question 1

Is it true that A*02:816 is on the same haplotype as DQB1*03:03:02? In other words, are the alleles of different genes in phase with each other?

Does arcasHLA claim that we have these two chromosomes?

chromosome A:    A*02:816    DQB1*03:03:02

chromosome B:    A*03:01:77  DQB1*03:94

Alternatively, are the genotypes unphased, so arcasHLA does not distinguish the above situation from the following situation?

chromosome A:    A*02:816    DQB1*03:94

chromosome B:    A*03:01:77  DQB1*03:03:02

Question 2

What is the genotype for HLA-B? Is it homozygous with two copies of B*83:01? Or, is it actually heterozygous with one copy of B*83:01 and one copy of unknown genotype?

Is arcasHLA claiming that this is the situation?

chromosome A:    B*83:01

chromosome B:    B*83:01

Or this?

chromosome A:    B*83:01

chromosome B:    Unknown

Question 3

Does arcasHLA offer any way to assess the probability or the confidence for the genotype calls?

I can see this output in the genes.json file:

"A": [7378.0, 1202, 0.10358776417015307]

Which corresponds to this output in the genotype.log file:

[alignment] Observed HLA genes:
                gene          abundance    read count    classes
                HLA-A            10.36%          7378       1202
                HLA-B             6.80%          4807        989
                HLA-C            10.30%          7354       1090

I will try to use these numbers to assess the confidence, but I wonder if you are able to get some calibrated p-value or probability from the kallisto results?

Question 4

Do you have any sense of what the read pileup coverage in a genome browser should look like in order to get accurate genotype calls?

Thank you!

Fastq as input

I understand, that input is RNA-seq bam, reads are extracted as .fq and then genotyped. However, is it also possible to use (not extracted) raw fastq files?
I tried using this paired end sample: https://www.ncbi.nlm.nih.gov/sra/ERX002711[accn]

and i get this error:
manager@bl8vbox[arcasHLA] ./arcasHLA genotype test/ERR009111_1.fastq.gz test/ERR009111_2.fastq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 test/output/ERR009111
usage: arcasHLA genotype [options] FASTQs or alignment.p file
arcasHLA genotype: error: The format of test/ERR009111_1.fastq.gz is invalid.

Problem with module "genotype"

Dear arcasHLA developer,

I was following the "test" section in the README and did "reference" and "extract". However, when running "genotype" I get warnings and not the expected result. See below:

manager@bl8vbox[arcasHLA] ./arcasHLA reference --version 3.24.0 [ 9:34am]
manager@bl8vbox[arcasHLA] ./arcasHLA extract test/test.bam -o test/output --paired -v

[log] Date: 2019-03-13
[log] Sample: test
[log] Input file: test/test.bam
[log] Read type: paired-end

[extract] Extracting reads from test/test.bam
[extract] Extracting chromosome 6:

samtools view -H -@1 test/test.bam -o /tmp/test.hla.sam

[extract] Extracting chromosome 6:

samtools view -@1 -f 2 test/test.bam 6 >> /tmp/test.hla.sam

[extract] Converting SAM to BAM:

samtools view -Sb -@1 /tmp/test.hla.sam > /tmp/test.hla.bam

[extract] Sorting bam:

samtools sort -n -@1 /tmp/test.hla.bam -o /tmp/test.hla.sorted.bam

[extract] Converting bam to fastq:

bedtools bamtofastq -i /tmp/test.hla.sorted.bam -fq test/output/test.extracted.1.fq -fq2 test/output/test.extracted.2.fq

manager@bl8vbox[arcasHLA] ./arcasHLA genotype test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o test/output -v

[log] Date: 2019-03-13
[log] Sample: test
[log] Input file(s): test/output/test.extracted.1.fq.gz, test/output/test.extracted.2.fq.gz
[log] Reference: cfb6db3de7f3a7e76d88467271541ff0cc8fbca1

[alignment] Analyzing read length
./scripts/genotype.py:82: UserWarning: genfromtxt: Empty input file: "/tmp/test.reads.txt"
read_lengths = np.genfromtxt(reads_file)
/home/manager/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3118: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/home/manager/.local/lib/python3.6/site-packages/numpy/core/_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/home/manager/.local/lib/python3.6/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice
keepdims=keepdims)
/home/manager/.local/lib/python3.6/site-packages/numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide
arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/home/manager/.local/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
[alignment] Pseudoaligning with Kallisto:

kallisto pseudo -i dat/ref/hla.idx -t 1 -o /tmp/test/ test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz

[alignment] processing pseudoalignment
[alignment] Processed 0 reads, 0 pseudoaligned to HLA reference
[alignment] 0 reads mapped to a single HLA gene
[alignment] Observed HLA genes:
gene abundance read count classes

[genotype] Genotyping parameters:
population: prior
max iterations: 1000
tolerance: 1e-06
drop iterations: 20
drop threshold: 0.1
zygosity threshold: 0.15

[genotype] Genotyping HLA-A
[genotype] No reads aligned to HLA-A

[genotype] Genotyping HLA-B
[genotype] No reads aligned to HLA-B

[genotype] Genotyping HLA-C
[genotype] No reads aligned to HLA-C

[genotype] Genotyping HLA-DPB1
[genotype] No reads aligned to HLA-DPB1

[genotype] Genotyping HLA-DQA1
[genotype] No reads aligned to HLA-DQA1

[genotype] Genotyping HLA-DQB1
[genotype] No reads aligned to HLA-DQB1

[genotype] Genotyping HLA-DRB1
[genotype] No reads aligned to HLA-DRB1

manager@bl8vbox[arcasHLA] [ 9:36am]

Merge jsons error

./arcasHLA/arcasHLA merge -i genotypeoutputs/ -o mergedoutput --run test

Traceback (most recent call last):
File "/myhome/arcasHLA/scripts/merge.py", line 173, in
'genes')
File "/myhome/arcasHLA/scripts/merge.py", line 92, in process_count
lines = lines.split('-'*80)[2].split('\n')
IndexError: list index out of range

docker recipe

The steps to install are quite involved -- may I request a Dockerfile recipe to be hosted on this official repository?

There are a few existing ones already, with some already built and hosted on dockerhub:

But these will eventually lag behind. Some edited version inspired by these could be officially hosted here and verified for the most recent release of arcasHLA; it would be much easier for users to test the tool on their own data and be confident that the latest arcasHLA is being used. You can trigger an autobuild on dockerhub pointing to new updates to this recipe.

Ideally, for academic settings with HPC environments, the container should also run fine after conversion from docker to singularity, but this is more minor. Happy to volunteer to test.

[extract] Error: unable to index bam file.

1607370_log.out.docx

I used two alignment tools, STAR and HISAT2.

bam file that is aligned by STAR worked well, I got HLA types using arcasHLA.

But I couldn't get the chr6 extracts from bam file aligned by HISAT2.

error message is

[extract] Error: unable to index bam file.

Is this because HISAT2 bam file that I made isn't correctly aligned?

command line is

arcasHLA extract -t 10 --paired --unmapped --log 1607370_log.out --o /home/sunghyepark_lab/test/test_files/arcas /home/sunghyepark_lab/test/test_files/RNA/rawData/HISAT2_aligned/1607370_RT.bam

I attached log file.

Cannot reproduce test case

I cannot reproduce the test case as described in README.md. Here's how I've run arcas HLA:

# Install requirements
conda create -n arcas-hla-deps coreutils 'bedtools>=2.27.1' biopython git-lfs 'kallisto>=0.44.0' numpy pandas  pigz 'python>=3.6.1' 'samtools>=1.9' scipy
conda activate arcas-hla-deps

# Get latest release
curl -L https://github.com/RabadanLab/arcasHLA/archive/v0.2.0.tar.gz | tar zx

# Obtain reference version required for tests
./arcasHLA-0.2.0/arcasHLA reference --version 3.24.0

# Extract reads
./arcasHLA-0.2.0/arcasHLA extract arcasHLA-0.2.0/test/test.bam -o arcasHLA-0.2.0/test/output --paired -t 8 -v

# Genotyping
./arcasHLA-0.2.0/arcasHLA genotype arcasHLA-0.2.0/test/output/test.extracted.1.fq.gz arcasHLA-0.2.0/test/output/test.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o arcasHLA-0.2.0/test/output -t 8 -v

This is the output I get:

{
        "A": ["A*03:01:01", "A*01:01:01"],
        "B": ["B*39:39:01", "B*07:02:01"],
        "C": ["C*01:02:01", "C*08:01:01"],
        "DPB1": ["DPB1*02:01:02", "DPB1*14:01:01"],
        "DQA1": ["DQA1*05:03:01", "DQA1*02:01:01"],
        "DQB1": ["DQB1*06:04:01", "DQB1*02:02:01"],
        "DRB1": ["DRB1*03:02:01", "DRB1*10:01:01"]
}

which mismatches on DQB1 and DRB1.

I also tested this on the master branch.

question about flag --unmapped behavior

According to the sample.extract.log, the extract --unmapped --paired appears to be using:

samtools view -f 12 sample.bam 6

Indeed this appears to be the way the command is constructed.

Why require a region of chromosome 6 in the case of both unmapped reads? Naively, this seems contradictory (an unmapped read shouldn't have a chromosome, right?). I tested this separately on my bam and this exact filter scheme actually causes the output to be empty. In contrast, removing the chr 6 region requirement shows output with both reads unmapped and the chromosome as "*". Would any reads ever be marked as unmapped and to chr6, or is it an impossible to satisfy filter?
Why require both reads to be unmapped (the bitwise 12)? What about cases where one read is mapped to chr6 while the other is unmapped?
I have not checked the logic for single-end input

These may be worth clarifying in the manual/readme so the user is aware for these situations.

time samtools view -@8 -f 12 sample.bam 6 > sample.hla.chr6.bothunmapped.sam &
time samtools view -@8 -f 12 sample.bam > sample.hla.any.bothunmapped.sam &
time samtools view -@8 -f 12 sample.bam '*' > sample.hla.asterisk.bothunmapped.sam &

The first one is relatively fast, but finds nothing.
The second one is slow (probably reads through the whole bam), and outputs a 15GB SAM.
The third one is fast (but assumes unmapped entries are to chr '*'), and the md5sum of the output agrees with the second.

Sam tools version

Hi, can you please update the samtools version to say 1.9 instead of 1.19. I found out I was having this same issue from reading the closed issue #5. Thanks

Error in reference --update

While I am able to do
./arcasHLA reference --version 3.24.0

running ./arcasHLA reference --update

gives me an error
Traceback (most recent call last):
File "./scripts/reference.py", line 467, in
build_fasta()
File "./scripts/reference.py", line 386, in build_fasta
seq_out, allele_idx, lengths = complete_records(cDNA, other)
File "./scripts/reference.py", line 308, in complete_records
offset = i + 1
UnboundLocalError: local variable 'i' referenced before assignment

Allele specific expression

Dear developers:
Thanks for the excellent tools.
After getting genotype from arcasHLA , I was wondering that is that possible to get allele specific expression using these genotypes ?
The actual question to me is that I was trying to quantify HLA allele expression from single cell Smartseq2 RNA data.
So far, I got the HLA genotype from bulk RNA seq.
Then is that any possible way to apply these HLA genotype to quantify specific HLA allele expression for scRNA data (assuming these single cell data are from the same sample) ?

Thank you

single cell RNA sequencing

I have been using arcasHLA for bulk RNA and it has been great. Can arcasHLA be applied to single cell RNA sequencing data too

Issue related to hardcoded relative path

Hi,

We have installed arcasHLA on our server. It works fine. However, the software is using relative paths to find scripts located in the script directory. Same thing for the dat directory. This is an issue as we are using module to run software on the server. The users load arcasHLA module to have access to the path to arcasHLA executable but are not expected to run the code in the software directory. We temporary solved the issue by adding symbolic link to the script and dat directories; however, this is only a temporary patch.

It would be highly appreciated if you could change the usage of hardcoded relative path so that the software could comply to module installation.

On another note, the fact that arcasHLA is calling HLA class II is highly welcomed.

Thanks,
Astrid

Typing of DPA1

Hi there,

Thanks for this tool. Is it possible to use arcasHLA to type DPA1? The documentation contains output for DPB1, but ti doesn't seem to be output for the alpha chain.

Kind regards,

Fong

Kallisto with single-end data?

I have a rather large set of single-end data that I'd like to perform HLA typing for and would love to use this tool but I am hung up on one thing. arcasHLA uses Kallisto and for single-end this requires fragment length and std deviation of fragment length. Quickly looking through I see that this is estimated as read length and std deviation of read length in this case. Will this affect results down stream? From what I have found it seems that Kallisto can be rather picky when it comes to these parameters. This is most likely not an issue for a vast majority as it seems paired-end is more common :^(

Typing of MICA

Is this tool able to type MICA? Running

arcasHLA genotype test/output/test.extracted.1.fq.gz test/output/test.extracted.2.fq.gz -g MICA -o test/output -t 8 -v

results in

arcasHLA genotype: error: The gene list MICA is invalid.

Can't install/update references

Hi,

I do not seem to be able to install and run arcasHLA.

I cloned the repository and when I try:

./arcasHLA reference --update

I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/jose.fernandez.navarro/shared/arcasHLA/scripts/../dat/IMGTHLA/wmda/hla_nom_p.txt'

If I git clone IMGTHLA to dat and I try again then I get this error:

[reference] Error: dat/IMGTHLA/hla.dat empty or corrupted.

I get the same error if I try specific versions like 3.24.0 and so on...

SciPy as a dependency

In addition to the listed dependencies, I had to install scipy to get references.py to run. Minor problem but could be worth adding scipy to the list on the readme.

merge.py

I download new merge.py file, and ran arcasHLA.

Errors that are exist before are solved I think, but new Syntax error appeared.

File "/home/sunghyepark_lab/packages/arcasHLA/scripts/merge.py", line 7
< !DOCTYPE html > <- (I add spaces at both side to write this issue)
^
SyntaxError: invalid syntax

Merged file were made like before.

New tagged version please

Hi,

I'm trying out arcasHLA but aparently the version in bioconda doesn't work due to the git lfs update. Can you please make a new tag:ed release, and update the bioconda recipe. That would help many people a lot.

Also, updating the HLA database versions would be nice. Now it only goes to 3.34.0 (or so) while https://github.com/ANHIG/IMGTHLA is at 3.43.0

Thanks

Missing reference

Hi,

When running genotype I get the following error:

arcasHLA genotype test.extracted.1.fq.gz test.extracted.2.fq.gz

Traceback (most recent call last):
File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/genotype.py", line 674, in
check_ref()
File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/reference.py", line 80, in check_ref
build_convert(False)
File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/reference.py", line 462, in build_convert
p_group = process_hla_nom(hla_nom_p)
File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/reference.py", line 232, in process_hla_nom
for line in open(hla_nom, 'r', encoding='UTF-8'):
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/../dat/IMGTHLA/wmda/hla_nom_p.txt'

The log shows that the error is at git lsf clone command
git lfs clone https://github.com/ANHIG/IMGTHLA.git /hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/../dat/IMGTHLA/

git: 'lfs' is not a git command. See 'git --help'.

How I can deal with this error?.

Thank you!

./arcasHLA reference --update failed

$ ./arcasHLA reference --update       
Traceback (most recent call last):
  File "/.../arcasHLA/scripts/reference.py", line 534, in <module>                                                               
    build_convert(False)
  File "/.../arcasHLA/scripts/reference.py", line 462, in build_convert                                                          
    p_group = process_hla_nom(hla_nom_p)
  File "/.../arcasHLA/scripts/reference.py", line 232, in process_hla_nom                                                        
    for line in open(hla_nom, 'r', encoding='UTF-8'):
FileNotFoundError: [Errno 2] No such file or directory: '/.../arcasHLA/scripts/../dat/IMGTHLA/wmda/hla_nom_p.txt'                

$ ./arcasHLA reference --rebuild
Traceback (most recent call last):
  File "/.../arcasHLA/scripts/reference.py", line 539, in <module>                                                               
    build_convert()
  File "/.../arcasHLA/scripts/reference.py", line 462, in build_convert                                                          
    p_group = process_hla_nom(hla_nom_p)
  File "/.../arcasHLA/scripts/reference.py", line 232, in process_hla_nom                                                        
    for line in open(hla_nom, 'r', encoding='UTF-8'):
FileNotFoundError: [Errno 2] No such file or directory: '/.../arcasHLA/scripts/../dat/IMGTHLA/wmda/hla_nom_p.txt'

interpreting *genes.json output

Hi,

Wondering how to interpret *genes.json output (see below). Thanks!

{"A": [9104.0, 3628, 0.18921475974399068], "B": [6499.0, 2754, 0.13618954185635945], "C": [6877.0, 2593, 0.14254001726640184], "DMA": [184.0, 2, 0.005342202898901628], "DMB": [246.0, 4, 0.007088184
725790673], "DOA": [17.0, 6, 0.0005152037872793143], "DPA2": [1.0, 1, 3.113301114046414e-05], "DPB1": [290.0, 119, 0.008517302674554006], "DQB1": [129.0, 70, 0.0036343754745788493], "DRA": [1396.0,
 7, 0.04164367848847119], "DRB1": [536.0, 55, 0.015270644795199345], "DRB5": [216.0, 2, 0.006153841932393766], "E": [9707.0, 1359, 0.20435292065495925], "F": [1112.0, 25, 0.024376938375165955], "HF
E": [63.0, 1, 0.0013731531245993252], "K": [23.0, 5, 0.0004802117427420722], "L": [2298.0, 2, 0.04784808621111001], "T": [2586.0, 1, 0.12857038272586738], "U": [1.0, 1, 0.00011643110798959293], "W"
: [57.0, 8, 0.0013897097633116797], "DRB4": [158.0, 16, 0.004501421413510255], "DPA1": [746.0, 158, 0.021742133953775635], "DOB": [114.0, 13, 0.003164886468271855], "DQA1": [199.0, 68, 0.0059131236
14617295], "DQA2": [1.0, 1, 2.9714189018177362e-05]}

Missing File

I was trying to run arcasHLA but received this error:
Traceback (most recent call last):
File "/home/yzx5896/.conda/envs/arcasHLA/bin/scripts/genotype.py", line 674, in
check_ref()
File "/home/yzx5896/.conda/envs/arcasHLA/bin/scripts/reference.py", line 80, in check_ref
build_convert(False)
File "/home/yzx5896/.conda/envs/arcasHLA/bin/scripts/reference.py", line 462, in build_convert
p_group = process_hla_nom(hla_nom_p)
File "/home/yzx5896/.conda/envs/arcasHLA/bin/scripts/reference.py", line 232, in process_hla_nom
for line in open(hla_nom, 'r', encoding='UTF-8'):
FileNotFoundError: [Errno 2] No such file or directory: '/home/yzx5896/.conda/envs/arcasHLA/bin/scripts/../dat/IMGTHLA/wmda/hla_nom_p.txt'

Where could I get the IMGTHLA/wmda/hla_nom_p.txt' file? I would appreciate your help.

Obtaining RPKM like counts

Hi I am interested in obtaining RPKM like data from arcasHLA output. I noticed that one of the output fines(*samplename.genes.json) has the following output format : HLA-type: [read count, equivalence class, ]

I was wondering if an estimate of expression for a specific allele type can be obtained by:
RPKM = (Read count x equivalence class)/1,000,000

ex: HLA-A:[20000,10000,]
Can HLA-A RPKM = (20000x10000)/1000000 = 200 ?

Can I use arcasHLA for cancer scRNA-seq data?

Broken: arcasHLA reference --version 3.24.0

$ ./arcasHLA reference --version 3.24.0
[reference] Error: dat/IMGTHLA/hla.dat empty or corrupted.

It appears that the hash for version 3.24.0 in parameters.p is incorrect. Searching the IMGTHLA repo, I cannot find a commit with hash of c5acf7a4342869351b2382b1cc1d1b5763e7e04e: https://github.com/ANHIG/IMGTHLA/search?q=hash%3Ahash%3Ac5acf7a4342869351b2382b1cc1d1b5763e7e04e&type=Commits.

I did find what appears to be a valid commit for 3.24.0 and attempted to use the hash for it, 4a0401af6be02ca688adeef3f63f5e55288d14fe; however, that fails with the same error message.

error when running docker

Hi,

I'm running into this issue when running the docker.

[log] Date: 2020-11-13
[log] Sample: test
[log] Input file(s): test.extracted.1.fq.gz
test.extracted.2.fq.gz
Traceback (most recent call last):
File "/path/genotype.py", line 697, in
with open(hla_p, 'rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/path/../dat/ref/hla.p'

I would greatly appreciate any solution to this error?,

Thank you so much,

HLA genome types virome data

Hello,

Thank you for making your tool available.

In your manuscript, you state "We established the HLA genotyping ground truth for the Virome dataset", but I could not find your HLA genotype calls anywhere.

Similarly, there is no URL to the manhattan virome data / this data is not in public repos.

I would be grateful if you could please share those with us so we can reproduce / compare to your results

Thank you!

HLA locus relative abundance

Can you provide an additional merge.py file which can generate a tsv table with HLA locus relative abundance?

Error in arcasHLA partial module

Hello,

in my attempt to run 'arcasHLA partial', I got an error because there was no 'database/parameters.p' file as required in line #346 of 'scripts/partial.py'. I tried replacing this with 'dat/info/parameters.p', then the program runs for while until it crashes again with the following message:

File "./scripts/partial.py", line 528, in <module>
args.keep_files)
ValueError: not enough values to unpack (expected 3, got 2)

I would appreciate if you could look into this.
Thank you

[extract] Error: unable to index bam file.

Hi I had a similar problem with an issue that was closed without an resolution!

So this is an attempt to reopen it.

The following is my command:
arcasHLA extract --unmapped --log my.log --outdir my_outdir my.bam

I got an error:
[extract] Error: unable to index bam file.

May I know how I can resolve this?
Thank you!

Why is there no install script. This is python.

This is python why cant I install this via pip or conda?

DPA1 typing

Hello,
would it be possible for your tool to type also the DPA1 allele?

Kind regards,
Matthias

corrupted hla.dat?

When I run your tool, I get this error

(py36) [rwarren@hpce704 arcasHLA]$ ./arcasHLA reference --version 3.24.0
[reference] Error: dat/IMGTHLA/hla.dat empty or corrupted.

turns out the file hla.dat isn't empty:

(py36) [arcasHLA]$ cat ./dat/IMGTHLA/hla.dat 
version https://git-lfs.github.com/spec/v1
oid sha256:d6bca31aedfe138f603eb605550d6d2bd5f5206b7cad2646cd191a56d12a2dfc
size 163633161

I tried cloning the IMGTHLA into dat, as suggested in #28

but then, the ./ref/hla.p is missing

thoughts?
Thank you!

issue with merge

Hi I'm having some trouble merging *genotype.json and *genes.json results via the merge function, can you please point me in the right direction?

[x@y out]$ ll *genotype*json | wc -l 
1640
[x@y out]$ ll *genes*json | wc -l 
1640
[x@y out]$ mkdir -p merged_results; arcasHLA merge --run test -i . -o merged_results
Traceback (most recent call last):
  File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/merge.py", line 173, in <module>
    'genes')
  File "/hpc/packages/minerva-centos7/arcashla/0.2.0/arcasHLA-0.2.0/scripts/merge.py", line 92, in process_count
    lines = lines.split('-'*80)[2].split('\n')
IndexError: list index out of range
[x@y out]$

Genotype MHC from other species

Hi,
Is it possible to build a reference using MHC alleles from other species? I have the allele sequences, and I wonder which files should I modify in arcasHLA/dat/.

Thanks

determination of single-end vs paired-end data

Note that default behaviour changed with #44 (breaking change).

Old behaviour: set as single-end if one fastq file supplied, paired-end if two supplied.
Change with #44: default is paired-end unless --single flag present.
Rationale: can be two fastq.gz files for single-end and e.g. 4 for paired-end (see kallisto manual)

Let me know if you would like to change it back to the old behaviour. It got added to #44 by mistake when I was updating it to reflect my own preference.

Requirements for arcasHLA

Hi,

Thank you for developing arcasHLA. I have a question regarding the requirements for installation of arcasHLA. Will the following dependencies and their versions work (or do they need to be the ones listed in your Github):

kallisto/0.46.2
bedtools2/2.29.2,
pigz 2.4.1, samtools 1.10
python/3.8.0,

Looking forward to hearing from you.

Thank you.

rabadanlab / arcashla Goto Github PK

arcashla's People

Contributors

Stargazers

Watchers

Forkers

arcashla's Issues

Question 1

Question 2

Question 3

Question 4

[log] Date: 2019-03-13 [log] Sample: test [log] Input file: test/test.bam [log] Read type: paired-end

[alignment] processing pseudoalignment [alignment] Processed 0 reads, 0 pseudoaligned to HLA reference [alignment] 0 reads mapped to a single HLA gene [alignment] Observed HLA genes: gene abundance read count classes

[genotype] Genotyping parameters: population: prior max iterations: 1000 tolerance: 1e-06 drop iterations: 20 drop threshold: 0.1 zygosity threshold: 0.15

[genotype] Genotyping HLA-A [genotype] No reads aligned to HLA-A

[genotype] Genotyping HLA-B [genotype] No reads aligned to HLA-B

[genotype] Genotyping HLA-C [genotype] No reads aligned to HLA-C

[genotype] Genotyping HLA-DPB1 [genotype] No reads aligned to HLA-DPB1

[genotype] Genotyping HLA-DQA1 [genotype] No reads aligned to HLA-DQA1

[genotype] Genotyping HLA-DQB1 [genotype] No reads aligned to HLA-DQB1

[genotype] Genotyping HLA-DRB1 [genotype] No reads aligned to HLA-DRB1

Recommend Projects

Recommend Topics

Recommend Org

[log] Date: 2019-03-13
[log] Sample: test
[log] Input file: test/test.bam
[log] Read type: paired-end

[alignment] processing pseudoalignment
[alignment] Processed 0 reads, 0 pseudoaligned to HLA reference
[alignment] 0 reads mapped to a single HLA gene
[alignment] Observed HLA genes:
gene abundance read count classes

[genotype] Genotyping parameters:
population: prior
max iterations: 1000
tolerance: 1e-06
drop iterations: 20
drop threshold: 0.1
zygosity threshold: 0.15

[genotype] Genotyping HLA-A
[genotype] No reads aligned to HLA-A

[genotype] Genotyping HLA-B
[genotype] No reads aligned to HLA-B

[genotype] Genotyping HLA-C
[genotype] No reads aligned to HLA-C

[genotype] Genotyping HLA-DPB1
[genotype] No reads aligned to HLA-DPB1

[genotype] Genotyping HLA-DQA1
[genotype] No reads aligned to HLA-DQA1

[genotype] Genotyping HLA-DQB1
[genotype] No reads aligned to HLA-DQB1

[genotype] Genotyping HLA-DRB1
[genotype] No reads aligned to HLA-DRB1