luyitian / flames Goto Github PK

View Code? Open in Web Editor NEW

64.0 64.0 9.0 3.32 MB

Full-length transcriptome splicing and mutation analysis

License: GNU General Public License v3.0

Python 65.84% C++ 16.80% C 17.36%

flames's People

Contributors

Stargazers

Watchers

Forkers

shians bill125 xueyidong ryanyip-kat hy-yang changqingw jchang97 wenmm olivervoogd

flames's Issues

sc_long_pipeline.py--> ValueError: invalid contig `chr1`

Hi, both my genome.fa and gff3 files use contig chr1. Is there support for this format or parameters I can set to solve this error?

Traceback (most recent call last):
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args)
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 193, in sc_long_pipeline
raw_gff3=raw_splice_isoform if config_dict["global_parameters"]["generate_raw_isoform"] else None)
File "PATH/TO/FLAMES/python/sc_longread.py", line 1123, in group_bam2isoform
it_region = bamfile.fetch(ch, bl.s, bl.e)
File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 686, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid contig `chr1

Missing mitochondrial transcripts in isoform_annotated.gff3

Hi,

first, thanks a lot for developing FLAMES!

I have one question about the configuration parameters and a problem regarding some missing genes/transcripts in the final FLAMES output and would really appreciate some help.

i) First, I was wondering if there is any further explanation for the different isoform parameters that can be adapted in the config file? I have an idea about some of the parameters (MAX_DIS, MAX_TS_DIST, Min_sup_cnt, strand_specific) but I would really appreciate a bit more detail about how the others impact the isoform identification step.

ii) Moreover, I noticed that some of the chromosomes/regions I was providing in the gene annotation reference were not part of the final FLAMES output. I'm using a slightly adapted gtf and fasta file that doesn't only contain human genes but also some pathogens. However, even though reads map against those genes, not a single transcript isoform for those genes is written into the isoform_annotated.gff3 and transcript_assembly.fa. Also, no mitochondrial transcripts are detected.
I checked the number of reads mapping to those regions in the align2genome.bam with samtools idxstats align2genome.bam and at least for the mitochondrial genes, a lot of reads are mapping.

However, only those seqnames are included in the isoform_annotated.gff3:
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'GL000191.1', 'GL000192.1', 'GL000194.1', 'GL000195.1', 'GL000218.1', 'GL000219.1', 'GL000223.1', 'X', 'Y']

Are they filtered out due to the parameters specified in the configuration or is something else happening here? It would be great to have information about those genes and transcripts as well.

Thanks a lot!

Best,
Kristin

match_cell_barcode - output cell barcode statistics file

Hi, I am using match_cell_barcode for ONT single cell data. I obtained a "whitelist.csv" and "putative_bc.csv" file from the output of BLAZE, at the same time I also have short-read sequencing data on the same library.

However, I am confused what file should be used for the 2nd argument of match_cell_barcode, that is "output cell barcode statistics file", or as explained in README "a file name/path for the statistics of barcode matching".

Can you please help with understanding this file? How should it look like (which headers?) and how can I get it?

Thanks in advance!

Cluster annotation file

Hi there,

Thank you for creating this amazing tool!

I am trying to utilize the DTU analysis script from the FLTseq_data directory, and I am just wondering how I can get the cluster_annotation.csv file?

(Line 80-82)

cluster_barcode_anno <- read.csv(file.path(data_dir,"cluster_annotation.csv"), stringsAsFactors=FALSE)
  rownames(cluster_barcode_anno) = cluster_barcode_anno$barcode_seq
  comm_cells = intersect(colnames(tr_sce),rownames(cluster_barcode_anno))

Thank you

error in transcript quantification step

Hi, I am getting this error in the final counts matrix generation step:

does anyone know how to circumvent this issue?

b'[bam_sort_core] merging from 9 files and 12 in-memory blocks...\n'
b''
### generate transcript count matrix 2023-12-08 17:21:24
Traceback (most recent call last):
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 270, in <module>
    bulk_long_pipeline(args)
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 235, in bulk_long_pipeline
    bc_tr_count_dict, bc_tr_badcov_count_dict, tr_kept = parse_realigned_bam(
  File "/users/sparthib/flames/python/count_tr.py", line 114, in parse_realigned_bam
    bc_dict = make_bc_dict(kwargs["bc_file"])
  File "/users/sparthib/flames/python/count_tr.py", line 57, in make_bc_dict
    with open(bc_anno) as f:
FileNotFoundError: [Errno 2] No such file or directory: ''

thanks!

Sowmya

FSM and FSM-match to ref

Ciao Luyi

Thanks again for nice work,
could you please let me know what's the difference between FSM (which is based on the definition of SQANTI isoformas matched with reference in all splicing junction) and FSM_annotation file at the output of the flames? in this annotation file there I column which is FSM-match to ref, if this is the list of all FSM Isoforms , then what this col7umn tell us?

Thanks
Iman

empty transcript_assembly.fa file using example SIRV data

Hi,
I tried running the example in the example folder of the python installation of FLAMES which gives an error:
subprocess.CalledProcessError: Command '['samtools faidx FLAMES_output/transcript_assembly.fa']' returned non-zero exit status 1.
I noticed that the transcript_assembly.fa file is empty. In the get_transcript_seq function in gff3_to_fa.py, it exits the first for loop right away as the following statement is false: if ch not in chr_to_gene:. However, it also does not enter the next for loop (for tr_seq in global_seq_dict:) because the dictionary is empty. I'd really appreciate your help.

Applying FLAMES to PacBIO

Hello,

I was trying to use FLAMES in a isoform characterization benchmarking study with a single sample but, since I am new with the long-read world, it is not clear to me yet which are the key parameters that I need to consider in the configuration file. After running FLAMES i found my isoform_filtered gff3 file almost empty. This is my output data:

        SIZE           DATE              FILE

2444550950 Jul 13 19:28 align2genome.bam
3252184 Jul 13 19:28 align2genome.bam.bai
16 Jul 13 19:43 isoform_annotated.filtered.gff3
15122175 Jul 13 19:32 isoform_annotated.gff3
61 Jul 13 19:43 isoform_FSM_annotation.csv
3534807677 Jul 13 18:11 merged.fastq.gz
59 Jul 13 18:11 pseudo_barcode_annotation.csv
1505577353 Jul 13 19:41 realign2transcript.bam
3221800 Jul 13 19:41 realign2transcript.bam.bai
98666092 Jul 13 19:33 transcript_assembly.fa
2062401 Jul 13 19:33 transcript_assembly.fa.fai
118564 Jul 13 19:42 transcript_count.bad_coverage.csv.gz
186937 Jul 13 19:42 transcript_count.csv.gz
3617886 Jul 13 19:32 tss_tes.bedgraph

My input parameters and data was:
--gff3 gencode.v40.annotation.gtf (human annotations)
--genomefa GRCh38.primary_assembly.genome.fa. (human reference genome)
--outdir FLAMES_output/
--fq_dir fastq/ (path to my directory containing my unique fastq file)

I am not using any configuration file so FLAMES is applying other parameters by default and I guess this is the main problem for me since it is designed for ONT. So my question would be, which are the best parameters for running an analysis with PacBio files? Which are your recommendations?

Here I paste a config file I used for ONT data so you indicate if this is everything I need to correct or, apart from correcting these parms for PacBio there is extra params to consider.

"pipeline_parameters":{
"do_genome_alignment":true,
"do_isoform_identification":true,
"do_read_realignment":true,
"do_transcript_quantification":true
},
"global_parameters":{
"generate_raw_isoform":false,
"has_UMI":false
},
"isoform_parameters":{
"MAX_DIST":10,
"MAX_TS_DIST":120,
"MAX_SPLICE_MATCH_DIST":10,
"min_fl_exon_len":40,
"Max_site_per_splice":3,
"Min_sup_cnt":10,
"Min_cnt_pct":0.001,
"Min_sup_pct":0.2,
"strand_specific":0,
"remove_incomp_reads":5
},
"alignment_parameters":{
"use_junctions":true,
"no_flank":false
},
"realign_parameters":{
"use_annotation":true
},
"transcript_counting":{
"min_tr_coverage":0.3,
"min_read_coverage":0.3
}
}

Thank you very much for your help in advance and my apologies for such basic question!
Best,
AP

bam_mutations.py

Hi,

Thank you for this tool.

I would like to know if we can only run mutation analysis without full-length transcriptome splicing. I have mapped bam and barcodes files.

Thanks

Using transcript_count.csv.gz matrix with popular analysis tools

I'm struggling to convert transcript_count.csv.gz matrix to a Seurat or AnnData object? Any help and advice would be appreciated.

No output from match_cell_barcode

Hi,

Thank you for sharing FLAMES!

I think I have successfully run the match_cell_barcode as I got the information pasted below.

However, I didn't get any output, neither matched fastq nor barcode statistic. Can you please comment on that?

Many thanks!
Yanming

Isoform parameters

Hi there,

Thank you so much for this amazing tool!

I am just wondering if it is possible to get a more in-depth explanation of each parameter for the config file e.g. for isoform parameters?

Thank you

where is your whiltelist

In your script file, not find filtered_feature_bc_matrix/barcodes.tsv.gz

fsm_splice_comp.csv

Hi ,
I am trying to utilize the tr_classify analysis script from the FLTseq_data directory, and I am just wondering how the fsm_splice_comp.csv create (I have runed sc_long_pipeline,however no this file in the output)?

(Line 43)

fsm_splice_comp <- read.csv(file.path(data_dir,"fsm_splice_comp.csv"), header=FALSE, stringsAsFactors=FALSE)
Thank you

Source compilation errror

Hi @LuyiTian , I am currently using FLAMES for single cell isoform identification and detection. Now I'm at the barcode assignment steps, where I ran the compilation code g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp as shown in the README, but I get the following error:

What is the problem?

single cell full length RNA-seq mutation detection

hi~

I remember that you use FLAMES to detect mutation and plot the mutation in UMAP. I was so impressed by this part. However I notice that "mutation detection" was not included in sc_long_pipeline.py pipeline and config file while there is did a python script named "bam_mutation.py". I do not know how to use this script, can you provide a tutorial on how you did this ?

thanks

garfield
2021 12 29

Flanking sequence match

Can I know when flames match flanking sequence CTACACGACGCTCTTCCGATCT, do they allow matching with an edit distance or it has to be exact match?

minimap error using sc_long_pipeline.py

Hi @LuyiTian ,
Could you please comment on my error below?

Running code:

for i in test; do /FLAMES/python/sc_long_pipeline.py --gff3 hg38v99.Cellranger.genes.gtf --infq $i.demultiplexed.fq.gz --outdir FLAMES_Output/$i --genomefa hg38v99.Cellranger.genome.fa --config_file /FLAMES/config_sclr_nanopore_default.json --minimap2_dir /Software/anaconda_py2/bin/  >$i.log 2>&1 & done

Error:

Use config file: config_sclr_nanopore_default.json

Parameters in configuration file:

comment : this is the default config for nanopore single cell long read data using 10X RNA-seq kit. use splice annotation in alignment.

global_parameters

	has_UMI : True

	generate_raw_isoform : False

isoform_parameters

	Min_sup_pct : 0.2

	MAX_SPLICE_MATCH_DIST : 10

	random_seed : 666666

	Min_cnt_pct : 0.001

	MAX_DIST : 10

	Min_sup_cnt : 5

	MAX_TS_DIST : 120

	Max_site_per_splice : 3

	strand_specific : -1

	remove_incomp_reads : 4

	min_fl_exon_len : 40

pipeline_parameters

	do_transcript_quantification : True

	do_read_realignment : True

	do_genome_alignment : True

	do_isoform_identification : True

transcript_counting

	min_tr_coverage : 0.4

	min_read_coverage : 0.4

realign_parameters

	use_annotation : True

alignment_parameters

	no_flank : False

	use_junctions : True

output directory not exist, create one:

FLAMES_Output/test

Input parameters:

	gene annotation: hg38v99.Cellranger.genes.gtf

	genome fasta: hg38v99.Cellranger.genome.fa

	input fastq: test.demultiplexed.fq.gz

	output directory: FLAMES_Output/test

	directory contains minimap2: /Software/anaconda_py2/bin/

### align reads to genome using minimap2 2021-01-30 12:48:05



Traceback (most recent call last):

  File "/FLAMES/python/sc_long_pipeline.py", line 213, in <module>

    sc_long_pipeline(args)

  File "/FLAMES/python/sc_long_pipeline.py", line 159, in sc_long_pipeline

    minimap2_align(args.minimap2_dir, args.genomefa, args.infq, tmp_bam, no_flank=config_dict["alignment_parameters"]["no_flank"], bed12_junc=tmp_bed if config_dict["alignment_parameters"]["use_junctions"] else None)

  File "/FLAMES/python/minimap2_align.py", line 37, in minimap2_align

    print subprocess.check_output([align_cmd], shell=True, stderr=subprocess.STDOUT)

  File "/Software/anaconda_py2/lib/python2.7/subprocess.py", line 223, in check_output

    raise CalledProcessError(retcode, cmd, output=output)

subprocess.CalledProcessError: Command '['/Software/anaconda_py2/bin/minimap2 -ax splice -t 12 --junc-bed FLAMES_Output/test/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no hg38v99.Cellranger.genome.fa test.demultiplexed.fq.gz | samtools view -bS -@ 4 -m 2G -o FLAMES_Output/test/tmp.align.bam -  ']' returned non-zero exit status 1

Error during re-alignment

Hello, FLAMES aligns my reads to the reference genome but during realignment I get this error:

### skip aligning reads to genome 2023-12-07 15:57:35
### read gene annotation 2023-12-07 15:57:35
remove similar transcripts in gene annotation: Counter({'duplicated_transcripts': 765})
### find isoforms 2023-12-07 15:59:27
Traceback (most recent call last):
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 270, in <module>
    bulk_long_pipeline(args)
  File "/users/sparthib/flames/python/bulk_long_pipeline.py", line 202, in bulk_long_pipeline
    group_bam2isoform(genome_bam, isoform_gff3, tss_tes_stat, "", chr_to_blocks, gene_dict, transcript_to_junctions, transcript_dict, args.genomefa,
  File "/users/sparthib/flames/python/sc_longread.py", line 1115, in group_bam2isoform
    for c in get_fa(fa_f):
  File "/users/sparthib/flames/python/sc_longread.py", line 45, in get_fa
    for line in open(fn):
  File "/users/sparthib/.conda/envs/FLAMES/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I cloned the flames package from github and this is my environment info:


#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
c-ares                    1.23.0               hd590300_0    conda-forge
ca-certificates           2023.11.17           hbcca054_0    conda-forge
editdistance              0.6.2           py310hc6cd4ac_2    conda-forge
htslib                    1.18                 h81da01d_0    bioconda
k8                        0.2.5                hdcf5f25_4    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.21.2               h659d440_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.9.0           20_linux64_openblas    conda-forge
libcblas                  3.9.0           20_linux64_openblas    conda-forge
libcurl                   8.4.0                hca28451_0    conda-forge
libdeflate                1.18                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgfortran-ng            13.2.0               h69a702a_3    conda-forge
libgfortran5              13.2.0               ha4646dd_3    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
liblapack                 3.9.0           20_linux64_openblas    conda-forge
libnghttp2                1.58.0               h47da74e_0    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libopenblas               0.3.25          pthreads_h413a1c8_0    conda-forge
libsqlite                 3.44.2               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_3    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
minimap2                  2.26                 he4a0461_2    bioconda
ncurses                   6.4                  h59595ed_2    conda-forge
numpy                     1.26.2          py310hb13e2d6_0    conda-forge
openssl                   3.2.0                hd590300_1    conda-forge
pip                       23.3.1             pyhd8ed1ab_0    conda-forge
pysam                     0.22.0          py310h41dec4a_0    bioconda
python                    3.10.13         hd12c33a_0_cpython    conda-forge
python_abi                3.10                    4_cp310    conda-forge
readline                  8.2                  h8228510_1    conda-forge
samtools                  1.18                 h50ea8bc_1    bioconda
setuptools                68.2.2             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2023c                h71feb2d_0    conda-forge
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge

Any pointers would be appreciated, thank you!

transcript id name

Hi, thanks for developing FLAMES, very nice tool.

One question about the transcript_count.csv.gz output, I got the result like this:

Where the transcript id name is quite weird, do you know how to solve it? Thanks!

How to know how many reads are assigned to the barcode ?

By setting the edit distance and with the barcode list, some of the reads should be removed from consideration. How can I know the number of reads that are assigned to the barcodes in flames output?

Transcript information

Hi,

Thank you for development of nice tool.
I'm applying BLAZE and FLAMSE to my single cell ONT data.
I've gotten useful output, but I need a genomic coordinate for each transcript to compare the transcript structure.
However, some of transcripts in the "transcript_count.csv.gz" are not existed in the "isoform_annotated.gff3" and "isoform_annotated.filtered.gff3". How can I find the information for these transcripts?

Thank you!

How to use multiple cores

Hi, is there any way to specify the number of cores for the single cell run so we can execute it faster like on a dataset with > 20 millions of reads ?

UMI deduplication in pipeline output?

Hi, I'm just wondering whether the counts table generated from the pipeline are already UMI deduplicated counts. If not, how would I go about generating these from the FLAMES output?

In addition, I found that for my transcript IDs for the mouse samples (from pipeline_output/transcript_count.csv), I'm getting quite a few transcript IDs that start with ENMUSG instead of ENMUST. Am I correct in thinking that these are gene codes instead of transcript IDs, and why would that be the case?

Gene name instead of gene_ID

Hello,

Thanks for developing the tool.
I was wandering if there is a way to get the gene name in the output matrix instead of the transcript_ID or the gene_ID ?
It would be more convenient for downstream analysis to have the correspondence gene_ID == gene_name.

Thanks for your help.
Rania

run match_cell_barcode, no error, no result, match_cell_barcode /data_RAGE_seq/data1 cell_barcode_stat.txt split_barcode.fastq flame_3M-february-2018.txt 2; split_barcode.fastq is zero,no other file generation。

Compute resource allocation?

Is there a way to define number of cores, RAM usage, etc. for the pipelines?

match_cell_barcode qnames too long

Minimap2/Samtools is throwing an error from reads with append cell barcode/UMI (generated from match_cell_barcode).

[E::sam_parse1] query name too long
[W::sam_read1_sam] Parse error at line 8760987
samtools sort: truncated file. Aborting

Here is an example qname:

@CTACGGGAGAGCTTTC_CGATAAGACCCA#ACATCGAGTCAAACGG_GCACATCTTGGC#GTAGAGGAGCGGGTTA_AGGCACCTATGT#AGTACTGAGAGTCAGC_CTCAGCCAGTAA#TGTCCCAGTTACCGTA_ATCGTACCAGTC#AATCGTGTCGACATCA_ACTCAAGGCCAT#CGAGAAGGTTCGGCGT_TACGCCAGTCTG#GCTGCAGCACATGGTT_TGATTATGCCTC#CCGTAGGCAGACTGCC_CTCTCGCATACA#TAAGTCGCAGGAGGTT_TAACTATTTACG#TCGTAGATCACTACGA_AGACGCAAATTT#GTCGAATAGGTTACAA_ACAAATTGTTTC#ACAAGCTCAGGCGTTC_CGTTGCCTATAT#GTGCACGAGGATAATC_CAGGAGTCAGAA#AGGATAAAGGTATCTC_CCAATCGCTTTA#GTCATGAGTCCTCCTA_AGCTCAAACACT#GACTTCCCAAAGTATG_GCCCACTTGCTG#TGTACAGTCAACCGAT_TGAAGCATCCAC#TGAGGTTTCAAGGACG_GGACCAAGTCGG#TTACGCCCAGCCATTA_AATCACCGCTCG#ATATCCTCACAATGAA_AATTATCTCTTT#CCACACTCAATAGGGC_CACCTATTTTTT#TCTCTGGCAAACACGG_GCCCCTGCATAG#ATATCCTGTATTCCGA_AATTATGAACTT#TCCCATGGTTGCGGAA_AAATTACAATCC#AGTAGTCTCGTCTCAC_CCATGATTCACG#CTAACCCGTGGCCTCA_ATTTACAGATGA#32fd44aa-9033-40d6-a233-bf43ece68751

Looks like qname must be equal to or shorter than 254 characters: samtools/samtools#1081

Typo in parse_realigned_bam?

On line 88, read_dict[r][0] will be assigned with (tr, rec.get_tag("AS"), tr_cov, float(rec.query_alignment_length)/rec.infer_read_length(), rec.mapping_quality), contradicting the comment on line 106 # transcript_id, pct_ref, pct_reads.

hit[1] > 0.8 was used on line 119, which would be evaluating alignment score > 0.8.

0.8 seems to be a very low threshold for alignment scores, did you mean to evaluate pct_ref > 0.8 (i.e. hit[2] > 0.8)?

Demultiplexing issues

Hi Luyi,

The pipeline is great! Thanks for the effort and for sharing it.

I have tried FLAMES on your published data and our own in-house data, and have two questions:

For match_cell_barcode, the "output cell barcode statistics file" always miss the first barcode in the "whitelist" file, is this a bug?
For single-cell long-read data, when poly-A tail is in the read, match_cell_barcode should search for the barcode and UMI in the suffix instead of the prefix of the read, right? I did find some cases that match_cell_barcode still searched and trimmed the prefix.

Looking forward to your feedback.

Thanks,
Yan

Demultiplexing

Can this pipeline also demultiplex reads from cell barcodes?

What's the difference

hi ~
What's the difference between barcode hm match and barcode match.

Config file parameters

Hi,
I am currently using FLAMES and a few other assemblers (flair and bookend), to compare them against each other and find out, which would be the most optimal one for my data and workflow (drosophila nanopore-sequences). Currently I am facing the issue, that my FLAMES-based transcriptomes are surprisingly small (after correction and filtering roughly 2500 isoforms against flairs 16000), even with the same references and sequencing files. I think, this may be due to the config file, that I honestly just copied from the github. What would you recommend as parameters/what should be changed to perhaps solve this? Would it be appropriate to be less strict and how would I enforce this in the config file?

Best,
Hasan.

Run FLAMES directly from an aligned bam file.

Could we run FLAMES directly from a bam file which is generated by other demultiplex tool (i.e. Nanopore/sockeye)? Actually, I have tried once, but failed. It seems that fastq file is required in realign step. Could you please give me some advice if we have no short-read sequencing data but want to use FLAMES for isoform analysis? Thanks so much!

Adapt match_cell_barcode to custom Barcode and UMI Length

Hi,

Thanks for developing FLAMES.

I have a specific requirement that involves adapting match_cell_barcode function to accommodate different barcode and UMI lengths. Currently, the software assumes a standard barcode length of 16 and a UMI length of 10, based on 10X kit.
I would like to request if it would be possible to modify these parameters according to my needs, since I’m using a custom single-cell ONT library with same flanking sequence (CTACACGACGCTCTTCCGATCT) but barcode length of 11 and UMI of 14 bp respectively.

Regarding the UMI length, it can be specified by command line.
I would like to ask you a feedback:

I modified the source code of match_cell_barcode by substituting all ’16’ occurrences with ’11’.
For the UMI I specified by command line my length (14).
Edit distance allowed : 2 (considering that I have a minimum hamming distance of 3 between my custom barcodes).

Are these modifications correct and sufficient in order to have a proper barcode and UMI assignment, or do I have to change something else in the source code of match_cell_barcode?

Sorry in advance but I’m not an expert in c++.
Best

GTF format error?

Do I need a specific GFF/GTF format?

I am getting this error:

Traceback (most recent call last):
  File "/FLAMES/python/bulk_long_pipeline.py", line 243, in <module>
    bulk_long_pipeline(args)
  File "/FLAMES/python/bulk_long_pipeline.py", line 171, in bulk_long_pipeline
    gff3_to_bed12(args.minimap2_dir, args.gff3, tmp_bed)
  File "/FLAMES/python/minimap2_align.py", line 17, in gff3_to_bed12
    print subprocess.check_output([cmd], shell=True, stderr=subprocess.STDOUT)
  File "/miniconda/lib/python2.7/subprocess.py", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['paftools.js gff2bed /gnet/is6/p04/data/dnaseq/analysis/led13/genomes/GCA_000001405.15_GRCh
38_full_analysis_set.refseq_annotation.gtf > /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6310_q10_l300_flames/tmp.splice_anno.
bed12']' returned non-zero exit status 1

here is the head of my GTF file:

#gtf-version 2.2
#!genome-build GRCh38
#!genome-build-accession NCBI_Assembly:GCA_000001405.15
#!annotation-date 01/25/2019
#!annotation-source NCBI Homo sapiens Updated Annotation Release 109.20190125
chr1    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC
:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseu
do "true";
chr1    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "1";
chr1    BestRefSeq      exon    12613   12721   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "2";
chr1    BestRefSeq      exon    13221   14409   .       +       .       gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "Gen
eID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1"; exon_number "3";
chr1    BestRefSeq      gene    14362   29370   .       -       .       gene_id "WASH7P"; db_xref "GeneID:653635"; db_xref "HGNC:HGN
C:38034"; description "WAS protein family homolog 7, pseudogene"; gbkey "Gene"; gene "WASH7P"; gene_biotype "transcribed_pseudogene

config file

Hi,
Nice work! Congrats!

Two questions:

1- Can I use the config file from the example ("SIRV_config.json") to run my human datasets?

2- Also, I could not activate the environment after installing your software, the error below, and I guess I still can run it without activating the env. Is that correct? I am not getting any error if I run your software without activating the env.

I tried to export the env path and also ran "conda.sh" before running the command with no luck.
Here is what I get if I try to activate the env:

conda activate FLAMES

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

FLAMES vs FLAIR

I'm evaluating FLAMES and FLAIR for my project. Can you comment on the conceptual or algorithmic differences between the two packages? For example, what aspect of FLAMES leads to its increased accuracy in benchmarking?