lucapinello / crispresso Goto Github PK

View Code? Open in Web Editor NEW

131.0 12.0 55.0 5.05 MB

Software pipeline for the analysis of CRISPR-Cas9 genome editing outcomes from sequencing data

License: Other

Python 99.72% Dockerfile 0.28%

crispr-analysis crispr-cas9 crispr cas9 ngs amplicon python docker

crispresso's Introduction

THIS IS AN OLD VERSION OF CRISPRESSO AND IT IS NOW DEPRECATED

PLEASE USE CRISPRESSO2

https://github.com/pinellolab/crispresso2

CRISPResso is a software pipeline for the analysis of targeted CRISPR-Cas9 sequencing data. This algorithm allows for the quantification of both non-homologous end joining (NHEJ) and homologous directed repair (HDR) occurrences.

CRISPResso automatizes and performs the following steps summarized in the figure below:

filters low quality reads,
trims adapters,
aligns the reads to a reference amplicon,
quantifies the proportion of HDR and NHEJ outcomes,
quantifies frameshift/inframe mutations (if applicable) and identifies affected splice sites,
produces a graphical report to visualize and quantify the indels distribution and position.

The CRISPResso suite accommodates single or pooled amplicon deep sequencing, WGS datasets and allows the direct comparison of individual experiments. In fact four additional utilities are provided:

CRISPRessoPooled: a tool for the analysis of pooled amplicon experiments
CRISPRessoWGS: a tool for the analysis of WGS data or prealigned reads in .bam format
CRISPRessoCompare:a tool for the comparison of two CRISPResso analyses, useful for example to compare treated and untreated samples or to compare different experimental conditions
CRISPRessoPooledCompare: a tool to compare experiments involving several regions analyzed by either CRISPRessoPooled or CRISPRessoWGS

TRY IT ONLINE!

If you don't like command line tools you can also use CRISPResso online here: http://crispresso.rocks

Installation and Requirements

To install the command line version of CRISPResso, some dependencies must be installed before running the setup:

Python 2.7 Anaconda: http://continuum.io/downloads
Java: http://java.com/download
C compiler / make. For Mac with OSX 10.7 or greater, open the terminal app and type and execute the command 'make', which will trigger the installation of OSX developer tools.Windows systems are not officially supported, although CRISPResso may work with Cygwin (https://www.cygwin.com/).

After checking that the required software is installed you can install CRISPResso from the official Python repository following these steps:

Open a terminal window
Type the command:

pip install CRISPResso  --no-use-wheel --verbose

Close the terminal window

Alternatively if want to install the package without the PIP utility:

Download the setup file: https://github.com/lucapinello/CRISPResso/archive/master.zip and decompress it
Open a terminal window and go to the folder where you have decompressed the zip file
Type the command: python setup.py install
Close the terminal window and open a new one (this is important in order to setup correctly the PATH variable in your system).

The Setup will try to install these software for you:

Trimmomatic(tested with v0.33): http://www.usadellab.org/cms/?page=trimmomatic
Flash(tested with v1.2.11): http://ccb.jhu.edu/software/FLASH/
Needle from the EMBOSS suite(tested with 6.6.0): ftp://emboss.open-bio.org/pub/EMBOSS/

If the setup fails on your machine you have to install them manually and put these utilities/binary files in your path!

To check that the installation worked, open a terminal window and execute CRISPResso --help, you should see the help page.

The setup will automatically create a folder in your home folder called CRISPResso_dependencies (if this folder is deleted, CRISPResso will not work!)! If you want to put the folder in a different location, you need to set the environment variable: CRISPRESSO_DEPENDENCIES_FOLDER. For example to put the folder in /home/lpinello/other_stuff you can write in the terminal BEFORE the installation:

export CRISPRESSO_DEPENDENCIES_FOLDER=/home/lpinello/other_stuff

Docker Image

If you like Docker, we provide a Docker image ready to use, so no installation is required!

https://hub.docker.com/r/lucapinello/crispresso/

To use the image first install Docker: https://docs.docker.com/engine/installation/

Then type the command:

docker pull lucapinello/crispresso

See an example on how to run CRISPResso from a Docker image in the section TESTING CRISPResso below.

OUTPUT

The output of CRISPResso consists of a set of informative graphs that allow for the quantification and visualization of the position and type of outcomes within an amplicon sequence. An example is shown below:

Usage

CRISPResso requires two inputs: (1) paired-end reads (two files) or single-end reads (single file) in .fastq format (fastq.gz files are also accepted) from a deep sequencing experiment and (2) a reference amplicon sequence to assess and quantify the efficiency of the targeted mutagenesis. The amplicon sequence expected after HDR can be provided as an optional input to assess HDR frequency. One or more sgRNA sequences (without PAM sequences) can be provided to compare the predicted cleavage position/s to the position of the observed mutations. Coding sequence/s may be provided to quantify frameshift and potential splice site mutations.

The reads are first filtered based on the quality score (phred33) in order to remove potentially false positive indels. The filtering based on the phred33 quality score can be modulated by adjusting the optimal parameters (see additional notes below). The adapters are trimmed from the reads using Trimmomatic and then sequences are merged with FLASha (if using paired-end data).The remaining reads are then aligned with needle from the EMBOSS suite, an optimal global sequence aligner based on the Needleman-Wunsch algorithm that can easily accounts for gaps. Finally, after analyzing the aligned reads, a set of informative graphs are generated, allowing for the quantification and visualization of the position and type of outcomes within the amplicon sequence.

NHEJ events:

The required inputs are:

Two files for paired-end reads or a single file for single-end reads in fastq format (fastq.gz files are also accepted). The reads are assumed to be already trimmed for adapters. If reads are not trimmed, please use the --trim_sequences option and the --trimmomatic_options_string if you are using an adapter different than Nextera.
The reference amplicon sequence must also be provided.

Example:

CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a AATGTCCCCCAATGGGAAGTTCATCTGGCACTGCCCACAGGTGAGGAGGTCATGATCCCCTTCTGGAGCTCCCAACGGGCCGTGGTCTGGTTCATCATCTGTAAGAATGGCTTCAAGAGGCTCGGCTGTGGTT

HDR events: The required inputs are:

Two files for paired-end reads or a single file for single-end reads in fastq format (fastq.gz files are also accepted). The reads are assumed to be already trimmed for adapters.
The reference amplicon sequence.
The expected amplicon sequence after HDR must also be provided.

Example:

CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a GCTTACACTTGCTTCTGACACAACTGTGTTCACGAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGAATGCCGTCACCACCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGA -e GCTTACACTTGCTTCTGACACAACTGTGTTCACGAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGTGGAAAAAAACGCCGTCACGACGTTATGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGA

IMPORTANT: You must input the entire reference amplicon sequence (’Expected HDR Amplicon sequence’ is the reference for the sequenced amplicon, not simply the donor sequence). If only the donor sequence is provided, an error will result

Understanding the parameters of CRISPResso

Required parameters: To run CRISPResso, only 2 parameters are required for single end reads, or 3 for paired end reads:

-r1 or --fastq_r1: This parameter allows for the specification of the first fastq file.

-r2 or --fastq_r2 FASTQ_R2: This parameter allows for the specification of the second fastq file for paired end reads.

-a or --amplicon_seq: This parameter allows the user to enter the amplicon sequence used for the experiment.

Optional parameters: In addition to the required parameters explained in the previous section, several optional parameters can be adjusted to tweak your analysis, and to ensure CRISPResso analyzes your data in the best possible way.

-g or --guide_seq or: This parameter allows for the specification of the sgRNA sequence. If more than one sequence are included, please separate by comma/s. If the guide RNA sequence is entered, then the position of the guide RNA and the cleavage site will be indicated on the output analysis plots. Note that the sgRNA needs to be input as the guide RNA sequence (usually 20 nt) immediately 5' of the PAM sequence (usually NGG for SpCas9). If the PAM is found on the opposite strand with respect to the Amplicon Sequence, ensure the sgRNA sequence is also found on the opposite strand. The CRISPResso convention is to depict the expected cleavage position using the value of the parameter cleavage_offset nt 3' from the end of the guide. In addition, the use of alternate nucleases to SpCas9 is supported. For example, if using the Cpf1 system, enter the sequence (usually 20 nt) immediately 3' of the PAM sequence and explicitly set the cleavage_offset parameter to 1, since the default setting of -3 is suitable only for SpCas9. (default:None)

-e or --expected_hdr_amplicon_seq: This parameter allows for the specification of the amplicon sequence expected after HDR. If the data to be analyzed were derived from an experiment using a donor repair template for homology-directed repair (HDR for short), then you have the option to input the sequence of the expected HDR amplicon. This sequence is necessary for CRISPResso to be able to identify successful HDR events within the sequencing data.

--hdr_perfect_alignment_threshold: Sequence homology percentage for an HDR occurrence (default: 98.0). This parameter allows for the user to set a threshold for sequence homology for CRISPResso to count instances of successful HDR. This is useful to improve the analysis allowing some tolerance for technical artifacts present in the sequencing data such as sequencing errors or single nucleotide polymorphisms (SNPs) in the cells used in the experiment. Therefore, if you have a read that exhibits successful HDR but has a SNP or sequencing error within the amplicon, you can lower the sequence homology in order allow CRISPResso to count the read as a successful HDR event. If the data are completely free of sequencing errors or polymorphisms, then consider to set parameter to 100.

-d or -donor_seq:This parameter allows the user to highlight the critical subsequence of the expected HDR amplicon in plots. This parameter does not have any effect on the quantification of HDR events.

-c, --coding_seq:This parameter allows for the specification of the subsequence/s of the amplicon sequence covering one or more coding sequences for the frameshift analysis. If more than one (for example, split by intron/s), please separate by comma. (default: None)

-q, or --min_average_read_quality: This parameter allows for the specification of the minimum average quality score (phred33) to include a read for the analysis.(default: 0, minimum: 0, maximum: 40). This parameter is helpful to filter out low quality reads. If filtering based on average base quality is desired, a reasonable value for this parameter is greater than 30.

-s or --min_single_bp_quality: This parameter allows for the specification of the minimum single bp score (phred33) to include a read for the analysis (default: 0, minimum: 0, maximum: 40). This parameter is helpful to filter out low quality reads. This filtering is more aggressive, since any read with a single bp below the threshold will be discarded. If you want to filter your reads based on single base quality to have very high quality reads, a reasonable value for this parameter is greater than 20.

--min_identity_score: This parameter allows for the specification of the min identity score for the alignment (default: 60.0). In order for a read to be considered properly aligned, it should pass this threshold. We suggest to lower this threshold only if really large insertions or deletions are expected in the experiment (>40% of the amplicon length).

-n or --name: This parameter allows for the specification of the output name of the report (default: the names is obtained from the filename of the fastq file/s used in input).

-o or --output_folder: This parameter allows for the specification of the output folder to use for the analysis (default: current folder).

--split_paired_end: Splits a single fastq file contating paired end reads in two files before running CRISPResso (default: False). If you got your data from the MGH sequencing core in Boston (https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/crispr_sequencing_main.jsp), you need this option!

--trim_sequences: This parameter enables the trimming of Illumina adapters with Trimmomatic (default: False)

--trimmomatic_options_string: This parameter allows the user the ability to override options for Trimmomatic (default: ILLUMINACLIP:/Users/luca/anaconda/lib/python2.7/site-packages/CRISPResso-0.8.0-py2.7.egg/CRISPResso/data/NexteraPE-PE.fa:0:90:10:0:true). This parameter is useful to specify different adaptor sequences used in the experiment if you need to trim them.

--min_paired_end_reads_overlap: This parameter allows for the specification of the minimum required overlap length between two reads to provide a confident overlap during the merging step. (default: 4, minimum: 1, max: read length)

--hide_mutations_outside_window_NHEJ: This parameter allows the user to visualize only the mutations overlapping the window around the cleavage site and used to classify a read as NHEJ. This parameter has no effect on the quantification of NHEJ. With the default setting (False), all mutations are visualized including those that do not overlap the window, even though these are not used to classify a read as NHEJ. It may be desirable in certain cases to hide pre-existing and known mutations or sequencing errors outside the window and hence not used for quantification of NHEJ events (default: False).

-w ,--window_around_sgrna: This parameter allows for the specification of a window(s) in bp around each sgRNA to quantify the indels. The window is centered on the predicted cleavage site specified by each sgRNA. Any indels not overlapping or substitutions not adjacent to the window are excluded. A value of 0 will disable this filter (default: 1). This parameter is important since sequencing artifacts and/or SNPs can lead to false positives or false negatives in the quantification of indels and HDR occurrences. Therefore, the user can choose to create a window around the predicted double strand break site of the nuclease used in the experiment. This can help limit sequencing or amplification errors or non-editing polymorphisms from being inappropriately quantified in CRISPResso analysis. Note: any indels that fully or partially overlap the window will be quantified.

--cleavage_offset: This parameter allows for the specification of the cleavage offset to use with respect to the provided sgRNA sequence. Remember that the sgRNA sequence must be entered without the PAM. The default is -3 and is suitable for the SpCas9 system. For alternate nucleases, other cleavage offsets may be appropriate, for example, if using Cpf1 set this parameter to 1. (default: -3, minimum:1, max: reference amplicon length). Note: any large indel that partially overlap the window will be also fully quantified.

--exclude_bp_from_left: Exclude bp from the left side of the amplicon sequence for the quantification of the indels (default: 15). This parameter is helpful to avoid artifacts due to imperfect trimming of the reads.

--exclude_bp_from_right: Exclude bp from the right side of the amplicon sequence for the quantification of the indels (default: 15). This parameter is helpful to avoid artifacts due to imperfect trimming of the reads.

--ignore_substitutions: Ignore substitutions events for the quantification and visualization (default: False).

--ignore_insertions: Ignore insertions events for the quantification and visualization (default: False).

--ignore_deletions: Ignore deletions events for the quantification and visualization (default: False).

--needle_options_string: This parameter allows the user to override options for the Needle aligner (default: -gapopen=10 -gapextend=0.5 -awidth3=5000). More information on the meaning of these parameters can be found in the needle documentation (http://embossgui.sourceforge.net/demo/manual/needle.html). We suggest that only experienced users modify these values.

--keep_intermediate: This parameter allows the user to keep all the intermediate files (default: False). We suggest keeping this parameter disabled for most applications, since the intermediate files (processed reads and alignments) can be really large.

--dump: This parameter allows to dump numpy arrays and pandas dataframes to file for debugging purposes (default: False).

--save_also_png: This parameter allows the user to also save.png images when creating the report., in addition to .pdf files.

-p, --n_processes Specify the number of processes to use for the quantification. This parameter is useful to speed up the quantification and generation of the mutation profiles when multiple CPUs are available. Please use with caution since increasing this parameter will increase significantly the memory required to run CRISPResso (default: 1).

Troubleshooting:

It is important to check if your reads are trimmed or not. CRISPResso assumes that the reads are already trimmed! If reads are not trimmed, use the option --trim_sequences. The default adapter file used is the Nextera. If you want to specify a custom adapter use the option --trimmomatic_options_string.
It is possible to use CRISPResso with single end reads. In this case, just omit the option -r2 to specify the second fastq file.
It is possible to filter based on read quality before aligning reads using the option -q. A reasonable value for this parameter (phred33) is 30.
The command line CRISPResso tool for use on Mac computers requires OS 10.7 or greater. It also requires that command line tools are installed on your machine. After the installation of Anaconda, open the Terminal app and type make, this should prompt you to install command line tools (requires internet connection).
Once installed, simply typing CRISPResso into any new terminal should load CRISPResso (you will be greeted by the CRISPResso cup)
Paired end sequencing files are assumed to contain overlapping sequences (at least 1 bp), if not run CRISPResso on each single fastq file of the pair in single mode.
Use the following command to get to your folder (directory) with sequencing files, assuming that is /home/lpinello/Desktop/CRISPResso_Folder/Sequencing_Files_Folder: cd /home/lpinello/Desktop/CRISPResso_Folder/Sequencing_Files_Folder
CRISPResso’s default setting is to output analysis files into your directory, otherwise use the --output parameter.

TESTING CRISPResso

Download the two fastq files:

Open a terminal and go to the folder where you have stored the files
Type:

CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a AATGTCCCCCAATGGGAAGTTCATCTGGCACTGCCCACAGGTGAGGAGGTCATGATCCCCTTCTGGAGCTCCCAACGGGCCGTGGTCTGGTTCATCATCTGTAAGAATGGCTTCAAGAGGCTCGGCTGTGGTT -g TGAACCAGACCACGGCCCGT

CRISPResso will create a folder with the processed data and the figures.

If you use a Docker image instead run with the following command:

docker run -v ${PWD}:/DATA -w /DATA  -i lucapinello/crispresso CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a AATGTCCCCCAATGGGAAGTTCATCTGGCACTGCCCACAGGTGAGGAGGTCATGATCCCCTTCTGGAGCTCCCAACGGGCCGTGGTCTGGTTCATCATCTGTAAGAATGGCTTCAAGAGGCTCGGCTGTGGTT -g TGAACCAGACCACGGCCCGT

If you run Docker on Window you have to specify the full path:

docker run -v //c/Users/luca/Downloads:/DATA -w /DATA lucapinello/crispresso CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a AATGTCCCCCAATGGGAAGTTCATCTGGCACTGCCCACAGGTGAGGAGGTCATGATCCCCTTCTGGAGCTCCCAACGGGCCGTGGTCTGGTTCATCATCTGTAAGAATGGCTTCAAGAGGCTCGGCTGTGGTT -g TGAACCAGACCACGGCCCGT

Useful tips

The log of the external utilities called are stored in the file CRISPResso_RUNNING_LOG.txt
You can specify the output folder with the option --output_folder
You can inspect intermediate files with the option --keep_intermediate
All the processed raw data used to generate the figures are available in the following plain text files:
- Mapping_statistics.txt: this file contains number of: reads in input, reads after preprocessing (merging or quality filtering) and reads properly aligned.
- Quantification_of_editing_frequency.txt: quantification of editing frequency: number of reads aligned, reads with NHEJ, reads with HDR, and reads with mixed HDR-NHEJ); In addition to each of these categories we also provide an overall report summarizing the total numbers of insertions, deletions and substitutions;
- Alleles_frequency_table.txt: number or reads and percentage for each allele discovered in the sequencing data.
- Frameshift_analysis.txt: number of modified reads with frameshift, in-frame and noncoding mutations;
- Splice_sites_analysis.txt: number of reads corresponding to potential affected splicing sites;
- effect_vector_combined.txt: location of mutations (including deletions, insertions, and substitutions) with respect to the reference amplicon;
- effect_vector_deletion.txt : location of deletions;
- effect_vector_insertion.txt: location of insertions;
- effect_vector_substitution.txt: location of substitutions.
- position_dependent_vector_avg_insertion_size.txt: average length of the insertions for each position.
- position_dependent_vector_avg_deletion_size.txt: average length of the deletions for each position.
- indel_histogram.txt: processed data used to generate figure 1 in the output report.
- insertion_histogram.txt: processed data used to generate the insertion histogram in figure 3 in the output report.
- deletion_histogram.txt: processed data used to generate the deletion histogram in figure 3 in the output report.
- substitution_histogram.txt: processed data used to generate the substitution histogram in figure 3 in the output report.

Explore the output of CRISPResso

In order to help you to familiarize with the output of CRISPResso we provide several precomputed analyses, using the standard settings, for different simulated sequencing datasets with sequencing artifact modeled after the Illumina Miseq platform (using the ART simulation tool: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ ) and with known editing efficiency and mutagenesis profile:

1000 unmodified reads: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_unmodified_amplicon_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with 1 substitution: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_1_substitution_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with 2 substitutions: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_2_substitution_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with 3 substitutions: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_3_substitution_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with an insertion of 5 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_5_ins_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with an insertion of 10 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_10_ins_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with an insertion of 50 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_50_ins_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with a deletion of 5 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_5_del_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with a deletion of 10 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_10_del_MISEQ_ERROR_WINDOW_1bp.zip
1000 unmodified reads, 1000 reads with a deletion of 50 bp: http://crispresso.rocks/static/examples/CRISPResso_on_SIMULATION_amplicon_50_del_MISEQ_ERROR_WINDOW_1bp.zip

Installation and usage of CRISPRessoPooled

CRISPRessoPooled is a utility to analyze and quantify targeted sequencing CRISPR/Cas9 experiments involving sequencing libraries with pooled amplicons. One common experimental strategy is to pool multiple amplicons (e.g. a single on-target site plus a set of potential off-target sites) into a single deep sequencing reaction (briefly, genomic DNA samples for pooled applications can be prepared by first amplifying the target regions for each gene/target of interest with regions of 150-400bp depending on the desired coverage. In a second round of PCR, with minimized cycle numbers, barcode and adaptors are added. With optimization, these two rounds of PCR can be merged into a single reaction. These reactions are then quantified, normalized, pooled, and undergo quality control before being sequenced). CRISPRessoPooled demultiplexes reads from multiple amplicons and runs the CRISPResso utility with appropriate reads for each amplicon separately.

Installation

CRISPRessoPooled is installed automatically during the installation of CRISPResso, but to use it two additional programs must be installed:

samtools: http://samtools.sourceforge.net/
bowtie2: http://bowtie-bio.sourceforge.net/bowtie2

To install these tools please refer to their documentation.

Usage

This tool can run in 3 different modes:

Amplicons mode: Given a set of amplicon sequences, in this mode the tool demultiplexes the reads, aligning each read to the amplicon with best alignment, and creates separate compressed FASTQ files, one for each amplicon. Reads that do not align to any amplicon are discarded. After this preprocessing, CRISPResso is run for each FASTQ file, and separated reports are generated, one for each amplicon.

To run the tool in this mode the user must provide:

Paired-end reads (two files) or single-end reads (single file) in [FASTQ format ](http://en.wikipedia.org/wiki/FASTQ_format)(fastq.gz files are also accepted)
A description file containing the amplicon sequences used to enrich regions in the genome and some additional information. In particular, this file, is a tab delimited text file with up to 5 columns (first 2 columns required):

AMPLICON_NAME: an identifier for the amplicon (must be unique).
AMPLICON_SEQUENCE: amplicon sequence used in the design of the experiment.
sgRNA_SEQUENCE (OPTIONAL): sgRNA sequence used for this amplicon without the PAM sequence. If not available, enter NA.
EXPECTED_AMPLICON_AFTER_HDR (OPTIONAL): expected amplicon sequence in case of HDR. If more than one, separate by commas and not spaces. If not available, enter NA.
CODING_SEQUENCE (OPTIONAL): Subsequence(s) of the amplicon corresponding to coding sequences. If more than one, separate by commas and not spaces. If not available, enter NA.

A file in the right format should look like this:

Site1 CACACTGTGGCCCCTGTGCCCAGCCCTGGGCTCTCTGTACATGAAGCAAC CCCTGTGCCCAGCCC NA NA

Site2 GTCCTGGTTTTTGGTTTGGGAAATATAGTCATC NA GTCCTGGTTTTTGGTTTAAAAAAATATAGTCATC NA

Site 3 TTTCTGGTTTTTGGTTTGGGAAATATAGTCATC NA NA GGAAATATA

Note: no column titles should be entered. Also the colors here are used only for illustrative purposes and in a plain text file will be not be present and saved.

The user can easily create this file with any text editor or with spreadsheet software like Excel (Microsoft), Numbers (Apple) or Sheets (Google Docs) and then save it as tab delimited file.

Example:

CRISPRessoPooled -r1 SRR1046762\_1.fastq.gz -r2 SRR1046762\_2.fastq.gz -f AMPLICONS\_FILE.txt --name ONLY\_AMPLICONS\_SRR1046762 --gene\_annotations gencode\_v19.gz

The output of CRISPRessoPooled Amplicons mode consists of:

REPORT_READS_ALIGNED_TO_AMPLICONS.txt: this file contains the same information provided in the input description file, plus some additional columns:
1. Demultiplexed_fastq.gz_filename: name of the files containing the raw reads for each amplicon.
2. n_reads: number of reads recovered for each amplicon.
A set of fastq.gz files, one for each amplicon.
A set of folders, one for each amplicon, containing a full CRISPResso report.
SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).
CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

Genome mode: In this mode the tool aligns each read to the best location in the genome. Then potential amplicons are discovered looking for regions with enough reads (the default setting is to have at least 1000 reads, but the parameter can be adjusted with the option --min_reads_to_use_region). If a gene annotation file from UCSC is provided, the tool also reports the overlapping gene/s to the region. In this way it is possible to check if the amplified regions map to expected genomic locations and/or also to pseudogenes or other problematic regions. Finally CRISPResso is run in each region discovered.

To run the tool in this mode the user must provide:

Paired-end reads (two files) or single-end reads (single file) in [FASTQ format ](http://en.wikipedia.org/wiki/FASTQ_format)(fastq.gz files are also accepted)
The full path of the reference genome in bowtie2 format (e.g. /homes/luca/genomes/human_hg19/hg19). Instructions on how to build a custom index or precomputed index for human and mouse genome assembly can be downloaded from the bowtie2 website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.
Optionally the full path of a gene annotations file from UCSC. The user can download this file from the UCSC Genome Browser ( http://genome.ucsc.edu/cgi-bin/hgTables?command=start ) selecting as table "knowGene", as output format "all fields from selected table" and as file returned "gzip compressed". (e.g. like: homes/luca/genomes/human_hg19/gencode_v19.gz)

Example:

CRISPRessoPooled -r1 SRR1046762\_1.fastq.gz -r2 SRR1046762\_2.fastq.gz -x /gcdata/gcproj/Luca/GENOMES/hg19/hg19 --name ONLY\_GENOME\_SRR1046762 --gene\_annotations gencode\_v19.gz

The output of CRISPRessoPooled Genome mode consists of:

REPORT_READS_ALIGNED_TO_GENOME_ONLY.txt: this file contains the list of all the regions discovered, one per line with the following information:

chr_id: chromosome of the region in the reference genome.
bpstart: start coordinate of the region in the reference genome.
bpend: end coordinate of the region in the reference genome.
fastq_file: location of the fastq.gz file containing the reads mapped to the region.
n_reads: number of reads mapped to the region.
sequence: the sequence, on the reference genome for the region.

MAPPED_REGIONS (folder): this folder contains all the fastq.gz files for the discovered regions.
A set of folders with the CRISPResso report on the regions with enough reads.
SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).
CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

This running mode is particular useful to check if there are mapping artifacts or contaminations in the library. In an optimal experiment, the list of the regions discovered should contain only the regions for which amplicons were designed.

Mixed mode (Amplicons + Genome): in this mode, the tool first aligns reads to the genome and, as in the Genome mode, discovers aligning regions with reads exceeding a tunable threshold. Next it will align the amplicon sequences to the reference genome and will use only the reads that match both the amplicon locations and the discovered genomic locations, excluding spurious reads coming from other regions, or reads not properly trimmed. Finally CRISPResso is run using each of the surviving regions.

To run the tool in this mode the user must provide:

Paired-end reads (two files) or single-end reads (single file) in [FASTQ format ](http://en.wikipedia.org/wiki/FASTQ_format)(fastq.gz files are also accepted)
A description file containing the amplicon sequences used to enrich regions in the genome and some additional information (as described in the Amplicons mode section).
The reference genome in bowtie2 format (as described in Genome mode section).
Optionally the gene annotations from UCSC (as described in Genome mode section).

Example:

CRISPRessoPooled -r1 SRR1046762\_1.fastq.gz -r2 SRR1046762\_2.fastq.gz -f AMPLICONS\_FILE.txt -x /gcdata/gcproj/Luca/GENOMES  /hg19/hg19 --name AMPLICONS\_AND\_GENOME\_SRR1046762 --gene\_annotations gencode\_v19.gz

The output of CRISPRessoPooled Mixed Amplicons + Genome mode consists of these files:

REPORT_READS_ALIGNED_TO_GENOME_AND_AMPLICONS.txt: this file contains the same information provided in the input description file, plus some additional columns:
1. Amplicon_Specific_fastq.gz_filename: name of the file containing the raw reads recovered for the amplicon.
2. n_reads: number of reads recovered for the amplicon.
3. Gene_overlapping: gene/s overlapping the amplicon region.
4. chr_id: chromosome of the amplicon in the reference genome.
5. bpstart: start coordinate of the amplicon in the reference genome.
6. bpend: end coordinate of the amplicon in the reference genome.
7. Reference_Sequence: sequence in the reference genome for the region mapped for the amplicon.
MAPPED_REGIONS (folder): this folder contains all the fastq.gz files for the discovered regions.
A set of folders with the CRISPResso report on the amplicons with enough reads.
SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).
CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

The Mixed mode combines the benefits of the two previous running modes. In this mode it is possible to recover in an unbiased way all the genomic regions contained in the library, and hence discover contaminations or mapping artifacts. In addition, by knowing the location of the amplicon with respect to the reference genome, reads not properly trimmed or mapped to pseudogenes or other problematic regions will be automatically discarded, providing the cleanest set of reads to quantify the mutations in the target regions with CRISPResso.

If the focus of the analysis is to obtain the best quantification of editing efficiency for a set of amplicons, we suggest running the tool in the Mixed mode. The Genome mode is instead suggested to check problematic libraries, since a report is generated for each region discovered, even if the region is not mappable to any amplicon (however, his may be time consuming). Finally the Amplicon mode is the fastest, although the least reliable in terms of quantification accuracy.

Installation and usage of CRISPRessoWGS

CRISPRessoWGS is a utility for the analysis of genome editing experiment from whole genome sequencing (WGS) data. CRISPRessoWGS allows exploring any region of the genome to quantify targeted editing or potentially off-target effects.

Installation

CRISPRessoWGS is installed automatically during the installation of CRISPResso, but to use it two additional programs must be installed:

samtools: http://samtools.sourceforge.net/
bowtie2: http://bowtie-bio.sourceforge.net/bowtie2

To install these tools please refer to their documentation.

To run CRISPRessoWGS you must provide:

A genome aligned BAM file. To align reads from a WGS experiment to the genome there are many options available, we suggest using either Bowtie2 (**<http://bowtie-bio.sourceforge.net/bowtie2/>) or **BWA (**<http://bio-bwa.sourceforge.net/>).**
A FASTA file containing the reference sequence used to align the reads and create the BAM file (the reference files for the most common organism can be download from UCSC: http://hgdownload.soe.ucsc.edu/downloads.html. Download and uncompress only the file ending with .fa.gz, for example for the last version of the human genome download and uncompress the file hg38.fa.gz)
Descriptions file containing the coordinates of the regions to analyze and some additional information. In particular, this file is a tab delimited text file with up to 7 columns (4 required):
- chr_id: chromosome of the region in the reference genome.
- bpstart: start coordinate of the region in the reference genome.
- bpend: end coordinate of the region in the reference genome.
- REGION_NAME: an identifier for the region (must be unique).
- sgRNA_SEQUENCE (OPTIONAL): sgRNA sequence used for this genomic segment without the PAM sequence. If not available, enter NA.
- EXPECTED_SEGMENT_AFTER_HDR (OPTIONAL): expected genomic segment sequence in case of HDR. If more than one, separate by commas and not spaces. If not available, enter NA.
- CODING_SEQUENCE (OPTIONAL): Subsequence(s) of the genomic segment corresponding to coding sequences. If more than one, separate by commas and not spaces. If not available, enter NA.

> A file in the right format should look like this:

chr1 65118211 65118261 R1 CTACAGAGCCCCAGTCCTGG NA NA

chr6 51002798 51002820 R2 NA NA NA

Note: no column titles should be entered. As you may have noticed this file is just a BED file with extra columns. For this reason a normal BED file with 4 columns, is also accepted by this utility.

Optionally the full path of a gene annotations file from UCSC. You can download the this file from the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) selecting as table "knowGene", as output format "all fields from selected table" and as file returned "gzip compressed". (something like: homes/luca/genomes/human_hg19/gencode_v19.gz)

Example:

CRISPRessoWGS -b WGS/50/50\_sorted\_rmdup\_fixed\_groups.bam -f WGS\_TEST.txt -r /gcdata/gcproj/Luca/GENOMES/mm9/mm9.fa --gene\_annotations ensemble\_mm9.txt.gz --name CRISPR\_WGS\_SRR1542350

The output from these files will consist of:

REPORT_READS_ALIGNED_TO_SELECTED_REGIONS_WGS.txt: this file contains the same information provided in the input description file, plus some additional columns:
1. sequence: sequence in the reference genome for the region specified.
2. gene_overlapping: gene/s overlapping the region specified.
3. n_reads: number of reads recovered for the region.
4. bam_file_with_reads_in_region: file containing only the subset of the reads that overlap, also partially, with the region. This file is indexed and can be easily loaded for example on IGV for visualization of single reads or for the comparison of two conditions. For example, in the figure below (fig X) we show reads mapped to a region inside the coding sequence of the gene Crygc subjected to NHEJ (CRISPR_WGS_SRR1542350) vs reads from a control experiment (CONTROL_WGS_SRR1542349).
5. fastq.gz_file_trimmed_reads_in_region: file containing only the subset of reads fully covering the specified regions, and trimmed to match the sequence in that region. These reads are used for the subsequent analysis with CRISPResso.
ANALYZED_REGIONS (folder): this folder contains all the BAM and FASTQ files, one for each region analyzed.
A set of folders with the CRISPResso report on the regions provided in input with enough reads (the default setting is to have at least 10 reads, but the parameter can be adjusted with the option

--min_reads_to_use_region).
CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

This utility is particular useful to investigate and quantify mutation frequency in a list of potential target or off-target sites, coming for example from prediction tools, or from other orthogonal assays.

Installation and usage of CRISPRessoCompare

CRISPRessoCompare is a utility for the comparison of a pair of CRISPResso analyses. CRISPRessoCompare produces a summary of differences between two conditions, for example a CRISPR treated and an untreated control sample (see figure below). Informative plots are generated showing the differences in editing rates and localization within the reference amplicon,

Installation

CRISPRessoCompare is installed automatically during the installation of CRISPResso

To run CRISPRessoCompare you must provide:

Two output folders generated with CRISPResso using the same reference amplicon and settings but on different datasets.
Optionally a name for each condition to use for the plots, and the name of the output folder

Example:

CRISPRessoCompare -n1 "VEGFA CRISPR" -n2 "VEGFA CONTROL"  -n VEGFA_Site_1_SRR10467_VS_SRR1046787 CRISPResso_on_VEGFA_Site_1_SRR1046762/ CRISPResso_on_VEGFA_Site_1_SRR1046787/

The output will consist of:

Comparison_Efficiency.pdf: a figure containing a comparison of the edit frequencies for each category (NHEJ, MIXED NHEJ-HDR and HDR) and as well the net effect subtracting the second sample (second folder in the command line) provided in the analysis from the first sample (first folder in the command line).
Comparison_Combined_Insertion_Deletion_Substitution_Locations.pdf: a figure showing the average profile for the mutations for the two samples in the same scale and their difference with the same convention used in the previous figure (first sample – second sample).
CRISPRessoCompare_RUNNING_LOG.txt: detailed execution log.

Installation and usage of CRISPRessoPooledWGSCompare

CRISPRessoPooledWGSCompare is an extension of the CRIPRessoCompare utility allowing the user to run and summarize multiple CRISPRessoCompare analyses where several regions are analyzed in two different conditions, as in the case of the CRISPRessoPooled or CRISPRessoWGS utilities.

Installation

CRISPRessoPooledWGSCompare is installed automatically during the installation of CRISPResso.

To run CRISPRessoPooledWGSCompare you must provide: 1. Two output folders generated with CRISPRessoPooled or CRISPRessoWGS using the same reference amplicon and settings but on different datasets. 2. Optionally a name for each condition to use for the plots, and the name of the output folder

Example:

CRISPRessoPooledWGSCompare CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046762/ CRISPRessoPooled_on_AMPLICONS_AND_GENOME_SRR1046787/ -n1 SRR1046762 -n2 SRR1046787 -n AMPLICONS_AND_GENOME_SRR1046762_VS_SRR1046787

The output from these files will consist of: 1. COMPARISON_SAMPLES_QUANTIFICATION_SUMMARIES.txt: this file contains a summary of the quantification for each of the two conditions for each region and their difference (read counts and percentages for the various classes: Unmodified, NHEJ, MIXED NHEJ-HDR and HDR). 2. A set of folders with CRISPRessoCompare reports on the common regions with enough reads in both conditions. 3. CRISPRessoPooledWGSCompare_RUNNING_LOG.txt: detailed execution log.

How to cite CRISPResso

If you use CRISPResso in your work please cite:

Pinello L, Canver MC, Hoban MD, Orkin SH, Kohn DB, Bauer DE, Yuan GC. Analyzing CRISPR genome-editing experiments with CRISPResso. Nat Biotechnol. 2016 Jul 12;34(7):695-697. doi: 10.1038/nbt.3583. PubMed PMID: 27404874.

Acknowledgements

We are grateful to Feng Zhang and David Scott for useful feedback and suggestions; the FAS Research Computing Team, in particular Daniel Kelleher, for great support in hosting the web application of CRISPResso; and Sorel Fitz-Gibbon from UCLA for help in sharing data. Finally, we thank all members of the Guo-Cheng Yuan lab for testing the software.

crispresso's People

Contributors

Stargazers

Watchers

Forkers

mdshw5 lmnganga gcyuan magniff mtchlcl af11-sanger alenzhao gppouliot jasper1918 jchenpku ryys1122 blawney nonzok4 pinellolab staciawyman kclem seangeleno sunlei0227 wannaporni haythamkhoury cercariae byemypast dfajar2 cdustinr briantoliveira lixr34 bguiribon jun-lizst tuqiang2014 gnetsanet sbtlab cornlab biovisual psyche007 aneeshpanoli healthvivo byo-ai rakarnik florbeer hyh3 nordnes j-fife cienciadedadosebigdata twohlever 00mjk catalyticds yali107 tonyreina illarionovaanastasia xianggenti gyuanlab

crispresso's Issues

CRISPResso command failed (return value 127) on region #0:

Using Crespresso 2.0.23
CRISPRessoWGS terminated with following error message:
Total region analyzed 18227
Similar message for all 18227 regions.
Running CRISPResso on region #1/18227: /home/pankum/miniconda3/lib/python2.7/site-packages/CRISPResso.py -r1 /san/ongoing/CRISPER_WGS_Data/CRISpresso/B2M-KO_101_predicted/CRISPRessoWGS_on_B2M-KO_101_predicted/ANALYZED_REGIONS/REGION_R_1.fastq.gz -a catctctctagggcaacgtcggctgcagctgagatggctgctccccggtg -o /san/ongoing/CRISPER_WGS_Data/CRISpresso/B2M-KO_101_predicted/CRISPRessoWGS_on_B2M-KO_101_predicted --name R_1 --needleman_wunsch_gap_extend -2 --max_rows_alleles_around_cut_to_plot 50 --aln_seed_count 5 --needleman_wunsch_aln_matrix_loc EDNAFULL --quantification_window_size 1 --quantification_window_center -3 --trimmomatic_command trimmomatic --conversion_nuc_from C --min_bp_quality_or_N 0 --default_min_aln_score 60 --needleman_wunsch_gap_incentive 1 --plot_window_size 40 --aln_seed_min 2 --needleman_wunsch_gap_open -20 --aln_seed_len 10 --conversion_nuc_to T --min_single_bp_quality 0 --exclude_bp_from_left 15 --min_average_read_quality 0 --min_frequency_alleles_around_cut_to_plot 0.2 --exclude_bp_from_right 15
CRISPResso command failed (return value 127) on region #0: "/home/pankum/miniconda3/lib/python2.7/site-packages/CRISPResso.py -r1 /san/ongoing/CRISPER_WGS_Data/CRISpresso/B2M-KO_101_predicted/CRISPRessoWGS_on_B2M-KO_101_predicted/ANALYZED_REGIONS/REGION_R_1.fastq.gz -a catctctctagggcaacgtcggctgcagctgagatggctgctccccggtg -o /san/ongoing/CRISPER_WGS_Data/CRISpresso/B2M-KO_101_predicted/CRISPRessoWGS_on_B2M-KO_101_predicted --name R_1 --needleman_wunsch_gap_extend -2 --max_rows_alleles_around_cut_to_plot 50 --aln_seed_count 5 --needleman_wunsch_aln_matrix_loc EDNAFULL --quantification_window_size 1 --quantification_window_center -3 --trimmomatic_command trimmomatic --conversion_nuc_from C --min_bp_quality_or_N 0 --default_min_aln_score 60 --needleman_wunsch_gap_incentive 1 --plot_window_size 40 --aln_seed_min 2 --needleman_wunsch_gap_open -20 --aln_seed_len 10 --conversion_nuc_to T --min_single_bp_quality 0 --exclude_bp_from_left 15 --min_average_read_quality 0 --min_frequency_alleles_around_cut_to_plot 0.2 --exclude_bp_from_right 15"

free variable 'df_genes' referenced before assignment in enclosing scope", u'occurred at index Site1'

Hi,
I get follows error when running the CRISPResso with Mixed mode (Amplicons + Genome).

ERROR: ("free variable 'df_genes' referenced before assignment in enclosing scope", u'occurred at index Site1')

My used genome is no exist in the UCSC. So, I create the gene annotations file through converting a GFF3 annotations file to a genePred file then input the --gene_annotations parameter. What’s wrong with it? And how to solve this problem?
Thanks.

Error with Flash

I get the following error

[Command used]:
CRISPResso /Library/Frameworks/Python.framework/Versions/2.7/bin/CRISPResso -r1 43_S15_L001_R1_001.fastq.gz -r2 43_S15_L001_R2_001.fastq.gz -q 20 -a gagtgctggctctggcctggtgccacccgcctatgcccctccccctgccgtccccggccatcctgccccccagagtgctgaggtgtggggcgggccttctggggcacagcctgggcacagaggtggctgtgcgaagaggggcttgacctcggggttcagaaggggactttacgcgggaaggtactttccctccctccagctcccctcccccgcgtccttccacctctcccggtctctcccactcctcccctggccctccacagcccctcttcttcctcccctggccctctccttcctcccagtccctccccatcccctcccccctacttttcctcctccttccctcccctcctccctgtgcttcttccctgtctctctttcccgccccgctgtacctctccctctgcccctccgctccccgttcactctccctcctcccctgcccctcgacactgtccctcccc -g CGAAGAGGGGCTTGACCTCGGGG -o 43_S15_q20_out

[Execution log]:
Filtering reads with average bp quality < 20 ...
Estimating average read length...
Merging paired sequences with Flash...
[FLASH] ERROR: Maximum overlap (-49) cannot be less than the minimum overlap (4).
Please make sure you have provided the read length and fragment length
correctly.  Or, alternatively, specify the minimum and maximum overlap
manually with the --min-overlap and --max-overlap options.
[FLASH] FLASH did not complete successfully; exiting with failure status (1)
Merging error, please check your input.

ERROR: Flash failed to run, please check the log file.

I cannot seem to specify --max-overlap though. I' am working with 150bp PE reads from MiSeq.

Front End Code

Hello,
I love your tool, it is very well done. Would it be possible to provide the Front End code that is running on http://crispresso.rocks/
Thanks!
Karly

ERROR: 'transform' must be an instance of 'matplotlib.transform.Transform'

Hi Luca!
In the latest version of matplotlib (1.5.1) they changed how Transform is called. I can no longer get your code to work. Here is the following error message:

CRITICAL @ Sun, 14 Aug 2016 20:37:38:
Unexpected error, please check your input.

ERROR: 'transform' must be an instance of 'matplotlib.transform.Transform'

CRIPRessoPooled - query

Hello once again @lucapinello

Sorry to bug you again, I am using CrispressoPooled to analyze some pooled amplicon data

My experimental set up,

Paired end seq, 4 samples, 6 amplicons and 6 guides, i prepared the description file as mention in the docs had few questions

it the pooled analysis limited to only 5 amplicons?
"A description file containing the amplicon sequences used to enrich regions in the genome and some additional information. In particular, this file, is a tab delimited text file with up to 5 columns (first 2 columns required):"
Also i see this error "Skipping amplicon [site4] since no reads are aligning to it" all the other (5 of them) amplicons produce results as expected except site4, to add further I use the this amplicon and guide and process it using (the same fastq files) only CRISPResso.py and it seems to work fine was wondering what am i missing. Note i am using default parameters for both the analysis.

Thanks a ton once again.

Frank

Run for a long times

Hi,
I use the CRISPResso with a Single-Read as follows command.

CRISPResso -r1 sample_R1_001.fastq.gz -a ATATGACCAGGTCGTACACGATGTGGATCTGCAGAAGCTGCCTGTAAGATTTGCAATGGACAGAGCTGGCCTCGTTGGTGCAGATGGTCCAACACATTGTGGGGCTTTTGATGTCACTTTCATG -g TTGGTGCAGATGGTCCAACACAT --name sample -o ${out} --trim_sequences -p ${cpu} -w 20

But It's still in the step "Calculating alleles frequencies" when running long times.

INFO  @ Thu, 31 May 2018 14:15:20:
	 Calculating alleles frequencies...

So, whether the CRISPResso is supporting for Single-Read? And what is the best way for the Single-Read sequence?

Flash failed to run flash

Hello,

I was trying to use CRISPRessoPoll on my sequence file but received some FLASH error. The following is the log:

`[Command used]:
CRISPRessoPooled /Library/Frameworks/Python.framework/Versions/2.7/bin/CRISPRessoPooled -r1 FGC1478_s_1_1_AGGCAGAA-ACTGCATA.fastq.gz -r2 FGC1478_s_1_2_AGGCAGAA-ACTGCATA.fastq.gz -f CRISPResso1111.xlsx

[Execution log]:
Merging paired sequences with Flash...
[FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] FGC1478_s_1_1_AGGCAGAA-ACTGCATA.fastq.gz
[FLASH] FGC1478_s_1_2_AGGCAGAA-ACTGCATA.fastq.gz
[FLASH]
[FLASH] Output files:
[FLASH] CRISPRessoPooled_on_FGC1478_s_1_1_AGGCAGAA-ACTGCATA_FGC1478_s_1_2_AGGCAGAA-ACTGCATA/out.extendedFrags.fastq.gz
[FLASH] CRISPRessoPooled_on_FGC1478_s_1_1_AGGCAGAA-ACTGCATA_FGC1478_s_1_2_AGGCAGAA-ACTGCATA/out.notCombined_1.fastq.gz
[FLASH] CRISPRessoPooled_on_FGC1478_s_1_1_AGGCAGAA-ACTGCATA_FGC1478_s_1_2_AGGCAGAA-ACTGCATA/out.notCombined_2.fastq.gz
[FLASH] CRISPRessoPooled_on_FGC1478_s_1_1_AGGCAGAA-ACTGCATA_FGC1478_s_1_2_AGGCAGAA-ACTGCATA/out.hist
[FLASH] CRISPRessoPooled_on_FGC1478_s_1_1_AGGCAGAA-ACTGCATA_FGC1478_s_1_2_AGGCAGAA-ACTGCATA/out.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 4
[FLASH] Max overlap: 100
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 8
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33, gzip
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 8 combiner threads
[FLASH] Processed 25000 read pairs
[FLASH] Processed 50000 read pairs
[FLASH] Processed 75000 read pairs
[FLASH] Processed 100000 read pairs
[FLASH] Processed 125000 read pairs
[FLASH] Processed 150000 read pairs
[FLASH] Processed 175000 read pairs
[FLASH] Processed 200000 read pairs
[FLASH] Processed 225000 read pairs
[FLASH] Processed 250000 read pairs
[FLASH] Processed 275000 read pairs
[FLASH] Processed 300000 read pairs
[FLASH] Processed 325000 read pairs
[FLASH] Processed 350000 read pairs
[FLASH] Processed 375000 read pairs
[FLASH] ERROR: Qual string length (55) not the same as sequence length (250) (file "FGC1478_s_1_1_AGGCAGAA-ACTGCATA.fastq.gz", near line 1502597)
[FLASH] FLASH did not complete successfully; exiting with failure status (1)

ERROR: Flash failed to run, please check the log file.
`

It seems like there's something wrong with my fastq file because I can run your test example without any problem, although that did not require CRISPRessoPoll.

Issue with ONLY AMPLICONS running mode

Hello,

I am trying to run CRISPRessoPooled in ONLY AMPLICONS mode. I have a problem into the generation of reads file, I think related with the following reported information into the _RUNNING_LOG file:

No samples; assembling all-inclusive block
Sorting block of length 1074 for bucket 1
(Using difference cover)
Error: reads file does not look like a FASTQ file
Error: Encountered exception: 'Unidentified exception'

I am already starting into bioinformatic analysis so I am not able to understand what it is happening. Furthermore, I couldn't find an example of this running mode to try solving the problem by my shelf. I attached the complete _RUNNING_LOG file and the input files.

AMPLICONS.txt
AV4T6XXXX1.fastq.gz
AV4T6XXXX2.fastq.gz
CRISPRessoPooled_RUNNING_LOG.txt

The program must be work because in CRISPResso mode I am able to reproduce the results obtained from the online version. But our design of sequencing experiments requires the Pooled version to make the analysis more easy and automatic.

Thank you so much for your attention,

Andrés Marco Giménez
Phd Student
Institute for Bioengineering of Catalonia (IBEC)

ERROR: If using all scalar values, you must pass an index

Hi there,

I am attempting your pipeline with single end reads (since my overlap is over 65bp and I get an error from flash when trying to merge. Is there any way to adjust flash params?). Reads were quality filtered and merged with usearch.

I have attached my merged reads and my reference sequence. They gave reasonable results with GATK/MuTect2...

I get this error, thanks for your help!:

-Analysis of CRISPR/Cas9 outcomes from deep sequencing data-

                      )
                     (
                    __)__
                 C\|     |
                   \     /
                    \___/
             

[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 1.0.8

INFO  @ Tue, 24 Oct 2017 10:05:27:
	 Creating Folder CRISPResso_on_JC04-4-CT13_S4_filter 

INFO  @ Tue, 24 Oct 2017 10:05:27:
	 Done! 

INFO  @ Tue, 24 Oct 2017 10:05:27:
	 Preparing files for the alignment... 

INFO  @ Tue, 24 Oct 2017 10:05:27:
	 Done! 

INFO  @ Tue, 24 Oct 2017 10:05:27:
	 Aligning sequences... 

INFO  @ Tue, 24 Oct 2017 10:09:34:
[JC04-4-CT13_S4_filter.fastq.gz](https://github.com/lucapinello/CRISPResso/files/1411814/JC04-4-CT13_S4_filter.fastq.gz)
[JC04-4-CT13_reference.txt](https://github.com/lucapinello/CRISPResso/files/1411820/JC04-4-CT13_reference.txt)



	 Align sequences to reverse complement of the amplicon... 

INFO  @ Tue, 24 Oct 2017 10:09:34:
	 Done! 

INFO  @ Tue, 24 Oct 2017 10:13:35:
	 Quantifying indels/substitutions... 

CRITICAL @ Tue, 24 Oct 2017 10:13:35:
	 Unexpected error, please check your input.

ERROR: If using all scalar values, you must pass an index

Unexpected error: invalid literal for int() with base 10: '0rc1'

JC04-4-CT13_short.txt

JC04-4-CT13_S4_filter.fastq.gz

Hi there,

I ran the web version and these files generated meaningful output, but when running the following with the command line version, I get the following error:

earnest@biolinux8[CRISPResso-master] python CRISPResso.py -r1 /home/earnest/Jeff.S/CRISPR_test/merged/filter/JC04-4-CT13_S4_filter.fastq -a TGCATGTCATCTCTTTCAGGTGTGGCATTTCAAGGGGGCTTGTGTCTTGAAAACAGCAACTGTGAGGACACTTGATAGTCATTTCCTTCAGTTCTGCTTTTGTCTCCCTAGGTGACTGTGGCCTTCCCCCAGATGTACCTAATGCCCAGCCAGCTTTGGAAGGCCGTACAAGTTTTCCCGAGGATACTGTAATAACGTACAAATGTGAAGAAAGCTTTGTGAAAATTCCTGGCGAGAAGGACTCAGT

-Analysis of CRISPR/Cas9 outcomes from deep sequencing data-

                      )
                     (
                    __)__
                 C\|     |
                   \     /
                    \___/
             

[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 1.0.8

WARNING @ Tue, 14 Nov 2017 16:39:12:
	 Folder CRISPResso_on_JC04-4-CT13_S4_filter already exists. 

INFO  @ Tue, 14 Nov 2017 16:39:12:
	 Preparing files for the alignment... 

INFO  @ Tue, 14 Nov 2017 16:39:12:
	 Done! 

INFO  @ Tue, 14 Nov 2017 16:39:12:
	 Aligning sequences... 

INFO  @ Tue, 14 Nov 2017 16:40:01:
	 Align sequences to reverse complement of the amplicon... 

INFO  @ Tue, 14 Nov 2017 16:40:01:
	 Done! 

INFO  @ Tue, 14 Nov 2017 16:40:23:
	 Quantifying indels/substitutions... 

INFO  @ Tue, 14 Nov 2017 16:43:49:
	 Done! 

INFO  @ Tue, 14 Nov 2017 16:43:49:
	 Calculating indel distribution based on the length of the reads... 

INFO  @ Tue, 14 Nov 2017 16:43:51:
	 Done! 

INFO  @ Tue, 14 Nov 2017 16:43:51:
	 Calculating alleles frequencies... 

CRITICAL @ Tue, 14 Nov 2017 16:43:51:
	 Unexpected error, please check your input.

ERROR: invalid literal for int() with base 10: '0rc1'

Specify the FLASH --max-overlap parameter?

I get the following warning from FLASH about a high proportion of paired end reads overlapping by more than 100bp. This pooled dataset has many short amplicons and 150bp PE reads, so this is probably to be expected. Is it possible to specify the --max-overlap (-M) parameter to fix this?

[FLASH]  
[FLASH] Read combination statistics:
[FLASH]     Total pairs:      2554170
[FLASH]     Combined pairs:   415920
[FLASH]     Uncombined pairs: 2138250
[FLASH]     Percent combined: 16.28%
[FLASH]  
[FLASH] Writing histogram files.
[FLASH] WARNING: An unexpectedly high proportion of combined pairs (10.04%)
overlapped by more than 100 bp, the --max-overlap (-M) parameter.  Consider
increasing this parameter.  (As-is, FLASH is penalizing overlaps longer than
100 bp when considering them for possible combining!)

no of reads from pie chart and alleles do not add up

Hi I am using crispr guided to a target region. we then sequenced the genomic dna (pcr product) that targets this region using miseq 2x250 + 10bp Index1 + 8bp Index2 (as defined in my previous question
Alleles_frequency.txt
alleles.pdf
Quantification_of_editing_frequency.txt
pie.pdf

)

On reading the paper and supplementary we decided to use CRISPResso.py for our analysis, ~CRISPResso.py --trim_sequences -r1 sample _R1_001.fastq.gz -r2 sample_R2_001.fastq.gz -a CCTCGCAGACATTAAAGCCCgtgctttgcaggcccgaggggcgagaggttaccactgcaatcgagagacggccaccactgccatcggaggggggggtggcccgggtggaggtggcactcgggccatcgatgagggaggtggcagagacagcagcaGTGGTGATGGTAGTGAGGCC -g grna seq -o sample_out

My question is after successful run, the number of reads show in the pie chart and the alleles frequency do not add up (so the plot shows only site above 0.2%, but even when you look at the text file the number of reads is confusing). Was wondering if something is wrong our command or are we comprehending it in a wrong way. Have attached all the figures and text associated with it.

Thank you.

Trimming adapter sequences that are not Nextera

Hello,
I have a bunch a fastq.gz files to analyze, paired end reads.
However the adapter sequences are not Nextera and I don't know which ones they are exactly.

Just from looking at the fastq.gz files, can I know which adapter sequences were used, and if so, how can I then trim them with --trimmomatic_options_string ??

I guess I have to create a .fa file similar to the "NexteraPe-PE.fa" file, however I'm not sure how to correctly do this.

I attach the two files (pair ends) of one sequencing.
Would you please help me out and guide me as to be able to this by myself in the future?
Thank you so much, I would really appreciate it! You tool is very useful and a great contribution to the scientific community.
Best,
Alex
Won_Tae_1_S48_L001_R1_001.fastq.gz
Won_Tae_1_S48_L001_R2_001.fastq.gz

RuntimeWarning: invalid value encountered in divide

When running control samples using CRISPResso v1.0.2, I get the following warnings:
/usr/lib/python2.7/site-packages/CRISPResso-1.0.2-py2.7.egg/CRISPResso/CRISPRessoCORE.py:1315: RuntimeWarning: invalid value encountered in divide
avg_vector_ins_all/=(effect_vector_insertion+effect_vector_insertion_hdr+effect_vector_insertion_mixed)
/usr/lib/python2.7/site-packages/CRISPResso-1.0.2-py2.7.egg/CRISPResso/CRISPRessoCORE.py:1316: RuntimeWarning: invalid value encountered in divide
avg_vector_del_all/=(effect_vector_deletion+effect_vector_deletion_hdr+effect_vector_deletion_mixed)

I did not get these warnings when I ran the same sample using v0.9.8. Here is my command:
FQ1=161219_A4_S4_L001_R1_001.fastq.gz
FQ2=161219_A4_S4_L001_R2_001.fastq.gz

AmpliconSequence=TAAGTGAATTACTTTTTTTGTCAATCATTTAACCATCTTTAACCTAAAAGAGTTTTATGTGAAATGGCTTATAATTGCTTAGAGAATATTTGTAGAGAGGCACATTTGCCAGTATTAGATTTAAAAGTGATGTTTTCTTTATCTAAATGA
sgRNA=TGTGAAATGGCTTATAATTGC
SAMPLE_NAME=38343_S_3
OUTDIR=$(pwd)

    CRISPResso \
    -r1 $FQ1 \
    -r2 $FQ2 \
    -a $AmpliconSequence \
    -g $sgRNA \
    -n $SAMPLE_NAME \
    -o $OUTDIR \
    --keep_intermediate \
    --save_also_png \
    --window_around_sgrna 10 \
    --hide_mutations_outside_window_NHEJ

I've attached two example files that generate this error. Would you please take a look?

Thanks for your help!
Matt

161219_A4_S4_L001_R2_001.fastq.gz
161219_A4_S4_L001_R1_001.fastq.gz

ERROR: Flash failed to run, please check the log file.

Hi,

I am trying to install CRISPResso on my computer, but am having trouble using FLASH. Here's the output I get when I try to run the example files:

CRISPResso -r1 reads1.fastq -r2 reads2.fastq -a GCTTACACTTGCTTCTGACACAACTGTGTTCACGAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGAATGCCGTCACCACCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGA -e GCTTACACTTGCTTCTGACACAACTGTGTTCACGAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGTGGAAAAAAACGCCGTCACGACGTTATGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGA

-Analysis of CRISPR/Cas9 outcomes from deep sequencing data-

                      )
                     (
                    __)__
                 C\|     |
                   \     /
                    \___/
             

[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 1.0.13

WARNING @ Mon, 12 Nov 2018 14:47:51:
	 Folder CRISPResso_on_reads1_reads2 already exists. 

INFO  @ Mon, 12 Nov 2018 14:47:51:
	 Estimating average read length... 

INFO  @ Mon, 12 Nov 2018 14:47:53:
	 Merging paired sequences with Flash... 

CRITICAL @ Mon, 12 Nov 2018 14:47:53:
	 Merging error, please check your input.

ERROR: Flash failed to run, please check the log file.

The running log says:
[Execution log]:
Estimating average read length...
Merging paired sequences with Flash...
/bin/sh: /Users/madeleinesitton/CRISPResso_dependencies/bin/flash: cannot execute binary file
Merging error, please check your input.

ERROR: Flash failed to run, please check the log file.

Thank you for the help!!

Make installing of external dependencies optional

Currently it is not possible to install CRISPResso without the external dependencies if these are not yet in the current PATH. This makes it difficult to install CRISPResso in a conda environment, where the external dependencies are installed as part of the environment (which is not yet active and therefore not detectable by the CRISPResso setup.py script.

I would propose to separate the installation of the external tools from the setup.py script, as I think that setup.py should not be responsible for installing the external dependencies to start with. If you would like to facilitate the installation of external tools, I would either provide instructions to do so or provide a separate script for installing the dependencies (or ideally both).

Running with non-overlapping reads

Is it possible to run CRISPresso with reads that originate from a large amplicon (a few kilobases) that has been fragmented and then sequenced? These reads would not map to only the ends of the target amplicon...

If guide seq is wrong, do not crash instead run without guide seq

Thanks for the software and the recent addition of allowing outies for flash ! TOP 👍

Would you maybe as well considering following change as well in the Core code?

                     if not cut_points:
                         #CHANGE ADDED here (add warning and default values instead of a crash)
                         #raise SgRNASequenceException('The guide sequence/s provided is(are) not present in the amplicon sequence! \n\nPlease check your input!')
                         warn('The guide sequence/s provided is(are) not present in the amplicon sequence! \n\nPlease check your input!, running now without guide_sequence!!')
                         cut_points=[]
                         sgRNA_intervals=[]
                         offset_plots=[]
                     else:
                         info('Cut Points from guide seq:%s' % cut_points)

This helps for throughput analysis. I know that the downstream calculation values are off, but the resulting allele_frequency_tsv should not be affected at all. What do you think?

mutation frequency plot

hello @lucapinello

Thank you for your help so far, i was able to get CRISPResso working, I had a question with respeect to the mutation frequency plot, as seen below

What could be reason we seen peaks at the beginning and end of the sequence, I can see this in almost all my sequences i did used '--trim_sequence' for trimming adapters.

Thank you once again

Appears to be counting reads with deletions as unedited

Hi Luca,
Big fan of crispresso and very much looking forward to crispresso 2.
I was running a batch of crispressos and I came across a few wells that seems to have a usually high amount of unmodified reads. When i looked at the alleles around the predicted cut side they appeared to have clear deletions but were counted as unmodified in every statistic.

Aligned_Sequence Reference_Sequence Unedited %Reads #Reads
AGTGGAGGATGCCTTCT--ACGTTGGTGCGTGAGATCCGG AGTGGAGGATGCCTTCTACACGTTGGTGCGTGAGATCCGG True 61.50844322453524 103082
GATGCCTTC-ACATGTCTCACGTTGGTGCGTGAGATCCGG GATGCCTTCTA-------CACGTTGGTGCGTGAGATCCGG True 35.92815800465359 60212
AGTGGAGGATGCCTTCT----------------------- AGTGGAGGATGCCTTCTACACGTTGGTGCGTGAGATCCGG False 0.6533802732859946 1095

I'm not sure why it CRISPresso could be counting these reads as unmodified. I think there is the same thing in the reads in the left-aligning option issue where the reads have deletions but are counted as unedited.

Any help would be deeply appreciated.

Thanks,

Alexander Raeside
Oxford Genetics

modify the python scripts for availability to other CRISPR systems

Hi,
This program is best for analysis of conventional Cas9 (20bp+NGG) genome-editing sequencing data. But I also want to use this program to analysis cpf1(TTTN+23bp) sequencing data. So, I need to modify the source python scripts to instead the NGG to TTTN and other contents. The scripts CRISPRessoCORE.py is complicated and I can not find how to change.
So, can you give some hints?
Thanks.

CRISPResso available on bioconda

I hope this is OK; I made CRISPResso available for installation via bioconda:
https://bioconda.github.io/recipes/crispresso/README.html

(It's now possible to conda install crispresso)

Recipe source on GitHub here:
https://github.com/bioconda/bioconda-recipes/tree/master/recipes/crispresso

I haven't put it through its paces, so it is possible some changes may need to be made to the recipe (especially with respect to dependencies), but in general it is available and should be easy for users to install.

In the future, it would be nice to have tagged releases on GitHub so downstream users can keep track of versions.

Cheers,
Chris

Error with FLASH for CRISPResso

I tried several times running my already split paired-end reads with CRISPResso, unfortunately this is the result I get every time.

[Command used]:
CRISPResso /Users/mtoetzl/anaconda/bin/CRISPResso -r1 3_HeLa_SG1_293817w_CA3_R1.fastq -r2 3_HeLa_SG1_293817w_CA3_R2.fastq -a CAAGGCTGAAATTGAGAATGAAGACTATAGTTATACAAAAGATGGAATAGGACTAGATTTGGAAAATTCTTTTAGTAACATTCTGTTATTTGTTCCTGAGTACTTAGACTTCATGCAGAATGGTAACTACTTTCTGATTTTTGTGAAGTCATGGAGCTTGAACACCTCTGGTCTGCGGATTACCACCTTGAGCTCCAATTTGTACAAAAGAGATATAACATCTGCAAAAGTCATGAATGCCACTGCTGCACTGGAGTTCCTCAAAGACATGAA -g GGTGGTAATCCGCAGACCAGAGG

[Execution log]:
Estimating average read length...
Merging paired sequences with Flash...
[FLASH] ERROR: Maximum overlap (-97) cannot be less than the minimum overlap (4).
Please make sure you have provided the read length and fragment length
correctly. Or, alternatively, specify the minimum and maximum overlap
manually with the --min-overlap and --max-overlap options.
[FLASH] FLASH did not complete successfully; exiting with failure status (1)
Merging error, please check your input.

ERROR: Flash failed to run, please check the log file.

Window around cleavage position seems to be asymmetrical

With various data, I get different quantification results when I flip the reference amplicon sequence between sense and anti-sense strand (i.e. when I reverse-complement it) if I provide a guide RNA that I keep unchanged when I run CRISPRessoPooled. The results should be independent of that.

I think the problem is caused by the fact that the window around the cleavage position is asymmetrical with respect to the cleavage position (which by default is located between the third and fourth base of the guide sequence), i.e. there are more bases to the left of the cleavage position then on the right of the cleavage position taken into account when quantifying indels.

Looking at the code, I would propose
st=max(0,cut_p-half_window+1)
en=min(len(args.amplicon_seq),cut_p+half_window+1)
in lines 1228 and 1229 of CRISPRessoCORE.py. This did make the asymmetry between the results better in my example, but did not fully remove it, so it can't be the full solution.

Last stage error Error with FLASH for CRISPRessoPooled.py

Dear Luca Pinello
I am using CRISPRessoPooled.py script to analyzed the paired ends reads data by Illumina.
the read length is 150 bp.
the amplicon size is 180bp long.
below is the command that I am used several time.
it ends up with an error at last of the analysis
I am new to the Linux environment. Kindly help me to resolve this issue.
thanks
ERROR: Flash failed to run, please check the log file.

$ python CRISPRessoPooled.py -r1 /home/bilal/Sir_Qayyum_data/6297_1_1.fastq.gz -r2 /home/bilal/Sir_Qayyum_data/6297_1_2.fastq.gz -f /home/bilal/Sir_Qayyum_data/CRISPResso-master/cat.csv

-Analysis of CRISPR/Cas9 outcomes from POOLED deep sequencing data-

              )                                            )
             (           _______________________          (
            __)__       | __  __  __     __ __  |        __)__
         C\|     \      ||__)/  \/  \|  |_ |  \ |     C\|     \
           \     /      ||   \__/\__/|__|__|__/ |       \     /
            \___/       |_______________________|        \___/
        

[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 1.0.13

INFO  @ Tue, 05 Feb 2019 22:43:54:
	 Checking dependencies... 

INFO  @ Tue, 05 Feb 2019 22:43:54:
	 
 All the required dependencies are present! 

INFO  @ Tue, 05 Feb 2019 22:43:54:
	 Only the Amplicon description file was provided. The analysis will be perfomed using only the provided amplicons sequences. 

INFO  @ Tue, 05 Feb 2019 22:43:54:
	 Creating Folder CRISPRessoPooled_on_6297_1_1_6297_1_2 

WARNING @ Tue, 05 Feb 2019 22:43:54:
	 Folder CRISPRessoPooled_on_6297_1_1_6297_1_2 already exists. 

INFO  @ Tue, 05 Feb 2019 22:43:54:
	 Merging paired sequences with Flash... 

CRITICAL @ Tue, 05 Feb 2019 22:43:54:
	 

ERROR: Flash failed to run, please check the log file.

No reads aligned?

I'm getting this error with my data, trying to align just one of the paired end read files. The amplicon input is a single line of text (I don't think it's terminated by a newline character). Alignment of the same reads to this sequence in CLC has no problems. Also test run of CRISPResso completed successfully, no problem.
[Command used]:
CRISPResso /usr/local/bin/CRISPResso -r1 2_S2_L001_R1_001.fastq.gz -a GATCGGAGAATAAGCATGAGTAGTTATTGAGATCTGGGTCTGACTGCAGGTAGCGTGGTCTTCTAGACGTTTAAGTGGGAGATTTGGAGGGGATGAGGAATGAAGGAACTTCAGGATAG AAAAGGGCTGAAGTCAAGTTCAGCTCCTAAAATGGATGTGGGAGCAAACTTTGAAGATAAACTGAATGACCCAGAGGATGAAACAGCGCAGATCAAAGAGGGGCCTGGAGCTCTGAGAAGAGAAGGAGACTCATCCGTGTTGAGTTTCCACAAGTACTGTCTTGAGTTTTGCAATAAAAGTGGGATAGC AGAGTTGAGTGAGCCGTAGGCTGAGTTCTCTCTTTTGTCTCCTAAGTTTTTATGACTACAAAAATCAGTAGTATGTCCTGAAATAATCATTAAGCTGTTTGAAAGTATGACTGCTTGCCATGTAGATACCATGGCTTGCTGAATAATCAGAAGAGGTGTGACTCTTATTCTAAAATTTGTCACAAAATG TCAAAATGAGAGACTCTGTAGGAACG

[Execution log]:
Preparing files for the alignment...
Done!
Aligning sequences...
Needleman-Wunsch global alignment of two sequences
Align sequences to reverse complement of the amplicon...
Done!
Needleman-Wunsch global alignment of two sequences
Quantifying indels/substitutions...
Alignment error, please check your input.

ERROR: Zero sequences aligned, please check your amplicon sequence

crispressoweb dockerfile

Hi @lucapinello,

Would you be willing to share the dockerfile for the crispressoweb docker image you have listed on docker hub?

Longer Allele Sequences

Hello Luca,

Is there any way to have longer allele sequences in the file Alleles_frequency_table_around_cut_site_for_*.txt?

It would be useful for the analysis of HDR events that are located far away from the gRNA cutting site.

Thank you for your attention,

Andrés

ERROR: The amplicons should be all distinct!

Hi,
I get follows stderr when running the CRISPRessoPooled.

ERROR: The amplicons should be all distinct!

Python error running locally

Hello,

I tried your software online and it worked perfect for my sample, so I installed it on my system using pip (as recommended), but when I run the same sample with exactly the same options that you do on the website (commandline extracted from the report that your webtool generates), it gives me an error:

[...]
INFO  @ Tue, 25 Apr 2017 22:09:46:
         Quantifying indels/substitutions...

/modules/ogi-mbc/software/CRISPResso/0.7.0/lib/python2.7/site-packages/CRISPResso/CRISPRessoCORE.py:1336: RuntimeWarning: invalid value encountered in divide
  avg_vector_ins_all/=(effect_vector_insertion+effect_vector_insertion_hdr+effect_vector_insertion_mixed)
/modules/ogi-mbc/software/CRISPResso/0.7.0/lib/python2.7/site-packages/CRISPResso/CRISPRessoCORE.py:1337: RuntimeWarning: invalid value encountered in divide
  avg_vector_del_all/=(effect_vector_deletion+effect_vector_deletion_hdr+effect_vector_deletion_mixed)
INFO  @ Tue, 25 Apr 2017 22:18:12:
         Done!

INFO  @ Tue, 25 Apr 2017 22:18:12:
         Calculating indel distribution based on the length of the reads...

INFO  @ Tue, 25 Apr 2017 22:18:21:
         Done!

INFO  @ Tue, 25 Apr 2017 22:18:21:
         Calculating alleles frequencies...

CRITICAL @ Tue, 25 Apr 2017 22:18:21:
         Unexpected error, please check your input.

ERROR: invalid literal for int() with base 10: '0rc1'

Commandline:

CRISPResso -a CGAGAGCCGCAGCCATGAACGGCACAGAGGGCCCCAATTTTTATGTGCCCTTCTCCAACGTCACAGGCGTGGTGCGGAGCCCCTTCGAGCAGCCGCAGTACTACCTGGCGGAACCATGGCAGTTCTCCATGCTGGCAGCGTACATGTTCCTGCTCATCGTGCTGGG -r1 ../spli
t_fastq/2.P23H2_Het_276035w_DB12.fastq.R1.fastq -r2 ../split_fastq/2.P23H2_Het_276035w_DB12.fastq.R2.fastq -q 0 -s 0 --exclude_bp_from_left 15 --exclude_bp_from_right 15 --hdr_perfect_alignment_threshold 98 -w 1 --name TMP --output_folder
 ./ --save_also_png

I understand that must be something regarding my installation, but I have no clue what is not going well.

Thanks.

Tests

Is this thing even working? Who knows) I think it is time to add some tests.

Running failed with the example sequences

Environment: CentOS6.5, Python 2.7.11, CRISPResso installed from source (master.zip downloaded from github), needle, flash and trimmomatic are in the CRISPResso dependencies dir.

Data was downloaded from http://bcb.dfci.harvard.edu/~lpinello/CRISPResso as indicated and run as:

CRISPResso -r1 reads1.fastq.gz -r2 reads2.fastq.gz -a AATGTCCCCCAATGGGAAGTTCATCTGGCACTGCCCACAGGTGAGGAGGTCATGATCCCCTTCTGGAGCTCCCAACGGGCCGTGGTCTGGTTCATCATCTGTAAGAATGGCTTCAAGAGGCTCGGCTGTGGTT -g TGAACCAGACCACGGCCCGT

output as follows:

-Analysis of CRISPR/Cas9 outcomes from deep sequencing data-

                      )
                     (
                    __)__
                 C\|     \
                   \     /
                    \___/


[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 0.9.8

INFO  @ Tue, 02 Aug 2016 17:32:39:
     Cut Points from guide seq:[76]

WARNING @ Tue, 02 Aug 2016 17:32:39:
     Folder CRISPResso_on_reads1_reads2 already exists.

INFO  @ Tue, 02 Aug 2016 17:32:39:
     Estimating average read length...

INFO  @ Tue, 02 Aug 2016 17:32:40:
     Merging paired sequences with Flash...

INFO  @ Tue, 02 Aug 2016 17:32:41:
     Done!

INFO  @ Tue, 02 Aug 2016 17:32:42:
     Preparing files for the alignment...

INFO  @ Tue, 02 Aug 2016 17:32:42:
     Done!

INFO  @ Tue, 02 Aug 2016 17:32:42:
     Aligning sequences...

sed: couldn't write 73 items to stdout: Broken pipe
awk: (FILENAME=- FNR=809) fatal: print to "standard output" failed (Broken pipe)

gzip: stdout: Broken pipe
cat: write error: Broken pipe
INFO  @ Tue, 02 Aug 2016 17:32:42:
     Quantifying indels/substitutions...

CRITICAL @ Tue, 02 Aug 2016 17:32:42:
     Alignment error, please check your input.

ERROR: Zero sequences aligned, please check your amplicon sequence

Left-aligning option

Hi Luca,

First of all thanks for developing this pipeline.

I have a question for you. I'm pasting below two of the alleles from the "Alleles_frequency_table_around_cut_site" file.

Aligned_Sequence Reference_Sequence Unedited %Reads #Reads
GACTGTAAGTGAATTACTTTTTTTGTCAATCA----ACCATCTTTAACCTAAAAGAGTTT GACTGTAAGTGAATTACTTTTTTTGTCAATCATTTAACCATCTTTAACCTAAAAGAGTTT True 0.24111800019683108 49
GACTGTAAGTGAATTACTTTTTTTGTCAATC----AACCATCTTTAACCTAAAAGAGTTT GACTGTAAGTGAATTACTTTTTTTGTCAATCATTTAACCATCTTTAACCTAAAAGAGTTT True 0.16730636748351543 34

As you can see, they are exactly the same except for the A which can align to either side. This reminds me to the LeftAlignAndTrimVariants module from GATK which in cases like this, simplifies the output by aligning those bases always to the left. Is it possible to improve the alignment in CRISPResso, so they are grouped as the same event?

Thanks

CRISPRessoPooledWGSCompare - hasnans

Running the CRISPRessoPooledWGSCompare I noticed a syntax bug in the CRISPRessoPooledWGSCompareCORE.py file.

Currently the log file gives the following error:

[Command used]:
CRISPRessoPooledWGSCompare /usr/local/bin/CRISPRessoPooledWGSCompare --name A4_S186_vs_A5_S184 --sample_1_name A4_S186 --sample_2_name A5_S184 --output_folder /home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184 --save_also_png /home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186 /home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184

[Execution log]:


ERROR: 'numpy.bool_' object is not callable

The solution is described here on stackoverflow: pandas hasnan and I tested the field rather than the method and it worked giving the new following output:

[Command used]:
CRISPRessoPooledWGSCompare /usr/local/bin/CRISPRessoPooledWGSCompare --name A4_S186_vs_A5_S184 --sample_1_name A4_S186 --sample_2_name A5_S184 --output_folder /home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184 --save_also_png /home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186 /home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184

[Execution log]:
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Fmn1" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Fmn1" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Fmn1" -n2 "A5_S184_Fmn1"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Dntt" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Dntt" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Dntt" -n2 "A5_S184_Dntt"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Ankrd10" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Ankrd10" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Ankrd10" -n2 "A5_S184_Ankrd10"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Mt1" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Mt1" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Mt1" -n2 "A5_S184_Mt1"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Psmd13" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Psmd13" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Psmd13" -n2 "A5_S184_Psmd13"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_Asap1" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_Asap1" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_Asap1" -n2 "A5_S184_Asap1"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_chr10_1" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_chr10_1" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_chr10_1" -n2 "A5_S184_chr10_1"
Skipping sample chr14 since it was not processed in one or both conditions
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_chr13" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_chr13" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_chr13" -n2 "A5_S184_chr13"
Running CRISPRessoCompare:CRISPRessoCompare "/home/aiezza/amplicon_exp/cspresso/A4_S186/CRISPRessoPooled_on_A4_S186/CRISPResso_on_chr10_2" "/home/aiezza/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/CRISPResso_on_chr10_2" -o "/home/aiezza/amplicon_exp/cspresso/A4_S186_vs_A5_S184/CRISPRessoPooledWGSCompare_on_A4_S186_vs_A5_S184" -n1 "A4_S186_chr10_2" -n2 "A5_S184_chr10_2"
All Done!

best,
Alex

Running CRISPResso - query

hello Developer,

Thank you for developing this tool.

I am not sure which module should I use for my purpose, we have crispr guided to a target region. we then sequenced the genomic dna (pcr product) that targets this region using miseq 2x250 + 10bp Index1 + 8bp Index2
I used two two different approach

using fastq and
using the bam file (alignment using bwa)

I have attached plots from both the steps, having difficulty in understanding the plots. could you please help in this?

4b.usingbam.pdf
4b.usingfastq.pdf

Amplicon Sequence

I got this error while the amplicon sequence looks fine in the excel.

ERROR: The amplicon sequence ??ރh?????m? contains wrong characters: ? ? ? ? ? ! ? " $ ) ? ? . 0 4 H 9 D ? I ? K ? ? S U O ? ? ] ? ? ? ? ? ? ? ? ? ? ?

Thanks,

Steve

Error while generating plots with Cpf1 data

I have been using CRIPResso with Cas9 and Cpf1. So far, all the Cas9 experiments are fine, but when I ran the Cpf1 I get an error when generating the plots. I should say that the plots 1a, 1b, 2, 3, 4a, 4b, and 4e are generated. It seems to fail while generating plot 9. BTW I am specifying "--guide_seq" and also "--cleavage_offset 1" when running Cpf1.

Here's the last section of the log file:
....
INFO @ Thu, 28 Jun 2018 14:53:51:
Calculating alleles frequencies...

INFO @ Thu, 28 Jun 2018 14:55:38:
Done!

INFO @ Thu, 28 Jun 2018 14:55:38:
Making Plots...

CRITICAL @ Thu, 28 Jun 2018 14:55:49:
Unexpected error, please check your input.

ERROR: 'N'

FLASH: Low Percent Combined

Greetings!

I have Illumina NextSeq500 150bp, paired end reads, generated from whole shotgun sequencing of environmental samples. These sequences have been quality filtered (Sickle, Phred > 20, default length to keep a read = 20bp). I am now trying to use FLASH to merge the paired end reads. For all of my metagenomes, I get really low numbers (as compared to the FLASH website and paper).

I ran the command as follows for all metagenomes:
./flash -M 150 -o Flash.out Forward.fastq Reverse.fastq | tee Flash.log

The range of Percent combined: Min = 13.04% ; Max = 52.72% ; Ave = 32.11%

I am curious as to why these numbers are so low or if this is considered to be "acceptable."

Many Thanks!!

Deprecation of convert_objects causing fatal error

I'm running CRISPRessoPooled in mixed-mode with the following command:

CRISPRessoPooled \
    --fastq_r1 A5_S184_L001_R1_001.fastq.gz \
    --fastq_r2 A5_S184_L001_R2_001.fastq.gz \
    --amplicons_file amplicons_description.txt \
    --bowtie2_index /data/ref_genome/mouse/musculus \
    --gene_annotations /data/ref_genome_annot/ucsc/mouse/vMM10.annotation.gz \
    --n_processes 4 \
    --name A5_S184 \
    --output_folder cspresso/A5_S184 \
    --save_also_png

This leads to the following output:

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Checking dependencies...

INFO  @ Tue, 12 Jul 2016 19:03:09:

 All the required dependencies are present!

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Amplicon description file and bowtie2 reference genome index files provided. The analysis will be perfomed using the reads that are aligned ony to the amplicons provided and not to other genomic regions.

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Creating Folder /cvri/miano/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Done!

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Merging paired sequences with Flash...

INFO  @ Tue, 12 Jul 2016 19:03:10:
         Done!

INFO  @ Tue, 12 Jul 2016 19:03:10:
         Loading gene coordinates from annotation file: /cvri/data/ref_genome_annot/ucsc/mouse/vMM10.annotation.gz...

INFO  @ Tue, 12 Jul 2016 19:03:11:
         The uncompressed reference fasta file for /cvri/data/ref_genome/mouse/musculus is already present! Skipping generation.

INFO  @ Tue, 12 Jul 2016 19:03:11:
         Aligning reads to the provided genome index...

INFO  @ Tue, 12 Jul 2016 18:48:20:
         Demultiplexing reads by location...

gzip: /cvri/miano/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/MAPPED_REGIONS//*.fastq: No such file or directory
INFO  @ Tue, 12 Jul 2016 18:48:20:
         Reporting problematic regions...

/usr/local/lib/python2.7/dist-packages/CRISPResso/CRISPRessoPooledCORE.py:770: FutureWarning: convert_objects is deprecated.  Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df_regions=df_regions.convert_objects(convert_numeric=True)
CRITICAL @ Tue, 12 Jul 2016 18:48:20:


ERROR: Cannot set a frame with no defined index and a value that cannot be converted to a Series


~~~CRISPRessoPooled~~~
-Analysis of CRISPR/Cas9 outcomes from POOLED deep sequencing data-
              )                                            )
             (           _______________________          (
            __)__       | __  __  __     __ __  |        __)__
         C\|     \      ||__)/  \/  \|  |_ |  \ |     C\|     \
           \     /      ||   \__/\__/|__|__|__/ |       \     /
            \___/       |_______________________|        \___/


[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 0.9.4

Mapping amplicons to the reference genome...

At this point the program stops executing. I found that if you alter CRISPRessoPooledCORE.py at 771 and 801 to df_regions=df_regions.apply(pd.to_numeric, errors='ignore') this problem goes away yielding these new results:

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Checking dependencies...

INFO  @ Tue, 12 Jul 2016 19:03:09:

 All the required dependencies are present!

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Amplicon description file and bowtie2 reference genome index files provided. The analysis will be perfomed using the reads that are aligned ony to the amplicons provided and not to other genomic regions.

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Creating Folder /cvri/miano/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Done!

INFO  @ Tue, 12 Jul 2016 19:03:09:
         Merging paired sequences with Flash...

INFO  @ Tue, 12 Jul 2016 19:03:10:
         Done!

INFO  @ Tue, 12 Jul 2016 19:03:10:
         Loading gene coordinates from annotation file: /cvri/data/ref_genome_annot/ucsc/mouse/vMM10.annotation.gz...

INFO  @ Tue, 12 Jul 2016 19:03:11:
         The uncompressed reference fasta file for /cvri/data/ref_genome/mouse/musculus is already present! Skipping generation.

INFO  @ Tue, 12 Jul 2016 19:03:11:
         Aligning reads to the provided genome index...

gzip: /cvri/miano/amplicon_exp/cspresso/A5_S184/CRISPRessoPooled_on_A5_S184/MAPPED_REGIONS//*.fastq: No such file or directory
INFO  @ Tue, 12 Jul 2016 19:05:56:
         Reporting problematic regions...

CRITICAL @ Tue, 12 Jul 2016 19:05:56:


ERROR: Cannot set a frame with no defined index and a value that cannot be converted to a Series


~~~CRISPRessoPooled~~~
-Analysis of CRISPR/Cas9 outcomes from POOLED deep sequencing data-

              )                                            )
             (           _______________________          (
            __)__       | __  __  __     __ __  |        __)__
         C\|     \      ||__)/  \/  \|  |_ |  \ |     C\|     \
           \     /      ||   \__/\__/|__|__|__/ |       \     /
            \___/       |_______________________|        \___/


[Luca Pinello 2015, send bugs, suggestions or *green coffee* to lucapinello AT gmail DOT com]

Version 0.9.4

Mapping amplicons to the reference genome...

There is still an error, but it continues to run this time even though all that was fixed was a deprecation. Not sure really if that is a good thing or not...

Trailing whitespace in the amplicon sequence is not tolerated

Feature request: Trim trailing whitespace in amplicon field (Column 2) of amplicon file.

Background:
I sometimes get the following error when running CRISPRessoPooled in amplicon mode:

ERROR: The amplicon sequence 4-11873390-11873412 contains wrong characters:

It is related to trailing whitespace in the amplicon field of the amplicon file which seems to happen frequently in when Excel is involved in generating these files.

Different results from online and command line CRISPResso with same data and parameters

Hi Luca,

I'm running CRISPResso in single read mode normally from command line, and checked with the online version, and it seems they output considerably different results. Specifically, I'm getting out about 50% less NHEJ from command line version compared to online version across samples, with the same data and parameters (see example and files attached below). So currently I don´t know which to trust.
H6_R2.fastq.tar.gz
H6_R2.fastq.tar.gz

I'm wondering if this is because of the specific installation on my machine, or whether the backend of online and command-line veersions differ, or whether the default parameters between the two differ. Any thoughts?

On another matter, does a CRISPResso forum exist? Sometimes would be better to address the question broadly. Any case, thanks for the great software.

PARAMETERS:
Online:
-Single-end reads
-Seq homology for HDR: 98%
-Window size: 25
-Min average read qual: 20
-Min single bp qual: 20
-Exclude bp from left: 5
-Exclude bp from right: 40
-Amplicon seq: TCGTGCTGCTTCATGTGGTCGGGGTAGCGGCTGAAGCACTGCACGCCGTAGGTCAGGGTGGTCACGAGGGTGGGCCAGGGCACGGGCAGCTTGCCGGTGGTGCAGATGAACTTCAGGGTCAGCTTGCCGTAGGT
-HDR seq: TCGTGCTGCTTCATGTGGTCGGGGTAGCGGCTGAAGCACTGCACGCCGTGGCTCAGGGTGGTCACGAGGGTGGGCCAGGGCACGGGCAGCTTGCCGGTGGTGCAGATGAACTTCAGGGTCAGCTTGCCGTAGGT
-Guide seq: ACCTACGGCGTGCAGTGCTT
-All else default

Command line:
CRISPResso -r1 H6_R2.fastq -g ACCTACGGCGTGCAGTGCTT -a TCGTGCTGCTTCATGTGGTCGGGGTAGCGGCTGAAGCACTGCACGCCGTAGGTCAGGGTGGTCACGAGGGTGGGCCAGGGCACGGGCAGCTTGCCGGTGGTGCAGATGAACTTCAGGGTCAGCTTGCCGTAGGT -e TCGTGCTGCTTCATGTGGTCGGGGTAGCGGCTGAAGCACTGCACGCCGTGGCTCAGGGTGGTCACGAGGGTGGGCCAGGGCACGGGCAGCTTGCCGGTGGTGCAGATGAACTTCAGGGTCAGCTTGCCGTAGGT -o test --exclude_bp_from_left 5 --exclude_bp_from_right 40 --save_also_png -w 25 -q 20 -s 20

[FLASH] ERROR: Maximum overlap (-173) cannot be less than the minimum overlap (4).

Hi:

Thanks for the wonderful tool! But when I used it to evaluate the sgRNA indel with the following script:

CRISPResso -r1 1.fq.gz -r2 2.fq.gz -a amplicon_sequence

there was an error "[FLASH] ERROR: Maximum overlap (-173) cannot be less than the minimum overlap (4)." It seemed that 'flash' failed to align when there was paired reads didn't cover the amplicon. Is that right? What should I do for this error? Thank you very much!

Alignment error, please check your input

Dear Luca,

I am encountering the below error when running the command:

CRISPResso
-r1 /work/rpapa/sbelleghem/mutant_miSeq/fastq_trimmed/TL12_R1_paired.fastq.gz
-r2 /work/rpapa/sbelleghem/mutant_miSeq/fastq_trimmed/TL12_R2_paired.fastq.gz
--amplicon_seq ATTGGATCTTAAAAGCTTGGGCTAAGCTCATGTCGACGGTCAGTAATTAGCATTCCGCATATAGTTTACAAAGCATTGCCGTTGTAAATTATTGGAAACTATAATCTTGTGCAAAAACTTGTTTTTTTATAAATATTATAAAATATATTCGTACAGGATTGAAATATAAAAAAAACATATCAGCTGCGAATAAAATTAATAGAGAATAAAAAAATATACTTATATCACAGCGACATATTTATTTTATTCTCTATTTTATTCACATTATATTTTTACTCCATGCCAAATTGATAATAGAATATGAACCTGTAACAACAGTCCTTAAAAATCCAAAACGATTATTAAGTGGTTTAATATTTTTACATAACAACATCAAATAATTTAAATTATATCTATTTCTAGGTAATACAGACAGGTGCTCAACAGGCGGTTGAAGAGTGTCAATACCAATTCCGAAACAGCCGCTGGAACTGCAGCACTGTCGAAAACAGCACTGATATATTTGGAGGAGTACTTAAATTTAGTAAGTAAAAGTTAAATTTTTGATTTAAATTTGTAAATCCTTTTTAATTGACAACCTAAATACTTATTTTTATTTGGATATATTATATAAAAATGTTGGATGAGTTTGGATTCCACTTACTACTTGGCTTCTTGAGCACTAACTTTAAAAATATATAAATTCTATTTGGAAAACGAAAGAAATAAGATTTCAAATGATCTATAACTAACAATTTTTATTATGATAAACCACAAACAACTATACAAAACGATTTACACGTAAAATTAACATATTCTCAACATATTACACAAATAATACTACCGTTAACTCAAAATTGGCATATACATATAAATAAATCTTGAATCATAAAATTCATTTCCGCTCGGATTTCAAGTCAAAGTAAGTTGTAAATTCTCAAATAATTATCGGTTGCATACATCGGCAACTCTTCAAAGGACGTGTTAAGTG
--max_paired_end_reads_overlap 150
--name TL12
--output_folder /work/rpapa/sbelleghem/mutant_miSeq/CRISPResso_out

###############
INFO @ Wed, 05 Jun 2019 13:10:08:
Finished reads; N_TOT_READS: 29195 N_COMPUTED_ALN: 0 N_CACHED_ALN: 0 N_COMPUTED_NOTALN: 6874 N_CACHED_NOTALN: 22321

INFO @ Wed, 05 Jun 2019 13:10:08:
Done!

INFO @ Wed, 05 Jun 2019 13:10:08:
Quantifying indels/substitutions...

INFO @ Wed, 05 Jun 2019 13:10:08:
Done!

CRITICAL @ Wed, 05 Jun 2019 13:10:08:
Alignment error, please check your input.

ERROR: Error: No alignments were found
#############

Would you have any advice on what to check to know what is going wrong? The amplicon should be fine as I can easily find alignable sequences in my fastq files.

Thank you for any help!

Steven

Use --min_identity_score for read vs amplicon

Hi,
I am using CRISPResso for simple amplicon seq analysis. I initially got the error that no reads aligned, which I tracked back to the fact that the amplicon sequence is 500 bp and the reads are only 150 bp. Would it be possible to change the min_identity_score to apply to the percent of the read that aligned rather than the amplicon? Pointers to where I could change this in the code would be appreciated as well. From a quick check, this number is being parsed out from the needle output, which defines it as the number of identical bases divided by the total in the alignment, which ends up being approxiamtely the length of the amplicon in the case described above. Maybe we can parse out the identical base count and divide by the length of the read instead?
Thanks for the help!
-Rahul

Error "'numpy.int64'

Hello @lucapinello ,

I am trying to run CRISPResso locally to genotype clonal cell lines. It stops at two different points depending of the input:

Error 1: "Quantifying indels/substitutions... " reporting:

ERROR: Zero sequences aligned, please check your amplicon sequence

CRISPResso_RUNNING_LOG_Zero_error.txt

or Error 2: "Calculating alleles frequencies... " reporting:

("'numpy.int64' object is not iterable", u'occurred at index 0')

CRISPResso_RUNNING_LOG_numpy.int64_error.txt

Could be related with a low quality input fastq file?

I don't think that is a problem in the installation because I properly ran CRISPResso locally with successful results using fastq files with higher quality.

I would like to send you an example of both files (those than worked and current ones) in order to know if you are able to detect the difference that avoids CRISPResso to run properly. The problem is that I can't attach them because they are so big.

Running machine: Mac (OSX El Captain)

Thank you so much for your attention,

Andrés Marco

lucapinello / crispresso Goto Github PK

crispresso's Introduction

THIS IS AN OLD VERSION OF CRISPRESSO AND IT IS NOW DEPRECATED

PLEASE USE CRISPRESSO2

TRY IT ONLINE!

Installation and Requirements

Docker Image

OUTPUT

Usage

Understanding the parameters of CRISPResso

Troubleshooting:

TESTING CRISPResso

Useful tips

Explore the output of CRISPResso

Installation and usage of CRISPRessoPooled

Installation and usage of CRISPRessoWGS

Installation and usage of CRISPRessoCompare

Installation and usage of CRISPRessoPooledWGSCompare

How to cite CRISPResso

Acknowledgements

crispresso's People

Contributors

Stargazers

Watchers

Forkers

crispresso's Issues

Recommend Projects

Recommend Topics

Recommend Org