Coder Social home page Coder Social logo

mosea's Introduction

Motif Scan and Enrichment Analysis (MoSEA)

MoSEA Modules:

(Please refer test_files/run_test_files.sh to run the pipeline and to run MoSEA directly with SUPPA events).

I. getfasta: Extract fasta sequences from given bedfile coordinates.

Required Input: 
		 --bedfile : Bed file in atleast 6 columns format, Default:strandness is TRUE
		 --genome : Genome Fasta File sequence in single line, eg. hg19.fa
Optional Input:
		 --output : Output file name, Default: STDOUT
		 --bedtoolspath : Bedtools bin path, Default: search for 'bedtools' in local path

Output: 1. Sequence Fasta file

Usage: 
python mosea.py getfasta --bedfile sample.bed --genome /path/to/hg19.fa --output sample_output.fa
Usage (with optional parameter):
python mosea.py getfasta --bedfile sample.bed --genome /path/to/hg19.fa --output sample_output.fa --bedtoolspath /path/to/bedtools/bin/bedtools

II. scan: Scans fasta Sequences for given motifs (PFMs)

Required Input: 
		 --scan : Scans for given motif (this option is always true)	

                     --pfm : If selected; Scans for motif using PFMs
                     --pfm_path : Path to folder where PFM matrices of motifs are stored 
                    
                     --kmer : If selected; Scans for motif using k-mers 
                     --kmer_path: Path to k-mer(s) folder, (Note: kmers file should be saved with *.kmer extension, in fasta format or just one kmer in each line. Motif-label will be taken from file_name before ext (eg. for file 'SRSF1.kmer', label will be 'SRSF1').

		 --fasta : Input Fasta file (Sequences in Fasta file format to scan motifs)
		 
		
Optional Input:
		 --count : Also creates count_file in tabular format for all motifs found on each sequences.
		 --out_dir : Output Directory (path to output directory, Default: creates fmo/ directory in current working path)
		 --fmo_path : Path to FIMO binary folder, Default: Search for 'fimo' in local path

Output: 1. fmo output files for each motif in out_dir/
		2. Summary count_file table for all motifs and all sequences in one file (if --count)}

Usage: 
python mosea.py scan --pfm --pfm_path /path/to/motif_pfm_dir/ --fasta sample_output.fa
or
python mosea.py scan --kmer --kmer_path /path/to/motif_kmer_dir/ --fasta sample_output.fa 

Usage (with optional parameter):
python mosea.py scan --scan --pfm --pfm_path /path/to/motif_pfm_dir/  --fasta sample_output.fa  --out_dir /path/my/fmo_output_dir/ --fmo_path /usr/local/bin/fimo --count

III. enrich : Motif enrichment test analysis between given regulated and background(control) files

Required Input: 
		 --reg_fa_file : Input file for regulated sequences in fasta format
		 --reg_count_file : Input file for regulated motif counts in tab seperated format(header=Motif_ids)
		 --bg_fa_file : Input file for background (control) sequences in fasta format
		 --bg_count_file : Input file for background (control) motif counts in tab seperated format(header=Motif_ids)
		
Optional Input:
		 --out_file : Creates output for enrichment test analysis, Default: output_MoSEA_enrichment.txt
		 --rand_count : Number of times randomization performed for each regulated sequence. Default:100
		 --match_len : Match length of background sequences to each regulated sequence(1=True, 0=False). Default: 0 (False)
		 --len_ext : Percent range of length extension on both side boundaries, to match with background sequences.
		 			 This option is valid only when '--match_len 1'. 
					 Default: 15 (15% length extended on both sides)

Output: 1. Enrichment test output file for each motif with zscore 

Usage: 
python mosea.py enrich --reg_fa_file reg_sample_output.fa --reg_count_file reg_sample_output.space
					   --bg_fa_file bg_sample_output.fa --bg_count_file bg_sample_output.space
					 
Usage (with optional parameter):
python mosea.py enrich --reg_fa_file reg_sample_output.fa --reg_count_file reg_sample_output.space
					   --bg_fa_file bg_sample_output.fa --bg_count_file bg_sample_output.space
					   --out_file my_output_file.tab --rand_count 200 --match_len 1 --len_ext 20

Run MoSEA on Suppa events to get enriched Motifs

Requirements:

- 	Python 2.7 (numpy, bio, pandas, StringIO, collections, math)
- 	Bedtools
- 	Fimo (MEME-Suite version >version 4.11 or higher)
- 	Genome fasta sequence (\*.fa and \*.fai files)
- 	Motif PFMs or Kmers file or directory 

Input files:

- 	Suppa event ids for regulated set (or Bedfile cordinates)
- 	Suppa event ids for control set (or Bedfile cordinates)

The analysis is divided into two steps, A) Motif Scan B) Enrichment

For Analysis (A): Motif Scan Steps:


Input: 2 files:

  • Regulated & Control event ids (Suppa events or Bedfile format)

Steps: Input_file -> Extract Sequence -> Scan sequences for Motif -> Create count table Requirement: Genome Fasta file (hg19.fa), Bedtools, FIMO from MEME suite

For Analysis (B): Enrichment


Input: 4 files: (All these files are generated in Analysis A)

  • Regulated fasta sequences file
  • Regulated motif count table
  • Control(Background) fasta sequence file
  • Control motif count file

Steps: For each regulated sequence -> create pools of Control seqs matching GC content and length of seq -> randomize 100 times -> count motifs on reg & control -> calculate z-score by observe(reg) vs expected (distribution from control) ((x-mean)/SD)

To test run call this script from MoSEA directory:

cd MoSEA/
./test_files/run_test_files.sh (Please point the paths to appropriate directories)

Analysis (A): Motif Scan

###Commands:

Step1 : Convert Suppa events to bedfile format

(This step is not necessary if cordinates are already in bedfile format. Go to step 2 directly)

$path_python ./mosealib/suppa_to_bed.py --ifile $reg_infile --event SE --ext 200 --ofile $bedfile_reg

(Ext : Extension is upstream and downstream regions from the center Exon)

Extension img

Step2: Convert Bedfile to Fasta File

$path_python mosea.py getfasta --bedfile $bedfile_reg --genome $path_genome --output $fafile_reg

Step3: Scan fasta sequences for Motifs (Example shown for PFMs)

fmopfm_outdir="fmo_pfm"  #output dir for scanned motifs
$path_python mosea.py scan --pfm --pfm_path $path_pfms --fasta $fafile_reg --out_dir $fmopfm_outdir --fmo_path $path_fimo --count

Repeat above three steps for control files as well. Ideally, number of events in control set must be atleast 100x more than regulated set.

Analysis (B): Motif Enrichment Analysis

Required Input: 4 files (2 regulated files: fasta & count_table, 2 control files: fasta & count_table) generated in Analysis (A).

Important: Only in case of running SUPPA events, please seperate variable region types prior to each enrichment analysis. For eg.

  1. To seperate upstream region of SE event from fasta file (regulated and control):
grep -A1 ';up' regfile.fa >events_SE_up_reg.fa
grep -A1 ';up' controlfile.fa >events_SE_up_control.fa
  1. To seperate upstream region of SE event from count file (regulated and control):
cat reg_file_count | head -1 >up_reg_file_count; grep ';up' reg_file_count >>up_reg_file_count
cat control_file_count | head -1 >up_cont_file_count; grep ";up" control_file_count >>up_cont_file_count

Repeat it for all regions. (Please see run script test_files/run_script_test_all_events.sh for more information)

Step4: Perform Enrichment

Regulated files : Fasta sequences and Motif count table (per region type)
- reg_file_fa
- reg_file_count
	
Control files : Fasta sequences and Motif count table (per region type)
- control_file_fa
- control_file_count

#Zscore Output file name
outfile="zscore_outfile.tab"

#perform enrichment
$path_python mosea.py enrich --reg_fa_file $reg_file_fa --reg_count_file $reg_file_count \
                       --bg_fa_file $control_file_fa --bg_count_file $control_file_count \
		     	--out_file $outfile


To run the pipeline all at once, please run the dummy script in test_files/run_test_files.sh To run extended version of all SUPPA events with variable region, please run script: test_files/run_script_test_all_events.sh

mosea's People

Contributors

babi2305 avatar babisingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mosea's Issues

ctrl_events_ids & reg_events_ids files

Dear Eduardo,

First of all, thank you so much for sharing your expertise with us!

I have a simple question about the two input files used in step A of the anlysis.
these files in this file : MoSEA/test_files/infile/
control_events_chr22.ids
reg_events_chr22.ids

I know that they contain event ids detected by SUPPA, but I just want make sure I correctly understood the difference between them:
Does this file "control_events_chr22.ids" contain all SE events regardless if they are or not significant?
Does the second file "reg_events_chr22.ids" contain regulated SE events found to be significant according to SUPPA pipeline?

if not, could you please clarify what these files are?
Thank you so much in advance!
Respectfully,
Jamal.

strand information for MoSEA not using SUPPA2 events

Dear team,
May this email finds you all fine.

I have a question related to using MoSEA using coordinates from a tool different than SUPPA2.
my question is should I provide the strand information when extracting the sequences ? My bed file looks lik the following:
chr18 63127035 63128759 BCL2_E1
chr18 63123346 63127034 BCL2_E2
chr18 63126835 63127035 BCL2_U1
chr18 63126834 63127034 BCL2_U2

Providing the above file to I got the sequences. When I scan the sequences for the occurrence of the RBP binding motif i got something like:
#pattern name sequence name start stop strand score p-value q-value matched sequence
HNRNPL_00091 BCL2_E1 794 800 + 8.13415 0.000743 ACACAAT
HNRNPL_00091 BCL2_E1 1260 1266 + 10.0671 7.11e-05 ACACGAA
HNRNPL_00091 BCL2_E2 87 93 + 10.0549 0.000159 ACACAAA
HNRNPL_00091 BCL2_E2 1122 1128 + 9.96951 0.000413 ACACAAG
HNRNPL_00091 BCL2_E2 1426 1432 + 8.2378 0.000536 ACACCAC
HNRNPL_00091 BCL2_E2 1877 1883 + 7.56098 0.000996 ACACAGA

Looking at the strand column the tool reports a sequence on the positive strand, while my gene BCL2 is on the reverse strand "ensembl location: Chromosome 18: 63,123,346-63,320,128 reverse strand".
I'm using the hg38 genome assembly to extract sequences.

I will be very thankful if you can help me to fix this issue.
Thank you so much in advance!
Kind regards,
Jamal.

mosea.py scan

Hello I have been trying to run MoSEA/mosea.py scan on the test files and I get this error.
python MoSEA/mosea.py scan --pfm --pfm_path MoSEA/test_files/motifs/pfms/ --fasta fafile_reg --out_dir fmopfm_outdir --count
scanning Motifs on file: fafile_reg
121/121[==================================================] 100%
Scanned 121 motif(s). Output saved in dir: fmopfm_outdir
('fafile_reg', 'MoSEA/test_files/motifs/pfms/', 'fmopfm_outdir')
ERROR:
Counting Motifs on file: fafile_reg
1/38[= ] 2%
Error in parsing: "['sequence name'] not in index"
I understand that the issue must be with parsing pfm files as the error comes from the count_motif function but I don’t understand why.

error trying plot_script.R

Dear all,
Thank you for sharing with us the R script to make perfect visualization of MoSEA outputs.

While trying the script on the test file I got the following error:
Attaching package: ‘reshape2’

The following objects are masked from ‘package:reshape’:

colsplit, melt, recast

cancer RBP_id regulation location robust_zscore exp_log2fold

633 LUSC CELF1 positive down 2.195866 0.3741492
129 LUSC CELF1 positive up 2.194584 0.3741492
634 LUSC CELF2 positive down 2.195866 -2.9128859
130 LUSC CELF2 positive up 2.194584 -2.9128859
635 LUSC CELF3 positive down 2.195866 -0.3538104
131 LUSC CELF3 positive up 2.194584 -0.3538104
cancer RBP_id regulation location robust_zscore exp_log2fold
633 LUSC CELF1 positive down 2.19586557294325 0.3741492
129 LUSC CELF1 positive up 2.19458434304226 0.3741492
634 LUSC CELF2 positive down 2.19586557294325 -2.9128859
130 LUSC CELF2 positive up 2.19458434304226 -2.9128859
635 LUSC CELF3 positive down 2.19586557294325 -0.3538104
131 LUSC CELF3 positive up 2.19458434304226 -0.3538104
group
633 Upregulated
129 Upregulated
634 Downregulated
130 Downregulated
635 Downregulated
131 Downregulated
Error in -(robust_zscore) : invalid argument to unary operator
Calls: with -> with.default -> eval -> eval -> ifelse
Execution halted

I run the following command while specifying the minimum zscore at 1.96
Rscript plot_script.R ./test_file.tab ../plot_script 1 3 0.2 ./test_plot.png test_heatmap

Any help would be much appreciated!
Thank you in advance!

Best regards,
Jamal.

options --match_len 1 --len_ext 20

Dear All,

In the demo of MoSEA is mentioned that we can run the enrich part for detecting enriched motifs using the two options --match_len 1 --len_ext 20.

When I run the tool with the above options I got an error saying they are unrecognized arguments.

Could you please tell me whether they were omitted from the mosea.py script? if so, how can I add them to be able to control for length differences between the bg and regulated sequences?

Thank you so much in advance!
Best regards,
Jamal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.