Coder Social home page Coder Social logo

hungyok / nuctf_equi_bai Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 394.99 MB

Statistical equilibrium models for computing occupancy of nucleosome and transcription-factors.

MATLAB 100.00%
nucleosome-positioning remodeling-factors statistical-equilibrium-models optimization tf-binding-motifs

nuctf_equi_bai's Introduction

Nucleosome & TF occupancy using SEMs

We present a computational method to compute nucleosome and transcription factor (TF) occupancies using statistical equilibrium models (SEMs). For complete details on the use and execution of this method, please refer to Kharerin & Bai, 2021.

Input data

Genomic sequence

Download the 16 chromosomes (12071326 bp) of Saccharomyces cerevisiae (S288C, genome version R64-3-1) in fasta format from SGD https://www.yeastgenome.org/. The downloaded genomic sequence can be found here.

TF motif database

File "listbai_all.txt" contains the name and motif size of 104 TFs that are found in the budding yeast nuclei in normal YPD growth condition (Yan, Chen & Bai). File "wmsbai_data_all.txt" shows the combined PWMs of these 104 TFs in tandem (same ordering as in listbai_all.txt), with the four columns representing the propensity of finding A/C/G/T at each motif position. The recommended score cutoff is shown in "PWM_TF_cutoff.txt". These files can be found in the folder tf_energy_all.

Nucleosome occupancy data

The original nucleosome occupancy data “analyzed_data_complete_bw20.txt” was provided in Lee et al.’s supplementary data (Lee, et al., 2007). The first four sequences are non-genomic DNA and excluded them from the dataset. We rename the new file as "raw_lee_chr1_16.txt" which has chr1 to chr16. This data, which are in log scale, were further normalized, linearized, and saved as "yy1_lee.mat". The normalization and linearization can be conducted with the following code:

> expY = importdata('yourpath/transf_yy_data/raw_lee_chr1_16.txt');
> [c,γ] = transf_yy_data(expY);
> d_lee=importdata('yourpath/transf_yy_data/skimpped_data.txt');
> d_indx=importdata('yourpath/transf_yy_data/skimpped_data_index.txt');
> [x1_lee,y1_lee] = linear_occup(expY,c,γ,d_lee,d_indx);
> save yy1_lee.mat x1_lee y1_lee;

The MATLAB code transf_yy_data.m and linear_occup.m can be found in the folder transf_yy_data. The final occupancy "yy1_lee.mat" can be found in the folder dataFolder.

Processing binding energy and NDR

Nucleosome energy

The sequence-dependent binding energy for nucleosome is obtained using the software developed by Kaplan & Segal (Kaplan, et al., 2009), which can be downloaded from the following link:

https://github.com/KaplanLab/NucleosomeModel

The link has all the instructions on how to use the software. Briefly, download the code “NucleosomeModel-master.zip” on Linux 64-bit machine. Enter the following commands on the terminal to generate the energy files:

> make install
> perl nucleosome_prediction.pl -raw_binding -t example -s input.fa -p raw_output -tab

Where input.fa is the DNA sequence in fasta format. One sequence per file, e.g., chr1.fa has sequence for chromosome 1 only. To be consistent with our codes, we suggest all the sequences be in caps. The raw_output.tab is the output energy (log-score) file in tab format (can easily be open by any text editor like "notepad"). This energy is stored as E_Em.mat in the folder nuc_energy.

TF energy

The binding energy of a TF is obtained by scanning its PWM along the genomic sequences (both forward and reverse-complement strands) and converting into position-dependent energy. The maximum of the two possible energies at a position is the TF energy.

The code “tf_binding_pot.m” uses the genome sequence, wmsbai_data_all.txt, and listbai_all.txt to obtain the energy profiles. Run the commands below to get all the 104 TF energies:

> path1 = 'yourpath/tf_energy_all/sgd_genome/'; % enter full path
> path2 = 'yourpath/tf_energy_all/Etf_allmat_chr/'; % enter full path
> [Emtf,k] = tf_binding_pot(path1,path2);

The genomic sequence (e.g., sgd_chr1.fa for chromosome 1 and so on) from the folder “sgd_genome” is fed using the input parameter “path1”. The code then spews out TF energy profiles for all the chromosomes (Etf_chr1.mat to Etf_chr16.mat) and the average TF energies (Emtfall.mat), which are stored in the folder “Etf_allmat_chr” indicated by “path2”. Note that we have reported here only three energy files, Etf_chr1.mat to Etf_chr3.mat, due to large file sizes! Each energy file (e.g., Etf_chr1.mat for chromosome 1) is a matrix file where it contains the binding energy for each TF at different chromosomal locations. The code, the subfolders, and the input/output files can be found in the folder tf_energy_all.

NDR annotation

To call and locate nucleosome-depleted-regions (NDRs) from the genome-wide nucleosome occupancy data, we carry out the following two steps:

ndr_cut.m: Using “yy1_lee.mat” as input data, this program identifies regions below certain occupancy threshold. In our current approach, we used eight different thresholds: threshold(i)=0.8-i×SD where i=1,2,…,8, and SD is the standard-deviation calculated from genome-wide nucleosome occupancy (yy1_lee.mat), which yield eight output files: NDR_1.mat, NDR_2.mat, …, NDR_8.mat. Set SD=0.0678 (default) and run the following code with input i=1,2,…,8.

> load yourpath/ndr_call/dataFolder/yy1_lee.mat;
> % step above generates two variables, x1_lee and y1_lee
> NDR_i = ndr_cut(x1_lee, y1_lee, i); 

ndr_pos_cal2.m: This program compares the boundaries in NDR_1.mat, NDR_2.mat, etc., picks the NDRs with steep edges, merges nearby NDRs, and force the nucleosome occupancies in NDRs to be zero. Copy all the NDR_i.mat into the "dataFolder":

> path1 ='yourpath/ndr_call/dataFolder/';
> [yy3A, ndrpos_chrA] = ndr_pos_cal2(path1);

The final output includes the NDR positions (ndrpos_chrA.mat) and the modified nucleosome occupancy (yy3A.mat). The occupancy, yy3A.mat, is renormalized to 80% and is recorded as yy3A_lee.mat. Use this final occupancy for future model optimization. These codes and the input/output data can be found in the folder ndr_call.

SEM and optimization

The SEM partition function is a summation of Boltzmann weight: ct/Net/NEt/N where Et/N stands for the binding energy of nucleosome (N) or TF (t) at a given genomic location. The scaling factors ct/N and γt/N are the free parameters of the model. We use a modified Nelder-Mead simplex algorithm with Simulated Annealing to optimize c and γ by minimizing the root-square-mean deviation (RMSD) of nucleosome occupancy between the modified expiremental data (yy3A_lee.mat) and our model.

NucTF (nucleosome model considering TFs)

To evaluate individual TF contribution to NDRs, we first optimize (c,γ) for an individual TF when only the concerned TF is present in the model.

Open the folder “simplexM_tf1” to find:

  1. Input and output files that are saved in folder “input” and “output” respectively.
  2. A log file “log.txt” to record the optimization process.
  3. simplex_SA_singleTF.m, the main simplex engine that executes the parameter optimization.
  4. occupxfunc.m, a sub-function that computes nucleosome occupancy and RMSD.
  5. sortf.m, a sub-function that sorts RMSDs

Edit directory paths:

> path1 ='yourpath/nuc_energy/'; 
> path2 = 'yourpath/tf_energy_all/';    
> path3 ='yourpath/NucTF/simplexM_tf1/'

Run the optimization code:

> simplex_SA_singleTF(tfx,path1,path2,path3);

Input parameters:

  1. The "tfx" is the index of the TF of interest listed in "lisbai_all.txt". It is an integer from 1 to 104.
  2. The "path1" loads nucleosome energy “E_Em.mat” located in the folder nuc_energy.
  3. The "path2" loads TF energy files Etf_chr1.mat to Etf_chr16.mat, and Emtfall.mat located in the folder Etf_allmat_chr. It also loads listbai_all.txt, a list of 104 TF names and motif sizes, located in the folder tf_energy_all.
  4. The "path3" loads yy3A_lee.mat and rand_genome.mat, where the former is the reference nucleosome occupancy data, and the latter includes a subset of chromosomes that accounts for 70% of the genome (we tune the parameters on 70% of the genome and use the rest 30% as an independent test). Both yy3A_lee.mat and rand_genome.mat are in the folder simplexM_tf1/input. The path3 also loads tf_xhIIa.mat (simplexM_tf1/input/initialp), a vector of eight numbers: an initial guessed values of (cNN;ctt) (the first four numbers) and corresponding small deviations about the latter values (the last four numbers).

Output files:

  1. simplex_xval.txt stores sorted (c,γ).
  2. simplex_fval.txt stores sorted RMSDs.
  3. simplex_tval.txt stores latest values of difference between the best and worst RMSDs, temperature, and RMSD.
  4. simplex_Temp.txt reports the content of “simplex_tval.txt” per simplex step.

The file simplex_xval.txt is a 4x5 matrix of c and γ. The four rows of the matrix are cN, γN, ct, and γt, respectively. The first column of the matrix is our optimized parameters. These files are exported into the folder output.

After carrying out the above one-TF optimization for all the TFs, we run "occupx.m" using the optimized parameters to get the final nucleosome occupancy and the predicted NDR probability (PNDR) (see below for details). We rank the 104 TFs based on their contribution to PNDR and store the ranking in another file called "tfindx.txt".

Below, we describe a model where we incorporated the top 30 TFs in tfindx.txt. Accordingly, we have in total of 62 unknown parameters (31 pairs of (c,γ)'s), of which 60 is for TFs and two for nucleosome.

Open the folder “simplexM_top30” to find:

  1. Input and output parameters that are saved in folder "input" and "output" respectively. The folder "output" contains similar files as in simplexM_tf1. We also put some optimized parameters into "output_optimized" for comparison purpose. The "output" is the running folder where every output is dumped here. The output for real or trial run, abrupt kill, or mistakes by users are dumped here.
  2. A log file “log.txt” to record the optimization process.
  3. simplex_SA_multiTF.m, the main simplex engine that executes the parameter optimization.
  4. occupxfunc.m, a sub-function that computes nucleosome occupancy and RMSD.
  5. sortf.m, a sub-function that sorts RMSDs.

Edit directory paths:

> path1 ='yourpath/nuc_energy/'; 
> path2 = 'yourpath/tf_energy_all/';    
> path3 ='yourpath/NucTF/simplexM_top30/'; 

Run the optimization code:

> simplex_SA_multiTF(tfx,path1,path2,path3) ;

Input parameters:

Here we have four input parameters: tfx, path1, path2, and path3. The functions of these inputs are the same as described in the previous section, except that tfx here is a 1D array of TF indices. For example, to consider top 30 TFs that contribute to NDRs, the tfx should be the first 30 numbers listed in the "tfindx.txt" file. Path3 loads "top30_xhIIa.mat" which is a 1D array containing the initial guessed values of (c,γ)s of nucleosome and the 30 TFs. Here, the first 62 elements are the (c,γ)s and the second 62 are the small displacements needed to build simplex vertices. This file can be found in the folder initialp.

Output files:

  1. simplex_xval.txt stores sorted (c,γ).
  2. simplex_fval.txt stores sorted RMSDs.
  3. simplex_tval.txt stores latest values of difference between the best and worst RMSDs, temperature, and RMSD.
  4. simplex_Temp.txt reports the content of “simplex_tval.txt” per simplex step.

Similar file description as in simplexM_tf1 except that the file simplex_xval.txt is a 62x63 matrix of c and γ with the first and second rows of the matrix are cN and γN respectively, and the rows 3–32 are ct’s and rows 33–62 are γt’s. The first column of the matrix is our optimized parameters and will be used to determine the final genome-wide occupancy profile.

Number of search steps at a given temperature "M" in simplex_SA_multiTF.m can be changed to adjust the optimization quality/time. Currently, at each of the five “temperature” (which controls the parameter search range), we gradually decrease the number of steps to be M=100, 83, 67 ,50, and 33. Larger M may lead to better convergence but cost more computational time. We run simplex_SA_multiTF.m code nine times, each time using the parameters optimized from the last round as the initial parameters. One simplex step takes ∼ 125 seconds. We recommend users to run a cumulative of at least ∼ 3000 simplex steps ∼ 4 days of simulations.

Calculating the final nucleosome occupancy:

Open the folder occup_profile and run the code to compute the occupancy profiles:

> path1 ='yourpath/nuc_energy/'; 
> path2 = 'yourpath/tf_energy_all/';    
> path3 ='yourpath/ndr_call/';
> load yourpath/NucTF/simplexM_top30/output_optimized/simplex_xval.txt;
> xfit = simplex_xval (:,1); TF = 1; Eseq = 1;
> load yourpath/NucTF/tf_energy_all/tfindx.txt;
> tfx = tfindx(1:30);
> [O] = occupx(TF,Eseq,xfit,tfx,path1,path2,path3);
> save O.mat O; 

The logical parameters “TF” and “Eseq” can take on the value of 0 or 1. For example, TF=1 means that TF information is considered in the model otherwise TF=0, and Eseq=1 means that nucleosome sequence-specific energy is considered and otherwise set Eseq=0. The input simplex_xval (:,1) contains the optimized parameters, and tfx represents the TFs taken into consideration. The three paths direct "occupx.m" to load the nucleosome energy, TF energy, TF index lists, and NDR coordinates. The output is an occupancy data, O.mat, comprising of 16x1 cell arrays (one for each chromosome), which includes the occupancies for nucleosome and each TF considered in the model. This program also generates "log.txt" file, which reports the model performance (RMSD, PNDR, AUC, etc.).

The output and figure of two example cases when "TF=0, Eseq=0" and "TF=0, Eseq=1" can be found in the folders “example1” and “example2”, respectively. The output of a model example with TF=1 (top 30 TFs) and Eseq=1 can be found in the folder “example3”. All the codes and input/output files and folders can be found in the folder simplexM_top30.

NucRemod

This version of the model considers both the effect from TF binding and nucleosome remodeling (NR). We assume that the action of NR is to deform the nucleosome energy landscape near the TF-bound sites in a TF occupancy-dependent manner. NR is present only when TF occupancy > tfcut (set tfcut=0.0022 computed as a genome-wide average of occupancy for all TFs in the model NucTF). We assume that an NR modifies the energy landscape as a Gaussian deformation with height “h” in kT/0.212 and width “w” in bp (free parameters). In the presence of multiple TFs adjacent to each other, we add up individual deformation energies.

Open the folder “simplexM_remod” to find:

  1. An “input” folder containing a file “pos_octf.mat” that records positions and occupancy of TFs when the TF occupancy cutoff is tfcut=0.0022. It is generated by the code “occup_tfs.m” by running the following steps:
> path1 ='yourpath\nuc_energy\'; 
> path2 = 'yourpath\tf_energy_all\';   
> tfcutx = 0.0022;
> load yourpath/NucRemod/simplexM_remod/input/avgxval_top30.txt; 
> xfit=avgxval_top30.txt; 
> load yourpath/NucTF/tf_energy_all/tfindx.txt;
> tfx = tfindx(1:30);
> [pos_octf] = occup_tfs(xfit,tfx,tfcutx,path1,path2);
> save pos_octf.mat pos_octf;

The format of pos_octf.mat is a 16x3 cell matrix. Each row in the cell represents a different chromosome (first row is chr1, 2nd row is chr2, and so on). First column represents TF binding position, while the 2nd and 3rd column represents the occupancy and identity of the TF. This file is used by occupR.m, occupRfunc.m, and tf_cluster.m.

  1. Two output folders for fitting parameters: "output" and "output_optimized". Parameters generated during the fitting process will be stored in "output". While "output_optimized" contains some precomputed best fitting parameters.
  2. An output log file “log.txt” that records the optimization process.
  3. simplex_SA_remod.m, the simplex engine with Simulated Annealing.
  4. occupRfunc.m, a sub-function that calls tf_cluster.m and occup_nucs.m to construct remodeling deformation energy, calculate nucleosome occupancy, and compute RMSD. This step is repeated nx = 10 times to allow combinatorial binding of multiple TFs (see step below). The final nucleosome occupancy represents an average between these configurations. Larger nx can be used to achieve better averaging.
  5. tf_cluster.m, a sub-function that gathers the binding configurations of TFs using the information provided by "pos_octf.mat". For isolated TFs, the binding status is determined by comparing their occupancy (from model NucTF) to a random number. When multiple TFs bind adjacent to each other, we allow maximally two TFs to overlap, and their individual binding status are determined by the same random number generator. The output of this function was used to generate an altered energy landscape in occupRfunc.m.
  6. occup_nucs.m, a sub-function that computes nucleosome occupancy (one chromosome at a time) using the altered energy landscape.
  7. sortf.m, a sub-function that sorts a set of RMSDs.

Edit directory paths:

> path1 ='yourpath\nuc_energy\'; 
> path2 = 'yourpath\tf_energy_all\';    
> path3 ='yourpath\NucTF\simplexM_top30\';
> path4 ='yourpath\NucRemod\simplexM_remod\';

Run the optimization code:

> simplex_SA_remod(tfx,path1,path2,path3,path4);

Note that tfx is the same index array as in "simplexM_top30" in model NucTF. The input and output files are same as in NucTF, except here the file simplex_xval.txt is a 4x5 matrix. The first column of the matrix (cN, γN, h, and w) is our optimized parameters and the TF parameters (ct, γt) obtained in NucTF are used to calculate the occupancy profiles. The prescribed set of M used here is M=12,12,10,10,6. We repeated step 3 four times and each time starting from the last output. One simplex step takes ∼ 1 hour. We recommend users to run a cumulative of at least ∼ 200 simplex steps ∼ 8 days of simulations.

Calculating the final nucleosome occupancy:

Open the folder occup_profile and run the code to compute the occupancy profiles:

> path1 ='yourpath\nuc_energy\'; 
> path2 = 'yourpath\tf_energy_all\';    
> path3 ='yourpath\NucRemod\simplexM_remod\';
> path4 ='yourpath\ndr_call\';
> load yourpath/NucRemod/simplexM_remod/output_optimized/simplex_xval.txt;
> remod = simplex_xval(:,1); % load the best-fit remodeling parameters
> load yourpath/NucTF/simplexM_top30/output_optimized/simplex_xval.txt;
> xfit = simplex_xval(:,1); % load the best-fit TF parameters
> load yourpath/NucTF/tf_energy_all/tfindx.txt;
> tfx = tfindx(1:30);
> [O] = occupR(remod,xfit,tfx,path1,path2,path3,path4); 

The output file “O.mat” is a 16x1 cell. Here we report only nucleosome occupancy per cell and can be found in the folder “occup_profile”. The folder also contains a figure showing comparison between a model with TF only and a model with TF+remodeling, and a plotting script. All the codes and input/output files and folders can be found in the folder simplexM_remod.

Output data

The main output files from this model include:

  1. A list of annotated NDRs "ndrpos_chrA.mat".
  2. The ranking list of TFs based on their contribution to NDRs "tfindx.txt".
  3. Best fitting parameters for nucleosome and TF binding energy "simplex_xval.txt".
  4. The output files "O.mat" without remodeling effect (NucTF) can be found in the folders example1, example2, and example3. The output file "O.mat" with remodeling effect (NucRemod) can be found in occup_profile.

nuctf_equi_bai's People

Contributors

hungyok avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.