Coder Social home page Coder Social logo

linguistic_hawkes's Introduction

Linguistic Hawkes Process Model Codebase

This Python codebase implements discrete hawkes process model for linguistic pattern analysis. Powered by dask backend, this repo is:

  1. Scalable: Can be deployed on either single machine or cluster to model either short or long term process.
  2. Fast: Highly parallelized. Extra speed-oriented fitting options are provided.
  3. Extendable: Can be used in other context besides liguistic analysis; Only proper input data formatting is required.

Run the code

  • Create your own corpus Corpus file should be in .json format, where key is the token string (of, the, ...) and value is a list of integer that records the time stamps at which that token occurs in a book. A utility that automates corpus creation from text file is provided in utils/scanner.py, where spacy-en_core_web_sm is used as default tokenizer. And the default token position counting pipeline is
[beginning of file] To  be  or  not  to  be , that  is  the  question .
[lower case]        to  be  or  not  to  be , that  is  the  question .
[time stamp]        0   1   2   3    4   5  \ 6     7   8    9        \
[example result]    {'to': [0, 4], 'be': [1, 5], 'or': [2], ......}     
from utils.scanner import Scanner
ignore_list = [' ', '.', ',']   # character to exclude when counting token position
scanner = Scanner(ignore_list)
corpus, token_cnt = scanner.count_pos('Hamlet.txt')
# corpus: Dict[str, List[int]]: mapping from toke string to occurrence position
# token_cnt: int, total number of token (ignore_list excluded)

Feel free to create your own customed corpus with other tokenizer.

  • Entry Point
$ python main.py -h       # display options help menu
$ python main.py --word --corpus-path --total-word-num \
                 --ckp-save-path \
                 --epoch \
                 --X-bandwidth --lag-bandwidth
  • Args Spec
    • word: target word in the book to consider
    • corpus-path: path to corpus json file
    • total-word-num: total word/token number of the book
    • ckp-save-path: save path of fitted checkpoint
    • X-bandwidth: bandwidth used when estimating distribution of occurrence
    • lag-bandwidth: bandwidth used when estimating distribution of occurrence lag, i.e difference between any pair of occurrence
    • epoch: the number of fitting epoch

Workflow

Conditional intensity function $\lambda(t)$ is defined as, which is the probability of occurrence of event at time stamp $t$ conditioned previous history

$$\lambda(t) = P\{X_t = 1 |\mathcal{H}_t\}$$

In our example, we estimate $\lambda(t)$ for uni-variant case with the following equation

$$\lambda(t) = \mu_0 \mu(t) + A \sum_{i: t_i < t} g(t-t_i)$$

The objective likelihood function is $$\mathcal{L} = \sum_{i=1}^{N} log[\lambda(t)](X_i) + log[1-\lambda(t)](1-X_i)$$ where $X_i$ is an indicator variable for occurrence, $X_i=1$ denotes occurrence, $X_i=0$ denotes no occurrence.

Denote $\phi(t) = \frac{\mu(t)}{\lambda(t)}$ and $\rho_{ij}=\frac{g(t_j-t_i)}{\lambda(t_j)}$

E step: Update $\mu(t)$ and $g(t)$

$$\mu(t) \leftarrow \sum_{t_i \in N} \phi(t_i) \mathbb{1}[t \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i \in N} \phi(t_i) Z(t-t_i, \omega_{bg})$$ $$g(t) \leftarrow \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} \mathbb{1}[t_j-t_i \in [t-\Delta t, t+\Delta t]] \approx \sum_{t_i, t_j \in N; t_i < t_j} \rho_{ij} Z(t_j-t_i, \omega_{trigger})$$

M step: Update $\mu_0$ and $A$ $$\frac{\partial \mathcal{L}}{\partial \mu_0} = \sum_{t_i \in N} \frac{\phi(t_i)}{\mu_0} - \sum_{t \notin N} \frac{\mu(t)}{1-\lambda(t)}$$ $$\frac{\partial \mathcal{L}}{\partial A} = \sum_{t_i \in N} \frac{1-\phi(t_i)}{A} - \sum_{t \notin N} \frac{\sum_{j: t_j &lt; t} g(t-t_j)}{1-\lambda(t)} $$ Iterative Optimization Approach $$\mu_0^{k+1} \leftarrow \frac{\sum_{t_i \in N}\phi^k(t_i)} {\sum_{t \notin N} \frac{\mu^k(t)}{1-\lambda^k(t)}}$$ $$A^{k+1} \leftarrow \frac{\sum_{t_i \in N} 1-\phi^{k+1}(t_i)}{\sum_{t \notin N} \frac{\sum_{t_j:t_j &lt; t} g^{k+1}(t-t_j)}{1 - \lambda^{k+1}(t)} }$$

Advanced Options

Parameter Init

  • init-mu0: initial value of $\mu_0$
  • init-A: initial value of $A$
  • init-exp-lambda: initial value of $\lambda$ parameterizing exponential distribution $\lambda e^{-\lambda x}$, which is used to initialize $g(t)$.

Fast Computation

  • g-window-size: when computing self excitation term, events occurred beyond this window this be truncated. Without truncating, the complexity for accumulative trigger calculation is $O(N^2)$.

Memory Utilization When running on single machine, one of the major problems when fitting long process is OOM (out of memory) on RAM, the following arguments should be carefully tuned to prevent it.

  • X-chunk-size: chunk size to split across entire episode
  • kernel-chunk-size: chunk size to split across occurrence
  • lag-chunk-size: chunk size to split across occurrence lag

General Guide:

  • Bottlenecks:

    • (#kernel_chunk_size, #occur_lag/#occur): estimation of the distribution of occurrence and lag of occurrence
    • (#lag_chunk_size, #g-window-size): calculation of accmulative self excitation term; events beyond g window will be truncated In single-machine scenario, the memory used by multiple workers collectively can be approximated by #num-worker $\times$ bottleneck, which should not exceed RAM capacity
  • Monitering memory consumption with Dask Daskboard is highly recommended.

Here are some of the results of $g(t)$ fitted on novels of Mark Twain and Dickens. $g(t)$ models how triggering effect between words varies with time/length of the interval. We believe that this metrics can be used to characterize writer's style.

dickens

mark_twain

linguistic_hawkes's People

Contributors

noah822 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.