Coder Social home page Coder Social logo

daviddwlee84 / sentencesimilarity Goto Github PK

View Code? Open in Web Editor NEW
43.0 6.0 11.0 72.23 MB

The enhanced RCNN model used for sentence similarity classification

Python 91.36% Shell 1.87% Jupyter Notebook 6.77%
sentence-similarity ccks quora-question-pairs enhanced-rcnn semantic-similarity

sentencesimilarity's Introduction

Sentence Similarity

(mainly based on Enhanced-RCNN model and other baselines)

Getting Started

To clone this project, make sure git-lfs is installed.

Please use the following command to clone this project:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/daviddwlee84/SentenceSimilarity.git

Clone repo without downloading real files with GitLFS

Quick Execute All

# Data preprocessing
./all_data_preprocess.sh
# Train & Evaluate
./train_all_data_at_once.sh [model name]

# Test Ant Submission functionality
bash run.sh raw_data/competition_train.csv ant_test_pred.csv
# pack the Ant Submission files
zip -r AntSubmit.zip . -i \*.py \*.sh -i data/stopwords.txt

Usage

# Data preprocessing
## Ant
python3 ant_preprocess.py [word/char] train
## CCKS
python3 ccks_preprocess.py
## PiPiDai
python3 pipidai_preprocess.py

# Train & Evaluate
## Chinese
python3 run.py --dataset [Ant/CCKS/PiPiDai] --model [model name] --word-segment [word/char]
# train all the model at once use ./train_all_data_at_once.sh
## English
python3 run.py --dataset Quora --model [model name]

# Use Tensorboard
tensorboard --logdir log/same_as_model_log_dir
## remote connection(forward local port to remote port) (execute in local machine)
## then you should be able to access with http://localhost:$LOCAL_PORT
ssh -NfL $LOCAL_PORT:localhost:$REMOTE_PORT $REMOTE_USER@$REMOTE_IP > /dev/null 2>&1
### to close connection (just kill the ssh command which run in background)
ps aux | grep "ssh -NfL" | grep -v grep | awk '{print $2}' | xargs kill

Model

  • ERCNN (default)
  • Transformer
    • ERCNN + replace the BiRNN with Transformer
  • Baseline
    • Siamese Series
      • SiameseCNN
        • Convolutional Neural Networks for Sentence Classification
        • Character-level Convolutional Networks for Text Classification
      • SiameseRNN
      • SiameseLSTM
        • Siamese Recurrent Architectures for Learning Sentence Similarity
      • SiameseRCNN
        • Siamese Recurrent Architectures for Learning Sentence Similarity
      • SiameseAttentionRNN
        • Text Classification Research with Attention-based Recurrent Neural Networks
    • Multi-Perspective Series
      • MPCNN
        • Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks
        • just a "more sentence similarity measurements" version of SiameseCNN (also use Siamese network to encode sentences)
        • TODO: Model too big to run.... (consume too much GPU memory) => Smaller batch size
      • MPLSTM: skip
      • BiMPM
        • Bilateral Multi-Perspective Matching for Natural Language Sentences
    • ESIM

Dataset

  • Ant - Chinese
  • CCKS - Chinese
  • PiPiDai - Chinese (encoded)
  • Quora - English

Mode

  • train
    • using 70% training data
    • k-fold cross-validation (k == training epochs)
    • will test the performance using valid set when each epoch end and save the model
  • test
    • using 30% test data
    • will load the latest model with the same settings
  • both (include train and test)
  • predict
    • will load the latest model with the same settings

Sampling

  • random (Original): data is skewed (the ratio is listed below)
  • balance: positive vs. negative data will be the same
    • generate-train
    • generate-test
$ python3 run.py --help
usage: run.py [-h] [--dataset dataset] [--mode mode] [--sampling mode]
              [--generate-train] [--generate-test] [--model model]
              [--word-segment WS] [--batch-size N] [--test-batch-size N]
              [--k-fold N] [--lr N] [--beta1 N] [--beta2 N] [--epsilon N]
              [--no-cuda] [--seed N] [--test-split N] [--log-interval N]
              [--test-interval N] [--not-save-model]

Enhanced RCNN on Sentence Similarity

optional arguments:
  -h, --help             show this help message and exit
  --dataset dataset      Chinese: Ant, CCKS; English: Quora (default: Ant)
  --mode mode            script mode [train/test/both/predict/submit(Ant)]
                         (default: both)
  --sampling mode        sampling mode during training (default: random)
  --generate-train       use generated negative samples when training (used in
                         balance sampling)
  --generate-test        use generated negative samples when testing (used in
                         balance sampling)
  --model model          model to use [ERCNN/Transformer/Siamese(CNN/RNN/LSTM/R
                         CNN/AttentionRNN)] (default: ERCNN)
  --word-segment WS      chinese word split mode [word/char] (default: char)
  --chinese-embed embed  chinese embedding (default: cw2vec)
  --not-train-embed      whether to freeze the embedding parameters
  --batch-size N         input batch size for training (default: 256)
  --test-batch-size N    input batch size for testing (default: 1000)
  --k-fold N             k-fold cross validation i.e. number of epochs to train
                         (default: 10)
  --lr N                 learning rate (default: 0.001)
  --beta1 N              beta 1 for Adam optimizer (default: 0.9)
  --beta2 N              beta 2 for Adam optimizer (default: 0.999)
  --epsilon N            epsilon for Adam optimizer (default: 1e-08)
  --no-cuda              disables CUDA training
  --seed N               random seed (default: 16)
  --test-split N         test data split (default: 0.3)
  --logdir path          set log directory (default: ./log)
  --log-interval N       how many batches to wait before logging training
                         status
  --test-interval N      how many batches to test during training
  --not-save-model       for not saving the current model
  --load-model name      load the specific model checkpoint file
  --submit-path path:    submission file path (currently for Ant dataset)

Related Additional Datasets

Data

Original

  • raw_data/competition_train.csv - Ant Financial

  • raw_data/train.csv - Quora Question Pairs

  • word2vec/substoke_char.vec.avg - Ant Financial

  • word2vec/substoke_word.vec.avg - Ant Financial

  • data/stopwords.txt - Ant Financial

  • word2vec/glove.word2vec.txt - Quora Question Pairs

  • raw_data/task3_train.txt - CCKS 2018

  • raw_data/task3_dev.txt - CCKS 2018

    wget http://nlp.stanford.edu/data/glove.840B.300d.zip
    unzip glove.840B.300d
    from gensim.scripts.glove2word2vec import glove2word2vec
    _ = glove2word2vec('glove.840B.300d.txt', 'word2vec/glove.word2vec.txt')
    rm glove.840B*

Generated

  • data/sentence_char_train.csv - Ant Financial
  • data/sentence_word_train.csv - Ant Financial
  • word2vec/Ant_char_tokenizer.pickle - Ant Financial
  • word2vec/Ant_char_embed_matrix.pickle - Ant Financial
  • word2vec/Ant_word_tokenizer.pickle - Ant Financial
  • word2vec/Ant_word_embed_matrix.pickle - Ant Financial
  • word2vec/Quora_tokenizer.pickle - Quora Question Pairs
  • word2vec/Quora_embed_matrix.pickle - Quora Question Pairs
  • model/*
  • log/*

Dataset

ANT Financial Competition

Goal: classify whether two question sentences are asking the same thing => predict true or false

Evaluation: f1-score

Data

  • Positive data: 18.23%

Quora Question Pairs

kaggle competitions download -c quora-question-pairs
unzip test.csv -d raw_data
unzip train.csv -d raw_data
rm *.zip

Goal: classify whether question pairs are duplicates or not => predict the probability that the questions are duplicates (a number between 0 and 1)

Evaluation: log loss between the predicted values and the ground truth

Data

  • Positive data: 36.92%
  • 400K rows in train set and about 2.35M rows in test set
  • 6 columns in train set but only 3 of them are in test set
    • train set
      • id - the id of a training set question pair
      • qid1, qid2 - unique ids of each question (only available in train.csv)
      • question1, question2 - the full text of each question
      • is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise
    • test set
      • test_id
      • question1, question2
  • about 63% non-duplicate questions and 37% duplicate questions in the training data set

CCKS 2018

CCKS: China Conference on Knowledge Graph and Semantic Computing

Data

  • Positive data: 50%
  • Data amount: 100000

CHIP 2018

須連繫主辦方才能取得數據

PiPiDai

Link失效

  • Positive data: 52%
  • Data amount: 254386

TODO

  • More evaluation matrics: recall & f1-score
  • Continue training?!
  • Potential multi-class classification
    • num_class input
    • sigmoid => softmax
    • (but how about siamese model??)

Notes

Notes for unbalanced data

Balance data generator

In data_prepare.py, the class BalanceDataHelper

Use different loss

  • Dice loss

    • Dice Loss PR · Issue #1249 · pytorch/pytorch

    • other approach

      if weight is None:
              weight = torch.ones(
                  y_pred.shape[-1], dtype=torch.float).to(device=y_pred.device)  # (C)
          if not mode:
              return self.simple_cross_entry(y_pred, golden, seq_mask, weight)
          probs = nn.functional.softmax(y_pred, dim=2)  # (B, T, C)
          B, T, C = probs.shape
      
          golden_index = golden.unsqueeze(dim=2)  # (B, T, 1)
          golden_probs = torch.gather(
              probs, dim=2, index=golden_index)  # (B, T, 1)
      
          probs_in_package = golden_probs.expand(B, T, T).transpose(1, 2)
      
          packages = np.array([np.eye(T)] * B)  # (B, T, T)
          probs_in_package = probs_in_package * \
              torch.tensor(packages, dtype=torch.float).to(device=probs.device)
          max_probs_in_package, _ = torch.max(probs_in_package, dim=2)
      
          golden_probs = golden_probs.squeeze(dim=2)
      
          golden_weight = golden_probs / (max_probs_in_package)  # (B, T)
      
          golden_weight = golden_weight.view(-1)
          golden_weight = golden_weight.detach()
          y_pred = y_pred.view(-1, C)
          golden = golden.view(-1)
          seq_mask = seq_mask.view(-1)
      
          negative_label = torch.tensor(
              [0] * (B * T), dtype=torch.long, device=y_pred.device)
          golden_loss = nn.functional.cross_entropy(
              y_pred, golden, weight=weight, reduction='none')
          negative_loss = nn.functional.cross_entropy(
              y_pred, negative_label, weight=weight, reduction='none')
      
          loss = golden_weight * golden_loss + \
              (1 - golden_weight) * negative_loss  # (B * T)
          loss = torch.dot(loss, seq_mask) / (torch.sum(seq_mask) + self.epsilon)
  • Triplet-Loss

  • N-pair Loss

Notes about Virtualenv

# this will create a env_name folder in current directory
virtualenv --python=/path/to/python3.x env_name

# activate the environment
source ./env_name/bin/activate

Add alias in bashrc

  • Goto work directory and activate the environment
    • alias davidlee="cd /home/username/working_dir; source env_name/bin/activate"
  • Use pip source when install packages
    • alias pipp="pip install -i https://pypi.tuna.tsinghua.edu.cn/simple"

Install Jupyter notebook use the virtualenv kernel

  1. make sure you activate the environment
  2. pip3 install jupyterlab
  3. python3 -m ipykernel install --user --name=python3.6virtualenv
  4. execute jupyter notebook as normal jupyter notebook
  5. Goto kernel > change kernel > select python3.6virtualenv

Links

Paper

PyTorch

Gensim

Others

Related Project

Summary

Model Source Code

--

Article

Candidate Set

Baseline

Siamese Models

Siamese-CNN, Siamese-RNN, Siamese-LSTM, Siamese-RCNN, Siamese-Attention-RCNN

Contrastive Loss

Trouble Shooting

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

somehow the nn.Module in a list can't be auto connect to(device)

Appendix

Sorry for the limitation of the Git-LFS bandwidth quota, might have some problem to clone this project.

git lfs clone --depth=1 https://github.com/daviddwlee84/SentenceSimilarity.git

Attention

# https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
class Attention(nn.Module):
    def __init__(self,
                 enc_hid_dim: int,
                 dec_hid_dim: int,
                 attn_dim: int):
        super().__init__()

        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim

        self.attn_in = (enc_hid_dim * 2) + dec_hid_dim

        self.attn = nn.Linear(self.attn_in, attn_dim)

    def forward(self,
                decoder_hidden: Tensor,
                encoder_outputs: Tensor) -> Tensor:

        src_len = encoder_outputs.shape[0]

        repeated_decoder_hidden = decoder_hidden.unsqueeze(
            1).repeat(1, src_len, 1)

        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        energy = torch.tanh(self.attn(torch.cat((
            repeated_decoder_hidden,
            encoder_outputs),
            dim=2)))

        attention = torch.sum(energy, dim=2)

        return F.softmax(attention, dim=1)


# https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)


# https://www.kaggle.com/mlwhiz/attention-pytorch-and-keras
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)

        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0

        weight = torch.zeros(feature_dim, 1)
        nn.init.kaiming_uniform_(weight)
        self.weight = nn.Parameter(weight)

        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))

    def forward(self, x, mask=None):
        feature_dim = self.feature_dim
        step_dim = self.step_dim

        eij = torch.mm(
            x.contiguous().view(-1, feature_dim),
            self.weight
        ).view(-1, step_dim)

        if self.bias:
            eij = eij + self.b

        eij = torch.tanh(eij)
        a = torch.exp(eij)

        if mask is not None:
            a = a * mask

        a = a / (torch.sum(a, 1, keepdim=True) + 1e-10)

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)

sentencesimilarity's People

Contributors

daviddwlee84 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sentencesimilarity's Issues

[Issue] dataset location

Hi:

thanks for your contribution on these models and dataset
however, when I am trying to fetch data the error message appear as below

fetch: Fetching reference refs/heads/master
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/daviddwlee84/SentenceSimilarity.git/info/lfs'

it seems like this repo have run out of quota, I am wondering if you can upload data into google drive or other space and offer a link for downloading?

Tensor dimension mismatch error

There is a problem with the definition of the model structure.
In model RCNN:
Should be changed to
self.fc_sub = FCSubtract((max_len-1)*2*3 + lstm_dim*2*2, fc_out_dims) # 3个inception+BiGRU [1056,192] self.fc_mul = FCMultiply((max_len-1)*2*3 + lstm_dim*2*2, fc_out_dims)
In addition, the experimental results cannot be reproduced

Transformer Error

try run code:
./train_all_data_at_once.sh Transformer

then get error

Traceback (most recent call last):
  File "run.py", line 352, in <module>
    main()
  File "run.py", line 329, in main
    train(args, model, tokenizer, device, optimizer, tbwriter)
  File "/home/wac/SentenceSimilarity/random_train.py", line 55, in train
    output = model(input_1, input_2)
  File "/home/wac/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wac/SentenceSimilarity/models/rcnn_transformer.py", line 82, in forward
    res_sub = self.fc_sub(sentence1, sentence2)
  File "/home/wac/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wac/SentenceSimilarity/models/rcnn_elements.py", line 17, in forward
    out = self.dense(res_sub_mul)
  File "/home/wac/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wac/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/wac/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1674, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

maybe dim sentence1 different from sentence2?

Inception Implementation Problem

class Inception1(nn.Module):
    def __init__(self, input_dim, conv_dim=64):
        super(Inception1, self).__init__()
        self.cnn = nn.Sequential(
            nn.Conv1d(input_dim, conv_dim, kernel_size=1),
            nn.ReLU(),
            nn.Conv1d(conv_dim, conv_dim, kernel_size=2),
            nn.ReLU()
        )
        self.global_avg_pool = nn.AvgPool1d(input_dim)
        self.global_max_pool = nn.MaxPool1d(input_dim)

    def forward(self, x):
        out = self.cnn(x)
        avg_pool, max_pool = mean_max(x)
        return torch.cat((avg_pool, max_pool), dim=1)

out in forward fuction seems not used? the same problem also exists in Inception2,Inception3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.