tBERT

This repository provides code for the paper "tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection" (https://www.aclweb.org/anthology/2020.acl-main.630/).

Setup

Download pretrained BERT

Create cache folder in home directory:

cd ~
mkdir tf-hub-cache
cd tf-hub-cache

Download pretrained BERT model and unzip:

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

Download preprocessed data

Go to the tBERT repository:

cd /path/to/tBERT/

Download topic models and cached datasets (data.tar.gz) from dropbox into data folder:

wget "https://www.dropbox.com/s/thhym7njtgp8uoh/data.tar.gz"

Uncompress data.tar.gz:

tar zxvf data.tar.gz &

Your data directory should now have the following content:

data/
├── cache
├── logs
├── models
├── MSRP
├── Quora
├── Semeval
├── STS
└── topic_models
    ├── basic
    │   ├── MSRP_alpha1_80
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Quora_alpha1_90
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Semeval_alpha10_70
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Semeval_alpha10_80
    │   │   └── ldamallet
    │   │       └── predictions
    │   └── Semeval_alpha50_70
    │       └── ldamallet
    │           └── predictions
    └── basic_gsdmm
        ├── MSRP_alpha0.1_80
        │   └── gsdmm
        │       └── predictions
        ├── Quora_alpha0.1_90
        │   └── gsdmm
        │       └── predictions
        ├── Semeval_alpha0.1_70
        │   └── gsdmm
        │       └── predictions
        └── Semeval_alpha0.1_80
            └── gsdmm
                └── predictions

Requirements

This code has been tested with Python 3.6 and Tensorflow 1.11.
Install the required Python packages as defined in requirements.txt:

pip install -r requirements.txt

Usage

You can try out if everything works by traing a model on a small portion of the data (you can play around with different model options by changing the opt dictionary):

python src/models/base_model_bert.py

This should produce the following output:

['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'topic_type': 'ldamallet', 'load_ids': True, 'topic': 'doc', 'minibatch_size': 10, 'seed': 1, 'max_m': 10, 'bert_large': False, 'num_topics': 80, 'num_epochs': 1, 'model': 'bert_simple_topic', 'max_length': 'minimum', 'simple_padding': True, 'padding': False, 'bert_update': True, 'L2': 0, 'dropout': 0.1, 'bert_cased': False, 'speedup_new_layers': False, 'unk_topic': 'zero', 'stopping_criterion': 'F1', 'tasks': ['B'], 'learning_rate': 0.3, 'hidden_layer': 1, 'gpu': -1, 'optimizer': 'Adadelta', 'datapath': 'data/', 'unflat_topics': False, 'sparse_labels': True, 'freeze_thaw_tune': False, 'dataset': 'MSRP', 'topic_alpha': 1, 'predict_every_epoch': False, 'unk_sub': False, 'subsets': ['train', 'dev', 'test'], 'topic_update': True}
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
No file found at data/logs/test.json. Creating new log.
get new id
Model 0
2020-07-03 19:02:31.946114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 19:02:37.224331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:02:37.224390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
logfile: test.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.75
data/models/model_0/model_epoch1.ckpt
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 1
reading logs...
Finished training after 0.55 min
Dev F1: 0.75
Test F1: 0.75
reading logs...
Wrote predictions for model_0.

The model will be saved under data/models/model_0/ and the training log is available under data/logs/test.json
You can also run an experiment on the complete dataset and alter different commandline flags, e.g.:

python src/experiments/tbert.py -dataset MSRP -layers 1 -topic doc -topic_type ldamallet -learning_rate 5e-5 --early_stopping -seed 3 -gpu 0

This should give you the following output:

Starting experiment 1 of 1
tbert_1_seed_early_stopping.json
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'predict_every_epoch': False,
'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'minibatch_size': 32, 'max_length': 'minimum', 'lo
ad_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, 'sparse_labels': True, 'num_topics': 80, 'topi
c_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learning_rate': 3e-05, 'optimizer': 'Adam', 'topic'
: 'doc'}
['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'unflat_topics': False, 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'bert_
large': False, 'predict_every_epoch': False, 'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'mini
batch_size': 32, 'max_length': 'minimum', 'load_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, '
sparse_labels': True, 'num_topics': 80, 'topic_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learni
ng_rate': 3e-05, 'optimizer': 'Adam', 'topic': 'doc'}
Running on GPU: 0
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
get new id
Model 1
2020-07-03 19:51:34.629485: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow bi
nary was not compiled to use: AVX2 FMA
2020-07-03 19:51:39.501180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:51:39.501233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2020-07-03 19:51:39.769183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 19:51:39.769242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2020-07-03 19:51:39.769263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2020-07-03 19:51:39.769368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10761 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 6a05:00:00.0, compute capability: 3.7)
logfile: tbert_1_seed_early_stopping.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.8917378783226013
data/models/model_1/model_epoch1.ckpt
Epoch 2
Dev F1 after epoch 2: 0.9058663249015808
data/models/model_1/model_epoch2.ckpt
Epoch 3
Dev F1 after epoch 3: 0.8959999680519104
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 2
reading logs...
Finished training after 10.74 min
Dev F1: 0.9059
Test F1: 0.8841
reading logs...
Wrote predictions for model_1.

wentingtseng / tbert-1 Goto Github PK

tbert-1's Introduction

tBERT

Setup

Download pretrained BERT

Download preprocessed data

Requirements

Usage

tbert-1's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent