Coder Social home page Coder Social logo

tbert-1's Introduction

tBERT

This repository provides code for the paper "tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection" (https://www.aclweb.org/anthology/2020.acl-main.630/).

Setup

Download pretrained BERT

  • Create cache folder in home directory:
cd ~
mkdir tf-hub-cache
cd tf-hub-cache
  • Download pretrained BERT model and unzip:
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

Download preprocessed data

  • Go to the tBERT repository:
cd /path/to/tBERT/
  • Download topic models and cached datasets (data.tar.gz) from dropbox into data folder:
wget "https://www.dropbox.com/s/thhym7njtgp8uoh/data.tar.gz"
  • Uncompress data.tar.gz:
tar zxvf data.tar.gz &   
  • Your data directory should now have the following content:
data/
├── cache
├── logs
├── models
├── MSRP
├── Quora
├── Semeval
├── STS
└── topic_models
    ├── basic
    │   ├── MSRP_alpha1_80
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Quora_alpha1_90
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Semeval_alpha10_70
    │   │   └── ldamallet
    │   │       └── predictions
    │   ├── Semeval_alpha10_80
    │   │   └── ldamallet
    │   │       └── predictions
    │   └── Semeval_alpha50_70
    │       └── ldamallet
    │           └── predictions
    └── basic_gsdmm
        ├── MSRP_alpha0.1_80
        │   └── gsdmm
        │       └── predictions
        ├── Quora_alpha0.1_90
        │   └── gsdmm
        │       └── predictions
        ├── Semeval_alpha0.1_70
        │   └── gsdmm
        │       └── predictions
        └── Semeval_alpha0.1_80
            └── gsdmm
                └── predictions

Requirements

  • This code has been tested with Python 3.6 and Tensorflow 1.11.
  • Install the required Python packages as defined in requirements.txt:
pip install -r requirements.txt

Usage

  • You can try out if everything works by traing a model on a small portion of the data (you can play around with different model options by changing the opt dictionary):
python src/models/base_model_bert.py
  • This should produce the following output:
['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'topic_type': 'ldamallet', 'load_ids': True, 'topic': 'doc', 'minibatch_size': 10, 'seed': 1, 'max_m': 10, 'bert_large': False, 'num_topics': 80, 'num_epochs': 1, 'model': 'bert_simple_topic', 'max_length': 'minimum', 'simple_padding': True, 'padding': False, 'bert_update': True, 'L2': 0, 'dropout': 0.1, 'bert_cased': False, 'speedup_new_layers': False, 'unk_topic': 'zero', 'stopping_criterion': 'F1', 'tasks': ['B'], 'learning_rate': 0.3, 'hidden_layer': 1, 'gpu': -1, 'optimizer': 'Adadelta', 'datapath': 'data/', 'unflat_topics': False, 'sparse_labels': True, 'freeze_thaw_tune': False, 'dataset': 'MSRP', 'topic_alpha': 1, 'predict_every_epoch': False, 'unk_sub': False, 'subsets': ['train', 'dev', 'test'], 'topic_update': True}
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
No file found at data/logs/test.json. Creating new log.
get new id
Model 0
2020-07-03 19:02:31.946114: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 19:02:37.224331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:02:37.224390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
logfile: test.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.75
data/models/model_0/model_epoch1.ckpt
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 1
reading logs...
Finished training after 0.55 min
Dev F1: 0.75
Test F1: 0.75
reading logs...
Wrote predictions for model_0.
  • The model will be saved under data/models/model_0/ and the training log is available under data/logs/test.json
  • You can also run an experiment on the complete dataset and alter different commandline flags, e.g.:
python src/experiments/tbert.py -dataset MSRP -layers 1 -topic doc -topic_type ldamallet -learning_rate 5e-5 --early_stopping -seed 3 -gpu 0
  • This should give you the following output:
Starting experiment 1 of 1
tbert_1_seed_early_stopping.json
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'predict_every_epoch': False,
'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'minibatch_size': 32, 'max_length': 'minimum', 'lo
ad_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, 'sparse_labels': True, 'num_topics': 80, 'topi
c_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learning_rate': 3e-05, 'optimizer': 'Adam', 'topic'
: 'doc'}
['m_train_B', 'm_dev_B', 'm_test_B']
['data/MSRP/m_train_B.txt', 'data/MSRP/m_dev_B.txt', 'data/MSRP/m_test_B.txt']
data/cache/m_train_B.pickle
Loading cached input for m_train_B
data/cache/m_dev_B.pickle
Loading cached input for m_dev_B
data/cache/m_test_B.pickle
Loading cached input for m_test_B
Mapping words to BERT ids...
Finished word id mapping.
Done.
{'dropout': 0.1, 'model': 'bert_simple_topic', 'bert_cased': False, 'max_m': None, 'tasks': ['B'], 'padding': False, 'dataset': 'MSRP',
 'unflat_topics': False, 'L2': 0, 'subsets': ['train', 'dev', 'test'], 'unk_sub': False, 'hidden_layer': 1, 'datapath': 'data/', 'bert_
large': False, 'predict_every_epoch': False, 'num_epochs': 3, 'simple_padding': True, 'patience': 2, 'speedup_new_layers': False, 'mini
batch_size': 32, 'max_length': 'minimum', 'load_ids': True, 'topic_type': 'ldamallet', 'unk_topic': 'uniform', 'topic_update': False, '
sparse_labels': True, 'num_topics': 80, 'topic_alpha': 1, 'seed': 1, 'gpu': 0, 'stopping_criterion': 'F1', 'bert_update': True, 'learni
ng_rate': 3e-05, 'optimizer': 'Adam', 'topic': 'doc'}
Running on GPU: 0
Topic scope: doc
input ids shape: (?, ?)
Loading pretrained model from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
---
Model: tBERT
---
D_T1 shape: (?, 80)
D_T2 shape: (?, 80)
pooled BERT shape: (?, 768)
combined shape: (?, 928)
hidden 1 shape: (?, 464)
output layer shape: (?, 2)
reading logs...
get new id
Model 1
2020-07-03 19:51:34.629485: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow bi
nary was not compiled to use: AVX2 FMA
2020-07-03 19:51:39.501180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 6a05:00:00.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2020-07-03 19:51:39.501233: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2020-07-03 19:51:39.769183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-03 19:51:39.769242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2020-07-03 19:51:39.769263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2020-07-03 19:51:39.769368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10761 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 6a05:00:00.0, compute capability: 3.7)
logfile: tbert_1_seed_early_stopping.json
Finetune...
Epoch 1
Dev F1 after epoch 1: 0.8917378783226013
data/models/model_1/model_epoch1.ckpt
Epoch 2
Dev F1 after epoch 2: 0.9058663249015808
data/models/model_1/model_epoch2.ckpt
Epoch 3
Dev F1 after epoch 3: 0.8959999680519104
Maximum number of epochs reached during early stopping.
Finished training.
Load best model from epoch 2
reading logs...
Finished training after 10.74 min
Dev F1: 0.9059
Test F1: 0.8841
reading logs...
Wrote predictions for model_1.

tbert-1's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.