Coder Social home page Coder Social logo

kingfener / bertpunc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nkrnrnk/bertpunc

0.0 1.0 0.0 230 KB

SOTA punctation restoration (for e.g. automatic speech recognition) deep learning model based on BERT pre-trained model

License: Apache License 2.0

Python 6.02% Jupyter Notebook 93.98%

bertpunc's Introduction

BertPunc

This repository contains the code of BertPunc a punctuation restoration model based on Google's BERT. The model is fine-tuned from a pretrained reimplementation of BERT in Pytorch.

A punctation restoration model adds punctuation (e.g. period, comma, question mark) to an unsegmented, unpunctuated text. Automatic Speech Recognition (ASR) systems typically output unsegmented, unpunctuated sequences of words. Punctation restoration improves the readability of ASR transcripts.

Results

BertPunc outperformes state-of-the-art results in Bidirectional Recurrent Neural Network with Attention Mechanism for Punctuation Restoration by Ottokar Tilk and Tanel Alumae on the IWSLT dataset of Ted Talk transcripts by large margins:

Model Overall Comma Period Question mark
T-BRNN-pre (Tilk et al.) 0.644 0.548 0.729 0.667
BertPunc 0.752 0.712 0.819 0.723
Improvement +16% +30% +12% +8%

(Scores are F1 scores on the test set)

Method

BertPunc adds an extra linear layer on top of the pretrained BERT masked language model (BertForMaskedLM). BertForMaskedLM outputs a logit vector for every (masked) token. The logit vector has a 30522 size corresponding to the BERT token vocabulary. The extra linear layer maps to the possible punctutation characters (4 in case of the IWSL: comma, period, question mark and no punctuation).

BertPunc is trained by feeding it with word sequences of a fixed segment size. The label for a segment is the punctuation for the middle word in the sentence. The segment size is a hyperparameter, for the ISWLT dataset a size of 32 tokens works well.

Code

  • train.py: training code
  • data.py: helper function to read and transform data
  • model.py: neural network model
  • evaluate.py: evaluation on ISWL test sets

bertpunc's People

Contributors

nickreinerink avatar nkrnrnk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.