Coder Social home page Coder Social logo

bert-text-summarizer's Introduction

A BERT-based Text Summarizer

Currently, only extractive summarization is supported.

Using a word limit of 200, this simple model achieves approximately the following ROUGE F1 scores on the CNN/DM validation set.

ROUGE-1: 37.78
ROUGE-2: 15.78

How does it work?

During preprocessing, the input text is divided into chunks up to 512 tokens long. Each sentence is tokenized using the bert official tokenizer and a special [CLS] is placed at the begging of each sentence. The ROUGE-1 and ROUGE-2 scores of each sentence with respect to the example summary are calculated. The model ouputs a single value corresponding to each [CLS] token and is trained to directly predict the mean of the ROUGE-1 and 2 scores.

During post-processing, the sentences are ranked according to their predicted ROUGE score. Finally, the top sentences are selected until the word limit is reached and resorted according to their positions within the text.

Install

pip install -U bert-text-summarizer

Usage

Get training data

bert-text-summarizer get-cnndm-train --max-examples=10000

This outputs a tf-record file named cnndm_train.tfrec by default.

Leaving out --max-examples it will process the entire CNN/DM training set which may take >1 hours to complete.

Train the model

bert-text-summarizer train-ext-summarizer \
  --saved-model-dir=bert_ext_summ_model \
  --train-data-path=cnndm_train.tfrec \
  --epochs=10

Get summary

bert-text-summarizer get-summary \
  --saved-model-dir=bert_ext_summ_model \
  --article-file=article.txt \
  --max-words=150

You can create a summary programmatically like this

import tensorflow_hub as hub
from official.nlp.bert import tokenization

from bert_text_summarizer.extractive.model import ExtractiveSummarizer

# Create the tokenizer (if you have the vocab.txt file you can bypass this tfhub step)
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

# Create the summarizer
predictor = ExtractiveSummarizer(tokenizer=tokenizer, saved_model_dir='bert_ext_summ_model')

# Get the article summary
article = open('article.txt', 'r').read().strip()
summary = predictor.get_summary(text=article, max_words=200)
print(summary)

Evaluate on the CNN/DM validation set

bert-text-summarizer eval-ext-summarizer \
  --saved-model-dir=bert_ext_summ_model

bert-text-summarizer's People

Contributors

david-wb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.