Coder Social home page Coder Social logo

nakhunchumpolsathien / thaisum Goto Github PK

View Code? Open in Web Editor NEW
37.0 2.0 13.0 1.78 MB

A Dataset for Thai text summarization from Thairath, ThaiPBS, Prachathai and The Standard with over 350,000 articles. Trained models are provided.

License: Apache License 2.0

Python 25.46% Shell 0.02% Jupyter Notebook 62.77% Perl 11.75%
thai-nlp thai-text-summarization summarization summarization-corpora headline-generation news-classification

thaisum's Introduction

ThaiSum

A dataset for Thai text summarization.


For any questions regarding the dataset or the experiment, please feel free to email me at nakhun.chum[at sign]gmail.com


Notes

  • ThaiSum คือชุดข้อมูลสำหรับเทรนระบบสรุปข้อความภาษาไทย จากเว็บไซต์ ไทยรัฐ, ไทยพีบีเอส, ประชาไท และ เดอะแสตนดาร์ด
  • ผู้สนใจยังสามารถดาวน์โหลด โมเดลที่เทรนแล้วไปทดสอบกับ test set ของท่าน โดยเราได้ modify ซอร์ซโค้ดของ BertSum และ ARedSum ให้รองรับภาษาไทย ศึกษาเพิ่มเติมได้ที่ ARedSum_for_Thai_text.ipynb และ BertSum_for_Thai_text.ipynb ทั้งนี้ท่านต้อง process ข้อความของท่านเองให้เหมาะกับแต่ละโมเมลโดยเรามีตัวอย่างไว้ให้แล้ว
  • เราระบุที่มาของทุกข่าวในชุดข้อมูลนี้ในคอลัมน์ url หากท่านต้องการดัดแปลงและเผยแพร่ชุดข้อมูลนี้ต่อ กรุณาใส่ที่มาของทุกๆบทความด้วย
  • ชุดข้อมูลนี้ยังเป็นประโยชน์ต่อระบบประมวลผลทางภาษาไทยอื่นๆเช่น news classification (ทั้งแบบ multi-label และ single-label), headline generation, language modelling

0. Download

0.1 Dataset

Dataset Remark
thaisum.csv contains title, body, summary, type, tags, url columns. (2.9 GB)
test_set.csv contain title, body, summary, type, tags, url columns. (113 MB)
validation_set.csv contain title, body, summary, type, tags, url columns. (113 MB)

0.2 Trained Models

0.2.1 BertSum

Model Size
BertSumExt 2.1 GB
BertSumAbs 3.6 GB
BertSumExtAbs 3.6 GB

0.2.2 ARedSum

Model Size
ARedSumBase 2.1 GB
ARedSumCTX 737.6 MB
ARedSumSEQ 2.3 GB

0.3 Source Code Credit

Model Original Source Paper
BertSum GitHub aclweb
ARedSum GitHub arXiv

1. Introduction

Sequence-to-sequence (Seq2Seq) models have shown great achievement in text summarization. However, Seq2Seq model often requires large-scale training data to achieve effective results. Although many impressive advancements in text summarization field have been made, most of summarization studies focus on resource-rich languages. The progress of Thai text summarization is still far behind. The dearth of large-scale dataset keeps Thai text summarization in its infancy. As far as our knowledge goes, there is not a large-scale dataset for Thai text summarization available anywhere. Thus, we present ThaiSum, a large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists. We evaluate the performance of various existing summarization models on ThaiSum dataset and analyse the characteristic of the dataset to present its difficulties.

2. Dataset Construction

We used a python library named Scrapy to crawl articles from several news websites namely Thairath, Prachatai, ThaiPBS and, The Standard. We first collected news URLs provided in their sitemaps. During web-crawling, we used HTML markup and metadata available in HTML pages to identify article text, summary, headline, tags and label. Collected articles were published online from 2014 to August 2020.

We further performed data cleansing process to minimize noisy data. We filtered out articles that their article text or summary is missing. Articles that contains article text with less than 150 words or summary with less than 15 words were removed. We also discarded articles that contain at least one of these following tags: ‘ดวง’ (horoscope), ‘นิยาย’ (novel), ‘อินสตราแกรมดารา’ (celebrity Instagram), ‘คลิปสุดฮา’(funny video) and ‘สรุปข่าว’ (highlight news). Some summaries were completely irrelevant to their original article texts. To eliminate those irrelevant summaries, we calculated abstractedness score between summary and its article text. Abstractedness score is written formally as:



Where 𝑆 denotes set of article tokens. 𝐴 denotes set of summary tokens. 𝑟 denotes a total number of summary tokens. We omitted articles that have abstractedness score at 1-grams higher than 60%.

It is important to point out that we used PyThaiNLP, version 2.2.4, tokenizing engine = newmm, to process Thai texts in this study. It is challenging to tokenize running Thai text into words or sentences because there are not clear word/sentence delimiters in Thai language. Therefore, using different tokenization engines may result in different segment of words/sentences.

3. Dataset Property

After data-cleansing process, ThaiSum dataset contains over 358,000 articles. The size of this dataset is comparable to a well-known English document summarization dataset, CNN/Dily mail dataset. Moreover, we analyse the characteristics of this dataset by measuring the abstractedness level, compassion rate, and content diversity. For more details, see thaisum_exploration.ipynb.

3.1 Dataset Statistics

ThaiSum dataset consists of 358,868 articles. Average lengths of article texts and summaries are approximately 530 and 37 words respectively. As mentioned earlier, we also collected headlines, tags and labels provided in each article. Tags are similar to keywords of the article. An article normally contains several tags but a few labels. Tags can be name of places or persons that article is about while labels indicate news category (politic, entertainment, etc.). Ultimatly, ThaiSum contains 538,059 unique tags and 59 unique labels. Note that not every article contains tags or labels.

Dataset Size 358,868 articles
Avg. Article Length 529.5 words
Avg. Summary Length 37.3 words
Avg. Headline Length 12.6 words
Unique Vocabulary Size 407,355 words
Occurring > 10 times 81,761 words
Unique News Tag Size 538,059 tags
Unique News Label Size 59 labels

3.2 Level of Abstractedness

The abstractedness level of a summary is determined by measuring the unique n-grams in the reference summary that are not appear in the article text. Figure 1 reports distributions of abstractedness scores at N-grams where N ranks from 1 – 5 and sentences. Figure1

Figure 1: Red vertical line represents average abstractedness score at N-gram.

3.3 Content Diversity

See assign_final_label_to_article.py for how we assign final label to the article. This could be useful for news classification task. Figure2

4. Experiment and Result

This experiment aims to create benchmarks for ThaiSum dataset by using some existing state-of-the-art summarization models on both extractive and abstractive settings.

4.1 Experimental Settings

To train sequence-to-sequence based models, we split the dataset into 336,868/11,000/11,000 documents for training, validation and testing. Input documents were truncated at 500 words. We used 'bert-base-multilingual-uncased' for fine-tuning. We trained BertSum and ARedSum models on two GPUs (NVIDIA Titan RTX). Other parameters were set similar to the oiginal experiments of the corresponding papers. The summaries produced by extractive models, except ORACLE, were limited to 2 sentences.

4.2 Performance of Existing Systems

4.2.1 Automatic Metric Evaluation

Model
ROUGEBertScore
R1R2RLF1
Baselines
Oracle52.6528.3452.2582.83
Lead242.7425.5842.6983.28
Lead2+Trigram42.2325.0342.1883.24
Extractive
ARedSum-Base43.8625.6543.8080.66
ARedSum-CTX40.7224.1740.6779.48
ARedSum-SEQ43.0624.4843.0181.07
BertSumExt44.3926.5844.3478.82
Abstractive
BertSumAbs48.8229.7048.7584.58
BertSumExtAbs49.5229.8649.4885.85

ROUGE-F1 and BertScore-F1 score on test set.

4.2.2 Position of Extracted Sentences

Most extractive models heavily select the first three sentences of the articles as the output summary. This is very common especially on news dataset (similar to CNN/Daily Mail dataset) where the articles are written in inverted pyramid style. The first few sentences of news articles contain the important information making Lead-3 baselines perform impressively well. Figure3

Figure 3: Proportion of extracted sentences according to their position in the original document.

4.2.3 Influence of Sentence Segmenter

See simple_thai_sentence_segmentation.py for our simple Thai sentence segmenter.

As mentioned earlier, it is difficult (and sometimes flexible) to pinpoint the end of the sentences from running Thai text. We investigate how different sentence segmentation engines affect the performance of the summarization models. We found that sentence segmenter from ThaiNLP sometimes generates unnecessarily long sentences. Thus, we create a simple Thai sentence segmenter considering conjunction words and length of the sentences. We compared ROUGE-F1 results on same test set but segmented by different sentence segmenters. Note that, for BertSumExt, training and validation sets were also segmented by different segmenter not just test set. Table below shows the comparison results. In short, different sentence segmenters affect significantly the performance of extractive models.

Model ThaiNLP Our Segmenter
R1 R2 RL R1 R2 RL
Oracle
52.65 28.34 52.25 63.06 35.93 62.94
Lead-2 42.74 25.58 42.69 52.72 31.13 52.67
BertSumExt 44.39 26.58 44.34 42.70 25.91 42.63

5. Licence

6. Cite this work

@mastersthesis{chumpolsathien_2020, 
    title={Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization},
    author={Chumpolsathien, Nakhun}, 
    year={2020}, 
    school={Beijing Institute of Technology}

7. Acknowledgment

thaisum's People

Contributors

nakhunchumpolsathien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

thaisum's Issues

Put file as train/valid/test on git-lfs or other easily downloaded link for integration with huggingface

I'm trying to integrate ThaiSum to huggingface/datasets. One of the key things to do is to have an easily accessible download link of a compressed file (data.zip) where inside are train, validation and test files.

Example: huggingface/datasets#981

Currently ThaiSum needs to be downloaded via Google Drive, which is not very convenient for this purpose.
I'm wondering if you would consider hosting it on git-lfs (or other more easily accessible links) instead.

How to preprocess my own test set with ThaiSum model?

README.md of ThaiSum said "ทั้งนี้ท่านต้อง process ข้อความของท่านเองให้เหมาะกับแต่ละโมเมลโดยเรามีตัวอย่างไว้ให้แล้ว". However, I am not sure where is the example that you mentioned. Would you mind explain the step for preprocess the data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.