Coder Social home page Coder Social logo

gabrielpondc / oovunderstand Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 12.46 MB

In this project, the authors propose to use contextual Word2Vec model for understanding OOV (out of vocabulary). The OOV is extracted by using left-right entropy and point information entropy. They choose to use Word2Vec to construct the word vector space and CBOW (continuous bag of words) to obtain the contextual information of the words.

Home Page: https://www.igi-global.com/article/contextual-word2vec-model-for-understanding-chinese-out-of-vocabularies-on-online-social-media/309428

License: GNU General Public License v3.0

Python 100.00%
oov word2vec wordembedding

oovunderstand's Introduction

Chinese OOV recognition and understanding by contextual Word2Vec model

GitHub issues

Content

Project Introduction

image


Run Way

Mining the data for the corpus
$python weibomining.py
Extract the word from corpus as word list
$python oovfinder.py
Compare the word list with dictionary and extract the oov as list
$python isoov.py
Filter person name , organization name , place name from OOV list and delete these word from the list as cleaned oov list
$python namefinder.py
$python placefinder.py
$python orgfinder.py
Mining some corpus using the oov as keyword in Weibo 
$python keywordcorpuscrawl.py
Merge the keyword corpus and origin corpus and spilt words with jieba
$python splitsystem.py
Training model and caculate the similarity of each oov
$python modeltraining.py
Additional experiments are inputting an OOV for direct semantic understanding
$python modeltraining.py

Word Extract

Mutual information(MI)
image
Higher the correlation between X and Y, the higher the possibility of X and Y forming words,Lower the value of mutual information, lower the correlation between X and Y, the higher possibility of a boundary between X and Y
Left and right entropy
image
image

W : candidate words after N-Gram segmentation.
A: a collection of all words appearing on the left of a candidate.
a: a word appearing on the left.
B: a collection of all words appearing on the right of a candidate.
b: a word appearing on the right.
The more words appear around the candidate word W, the more likely it is that W is a word.


Some Result

Class OOV Similar Words of OOV
A 天才病(Genius Disease) 阿兹伯格综合症(Asperger's Syndrome)
B 新冠 (COVID-19) 感染(Infection), 病毒(Virus), 肺炎(pneumonia)
C 凤凰网(Media Organization) 应该 (Should be),讨论 (discuss),看法 (view)

The example of ’凤凰网‘(Media organization)on the left and ‘新冠’(Covid-19) on the right,Because the word ‘凤凰网’ often appears in the back of some news, it is difficult to predict the meaning of the word because there is not enough information in the context and there is a lot of noise,On the contrary, the word '新冠' is rich in contextual information, so the predicted value is also relatively accurate. image This example shows the understand of '耗子尾汁' by both CBOW and Skip-gram models. Both models accurately understand the semantic words, but the similarity between the two words understood by the CBOW model is higher image 1

Model A B C Accuracy
CBOW 21 13 1 97.10%
Skip-gram 17 14 4 88.57%

The result of OOV ’ 耗子尾汁’

Word Translation Similarity
好自为之 Take care of yourself 0.99997896
particle (in Chinese) 0.99997878
i 0.99997693
马保国 Baoguo Ma 0.99997658
Also 0.99997264
and 0.99997222
particle (in Chinese) 0.99997193

About the Author

JiaKai Gu
E-mail: [email protected]
Jason J. Jung
Department of Computer Engineering, Chung-Ang University 84, Heukseok-ro, Dongjak-gu, Seoul, Republic of Korea 06974
Tel.: +82-2-820-5136
Fax: +82-2-820-5301
E-mail: [email protected]

Cite this project

@article{gu2022contextual,
   author = {Gu, JiaKai and Li, Gen and Vo, Nam D. and Jung, Jason J.},
   title = {Contextual Word2Vec Model for Understanding Chinese Out of Vocabularies on Online Social Media},
   journal = {International Journal on Semantic Web and Information Systems (IJSWIS)},
   volume = {18},
   number = {1},
   pages = {1-14},
   ISSN = {1552-6283},
   DOI = {10.4018/IJSWIS.309428},
   url = { https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJSWIS.309428 },
   year = {2022},
   type = {Journal Article}
}

Data source


oovunderstand's People

Contributors

gabrielpondc avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.