Coder Social home page Coder Social logo

jieba-zh_tw's Introduction

jieba-zh_TW

結巴(jieba)斷詞台灣繁體版本

原理

採用和原始jieba相同的演算法,替換其詞庫及HMM機率表製做出針對台灣繁體的jieba斷詞器

使用說明

  • 相容python2和python3
  • 將jieba資料夾放在你程式的資料夾底下
  • import jieba

程式碼範例

操作方法同原始jieba

斷詞

import jieba

#如果您的電腦同時要使用兩個版本的jieba,請自訂cache檔名,避免兩個cache互相蓋住對方
#jieba.dt.cache_file = 'jieba.cache.new'

seg_list = jieba.cut("在非洲,每六十秒,就有一分鐘過去") 
print("|".join(seg_list))
# 在|非洲|,|每|六十秒|,|就|有|一分鐘|過去

關鍵詞抽取

尚未替換機率表,輸出的結果非常不可靠

詞性標記

應該是一跑就會噴錯的狀態

可靠度探討

拿本份程式碼去和jieba轉簡體後斷詞jieba直接斷繁體字這兩個方法,去斷這篇台灣記者寫的新聞。並以中研院中文斷詞系統作為標準答案,以詞為單位,去計算這三個方法和中研院的結果的Edit distance

Edit distance 第一段(92) 第二段(136) 第三段(75) 第四段(52) 第五段(63)
jieba zh_TW 9 20 12 12 9
jieba轉簡體後斷詞 44 43 31 23 21
jieba直接斷繁體字 53 74 43 34 34
(括號內為中研院斷出來的詞彙數)

感謝

  • **研究院資訊科學所詞庫小組中文斷詞線上服務

注意事項

使用本份程式碼請遵守中研院斷詞服務之服務條款其中的衍生資料相關規定

一些問題

詳見我Blog上的這篇文章:關於結巴(Jieba)斷詞的幾個問題

jieba-zh_tw's People

Contributors

aholic avatar anderscui avatar changyy avatar cloudaice avatar davidlihm avatar felixonmars avatar fukuball avatar fxsjy avatar gumblex avatar hermanschaaf avatar jagt avatar jerryday avatar keroro520 avatar ldkrsi avatar lynschinzer avatar mozillazg avatar nomaka avatar piaolingxue avatar qinwf avatar shurachow avatar sing1ee avatar sunjoy1984 avatar walkskyer avatar wangbin avatar yanyiwu avatar zheplusplus avatar zoeyyoung avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.