Coder Social home page Coder Social logo

jieba's Introduction

Jieba

Build semver Hex.pm

(Note for versions 0.2.0 and earlier)

A Rustler bridge to jieba-rs, the Rust Jieba implementation.

This provides the ability to use the Jieba-rs segmenter in Elixir for segmenting Chinese text.

The API is mostly a direct mapping of the Rust API. The constructors have all been combined under one new/2 API that allows the code to feel less imperative.

The KeywordExtract functionality for both TFIDF and TextRank are also provided but due to the design of jieba-rs that restricts to project those two Rust structs into the Beam while respecting the Rust lifetime rules and ensuring mutual exclusion across threads, they are exported as single use functions that construct/tear-down the TFIDF and TextRank instances per call. This is possibly slow but fixing it to be fast would require modifying the jieba-rs API so that neither TFIDF or TextRank held a reference to the underlying jieba instance on construction and instead took the wanted instance on the extract_tags() call.

Installation

If available in Hex, the package can be installed by adding jieba to your list of dependencies in mix.exs:

def deps do
  [
    {:jieba, "~> 0.3.1"}
  ]
end

Versions prior to 0.2.0 were written by mjason (lmj on hex and released from the mjason/jieba_ex source tree. It exposed a single Jieba.cut(sentence) method will used a single, unsyncrhonized, static instance of Jieba on the Rust side loaded with the default dictionary. The cut(sentence) was hardcoded to have hmm=false.

In March 2024, this codebase was written to help with the Visual Fonts project, not realizing an existing codebase was available. This codebase had a more complete exposure of the Rust API. After talking with mjason, it was decided to switch to this codebase and to increment the version number to signify the API break.

The 0.3.z versions still include Jieba.cut/1 interface, but have it marked deprecated. In 1.0.0, this API will be removed in favor of non-global-object based API.

jieba's People

Contributors

awong-dev avatar kianmeng avatar

Watchers

 avatar

Forkers

jkwchui kianmeng

jieba's Issues

Make APIs consistent

Some APIs take positional args. Some take options.
Some return a result tuple with a ! version that raises exceptions. Others have no ! version.
The pattern for handling default options is confusing as setting one option might unset the other defaults.

Go make this all consistent.

hmm accidentally enabled even when cut/3 has hmm: :false

Doing:

test "empty" do
  empty_jieba = Jieba.new!(use_default: :false, hmm: false)
  assert Jieba.cut(empty_jieba, "鄧小平學生好可憐") == ["鄧","小","平","學","生","好","可","憐"]
end

will fail with 鄧小平 being listed as a segment. This is because the hmm model itself has data for segmentation and it is being incorrectly used even with hmm: false

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.