Coder Social home page Coder Social logo

zjaume / splitters Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 70 KB

A CLI for Rust SRX sentence segmenation rules as Python package.

License: GNU General Public License v3.0

Rust 93.56% Shell 6.44%
rust sentence-segmentation srx pypi python sentence-splitter sentence-splitting

splitters's Introduction

splitte(.)rs

There's still some work pending to make this usable

A CLI for Rust SRX implementation as a Python package.

Installation

Installing from source needs Rust Cargo to be installed. Install it with your package manager or with https://rustup.rs/.

Then, clone the repo and install it as any other Python package:

git clone https://github.com/ZJaume/splitters
pip install ./splitters

Usage

Example usage

echo "Yes this is a sentence. Another one." | splitters -i /dev/stdin --output /dev/stdout
Yes this is a sentence.
Another one.

Full list of parameters:

splitters 0.1.0

USAGE:
    splitters [OPTIONS] --input <INPUT> --output <OUTPUT>

OPTIONS:
    -h, --help                   Print help information
    -i, --input <INPUT>
    -l, --language <LANGUAGE>    ISO-639-1, 2 char language code [default: en]
    -o, --output <OUTPUT>
    -s, --srxfile <SRXFILE>      [default: ]
    -v, --verbose
    -V, --version                Print version information

Compatibility with Rust regex

Some regex expressions might not be loaded because of syntax incompatibilities with Rust regex engine. To avoid that, the SRX rules bundled with this package have been partially fixed to minimize this. The scripts/fix_regex.sh contains the following fixes being applied:

  • Escape whitespace character at the begginging of <afterbreak>. For some reason the Rust xml parser is removing the space inside the rule for <afterbreak> +</afterbreak> so it ends up with the repetition operator missing its expression.
  • Unescape 'ุธ' character for Farsi. Rust regex does not require it to be escaped.
  • \Q and \E expresssions are not supported, so removing them and escaping everything enclosed in it.
  • Escape dash before \d and \p{...} causing invalid range literal.

To see the loading errors, run splitters with -v option and use -s to provide one of the original SRX files in data_orig to see the fixed errors.

splitters's People

Contributors

zjaume avatar

Watchers

 avatar

splitters's Issues

Add missing languages

There are languages not in the list of "best rules" that have rules in some of the srx files. They should be added. For example Catalan.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.