Coder Social home page Coder Social logo

nickspring / charset-normalizer-rs Goto Github PK

View Code? Open in Web Editor NEW
23.0 3.0 3.0 9.39 MB

Truly universal encoding detector in pure Rust - port of Python version

Home Page: https://crates.io/crates/charset-normalizer-rs

License: MIT License

Rust 100.00%
chardet charset charset-conversion charset-detection charset-detector charset-normalizer encoding encoding-decoding rust

charset-normalizer-rs's Introduction

Charset Normalizer

charset-normalizer-rs on docs.rs charset-normalizer-rs on crates.io

A library that helps you read text from an unknown charset encoding.
Motivated by original Python version of charset-normalizer, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Rust encoding library provides codecs are supported.

This project is port of original Pyhon version of Charset Normalizer. The biggest difference between Python and Rust versions - number of supported encodings as each langauge has own encoding / decoding library. In Rust version only encoding from WhatWG standard are supported. Python version supports more encodings, but a lot of them are old almost unused ones.

⚡ Performance

This package offer better performance than Python version (4 times faster, than MYPYC version of charset-normalizer, 8 times faster than usual Python version). In comparison with chardet and chardetng packages it has approximately the same speed but more accurate. Here are some numbers.

Package Accuracy Mean per file (ms) File per sec (est)
chardet 82.6 % 3 ms 333 file/sec
chardetng 90.7 % 1.6 ms 625 file/sec
charset-normalizer-rs 97.1 % 1.5 ms 666 file/sec
charset-normalizer (Python + MYPYC version) 98 % 8 ms 125 file/sec
Package 99th percentile 95th percentile 50th percentile
chardet 8 ms 2 ms 0.2 ms
chardetng 14 ms 5 ms 0.5 ms
charset-normalizer-rs 12 ms 5 ms 0.7 ms
charset-normalizer (Python + MYPYC version) 94 ms 37 ms 3 ms

Stats are generated using 400+ files using default parameters. These results might change at any time. The dataset can be updated to include more files. The actual delays heavily depends on your CPU capabilities. The factors should remain the same. Rust version dataset has been reduced as number of supported encodings is lower than in Python version.

There is a still possibility to speed up library, so I'll appreciate any contributions.

✨ Installation

Library installation:

cargo add charset-normalizer-rs

Binary CLI tool installation:

cargo install charset-normalizer-rs

🚀 Basic Usage

CLI

This package comes with a CLI, which supposes to be compatible with Python version CLI tool.

normalizer -h
Usage: normalizer [OPTIONS] <FILES>...

Arguments:
  <FILES>...  File(s) to be analysed

Options:
  -v, --verbose                Display complementary information about file if any. Stdout will contain logs about the detection process
  -a, --with-alternative       Output complementary possibilities if any. Top-level JSON WILL be a list
  -n, --normalize              Permit to normalize input file. If not set, program does not write anything
  -m, --minimal                Only output the charset detected to STDOUT. Disabling JSON output
  -r, --replace                Replace file when trying to normalize it instead of creating a new one
  -f, --force                  Replace file without asking if you are sure, use this flag with caution
  -t, --threshold <THRESHOLD>  Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1 [default: 0.2]
  -h, --help                   Print help
  -V, --version                Print version
normalizer ./data/sample.1.fr.srt

🎉 The CLI produces easily usable stdout result in JSON format (should be the same as in Python version).

{
    "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
    "encoding": "cp1252",
    "encoding_aliases": [
        "1252",
        "windows_1252"
    ],
    "alternative_encodings": [
        "cp1254",
        "cp1256",
        "cp1258",
        "iso8859_14",
        "iso8859_15",
        "iso8859_16",
        "iso8859_3",
        "iso8859_9",
        "latin_1",
        "mbcs"
    ],
    "language": "French",
    "alphabets": [
        "Basic Latin",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.149,
    "coherence": 97.152,
    "unicode_path": null,
    "is_preferred": true
}

Rust

Library offers two main methods. First one is from_bytes, which processes text using bytes as input parameter:

use charset_normalizer_rs::from_bytes;

fn test_from_bytes() {
    let result = from_bytes(&vec![0x84, 0x31, 0x95, 0x33], None);
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "gb18030",
    );
}
test_from_bytes();

from_path processes text using filename as input parameter:

use std::path::PathBuf;
use charset_normalizer_rs::from_path;

fn test_from_path() {
    let result = from_path(&PathBuf::from("src/tests/data/samples/sample-chinese.txt"), None).unwrap();
    let best_guess = result.get_best();
    assert_eq!(
        best_guess.unwrap().encoding(),
        "big5",
    );
}
test_from_path();

😇 Why

When I started using Chardet (Python version), I noticed that it was not suited to my expectations, and I wanted to propose a reliable alternative using a completely different method. Also! I never back down on a good challenge!

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that? 😎

🍰 How

  • Discard all charset encoding table that could not fit the binary content.
  • Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
  • Extract matches with the lowest mess detected.
  • Additionally, we measure coherence / probe for a language.

Wait a minute, what is noise/mess and coherence according to YOU?

Noise : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚡ Known limitations

  • Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
  • Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

👤 Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

📝 License

Copyright © Nikolay Yarovoy @nickspring - porting to Rust.
Copyright © Ahmed TAHRI @Ousret - original Python version and some parts of this document.
This project is MIT licensed.

Characters frequencies used in this project © 2012 Denny Vrandečić

charset-normalizer-rs's People

Contributors

chris-ha458 avatar nickspring avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

charset-normalizer-rs's Issues

[BUG] unreachable code?

Describe the bug
This part of the code seems to be unreachable

The outer if ensures that cohrerence_difference is not 0.0.
So that particular if statement should never be reached.

To Reproduce
I don't know the intended case for these statements so I cannot reproduce the "negative case"

Expected behavior
maybe the outer if needs to be different.
In the original python code, chaos_difference == 0.0 and self.coherence == other.coherence this comparison is used.

[BUG] correct behavior for “Ё” (U+0401)

Describe the bug
In test_is_accentuated


This case is tested to see if it is false.
“Ё” (U+0401) Cyrillic Capital Letter Io

The code being tested is here

"WITH DIAERESIS",

The problem here is that it is considered to have an diaeresis under current correct unicode decomposition rules (both NFKD and NFD)
https://www.compart.com/en/unicode/U+0401
https://graphemica.com/%D0%81

(BTW this is different from almost exactly looking Unicode Character “Ë” (U+00CB) Latin Capital Letter E with Diaeresis

To Reproduce
the icu4x crates can be used to decompose in rust.
cargo add icu_normalizer
I am actually trying to reimplement some parts of the code and that is how i discovered it.

pub(crate) fn is_accentuated(ch: char) -> bool {
    let nfd = icu_normalizer::DecomposingNormalizer::new_nfkd();
    let denormalized_string: String = nfd.normalize(ch.to_string().as_str());
    denormalized_string
        .chars()
        .any(|decomposed| match decomposed {
        '\u{0300}' // "WITH GRAVE"
        |'\u{0301}' // "WITH ACUTE"
        |'\u{0302}' // "WITH CIRCUMFLEX"
        |'\u{0303}' // "WITH TILDE"
        |'\u{0308}' //  "WITH DIAERESIS"
        |'\u{0327}' // "WITH CEDILLA"
         => true,
        _=> false,
    })
}

This new implementation directly tries to directly decompose the input character and try to see if unicode characters that indicate accents exist.
Since “Ё” (U+0401) Cyrillic Capital Letter Io decomposes into Е Cyrillic Capital Letter Ie + Diaeresis '\u{0308}'
the new code returns true, while the old code returns false (since diaeresis is not in the name)

Expected behavior
“Ё” (U+0401) should return true.

Additional context
Unicode standard is fast moving. A new standard every year and especially for CJK there are new codepoints added constantly.
I think it is valuable to have an implementation that is up to date.

Btw I have almost finished my implementation using various components from https://github.com/unicode-org/icu4x
It is a pure rust codebase worked on by both standard bodies and industry supporters such as google, so I feel like it would be a good library to rely upon.

[BUG] cargo audit gives a warning about `encoding` being unmaintaned.

When running cargo audit I get this warning about encoding being unmaintaned

Crate:     encoding
Version:   0.2.33
Warning:   unmaintained
Title:     `encoding` is unmaintained
Date:      2021-12-05
ID:        RUSTSEC-2021-0153
URL:       https://rustsec.org/advisories/RUSTSEC-2021-0153
Dependency tree:
encoding 0.2.33
└── charset-normalizer-rs 1.0.6 

The crate https://crates.io/crates/encoding_rs is suggested as a possible alternative, so maybe it would be worth seeing if it would be possible to use that instead.

Improvements : Idiomatic code

Re : #3

This codebase has been ported from Python and a lot of the design patterns could be improved to be more idiomatic rust code.
Such a move will make it easier to improve speed and maintainability, ensure correct operation from a rust point of view.

Some examples would be avoiding for loops, using matches instead of if chains etc.

Many require deeper consideration.

For example, this codebase has extensive use of f32. Unless using intrinsics, f64 are as fast as or faster than f32 in rust.
Moreover, trying to cast to and back for f32 and f64 can harm performance and make it difficult to ensure correct code. For instance there are instances of exact compare between f32 and f64 variables, and this is very unlikely to operate in the intended way. If it is intended, it would be valuable to have documentation regarding that, suppressing relevant lints as well.
However, if there is a need to maintain ABI compatibility or follow a specification it might be inevitable. Also, on-disk size could be a consideration.
In summary f32 vs f64 handling could serve as both idiomatic code and speed but only if done right.

I will try to prepare some PRs that change some things. Despite my best efforts, I am sure that many of my changes or views might be based on a flawed understanding of the code, so feel free to explain why things were done the way they were.
In such cases I will help with documentation.

Improvements : Speed

As per our discussion in #2 for speed improvements the following has been suggested

  • calc coherence & mess in threads
  • or calc mess for plugins in threads (or some async?)
  • or something other...

The paths I had in mind was these:

  • Related to threads idea : use Rayon
    • Replace HashMap with concurrent DashMap (Current std HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)
  • Use replace hashing algorithm used in HashMap with FxHash, AHash, HighwayHash
    • aHash implemented #14
  • Replace sort() with sort_unstable() #6
  • Identfiy preallocation opportunities. For instance, replace Vec::new() with Vec::with_capacity()
    • Seems like most current new() cannot really preallocate due to uncertainty. The basic preallocation algorithm is optimized enough that unless we have a strong idea regarding memory access premature allocation is not helpful.

Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code.
For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with rayon, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.