joungkyun / libchardet Goto Github PK

View Code? Open in Web Editor NEW

103.0 103.0 32.0 1.2 MB

libchardet - Mozilla's Universal Charset Detector C/C++ API

License: Other

Makefile 10.64% C 14.99% Shell 30.30% M4 0.67% C++ 43.39% Objective-C 0.01%

libchardet's People

Contributors

Stargazers

Watchers

libchardet's Issues

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai))

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕

These text editors detect it as UTF-8 and displays it correctly

Notepad++, Editpad Lite 7, Editplus, Notepad2, Notepad2e, Notepad2-mod,
Notepad2-zfuliu and VS Code,!!!

Here the bad detection as "TIS-620"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "ใ��ใ�นใ��ใ€�",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

Here the correct detection as "UTF-8"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "テスト。",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

autoheader and libtoolize warnings

Revision 55e802a has these warnings:

libtoolize: Consider adding 'AC_CONFIG_MACRO_DIRS([m4])' to configure.ac,
libtoolize: and rerunning libtoolize and aclocal.
libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT'
autoheader-2.69: WARNING: Using auxiliary files such as `acconfig.h', `config.h.bot'
autoheader-2.69: WARNING: and `config.h.top', to define templates for `config.h.in'
autoheader-2.69: WARNING: is deprecated and discouraged.
autoheader-2.69: 
autoheader-2.69: WARNING: Using the third argument of `AC_DEFINE' and
autoheader-2.69: WARNING: `AC_DEFINE_UNQUOTED' allows one to define a template without
autoheader-2.69: WARNING: `acconfig.h':
autoheader-2.69: 
autoheader-2.69: WARNING:   AC_DEFINE([NEED_FUNC_MAIN], 1,
autoheader-2.69:                [Define if a function `main' is needed.])
autoheader-2.69: 
autoheader-2.69: WARNING: More sophisticated templates can also be produced, see the
autoheader-2.69: WARNING: documentation.
configure.ac:34: installing 'tools/compile'
configure.ac:9: installing 'tools/missing'

can not detect EUC extended area

EUC-TW : 湾普
EUC-KR : 똠방각하, 뷁

Expected result: EUC-TW or EUC-KR
Actual result: none

update model of Greek, Hungarian and Thai

Hungarian and Greek conflict with other charsets
Thai model support iso-8859-11

Add API option to get all the encodings confidence

Add API option to get all the encodings confidence #96
make code more straightforward

by treating the self.done = True as a real finish point of the analysis
use detect_all instead of detect(.., all=True)
fix corner case of when there is no good prober

This feature is supported from python chartet 4.x.

Issued Environment

given string : "안녕하세요"
charset : EUC-KR
version: 1.0.5

Expected

detected charset: EUC-KR
expectged confidence: bigger than 0.5

Actually

detected charset: NULL
returned confidence: 0

Reported by @jayvdb

A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.

I realise that utf-8-sig is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet could check to known when to append the -sig.

Other differences when compared with other libraries:

utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig when the BOM appears in utf-8.

Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.

UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.

Single UTF-8 character detected as Windows-1258

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1848 we are faced with a problem of a Single "UTF-8" character which is detected as: Windows-1258 with reliability level of 72% by UCHARDET. 😕

Here the French "é" character (Précis:) !

In the following sample, it's the character character "¶" this is badly detectected as: "ΒΆ"

I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8 
with ONLY ONE non-ASCII character "delims=¶" on line 33 in this "shorted" batch file.

- This text is open faultily as "ISO-8859-7 (Greek)" with Notepad3 : "delims=ΒΆ"
- This text is open correctly as "UTF-8" with Notepad3 if I add an encoding tag ":: encoding: UTF-8"
- This text is open correctly as "UTF-8" with Noteapd++, Editpad Lite 7, Editplus, Notepad2, 
  Notepad2e, Notepad2-mod, Notepad2-zfuliu and VS Code,!!!

In attachment the 2 samples: Error Detection Single UTF-8 (issue #1848).zip