joungkyun / libchardet Goto Github PK
View Code? Open in Web Editor NEWlibchardet - Mozilla's Universal Charset Detector C/C++ API
License: Other
libchardet - Mozilla's Universal Charset Detector C/C++ API
License: Other
Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.
In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕
These text editors detect it as UTF-8 and displays it correctly
Here the bad detection as "TIS-620"
{
"manifest_version": 2,
"name": "k view",
"version": "0.5",
"description": "ใ��ใ�นใ��ใ€�",
"browser_action": {
"default_icon": { "19": "round-done-button.png" }
},
}
Here the correct detection as "UTF-8"
{
"manifest_version": 2,
"name": "k view",
"version": "0.5",
"description": "テスト。",
"browser_action": {
"default_icon": { "19": "round-done-button.png" }
},
}
In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip
Thanks in advance for your attention.
Have a nice day.
hpwamr
Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are always welcome... 😃
Revision 55e802a has these warnings:
libtoolize: Consider adding 'AC_CONFIG_MACRO_DIRS([m4])' to configure.ac,
libtoolize: and rerunning libtoolize and aclocal.
libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT'
autoheader-2.69: WARNING: Using auxiliary files such as `acconfig.h', `config.h.bot'
autoheader-2.69: WARNING: and `config.h.top', to define templates for `config.h.in'
autoheader-2.69: WARNING: is deprecated and discouraged.
autoheader-2.69:
autoheader-2.69: WARNING: Using the third argument of `AC_DEFINE' and
autoheader-2.69: WARNING: `AC_DEFINE_UNQUOTED' allows one to define a template without
autoheader-2.69: WARNING: `acconfig.h':
autoheader-2.69:
autoheader-2.69: WARNING: AC_DEFINE([NEED_FUNC_MAIN], 1,
autoheader-2.69: [Define if a function `main' is needed.])
autoheader-2.69:
autoheader-2.69: WARNING: More sophisticated templates can also be produced, see the
autoheader-2.69: WARNING: documentation.
configure.ac:34: installing 'tools/compile'
configure.ac:9: installing 'tools/missing'
EUC-TW : 湾普
EUC-KR : 똠방각하, 뷁
Expected result: EUC-TW or EUC-KR
Actual result: none
Add API option to get all the encodings confidence #96
make code more straightforward
by treating the self.done = True as a real finish point of the analysis
use detect_all instead of detect(.., all=True)
fix corner case of when there is no good prober
This feature is supported from python chartet 4.x.
support "make check" with automake
given string : "안녕하세요"
charset : EUC-KR
version: 1.0.5
detected charset: EUC-KR
expectged confidence: bigger than 0.5
detected charset: NULL
returned confidence: 0
A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.
I realise that utf-8-sig
is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet
could check to known when to append the -sig
.
Other differences when compared with other libraries:
utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig
when the BOM appears in utf-8.
Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.
UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.
Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.
In issue #1848 we are faced with a problem of a Single "UTF-8" character which is detected as: Windows-1258 with reliability level of 72% by UCHARDET. 😕
Here the French "é" character (Précis:) !
In the following sample, it's the character character "¶" this is badly detectected as: "ΒΆ"
I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8
with ONLY ONE non-ASCII character "delims=¶" on line 33 in this "shorted" batch file.
- This text is open faultily as "ISO-8859-7 (Greek)" with Notepad3 : "delims=ΒΆ"
- This text is open correctly as "UTF-8" with Notepad3 if I add an encoding tag ":: encoding: UTF-8"
- This text is open correctly as "UTF-8" with Noteapd++, Editpad Lite 7, Editplus, Notepad2,
Notepad2e, Notepad2-mod, Notepad2-zfuliu and VS Code,!!!
In attachment the 2 samples: Error Detection Single UTF-8 (issue #1848).zip
Thanks in advance for your attention.
Have a nice day.
hpwamr
Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are always welcome... 😃
move language model files to models sub directory
AM_PROG_AR need automake 1.12.0 and after.
It seems that chardet.h has no include guard. Aren't you going to add include guard?
support Arabic, Danish, Esperanto, German, Spanish, Turkish, Vietnamese
text: "an escape character: ^["
_Expected result:_ ascii
_Actual result:_ none
From python chardet 3.x, it support the name of detectd language.
usock = urlopen(url)
detector = UniversalDetector()
detector.reset ()
for line in usock.readlines():
detector.feed(line)
if detector.done: break
detector.close()
usock.close()
print (detector.result)
results are follows:
{'encoding': 'MacCyrillic', 'confidence': 0.33487987945663855, 'language': 'Russian'}
_Expected result:_ ISO-8859-15
_Actual result:_ ISO-8859-1
merge uchardet's improve changes.
I had to add this patch to build on the FreeBSD:
--- configure.ac.orig 2016-08-19 17:27:35 UTC
+++ configure.ac
@@ -7,7 +7,7 @@
AC_PREREQ(2.59)
AC_INIT([libchardet], [1.0.5], [http://oops.org])
AC_CONFIG_AUX_DIR([tools])
-AM_INIT_AUTOMAKE([-Wall -Werror -Wno-override foreign no-dependencies])
+AM_INIT_AUTOMAKE([-Wall -Werror -Wno-override foreign no-dependencies subdir-objects])
AM_MAINTAINER_MODE
AC_CONFIG_SRCDIR([src/nsUniversalDetector.h])
can't detect utf-16 and utf-32
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.