Coder Social home page Coder Social logo

libchardet's People

Contributors

gaoxiang-ut avatar joungkyun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libchardet's Issues

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai))

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕

These text editors detect it as UTF-8 and displays it correctly

  • Notepad++, Editpad Lite 7, Editplus, Notepad2, Notepad2e, Notepad2-mod,
    Notepad2-zfuliu and VS Code,!!!

Here the bad detection as "TIS-620"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "ใ��ใ�นใ��ใ€�",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

Here the correct detection as "UTF-8"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "テスト。",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

autoheader and libtoolize warnings

Revision 55e802a has these warnings:

libtoolize: Consider adding 'AC_CONFIG_MACRO_DIRS([m4])' to configure.ac,
libtoolize: and rerunning libtoolize and aclocal.
libtoolize: 'AC_PROG_RANLIB' is rendered obsolete by 'LT_INIT'
autoheader-2.69: WARNING: Using auxiliary files such as `acconfig.h', `config.h.bot'
autoheader-2.69: WARNING: and `config.h.top', to define templates for `config.h.in'
autoheader-2.69: WARNING: is deprecated and discouraged.
autoheader-2.69: 
autoheader-2.69: WARNING: Using the third argument of `AC_DEFINE' and
autoheader-2.69: WARNING: `AC_DEFINE_UNQUOTED' allows one to define a template without
autoheader-2.69: WARNING: `acconfig.h':
autoheader-2.69: 
autoheader-2.69: WARNING:   AC_DEFINE([NEED_FUNC_MAIN], 1,
autoheader-2.69:                [Define if a function `main' is needed.])
autoheader-2.69: 
autoheader-2.69: WARNING: More sophisticated templates can also be produced, see the
autoheader-2.69: WARNING: documentation.
configure.ac:34: installing 'tools/compile'
configure.ac:9: installing 'tools/missing'

Add API option to get all the encodings confidence

  • Add API option to get all the encodings confidence #96

  • make code more straightforward

    by treating the self.done = True as a real finish point of the analysis

  • use detect_all instead of detect(.., all=True)

  • fix corner case of when there is no good prober

This feature is supported from python chartet 4.x.

can't detect short euc-kr

Issued Environment

given string : "안녕하세요"
charset : EUC-KR
version: 1.0.5

Expected

detected charset: EUC-KR
expectged confidence: bigger than 0.5

Actually

detected charset: NULL
returned confidence: 0

Report utf-8-sig

Joungkyun/python-chardet#3

Reported by @jayvdb

A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.

I realise that utf-8-sig is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet could check to known when to append the -sig.

Other differences when compared with other libraries:

utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig when the BOM appears in utf-8.

Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.

UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.

Single UTF-8 character detected as Windows-1258

Hello,
For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1848 we are faced with a problem of a Single "UTF-8" character which is detected as: Windows-1258 with reliability level of 72% by UCHARDET. 😕

Here the French "é" character (Précis:) !

71032731-cc90f480-217a-11ea-8313-ee011adf1467

In the following sample, it's the character character "" this is badly detectected as: "ΒΆ"

I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8 
with ONLY ONE non-ASCII character "delims=¶" on line 33 in this "shorted" batch file.

- This text is open faultily as "ISO-8859-7 (Greek)" with Notepad3 : "delims=ΒΆ"
- This text is open correctly as "UTF-8" with Notepad3 if I add an encoding tag ":: encoding: UTF-8"
- This text is open correctly as "UTF-8" with Noteapd++, Editpad Lite 7, Editplus, Notepad2, 
  Notepad2e, Notepad2-mod, Notepad2-zfuliu and VS Code,!!!

In attachment the 2 samples: Error Detection Single UTF-8 (issue #1848).zip

Thanks in advance for your attention.
Have a nice day.
hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

support language name detect

From python chardet 3.x, it support the name of detectd language.

usock = urlopen(url)
detector = UniversalDetector()
detector.reset ()
for line in usock.readlines():
    detector.feed(line)
    if detector.done: break
detector.close()
usock.close()
print (detector.result)

results are follows:

{'encoding': 'MacCyrillic', 'confidence': 0.33487987945663855, 'language': 'Russian'}

configure.ac needs subdir-objects

I had to add this patch to build on the FreeBSD:

--- configure.ac.orig   2016-08-19 17:27:35 UTC
+++ configure.ac
@@ -7,7 +7,7 @@
 AC_PREREQ(2.59)
 AC_INIT([libchardet], [1.0.5], [http://oops.org])
 AC_CONFIG_AUX_DIR([tools])
-AM_INIT_AUTOMAKE([-Wall -Werror -Wno-override foreign no-dependencies])
+AM_INIT_AUTOMAKE([-Wall -Werror -Wno-override foreign no-dependencies subdir-objects])
 AM_MAINTAINER_MODE

 AC_CONFIG_SRCDIR([src/nsUniversalDetector.h])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.