emax093 / ude Goto Github PK

0.0 1.0 0.0 0 B

Automatically exported from code.google.com/p/ude

License: Other

Shell 4.30% C# 95.17% C 0.52%

ude's Introduction

DESCRIPTION
===========

    Ude is a C# port of Mozilla Universal Charset Detector.

    The original source code is available at: 

    http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/src/    

    The article "A composite approach to language/encoding detection" describes
    the algorithms of Universal Charset Detector and is available at: 

        http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
        http://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/doc/UniversalCharsetDetection.doc

    Some data-structures used into this port have been adapted from the Java port 
    "juniversalchardet", available at:
     
        http://code.google.com/p/juniversalchardet/

    Another port I know of is "chardet" (in Python) available at: 
        
        http://chardet.feedparser.org/

USAGE
=======

    Import the library:

        using Ude;

    and feed a stream or a byte array to the detector. Call DataEnd to notify the detector that 
    you want back the result:
         
        ICharsetDetector cdet = new CharsetDetector();
        byte[] buff = new byte[1024];
        int read;
        while ((read = stream.Read(buff, 0, buff.Length)) > 0 && !done) {
            cdet.Feed(buff, 0, read);
        }
        cdet.DataEnd();
        Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence);


    Alternatively, you can feed a Stream to the detector:

        using (FileStream fs = File.OpenRead(filename)) {
            ICharsetDetector cdet = new CharsetDetector();
            cdet.Feed(fs);
            cdet.DataEnd();
            Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence);
        }    

    Or you can provide an alternative implementation of the interface - Ude.ICharsetDetector - 
    that wraps the original nsUniversalDetector API. 
 

LICENSE
=======

    This library is subject to the Mozilla Public License Version 1.1 (the
    "License").  Alternatively, the contents of this file may be used under the
    terms of either the GNU General Public License Version 2 or later (the "GPL"),
    or the GNU Lesser General Public License Version 2.1 or later (the "LGPL")

ude's People

Contributors

Watchers

ude's Issues

pureascii detection issue

What steps will reproduce the problem?
1. create a text file with just the character "3"
2. save it and run detection.
3. notice that it gives detection failed

What is the expected output? What do you see instead?
expected it to report the file as ascii(happens on any file that had the 
number 3 in it)

What version of the product are you using? On what operating system?
last updated version on windows xp

Please provide any additional information below.

noticed that the code is looking for EscAscii characters and it is looking 
for 0x33 instead of 0x1b. 0x33 is the number 3 and not an escape character.
not sure if there is such an issue anywhere else in the code

Original issue reported on code.google.com by rbhatt%[email protected] on 2 Dec 2009 at 5:29

Cannot Find .sln for windows usage

What steps will reproduce the problem?
1. DL the tarball
2. Extract
3. Look for .sln

What is the expected output? What do you see instead?
Should be there somewhere...  Its not.

What version of the product are you using? On what operating system?
0.1 windows xp

Please provide any additional information below.
Is there a workaround?  Should I just build my own solution form the 
source?

Original issue reported on code.google.com by [email protected] on 13 Jul 2009 at 11:53

Detection fails on particular, simple ANSI file

What steps will reproduce the problem?
1. Save an ANSI file containing the text "CONFIG: main 30000000"
2. Run the library and/or exe on it

What is the expected output? What do you see instead?

I expect ANSI detected.

What version of the product are you using? On what operating system?

The library shows null for charset, and the exe shows "detection failed".

Please provide any additional information below.

I don't know if this is how the library is intended to work, but I think it 
would be more useful to detect ANSI if all the characters fit into ANSI. Or at 
least support this behavior optionally.

Original issue reported on code.google.com by [email protected] on 14 Sep 2014 at 4:59

Returns UTF-8 for Cyrillic text

What steps will reproduce the problem?
1. Define Cyrillic text, "Это пример кириллического 
текста".
2. Feed the CharsetDetector with stream to this text.
3. Result charset is "UTF-8" with Confidence 1.0

What is the expected output? 
Charset is koi-8

What do you see instead?
UTF-8

What version of the product are you using? 
Ude, C# port 

On what operating system?
Windows 7/8, x64

Original issue reported on code.google.com by [email protected] on 8 Dec 2012 at 8:19

BUG in SBCSGroupProber class in function Reset

http://ude.googlecode.com/svn/trunk/src/Library/Ude.Core/SBCSGroupProber.cs

existing code:

public override void Reset ()
{
    int activeNum = 0;
...

SHOULD be:

public override void Reset ()
{
    activeNum = 0;
...

in many cases this bug will cause fail to detect right charset because class 
member activeNum is currently always 0 because in Reset function local variable 
used, see this piece of code:
} else if (st == ProbingState.NotMe) {
   isActive[i] = false;
   activeNum--;
   if (activeNum <= 0) {
      state = ProbingState.NotMe;
      break;
   }
}

I fixed it locally but want that other developer didn't spent much time 
debugging the same issue)

attached file is where bug is reproduced (charset is KOI8-R)

Original issue reported on code.google.com by [email protected] on 25 May 2012 at 6:40

Attachments:

eo.csv

Patch for /trunk/src/Library/Ude.Core/SBCSGroupProber.cs

big problem!
not worked in line 103

Original issue reported on code.google.com by [email protected] on 15 Apr 2015 at 8:52

Attachments:

SBCSGroupProber.cs.patch

EUCTW: System.IndexOutOfRangeException

The problem is CharDistributionAnalyser.HandleOneChar call for EUCTW detection.

size of charToFreqOrder array is 5376 but tableSize is deffind as 8102 and
this check is wrong
if (order < tableSize) <--
 { // order is valid
   if (512 > charToFreqOrder[order])
     freqChars++;
 }

I have take a look in Java code and this part of code is changed to

if (order < charToFreqOrder.Length)
{ // order is valid
  if (512 > charToFreqOrder[order])
    freqChars++;
}

we don't need tableSize any more and there will be no Exception at this
place in future.

Original issue reported on code.google.com by [email protected] on 16 Nov 2009 at 3:12

UTF-16 without BOM not detected correctly

What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file.  Yes, this is purposely modifying 
the file to cause a problem but I have been encountering many examples of 
UTF-16 encoded files lacking a BOM as provided to me from other applications.  
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns 
a result of encoding = ANSI 1252 which is wrong.

What is the expected output? 

Expected output is encoding = "UTF-16"

What do you see instead?

"Charset: ASCII, confidence: 1"


What version of the product are you using? On what operating system?

Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit

Please provide any additional information below.

Larger files (1000kb+) lacking the BOM tend to show result of "Charset: 
windows-1252, confidence: 0.5"

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 10:52

emax093 / ude Goto Github PK

ude's Introduction

ude's People

Contributors

Watchers

ude's Issues

pureascii detection issue

Cannot Find .sln for windows usage

Detection fails on particular, simple ANSI file

Returns UTF-8 for Cyrillic text

BUG in SBCSGroupProber class in function Reset

Patch for /trunk/src/Library/Ude.Core/SBCSGroupProber.cs

EUCTW: System.IndexOutOfRangeException

UTF-16 without BOM not detected correctly

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent