emax093 / ude Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/ude
License: Other
Automatically exported from code.google.com/p/ude
License: Other
DESCRIPTION =========== Ude is a C# port of Mozilla Universal Charset Detector. The original source code is available at: http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/src/ The article "A composite approach to language/encoding detection" describes the algorithms of Universal Charset Detector and is available at: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html http://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/doc/UniversalCharsetDetection.doc Some data-structures used into this port have been adapted from the Java port "juniversalchardet", available at: http://code.google.com/p/juniversalchardet/ Another port I know of is "chardet" (in Python) available at: http://chardet.feedparser.org/ USAGE ======= Import the library: using Ude; and feed a stream or a byte array to the detector. Call DataEnd to notify the detector that you want back the result: ICharsetDetector cdet = new CharsetDetector(); byte[] buff = new byte[1024]; int read; while ((read = stream.Read(buff, 0, buff.Length)) > 0 && !done) { cdet.Feed(buff, 0, read); } cdet.DataEnd(); Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence); Alternatively, you can feed a Stream to the detector: using (FileStream fs = File.OpenRead(filename)) { ICharsetDetector cdet = new CharsetDetector(); cdet.Feed(fs); cdet.DataEnd(); Console.WriteLine("Charset: {0}, confidence: {1}, cdet.Charset, cdet.Confidence); } Or you can provide an alternative implementation of the interface - Ude.ICharsetDetector - that wraps the original nsUniversalDetector API. LICENSE ======= This library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, the contents of this file may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL")
What steps will reproduce the problem?
1. create a text file with just the character "3"
2. save it and run detection.
3. notice that it gives detection failed
What is the expected output? What do you see instead?
expected it to report the file as ascii(happens on any file that had the
number 3 in it)
What version of the product are you using? On what operating system?
last updated version on windows xp
Please provide any additional information below.
noticed that the code is looking for EscAscii characters and it is looking
for 0x33 instead of 0x1b. 0x33 is the number 3 and not an escape character.
not sure if there is such an issue anywhere else in the code
Original issue reported on code.google.com by rbhatt%[email protected]
on 2 Dec 2009 at 5:29
What steps will reproduce the problem?
1. DL the tarball
2. Extract
3. Look for .sln
What is the expected output? What do you see instead?
Should be there somewhere... Its not.
What version of the product are you using? On what operating system?
0.1 windows xp
Please provide any additional information below.
Is there a workaround? Should I just build my own solution form the
source?
Original issue reported on code.google.com by [email protected]
on 13 Jul 2009 at 11:53
What steps will reproduce the problem?
1. Save an ANSI file containing the text "CONFIG: main 30000000"
2. Run the library and/or exe on it
What is the expected output? What do you see instead?
I expect ANSI detected.
What version of the product are you using? On what operating system?
The library shows null for charset, and the exe shows "detection failed".
Please provide any additional information below.
I don't know if this is how the library is intended to work, but I think it
would be more useful to detect ANSI if all the characters fit into ANSI. Or at
least support this behavior optionally.
Original issue reported on code.google.com by [email protected]
on 14 Sep 2014 at 4:59
What steps will reproduce the problem?
1. Define Cyrillic text, "Это пример кириллического
текста".
2. Feed the CharsetDetector with stream to this text.
3. Result charset is "UTF-8" with Confidence 1.0
What is the expected output?
Charset is koi-8
What do you see instead?
UTF-8
What version of the product are you using?
Ude, C# port
On what operating system?
Windows 7/8, x64
Original issue reported on code.google.com by [email protected]
on 8 Dec 2012 at 8:19
http://ude.googlecode.com/svn/trunk/src/Library/Ude.Core/SBCSGroupProber.cs
existing code:
public override void Reset ()
{
int activeNum = 0;
...
SHOULD be:
public override void Reset ()
{
activeNum = 0;
...
in many cases this bug will cause fail to detect right charset because class
member activeNum is currently always 0 because in Reset function local variable
used, see this piece of code:
} else if (st == ProbingState.NotMe) {
isActive[i] = false;
activeNum--;
if (activeNum <= 0) {
state = ProbingState.NotMe;
break;
}
}
I fixed it locally but want that other developer didn't spent much time
debugging the same issue)
attached file is where bug is reproduced (charset is KOI8-R)
Original issue reported on code.google.com by [email protected]
on 25 May 2012 at 6:40
Attachments:
big problem!
not worked in line 103
Original issue reported on code.google.com by [email protected]
on 15 Apr 2015 at 8:52
Attachments:
The problem is CharDistributionAnalyser.HandleOneChar call for EUCTW detection.
size of charToFreqOrder array is 5376 but tableSize is deffind as 8102 and
this check is wrong
if (order < tableSize) <--
{ // order is valid
if (512 > charToFreqOrder[order])
freqChars++;
}
I have take a look in Java code and this part of code is changed to
if (order < charToFreqOrder.Length)
{ // order is valid
if (512 > charToFreqOrder[order])
freqChars++;
}
we don't need tableSize any more and there will be no Exception at this
place in future.
Original issue reported on code.google.com by [email protected]
on 16 Nov 2009 at 3:12
What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file. Yes, this is purposely modifying
the file to cause a problem but I have been encountering many examples of
UTF-16 encoded files lacking a BOM as provided to me from other applications.
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns
a result of encoding = ANSI 1252 which is wrong.
What is the expected output?
Expected output is encoding = "UTF-16"
What do you see instead?
"Charset: ASCII, confidence: 1"
What version of the product are you using? On what operating system?
Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit
Please provide any additional information below.
Larger files (1000kb+) lacking the BOM tend to show result of "Charset:
windows-1252, confidence: 0.5"
Original issue reported on code.google.com by [email protected]
on 17 Sep 2012 at 10:52
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.