Comments (5)
Working as intended.
The inner call of CLD2 returns the total number of bytes of text found, a list of three
languages, their three percentages of the total text bytes, and a reliability Boolean.
There are several degenerate cases possible
- UNKNOWN_LANGUAGE is a valid language and may show up in the list (some webcruft strings
such as "http" or "jpg" may deliberately match UNKNOWN_LANGUAGE to prevent them from
falsely indicating Somali or somesuch).
- The three percentages in general will total less than 100%, implying that the remainder
of the text is UNKNOWN_LANGUAGE.
- The percentage of the top language might be small but non-zero, meaning that any
other detected languages are a smaller percentage and the rest is unknown.
- The percentage of the top language might be 0%, meaning 100% unknown.
- Several languages are detected but they differ so slightly or they score much too
low or much too high compared to real text in each language, so the reliability Boolean
is set to false.
In your particular example, only four letter groups score:
fogr fik_ _über_ spie
Other letter groups such as
_inte tive_ _info _vide deos
occur in so many different languages that they are ignored. The letter sequence "_über_"
is strongly German, but not much else is, so the German language score is too low for
a normal 49 bytes of German. (And 49 bytes is too low for CLD2 to do well -- two sentences
is a more reasonable amount of input; CLD2's design center is real text from web pages,
not 1-4 word fragments from searches or Twitter or suchlike.)
The letter sequence "spie" occurs about equally in German and Latvian, so the overall
score separation between those two ends up too low. In the end, both languages are
dropped entirely with too few useful table hits, leaving 100% "other". The reliability
bit is essentially over the null set of returned languages in this case.
I haven't looked carefully at the Python wrapper, but Mike may want to expose the percentages
or set the reliability bool to false in more of the degenerate cases above. /dick
Reported by [email protected]
on 2013-08-09 18:17:43
from cld2.
Reported by [email protected]
on 2013-08-09 18:39:39
- Status changed:
WontFix
from cld2.
OK I fixed the Python bindings to always return 3 languages even when some of them are
UNKNOWN (previously I would skip UNKNOWN), and added a test case.
Reported by [email protected]
on 2013-08-09 20:09:14
from cld2.
I'm off on vacation in upstate Wisconsin for a week, back on the 20th. At
that time, I plan to tweak CLD2 to return unreliable if the top language is
less than 2% of the total text -- this will also cover the all-unknown case.
On Friday, August 9, 2013, wrote:
Reported by [email protected]
on 2013-08-10 20:04:12
from cld2.
Updated to return is_reliable=false if top language is UNKNOWN_LANGUAGE.
Reported by [email protected]
on 2013-08-20 21:22:44
- Status changed:
Fixed
from cld2.
Related Issues (20)
- Fix array-subscript-is-char warning for Clang on Windows
- Fails to build from source with upcoming gcc-6 HOT 5
- A new language question HOT 3
- how to add it to c++ program HOT 4
- c++0x support HOT 1
- Language Detection with CLD2 with Mixed Inputs in long documents HOT 1
- How to interpret the Score value and Reliable flag. HOT 5
- UBSAN errors HOT 1
- Build/Compile Successfully but when it was deployed - error HOT 1
- Training set
- Which languages are supported HOT 1
- ISO differences HOT 1
- Training for a new language HOT 2
- CLD2 cannot classify text that doesn't have spaces
- crash SIGSEGV(SEGV_MAPERR) in utf8statetable.cc:384
- Thread-Safety for CLD2 ::DetectLanguageSummaryV2?
- How to install and use under pypy
- Java Binding? HOT 1
- Training neural network for language detection
- Long-form audio speaker diarization OOM in clustering HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cld2.