Coder Social home page Coder Social logo

cld2's Introduction

Compact Language Detector 2

Summary

Dick Sites ([email protected])
2013.07.28

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. Legacy encodings must be converted to valid UTF-8 by the caller. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes of text means about 800 bytes of English and 200 bytes of French). Optionally, it also returns a vector of text spans with the language of each identified. This may be useful for applying different spelling-correction dictionaries or different machine translation requests to each span. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text, lists of proper names, part numbers, etc.

Supported Languages

These 83 languages are detected:

Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bengali Bihari Bulgarian Catalan Cebuano Cherokee Croatian Czech Chinese Chinese_T Danish Dhivehi Dutch English Estonian Finnish French Galician Ganda Georgian German Greek Gujarati Haitian_Creole Hebrew Hindi Hmong Hungarian Icelandic Indonesian Inuktitut Irish Italian Javanese Japanese Kannada Khmer Kinyarwanda Korean Laothian Latvian Limbu Lithuanian Macedonian Malay Malayalam Maltese Marathi Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian Russian Scots_Gaelic Serbian Sinhalese Slovak Slovenian Spanish Swahili Swedish Syriac Tagalog Tamil Telugu Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish.

Internals

Classification & Scoring. CLD2 is a Naïve Bayesian classifier, using one of three different token algorithms. For Unicode scripts such as Greek and Thai that map one-to-one to detected languages, the script defines the result. For the 80,000+ character Han script and its CJK combination with Hiragana, Katakana, and Hangul scripts, single letters (unigrams) are scored. For all other scripts, sequences of four letters (quadgrams) are scored.

Scoring is done exclusively on lowercased Unicode letters and marks, after expanding HTML entities &xyz; and after deleting digits, punctuation, and <tags>. Quadgram word beginnings and endings (indicated here by underscore) are explicitly used, so the word _look_ scores differently from the word-beginning _look or the mid-word look. Quadgram single-letter "words" are completely ignored. For each letter sequence, the scoring uses the 3-6 most likely languages and their quantized log probabilities. The training corpus is manually constructed from chosen web pages for each language, then augmented by careful automated scraping of over 100M additional web pages.

Several embellishments improve the basic algorithm:

  • additional scoring of some sequences of two CJK letters or eight other letters
  • scoring some words and word pairs that are distinctive within sets of statistically-close languages such as {Malay, Indonesian} or {Spanish, Portuguese, Galician}
  • removing repetitive sequences/words that would otherwise skew the scoring, such as jpg in foo.jpg bar.jpg baz.jpg
  • removing web-specific words that convey almost no language information such as page, link, click, td, tr, copyright, wikipedia, http.

Hints. Several hints can be supplied. Because these can be inaccurate on web pages, they are just hints -- they add a bias but do not force a specific language to be the detection result. The hints include:

  • expected language
  • original document encoding
  • document URL top-level domain name
  • embedded <…lang=xx …> language tags.

Optimized for space and speed. The table-driven extraction of letter sequences and table-driven scoring is highly optimized for both space and speed, running about 10x faster than other detectors and covering over 70 languages in 1.8MB of x86 code and tables. The main quadgram lookup table consists of 256K four-byte entries, covering about 50 languages. Detection over the average web page of 30KB (half tags/digits/punctuation, half letters) takes roughly 1 msec on a current x86 processor.

CLD2 is an update of the original CLD, adding more languages, updating to Unicode 6.2 characters, improving scoring, and adding the optional output vector of labelled language spans.

cld2's People

Contributors

ivgiuliani avatar jasonriesa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cld2's Issues

compile_libs.sh does not work on Windows 7 x64 with cygwin

Originally reported on Google Code with ID 4

What steps will reproduce the problem?
1. sh compile_libs.sh
2. Observe errors

What is the expected output? What do you see instead?
Expected: None (success)
Actual output:

compact_lang_det_test.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_test.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
cldutil.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil_shared.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
cldutil_shared.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
compact_lang_det.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_hint_code.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_hint_code.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_impl.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_impl.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
debug.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
debug.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
fixunicodevalue.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
fixunicodevalue.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_entities.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_entities.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_language.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_language.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_ulscript.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_ulscript.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
getonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
getonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
lang_script.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
lang_script.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
offsetmap.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
offsetmap.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
scoreonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
scoreonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
tote.cc:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled
by default]
tote.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
utf8statetable.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
utf8statetable.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld_generated_cjk_uni_prop_80.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld_generated_cjk_uni_prop_80.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld2_generated_cjk_compatible.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld2_generated_cjk_compatible.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld_generated_cjk_delta_bi_4.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld_generated_cjk_delta_bi_4.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
generated_distinct_bi_0.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_distinct_bi_0.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld2_generated_quadchrome0715.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld2_generated_quadchrome0715.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld2_generated_deltaoctachrome0614.cc:1:0: warning: -fPIC ignored for target (all code
is position independent) [enabled by default]
cld2_generated_deltaoctachrome0614.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld2_generated_distinctoctachrome0604.cc:1:0: warning: -fPIC ignored for target (all
code is position independent) [enabled by default]
cld2_generated_distinctoctachrome0604.cc:1:0: sorry, unimplemented: 64-bit mode not
compiled in
cld_generated_score_quad_octa_1024_256.cc:1:0: warning: -fPIC ignored for target (all
code is position independent) [enabled by default]
cld_generated_score_quad_octa_1024_256.cc:1:0: sorry, unimplemented: 64-bit mode not
compiled in
compact_lang_det_test.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_test.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
cldutil.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cldutil_shared.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
cldutil_shared.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
compact_lang_det.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_hint_code.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_hint_code.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
compact_lang_det_impl.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
compact_lang_det_impl.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
debug.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
debug.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
fixunicodevalue.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
fixunicodevalue.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_entities.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_entities.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_language.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_language.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
generated_ulscript.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_ulscript.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
getonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
getonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
lang_script.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
lang_script.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
offsetmap.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
offsetmap.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
scoreonescriptspan.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
scoreonescriptspan.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
tote.cc:1:0: warning: -fPIC ignored for target (all code is position independent) [enabled
by default]
tote.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
utf8statetable.cc:1:0: warning: -fPIC ignored for target (all code is position independent)
[enabled by default]
utf8statetable.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld_generated_cjk_uni_prop_80.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld_generated_cjk_uni_prop_80.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld2_generated_cjk_compatible.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld2_generated_cjk_compatible.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld_generated_cjk_delta_bi_32.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld_generated_cjk_delta_bi_32.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
generated_distinct_bi_0.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
generated_distinct_bi_0.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld2_generated_quad0720.cc:1:0: warning: -fPIC ignored for target (all code is position
independent) [enabled by default]
cld2_generated_quad0720.cc:1:0: sorry, unimplemented: 64-bit mode not compiled in
cld2_generated_deltaocta0527.cc:1:0: warning: -fPIC ignored for target (all code is
position independent) [enabled by default]
cld2_generated_deltaocta0527.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld2_generated_distinctocta0527.cc:1:0: warning: -fPIC ignored for target (all code
is position independent) [enabled by default]
cld2_generated_distinctocta0527.cc:1:0: sorry, unimplemented: 64-bit mode not compiled
in
cld_generated_score_quad_octa_1024_256.cc:1:0: warning: -fPIC ignored for target (all
code is position independent) [enabled by default]
cld_generated_score_quad_octa_1024_256.cc:1:0: sorry, unimplemented: 64-bit mode not
compiled in

What version of the product are you using? On what operating system?
Windows 7 x64 SP1
gcc version 4.7.3 (GCC) (i686-pc-cygwin)
GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)

Please provide any additional information below.

I'm tryin to build this library on a Windows host to be used in the chromium-compact-language-detector
Python extension.

When I remove the flags -fPIC and -m64 the compilation works (but that is probably
not the right fix). And I can't test it because the Python extension requires *.lib
files but *.so are produced.

Reported by radek.lat on 2013-09-09 14:22:16

Fix array-subscript-is-char warning for Clang on Windows

As seen here:
https://codereview.chromium.org/1426873002/

It looks like CLD2 generates some warnings on Windows under some Clang configurations on Windows:
FAILED: ninja -t msvc -e environment.x64 --
../../third_party/llvm-build/Release+Asserts/bin/clang-cl.exe /nologo
/showIncludes /FC @obj/third_party/cld_2/cld2_static/offsetmap.obj.rsp
/c
../../third_party/cld_2/src/internal/offsetmap.cc
/Foobj/third_party/cld_2/cld2_static/offsetmap.obj
/Fdobj/third_party/cld_2/cld2_static_cc.pdb
../../third_party/cld_2/src/internal/offsetmap.cc(84,36) : error: array subscript is of type 'char'
[-Werror,-Wchar-subscripts]
fprintf(fout, "%c%02d", "&=+-"[OpPart(diffs_[i])], LenPart(diffs_[i]));
^~~~~~~~~~~~~~~~~~
../../third_party/cld_2/src/internal/offsetmap.cc(210,38) :
error: array subscript is of type 'char' [-Werror,-Wchar-subscripts]
fprintf(stderr, "%c%02d ", "&=+-"[OpPart(diffs_[i])], LenPart(diffs_[i]));
^~~~~~~~~~~~~~~~~~
2 errors generated.

Wasted work in cld::GetNormalizedScore() and cld::GetReliability()

Originally reported on Google Code with ID 2

The problem appears in revision 215539. I have attached a simple one-line patch that
fixes it.

In method cld::GetNormalizedScore() in cld/compact_lang_det/cldutil.cc, the loop in
line 818 keeps overriding "expected_score" with "kMeanScore[cur_lang * 4 + i]" when
it is larger than zero. Therefore, only the last written value is visible out of the
loop and all the other writes and iterations are not necessary. The patch iterates
from the end of "i" and breaks the first time when "expected_score" is set.

Similar problem also appears in cld::GetReliability(), at line 846.


Reported by pochang0403 on 2013-08-06 20:45:26

please use CFLAGS CXXFLAGS CPPFLAGS and LDFLAGS

Originally reported on Google Code with ID 28

patch attached.

Description: Adding CFLAGS CXXFLAGS CPPFLAGS and LDFLAGS to the build
Author: Gianfranco Costamagna <[email protected]>
Origin: debian
Last-Update: <2015-01-10>

--- cld2-0.0.0~svn193.orig/internal/compile.sh
+++ cld2-0.0.0~svn193/internal/compile.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,10 +24,10 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o compact_lang_det_test_chrome_2
+  -o compact_lang_det_test_chrome_2 $LDFLAGS
 echo "  compact_lang_det_test_chrome_2 compiled"

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -37,11 +37,11 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_16.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o compact_lang_det_test_chrome_16
+  -o compact_lang_det_test_chrome_16 $LDFLAGS
 echo "  compact_lang_det_test_chrome_16 compiled"


-g++ -O2 -m64  cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_unittest.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -51,10 +51,10 @@ g++ -O2 -m64  cld2_unittest.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_unittest_chrome_2
+  -o cld2_unittest_chrome_2 $LDFLAGS
 echo "  cld2_unittest_chrome_2 compiled"

-g++ -O2 -m64  -Davoid_utf8_string_constants cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -Davoid_utf8_string_constants cld2_unittest.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -64,7 +64,7 @@ g++ -O2 -m64  -Davoid_utf8_string_consta
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_unittest_avoid_chrome_2
+  -o cld2_unittest_avoid_chrome_2 $LDFLAGS
 echo "  cld2_unittest_avoid_chrome_2 compiled"


--- cld2-0.0.0~svn193.orig/internal/compile_dynamic.sh
+++ cld2-0.0.0~svn193/internal/compile_dynamic.sh
@@ -15,7 +15,7 @@
 #  limitations under the License.

 # The data tool, which can be used to read and write CLD2 dynamic data files
-g++ -O2 -m64 cld2_dynamic_data_tool.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_dynamic_data_tool.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_extractor.h cld2_dynamic_data_extractor.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
@@ -28,11 +28,11 @@ g++ -O2 -m64 cld2_dynamic_data_tool.cc \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o cld2_dynamic_data_tool
+  -o cld2_dynamic_data_tool $LDFLAGS
 echo "  cld2_dynamic_data_tool compiled"

 # Tests for Chromium flavored dynamic CLD2
-g++ -O2 -m64 -D CLD2_DYNAMIC_MODE compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -D CLD2_DYNAMIC_MODE compact_lang_det_test.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_extractor.h cld2_dynamic_data_extractor.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
@@ -41,12 +41,12 @@ g++ -O2 -m64 -D CLD2_DYNAMIC_MODE compac
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o compact_lang_det_dynamic_test_chrome
+  -o compact_lang_det_dynamic_test_chrome $LDFLAGS
 echo "  compact_lang_det_dynamic_test_chrome compiled"


 # Unit tests, in dynamic mode
-g++ -O2 -m64 -g3 -D CLD2_DYNAMIC_MODE cld2_unittest.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -g3 -D CLD2_DYNAMIC_MODE cld2_unittest.cc \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
@@ -54,11 +54,11 @@ g++ -O2 -m64 -g3 -D CLD2_DYNAMIC_MODE cl
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o cld2_dynamic_unittest
+  -o cld2_dynamic_unittest $LDFLAGS
 echo "  cld2_dynamic_unittest compiled"

 # Shared library, in dynamic mode
-g++ -shared -fPIC -O2 -m64 -D CLD2_DYNAMIC_MODE \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC -D CLD2_DYNAMIC_MODE \
   cld2_dynamic_data.h cld2_dynamic_data.cc \
   cld2_dynamic_data_loader.h  cld2_dynamic_data_loader.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
@@ -66,6 +66,6 @@ g++ -shared -fPIC -O2 -m64 -D CLD2_DYNAM
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
   getonescriptspan.cc lang_script.cc offsetmap.cc  scoreonescriptspan.cc \
   tote.cc utf8statetable.cc  \
-  -o libcld2_dynamic.so
+  -o libcld2_dynamic.so $LDFLAGS
 echo "  libcld2_dynamic.so compiled"

--- cld2-0.0.0~svn193.orig/internal/compile_full.sh
+++ cld2-0.0.0~svn193/internal/compile_full.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -O2 -m64  compact_lang_det_test.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS compact_lang_det_test.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,10 +24,10 @@ g++ -O2 -m64  compact_lang_det_test.cc \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o compact_lang_det_test_full
+  -o compact_lang_det_test_full $LDFLAGS
 echo "  compact_lang_det_test_full compiled"

-g++ -O2 -m64  cld2_unittest_full.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS cld2_unittest_full.cc \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -37,10 +37,10 @@ g++ -O2 -m64  cld2_unittest_full.cc \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o cld2_unittest_full
+  -o cld2_unittest_full $LDFLAGS
 echo "  cld2_unittest_full compiled"

-g++ -O2 -m64  -Davoid_utf8_string_constants cld2_unittest_full.cc \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -Davoid_utf8_string_constants cld2_unittest_full.cc
\
   cldutil.cc cldutil_shared.cc compact_lang_det.cc  compact_lang_det_hint_code.cc
\
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -50,6 +50,6 @@ g++ -O2 -m64  -Davoid_utf8_string_consta
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o cld2_unittest_full_avoid
+  -o cld2_unittest_full_avoid $LDFLAGS
 echo "  cld2_unittest_full_avoid compiled"

--- cld2-0.0.0~svn193.orig/internal/compile_libs.sh
+++ cld2-0.0.0~svn193/internal/compile_libs.sh
@@ -14,7 +14,7 @@
 #  See the License for the specific language governing permissions and
 #  limitations under the License.

-g++ -shared -fPIC -O2 -m64 \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -24,9 +24,9 @@ g++ -shared -fPIC -O2 -m64 \
   cld_generated_cjk_delta_bi_4.cc generated_distinct_bi_0.cc  \
   cld2_generated_quadchrome_2.cc cld2_generated_deltaoctachrome.cc \
   cld2_generated_distinctoctachrome.cc  cld_generated_score_quad_octa_2.cc  \
-  -o libcld2.so
+  -o libcld2.so $LDFLAGS

-g++ -shared -fPIC -O2 -m64 \
+g++ $CFLAGS $CPPFLAGS $CXXFLAGS -shared -fPIC \
   cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc \
   compact_lang_det_impl.cc  debug.cc fixunicodevalue.cc \
   generated_entities.cc  generated_language.cc generated_ulscript.cc  \
@@ -36,5 +36,5 @@ g++ -shared -fPIC -O2 -m64 \
   cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc  \
   cld2_generated_quad0122.cc cld2_generated_deltaocta0122.cc \
   cld2_generated_distinctocta0122.cc  cld_generated_score_quad_octa_0122.cc  \
-  -o libcld2_full.so
+  -o libcld2_full.so $LDFLAGS



(there is an ongoing debian effort to package it)

Reported by costamagna.gianfranco on 2015-02-10 15:36:51

CLD2DynamicDataLoader calls delete instead of delete[] on array types

Originally reported on Google Code with ID 14

Upon running some browser tests in Chrome, the following error was encountered when
attempting to call CLD2::loadDataFromRawAddress():

memory allocation/deallocation mismatch at 0x155bb621cb20: allocated with new [] being
deallocated with delete
Received signal 11 SEGV_MAPERR 000000000039
...
#6 0x000002b7b791 MallocBlock::CheckLocked()
#7 0x000002b7b422 MallocBlock::CheckAndClear()
#8 0x000002b7bb4a MallocBlock::Deallocate()
#9 0x000002b79109 DebugDeallocate()
#10 0x000009e02885 operator delete()
#11 0x000006ecd635 CLD2DynamicDataLoader::loadDataInternal()
#12 0x000006ecd325 CLD2DynamicDataLoader::loadDataRaw()
#13 0x000006eba963 CLD2::loadDataFromRawAddress()

I'm not sure why this wasn't caught earlier in testing. It may be a consequence of
toolchain changes in Chromium, but the error seems valid and should be fixed. This
was previously working without issue on both Linux and Android platform builds for
x64 and ARM respectively.

I will review the other uses of delete to see if there are more occurrences. This should
be a trivial fix, but blocks adoption of CLD2 dynamic mode in Chromium.

Reported by [email protected] on 2014-05-15 16:32:28

Check in tools for generating generated_* files

Originally reported on Google Code with ID 33

There's a lot of data files that get generated, such as cld_generated_cjk_uni_prop_80.cc
and its ilk. There have been several problems in the past with the generated files
that have necessitated post-generation fixes, e.g.:

https://code.google.com/p/cld2/source/detail?r=155
https://code.google.com/p/cld2/source/detail?r=156
https://code.google.com/p/cld2/source/detail?r=189
https://code.google.com/p/cld2/source/detail?r=192
https://code.google.com/p/cld2/source/detail?r=193

...

And now we have issue 32, which is more of the same. We don't have the templates or
whatever are used to generated these source files checked in; we should. I get that
the actual data is huge and isn't something we'd store in Git, but I'd really like
to see us put the templates/generators into the code base so that we can maintain them
alongside the code that they produce.

High priority because I feel that at this point there is likely drift between the templates
and the code they produce; we should probably get the templates checked in and iterate
on them until they produce exactly the same files that we have today, then proceed
forward with maintenance.

WDYT?

Reported by [email protected] on 2015-05-01 08:34:54

Translation bar shows up for the English website and detects as "Malay"

Originally reported on Google Code with ID 5


What steps will reproduce the problem?
1. Launch chrome with the flag --force-fieldtrials=CLD1VsCLD2/CLD2/
2. Open website <https://play.google.com/store>
3. Go to movie and Romancing Bollywood then click on see more movies.
4. For India location it detect "Malay" language of the page although this page is
in English language (refer attached screenshot.)

What is the expected output? 
No translation bar as the language of website is English.

What do you see instead?
translation bar asking for translation from Malay to English.

What version of the product are you using? On what operating system?
Version: 32.0.1657.2 (Official Build 226144) 
OS: Linux Ubuntu 12.04

Please provide any additional information below.


Reported by [email protected] on 2013-11-07 09:39:22


- _Attachment: CLD2.png
![CLD2.png](https://storage.googleapis.com/google-code-attachments/cld2/issue-5/comment-0/CLD2.png)_

Windows build fails: undeclared identifier 'close'

Originally reported on Google Code with ID 19

There appears to be a weird mix of both open() and fopen() (with corresponding close()
and fclose()) in cld2_dynamic_data_loader.cc, and possibly other places in the code.
We should consistently use one or the other. To use close() we'd also technically need
to depend on unistd.h, I think, which we currently don't. This is causing some problems
for Chromium, though why it has just cropped up now I could not say:

https://code.google.com/p/chromium/issues/detail?id=403222

The fix here should be trivial, and I'll take care of it now.

Reported by [email protected] on 2014-08-13 09:41:53

Dynamic data loading should not use iostream

Originally reported on Google Code with ID 18

Dynamic data loading currently uses iostream for logging.

That would be fine, except that nowhere else in the library is iostream used, meaning
this is bringing in many classes for little gain, and only when dynamic data loading
is turned on.

Reported by [email protected] on 2014-07-15 21:39:27

how to add it to c++ program

i just want to know how is there a way to add it to c++ program so that it can detect language automatically.

Thanks

Consider declaring dynamic data methods unconditionally

Originally reported on Google Code with ID 16

Today, we guard the declaration of the dynamic-data-related functions in comapct_lang_det.h
with "#ifdef CLD2_DYNAMIC_MODE":
https://code.google.com/p/cld2/source/browse/trunk/public/compact_lang_det.h

This causes some unfortunate side effects when including CLD2 in another project: unless
building with a single compile pass including all sources, any separate compilation
unit that requires dynamic functionality has to have the same define when it #includes
compact_lang_det.h in order to keep the compiler happy.

For example, Chromium builds CLD2 separately, then links it into the Chromium binary;
but if CLD2_DYNAMIC_MODE isn't defined in Chromium code that includes compact_lang_det.h,
you get compiler errors like the ones below even if CLD2 itself has been built with
the define:

error: 'isDataLoaded' is not a member of 'CLD2'
error: 'loadDataFromRawAddress' is not a member of 'CLD2'

Ideally, the #define guard can be encapsulated entirely within CLD2 so that the dependent
library doesn't need to know about this at all.

The downside is that dependent code might accidentally try to use dynamic mode even
if it isn't available. Throwing exceptions isn't a viable solution, since some projects
disable exceptions when compiling. We'd presumably just have to define the following
behavior if CLD2_DYNAMIC_MODE is not defined:

isDataLoaded: return true
loadDataFromRawAddress: no-op and output a warning to stderr
loadDataFromFile: no-op and output a warning to stderr

This change should be fully backwards compatible, since it doesn't change or remove
any existing function declarations under any circumstances.

Reported by [email protected] on 2014-06-23 09:57:35

Compilation error on Yosemite

./compile_libs.sh Warning: None of CFLAGS, CXXFLAGS or CPPFLAGS is set; you probably should enable some options.
ld: unknown option: -soname=libcld2.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ld: unknown option: -soname=libcld2_full.so
clang: error: linker command failed with exit code 1 (use -v to see invocation)

is it available for Yosemite? please let me know

Thanks in advance,
Canh

Build warning on Windows with clang

Originally reported on Google Code with ID 21

http://build.chromium.org/p/chromium.fyi/builders/Cr%20Win%20Clang/builds/108/steps/compile/logs/stdio

..\..\third_party\cld_2\src\internal\offsetmap.cc(82,43) :  warning(clang): format
specifies type 'long' but the argument has type 'size_type' (aka 'unsigned int') [-Wformat]
  fprintf(fout, "Offsetmap: %ld bytes\n", diffs_.size());
                            ~~~           ^~~~~~~~~~~~~
                            %u

There's no great portable way to printf size_t types. Since this is debugging code,
I suggest this patch:

Nicos-MacBook-Pro:src thakis$ svn diff
Index: internal/offsetmap.cc
===================================================================
--- internal/offsetmap.cc   (revision 165)
+++ internal/offsetmap.cc   (working copy)
@@ -79,7 +79,8 @@
   }

   Flush();    // Make sure any pending entry gets printed
-  fprintf(fout, "Offsetmap: %ld bytes\n", diffs_.size());
+  fprintf(fout, "Offsetmap: %lu bytes\n",
+          static_cast<unsigned long>(diffs_.size()));
   for (int i = 0; i < static_cast<int>(diffs_.size()); ++i) {
     fprintf(fout, "%c%02d ", "&=+-"[OpPart(diffs_[i])], LenPart(diffs_[i]));
     if ((i % 20) == 19) {fprintf(fout, "\n");}

Can you land this, please?

Reported by [email protected] on 2014-08-18 14:24:33

Eliminate redundancy and/or simplify default case for compiling unittest_data.h

Originally reported on Google Code with ID 22

internal/unittest_data.h seems to use a mixture of escape sequences and raw non-ASCII
text. For maximum portability and safety, it would be best for the source code to use
all ASCII characters and escape the non-ASCII characters. This should help compiler
compatibility, though there are no reports of breakage since this Chromium.org issue
back in 2009:

https://code.google.com/p/chromium/issues/detail?id=20033

The change should be simple enough, and a script can be written to perform the transformation.

Reported by [email protected] on 2014-08-26 15:27:43

SIGBUS on ARM32 in utf8statetable.cc:517

Originally reported on Google Code with ID 11

I'm trying to get CLD2 working on ARM32 inside of Chromium, cross-compiling from a linux
x64 host to arm32. The library loads properly, but the following crash occurs when
calling DetectLanguageSummary:

Program received signal SIGBUS, Bus error.
#0  CLD2::UTF8GenericScan (st=0x61a82104, str=<optimized out>, bytes_consumed=0x5f00f88c)
    at ../../third_party/cld_2/src/internal/utf8statetable.cc:518

I'll attach the full trace as a file. Well, minus the Chromium bits. Anyhow, the problem
appears to be with this snippet of code in utf8statetable.cc:

  // Do fast for groups of 8 identity bytes.
  // This covers a lot of 7-bit ASCII ~8x faster than the 1-byte loop,
  // including slowing slightly on cr/lf/ht
  //----------------------------
  const uint8* Tbl2 = &st->fast_state[0];
  uint32 losub = st->losub;
  uint32 hiadd = st->hiadd;
  while (src < srclimit8) {
    uint32 s0123 = (reinterpret_cast<const uint32 *>(src))[0];
    uint32 s4567 = (reinterpret_cast<const uint32 *>(src))[1];
    src += 8;


Inspecting the pointers in the debugger during the crash, and looking at the "src"
variable, seems to reveal the problem:
(gdb) p src
$32 = (
    const CLD2::uint8 *) 0x58de4bee "\n\n\n百度一下\n地图贴吧视频图片hao123\n新闻应用音乐文库更多\n小说游戏下载\n把百度放到桌面上,
搜索最方便\n触屏版极速版\nBaidu 京ICP证030173号"

Specifically, src is located at 0x58de4bee. Since this isn't a 4-byte (32-bit) aligned
address, the SIGBUS presumably comes from trying to read it as a uint32*. Many thanks
to [email protected] and [email protected] for the help in diagnosing this, I was
a bit lost in the weeds looking at my dynamic data changes, which turn out to be completely
unrelated (this happens with and without dynamic data mode).

The suggested workaround for this case is to %4 the address and do a one-off scan of
the first 0-3 bytes (as necessary), and then descend into the fast loop; the concern
is that there may be other places in CLD2 that have similar behavior and might be time
bombs. It might be a good idea to add some memory churning code to the unit tests,
and then start running the unit tests themselves on ARM to further diagnose other problems
like this that might arise.

Reported by [email protected] on 2014-03-20 15:11:07


- _Attachment: [crashstack.txt](https://storage.googleapis.com/google-code-attachments/cld2/issue-11/comment-0/crashstack.txt)_

A new language question

Is there a way to add support for detecting a new language to the library by myself?
For instance, I'd like to add Aragonese, Asturian, Occitan, Norwegian Nynorsk.

No langauges output despite isReliable=True

Originally reported on Google Code with ID 1

I'm using Mike McCandless' Python binding to cld2. I originally reported this issue
to him, and he suggested I report it here (see https://code.google.com/p/chromium-compact-language-detector/issues/detail?id=15).

The issue is that for a particular input string, cld2 reports that the prediction is
reliable, but the set of languages detected is empty.

What steps will reproduce the problem?
1. import cld2
2. cld2.detect('interaktive infografik \xc3\xbcber videospielkonsolen')

What is the expected output? What do you see instead?
The output is 

(True, 49, ())


What version of the product are you using? On what operating system?
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2

cld2 was built using SVN rev 63, 
cld python module was built using hg changeset b1cad3f04ef4

Reported by saffsd on 2013-08-06 03:08:57

compact_lang_det.h: loadDataFromRawAddress should use types from stdint.h instead of "int"

Originally reported on Google Code with ID 12

The existing signature of loadDataFromRawAddress:
void loadDataFromRawAddress(const void* rawAddress, const int length);

The use of "int" here is dangerous because we don't know what the length will be on
any platform. This is my fault, since I'm the one who introduced this API. Before much
more time elapses, we should use a type from stdint.h instead. In this case I think
uint32_t would make the most sense, as we need more than 16 bits for sure but more
than 32 would be truly insane.

It's a simple patch; any objections?

Reported by [email protected] on 2014-03-26 11:56:54

Can't link "dynamic" and "full"

Originally reported on Google Code with ID 13

What steps will reproduce the problem?

This "g++" command-line is a mix between "full" and "dynamic":

g++ -O2 -m64 cld2_dynamic_data_tool.cc cld2_dynamic_data.cc cld2_dynamic_data_extractor.cc
cld2_dynamic_data_loader.cc cldutil.cc cldutil_shared.cc compact_lang_det.cc compact_lang_det_hint_code.cc
compact_lang_det_impl.cc debug.cc fixunicodevalue.cc generated_entities.cc generated_language.cc
generated_ulscript.cc getonescriptspan.cc lang_script.cc offsetmap.cc scoreonescriptspan.cc
tote.cc utf8statetable.cc cld_generated_cjk_uni_prop_80.cc cld2_generated_cjk_compatible.cc
cld_generated_cjk_delta_bi_32.cc generated_distinct_bi_0.cc cld2_generated_quad0122.cc
cld2_generated_deltaocta0122.cc cld2_generated_distinctocta0122.cc cld_generated_score_quad_octa_0122.cc
-o cld2_dynamic_data_tooandl

What is the expected output? What do you see instead?

cld2_dynamic_data_tool.cc:(.text.startup+0x293): Undefined `CLD2::kQuadChromeIndSize'
cld2_dynamic_data_tool.cc:(.text.startup+0x29d): Undefined `CLD2::kQuadChrome2IndSize'



Reported by doppelbauer on 2014-04-09 09:59:05

Valgrind errors?

Originally reported on Google Code with ID 3

Hi, thanks for the awesome library. I'm seeing a couple memory errors in valgrind when
I use it.

The first:

==7805== Conditional jump or move depends on uninitialised value(s)
==7805==    at 0x4C2CB94: strcmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7805==    by 0x43C412: CLD2::DoTLDLookup(char const*, CLD2::TLDLookup const*, int)
(compact_lang_det_hint_code.cc:1034)
==7805==    by 0x43D705: CLD2::SetCLDTLDHint(char const*, CLD2::CLDLangPriors*) (compact_lang_det_hint_code.cc:1452)
==7805==    by 0x40CEB0: CLD2::ApplyHints(char const*, int, bool, CLD2::CLDHints const*,
CLD2::ScoringContext*) (compact_lang_det_impl.cc:1504)
==7805==    by 0x40DC4F: CLD2::DetectLanguageSummaryV2(char const*, int, bool, CLD2::CLDHints
const*, bool, int, CLD2::Language, CLD2::Language*, int*, double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) (compact_lang_det_impl.cc:1644)
==7805==    by 0x409BAE: CLD2::DetectLanguageSummary(char const*, int, bool, char const*,
int, CLD2::Language, CLD2::Language*, int*, int*, bool*) (compact_lang_det.cc:133)
==7805==    by 0x405932: codulus::main(int, char**) (test_language_detection.cc:43)
==7805==    by 0x406341: main (test_language_detection.cc:64)
==7805== 

This one seems reasonable to me, DoTLDLookup is using strcmp, but the value of 'key'
passed to it is not null terminated.


The other issue I see is an invalid read of one character past the end of my input
in a couple places in the code:

==8337== Invalid read of size 1
==8337==    at 0x415932: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) (getonescriptspan.cc:973)
==8337==    by 0x415DAE: CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, CLD2::CLDHints
const*, bool, int, CLD2::Language, CLD2::Language*, int*, double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) (compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, char const*,
int, CLD2::Language, CLD2::Language*, int*, int*, bool*) (compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) (test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x414D3C: CLD2::UTF8OneCharLen(char const*) (utf8statetable.h:270)
==8337==    by 0x415A6D: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) (getonescriptspan.cc:991)
==8337==    by 0x415DAE: CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, CLD2::CLDHints
const*, bool, int, CLD2::Language, CLD2::Language*, int*, double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) (compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, char const*,
int, CLD2::Language, CLD2::Language*, int*, int*, bool*) (compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) (test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x41D1A3: CLD2::UTF8GenericPropertyTwoByte(CLD2::UTF8StateMachineObj_2
const*, unsigned char const**, int*) (utf8statetable.cc:403)
==8337==    by 0x414D24: CLD2::GetUTF8LetterScriptNum(char const*) (getonescriptspan.cc:1098)
==8337==    by 0x415A87: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) (getonescriptspan.cc:992)
==8337==    by 0x415DAE: CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*)
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, CLD2::CLDHints
const*, bool, int, CLD2::Language, CLD2::Language*, int*, double*, std::__1::vector<CLD2::ResultChunk,
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) (compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, char const*,
int, CLD2::Language, CLD2::Language*, int*, int*, bool*) (compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) (test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

For now, I'm working around this by passing (input, size - 1) instead of (input, size)
to cld2. My input is not null terminated, if that makes a difference. It seems to happen
with every input I try (they are all web pages, by the way). Also, I am running this
on x64 linux.

Any ideas?

Reported by chasw3 on 2013-08-21 06:04:23

Enable dynamic data for 20141015 release

Originally reported on Google Code with ID 25

The 20141015 tables don't compile with the dynamic data tool because they are missing
the hand-crafted "agnostic" constants that were put in for the old release. Attached
is a patch that appears to make this work for the dynamic data tool.

Reported by [email protected] on 2014-10-31 19:12:23


- _Attachment: [cld2_patch.txt](https://storage.googleapis.com/google-code-attachments/cld2/issue-25/comment-0/cld2_patch.txt)_

Undefined language on a page that looks normal

Originally reported on Google Code with ID 8

Apparently, CLD2 has some difficulties(*) with http://drugoi.livejournal.com/3971967.html


We are seeing UND (undefined) on chrome://translate-internals

*: or maybe we are mis-using it...

Reported by [email protected] on 2014-03-05 06:59:52

utf8statetable.cc contains a bunch of unused functions

Originally reported on Google Code with ID 38

What steps will reproduce the problem?
1. Compile cld2 with clang's -Wunused-function

What is the expected output? What do you see instead?

Expected: No warnings. Instead, warnings.


This patch fixes it:

diff --git a/internal/utf8statetable.cc b/internal/utf8statetable.cc
index aa9a98e..bd1e161 100644
--- a/internal/utf8statetable.cc
+++ b/internal/utf8statetable.cc
@@ -173,34 +173,6 @@ static inline bool InStateZero_2(const UTF8ReplaceObj_2* st,
   return (static_cast<uint32>(Tbl - Tbl0) < st->state0_size);
 }

-// UTF8PropObj, UTF8ScanObj, UTF8ReplaceObj are all typedefs of
-// UTF8MachineObj.
-
-static bool IsPropObj(const UTF8StateMachineObj& obj) {
-  return obj.fast_state == NULL
-      && obj.max_expand == 0;
-}
-
-static bool IsPropObj_2(const UTF8StateMachineObj_2& obj) {
-  return obj.fast_state == NULL
-      && obj.max_expand == 0;
-}
-
-static bool IsScanObj(const UTF8StateMachineObj& obj) {
-  return obj.fast_state != NULL
-      && obj.max_expand == 0;
-}
-
-static bool IsReplaceObj(const UTF8StateMachineObj& obj) {
-  // Normally, obj.fast_state != NULL, but the handwritten tables
-  // in utf8statetable_unittest don't handle fast_states.
-  return obj.max_expand > 0;
-}
-
-static bool IsReplaceObj_2(const UTF8StateMachineObj_2& obj) {
-  return obj.max_expand > 0;
-}
-
 // Look up property of one UTF-8 character and advance over it
 // Return 0 if input length is zero
 // Return 0 and advance one byte if input is ill-formed

Reported by [email protected] on 2015-07-26 03:27:39

cld2 testsuite failures

Originally reported on Google Code with ID 30

What steps will reproduce the problem?
1. checkout revision 194
2. use the cmake file (probably doesn't change anything)
3. use ubuntu 14.10 x64

build it and run tests
make[1]: Entering directory '/tmp/buildd/cld2-0.0.0~svn194'
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_chrome_2
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 81us (0 MB/sec), (null)
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_chrome_16
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 79us (0 MB/sec), (null)
cd obj-* && ./cld2_unittest_chrome_2 > /dev/null
*** Bad UTF-8 after 40 bytes<br>
Checking that non-dynamic implementations of dynamic data methods are no-ops (ignore
the warnings).
WARNING: Dynamic mode not active, loadDataFromFile has no effect!
WARNING: Dynamic mode not active, loadDataFromRawAddress has no effect!
WARNING: Dynamic mode not active, unloadData has no effect!
Done checking non-dynamic implementations of dynamic data methods, care about warnings
again.
PASS
cd obj-* && ./cld2_unittest_avoid_chrome_2 > /dev/null
*** Bad UTF-8 after 40 bytes<br>
Checking that non-dynamic implementations of dynamic data methods are no-ops (ignore
the warnings).
WARNING: Dynamic mode not active, loadDataFromFile has no effect!
WARNING: Dynamic mode not active, loadDataFromRawAddress has no effect!
WARNING: Dynamic mode not active, unloadData has no effect!
Done checking non-dynamic implementations of dynamic data methods, care about warnings
again.
PASS
cd obj-* && echo "this is some english text" | ./compact_lang_det_test_full
ExtLanguage ENGLISH(96% 1772p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 153us (0 MB/sec), (null)
cd obj-* && ./cld2_unittest_full > /dev/null
PASS
cd obj-* && ./cld2_unittest_full_avoid > /dev/null
PASS
cd obj-* && ./cld2_dynamic_data_tool --dump cld2_data.bin
cd obj-* && ./cld2_dynamic_data_tool --verify cld2_data.bin
cd obj-* && echo "this is some english text" | ./compact_lang_det_dynamic_test_chrome
--data-file cld2_data.bin
Loading data from: cld2_data.bin
Data loaded, test commencing
ExtLanguage ENGLISH(96% 1851p), 27/26 bytes of non-tag letters, Summary: ENGLISH
  SummaryLanguage ENGLISH at 0 of 26 69us (0 MB/sec), --data-file
cd obj-* && ./cld2_dynamic_unittest --data-file cld2_data.bin > /dev/null
*** Bad UTF-8 after 40 bytes<br>
*** Bad UTF-8 after 40 bytes<br>
PASS
make[1]: Leaving directory '/tmp/buildd/cld2-0.0.0~svn194'


don't know, is everything ok?

Reported by costamagna.gianfranco on 2015-02-12 17:24:26

Compilation failure on VS2015 on Windows

Originally reported on Google Code with ID 32

In building chromium, cld2 emits narrowing warnings due to stricter checks in the new
compiler.

Attached is a patch against the svn repo to fix them.

Reported by [email protected] on 2015-04-30 18:18:22


- _Attachment: [patch.diff](https://storage.googleapis.com/google-code-attachments/cld2/issue-32/comment-0/patch.diff)_

Missing include in cld2_dynamic_data_loader.cc

Originally reported on Google Code with ID 23

What steps will reproduce the problem?
1. Try to compile the cld_2_dynamic_data_tool with gcc 4.8 in Ubuntu 12.04

What is the expected output? What do you see instead?

It fails because close() isn't defined. close() is declared in <unistd.h> and adding
that include makes it compile.

I could do a patch but I suspects it is much faster for everyone that a maintainer
just does this manually:

index 7227b8e..06375e18 100644
--- a/third_party/cld_2/src/internal/cld2_dynamic_data_loader.cc
+++ b/third_party/cld_2/src/internal/cld2_dynamic_data_loader.cc
@@ -19,6 +19,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <sys/mman.h>
+#include <unistd.h>

 #include "cld2_dynamic_data.h"
 #include "cld2_dynamic_data_loader.h"


Reported by [email protected] on 2014-08-29 08:22:52

Language Detection with CLD2 with Mixed Inputs in long documents

Internals Recap. CLD2 is a Naïve Bayesian classifier, trained on documents of mean size of 200 characters, trained on a corpus of 100M scraped and human expert selected web pages.

When working on long documents size like

~3000-4000 words, 
~40-50.000 characters

of mixed input texts (at least 2-3 languages in the same document), I see that CLD fails the recognize all the mixed inputs, resulting in only the most common language, like being having a polarization around this language like in this document excerpt:

Only come and treat me right
And you'll never guilty
Sekarang kamu sudah ada di depanku
Aku pun berdebar menanti kata-katamu
Honey Bunny Sweety
Let's take a chance

This will be recognized as english so I get

{
  "results": [
    {
      "reliable": true,
      "detection": {
        "name": "ENGLISH",
        "code": "en",
        "percent": 54,
        "score": 930
      }
    }
  ]
}

while I would expect here to have at least 2 languages. Internally CLD2 uses NGram decomposition of the input text, that is known to perform very will on language detection in a text: the feature space is very compacted when using a bigram, since the latin alphabet you will get 26^2=676 bigram of possibile features in the training set. See here for more details.

If I generate ngram of a given size (this case N=2) of this document I will get this time

{
    "count": 86,
    "code": "na",
    "name": "na",
    "mean": 0.45989304812834225
  },
  {
    "count": 50,
    "mean": 16.503352692086242,
    "code": "id",
    "name": "INDONESIAN"
  },
  {
    "count": 38,
    "mean": 12.779225483523962,
    "code": "en",
    "name": "ENGLISH"
  },
  {
    "count": 13,
    "mean": 1.5371176291771826,
    "code": "ms",
    "name": "MALAY"
  }

i.e. a more detailed detection of the mixed input language that are in this document. Of course this detection depends on N i.e. the size of the ngrams, so it may happens that for some values of N it have false positive (like a new language detected that it is not in the mixed inputs).

Assumed that CLD2 is using Ngram internally, and that the Bayes classifier was trained on ~200 characters wide text (~2-3 sentences), it seems to have a polarization in some way, and to provide better results, when working in this way - in the case of mixed inputs. The question is if this is arguable in some way and if there is a different approach that could bring to the same results obtained here.

Originally posted here.

CLD should check result of "new" in all use cases

Originally reported on Google Code with ID 27

There are many uses of the "new" operator in the CLD source code, such as in scoreonescriptspan.cc's
"new ScoringHitBuffer":

https://code.google.com/p/cld2/source/browse/trunk/internal/scoreonescriptspan.cc#1168

There's no check that the "new" operator successfully allocated memory. In low-memory
conditions this can lead to an access violation and subsequent crash.

The code should fail gracefully under low-memory conditions, though it isn't immediately
obvious how to "gracefully" fail or how helpful it would be to the caller to have such
behavior if they are truly out of memory.

Reported by [email protected] on 2015-01-07 12:19:31

c++0x support

Currently, cld2 fails in c++0x builds with errors like c++11-narrowing. This happens, specifically, by default in bazel builds that include cld2.

It'd be nice to have those cleared up. I'm a little afraid to touch files with "generated" in their name instead of editing their sources. Perhaps, y'all have access to those and can fix up the code?

You can repro the problem with CFLAGS=std=c++0x ./compile.sh and you'll get lots of errors similar to cld_generated_cjk_uni_prop_80.cc:169:9: error: constant expression evaluates to -14 which cannot be narrowed to type 'uint8' (aka 'unsigned char') [-Wc++11-narrowing].

If there sources are somewhere I can get to, I could try to take a whack at this.

Support mmap-ing dynamic data on win32

Originally reported on Google Code with ID 20

As described in issue 19, the current implementation of dynamic data won't work in windows
because it relies on:
 * from sys/mman.h: mmap(), munmap()
 * from unistd.h: close()

These header files don't exist in vanilla win32 build environments, so compatibility
is broken. The fix for close() is being implemented in issue 19, but the fix for mmap()
is less straightforward.

Reported by [email protected] on 2014-08-13 11:40:10

cld2_dynamic_data.cc and cld2_dynamic_data_loader.cc problems on Win32

Originally reported on Google Code with ID 24

Chromium build output from one of the buildbots:

FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma/gomacc "E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe"
/nologo /showIncludes /FC @obj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data.obj.rsp
/c ..\..\third_party\cld_2\src\internal\cld2_dynamic_data.cc /Foobj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data.obj
/Fdobj\third_party\cld_2\cld2_dynamic.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.cc(33)
: error C2039: 'max' : is not a member of 'std'
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.cc(33)
: error C3861: 'max': identifier not found
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data.cc(85)
: warning C4018: '<' : signed/unsigned mismatch
FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma/gomacc "E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe"
/nologo /showIncludes /FC @obj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data_loader.obj.rsp
/c ..\..\third_party\cld_2\src\internal\cld2_dynamic_data_loader.cc /Foobj\third_party\cld_2\src\internal\cld2_dynamic.cld2_dynamic_data_loader.obj
/Fdobj\third_party\cld_2\cld2_dynamic.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_loader.cc(99)
: error C2220: warning treated as error - no 'object' file generated
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_loader.cc(99)
: warning C4018: '<' : signed/unsigned mismatch
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_dynamic_data_loader.cc(235)
: warning C4018: '<' : signed/unsigned mismatch

This should be fixed.

Reported by [email protected] on 2014-10-01 14:36:06

new code location?

Originally reported on Google Code with ID 35

Hi, since google code is closing, where do you plan to move the packaging?

thanks!

Reported by costamagna.gianfranco on 2015-05-06 14:21:31

Missing Apache license header text in several source files

Originally reported on Google Code with ID 10

The following files are affected:
cld2_generated_quadchrome0122_16.cc
cld2_generated_deltaoctachrome0122.cc
cld2_generated_deltaocta0122.cc
cld2_generated_quadchrome0122_19.cc
cld2_generated_distinctoctachrome0122.cc
cld2_generated_distinctocta0122.cc
cld2_generated_quadchrome0122_2.cc

Unfortunately this prevents Chrome from rolling to the latest CLD2. Patch attached.

Reported by [email protected] on 2014-03-14 17:20:37


- _Attachment: [diff.patch](https://storage.googleapis.com/google-code-attachments/cld2/issue-10/comment-0/diff.patch)_

Enable dynamic mode

Originally reported on Google Code with ID 6

As discussed offline, this is a patch to enable CLD2 to run in "dynamic" mode. In dynamic
mode the kScoringtables struct is populated from a file at runtime instead of being
compiled into the program as a read-only section in the binary.

This patch adds a new cld2_dynamic_data_tool and accompanying build instructions, and
patches the unit tests to exercise all dynamic functionality. Data can be loaded, unloaded,
and reloaded - theoretically allowing continuous operations of the program when updated
tables are available.

It still has some hardcoding, but we can fix the underlying issues in the source code
easily in another pass as you've suggested.

Reported by [email protected] on 2014-02-25 18:30:09


- _Attachment: [cld2_dynamic_mode.patch](https://storage.googleapis.com/google-code-attachments/cld2/issue-6/comment-0/cld2_dynamic_mode.patch)_

please provide a SONAME

Originally reported on Google Code with ID 29

Can you please provide a SONAME for the library?

Installing something in usr/lib without a SONAME is so painful.

Reported by costamagna.gianfranco on 2015-02-10 15:37:26

Fails to build from source with upcoming gcc-6

Hi, this is the error:

> [ 29%] Building CXX object CMakeFiles/cld2_full.dir/internal/cld_generated_cjk_delta_bi_32.cc.o
> /usr/bin/c++  -Dcld2_full_EXPORTS  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2  -fPIC  -o CMakeFiles/cld2_full.dir/internal/cld_generated_cjk_delta_bi_32.cc.o -c /<<PKGBUILDDIR>>/internal/cld_generated_cjk_delta_bi_32.cc
> /<<PKGBUILDDIR>>/internal/scoreonescriptspan.cc: In function 'void CLD2::ScoreEntireScriptSpan(const CLD2::LangSpan&, CLD2::ScoringContext*, CLD2::DocTote*, CLD2::ResultChunkVector*)':
> /<<PKGBUILDDIR>>/internal/scoreonescriptspan.cc:1149:5: warning: narrowing conversion of 'score' from 'int' to 'CLD2::uint16 {aka short unsigned int}' inside { } [-Wnarrowing]
>      };
>      ^
>
> /<<PKGBUILDDIR>>/internal/scoreonescriptspan.cc:1149:5: warning: narrowing conversion of 'bytes' from 'int' to 'CLD2::uint16 {aka short unsigned int}' inside { } [-Wnarrowing]
> /<<PKGBUILDDIR>>/internal/scoreonescriptspan.cc:1149:5: warning: narrowing conversion of 'reliability' from 'int' to 'CLD2::uint8 {aka unsigned char}' inside { } [-Wnarrowing]
[snip]
> /<<PKGBUILDDIR>>/internal/scoreonescriptspan.cc:1149:5: warning: narrowing conversion of 'reliability' from 'int' to 'CLD2::uint8 {aka unsigned char}' inside { } [-Wnarrowing]
> /<<PKGBUILDDIR>>/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error: narrowing conversion of '-14' from 'int' to 'CLD2::uint8 {aka unsigned char}' inside { } [-Wnarrowing]
[snip]
>  };
>  ^
>
conversion of '-30' from 'int' to 'CLD2::uint8 {aka unsigned char}' inside { } [-Wnarrowing]conversion of '-30' from 'int' to 'CLD2::uint8 {aka unsigned char}' inside { } [-Wnarrowing]

of course uint8 can't handle negative numbers, and if you change to int8 you can't handle numbers over 127.

How can I fix it?
thanks

Compilation issues in Visual Studio

Originally reported on Google Code with ID 31

I am trying to compile the chromium in Visual Studio 2013. I am actually trying to create
a .NET Wrapper for the library so I have added all the source files inside my CLR project.

Now whenever I compile I get these linking errors.

    error LNK2005: "struct CLD2::CLD2TableSummary const CLD2::kCjkDeltaBi_obj" (?kCjkDeltaBi_obj@CLD2@@3UCLD2TableSummary@1@B)
already defined in cld_generated_cjk_delta_bi_32.obj

These all seems to be related as I can see a relation between the 'generated' files.

Problem is I have a lot of these and I am not sure which ones I should exclude and
which I should keep and use in my code.

Here is a list all the generated files that came with the CLD2 code.

    cld_generated_cjk_uni_prop_80.cc
    cld_generated_score_quad_octa_2.cc
    cld_generated_score_quad_octa_0122.cc
    cld_generated_score_quad_octa_0122_2.cc
    cld_generated_score_quad_octa_1024_256.cc
    cld_generated_cjk_delta_bi_4.cc
    cld_generated_cjk_delta_bi_32.cc
    cld2_generated_octa2_dummy.cc
    cld2_generated_quad0122.cc
    cld2_generated_quad0720.cc
    cld2_generated_quadchrome_2.cc
    cld2_generated_quadchrome_16.cc
    cld2_generated_cjk_compatible.cc
    cld2_generated_deltaocta0122.cc
    cld2_generated_deltaocta0527.cc
    cld2_generated_deltaoctachrome.cc
    cld2_generated_distinctocta0122.cc
    cld2_generated_distinctocta0527.cc
    cld2_generated_distinctoctachrome.cc

The naming convention of these suggests that I should only be using one of each group.
At least that how I think I should use it as I am not really an expert in encoding
nor in how CLD2 works. And I could not find any references online explaining how to
configure it.

I tried eliminating the linking errors by keeping only one of each generated group:

for example: from `cld_generated_cjk_delta_bi_4` and `cld_generated_cjk_delta_bi_32`
I kept the 32 version. And so on for the rest of the files.

Now this made CLD compile yet when I tried testing it with languages I noticed that
the scores were way way off and it was behaving inexplicably bad.

I am not trying to support all languages I only need to support latin languages along
with hebrew, arabic, japanese and chinese.

Can someone please explain how to configure CLD2 to compile and work correctly.


Reported by redserpent7 on 2015-03-30 05:57:39

Private field next_byte_limit_ is unused

Originally reported on Google Code with ID 37

..\..\third_party\cld_2\src\internal/getonescriptspan.h(89,15) :  warning(clang): private
field 'next_byte_limit_' is not used [-Wunused-private-field]
  const char* next_byte_limit_;   // Last byte + 1
              ^

Looks like this warning is correct.

Reported by [email protected] on 2015-07-10 03:06:32

Add armv8-a support

Originally reported on Google Code with ID 34

On behalf of [email protected] (cc'd):

--- snip ---
I would like to send a patch to CLD_2 for adding ARMv8a to the supporting list in internal/port.h.
It’s my first time to send patches to CLD_2 project, and I have no idea how to upload
it.

Can you take a look at the attached file to check if this modification is useful? Is
there any issue related to this modification? Also, can you tell me how to upload the
patch properly?

Thanks for your kindly help.
--- snip ---

Reported by [email protected] on 2015-05-05 09:46:30


- _Attachment: [0001-cld_2-ARMv8-patch.patch](https://storage.googleapis.com/google-code-attachments/cld2/issue-34/comment-0/0001-cld_2-ARMv8-patch.patch)_

Post-dynamic-mode cleanup

Originally reported on Google Code with ID 7

There are a few things to clean up in r151:
* Use the newly-added constants in the table classes to avoid hardcoding sizes
* Ensure cld2_generated_quadchrome0122_16.cc works with both active tables in dynamic
mode
* Add the ability to use an already-extant mmap to load the data from (rather than
managing the mmap directly). This is necessary for systems (such as Chromium) where
the security model forbids direct access to the filesystem in some contexts where CLD2
might be used

Should all be pretty straightforward. Remove all FIXME and TODO comments added by [email protected]
as well.

Reported by [email protected] on 2014-03-03 15:25:27

CLD2 result chunk vector omits portions of input file

Originally reported on Google Code with ID 17

Hello,

I'm trying to extract natural language from a web crawl for use in NLP applications.
Since web pages often have multiple languages on them, I'm using CLD2's ResultChunkVector
API to split each page into chunks of known uniform language. The problem I'm running
into is that fairly often, the ResultChunkVector simply doesn't include parts of the
input text -- I've attached two sample files that demonstrate this. In 32200.utf8,
the first chunk starts at position 8 -- I guess this has something to do with the fact
that the file starts with numbers/punctuation? In 27878255.utf8, the first chunk covers
positions 0-65530, and the second chunk begins at position 199884 (so there's a very
substantial amount of text being skipped! and the text appears to be plain old English,
nothing special) -- I guess this might have something to do with the use of a 2-byte
length field, but the length of the first chunk isn't 2**16. And perhaps there are
other cases that also lead to gaps like this.

My expectation was that the first chunk would always start at position 0, that each
chunk would start where the previous one ended, and that the last chunk would end at
the end of the input file. Or, if this isn't possible, then is there any guidance on
how gaps like this should be interpreted? I could simply pretend they were tagged "unknown",
but this seems like a pretty weird way to handle the 140 kB of English text in 27878255.utf8.

I'm using the "full" detector, but these files trigger the behaviour in both full and
regular modes (slightly differently).

Reported by [email protected] on 2014-07-01 11:04:03


- _Attachment: [27878255.utf8](https://storage.googleapis.com/google-code-attachments/cld2/issue-17/comment-0/27878255.utf8)_ - _Attachment: [32200.utf8](https://storage.googleapis.com/google-code-attachments/cld2/issue-17/comment-0/32200.utf8)_

Add possibility to set MinReliableKeepPercent

Originally reported on Google Code with ID 36

What steps will reproduce the problem?
1. try to detect the language of attached input file
2. see the output is "unknown"

What is the expected output? What do you see instead?
I would expect either 'perssian' or 'arabic'

What version of the product are you using? On what operating system?
rev195 on centos 7

Please provide any additional information below.

CLD2 returns "unknown" because the reliability is lower than kMinReliableKeepPercent
(in compact_lang_det_impl.cc) :
static const int kMinReliableKeepPercent = 41;  // Remove lang if reli < this

Would adding an additional parameter to the DetectLanguageXXX(...) in order to set
this threshold be acceptable ?

Regards

Reported by William.Tambellini on 2015-06-11 17:07:38


- _Attachment: [input_ara_only.txt](https://storage.googleapis.com/google-code-attachments/cld2/issue-36/comment-0/input_ara_only.txt)_

Windows compile process for Chromium unhappy with zero-length array declarations

Originally reported on Google Code with ID 9

In these files (and any others, obviously):
cld2_generated_distinctoctachrome0122
cld2_generated_deltaoctachrome0122

The Windows compile chain for Chromium is upset because there is an attempt to declare
a zero-length array. Dick has noted this as a concern when we let the size be zero,
and it seems the concern is valid under the Chromium build chain on Windows.

From Chromium's buildbots, here are the error messages from compilation:

FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma\gomacc.exe "E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe"
/nologo /showIncludes /FC @obj\third_party\cld_2\src\internal\cld_2.cld2_generated_distinctoctachrome0122.obj.rsp
/c ..\..\third_party\cld_2\src\internal\cld2_generated_distinctoctachrome0122.cc /Foobj\third_party\cld_2\src\internal\cld_2.cld2_generated_distinctoctachrome0122.obj
/Fdobj\third_party\cld_2\cld_2.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_distinctoctachrome0122.cc(2184)
: error C2466: cannot allocate an array of constant size 0
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_distinctoctachrome0122.cc(2186)
: error C2466: cannot allocate an array of constant size 0
FAILED: ninja -t msvc -e environment.x86 -- E:\b\build\goma\gomacc.exe "E:\b\depot_tools\win_toolchain\vs2013_files\VC\bin\amd64_x86\cl.exe"
/nologo /showIncludes /FC @obj\third_party\cld_2\src\internal\cld_2.cld2_generated_deltaoctachrome0122.obj.rsp
/c ..\..\third_party\cld_2\src\internal\cld2_generated_deltaoctachrome0122.cc /Foobj\third_party\cld_2\src\internal\cld_2.cld2_generated_deltaoctachrome0122.obj
/Fdobj\third_party\cld_2\cld_2.cc.pdb 
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_deltaoctachrome0122.cc(4577)
: error C2466: cannot allocate an array of constant size 0
e:\b\build\slave\win\build\src\third_party\cld_2\src\internal\cld2_generated_deltaoctachrome0122.cc(4579)
: error C2466: cannot allocate an array of constant size 0
ninja: build stopped: subcommand failed.

The workaround we had in place before was to have the constants for size *say* zero,
i.e. the code will never read anything from the array and the dynamic data tool will
just skip it. We'd then actually allocate an array of size one (however many bytes,
usually 4 for our use cases of uint32). This makes the compiler happy at a cost of
a few bytes of overhead in non-dynamic mode. Seems like we don't really have a choice
here, so I'll prepare the patch.

Reported by [email protected] on 2014-03-12 11:32:40

Bilingual text not properly divided into chunks

The following test program takes an English/Italian bilingual text and uses CLD2 to divide it into chunks. This should be an easy case for chunk detection, since the language consistently changes at paragraph boundaries. However, only the first chunk boundary is correctly detected; all subsequent chunk boundaries are off by a few words, and (as a consequence, I think) it fails to identify the language of some of the chunks.

#include <vector>
#include <stdio.h>
#include <cld2/public/encodings.h>
#include <cld2/public/compact_lang_det.h>

static const char text[] =
"In my younger and more vulnerable years my father gave me some advice\n"
"that I've been turning over in my mind ever since.\n"
"\n"
"Nei miei anni più giovani e più vulnerabili mio padre mi diede un\n"
"consiglio su cui da allora non ho più smesso di rimuginare.\n"
"\n"
"«Whenever you feel like criticizing any one,» he told me, «just\n"
"remember that all the people in this world haven't had the advantages\n"
"that you've had.»\n"
"\n"
"«Quando ti viene voglia di criticare qualcuno» mi disse «ricordati\n"
"solo che non tutti a questo mondo hanno avuto i vantaggi che hai avuto\n"
"tu».\n"
"\n"
"He didn't say any more, but we've always been unusually communicative\n"
"in a reserved way, and I understood that he meant a great deal more\n"
"than that.\n"
"\n"
"Non disse altro, ma siamo sempre stati straordinariamente comunicativi\n"
"senza tante parole e capii che voleva dire molto di più di questo.\n"
"\n"
"In consequence, I'm inclined to reserve all judgments, a habit that\n"
"has opened up many curious natures to me and also made me the victim\n"
"of not a few veteran bores.\n"
"\n"
"Di conseguenza, sono inclìne a evitare ogni giudizio, un'abitudine che\n"
"mi ha rivelato molti caratteri strani e mi ha anche reso vittima di\n"
"non pochi rompiscatole di lungo corso.\n"
"\n"
"The abnormal mind is quick to detect and attach itself to this quality\n"
"when it appears in a normal person, and so it came about that in\n"
"college I was unjustly accused of being a politician, because I was\n"
"privy to the secret griefs of wild, unknown men.\n"
"\n"
"La mente anormale è lesta a riconoscere e a aggrapparsi a questa\n"
"qualità quand'essa si manifesta in una persona normale, e così accadde\n"
"che all'università fui ingiustamente accusato di essere un\n"
"politicante, perché ero al corrente delle pene segrete di uomini\n"
"sregolati e sconosciuti.\n";

int main(void)
{
  CLD2::CLDHints hints;
  hints.content_language_hint = 0;
  hints.tld_hint = 0;
  hints.encoding_hint = CLD2::UNKNOWN_ENCODING;
  hints.language_hint = CLD2::UNKNOWN_LANGUAGE;

  CLD2::Language top3[3];
  int pct3[3];
  double score3[3];
  CLD2::ResultChunkVector chunks;
  int text_bytes;
  bool reliable;

  CLD2::ExtDetectLanguageSummary(text, sizeof text - 1,
                                 true, &hints, 0,
                                 top3, pct3, score3, &chunks,
                                 &text_bytes, &reliable);

  puts("<!doctype html><meta charset=\"utf-8\"><body>");
  CLD2::DumpResultChunkVector(stdout, text, &chunks);
  puts("</body>");
  return 0;
}

And here's the program output for me:

<!doctype html><meta charset="utf-8"><body>
DumpResultChunkVector[12]<br>
[0]{0 122 en}  <span style="background:#FFFFF4;color:#000000;">
In my younger and more vulnerable years my father gave me some advice that I&apos;ve been turning over in my mind ever since.  </span><br>
[1]{122 117 it}  <span style="background:#E3FFD8;color:#000000;">
Nei miei anni più giovani e più vulnerabili mio padre mi diede un consiglio su cui da allora non ho più smesso di </span><br>
[2]{239 38 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
rimuginare.  «Whenever you feel like </span><br>
[3]{277 144 en}  <span style="background:#FFFFF4;color:#000000;">
criticizing any one,» he told me, «just remember that all the people in this world haven&apos;t had the advantages that you&apos;ve had.»  «Quando ti </span><br>
[4]{421 139 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
viene voglia di criticare qualcuno» mi disse «ricordati solo che non tutti a questo mondo hanno avuto i vantaggi che hai avuto tu».  He </span><br>
[5]{560 147 en}  <span style="background:#FFFFF4;color:#000000;">
didn&apos;t say any more, but we&apos;ve always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that.  </span><br>
[6]{707 131 it}  <span style="background:#E3FFD8;color:#000000;">
Non disse altro, ma siamo sempre stati straordinariamente comunicativi senza tante parole e capii che voleva dire molto di più di </span><br>
[7]{838 178 en}  <span style="background:#FFFFF4;color:#000000;">
questo.  In consequence, I&apos;m inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores.  Di </span><br>
[8]{1016 75 un}  <span style="background:#FFFFFF;color:#B0B0B0;">
conseguenza, sono inclìne a evitare ogni giudizio, un&apos;abitudine che mi ha </span><br>
[9]{1091 102 it}  <span style="background:#E3FFD8;color:#000000;">
rivelato molti caratteri strani e mi ha anche reso vittima di non pochi rompiscatole di lungo corso.  </span><br>
[10]{1193 283 en}  <span style="background:#FFFFF4;color:#000000;">
The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men.  La mente anormale è lesta a </span><br>
[11]{1476 261 it}  <span style="background:#E3FFD8;color:#000000;">
riconoscere e a aggrapparsi a questa qualità quand&apos;essa si manifesta in una persona normale, e così accadde che all&apos;università fui ingiustamente accusato di essere un politicante, perché ero al corrente delle pene segrete di uomini sregolati e sconosciuti. </span><br>
<br>
</body>

New GCC 5.0 hits problem with narrowing in list-initializers

Originally reported on Google Code with ID 26

Following errors are produced by GCC compiler:

c++ -MMD -MF obj/third_party/cld_2/src/internal/cld2_static.cld_generated_cjk_uni_prop_80.o.d
-DV8_DEPRECATION_WARNINGS -D_FILE_OFFSET_BITS=64 -DCHROMIUM_BUILD -DTOOLKIT_VIEWS=1
-DUI_COMPOSITOR_IMAGE_TRANSPORT -DUSE_AURA=1 -DUSE_ASH=1 -DUSE_PANGO=1 -DUSE_CAIRO=1
-DUSE_DEFAULT_RENDER_THEME=1 -DUSE_LIBJPEG_TURBO=1 -DUSE_X11=1 -DUSE_CLIPBOARD_AURAX11=1
-DENABLE_ONE_CLICK_SIGNIN -DENABLE_PRE_SYNC_BACKUP -DENABLE_REMOTING=1 -DENABLE_WEBRTC=1
-DENABLE_PEPPER_CDMS -DENABLE_CONFIGURATION_POLICY -DENABLE_NOTIFICATIONS -DUSE_UDEV
-DDONT_EMBED_BUILD_METADATA -DENABLE_TASK_MANAGER=1 -DENABLE_EXTENSIONS=1 -DENABLE_PLUGINS=1
-DENABLE_SESSION_SERVICE=1 -DENABLE_THEMES=1 -DENABLE_AUTOFILL_DIALOG=1 -DENABLE_BACKGROUND=1
-DENABLE_GOOGLE_NOW=1 -DCLD_VERSION=2 -DENABLE_PRINTING=1 -DENABLE_BASIC_PRINTING=1
-DENABLE_PRINT_PREVIEW=1 -DENABLE_SPELLCHECK=1 -DENABLE_CAPTIVE_PORTAL_DETECTION=1
-DENABLE_APP_LIST=1 -DENABLE_SETTINGS_APP=1 -DENABLE_SUPERVISED_USERS=1 -DENABLE_MDNS=1
-DENABLE_SERVICE_DISCOVERY=1 -DV8_USE_EXTERNAL_STARTUP_DATA -DUSE_LIBPCI=1 -DUSE_GLIB=1
-DUSE_NSS=1 -DNDEBUG -DNVALGRIND -DDYNAMIC_ANNOTATIONS_ENABLED=0 -Igen -I../../third_party/cld_2/src/internal
-I../../third_party/cld_2/src/public -fstack-protector --param=ssp-buffer-size=4  -pthread
-fno-strict-aliasing -Wno-unused-parameter -Wno-missing-field-initializers -fvisibility=hidden
-pipe -fPIC -B/home/marxin/Programming/chromium/src/third_party/binutils/Linux_x64/Release/bin
-Wno-unused-local-typedefs -Wno-format -Wno-unused-result -m64 -march=x86-64 -O2 -fno-ident
-fdata-sections -ffunction-sections -funwind-tables -fno-exceptions -fno-rtti -fno-threadsafe-statics
-fvisibility-inlines-hidden -Wno-deprecated -std=gnu++11 -Wno-narrowing -Wno-literal-suffix
 -c ../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc -o obj/third_party/cld_2/src/internal/cld2_static.cld_generated_cjk_uni_prop_80.o
-Wno-narrowing
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
 };
 ^
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
../../third_party/cld_2/src/internal/cld_generated_cjk_uni_prop_80.cc:7089:1: error:
narrowing conversion of ‘-14’ from ‘int’ to ‘CLD2::uint8 {aka unsigned char}’ inside
{ }
... (and many more)

Problem is more discussed in following thread: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/D5YxoMmtEmE
I think fix is quite obvious, generator should produce just uint8 numbers.

Thanks,
Martin

Reported by marxin.liska on 2015-01-05 10:34:02

cld2/public/compact_lang_det.h should include .../encodings.h and stdio.h

cld2/public/compact_lang_det.h does not include all the headers it needs. The more obvious problem is failure to include stdio.h, which causes compilation to fail:

$ printf '#include <cld2/public/compact_lang_det.h>\nint foo;\n' | 
  g++ -fsyntax-only -fmessage-length=72 -x c++ -
In file included from <stdin>:1:0:
/usr/include/cld2/public/compact_lang_det.h:379:28: error: variable
   or field ‘DumpResultChunkVector’ declared void
 void DumpResultChunkVector(FILE* f, const char* src,
                            ^
/usr/include/cld2/public/compact_lang_det.h:379:28: error: ‘FILE’
    was not declared in this scope
/usr/include/cld2/public/compact_lang_det.h:379:34: error: ‘f’
    was not declared in this scope
 void DumpResultChunkVector(FILE* f, const char* src,
                                  ^
/usr/include/cld2/public/compact_lang_det.h:379:37: error: expected
   primary-expression before ‘const’
 void DumpResultChunkVector(FILE* f, const char* src,
                                     ^
/usr/include/cld2/public/compact_lang_det.h:380:45: error: expected
   primary-expression before ‘*’ token
                            ResultChunkVector* resultchunkvector);
                                             ^
/usr/include/cld2/public/compact_lang_det.h:380:47: error: ‘resultchunkvector’
    was not declared in this scope
                            ResultChunkVector* resultchunkvector);
                                               ^

The more subtle problem is that it doesn't include cld2/public/encodings.h either; this header is required in order to pass a valid encoding hint. (This would be minor if UNKNOWN_ENCODING had the value zero, but it doesn't.)

Enhancement: expose ISO 639-2 codes for all languages

Please add a function whose contract is the same as the existing CLD2::LanguageCode except that it always produces an ISO 639-2 code if there is one (instead of using -2 codes only for languages that don't have a -1 code). This would, for instance, mean that I don't need to have a separate "what CLD2 calls this language" column in my database along with the ISO codes.

Cleanup old data sets as was done for Chrome

We should consider cleaning up the old data sets that are kicking around in the main tree, like we did for Chrome a while back in commit b2c2d34. Old data sets should, in my opinion, be stored one of two ways:

  1. In subdirectories whose names indicate versioning (e.g., for temporal versioning YYYYMMDD)
  2. In git history (i.e., only the most recent data set is maintained and kept compatible)

I don't see a strong reason to keep the old data sets around. Folks that want to use the old data sets should be welcome to do so by checking out an older revision. Visibility is an issue; For now we could create branches for the old data sets that exist today, simply pruning the irrelevant data sets in each. After those are done, we can do development of code fixes on master and cherry-pick

I'd suggest branches:
legacy_release_0122 (delete any data files not suffixed with 0122)
legacy_release_0527 (delete any data files not suffixed with 0527)
legacy_release_0720 (delete any data files not suffixed with 0720)

Chromium doesn't need or care about the old data sets, which is why the version suffixes on all the "chrome" files were already deleted. It's not worth putting them back in for the old releases.

The current release is from 2014-10-15, so we would also have a release branch called release_20141015 and we would maintain (at least) this branch with cherry-picks from master.

WDYT?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.