josephsefara / language-detection Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/language-detection
Automatically exported from code.google.com/p/language-detection
What steps will reproduce the problem?
1. Download attached Japanese text
2. Execute in CMD.exe: "java -jar langdetect.jar --detectlang -d profiles
lang_detect.txt"
What is the expected output? [ja:0.7142823122662098]
What do you see instead? lang_detect.txt:[en:0.7142823122662098,
pl:0.14285727552109861, tl:0.14285682309
334474]
What version of the product are you using? latest (langdetect-09-13-2011)
On what operating system? Windows 7
Original issue reported on code.google.com by [email protected]
on 20 Apr 2012 at 7:48
Attachments:
What steps will reproduce the problem?
1.Use the library within a tomcat environment
2.The isVerbose output is directed to the console and cannot be caught by
loggers (like log4j)
What is the expected output? What do you see instead?
Log entries coming from the isVerbose should be directed to a logger. (LOG4J
preferably)
What version of the product are you using? On what operating system?
langdetect-09-13-2011.zip On windows 7, server 2008 and Ubuntu 12
Original issue reported on code.google.com by [email protected]
on 8 Jan 2013 at 4:11
What steps will reproduce the problem?
1. java -jar lib/langdetect.jar --detectlang -d profiles cdebconf-km with
09-13-2011 version
What is the expected output? What do you see instead?
Expected: cdebconf-km:[km:0.9999998969777439] (from 11-18-2010 version)
Actual: com.cybozu.labs.langdetect.LangDetectException: no features in text
What version of the product are you using? On what operating system?
09-13-2011 on Ubuntu lucid
Please provide any additional information below.
There has been a regression between 11-18-2010 and 09-13-2011 versions. A
large number of files that detect correctly with the earlier version now show
"no features in text" in the later version. I have attached an example of such
a file.
Original issue reported on code.google.com by [email protected]
on 6 Dec 2011 at 2:33
Attachments:
How can I "reset" the Detector? Shall I just create a new one? Is there a
performance hit?
Thank you very much, Renaud
Original issue reported on code.google.com by [email protected]
on 19 Jan 2011 at 9:44
Another user hijacked Issue 26 with an improvement to the URL pattern matching:
//private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");
I think it's possible to do better because there are a number of issues:
1. The host part doesn't use a separate regular expression. Hosts can't
contain "?". "&", ";" and so forth, so this would allow the regular expression
more quickly to determine non-matches.
2. There are more URL schemes than just "http" and "https".
3. Some URL schemes are more structured than others. For instance, "mailto"
doesn't actually have any of the the slashes (all "opaque" URL schemes are like
this.)
4. It might be good to match international ones too, but this one only matches
ASCII ones.
Original issue reported on code.google.com by trejkaz
on 19 Oct 2011 at 9:50
Hello,
I want to ask you if it is possible to use something else than the wiki
abstract you tell in the wiki to generate the language profile. To be more
precise, i would like to know if it is possible to use the Europarl parallel
corpus (dedicated to the EU languages but continuously improved and realigned).
Is it useful to generate/regenerate profiles with such corpus? Or the wiki
extracts are sufficient?
This corpus is available at http://www.statmt.org/europarl/
In order to not download the 1.3Gb of data, let me introduce the two kinds of
files it contains:
1) One huge file (for example 65Mb for LV) containing lines of text in the
corresponding language - perhaps the easiest corpus to work with - but what
about Java memory exception?) ;
2) Several (many) little files containing 'bad' XML (opened tags with no
corresponding closing tags);
Regards,
Emmanuel
P.S. : sorry to spam you today with my 3 issues but your API is really useful
and fast enough to fit our constraints ;-)
Original issue reported on code.google.com by [email protected]
on 29 Aug 2011 at 3:01
Please expand in
https://code.google.com/p/language-detection/source/browse/src/com/cybozu/labs/l
angdetect/util/NGram.java in method normalize() the 'IJ' (U+0132) and 'ij'
(U+0133) ligatures to 'I'+'J' and 'i'+'j'.
These are used sometimes in Dutch but the convention is to use IJ and ij.
Reasons for this is that the ligature is hard to enter on a keyboard and many
fonts render them visually identical as the IJ and ij ligatures.
Original issue reported on code.google.com by [email protected]
on 17 Nov 2012 at 8:39
TEXT="Mam to kino 2,5 roku.Nic dodać.Jest po prostu super."
getProbabilities() for the above text results in:
[hr : 0.9999948455378745]
I am aware of the "short sentence issue"
http://code.google.com/p/language-detection/issues/detail?id=12&q=short
But for me this is a bug. Why? Becase it is over 50 characters long and
contains some pretty obvious language features. Take for instance letter "ć",
it exists only in Polish(pl) and Croatian(hr). If we take a look at frequencies
in profiles we see that:
profiles/pl - "ć":60605
profiles/hr - "ć":16773
I could understand that getProbabilities returns "hr" but why there is no
Polish at all???!! Is there any way to teach language detector on my own?
Original issue reported on code.google.com by [email protected]
on 16 Jan 2012 at 8:12
What steps will reproduce the problem?
1. Train language uing wikipedia
2. Get detected language and score for parts of big corpus
3. check wrongly indicated languages
Wikipedia is extremely full of foreign subjects and proper names, and is not a
good trainer for everyday language.
Better profile might be generated from well-maintained corpora
Original issue reported on code.google.com by [email protected]
on 16 Nov 2012 at 5:24
I have tried to create new instance but seems to me that the detector can only
be used once and i have to clean and build the program.
Could you pinpoint me to the right direction as how to create the new instance
of tehe detector?
Thanks very much.
Original issue reported on code.google.com by [email protected]
on 5 May 2011 at 4:46
Please offer langdetect.zip or langdetect-latest.zip (without a date in the
name) to download always latest stable.
Original issue reported on code.google.com by [email protected]
on 21 Nov 2012 at 8:10
here's some static jar to install language-detection for use with maven:
https://github.com/renaud/maven_deps/tree/master/language_detection
Original issue reported on code.google.com by [email protected]
on 19 Jan 2011 at 9:19
I would like to bundle the language profiles in a jar file to reduce clutter.
However, the DetectorFactory API is quite restrictive in what it allows you to
pass as the location of the profiles.
If it were possible to pass in a URL, then I think this would be a lot more
convenient.
Original issue reported on code.google.com by [email protected]
on 12 Sep 2011 at 2:22
What steps will reproduce the problem?
1. Use the Prior map ...
What is the expected output? What do you see instead?
I would have expected the prior map to be a generic interface to not put
constraint on the users of the library.
What version of the product are you using? On what operating system?
lang-detect-09-13-2011.zip
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 18 Dec 2012 at 1:20
What steps will reproduce the problem?
1. try {
DetectorFactory.loadProfile("profiles");
} catch (LangDetectException e1) {
System.out.println("exception: " + e1.getMessage());
e1.printStackTrace();
}
2.
3.
What is the expected output? What do you see instead?
I see the following message:
[java] Java Result: 1
What version of the product are you using? On what operating system?
langdetect-09-13-2011.zip on Mac
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 28 Oct 2011 at 11:59
What is the expected output? What do you see instead?
Expected output is the name of the detected language but I am getting
com.cybozu.labs.langdetect.LangDetectException: duplicate the same language
profile
What version of the product are you using? On what operating system?
The latest one on ubuntu
Original issue reported on code.google.com by [email protected]
on 6 Jun 2011 at 8:19
The enumeration (enum) ErrorCode, specified in the
com.cybozu.labs.langdetect.LangDetectException.java file is not visible
externally. Therefore, even if the exception has a getCode() method, this is
practically useless, because you can not check against a well-read value.
For instance, you can not write
package mypackage;
import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.LangDetectException;
public class TestEnum {
public static void main( String[] args ){
try{
DetectorFactory.loadProfile( "./profiles" );
Detector detector = null;
String text = "";
detector = DetectorFactory.create();
detector.append( text );
String lang = detector.detect();
System.out.println( "Language is: " + lang );
}catch( LangDetectException lde ){
if ( lde.getCode() == ErrorCode.CantDetectError ){
// ignore
}
}
}
}
because this throw give a compilation error.
You can still check using .ordinal() on getCode(), i.e.
(lde.getCode().ordinal()), however, using a switch (or multiple if) without
knowing what each value represents (e.g. is value 1 a "CantDetectError" or a
"InitParamError") is really a *bad* idea.
By the way, the ErrorCode "NoTextError" seems not to be used anywhere.
Suggested solution: Unless additional exceptions will be provided (say, one for
each ErrorCode), so as to discriminate between them, I'd suggest declaring the
ErrorCode in a separate file, as a public Enum.
Original issue reported on code.google.com by [email protected]
on 23 Jan 2013 at 6:19
Please support West Frisian language (fy). See
https://fy.wikipedia.org/wiki/Haadside for corpus and
https://en.wikipedia.org/wiki/West_Frisian_language for more information.
Original issue reported on code.google.com by [email protected]
on 12 Nov 2012 at 2:08
In Detector.java detectBlock() function draws random int value from 0 to
ngrams.size(). This approach could give the same ngram multiple times for
updateLangProb function.
For short text it could give a large deviation of result probability.
I propose (short and medium text lenght) to draw ngram index, who has not been
yet selected. Perhaps for short text better way is to do a full review.
Original issue reported on code.google.com by [email protected]
on 28 Feb 2011 at 8:24
Hi All,
Im new with this library. Can anybody tell me how to create the language
profile.
Thanks a lot
Original issue reported on code.google.com by [email protected]
on 10 May 2011 at 2:56
It would be nice to have bigger models (with more accuracy) and also models for
lowercase text.
Original issue reported on code.google.com by [email protected]
on 26 Jan 2011 at 5:37
What steps will reproduce the problem?
Input: NO PODEÍS PREPARAR A VUESTROS ALUMNOS PARA QUE CONSTRUYAN MAÑANA EL
MUNDO DE SUS SUEÑOS SI VOSOTROS YA NO CREÉIS EN ESOS SUEÑOS NO PODEÍS
PREPARARLOS PARA LA VIDA SINO CREÉIS EN ELLA NO PODRÉIS MOSTRAR EL CAMINO SI
OS HABEÍS SENTADO CANSADOS Y DESALENTADOS EN LA ENCRUCIJADA CELESTIN FREINET
FRANCIA
output: [pt:0.5714263645442876, de:0.428569792470217]
i create through the factory, append and then detect. i dont set a seed.
What is the expected output? What do you see instead?
expected: spanish
result: [pt:0.5714263645442876, de:0.428569792470217]
What version of the product are you using? On what operating system?
latest
I am a bit surprised it would show German. is it the upper case that causes a
problem? at times i even see German as the main language, i suppose it depends
on the seed?
thank you!
Original issue reported on code.google.com by [email protected]
on 15 Jun 2012 at 10:55
Hi, I think it would be very nifty if Detector had a method like
getLangsWithProbs that would return HashMap with languages and their
probabilities... For the developer to be able to decide whether he accepts such
a probability...
All that is needed is another sortProbability method that returns map instead
of list ....
The thing is, that if you get a text in a language that has not profile or it
is some gibberish, it easily satisfies PROB_THRESHOLD and developer that is
using this library doesn't have a chance to see the probability at all... More
over the PROB_THRESHOLD is private final and cannot be set.
King Regards, Jakub
Original issue reported on code.google.com by liska.jakub
on 27 Apr 2011 at 1:09
Email addresses differ somewhat from the pattern which has been used:
private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,64}@([-_0-9A-Za-z]){1,63}(.([-_.0-9A-Za-z]{1,63}))");
In fact, coming up with a very accurate mail regex requires using a very long
one:
http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
Whereas that regular expression is somewhat ridiculous, there are some things
which could be improved without going that far:
1. Addresses permit a lot more in the local part, for instance "+".
2. Addresses can use a quoted local part like: "any string \"here\""@example.com
3. Hostnames can't actually contain an underscore.
4. International hostnames are possible although rare. International usernames
are not through standards yet, but are coming soon to a server near you. ;)
Original issue reported on code.google.com by trejkaz
on 19 Oct 2011 at 9:55
Hey,
after I do
DetectorFactory.loadProfile(new
File(LangDetector.class.getClassLoader().getResource("profiles").toURI()));
then langlist is not accessible neither in DetectoryFactory, nor in Detector.
There is simply no way of checking what languages it supports.
Am I missing something ? Or it really can't be done.
Regards, Jakub
Original issue reported on code.google.com by liska.jakub
on 17 Aug 2011 at 9:14
Hello,
I have successfully generated profiles for MT, LV, SL, ET and LT (based on the
steps you mentioned in the Wiki (cf. Tools section). When I test the language
detection with only added profiles MT, LV, SL, ET, these 4 new profiles are
correctly loaded (now error appears). But, when I add the LT profile I get the
error:
GRAVE: Error
java.lang.ArrayIndexOutOfBoundsException: -1
at com.cybozu.labs.langdetect.DetectorFactory.addProfile(DetectorFactory.java:105)
at com.cybozu.labs.langdetect.DetectorFactory.loadProfile(DetectorFactory.java:75)
at Main.qualityCkeck(Main.java:215)
at Main.main(Main.java:62)
Please see enclosed the files (profiles and Wiki abstract only for LT). I use
language-detection API version 05-09-2011.
Regards,
Emmanuel
Original issue reported on code.google.com by [email protected]
on 29 Aug 2011 at 9:13
Attachments:
Thanks for all the good work you shared
Enhancement advice is needed on the best way to implement this in order to
detect two/three (or more) languages from the same document. Any guidelines are
welcome and I will try to implement and share any results. Thanks
Original issue reported on code.google.com by [email protected]
on 5 Feb 2011 at 6:17
What steps will reproduce the problem?
1.I provide the profiles folder in the project
2.run
3.
What is the expected output? What do you see instead?
get error : duplicate the same language profile
What version of the product are you using? On what operating system?
latest version ; max os
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 9 Dec 2011 at 7:53
I would like to propose loadProfile() to be overloaded as follows in order to
make it possible to have the profiles/ directory stored in a JAR file and load
its content via something like this:
DetectorFactory.loadProfile(MyClass.class.getResource("profiles").toURI());
Here is my proposal:
/**
* Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
* detection.
*
* @param profileDirectory
* profile directory path
* @throws LangDetectException
* Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
* is wrong (error code = {@link ErrorCode#FormatError})
*/
public static void loadProfile(String profileDirectory) throws LangDetectException
{
loadProfile(new File(profileDirectory).toURI());
}
/**
* Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
* detection.
*
* @param profileDirectory
* profile directory path as a URI
* @throws LangDetectException
* Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
* is wrong (error code = {@link ErrorCode#FormatError})
*/
public static void loadProfile(URI profileDirectory) throws LangDetectException
{
File dir = new File(profileDirectory);
File[] listFiles = dir.listFiles();
if (listFiles == null)
throw new LangDetectException(ErrorCode.NeedLoadProfileError, "Not found profile directory: " + profileDirectory);
int langsize = listFiles.length, index = 0;
for (File file : listFiles)
{
if (file.getName().startsWith(".") || !file.isFile()) continue;
FileInputStream is = null;
try
{
is = new FileInputStream(file);
LangProfile profile = JSON.decode(is, LangProfile.class);
addProfile(profile, index, langsize);
++index;
}
catch (JSONException e)
{
throw new LangDetectException(ErrorCode.FormatError, "profile format error in '" + file.getName() + "'");
}
catch (IOException e)
{
throw new LangDetectException(ErrorCode.FileLoadError, "can't open '" + file.getName() + "'");
}
finally
{
try
{
if (is != null) is.close();
}
catch (IOException e)
{
}
}
}
}
Original issue reported on code.google.com by [email protected]
on 17 Feb 2011 at 7:16
What steps will reproduce the problem?
1.I am passing a text "viking river cruise" to detect the language
What is the expected output? What do you see instead?
Expected is English , but it displays "af"
What version of the product are you using? On what operating system?
latest version
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 17 Feb 2011 at 4:26
I'm trying to use the language-detection library to distinguish between English
and German for very short snippets of text. I noticed that sometimes when I use
the language detector repeatedly on the same piece of input then I get
different results for each trial (the very first is usually right).
Am I using the API in a wrong way? Is there anything I can do to always get
deterministic results?
I attached a little unit test that reproduces the problem if you set the right
PROFILE_DIR.
What steps will reproduce the problem?
1. Load profiles, create a detector, append input, detect.
2. Create new detector, append same input again, detect.
3. Repeat this a couple of times.
What is the expected output? What do you see instead?
I expect to constantly get the same result for the same input.
What version of the product are you using? On what operating system?
<groupId>org.apache.solr</groupId>
<artifactId>solr-langdetect</artifactId>
<version>3.5.0</version>
on Windows 7
Thanks in advance!
Heike
Original issue reported on code.google.com by [email protected]
on 28 Jan 2013 at 4:28
Attachments:
Hello,
Just an idea to help some people who have problems with short sentences (<10 or
15 words). It used to be a problem for me.
Perhaps, add a utility method in some class (Detector class?) to count the
number of words of a String.
For example (it is the one I use - a static method):
public static int wordCount(String line) {
int idx = 0;
int cnt = 0;
while (idx < line.length()) {
if (!Character.isLetter(line.charAt(idx))) {
idx++;
continue;
}
cnt++;
while (idx < line.length() && Character.isLetter(line.charAt(idx))) {
idx++;
}
}
return cnt;
}
Or, why not throwing an exception directly when the String does not contain
enough word (LangDetectException: "To few words")
But it is just an idea.
But, indeed, it is a great API, with good response time and the ability to add
supported languages. Thanks a lot.
Regards,
Emmanuel
Original issue reported on code.google.com by [email protected]
on 29 Aug 2011 at 9:25
What steps will reproduce the problem?
Reproduce with following code:
@Test
public void langDetect(){
final String textToDetect = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [email protected] asdfadasdf";
try {
final URL profiles = Resources.getResource(getClass(), "profiles");
LangDetector.init(new File(profiles.getPath()));
final Detector detector = DetectorFactory.create();
detector.append(new StringReader(textToDetect));
} catch (LangDetectException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
What is the expected output? What do you see instead?
I expect anything but an exception.
I get this stacktrace:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.<init>(String.java:207)
at com.cybozu.labs.langdetect.Detector.append(Detector.java:154)
What version of the product are you using? On what operating system?
A build from 2011-09-21. JRE 1.6b21
Please provide any additional information below.
I am attempting to use append(Reader) because the URL/Address regex in the
append(String) will occasionally "freeze" as noted in issues 6 and 26
(http://code.google.com/p/language-detection/issues/detail?id=26 and
http://code.google.com/p/language-detection/issues/detail?id=6&q=append). Using
the reader and buffer alleviates the slowdown by regexes, but is unusable with
this out of bounds exception.
Original issue reported on code.google.com by [email protected]
on 17 Jul 2012 at 2:59
I noticed some unnecessary object creation in the Detector and NGram classes,
and made some changes at this branch:
http://code.google.com/r/armintor-language-detection/source/list?name=tuning
My tests show improvements in speed, memory use, and a fairly drastic decrease
in objects created.
Original issue reported on code.google.com by [email protected]
on 17 Sep 2012 at 7:56
First of all, congratulations for your library it has very good performance
even in very hard settings.
I have implemented a faster version of the library based on your algorithm,
using arrays instead of HashMaps. It runs aprox. 5 to 8 times faster.
You can find the sources attached, feel free to add them to the library if you
find them useful.
Original issue reported on code.google.com by [email protected]
on 18 Jan 2011 at 5:07
Attachments:
if you want to use the plugin with the new version of nutch the extensionpoint
is missing.
Exception in thread "main" java.lang.RuntimeException: Plugin
(language-detector), extension point: org.apache.nutch.searcher.QueryFilter
does not exist.
at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:84)
at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
at org.apache.nutch.protocol.ProtocolFactory.<init>(ProtocolFactory.java:49)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:78)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:132)
If tried to change the QueryFilter inside the plugin.xml into
indexer.IndexingFilter which made the exception disappear, but I've get the
same result like with the language-identification plug-in ("et" while testing
wikipedia.co.jp). This should not be the hardest challenge, so I've expected
the correct language. Is IndexingFilter maybe the wrong extension point?
Original issue reported on code.google.com by [email protected]
on 15 Feb 2012 at 4:39
So how does one set the priorMap to give the library a hint towards a language.
I have user input to indicate what the language may be.
If I set a probability - it always goes in that direction.
e.g.
Detector detector = DetectorFactory.create();
HashMap priorMap = new HashMap();
priorMap.put("ja", new Double(0.001));
detector.setPriorMap(priorMap);
detector.append("This is an english sentence.");
I would expect the language to be detected as "en" instead I get it as 'ja"
with a 0.99999999998567 probability. It seems to be case for all languages - if
you seed the priorMap table the library just validates that language as the
right language.
Am I not using the interface correctly?
Original issue reported on code.google.com by [email protected]
on 16 Apr 2011 at 6:32
If you split up the text before, it will work:
E.g. somthing à la:
public static String cleanText(String text) {
// break up big words > 40 chars into single ones
int s = text.length();
StringBuffer sb = new StringBuffer();
int sindex = 0;
for (int i = 0; i < s; i++) {
char c = text.charAt(i);
if (c == ' ') {
sindex = i;
}
if (i - sindex > 40) {
sindex = i;
sb.append(" ");
}
sb.append(c);
}
return sb.toString();
}
Original issue reported on code.google.com by [email protected]
on 4 Feb 2011 at 2:37
There is a Frysian wikipedia, which lead to the attached profile.
Frysian language code might be fy.
Original issue reported on code.google.com by [email protected]
on 10 Nov 2012 at 12:08
Attachments:
What steps will reproduce the problem?
1. Use the phrase "distribution agreement"
I expect it to return english, but it returns french.
I am using version 02-02-2011, on Windows 7
This seems to happen on certain short phrases, the wrong language is returned.
Is there anything I can do to fix this?
Original issue reported on code.google.com by [email protected]
on 31 Mar 2011 at 8:50
What steps will reproduce the problem?
,in our test job,we are evaluating your tool. we are testers checking
translated contend in software for nearly 30 languages.
1. when we are trying to identify "Ingen ArcSync?-konto?Log ind Opret en konto"
, it thought NO is the most possible language(>99.99%) ,other than DK.
2. when identifying "Accepar" ,it determine RO(>99.99%) ,other than ES . May
detection probability be improved next version?? Thanks your work.
What is the expected output? What do you see instead?
Can identify DK/NO,RO/ES more correctly.
What version of the product are you using? On what operating system?
langdetect-09-13-2011.zip
Please provide any additional information below.
We don't get newer profile from WikiPedia because we don't know where to get
these abstract database file on Download page. we have googled using 'eswiki-
-abstract.xml' OR 'Wikipedia abstract database files',but no result.
And on Polyglot3000 ,which is also a good language detection tool, it can
return correct judgement, though it only provide GUI on Windows and no API or
source code.
Original issue reported on code.google.com by [email protected]
on 21 Jan 2012 at 2:59
This isn't a bug really. Would it be possible to just rename the short name
profiles with their corresponding long form name? Or maybe add a way to store
both long and short names and add a method like "detectLongName" or
"detectShortName"?
This would be better just in case more profiles are added and no one is
dependent on translating the short name to a long name in their own code base.
Original issue reported on code.google.com by [email protected]
on 3 Jun 2011 at 8:42
I'm getting the below error even though I have added jsonic.jar and jsonic.swc
Exception in thread "main" java.lang.NoClassDefFoundError:
net/arnx/jsonic/JSONException
at javaapplication31.Main.main(Main.java:29)
Caused by: java.lang.ClassNotFoundException: net.arnx.jsonic.JSONException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
... 1 more
Original issue reported on code.google.com by [email protected]
on 12 Jan 2011 at 3:51
DetectorFactory.loadProfile is a static method and additionally, if it is
called more than once, errors occur.
This causes problems for applications which are completely componentised - when
you start up a language detection component it wants to initialise
DetectorFactory. When that same component initialises again in the future, it
throws an error.
A better alternative would be that the factory be an instance which can be
configured and then thrown away. The second component would then be using a
different instance, which would avoid the problem completely.
Original issue reported on code.google.com by [email protected]
on 12 Sep 2011 at 2:24
Please offer a way to formally document how models have been trained for a
certain language. Retraining, when needed, can be done automatically when
relevant sources are documented.
For the normal model and the short message model a list of URLs, each with
short description, should be stored in a configuration file. Also the
wikipedia/media/etc URLs should be stored in this file. So that then can be
omitted when needed.
Original issue reported on code.google.com by [email protected]
on 17 Nov 2012 at 6:41
What steps will reproduce the problem?
1. Modify the following test case to match paths on your system (StringUtils
comes from Commons Lang, in case you're not using it. I was reducing the size
of the test.)
@Test
public void test() throws Exception {
// Generate two strings which are exactly the same length (thus we should expect similar performance.)
int testSize = 8*1024;
String repetitiveEnglish = StringUtils.repeat("I see what you did there, dude. ", testSize/32);
String noSpaces = StringUtils.repeat("abcdefgh", testSize/8);
DetectorFactory.loadProfile(new File("dependencies/langdetect/profiles"));
detect(repetitiveEnglish, 10);
detect(noSpaces, 10);
detect(repetitiveEnglish, 10);
detect(noSpaces, 10);
}
private void detect(String text, int runs) throws Exception
{
long t0 = System.currentTimeMillis();
for (int i = 0; i < runs; i++)
{
Detector detector = DetectorFactory.create();
detector.append(text);
String result = detector.detect();
}
long t1 = System.currentTimeMillis();
System.out.println(String.format("%s ms", t1-t0));
}
2. Run and watch the output.
I get the following:
219 ms
9066 ms
36 ms
8999 ms
So whereas normal English-like text gets down to 3.6ms or so per detection,
detecting the language of text with no whitespace costs nearly a second per go.
When you're trying to identify millions of documents, that adds up pretty fast.
Original issue reported on code.google.com by [email protected]
on 17 Oct 2011 at 5:55
I just made an API call to loadProfile using a DetectorFactory and supplied
"trunk/profile" as the argument (as suggested in the tutorial). It reports that
the net.arnx.jsonic.JSON class along with the net.arnx.jsonic.JSONException
class are both missing. DetectorFactory seems to import them in it, and I guess
these classes haven't been bundled in the LangDetect jar file ?
I'm using langdetect-05-09-2011.jar with JavaSE 1.6 on Mac OSX (Snow Leopard).
I just downloaded jsonic-1.2.5.zip but wondering where and how to integrate it
with langdetect before using the API call to the same.
Thanks!
Original issue reported on code.google.com by [email protected]
on 14 May 2011 at 11:56
Hi,
I just want to detect whether given text is in English or not. For my problem,
I am not really interested in identifying exact language of the text. I
understand by reducing number of target languages to predict will increase the
accuracy. So, Can I infer that just predicting whether the text in English or
not, instead of predicting exact language will increase the accuracy. If so,
how can I tweak this package to detect whether given text is English or not.
Regards,
Vamsi
Original issue reported on code.google.com by [email protected]
on 22 Mar 2011 at 2:57
First of all: thank you for making this software publicly available! I really
appreciate it ;)
Now that you provide two different types of language profiles (based on
wikipedia and twitter, respectively), I'd like to be able to keep two language
detectors alive simultaneously. One relying on the twitter profiles for
classifying short texts, and one relying on the wikipedia profiles for
classifying long ones (this is somewhat related to issue 25). However, the
current implementation of the DetectorFactory does not allow me to set up more
than one type of Detector.
I'd like to be able to do something similar to the following (where
detectorShort and detectorLong would use language profiles for (approximately)
the same languages, originating from different sources):
// set-up two different detectors
DetectorFactory.loadProfiles(LanguageProfile.SHORT_TEXTS,
"/path/to/shorttextprofiles");
DetectorFactory.loadProfiles(LanguageProfile.LONG_TEXTS,
"/path/to/longtextprofiles");
...
// get the detector for handling short texts
Detector detectorShort = DetectorFactory.create(LanguageProfile.SHORT_TEXTS);
// get the detector for handling longer texts
Detector detectorLong = DetectorFactory.create(LanguageProfile.LONG_TEXTS);
...
Is this behaviour something you would consider in a future release?
Kind regards
Original issue reported on code.google.com by [email protected]
on 15 Mar 2012 at 6:44
What steps will reproduce the problem?
1. Create a Detector with the language profiles bundled with this libray
2. call detector.append("")
3. call detector.getProbabilities()
What is the expected output? What do you see instead?
Should return an empty list, instead throws a LangDetectException("no features
in text")
What version of the product are you using? On what operating system?
langdetect-1.1-20120112
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Sep 2012 at 5:23
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.