josephsefara / language-detection Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 9.57 MB

Automatically exported from code.google.com/p/language-detection

Java 98.54% HTML 1.46%

language-detection's People

Contributors

Watchers

Forkers

paulmarome

language-detection's Issues

Japanese language detection problem

What steps will reproduce the problem?
1. Download attached Japanese text
2. Execute in CMD.exe: "java -jar langdetect.jar --detectlang -d profiles 
lang_detect.txt"

What is the expected output? [ja:0.7142823122662098]
What do you see instead? lang_detect.txt:[en:0.7142823122662098, 
pl:0.14285727552109861, tl:0.14285682309
334474]


What version of the product are you using? latest (langdetect-09-13-2011)
On what operating system? Windows 7

Original issue reported on code.google.com by [email protected] on 20 Apr 2012 at 7:48

Attachments:

lang_detect.txt

isVerbose is using System.out.println should direct to a logger

What steps will reproduce the problem?
1.Use the library within a tomcat environment
2.The isVerbose output is directed to the console and cannot be caught by 
loggers (like log4j)

What is the expected output? What do you see instead?
Log entries coming from the isVerbose should be directed to a logger. (LOG4J 
preferably)

What version of the product are you using? On what operating system?
langdetect-09-13-2011.zip On windows 7, server 2008 and Ubuntu 12

Original issue reported on code.google.com by [email protected] on 8 Jan 2013 at 4:11

regression: "no features in text"

What steps will reproduce the problem?
1. java -jar lib/langdetect.jar --detectlang -d profiles cdebconf-km with 
09-13-2011 version


What is the expected output? What do you see instead?
Expected: cdebconf-km:[km:0.9999998969777439] (from 11-18-2010 version)
Actual: com.cybozu.labs.langdetect.LangDetectException: no features in text

What version of the product are you using? On what operating system?
09-13-2011 on Ubuntu lucid


Please provide any additional information below.

There has been a regression between 11-18-2010 and 09-13-2011 versions.  A 
large number of files that detect correctly with the earlier version now show 
"no features in text" in the later version. I have attached an example of such 
a file.

Original issue reported on code.google.com by [email protected] on 6 Dec 2011 at 2:33

Attachments:

cdebconf-km

How can I "reset" the Detector? Shall I just create a new one?

How can I "reset" the Detector? Shall I just create a new one? Is there a 
performance hit?

Thank you very much, Renaud

Original issue reported on code.google.com by [email protected] on 19 Jan 2011 at 9:44

URL pattern matching doesn't match every URL

Another user hijacked Issue 26 with an improvement to the URL pattern matching:

//private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX = 
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");

I think it's possible to do better because there are a number of issues:

1. The host part doesn't use a separate regular expression.  Hosts can't 
contain "?". "&", ";" and so forth, so this would allow the regular expression 
more quickly to determine non-matches.
2. There are more URL schemes than just "http" and "https".
3. Some URL schemes are more structured than others.  For instance, "mailto" 
doesn't actually have any of the the slashes (all "opaque" URL schemes are like 
this.)
4. It might be good to match international ones too, but this one only matches 
ASCII ones.

Original issue reported on code.google.com by trejkaz on 19 Oct 2011 at 9:50

Non Wikipedia corpus for profile generation

Hello,


I want to ask you if it is possible to use something else than the wiki 
abstract you tell in the wiki to generate the language profile. To be more 
precise, i would like to know if it is possible to use the Europarl parallel 
corpus (dedicated to the EU languages but continuously improved and realigned).

Is it useful to generate/regenerate profiles with such corpus? Or the wiki 
extracts are sufficient?

This corpus is available at http://www.statmt.org/europarl/

In order to not download the 1.3Gb of data, let me introduce the two kinds of 
files it contains:
1) One huge file (for example 65Mb for LV) containing lines of text in the 
corresponding language - perhaps the easiest corpus to work with - but what 
about Java memory exception?) ;
2) Several (many) little files containing 'bad' XML (opened tags with no 
corresponding closing tags);


Regards,
Emmanuel

P.S. : sorry to spam you today with my 3 issues but your API is really useful 
and fast enough to fit our constraints ;-)

Original issue reported on code.google.com by [email protected] on 29 Aug 2011 at 3:01

Expand IJ and ij ligatures

Please expand in 
https://code.google.com/p/language-detection/source/browse/src/com/cybozu/labs/l
angdetect/util/NGram.java in method normalize() the 'Ĳ' (U+0132) and 'ĳ' 
(U+0133) ligatures to 'I'+'J' and 'i'+'j'.

These are used sometimes in Dutch but the convention is to use IJ and ij. 
Reasons for this is that the ligature is hard to enter on a keyboard and many 
fonts render them visually identical as the Ĳ and ĳ ligatures.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 8:39

Short sentences for Polish are not detected properly

TEXT="Mam to kino 2,5 roku.Nic dodać.Jest po prostu super."

getProbabilities() for the above text results in:
[hr : 0.9999948455378745]

I am aware of the "short sentence issue" 
http://code.google.com/p/language-detection/issues/detail?id=12&q=short

But for me this is a bug. Why? Becase it is over 50 characters long and 
contains some pretty obvious language features. Take for instance letter "ć", 
it exists only in Polish(pl) and Croatian(hr). If we take a look at frequencies 
in profiles we see that:
profiles/pl - "ć":60605
profiles/hr - "ć":16773

I could understand that getProbabilities returns "hr" but why there is no 
Polish at all???!! Is there any way to teach language detector on my own?

Original issue reported on code.google.com by [email protected] on 16 Jan 2012 at 8:12

Wikipedia less than optimal training database

What steps will reproduce the problem?
1. Train language uing wikipedia
2. Get detected language and score for parts of big corpus
3. check wrongly indicated languages

Wikipedia is extremely full of foreign subjects and proper names, and is not a 
good trainer for everyday language.
Better profile might be generated from well-maintained corpora

Original issue reported on code.google.com by [email protected] on 16 Nov 2012 at 5:24

How to reinstantiate the Detector object?



I have tried to create new instance but seems to me that the detector can only 
be used once and i have to clean and build the program. 
Could you pinpoint me to the right direction as how to create the new instance 
of tehe detector?

Thanks very much.

Original issue reported on code.google.com by [email protected] on 5 May 2011 at 4:46

Offer latest stable for direct download

Please offer langdetect.zip or langdetect-latest.zip (without a date in the 
name) to download always latest stable.

Original issue reported on code.google.com by [email protected] on 21 Nov 2012 at 8:10

maven integration

here's some static jar to install language-detection for use with maven:
https://github.com/renaud/maven_deps/tree/master/language_detection

Original issue reported on code.google.com by [email protected] on 19 Jan 2011 at 9:19

Ability to bundle the profiles into a jar file

I would like to bundle the language profiles in a jar file to reduce clutter.  
However, the DetectorFactory API is quite restrictive in what it allows you to 
pass as the location of the profiles.

If it were possible to pass in a URL, then I think this would be a lot more 
convenient.

Original issue reported on code.google.com by [email protected] on 12 Sep 2011 at 2:22

PriorMap should expose the generic Map interface

What steps will reproduce the problem?
1. Use the Prior map ...

What is the expected output? What do you see instead?

I would have expected the prior map to be a generic interface to not put 
constraint on the users of the library.

What version of the product are you using? On what operating system?

lang-detect-09-13-2011.zip

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 18 Dec 2012 at 1:20

exception when running calling DetectorFactory.loadProfile()

What steps will reproduce the problem?
1. try {
            DetectorFactory.loadProfile("profiles");
        } catch (LangDetectException e1) {
            System.out.println("exception: " + e1.getMessage());
            e1.printStackTrace();
        }
2.
3.

What is the expected output? What do you see instead?
I see the following message:      

[java] Java Result: 1


What version of the product are you using? On what operating system?
langdetect-09-13-2011.zip on Mac

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 28 Oct 2011 at 11:59

Duplicate the same Language Profile

What is the expected output? What do you see instead?
Expected output is the name of the detected language but I am getting 
com.cybozu.labs.langdetect.LangDetectException: duplicate the same language 
profile

What version of the product are you using? On what operating system?
The latest one on ubuntu

Original issue reported on code.google.com by [email protected] on 6 Jun 2011 at 8:19

ErrorCode is not visible

The enumeration (enum) ErrorCode, specified in the 
com.cybozu.labs.langdetect.LangDetectException.java file is not visible 
externally. Therefore, even if the exception has a getCode() method, this is 
practically useless, because you can not check against a well-read value. 

For instance, you can not write

package mypackage;

import com.cybozu.labs.langdetect.Detector;
import com.cybozu.labs.langdetect.DetectorFactory;
import com.cybozu.labs.langdetect.LangDetectException;

public class TestEnum {

     public static void main( String[] args ){

          try{
               DetectorFactory.loadProfile( "./profiles" );

               Detector detector = null;
               String text = "";
               detector = DetectorFactory.create();
               detector.append( text );
               String lang = detector.detect();
               System.out.println( "Language is: " + lang );
          }catch( LangDetectException lde ){
               if ( lde.getCode() == ErrorCode.CantDetectError ){
                    // ignore
               }
          }
     }
}

because this throw give a compilation error.

You can still check using .ordinal() on getCode(), i.e. 
(lde.getCode().ordinal()), however, using a switch (or multiple if) without 
knowing what each value represents (e.g. is value 1 a "CantDetectError" or a 
"InitParamError") is really a *bad* idea.

By the way, the ErrorCode "NoTextError" seems not to be used anywhere.


Suggested solution: Unless additional exceptions will be provided (say, one for 
each ErrorCode), so as to discriminate between them, I'd suggest declaring the 
ErrorCode in a separate file, as a public Enum.

Original issue reported on code.google.com by [email protected] on 23 Jan 2013 at 6:19

Support West Frisian language (fy)

Please support West Frisian language (fy). See 
https://fy.wikipedia.org/wiki/Haadside for corpus and 
https://en.wikipedia.org/wiki/West_Frisian_language for more information.

Original issue reported on code.google.com by [email protected] on 12 Nov 2012 at 2:08

detectBlock and random updateLangProb

In Detector.java detectBlock() function draws random int value from 0 to 
ngrams.size(). This approach could give the same ngram multiple times  for 
updateLangProb function. 
For short text it could give a large deviation of result probability. 

I propose (short and medium text lenght) to draw ngram index, who has not been 
yet selected. Perhaps for short text better way is to do a full review.

Original issue reported on code.google.com by [email protected] on 28 Feb 2011 at 8:24

How to create language profile?

Hi All,

Im new with this library. Can anybody tell me how to create the language 
profile. 

Thanks a lot

Original issue reported on code.google.com by [email protected] on 10 May 2011 at 2:56

Create bigger models and models for lc text

It would be nice to have bigger models (with more accuracy) and also models for 
lowercase text.

Original issue reported on code.google.com by [email protected] on 26 Jan 2011 at 5:37

Portoguese detection problem

What steps will reproduce the problem?
Input: NO PODEÍS PREPARAR A VUESTROS ALUMNOS PARA QUE CONSTRUYAN MAÑANA EL 
MUNDO DE SUS SUEÑOS SI VOSOTROS YA NO CREÉIS EN ESOS SUEÑOS NO PODEÍS 
PREPARARLOS PARA LA VIDA SINO CREÉIS EN ELLA NO PODRÉIS MOSTRAR EL CAMINO SI 
OS HABEÍS SENTADO CANSADOS Y DESALENTADOS EN LA ENCRUCIJADA CELESTIN FREINET 
FRANCIA 

output: [pt:0.5714263645442876, de:0.428569792470217]

i create through the factory, append and then detect. i dont set a seed.

What is the expected output? What do you see instead?

expected: spanish
result: [pt:0.5714263645442876, de:0.428569792470217]

What version of the product are you using? On what operating system?

latest

I am a bit surprised it would show German. is it the upper case that causes a 
problem? at times i even see German as the main language, i suppose it depends 
on the seed?

thank you!

Original issue reported on code.google.com by [email protected] on 15 Jun 2012 at 10:55

Detector Enhancement

Hi, I think it would be very nifty if Detector had a method like  
getLangsWithProbs  that would return HashMap with languages and their 
probabilities... For the developer to be able to decide whether he accepts such 
a probability...

All that is needed is another  sortProbability method that returns map instead 
of list ....


The thing is, that if you get a text in a language that has not profile or it 
is some gibberish, it easily satisfies PROB_THRESHOLD and developer that is 
using this library doesn't have a chance to see the probability at all... More 
over the PROB_THRESHOLD is private final and cannot be set.

King Regards, Jakub

Original issue reported on code.google.com by liska.jakub on 27 Apr 2011 at 1:09

Email pattern matching doesn't match every email

Email addresses differ somewhat from the pattern which has been used:

    private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,64}@([-_0-9A-Za-z]){1,63}(.([-_.0-9A-Za-z]{1,63}))");

In fact, coming up with a very accurate mail regex requires using a very long 
one:

  http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

Whereas that regular expression is somewhat ridiculous, there are some things 
which could be improved without going that far:

1. Addresses permit a lot more in the local part, for instance "+".
2. Addresses can use a quoted local part like: "any string \"here\""@example.com
3. Hostnames can't actually contain an underscore.
4. International hostnames are possible although rare.  International usernames 
are not through standards yet, but are coming soon to a server near you. ;)

Original issue reported on code.google.com by trejkaz on 19 Oct 2011 at 9:55

No way to get list of languages supported

Hey,

after I do 

DetectorFactory.loadProfile(new 
File(LangDetector.class.getClassLoader().getResource("profiles").toURI()));

then langlist is not accessible neither in DetectoryFactory, nor in Detector.

There is simply no way of checking what languages it supports.

Am I missing something ? Or it really can't be done.

Regards, Jakub

Original issue reported on code.google.com by liska.jakub on 17 Aug 2011 at 9:14

Profile generation problem

Hello,


I have successfully generated profiles for MT, LV, SL, ET and LT (based on the 
steps you mentioned in the Wiki (cf. Tools section). When I test the language 
detection with only added profiles MT, LV, SL, ET, these 4 new profiles are 
correctly loaded (now error appears). But, when I add the LT profile I get the 
error:

GRAVE: Error
java.lang.ArrayIndexOutOfBoundsException: -1
    at com.cybozu.labs.langdetect.DetectorFactory.addProfile(DetectorFactory.java:105)
    at com.cybozu.labs.langdetect.DetectorFactory.loadProfile(DetectorFactory.java:75)
    at Main.qualityCkeck(Main.java:215)
    at Main.main(Main.java:62)

Please see enclosed the files (profiles and Wiki abstract only for LT). I use 
language-detection API version 05-09-2011.


Regards,
Emmanuel

Original issue reported on code.google.com by [email protected] on 29 Aug 2011 at 9:13

Attachments:

Detect multi languages in the same doc

Thanks for all the good work you shared

Enhancement advice is needed on the best way to implement this in order to 
detect two/three (or more) languages from the same document. Any guidelines are 
welcome and I will try to implement and share any results. Thanks

Original issue reported on code.google.com by [email protected] on 5 Feb 2011 at 6:17

Error: duplicate the same language profile

What steps will reproduce the problem?
1.I provide the profiles folder in the project
2.run 
3.

What is the expected output? What do you see instead?
get error : duplicate the same language profile

What version of the product are you using? On what operating system?
latest version ; max os 

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 9 Dec 2011 at 7:53

Extend DetectorFactory.loadProfile() so that it works with profiles in JAR files

I would like to propose loadProfile() to be overloaded as follows in order to 
make it possible to have the profiles/ directory stored in a JAR file and load 
its content via something like this:

      DetectorFactory.loadProfile(MyClass.class.getResource("profiles").toURI());

Here is my proposal:

  /**
   * Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
   * detection.
   * 
   * @param profileDirectory
   *          profile directory path
   * @throws LangDetectException
   *           Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
   *           is wrong (error code = {@link ErrorCode#FormatError})
   */
  public static void loadProfile(String profileDirectory) throws LangDetectException
  {
    loadProfile(new File(profileDirectory).toURI());
  }

  /**
   * Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
   * detection.
   * 
   * @param profileDirectory
   *          profile directory path as a URI
   * @throws LangDetectException
   *           Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
   *           is wrong (error code = {@link ErrorCode#FormatError})
   */
  public static void loadProfile(URI profileDirectory) throws LangDetectException
  {
    File dir = new File(profileDirectory);
    File[] listFiles = dir.listFiles();
    if (listFiles == null)
      throw new LangDetectException(ErrorCode.NeedLoadProfileError, "Not found profile directory: " + profileDirectory);

    int langsize = listFiles.length, index = 0;
    for (File file : listFiles)
    {
      if (file.getName().startsWith(".") || !file.isFile()) continue;
      FileInputStream is = null;
      try
      {
        is = new FileInputStream(file);
        LangProfile profile = JSON.decode(is, LangProfile.class);
        addProfile(profile, index, langsize);
        ++index;
      }
      catch (JSONException e)
      {
        throw new LangDetectException(ErrorCode.FormatError, "profile format error in '" + file.getName() + "'");
      }
      catch (IOException e)
      {
        throw new LangDetectException(ErrorCode.FileLoadError, "can't open '" + file.getName() + "'");
      }
      finally
      {
        try
        {
          if (is != null) is.close();
        }
        catch (IOException e)
        {
        }
      }
    }
  }

Original issue reported on code.google.com by [email protected] on 17 Feb 2011 at 7:16

English detected as af

What steps will reproduce the problem?
1.I am passing a text "viking river cruise" to detect the language

What is the expected output? What do you see instead?
Expected is English , but it displays "af"

What version of the product are you using? On what operating system?
latest version

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 17 Feb 2011 at 4:26

Detector returns different results when called multiple times with the same input

I'm trying to use the language-detection library to distinguish between English 
and German for very short snippets of text. I noticed that sometimes when I use 
the language detector repeatedly on the same piece of input then I get 
different results for each trial (the very first is usually right). 

Am I using the API in a wrong way? Is there anything I can do to always get 
deterministic results? 

I attached a little unit test that reproduces the problem if you set the right 
PROFILE_DIR.

What steps will reproduce the problem?
1. Load profiles, create a detector, append input, detect.
2. Create new detector, append same input again, detect.
3. Repeat this a couple of times. 

What is the expected output? What do you see instead?
I expect to constantly get the same result for the same input. 

What version of the product are you using? On what operating system?

<groupId>org.apache.solr</groupId>
<artifactId>solr-langdetect</artifactId>
<version>3.5.0</version>

on Windows 7

Thanks in advance!
Heike

Original issue reported on code.google.com by [email protected] on 28 Jan 2013 at 4:28

Attachments:

Word count utility method

Hello,


Just an idea to help some people who have problems with short sentences (<10 or 
15 words). It used to be a problem for me.

Perhaps, add a utility method in some class (Detector class?) to count the 
number of words of a String.

For example (it is the one I use - a static method):

    public static int wordCount(String line) {
        int idx = 0;
        int cnt = 0;
        while (idx < line.length()) {
            if (!Character.isLetter(line.charAt(idx))) {
                idx++;
                continue;
            }
            cnt++;
            while (idx < line.length() && Character.isLetter(line.charAt(idx))) {
                idx++;
            }
        }
        return cnt;
    }

Or, why not throwing an exception directly when the String does not contain 
enough word (LangDetectException: "To few words")

But it is just an idea. 

But, indeed, it is a great API, with good response time and the ability to add 
supported languages. Thanks a lot.


Regards,
Emmanuel

Original issue reported on code.google.com by [email protected] on 29 Aug 2011 at 9:25

Detector.append(Reader) throws StringIndexOutOfBoundsException

What steps will reproduce the problem?

Reproduce with following code:

@Test
    public void langDetect(){
        final String textToDetect = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [email protected] asdfadasdf";

        try {
            final URL profiles = Resources.getResource(getClass(), "profiles");
            LangDetector.init(new File(profiles.getPath()));

            final Detector detector = DetectorFactory.create();
            detector.append(new StringReader(textToDetect));


        } catch (LangDetectException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

What is the expected output? What do you see instead?
I expect anything but an exception.
I get this stacktrace: 

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.<init>(String.java:207)
    at com.cybozu.labs.langdetect.Detector.append(Detector.java:154)



What version of the product are you using? On what operating system?
A build from 2011-09-21. JRE 1.6b21

Please provide any additional information below.
I am attempting to use append(Reader) because the URL/Address regex in the 
append(String) will occasionally "freeze" as noted in issues 6 and 26 
(http://code.google.com/p/language-detection/issues/detail?id=26 and 
http://code.google.com/p/language-detection/issues/detail?id=6&q=append). Using 
the reader and buffer alleviates the slowdown by regexes, but is unusable with 
this out of bounds exception.

Original issue reported on code.google.com by [email protected] on 17 Jul 2012 at 2:59

Reducing heap use and object creation

I noticed some unnecessary object creation in the Detector and NGram classes, 
and made some changes at this branch: 
http://code.google.com/r/armintor-language-detection/source/list?name=tuning

My tests show improvements in speed, memory use, and a fairly drastic decrease 
in objects created.

Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 7:56

Faster language detector version

First of all, congratulations for your library it has very good performance 
even in very hard settings. 

I have implemented a faster version of the library based on your algorithm, 
using arrays instead of HashMaps. It runs aprox. 5 to 8 times faster.

You can find the sources attached, feel free to add them to the library if you 
find them useful.

Original issue reported on code.google.com by [email protected] on 18 Jan 2011 at 5:07

Attachments:

lang-detection-alt.zip

nutch 1.4 extension point

if you want to use the plugin with the new version of nutch the extensionpoint 
is missing.

Exception in thread "main" java.lang.RuntimeException: Plugin 
(language-detector), extension point: org.apache.nutch.searcher.QueryFilter 
does not exist.
        at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:84)
        at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
        at org.apache.nutch.protocol.ProtocolFactory.<init>(ProtocolFactory.java:49)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:78)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:132)

If tried to change the QueryFilter inside the plugin.xml into 
indexer.IndexingFilter which made the exception disappear, but I've get the 
same result like with the language-identification plug-in ("et" while testing 
wikipedia.co.jp). This should not be the hardest challenge, so I've expected 
the correct language. Is IndexingFilter maybe the wrong extension point?

Original issue reported on code.google.com by [email protected] on 15 Feb 2012 at 4:39

prior map hint

So how does one set the priorMap to give the library a hint towards a language. 
I have user input to indicate what the language may be. 

If I set a probability - it always goes in that direction. 

e.g. 

Detector detector = DetectorFactory.create();
            HashMap priorMap = new HashMap();
            priorMap.put("ja", new Double(0.001));
            detector.setPriorMap(priorMap);
            detector.append("This is an english sentence.");


I would expect the language to be detected as "en" instead I get it as 'ja" 
with a 0.99999999998567 probability. It seems to be case for all languages - if 
you seed the priorMap table the library just validates that language as the 
right language. 

Am I not using the interface correctly?

Original issue reported on code.google.com by [email protected] on 16 Apr 2011 at 6:32

Long string without whitespaces will take forever to detect

If you split up the text before, it will work:

E.g. somthing à la:


    public static String cleanText(String text) {
        // break up big words > 40 chars into single ones
        int s = text.length();
        StringBuffer sb = new StringBuffer();
        int sindex = 0;
        for (int i = 0; i < s; i++) {
            char c = text.charAt(i);
            if (c == ' ') {
                sindex = i;
            }
            if (i - sindex > 40) {
                sindex = i;
                sb.append(" ");
            }
            sb.append(c);
        }
        return sb.toString();
    }

Original issue reported on code.google.com by [email protected] on 4 Feb 2011 at 2:37

profile and language code for Frysian is missingt

There is a Frysian wikipedia, which lead to the attached profile.

Frysian language code might be fy.

Original issue reported on code.google.com by [email protected] on 10 Nov 2012 at 12:08

Attachments:

Using short phrases leads to erroneous langauge detection

What steps will reproduce the problem?
1. Use the phrase "distribution agreement"

I expect it to return english, but it returns french.

I am using version 02-02-2011, on Windows 7

This seems to happen on certain short phrases, the wrong language is returned. 
Is there anything I can do to fix this?

Original issue reported on code.google.com by [email protected] on 31 Mar 2011 at 8:50

Wrong detection on ES/RO, DK/NO

What steps will reproduce the problem?

,in our test job,we are evaluating your tool. we are testers checking 
translated contend in software for nearly 30 languages. 
1. when we are trying to identify "Ingen ArcSync?-konto?Log ind Opret en konto" 
, it thought NO is the most possible language(>99.99%) ,other than DK. 
2.  when identifying "Accepar" ,it determine RO(>99.99%) ,other than ES . May 
detection probability be improved next version?? Thanks your work.


What is the expected output? What do you see instead?
Can identify DK/NO,RO/ES more correctly.

What version of the product are you using? On what operating system?
 langdetect-09-13-2011.zip

Please provide any additional information below.
We don't get newer profile from WikiPedia because we don't know where to get 
these abstract database file on Download page. we have googled using 'eswiki- 
-abstract.xml' OR 'Wikipedia abstract database files',but no result. 
And on Polyglot3000 ,which is also a good language detection tool, it can 
return correct judgement, though it only provide GUI on Windows and no API or 
source code.

Original issue reported on code.google.com by [email protected] on 21 Jan 2012 at 2:59

change abbreviated language profile name to longer form.

This isn't a bug really.  Would it be possible to just rename the short name 
profiles with their corresponding long form name? Or maybe add a way to store 
both long and short names and add a method like "detectLongName" or 
"detectShortName"?

This would be better just in case more profiles are added and no one is 
dependent on translating the short name to a long name in their own code base.

Original issue reported on code.google.com by [email protected] on 3 Jun 2011 at 8:42

Jsonic bundle problem

I'm getting the below error even though I have added jsonic.jar and jsonic.swc
Exception in thread "main" java.lang.NoClassDefFoundError: 
net/arnx/jsonic/JSONException
        at javaapplication31.Main.main(Main.java:29)
Caused by: java.lang.ClassNotFoundException: net.arnx.jsonic.JSONException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        ... 1 more

Original issue reported on code.google.com by [email protected] on 12 Jan 2011 at 3:51

DetectorFactory.loadProfile(String) is a static method

DetectorFactory.loadProfile is a static method and additionally, if it is 
called more than once, errors occur.

This causes problems for applications which are completely componentised - when 
you start up a language detection component it wants to initialise 
DetectorFactory.  When that same component initialises again in the future, it 
throws an error.

A better alternative would be that the factory be an instance which can be 
configured and then thrown away.  The second component would then be using a 
different instance, which would avoid the problem completely.

Original issue reported on code.google.com by [email protected] on 12 Sep 2011 at 2:24

Offer way to document training of models for (automated) retraining

Please offer a way to formally document how models have been trained for a 
certain language. Retraining, when needed, can be done automatically when 
relevant sources are documented.

For the normal model and the short message model a list of URLs, each with 
short description, should be stored in a configuration file. Also the 
wikipedia/media/etc URLs should be stored in this file. So that then can be 
omitted when needed.

Original issue reported on code.google.com by [email protected] on 17 Nov 2012 at 6:41

Insanely slow performance on text with no whitespace

What steps will reproduce the problem?
1. Modify the following test case to match paths on your system (StringUtils 
comes from Commons Lang, in case you're not using it.  I was reducing the size 
of the test.)

    @Test
    public void test() throws Exception {
        // Generate two strings which are exactly the same length (thus we should expect similar performance.)
        int testSize = 8*1024;
        String repetitiveEnglish = StringUtils.repeat("I see what you did there, dude. ", testSize/32);
        String noSpaces = StringUtils.repeat("abcdefgh", testSize/8);

        DetectorFactory.loadProfile(new File("dependencies/langdetect/profiles"));

        detect(repetitiveEnglish, 10);
        detect(noSpaces, 10);
        detect(repetitiveEnglish, 10);
        detect(noSpaces, 10);
    }

    private void detect(String text, int runs) throws Exception
    {
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < runs; i++)
        {
            Detector detector = DetectorFactory.create();
            detector.append(text);
            String result = detector.detect();
        }
        long t1 = System.currentTimeMillis();
        System.out.println(String.format("%s ms", t1-t0));
    }

2. Run and watch the output.

I get the following:

219 ms
9066 ms
36 ms
8999 ms

So whereas normal English-like text gets down to 3.6ms or so per detection, 
detecting the language of text with no whitespace costs nearly a second per go. 
 When you're trying to identify millions of documents, that adds up pretty fast.

Original issue reported on code.google.com by [email protected] on 17 Oct 2011 at 5:55

net.arnx.jsonic.JSON not found exception

I just made an API call to loadProfile using a DetectorFactory and supplied 
"trunk/profile" as the argument (as suggested in the tutorial). It reports that 
the net.arnx.jsonic.JSON class along with the net.arnx.jsonic.JSONException 
class are both missing. DetectorFactory seems to import them in it, and I guess 
these classes haven't been bundled in the LangDetect jar file ?  

I'm using langdetect-05-09-2011.jar with JavaSE 1.6 on Mac OSX (Snow Leopard). 
I just downloaded jsonic-1.2.5.zip but wondering where and how to integrate it 
with langdetect before using the API call to the same.

Thanks!

Original issue reported on code.google.com by [email protected] on 14 May 2011 at 11:56

Reducing the problem to binary classification (identify given text english or not)

Hi,

I just want to detect whether given text is in English or not. For my problem, 
I am not really interested in identifying exact language of the text. I 
understand by reducing number of target languages to predict will increase the 
accuracy. So, Can I infer that just predicting whether the text in English or 
not, instead of predicting exact language will increase the accuracy. If so, 
how can I tweak this package to detect whether given text is English or not.

Regards,
Vamsi

Original issue reported on code.google.com by [email protected] on 22 Mar 2011 at 2:57

Singleton implementation of DetectorFactory prohibits Detectors using different profiles

First of all: thank you for making this software publicly available! I really 
appreciate it ;)

Now that you provide two different types of language profiles (based on 
wikipedia and twitter, respectively), I'd like to be able to keep two language 
detectors alive simultaneously. One relying on the twitter profiles for 
classifying short texts, and one relying on the wikipedia profiles for 
classifying long ones (this is somewhat related to issue 25). However, the 
current implementation of the DetectorFactory does not allow me to set up more 
than one type of Detector.

I'd like to be able to do something similar to the following (where 
detectorShort and detectorLong would use language profiles for (approximately) 
the same languages, originating from different sources):

// set-up two different detectors
DetectorFactory.loadProfiles(LanguageProfile.SHORT_TEXTS, 
"/path/to/shorttextprofiles");
DetectorFactory.loadProfiles(LanguageProfile.LONG_TEXTS, 
"/path/to/longtextprofiles");

...

// get the detector for handling short texts
Detector detectorShort = DetectorFactory.create(LanguageProfile.SHORT_TEXTS);

// get the detector for handling longer texts
Detector detectorLong = DetectorFactory.create(LanguageProfile.LONG_TEXTS);

...

Is this behaviour something you would consider in a future release?

Kind regards

Original issue reported on code.google.com by [email protected] on 15 Mar 2012 at 6:44

If no NGrams found, return no language or empty list instead of throwing an exception

What steps will reproduce the problem?
1. Create a Detector with the language profiles bundled with this libray
2. call detector.append("")
3. call detector.getProbabilities()

What is the expected output? What do you see instead?
Should return an empty list, instead throws a LangDetectException("no features 
in text")

What version of the product are you using? On what operating system?
langdetect-1.1-20120112

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Sep 2012 at 5:23

josephsefara / language-detection Goto Github PK

language-detection's People

Contributors

Watchers

Forkers

language-detection's Issues

Recommend Projects

Recommend Topics

Recommend Org