Coder Social home page Coder Social logo

russianmorphology's Introduction

Russian Morphology for Apache Lucene

Russian and English morphology for Java and Apache Lucene 9.3 framework based on open source dictionary from site АОТ. It uses dictionary base morphology with some heuristics for unknown words. It supports a homonym for example for a Russian word "вина" it gives two variants "вино" and "вина".

How to use

Build project, by running mvn clean package, this will provide you the latest versions of the artifacts - 1.5, add it to your classpath. You could select which version to use - Russian or English.

Now you can create a Lucene Analyzer:

  RussianAnalayzer russian = new RussianAnalayzer();
  EnglishAnalayzer english = new EnglishAnalayzer();

You can write you own analyzer using filter that convert word in it's right forms.

  LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
  TokenStream tokenStream = new MorphlogyFilter(result, luceneMorph);

Because usually LuceneMorphology contains a lot data needing for it functionality, it is better didn't create this object for each MorphologyFilter.

Also if you need get a list of base forms of word, you can use following example

 LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
 List<String> wordBaseForms = luceneMorph.getMorphInfo(word);

Solr

You can use the LuceneMorphology as morphology filter in a Solr schema.xml using a MorphologyFilterFactory:

<fieldType name="content" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="org.apache.lucene.analysis.morphology.MorphologyFilterFactory" language="Russian"/>
		<filter class="org.apache.lucene.analysis.morphology.MorphologyFilterFactory" language="English"/>
      </analyzer>
</fieldType>

Just add morphology-1.5.jar in your Solr lib-directories

Restrictions

  • It works only with UTF-8.
  • It assume what letters е and ё are the same.
  • Word forms with prefixes like "наибольший" treated as separate word.

License

Apache License, Version 2.0

russianmorphology's People

Contributors

akuznetsov avatar bannikovilea avatar grossws avatar imotov avatar jlleitschuh avatar mysterionrise avatar phront avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

russianmorphology's Issues

Add licensing information

Could you add some licensing info to the repo (LICENSE file and short mention which license is used in README.md).

"Аминовен" преобразуется в "амин"

Отбрасывается суффикс "овен" для любых несловарных слов или может тут имеется ввиду что это винительный падеж множественное число?

Может возвращать исходную словоформу из эвристики тоже вместе с трансформированным вариантом? Или что-то чуть интеллектуальнее - попробовать угадать, что достаточно вероятно словоформа в именительном падеже.

Binary releases for Lucene 7.x?

I noticed that the last binary release of this library was based on lucene v5.1.0. The 6.x support was added in the master branch with the instructions on how to build the library but a new version was never released. So, I am curious if there are any plans to provide binary releases for this library in the future?

Question

Where can I find definition of response structure?
I mean for example in '[авария|G С жр,ед,им]'
What G mean?
Another is more - less clear, but G is not.
More examples
[авиасообщение|K С ср,ед,им] - why here is K ?
[авитаминоз|A С мр,ед,им] - why here is A?

morph-1.2.jar - 404 error

Не скачивается файл по ссылки из README.md, на bintray нет такого файла.

Lucene.NET support?

Hello!
I've stumbled upon this project while seeking for a better russian analyzer for Lucene.NET. Well, this might be it, except that it is in Java...
Is there any plan for supporting a .NET version as well, eg maybe doing the same way as Lucene.NET itself appeared - by automatically converting the Java code to C# - or any other way?
Thanks in advance.

По запросу "ев" находятся "ели".

Добрый день!

Я использую плагин для elasticsearch - analysis-morphology который, как я понимаю, использует вашу библиотеку:
И с ней проиходит такая интересная штука
"ели" - генерирует токены "ель", "есть".
"ев" - генерирует токены "ева", "есть".

imotov/elasticsearch-analysis-morphology#19

У фамилии Аккуратов в мужском роде обрезаются последние две буквы

Добрый день!

Фамилия в женском роде 'аккуратова' преобразуется в "аккуратов". Тут все нормально.
А вот в мужсокм 'аккуратов' преобразуется в "аккурат". А это уже совсем другое слово.

На других фамилиях, вроде все нормально. Например из "петров" получаются два токена "петров" и "петр".

Это неправильное поведение или какие-то особенности морфологии?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.