Coder Social home page Coder Social logo

Spanish support? about clj-fuzzy HOT 13 CLOSED

yomguithereal avatar yomguithereal commented on June 26, 2024
Spanish support?

from clj-fuzzy.

Comments (13)

Yomguithereal avatar Yomguithereal commented on June 26, 2024

Hello @demian85. I guess you mean to ask if the library has a stemmer for the Spanish language. Unfortunately it does not have one yet. Using Schinke stemmer on Spanish text will indeed produce only garbage since the algorithm is targeting Latin.

However, I am currently working on Talisman, a much wider library than this one (which is in JavaScript, not Clojure) and can probably implement some kind of Spanish stemmer soon (the ones used by Lucene I think). Tell me if this would suit your use case.

The stemmers I found for Spanish are the Martin Porter one in Snowball & the UniNe one used by Lucene.

from clj-fuzzy.

demian85 avatar demian85 commented on June 26, 2024

Turns out that what I'm looking for is an inflector, I just want a way to normalize a string. More specifically, I need to singularize nouns in spanish.

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

Ok. The UniNe stemmer might be of some use to you then. It perform really simple stemming and will probably drop most plural forms (won't inflect them in a grammatically correct way though).

Here is how it works:

  • Deburr the string
  • If the string is less than 5 characters long, then don't affect it
  • Else drop final o, a and e
  • Handle final s likewise:
if (s[len-2] == 'e' && s[len-3] == 's' && s[len-4] == 'e')
  return len-2;
if (s[len-2] == 'e' && s[len-3] == 'c') {
  s[len-3] = 'z';
  return len - 2;
}
if (s[len-2] == 'o' || s[len-2] == 'a' || s[len-2] == 'e')
  return len - 2;

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

Else, here the code of a python inflector for the Spanish language.

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

What are you trying to achieve specifically here? Fuzzy matching? Clustering?

from clj-fuzzy.

demian85 avatar demian85 commented on June 26, 2024

I'm using MongoDB but the full text search is not smart enough to cover edge cases. I cannot find a way to match all terms using AND without losing stemming and other stuff.
I'm just planning to store a normalized string and search for equality.
Thanks por the Python version, do you know any JS implementation?

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

If you tell me the python inflector works for you and solves your problem, I can implement it in talisman but I'll need some time to do so.

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

Ok, I just implemented both the UniNe stemmer & the python inflector in talisman @demian85. Here is how to use them:

npm install talisman
// The stemmer
const stemmer = require('talisman/stemmers/spanish/unine');

// The inflector
const inflector = require('talisman/inflectors/spanish/noun').singularize;

from clj-fuzzy.

demian85 avatar demian85 commented on June 26, 2024

Great! I'll give it a try! Thanks!

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

I'll close this issue. Open one on talisman if you have any problem.

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

So did it work for you?

from clj-fuzzy.

Yomguithereal avatar Yomguithereal commented on June 26, 2024

@demian85 you never told me if this worked for you or if there was some things to fix.

from clj-fuzzy.

demian85 avatar demian85 commented on June 26, 2024

from clj-fuzzy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.