Coder Social home page Coder Social logo

lemmy's Introduction

🤘 Lemmy

Lemmy is a lemmatizer for Danish 🇩🇰 and Swedish 🇸🇪. It comes ready for use. The Danish model is trained on Dansk Sprognævn's (DSN) word list (‘fuldformliste’) and the Danish Universal Dependencies. The Swedish model is trained on the SALDO's morphology dataset and the Swedish Universal Dependencies (Talbanken). Lemmy also supports training on your own dataset.

The models included in Lemmy were evaluated on the respective Universal Dependencies dev datasets. The Danish model scored > 99% accuracy, while the Swedish model scored > 97%. All reported scores were obtained when supplying Lemmy with POS tags.

You can use Lemmy as a spaCy extension, more specifcally a spaCy pipeline component. This is highly recommended and makes the lemmas easily accessible from the spaCy tokens. Lemmy makes use of POS tags to predict the lemmas. When wired up to the spaCy pipeline, Lemmy has the benefit of using spaCy’s builtin POS tagger.

Lemmy can also by used without spaCy, as a standalone lemmatizer. In that case, you will have to provide the POS tags. Alternatively, you can use Lemmy without POS tags, though most likely the accuracy will suffer. Currrently, only the Danish Lemmy model comes with a model trained for use without POS tags. That is, if you want to use Lemmy on Swedish text without POS tags, you must train your own Lemmy model.

Lemmy is heavily inspired by the CST Lemmatizer for Danish.

Install

pip install lemmy

Basic Usage Without POS tags

import lemmy

# Create an instance of the standalone lemmatizer.
lemmatizer = lemmy.load("da")

# Find lemma for the word 'akvariernes'. First argument is an empty POS tag.
lemmatizer.lemmatize("", "akvariernes")

Basic Usage With POS tags

import lemmy

# Create an instance of the standalone lemmatizer.
# Replace 'da' with 'sv' for the Swedish lemmatizer.
lemmatizer = lemmy.load("da")

# Find lemma for the word 'akvariernes'. First argument is the user-provided POS tag.
lemmatizer.lemmatize("NOUN", "akvariernes")

Usage with spaCy Model

import da_custom_model as da # replace da_custom_model with name of your spaCy model
import lemmy.pipe
nlp = da.load()

# Create an instance of Lemmy's pipeline component for spaCy.
# Replace 'da' with 'sv' for the Swedish lemmatizer.
pipe = lemmy.pipe.load('da')

# Add the component to the spaCy pipeline.
nlp.add_pipe(pipe, after='tagger')

# Lemmas can now be accessed using the `._.lemmas` attribute on the tokens.
nlp("akvariernes")[0]._.lemmas

Training

The notebooks folder contains examples showing how to train your own model using Lemmy.

lemmy's People

Contributors

sorenlind avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.