Coder Social home page Coder Social logo

qiangsima / textdistance Goto Github PK

View Code? Open in Web Editor NEW

This project forked from life4/textdistance

0.0 1.0 0.0 157 KB

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface.

License: GNU Lesser General Public License v3.0

Shell 0.15% Python 99.85%

textdistance's Introduction

TextDistance

TextDistance logo

Build Status PyPI version Status Code size License

TextDistance -- python library for compare distance between two or more sequences by many algorithms.

Features:

  • 30+ algorithms
  • Pure python implementation
  • Simple usage
  • More than two sequences comparing
  • Some algorithms have more than one implementation in one class.
  • Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm Class Functions
Hamming Hamming hamming
MLIPNS Mlipns mlipns
Levenshtein Levenshtein levenshtein
Damerau-Levenshtein DamerauLevenshtein damerau_levenshtein
Jaro-Winkler JaroWinkler jaro_winkler, jaro
Strcmp95 StrCmp95 strcmp95
Needleman-Wunsch NeedlemanWunsch needleman_wunsch
Gotoh Gotoh gotoh
Smith-Waterman SmithWaterman smith_waterman

Token based

Algorithm Class Functions
Jaccard index Jaccard jaccard
Sørensen–Dice coefficient Sorensen sorensen, sorensen_dice, dice
Tversky index Tversky tversky
Overlap coefficient Overlap overlap
Tanimoto distance Tanimoto tanimoto
Cosine similarity Cosine cosine
Monge-Elkan MongeElkan monge_elkan
Bag distance Bag bag

Sequence based

Algorithm Class Functions
longest common subsequence similarity LCSSeq lcsseq
longest common substring similarity LCSStr lcsstr
Ratcliff-Obershelp similarity RatcliffObershelp ratcliff_obershelp

Compression based

Work in progress. Now all algorithms compare two strings as array of bits, not by chars.

NCD - normalized compression distance.

Functions:

  1. bz2_ncd
  2. lzma_ncd
  3. arith_ncd
  4. rle_ncd
  5. bwtrle_ncd
  6. zlib_ncd

Phonetic

Algorithm Class Functions
MRA MRA mra
Editex Editex editex

Simple

Algorithm Class Functions
Prefix similarity Prefix prefix
Postfix similarity Postfix postfix
Length distance Length length
Identity similarity Identity identity
Matrix similarity Matrix matrix

Installation

Stable:

pip install textdistance

Dev:

pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance

Usage

All algorithms have 2 interfaces:

  1. Class with algorithm-specific params for customizing.
  2. Class instance with default params for quick and simple usage.

All algorithms have some common methods:

  1. .distance(*sequences) -- calculate distance between sequences.
  2. .similarity(*sequences) -- calculate similarity for sequences.
  3. .maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
  4. .normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
  5. .normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

  1. qval -- q-value for split sequences into q-grams. Possible values:
    • 1 (default) -- compare sequences by chars.
    • 2 or more -- transform sequences to q-grams.
    • None -- split sequences by words.
  2. as_set -- for token-based algorithms:
    • True -- t and ttt is equal.
    • False (default) -- t and ttt is different.

Example

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.

textdistance's People

Contributors

inokenty90 avatar orsinium avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.