TextDistance -- python library for compare distance between two or more sequences by many algorithms.
Features:
- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional numpy usage for maximum speed.
Algorithm | Class | Functions |
---|---|---|
Hamming | Hamming |
hamming |
MLIPNS | Mlipns |
mlipns |
Levenshtein | Levenshtein |
levenshtein |
Damerau-Levenshtein | DamerauLevenshtein |
damerau_levenshtein |
Jaro-Winkler | JaroWinkler |
jaro_winkler , jaro |
Strcmp95 | StrCmp95 |
strcmp95 |
Needleman-Wunsch | NeedlemanWunsch |
needleman_wunsch |
Gotoh | Gotoh |
gotoh |
Smith-Waterman | SmithWaterman |
smith_waterman |
Algorithm | Class | Functions |
---|---|---|
Jaccard index | Jaccard |
jaccard |
Sørensen–Dice coefficient | Sorensen |
sorensen , sorensen_dice , dice |
Tversky index | Tversky |
tversky |
Overlap coefficient | Overlap |
overlap |
Tanimoto distance | Tanimoto |
tanimoto |
Cosine similarity | Cosine |
cosine |
Monge-Elkan | MongeElkan |
monge_elkan |
Bag distance | Bag |
bag |
Algorithm | Class | Functions |
---|---|---|
longest common subsequence similarity | LCSSeq |
lcsseq |
longest common substring similarity | LCSStr |
lcsstr |
Ratcliff-Obershelp similarity | RatcliffObershelp |
ratcliff_obershelp |
Work in progress. Now all algorithms compare two strings as array of bits, not by chars.
NCD
- normalized compression distance.
Functions:
bz2_ncd
lzma_ncd
arith_ncd
rle_ncd
bwtrle_ncd
zlib_ncd
Algorithm | Class | Functions |
---|---|---|
MRA | MRA |
mra |
Editex | Editex |
editex |
Algorithm | Class | Functions |
---|---|---|
Prefix similarity | Prefix |
prefix |
Postfix similarity | Postfix |
postfix |
Length distance | Length |
length |
Identity similarity | Identity |
identity |
Matrix similarity | Matrix |
matrix |
Stable:
pip install textdistance
Dev:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
All algorithms have 2 interfaces:
- Class with algorithm-specific params for customizing.
- Class instance with default params for quick and simple usage.
All algorithms have some common methods:
.distance(*sequences)
-- calculate distance between sequences..similarity(*sequences)
-- calculate similarity for sequences..maximum(*sequences)
-- maximum possible value for distance and similarity. For any sequence:distance + similarity == maximum
..normalized_distance(*sequences)
-- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences)
-- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.
Most common init arguments:
qval
-- q-value for split sequences into q-grams. Possible values:- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.
as_set
-- for token-based algorithms:- True --
t
andttt
is equal. - False (default) --
t
andttt
is different.
- True --
For example, Hamming distance:
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.