Coder Social home page Coder Social logo

microsoft / phoneticmatching Goto Github PK

View Code? Open in Web Editor NEW
148.0 12.0 30.0 1.66 MB

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.

License: MIT License

Python 2.69% C++ 35.30% TypeScript 17.99% C# 44.02%

phoneticmatching's Introduction

Build status

Introduction

A phonetic matching library. Includes text utilities to do string comparisons on phonemes (the sound of the string), as opposed to characters.

Docs can be found at: https://microsoft.github.io/PhoneticMatching/

Supported API:

  • C++
  • Node.js (>=8.11.2)
  • C# .NET Core (>=2.1)

Supported Languages

  • English

Current pre-built binaries offered to save the trouble of compiling the source locally.

  • node-v{72,67,64,59,57}-{win32,linux,darwin}-{x64}

(Run node -p "process.versions.modules" to see which Node-ABI in use.)

Getting Started

This repository consists of TypeScript and native dependencies built with node-gyp. See package.json for various scripts for the development process.

For first time building remember to npm install

This repository uses git submodules. If paths are outdated or non-existent run git submodule update --init --recursive

Install

To install from NPM

npm install phoneticmatching

Usage

See the typings for more details.
Classes prefixed with En make certain assumptions that are specific to the English language.

import { EnPronouncer, EnPhoneticDistance, FuzzyMatcher, AcceleratedFuzzyMatcher, EnHybridDistance, StringDistance } from "phoneticmatching";

Speech The namespace containing the type interfaces of the library objects.

EnPronouncer Pronounces a string, as a General English speaker, into its IPA string or array of Phones format.

matchers module:

  • FuzzyMatcher Main use case for this library. Returns matches against a list of targets for a given query. The comparisons are not remembered and therefore better for one-off use cases.

  • AcceleratedFuzzyMatcher Same interface as FuzzyMatcher but the list of targets are precomputed, so beneficial for multiple queries at the cost of a higher initialization time.

  • EnContactMatcher A domain specialization of using the AcceleratedFuzzyMatcher for English speakers searching over a list of names. Does additional preprocessing and setups up the distance function for you.

  • EnPlaceMatcher A domain specialization of using the AcceleratedFuzzyMatcher for English speakers searching over a list of places. Does additional preprocessing and setups up the distance function for you.

distance module:

  • EnPhoneticDistance Returns a metric distance score between two English pronunciations.

  • StringDistance Returns a metric distance score between two strings (edit distance).

  • EnHybridDistance Returns a metric distance score based on a combination of the two above distance metrics (English pronunciations and strings).

  • DistanceInput Input object for EnHybridDistance. Hold the text and the pronunciation of that text

nlp module:

  • EnPreProcessor English Pre-processor.

  • EnPlacesPreProcessor English Pre-processor with specific rules for places.

  • SplittingTokenizer Tokenizing base-class that will split on the given RegExp.

Here are some example of how to import modules and classes:

import { EnContactMatcher, EnPlaceMatcher } from "phoneticmatching";
import * as Matchers from "phoneticmatching/lib/matchers";

Example

JavaScript

// Import core functionality from the library.
const { EnPhoneticDistance, FuzzyMatcher } = require("phoneticmatching");

// A distance metric over pronunciations.
const metric = new EnPhoneticDistance();

// The target list to match against.
const targets = [
    "Apple",
    "Banana",
    "Blackberry",
    "Blueberry",
    "Grapefruit",
    "Pineapple",
    "Raspberry",
    "Strawberry",
];

// Create the fuzzy matcher.
const matcher = new FuzzyMatcher(targets, metric);
// Find the nearest match.
const result = matcher.nearest("blu airy");
/* The result should be:
 * {
 *     // The object from the targets list.
 *     element: 'Blueberry',
 *     // The distance score the from distance function.
 *     distance: 0.041666666666666664
 * }
 */
console.log(result);

C#

using System;

// Import core functionality from the library.
using Microsoft.PhoneticMatching.Matchers.FuzzyMatcher.Normalized;

public class Program
{
    public static void Main(string[] args)
    {
        // The target list to match against.
        string[] targets = 
        {
            "Apple",
            "Banana",
            "Blackberry",
            "Blueberry",
            "Grapefruit",
            "Pineapple",
            "Raspberry",
            "Strawberry",
        };

        // Create the fuzzy matcher.
        var matcher = new EnPhoneticFuzzyMatcher<string>(targets);

        // Find the nearest match.
        var result = matcher.FindNearest("blu airy");

        /* The result should be:
         * {
         *     // The object from the targets list.
         *     element: 'Blueberry',
         *     // The distance score the from distance function.
         *     distance: 0.0416666666666667
         * }
         */
        Console.WriteLine("element : [{0}] - distance : [{1}]", result.Element, result.Distance);
    }
}

Build

TypeScript Transpiling

npm run tsc

Native Compiling

# X is the parallelization number, usually set to the number of cores of the machine.
# This cleans and rebuilds everything.
JOBS=X npm run rebuild
# For incremental builds.
JOBS=X npm run build

Test

# Requires native dependencies built, but TypeScript transpiling not required.
npm test

Docs

# Generate the doc files from the docstrings.
npm run build-docs

Release

# Builds everything, TypeScript & native & docs, as a release build.
npm run release

Deployment/Upload

Note that the .js library code and native dependencies will be deployed separately. Npm registries will be used for the .js code, node-pre-gyp will be used for prebuilt dependencies while falling back to building on the client.

# Pushes pack to npmjs.com or a private registry if a .npmrc exists.
npm publish
# Packages a ./build/stage/{version}/maluubaspeech-{node_abi}-{platform}-{arch}.tar.gz.
# See package.json:binary.host on where to put it.
npm run package

NuGet Publish

A .NET Core NuGet package is published for this project. The package is published by Microsoft. Hence, it must follow guidance at https://aka.ms/nuget and sign package content and package itself with an official Microsoft certificate. To ease signing and publishing process, we integrate ESRP signing to Azure DevOps build tasks. To publish a new version of the package, create a release for the latest build (Pipelines->Releases->PublishNuget->Create a release).

Contributors

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reporting Security Issues

Security issues and bugs should be reported privately, via email, to the Microsoft Security Response Center (MSRC) at [email protected]. You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Further information, including the MSRC PGP key, can be found in the Security TechCenter.

License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

See sources for licenses of dependencies.

phoneticmatching's People

Contributors

dependabot[bot] avatar dommorin avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar mmdixon avatar msftgits avatar skoomasteve avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phoneticmatching's Issues

Build issue on Node 13.9.0 / WSL Debian / ARM64 SurfaceProX

$ npm install -S phoneticmatching

[email protected] install /.../node_modules/phoneticmatching
node-pre-gyp install --fallback-to-build

node-pre-gyp WARN Using request for node-pre-gyp https download
node-pre-gyp WARN Tried to download(404): https://github.com/Microsoft/PhoneticMatching/releases/download/0.3.5/maluubaspeech-node-v79-linux-arm64.tar.gz
node-pre-gyp WARN Pre-built binaries not found for [email protected] and [email protected] (node-v79 ABI, glibc) (falling back to source compile with node-gyp)
make: Entering directory '/.../node_modules/phoneticmatching/build'
CXX(target) Release/obj.target/maluubaspeech-source/src/maluuba/speech/phoneticdistance/metric.o
CXX(target) Release/obj.target/maluubaspeech-source/src/maluuba/speech/phoneticdistance/phoneticdistance.o
CXX(target) Release/obj.target/maluubaspeech-source/src/maluuba/speech/pronouncer/pronouncer.o
../src/maluuba/speech/pronouncer/pronouncer.cpp:5:10: fatal error: flite/lang/cmulex/cmu_lex.h: No such file or directory
#include <flite/lang/cmulex/cmu_lex.h>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Question: EnContactMatcher performance

Good morning,

While trying to use EnContactMatcher.ts with 10k contacts the constructor takes up to 1 min. By changing inner-matchers to non accelerated, it improves considerably (it goes down to 'only' 30 seconds).

I have a few questions:

  1. Are those times reasonable/expected with 10k contacts?
  2. Would it be ok to open a PR adding a new configuration flag for using or not accelerated matchers in EnContactMatcher? Another option would be passing in the EnContactMatcher constructor the fuzzy matcher class to use? Which one would you prefer?
  3. With 10k contacts, there's a very small difference on the performance for find method call (0.1 seconds with accelerated vs 0.3 seconds without). Given the find time is almost negligible compared to the constructor time for accelerated vs non accelerated , we chose to use non accelerated fuzzy matchers. Does this make sense?

Thanks and congratulations for this amazing library :)

PhoneticMatchin Xamarin.Forms

I was trying to use this package in on of our Xamarin.Forms projects, but I get build errors:

On Windows (building Android app):

  1. EnPhoneticFuzzyMatcher not found
  2. The Type or namespace "PhoneticMatching" does not exist

On Mac (building iOS app)

  1. Same as above
  2. Error MSB4236: The SDK 'Microsoft.NET.Sdk' specified could not be found.

The packages are added correctly in all packages, without error or warnings. I have .Net Core 2.1.700 installed. If I add the packages to a Console app (.NET Core) everything works as expected.

Any ideas?

Using this library on an Azure App Service

Hi,

We are trying to use this library in our Microsoft Bot Framework bot.
When testing it locally in our emulator we have no issues and everything works fine.
The problems start when trying to deploy this solution to our azure environment.
At that point we get the following error:

Exception message: An attempt was made to load a program with an incorrect format. (Exception from HRESULT: 0x8007000B).

This leads me to believe that we are missing some sort of configuration on our App Service.
Does anyone have an idea how to solve this issue?

We have tried putting our build completly in x86 and x64. We have changed the settings on the app service itself to every possible platform.

Any help would be highly appreciated.

Python API

Hi, is it possible to add Python API?

Target NetStandard?

Any chance to have the nuget library targeting netstandard 2.0 instead of netcoreapp2.1?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.