Coder Social home page Coder Social logo

tesa's Introduction

Tesa (text sanitizer)

Build Status Code Coverage Scrutinizer Code Quality Latest Stable Version Packagist download count Dependency Status

The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.

Requirements

  • PHP 5.3 / HHVM 3.5 or later
  • Recommended to enable the ICU extension

Installation

The recommended installation method for this library is by adding the following dependency to your composer.json.

{
	"require": {
		"onoi/tesa": "~0.1"
	}
}

Usage

use Onoi\Tesa\SanitizerFactory;
use Onoi\Tesa\Transliterator;
use Onoi\Tesa\Sanitizer;

$sanitizerFactory = new SanitizerFactory();

$sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' );

$sanitizer->reduceLengthTo( 200 );
$sanitizer->toLowercase();

$sanitizer->replace(
	array( "'", "http://", "https://", "mailto:", "tel:" ),
	array( '' )
);

$sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 );
$sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) );

$sanitizer->applyTransliteration(
	Transliterator::DIACRITICS | Transliterator::GREEK
);

$text = $sanitizer->sanitizeWith(
	$sanitizerFactory->newGenericTokenizer(),
	$sanitizerFactory->newNullStopwordAnalyzer(),
	$sanitizerFactory->newNullSynonymizer()
);
  • SanitizerFactory is expected to be the sole entry point for services and instances when used outside of this library
  • IcuWordBoundaryTokenizer is a preferred tokenizer in case the ICU extension is available
  • NGramTokenizer is provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
  • StopwordAnalyzer together with a LanguageDetector is provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
  • Synonymizer currently only provides an interface

Contribution and support

If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here.

Tests

The library provides unit tests that covers the core-functionality normally run by the continues integration platform. Tests can also be executed manually using the composer phpunit command from the root directory.

Release notes

  • 0.1.0 Initial release (2016-08-07)
  • Added SanitizerFactory with support for a
  • Tokenizer, LanguageDetector, Synonymizer, and StopwordAnalyzer interface

Acknowledgments

  • The Transliterator uses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö)
  • The stopwords used by the StopwordAnalyzer have been collected from different sources, each json file identifies its origin
  • CdbStopwordAnalyzer relies on wikimedia/cdb to avoid using an external database or cache layer (with extra stopwords being available here)
  • JaTinySegmenterTokenizer is based on the work of Taku Kudo and his tiny_segmenter.js
  • TextCatLanguageDetector uses the wikimedia/textcat library to make predictions about a language

License

GNU General Public License 2.0 or later.

tesa's People

Contributors

mwjames avatar jaideraf avatar kghbln avatar

Stargazers

Anton Kurashev avatar  avatar  avatar

Watchers

Terry Moore avatar James Cloos avatar  avatar

Forkers

qamodi wgevaert

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.