Coder Social home page Coder Social logo

triegex's Introduction

triegex

https://travis-ci.org/ZhukovAlexander/triegex.svg?branch=master

About

triegex is a library that builds a compact trie-structured regular expressions from a list of words.

Installation

pip install triegex

Alternatively, you can install the latest release directly from git:

pip install git+https://github.com/ZhukovAlexander/[email protected]

Example usage

>>> import triegex
>>>
>>> t = triegex.Triegex('foo', 'bar', 'baz')
>>>
>>> t.to_regex()  # build regular expression
'(?:ba(?:r\\b|z\\b)|foo\\b|~^(?#match nothing))'
>>>
>>> t.add('spam')
>>>
>>> 'spam' in t  # you check if the word is in there
True
>>>
>>> import re
>>> re.findall(t.to_regex(), 'spam & eggs')  # ['spam']
['spam']

Why?

The library was inspired by a need to match a list of valid IANA top-level domain names (which is pretty big).

Also it's fun

triegex was influenced by these projects: frak, regex-trie and Regexp-Trie

triegex's People

Contributors

zhukovalexander avatar iquick143 avatar nlevitt avatar kai3341-effectivesoft avatar

Stargazers

Diwank Singh Tomer avatar R. S. Doiel avatar Hugefiver avatar Jay avatar Ettore Rizza avatar Florian Rathgeber avatar Mihir Patel avatar M Chimiste avatar Jack Cherng avatar Pedro Batista avatar Benno Kruit avatar  avatar  avatar Sacha Arbonel avatar Iuliia Volkova avatar Takeru Hayasaka avatar Bill Doyle avatar  avatar  avatar  avatar  avatar

Watchers

 avatar James Cloos avatar  avatar  avatar M Chimiste avatar

triegex's Issues

Issue with added c++ as keyword

Thanks for the library, will definitely try it out:

I am facing a simple issue meanwhile:

t = Triegex('foo', 'bar', 'baz')

t.to_regex()

t.add('c++')

'spam' in t

import re

re.findall(t.to_regex(), 'c++')

Full stack trace is too long, sharing last part of it here

    639             if item[0][0] in _REPEATCODES:
    640                 raise source.error("multiple repeat",
--> 641                                    source.tell() - here + len(this))
    642             if sourcematch("?"):
    643                 subpattern[-1] = (MIN_REPEAT, (min, max, item))

error: multiple repeat at position 19

Will not import with with python 3.5

import triegex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/share/virtualenvs/project-xy-SPS3Z/lib/python3.5/site-packages/triegex/__init__.py", line 29
    return f'<TriegexNode: \'{self.char}\' end={self.end}>'

ModuleNotFoundError: No module named 'triegex'

I installed the package from pypi using pip:

$ pip install triegex
Collecting triegex
  Downloading triegex-0.0.2.tar.gz
Building wheels for collected packages: triegex
  Running setup.py bdist_wheel for triegex ... done
  Stored in directory: C:\Users\a\AppData\Local\pip\Cache\wheels\d9\c2\27\ac898adb26725eacef0139c486d270c641d86882743e16c2e6
Successfully built triegex
Installing collected packages: triegex
Successfully installed triegex-0.0.2

But I could not import it:

>>> import triegex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'triegex'

My initial guess is that it is related to the packages argument of setuptool.setup, but I have not tested it. Perhaps you should use find_packages or add them manually?

Can we build a regex-trie from regexes?

suppose there are a list of regexes:

a.*
abc.*
acc\d+efg

and a string acc4

Naively, we would have to loop through all the regexes, and find a match.

Can we build a regex-trie such that we don't have to loop over all res?

Incorrect output regex on certain inputs.

When the words fed into the trie are a prefix of an other word, the regex outputted is not correct.

Example code used to reproduce bug:

import triegex

def RegexFromList(listOfWords):
	t = triegex.Triegex()
	for i in listOfWords:
		t.add(i)
	return t.to_regex()

print(RegexFromList(["TEST", "TE", "TEXT", "BANANA"]))

This code outputs: (?:XT\b|TE\b|~^(?#match nothing))
Which happens to match TE correctly but matches XT too and doesn't match TEXT, TEST, BANANA as would be expected.

Expected Regex would be: (?:TE(?:\b|XT\b|ST\b)|BANANA\b|~^(?#match nothing)) or equivalent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.