eriknyquist / librxvm Goto Github PK

View Code? Open in Web Editor NEW

61.0 9.0 1.0 1.61 MB

non-backtracking NFA-based regular expression library, for C and Python

License: Apache License 2.0

Makefile 1.04% C 87.43% Shell 2.62% M4 0.40% Python 3.02% PowerShell 3.66% Batchfile 1.61% Objective-C 0.23%

regex regexes regex-pattern regexp nfa compiler compilers compiler-design parser parsers

librxvm's Introduction

librxvm: non-backtracking NFA-based C library for regular expressions

Linux, OSX :

Windows :

Table of Contents

Introduction
librxvm Regular Expression Syntax
Installation
Example applications
Reference: core features
Reference: optional features
Constants
Test Suite

Introduction

Short one

librxvm is a C library for matching regular expressions on large sets of data.

librxvm is a Regular eXpression Virtual Machine. It compiles a regular expression into an NFA representation consisting of a sequence of primitive opcodes for a "virtual machine" (the "machine" here is theoretical, rather than physical hardware, and is implemented as a simple bytecode interpreter). The virtual machine can then execute the NFA representation, or "program" against some input text to determine whether the input text matches the regular expression.

In addition to the usual string matching & searching functions, librxvm also provides a function (rxvm_fsearch) that takes a FILE pointer for large sets of input data. rxvm_fsearch uses the Boyer-Moore-Horspool algorithm to achieve extremely high throughput for regular expression searches on any data that can be accessed through a standard FILE pointer. You can try it out with the rxvm_fsearch example application.

Speed test: using the `rxvm_fsearch` function to search a 1GB plain-text file

$> ls -lh file
-rw-r--r-- 1 enyquist enyquist 954M Jul 23 21:55 file

$> time examples/rxvm_fsearch "eriknyquist" file

eriknyquistB6dvGZ7wxvd9pqXnep5wABDee7UqaNibTUch/LUAcAjqRPAnQBpMjdG3w7R4wR+1082/ctpmmWk

real    0m0.528s
user    0m0.444s
sys 0m0.084s

$>

528 milliseconds-- pretty fast. In fact, if we search for longer strings, we can go even faster, because of how the Boyer-Moore string searching algorithm works

$> time examples/rxvm_fsearch "this is a long string" file

real    0m0.424s
user    0m0.300s
sys 0m0.120s

$> time examples/rxvm_fsearch "this is a very very very very veeeeeeeeeerrrrrrrrrryyyyyyyyyyyyy long string" file

real    0m0.197s
user    0m0.068s
sys 0m0.128s

Under 200 milliseconds to search a 1GB file, when the search pattern contains a fixed string of ~80 characters or more. That's fast!

What's missing from `librxvm`

Currently, librxvm only works with plain ol' ASCII.
It's not POSIX compliant, or anything compliant as far as I know.
Probably a lot of things.

`librxvm` Regular Expression Syntax

A regular expression consists of ordinary characters and special characters. An ordinary character matches itself exactly (e.g. the expression abc matches only the input string abc).

Full grammar rules can be seen here.

A description of the available special characters follows.

Symbol Name Description

+ one or more matches one or more of the preceding character or parenthesis group, e.g. the expression ab+ matches the input ab, abb, but not a

* zero or more matches zero or more of the preceding character or parenthesis group, e.g. the expression ab* matches the input a, ab and abb

? zero or one matches zero or one of the preceding character or parenthesis group, e.g the expression ab? matches only a or ab

{n} repetition matches n repetitions of the preceding character or parenthesis group.

{n,m} repetition (range) matches n to m repetitions of the preceding character or parenthesis group.

{,m} repetition (less) matches m or fewer repetitions of the preceding character or parenthesis group

{n,} repetition (more) matches n or more repetitions of the preceding character or parenthesis group

| alternation allows either the preceding or the following expression to match, e.g. the expression (c|h)at matches cat and hat

. any matches any character

^ start anchor by default, matches immediately following the beginning of the input string. If the RXVM_MULTILINE flag is set, then it also matches immediately following each newline character

$ end anchor by default, matches immediately preceding the end of the input string or newline character at the end of the input string. If the RXVM_MULTILINE flag is set, then it also matches immediately preceding each newline character

( ) parenthesis group Groups together individual characters or subexpressions, e.g. a(bc)+ matches abcbc or abcbcbcbc, but not a. Parenthesis groups can contain any expression, and can be nested.

[ ] character class matches a single character inside the brackets. Characters can be escaped, (e.g. to match a literal "[" or "]" character), or part of a range. Ranges are valid in both valid in both directions, e.g. Z-A describes the same set of characters as A-Z

[^ ] negated character class matches a single character not inside the brackets. Otherwise, the same character class rules apply

\ escape used to remove special meaning from characters, e.g. to match a literal * character

Symbol	Name	Description
+	one or more	matches one or more of the preceding character or parenthesis group, e.g. the expression `ab+` matches the input `ab`, `abb`, but not `a`
*	zero or more	matches zero or more of the preceding character or parenthesis group, e.g. the expression `ab*` matches the input `a`, `ab` and `abb`
?	zero or one	matches zero or one of the preceding character or parenthesis group, e.g the expression `ab?` matches only `a` or `ab`
{n}	repetition	matches n repetitions of the preceding character or parenthesis group.
{n,m}	repetition (range)	matches n to m repetitions of the preceding character or parenthesis group.
{,m}	repetition (less)	matches m or fewer repetitions of the preceding character or parenthesis group
{n,}	repetition (more)	matches n or more repetitions of the preceding character or parenthesis group
\|	alternation	allows either the preceding or the following expression to match, e.g. the expression `(c\|h)at` matches `cat` and `hat`
.	any	matches any character
^	start anchor	by default, matches immediately following the beginning of the input string. If the RXVM_MULTILINE flag is set, then it also matches immediately following each newline character
$	end anchor	by default, matches immediately preceding the end of the input string or newline character at the end of the input string. If the RXVM_MULTILINE flag is set, then it also matches immediately preceding each newline character
( )	parenthesis group	Groups together individual characters or subexpressions, e.g. `a(bc)+` matches `abcbc` or `abcbcbcbc`, but not `a`. Parenthesis groups can contain any expression, and can be nested.
[ ]	character class	matches a single character inside the brackets. Characters can be escaped, (e.g. to match a literal `"["` or `"]"` character), or part of a range. Ranges are valid in both valid in both directions, e.g. `Z-A` describes the same set of characters as `A-Z`
[^ ]	negated character class	matches a single character not inside the brackets. Otherwise, the same character class rules apply
\	escape	used to remove special meaning from characters, e.g. to match a literal `*` character

Installation

Installing librxvm as a python module

Just use the included setup.py file to compile & install librvm as a python module:

python setup.py build
python setup.py install

Full documentation for the librxvm python API can be found here

Installing librxvm as a C library

Dependencies:

GNU Make
GNU Autotools
A C compiler (GCC, Clang)
Some kind of libc (requires stdio.h, stdlib.h, stdint.h and string.h)

To install, do the usual stuff:

./autogen.sh
./configure
make
sudo make install

The resulting static library librxvm.a will be installed in /usr/local/lib by default. With the default configuration (i.e. everything turned on), the compiled library is about 54K on my 64-bit system. If you want to shave off ~40% of this size (around 20K in my case), you can configure with the --disable-extras flag (see Reference: optional features).

Building C code with librxvm

Once librxvm is installed, you can use it by adding #include <librxvm/rxvm.h> to your program, and then passing -lrxvm when linking. For example:

gcc my_rxvm_program.c -lrxvm

Example applications

See sample code in the examples directory. The examples are simple, and compile into easy-to-use command-line programs. They are automatically built by the top-level Makefile when you run make to build librxvm.

`rxvm_match`

Accepts two arguments, a regular expression and an input string. Prints a message indicating whether the input string matches the expression or not.

$> examples/rxvm_match

  Usage: rxvm_match <regex> <input>

$> examples/rxvm_match "[Rr]x(vm|VM){3,6}" "rxvm"

  No match.

$> examples/rxvm_match "[Rr]x(vm|VM){3,6}" "rxVMvmVM"

  Match!

`rxvm_search`

Accepts two arguments, a regular expression and an input string. Prints any instances of the regular expression that occur inside the input string.

$> examples/rxvm_search

  Usage: rxvm_search <regex> <input>

$> examples/rxvm_search "rx(vm)*" "------------rx---------"

  Found match: rx

$> examples/rxvm_search "rx(vm)*" "------rxvm-------rxvmvm----"

  Found match: rxvm
  Found match: rxvmvm

`rxvm_fsearch`

Accepts two arguments, a regular expression and a filename. Prints any instances of the regular expression that occur inside the file.

$> examples/rxvm_fsearch

  Usage: rxvm_fsearch <regex> <filename>

$> echo "------rxvm-------rxvmvm----" > file.txt
$> examples/rxvm_fsearch "rx(vm)*" file.txt

  Found match: rxvm
  Found match: rxvmvm

`rxvm_gen`

Accepts one argument, a regular expression. Generates a pseudo-random string which matches the expression.

$> examples/rxvm_gen

  Usage: rxvm_gen <regex>

$> examples/rxvm_gen "([Rr]+(xv|XV)mm? ){2,}"

  rRrrRrrxvmm rxvmm rrRrrrRXVm Rrxvm rrRRrXVmm RXVmm

$> examples/rxvm_gen "([Rr]+(xv|XV)mm? ){2,}"

  Rxvm rrrxvmm RXVm RRxvmm

Reference: core features

`rxvm_compile`

int rxvm_compile (rxvm_t *compiled, char *exp);

Compiles the regular expression exp, and places the resulting VM instructions into the rxvm_t type pointed to by compiled.

Return value

0 if compilation succeeded
negative number if an error occured (See rxvm_compile error codes) and General error codes)

`rxvm_match`

int rxvm_match (rxvm_t *compiled, char *input, int flags);

Checks if the string input matches the compiled expression compiled exactly.

Return value

1 if the input matches the expression
0 if the input doesn't match the compiled expression
negative number if an error occured (See General error codes)

`rxvm_search`

int rxvm_search (rxvm_t *compiled, char *input, char **start, char **end, int flags);

Searches the string starting at input for a pattern that matches the compiled regular expresssion compiled, until a match is found or until the string's null termination character is reached. When a match is found, the pointers pointed to by start and end are pointed at the first and last characters of the matching substring. If no match is found, then both start and end are set to NULL.

Return value

1 if a match is found
0 if no match is found
negative number if an error occured (See General error codes)

`rxvm_free`

void rxvm_free (rxvm_t *compiled);

Frees all dynamic memory associated with a compiled rxvm_t type. Always call this function, before exiting, on any compiled rxvm_t types.

Returns nothing.

Reference: optional features

The following functions rxvm_fsearch, rxvm_gen and rxvm_print are compiled in by default. However, if you don't need them and you want the final library to be a bit smaller, you can exlude them by passing the --disable-extras flag to the configure script, e.g.

> ./configure --disable-extras

`rxvm_fsearch`

int rxvm_fsearch (rxvm_t *compiled, FILE *fp, uint64_t *match_size, int flags);

Searches the file at fp (fp must be initialised by the caller, e.g. via fopen) for a pattern that matches the compiled regular expresssion compiled, from the current file position until EOF. If a match is found, the file pointer fp is re-positioned to the first character of the match, and match_size is populated with a positive integer representing the match size (number of characters). If no match is found, then match_end is set to 0, and fp remains positioned at EOF.

This function uses an implementation of the Boyer-Moore-Horspool (BMH) algorithm to search the file for a pattern, and can be extremely fast. Because the BMH algorithm only works with fixed strings, this function uses a special heuristic to identify subtrings of fixed literal characters in your expression, and uses the fast BMH algorithm to search for these smaller substrings. If one is found, the virtual machine is invoked (needed to match a regular expression, but slower).

This means the type of expression you write can significantly affect the speed of the rxvm_search function. Specifically, longer strings means faster matching.

Return value

1 if a match is found
0 if no match is found
negative number if an error occured (See rxvm_fsearch error codes and General error codes)

`rxvm_gen`

char *rxvm_gen (rxvm_t *compiled, rxvm_gencfg_t *cfg);

Generates a string of random characters that matches the compiled expression compiled (compiled must be initialised by the caller first, e.g. via rxvm_compile).

The rxvm_gencfg_t type provides some control over the randomness:

struct rxvm_gencfg {
    uint8_t generosity;
    uint8_t whitespace;
    uint64_t limit;

    uint64_t len;
};

generosity: This value is expected to be between 0-100, and represents the probability out of 100 that a + or * operator will match again ("greedyness" in reverse). Higher means more repeat matches.
whitespace: This value is expected to be between 0-100, and represents the probability that a whitespace character will be used instead of a visible character, when the expression allows it (e.g. when the expression contains a "." metacharacter). Higher means more whitespace.
limit: This value represents the generated input string size at which the generation process should stop. This is not hard limit on the size of the generated string; when the generated string reaches a size of limit, then generosity is effectively set to 0, and generation will stop at the earliest possible opportunity, while also ensuring that the generated string matches the pattern compiled.
len: If rxvm_gen returns a valid (non-null) pointer, then len will contain the number of characters in the generated string (excluding the terminating null-character).

If a null pointer is passed instead of a valid pointer to a rxvm_gencfg_t type, then default values are used.

Return value

A pointer to a heap allocation that contains a null-terminated random matching string. If memory allocation fails, a null pointer is returned.

`rxvm_print`

void rxvm_print (rxvm_t *compiled)

Prints a compiled expression in a human-readable format.

Returns nothing.

Constants

Flags

rxvm_match and rxvm_search take a flags parameter. You can use the masks below to set bit-flags which will change the behaviour of these functions (combine multiple flags by bitwise OR-ing them together):

`RXVM_ICASE`

case insensitive: ignore case when matching alphabet characters. Matching is case-sensitive by default.

`RXVM_NONGREEDY`

non-greedy matching: by default, the operators +, *, and ? will match as many characters as possible, e.g. running rxvm_search with the expression <.*> against the input string <tag>name<tag> will match the entire string. With this flag set, it will match only <tag>.

`RXVM_MULTILINE`

Multiline: By default, ^ matches immediately following the start of input, and $ matches immediately preceding the end of input or the newline before the end of input. With this flag set, ^ will also match immediately following each newline character, and $ will also match immediately preceding each newline character. This flag is ignored and automatically enabled when rxvm_match is used; since rxvm_match effectively requires a matching string to be anchored at both the start and end of input, then ^ and $ are only useful if they can also act as line anchors.

General error codes

The following error codes are returned by all librxvm functions

`RXVM_EMEM`

Indicates that memory allocation failed.

`RXVM_EPARAM`

Indicates that an invalid parameter (e.g. a NULL pointer) was passed to a librxvm library function.

`rxvm_compile` error codes

The following error codes are returned only by the rxvm_compile function

`RXVM_BADOP`

Indicates that an operator (*, +, ?, {}) was used incorrectly in the input expression, i.e. without a preceding literal character.

Example expressions: ab++, {5}.

`RXVM_BADCLASS`

Indicates that an unexpected (and unescaped) character class closing character (]) was encountered in the input expression.

Example expressions: xy], [a-f]]

`RXVM_BADREP`

Indicates that an unexpected (and unescaped) repetition closing character (}) was encountered in the input expression.

Example expressions: a}, bb{4,}}

`RXVM_BADPAREN`

Indicates that an unexpected (and unescaped) closing parenthesis character (``)``) was encountered in the input expression.

Example expressions: qy), q*(ab))

`RXVM_EPAREN`

Indicates that an unterminated parenthesis group (()) was encountered in the input expression.

Example expressions: d+(ab, ((ab)

`RXVM_ECLASS`

Indicates that an unterminated character class ([]) was encountered in the input expression.

Example expressions: [A-Z, [[A-Z]

`RXVM_EREP`

Indicates that an unterminated repetition ({}) was encountered in the input expression.

Example expressions: ab{5, ((ab)

`RXVM_ERANGE`

Indicates that an incomplete character range inside a character class was encountered in the input expression.

Example expressions: [A-], [-z]

`RXVM_ECOMMA`

Indicates that an invalid extra comma inside a repetition was encountered in the input expression.

Example expressions: ab{5,,}, x{6,7,8}

`RXVM_EDIGIT`

Indicates that an invalid character (i.e. not a digit or a comma) inside a repetition was encountered in the input expression.

Example expressions: ab{3,y}, b{8.9}

`RXVM_MREP`

Indicates that an empty repetition ({}) was encountered in the input expression.

Example expressions: ab{}, ab{,}

`RXVM_ETRAIL`

Indicates that a trailing escape character (\\) was encountered in the input expression.

Example expressions: ab\\, \\*\\

`RXVM_EINVAL`

Indicates that an invalid symbol (any character outside the supported character set) was encountered in the input expression.

`rxvm_fsearch` error codes

The following error codes are returned only by the rxvm_fsearch function

`RXVM_IOERR`

Indicates that an error occured while attempting to read from the passed FILE pointer

Test Suite

To run the tests, use the check target in the main Makefile

make check

You can also run the tests through Valgrind (if installed) to check for memory leaks or other issues in librxvm, using the separate Makefile provided specifically for this purpose, memcheck.mk

NOTE: Running the tests through Valgrind can take a very long time to complete

make -f memcheck.mk

librxvm's People

Contributors

Stargazers

Watchers

Forkers

serialhex

librxvm's Issues

Add some speed benchmark figures to README

Add a section with some benchmarks, compare against built-in RE python lib, C regex lib, and google re2 (at least)..

Remove unnecessary null pointer checks

An extra null pointer check is not needed in functions like the following.

TODO: Add support for numbered repetitions i.e. {n,m}, {n,}, {,m}

Should look like this:

a{n,m} : Match n-m (inclusive) repetitions of a.
a{n,} : Match n or more (inclusive) repetitions of a.
a{,m} : Match m or less (inclusive) repetitions of a.

This can be done (I think) by adding just one new VM instruction, "jlt" or "Jump if Less Than".
This instruction takes two parameters; the index of an instruction in the VM program (jump location), and a count value. For example:

jlt 5 6

The first parameter "5" is the jump location. The second parameter "6" is the count value. The VM must keep count of how many times this instruction is executed within the containing program; if that count is less than the jlt instruction's count value, then execution will jump to the specified location, and if both values are equal then the VM will stop keeping count and execution will carry on to the next instruction as normal. The effect, in this example, is that for the first 6 times this instruction is executed it will cause execution to jump to the instruction at "5", and every time after that it will have no effect and execution will proceed normally.

These can be used with branch instructions to build the desired functionality. Some examples:

abc{n,m}

0 char a
1 char b
2 jlt 4 n
3 branch 4 6
4 char c
5 jlt 2 m+1
6 match

abc{n,}

0 char a
1 char b
2 char c
3 jlt 2 n+1
4 branch 2 5
5 match

abc{,m}

0 char a
1 char b
2 branch 3 5
3 char c
4 jlt 2 m+1
5 match

abc{n}

0 char a
1 char b
2 char c
3 jlt 2 n+1
4 match

rxvm_fsearch is only line-based. Make this configurable

Currently, the BMH portion of rxvm_fsearch uses the following heuristic to run a BMH search on a fixed substring from an expression;

Run BMH string search using fixed substring
On match, from the first char. of match, back up to the last newline character
Run vm_execute (full regexp matching) from here

Now, this doesn't cause as many issues as it may sound at first-- rxvm_compile, while compiling the expression initially, is keeping track of any fixed substrings potentially suitable for BMH, and there are several things that will get an expression marked as "unsuitable for BMH" (for more details, you can look at the tests for this behaviour, in tests/src/test_rxvm_lfix_heuristic.c). One of those things is a literal newline character in the expression, since if the expression spans multiple lines, I won't know how far to back up...... it's technically possible, if I track the number of newlines in the expression, but I think that can be a future optimisation.

One issue, however, is that you can't really (very successfully) use rxvm_fsearch in a file that contains no newline characters. Or, if it's a huge file with very few newlines then it can be very slow. So I need to either provide a way to completely disable this line-based-BMH-substring method, or figure out some way to do it without newline characters at all. Or, I guess, provide some other alternative, for example, if the user is willing to specify:

1) the maximum size of any possible matching text, or
2)  an alternate, frequently-occurring character, besides newline, that can instead be "backed up to"

Then we could still use BMH effectively, for a full regexp in a file that is not line-based

reserved identifier violation

I would like to point out that identifiers like “_STRING_BUILDER_H_” and “_VMCODE_H_” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

pyrxvm/pyrxvm.c: needs review/audit by someone who knows python C API better than me

I'm pretty sure I did everything correctly, but there may be a reference leak somewhere, or I could have otherwise messed something up.

Make include guards unique

I find that include guards like “LEX_H” and “STACK_H” are too short for the safe reuse of your header files (when they belong to an application programming interface).

TODO: Create test module to generate random (with some degree of control) valid expressions.

Generating random valid expressions will make a nice pair with the existing code to generate valid input for a given expression. This is where the real testing would begin! In order to allow some degree of control of the randomness, the module should accept multiple values between 0-100, each representing the probability of:

Another token being generated (higher == longer expression)
More parenthesis groups being generated (higher == more groups)
More operators (higher == more operators, i.e. ?+*{n,m})
More literal characters (higher == more literals)

TODO: Create vmcode.c for all code generation stuff

Take everything involving VM code generation out of stage1.c, as it's getting a bit crowded in there. Add a new file vmcode.c and stick it in there.

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fseek ⇒ bmh_partial
ftell ⇒ bmh

TODO: use Boyer-moore (or something similar) to improve text searching performance

Currently the regexvm_search and regexvm_fsearch functions use a naive matching method-- every single character in the 'haystack' is looked at. Naturally this is the slowest method; Come up with way to speed this up using Boyer-moore or some similar algorithm.

Support UTF-8

Currently, librxvm only handles ASCII input, one character == one byte. UTF-8 support would be nice. Need to research the best way to do this.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.