Coder Social home page Coder Social logo

3f / regxwild Goto Github PK

View Code? Open in Web Editor NEW
26.0 9.0 6.0 435 KB

⏱ Superfast ^Advanced wildcards++? | Unique algorithms that was implemented on native unmanaged C++ but easily accessible in .NET via Conari (with caching of 0x29 opcodes +optimizations) etc.

License: MIT License

C++ 84.00% C 2.23% Batchfile 9.05% C# 4.71%
regex regexp wildcards strings text filter search match search-in-text fast-regex

regxwild's Introduction

⏱ Superfast ^Advanced wildcards++? *,|,?,^,$,+,#,>,++??,##??,>c in addition to slow regex engines and more.

✔ regex-like quantifiers, amazing meta symbols, and speed...

Unique algorithms that was implemented on native unmanaged C++ but easily accessible in .NET through Conari (recommended due to caching of 0x29 opcodes + related optimizations), and others such as python etc.

Build status release License NuGet package Tests

Build history

Samples regXwild filter n
number = '1271'; number = '????'; 0 - 4
year = '2020'; '##'|'####' 2 | 4
year = '20'; = '##??' 2 | 4
number = 888; number = +??; 1 - 3
Samples regXwild filter
everything is ok ^everything*ok$
systems system?
systems sys###s
A new 'X1' project ^A*'+' pro?ect
professional system pro*system
regXwild in action pro?ect$|open*source+act|^regXwild

Why regXwild ?

It was designed to be faster than just fast for features that usually go beyond the typical wildcards. Seriously, We love regex, I love, You love; 2013 far behind but regXwild still relevant for speed and powerful wildcards-like features, such as ##?? (which means 2 or 4) ...

🔍 Easy to start

Unmanaged native C++ or managed .NET project. It doesn't matter, just use it:

C++

#include <regXwild.h>
using namespace net::r_eg::regXwild;
...
EssRxW rxw;
if(rxw.match(_T("regXwild"), _T("reg?wild"))) {
    // ...
}

C# if Conari

using dynamic l = new ConariX("regXwild.dll");
...
if(l.match<bool>("regXwild", "reg?wild")) {
    // ...
}

🏄 Amazing meta symbols

ESS version (advanced EXT version)

metasymbol meaning
* {0, ~}
| str1 or str2 or ...
? {0, 1}, ??? {0, 3}, ...
^ [str... or [str1...
$ ...str] or ...str1]
+ {1, ~}, +++ {3, ~}, ...
# {1}, ## {2}, ### {3}, ...
> Legacy > (F_LEGACY_ANYSP = 0x008) as [^/]*str | [^/]*$
>c 1.4+ Modern > as [^**c**]*str | [^**c**]*$

EXT version (more simplified than ESS)

metasymbol meaning
* {0, ~}
> as [^/\\]+
| str1 or str2 or ...
? {0, 1}, ??? {0, 3}, ...

🧮 Quantifiers

1.3+ ++??; ##??

regex regXwild n
.* * 0+
.+ + 1+
.? ? 0 | 1
.{1} # 1
.{2} ## 2
.{2, } ++ 2+
.{0, 2} ?? 0 - 2
.{2, 4} ++?? 2 - 4
(?:.{2}|.{4}) ##?? 2 | 4
.{3, 4} +++? 3 - 4
(?:.{1}|.{3}) #?? 1 | 3

and similar ...

Play with our actual Unit-Tests.

🚀 Awesome speed

  • ~2000 times faster when C++.
  • For .NET (including modern .NET Core), Conari provides optional caching of 0x29 opcodes (Calli) and more to get similar to C++ result as possible.

Match result and Replacements

1.4+

EssRxW::MatchResult m;
rxw.match
(
    _T("number = '8888'; //TODO: up"),
    _T("'+'"),
    EssRxW::EngineOptions::F_MATCH_RESULT,
    &m
);
//m.start = 9
//m.end = 15
...
input.replace(m.start, m.end - m.start, _T("'9777'"));
tstring str = _T("year = 2021; dd = 17;");
...
if(rxw.replace(str, _T(" ##;"), _T(" 00;"))) {
    // year = 2021; dd = 00;
}

🍰 Open and Free

Open Source project; MIT License, Enjoy 🎉

License

The MIT License (MIT)

Copyright (c) 2013-2021  Denis Kuzmin <[email protected]> github/3F

[ ☕ Make a donation ]

regXwild contributors: https://github.com/3F/regXwild/graphs/contributors

We're waiting for your awesome contributions!

Speed

Procedure of testing

  • Use the algo subproject as tester of the main algorithms (Release cfg - x32 & x64)
  • In general, calculation is simple and uses average as i = (t2 - t1); (sum(i) / n) where:
    • i - one iteration for searching by filter. Represents the delta of time t2 - t1
    • n - the number of repeats of the matching to get average.

e.g.:

{
    Meter meter;
    int results = 0;

    for(int total = 0; total < average; ++total)
    {
        meter.start();
        for(int i = 0; i < iterations; ++i)
        {
            if((alg.*method)(data, filter)) {
                //...
            }
        }
        results += meter.delta();
    }

    TRACE((results / average) << "ms");
}

for regex results it also prepares additional basic_regex from filter, but of course, only one for all iterations:

meter.start();

auto rfilter = tregex(
    filter,
    regex_constants::icase | regex_constants::optimize
);

results += meter.delta();
...

Please note:

  • +icase means ignore case sensitivity when matching the filter(pattern) within the searched string, i.e. ignoreCase = true. Without this, everything will be much faster of course. That is, icase always adds complexity.
  • Below, MultiByte can be faster than Unicode (for the same platform and the same way of module use) but it depends on specific architecture and can be about ~2 times faster when native C++, and about ~4 times faster when .NET + Conari and related.
  • The results below can be different on different machines. You need only look at the difference (in milliseconds) between algorithms for a specific target.
  • To calculate the data, as in the table below, you need execute algo.exe

Sample of speed for Unicode

340 Unicode Symbols and 10^4 iterations (340 x 10000); Filter: L"nime**haru*02*Magica"

algorithms (see impl. from algo) +icase [x32] +icase [x64]
Find + Find ~58ms ~44ms
Iterator + Find ~57ms ~46ms
Getline + Find ~59ms ~54ms
Iterator + Substr ~165ms ~132ms
Iterator + Iterator ~136ms ~118ms
main :: based on Iterator + Find ~53ms ~45ms
​ ​
Final algorithm - EXT version: ~50ms ~26ms
Final algorithm - ESS version: ~50ms ~27ms
​ ​
regexp-c++11(regex_search) ~59309ms ~53334ms
regexp-c++11(only as ^match$ like a '==') ~12ms ~5ms
regexp-c++11(regex_match with endings .*) ~59503ms ~53817ms

ESS vs EXT

350 Unicode Symbols and 10^4 iterations (350 x 10000);

Operation (+icase) EXT [x32] ESS [x32] EXT [x64] ESS [x64]
ANY ~54ms ~55ms ~32ms ~34ms
ANYSP ~60ms ~59ms ~37ms ~38ms
ONE ~56ms ~56ms ~33ms ~35ms
SPLIT ~92ms ~94ms ~58ms ~63ms
BEGIN --- ~38ms --- ~19ms
END --- ~39ms --- ~21ms
MORE --- ~44ms --- ~23ms
SINGLE --- ~43ms --- ~22ms

For .NET users through Conari engine:

Same test Data & Filter: 10^4 iterations

Release cfg; x32 or x64 regXwild (Unicode)

Attention: For more speed you need upgrading to Conari 1.3 or higher !

algorithms (see impl. from snet) +icase [x32] +icase [x64]
regXwild via Conari v1.2 (Lambda) - ESS ~1032ms ~1418ms x
regXwild via Conari v1.2 (DLR) - ESS ~1238ms ~1609ms x
regXwild via Conari v1.2 (Lambda) - EXT ~1117ms ~1457ms x
regXwild via Conari v1.2 (DLR) - EXT ~1246ms ~1601ms x
​ ​
regXwild via Conari v1.3 (Lambda) - ESS ~58ms ~42ms <<
regXwild via Conari v1.3 (DLR) - ESS ~218ms ~234ms
regXwild via Conari v1.3 (Lambda) - EXT ~54ms ~35ms <<
regXwild via Conari v1.3 (DLR) - EXT ~214ms ~226ms
​ ​
.NET Regex engine [Compiled] ~38310ms ~37242ms
.NET Regex engine [Compiled]{only ^match$} < 1ms ~3ms
.NET Regex engine ~31565ms ~30975ms
.NET Regex engine {only ^match$} < 1ms ~1ms

How to get regXwild

regXwild v1.1+ can also be installed through NuGet same for both unmanaged and managed projects.

For .NET it will put x32 & x64 regXwild into $(TargetDir). Use it with your .net modules through Conari and so on.

x64 + x32 Unicode + MultiByte modules;

Please note: Modern regXwild packages will no longer be distributed together with Conari. Please consider to use it separately, Conari nuget packages.

regxwild's People

Contributors

3f avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regxwild's Issues

Capture data for all quantifiers

The most lightweight way (+Speed) implement flag to enable capturing data by default for all quantifiers *?+#

  • @c ... enabled ...
  • @c ... enabled ... @c ... disabled ...

example:

  • @c12*5 a?d 34+7
    • 12005 and 3417
      • -> $1 - 00
      • -> $2 - n
      • -> $3 - 1

Alternative

Speed - is most important, because the regex engine much more powerful anyway.

v2:

  • @c ... @
    • 12@c*@7 - > 12457 accessor: $n

v3 (overhead):

  • @c flag - switch to enable () for capturing
    • @c12(*)7 - > 12457 accessor: $1 - 45
    • escape \(\) or (())

Disable Meta-Symbols

v1 - Same char to escape

  • 1024*2 -> 10247412
  • 1024**2 -> 1024*2
  • 1024***2 -> 1024*0xD2
  • one|two -> one or two
  • one||two -> one|two
  • one|||two -> one| or two
  • one||||two -> one||two
  • one|||||two -> one|| or two
  • one|*||two -> one or |two
  • one|*||||two -> one or ||two

... for each available meta-symbol

upd: no, ESS version provides #, that means 1 any symbol, thus we cannot use this because of: track-num:*###7400 and similar.

v2 - markers via flags

We'll add @ + special flag ([a-z0-9] only) + @@ to escape this flag

Then we can use unified combinations of this, for example:

  • @ + d for disable all meta-symbols before new @(?) -> @d ... @
    • 1024*2 -> 10247412
    • 1024@d*@2 -> 1024*2 or @d1024*2@ -> 1024*2
    • @d1024*@*2 -> 1024*0xD2
    • @@d1024*@*2 -> @@d1024*@0xD2 -> note: @* ok (@ + [a-z0-9])
    • one|two -> one or two
    • @done|two@ -> one|two
    • @done|@|two -> one| or two
    • @done||two@ -> one||two
    • @done||@|two -> one|| or two
    • one|@d|two@ -> one or |two
    • one|@d||two@ -> one or ||two

well, this is more powerful way, but probably more hard for view o_o and:

  • v2: one|@d||two@ <- ?@done|@@d||two@@@
  • v1: one|@d||two@ <- one||@@d||||two@@

v3 - markers via single tilde '~'

v2 can also be as simple ~ ... ~ + to escape ~~:

  • one|@d||two@ <- ~one|@d||two@~
  • one|~d||two~ <- ~one|~~d||two~~~

Special markers & Flags

To use additional syntax, we can add special markers & flags

The entry point like:

  • Single: @x ...
  • Pair: @x ... @

Where x the any supported combination, like single symbol [0-9a-z] or complex expressions like {1,7}.

This related for #1 (v2 - markers via flags), but the main idea to extend syntax for support additional features like the following below.

Sub-Issues

  • #4 - Capture data for all quantifiers
  • #5 - Control of newline symbols
  • #1 - Disable Meta-Symbols

Special symbol-logic like for current '>'

MS_ANYSP = _T('>'), // as [^/]* //TODO: >\>/ i.e. '>' + {symbol}

  • @B> ...

etc.

non-/whitespace character + std. quantifiers

Add any variant for using of whitespace & non-whitespace character/s

v1 - single meta-symbol + std. quantifier

  • % - whitespace + quantifier *?+#
  • & - non-whitespace + quantifier *?+#

for example:

  • 1%+5 -> 125, 12345, ...
  • 1%##5 -> 1235, 1775, ...
  • 1%?5 -> 15, 125, 145, ...
  • 1%*5 -> 123467895, 15, ...

configurable % + & like:

search(const TCHAR* data, const TCHAR* filter, const TCHAR* whsp, const TCHAR* nonwhsp)
_T("qwerty012345...")
_T("	 ")

or simply:

search(const TCHAR* data, const TCHAR* filter, const TCHAR* whsp)
_T("	 ") - all except this is a non-whitespace character

but the first variant is more flexible for using special set of characters for both cases of % + &

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.