andgineer / tregexpr Goto Github PK

View Code? Open in Web Editor NEW

174.0 174.0 62.0 17.26 MB

Regular expressions (regex), pascal.

Home Page: https://regex.sorokin.engineer/en/latest/

License: MIT License

Pascal 100.00%

delphi freepascal pascal regex

tregexpr's People

Contributors

Stargazers

Watchers

Forkers

zzti pabrrs pawe1 shavlyugaa vencejosoftware acidburn0zzz tarasmetal ffoliveira shkodskihk ioriwellings anomous arvur chevant0n trupka pkrasowski fau juliosenha digaogo cumtcdf cfa2k ahotko nikail23 vasalex weblate kapkapas alexey-t khongten001 rezox zedalaye gitcontainer sprintserwis xiaoshzx atkins126 gaitolini slmaker liehu2050 mvancanneyt tonystone31 laoqiuqiu jintianguonian jtheiller-fork ya-zahra dbcto ilya2ik hubble63 mccompsa gitcnsh-dslin noot2000 mcfrydss josedachao paule32 user4martin wqmeng ronaldobergara thomaskalten wikan-github

tregexpr's Issues

Maybe delete this getter?

function TRegExpr.GetInputString : RegExprString;
 begin
  if fInputString = '' then begin
    Error (reeGetInputStringWithoutInputString);
    EXIT;
   end;
  Result := fInputString;
 end; { of function TRegExpr.GetInputString

Ok to remove getter, or it's needed?

Dont support Delphi 4- in testcases

tests.pas

{$IFDEF VER130} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D5
{$IFDEF VER140} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D6
{$IFDEF VER150} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}

давайте выкинем суппорт старых делфей отсюда? эти D2, D3 D4 D5. только из теста.

For user class ']'at the beginning does not included into class

https://regex.sorokin.engineer/en/latest/regular_expressions.html#user-character-classes
"If you want ] or [ you may place it at the start"
but in fact we do not do that:
https://github.com/andgineer/TRegExpr/blob/61701cb4f0a53f7001ba9a0867352f962b9ad3ce/src/RegExpr.pas#L1010

Delphi tests in travis CI

Single pass of regex compiler?

@andgineer хочу попробовать сделать только 1 проход. пока их два- 1й считает размер программы, 2й уже пишет программу в буфер. как сделать только один? сразу выделять память и писать в буфер, делая ReallocMem при увеличении буфера. делаем realloc шагами по Н символов (предлагаю дать Н=100). многие выражения коротки и уместятся в 100-300 символов, что требует только 2 реаллока. добро?

Support \h and \v

Maybe will make it.
https://www.regular-expressions.info/shorthand.html

While support for \d, \s, and \w is quite universal, there are some regex flavors that support additional shorthand character classes. Perl 5.10 introduced \h and \v. \h matches horizontal whitespace, which includes the tab and all characters in the "space separator" Unicode category. It is the same as [\t\p{Zs}]. \v matches "vertical whitespace", which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, Java as of version 8, and the JGsoft engine as of version 2. Boost supports \h starting with version 1.42. No version of Boost supports \v as a shorthand.

In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they were free to give \v a different meaning. Java 4 to 7 and JGsoft V1 did use \v to match only the vertical tab. Java 8 and JGsoft V2 changed the meaning of this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK to represent the vertical tab.

Merge with free pascal esc-sequences in replace substitution sting

Initial ticket
#2

It changed in Freepascal so it keeps old behaviour by default and you can
switch to new one with property UseOsLineEndOnReplace
https://github.com/graemeg/freepascal/commit/4f00b7d7fcc92c84bcfa1bd28f6a835af6c510c2

Update Dump() for new opcode

Todo

Give error for [\1\Z]

if meta chars are not allowed in [], it is not handled, and no error shows.
e.g. try regex [\1] or [\Z] - this gives char '1' in [], char 'Z' in [].
suggestion- show error here.

            if regparse^ = EscChar then
            begin
              Inc(regparse);
              if regparse >= fRegexEnd then
              begin
                Error(reeParseAtomTrailingBackSlash);
                Exit;
              end;
              if _IsMetaChar(regparse^) then
              begin
                AddrOfString := nil;
                CanBeRange := False;
                EmitC(OpKind_MetaClass);
                EmitC(regparse^);
              end
              else
              begin
                EmitSimpleRangeC(UnQuoteChar(regparse));
               //!! error
              end;

strange IFDEF

{$IFDEF OverMeth}
function TRegExpr.Replace(const AInputStr: RegExprString;
  AReplaceFunc: TRegExprReplaceFunction): RegExprString;
begin
  {$IFDEF FPC}Result := {$ENDIF}
  ReplaceEx(AInputStr, AReplaceFunc);
end; { of function TRegExpr.Replace
  -------------------------------------------------------------- }
{$ENDIF}

Why result is set only for fpc???

[\s\S] doesn't seem to work

Using Regex "Test:\s*([\s\S]?)\s;" (without quotes, obviously) with an input of "Test: hello ;" correctly Returns "hello" on other Regex tools (e.g. http://www.regexr.com/) but returns no results using TRegExpr.

Using "Test:\s*(.?)\s;" works for this case in TRegExpr but obviously wouldn't do the same job if you were using a multi-line input string.

Unless I'm mistaken, the below should return "hell\nlo":

Drop support for old Delphi?

{$IFDEF D3} {$DEFINE UseAsserts} {$ENDIF}
{$IFDEF FPC} {$DEFINE UseAsserts} {$ENDIF}
// Define 'use subroutine parameters default values' option (do not edit this definition).
{$IFDEF D4} {$DEFINE DefParam} {$ENDIF}
{$IFDEF FPC} {$DEFINE DefParam} {$ENDIF}
// Define 'OverMeth' options, to use method overloading (do not edit this definitions).
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}

it's not good, maybe drop D5 and older? D4 and older? code not nice with ifdefs.

Bad result of SubExprMatchCount

result is 2x bigger than needed.
will add testcase for it and a fix.

Exec with TryOnce

I will add TryOnce exec, exec which tests only at Offset (not in loop). it's needed for lexer parser which must test only one offset, outter code must change offset then (sometimes by 1, sometimes by n).

to add ATryOnce, i will change code near if reganchored <> #0 in MatchPrim

Ok naming?
ExecPos(AOffset: integer; ATryOnce: boolean)

Must rewrite StrScanCI

DELETED.
it is okay.

Failed URL to home server

Failed url
http://regexpr.masterandrey.com/en/latest/
Actual
https://regex.sorokin.engineer/

Directory is missing in restudio

I can't compile restudio due to a missing directory.

The compiler complaines about missing unit 'tynList' which is expected in directory 'Persistence'.
Unfortunally this directory does not exist.

Can you please provide the directory (with tynlist, ansoStrings & ansoRTTIHook)?

Kind regards
Andreas

UNHANDLED EXCEPTION!!! TRegExpr(comp): Urecognized Modifier (pos 10)

^(?!([0-9])\1{9})[0-9]{10}$

1111111111 - not ok
1111111112 - ok

Suggest: Fixes to get working on never versions of Delphi

The new solution for regular expressions in Delphi has some serious drawback as well - in particular in e.g. XE2-XE4 when dealing with UCS-2 documents and is not maintained by 4d party so people can not upgrade that lib without buying new Delphi versions.
I want to have cross-compile code in Delphi and Lazarus. This is an important goal I think and very easy to reach.

I suggest adding these changes to the RegXepr unit:

{$IFDEF VER170} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER180} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER200} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER210} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER220} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER230} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER240} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER250} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER260} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER270} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER280} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7

and maybe

{$IFDEF D7} {$DEFINE UseAsserts} {$ENDIF}

Extra:
I would also suggest increase this to 63 instead of 15 (you 15 really fast - in practice 63 holds even when used intensively over 10+ years)

NSUBEXP = 63;

Idea how to do SmallSet optimization with new opcode

with new opcode (OpKind_Char+len, chars), (OpKind_Meta, 'w'), (OpKind_Range, a, b) it's possible to make SmallSet (set of char with number of elements<32) optimization.
plan:
collect current opcode (OpKind_...), until end of range (OpKind_End).
then analize collected data- from opcode begin until end of range.
if collected data fits into 32 char range (e.g. 'a'...'z' plus symbols) then erase opcode (decrease regcode) and replace it with SmallSet opcode.

this will need hard testing, new tests with complex ranges [a/,k-xbe-fz] etc.
ranges with metachars \w \d - cannot be changed.

Better find in char classes

После патча #90 ещё сделаю

запись в опкод имён мета классов , а не всей строки вида DigitChars, WordChars, SpaceChars
будет работать инверсные \W \S \D внутри []
писать диапазоны a-x как два кода а не как всю строку от начала до конца

Non capturing groups (?=...), (?!...) support

Is it possible to add an option to use non-capturing groups?
either with ?: or /n

Rename files in test/

I suggest to rename files in test/ dir:

name FPC project as test_fpc.*
name Delphi project as test_delphi.*
rename pas unit with tests to "tests.pas"
adjust also CI file "test.sh" and .gitignore

do you agree? @andgineer

Update docs

https://regex.sorokin.engineer/en/latest/regular_expressions.html#predefined-character-classes

here \h, \v are missed.

String contains null character not matched properly

NUL character does not properly matched.

program testemail;

uses
    regexpr;

var
  RegexObj: TRegExpr;

begin
  RegexObj := TRegExpr.Create;

  regexObj.expression := '^(\d+):CONTENT_LENGTH\x00(\d+)\x00';
  if RegexObj.Exec('1065:CONTENT_LENGTH' + #0 + '185364' + #0 + 'SCGI'+ #0 + '1' + #0 + 'CONTENT_') then 
      WriteLn('matched!');
  RegexObj.Free;
end.

\w \W - opcode or string?

ParseAtom имеет такой код для \d \D

          'd': begin // r.e.extension - any digit ('0' .. '9')
             ret := EmitNode (ANYDIGIT);
             flagp := flagp or HASWIDTH or SIMPLE;
            end;
          'D': begin // r.e.extension - not digit ('0' .. '9')
             ret := EmitNode (NOTDIGIT);
             flagp := flagp or HASWIDTH or SIMPLE;
            end;

тут делается опкод ANYDIGIT или обратный. ОК
для \w \W делается не так - тут или опкод или Emit строки wordchars

          'w': begin // r.e.extension - any english char / digit / '_'
             {$IFDEF UseSetOfChar}
             ret := EmitRange (ANYOF);
             EmitRangeStr (WordChars);
             EmitRangeC (#0);
             {$ELSE}
             ret := EmitNode (ANYLETTER);
             {$ENDIF}
             flagp := flagp or HASWIDTH or SIMPLE;
            end;

почему не делать тут всегда опкод? это же лучше вроде - тогда UseUnicodeWordDetection отработает для этого случая тоже (а пока оно где то видимо работает а где-то нет). @andgineer

вот где юзается UseUnicodeWordDetection:

function TRegExpr.IsWordChar(AChar: REChar): Boolean;
begin
  Result := Pos(AChar, fWordChars)>0;
  {$IFDEF UnicodeWordDetection}
  If Not Result and UseUnicodeWordDetection then
    Result:=IsUnicodeWordChar(aChar);
  {$ENDIF}
end;

Fix author emails in License, in .pas

subj. @andgineer
and "anso.da.ru"- fix it.

Strange behavior: \w* and backreferences

uses RegExpr;
begin
  WriteLn( ReplaceRegExpr('(\w*)','name.ext','$1.new', True) );
  ReadLn;
end.

Return: name.new.new.ext.new.new. Bug or incorrect use?

On Russian: некоторые подробности проблемы отсюда и ниже.

Escape sequences in replace string (substitute template)

are not supported

Bug of FirstCharSet

wrong FirstCharset

sync with freepascal pull request for esc-sequences in replace

https://github.com/graemeg/freepascal/pull/15

Change doc in accordance with new ReplaceRegExpr design

Restore FillFirstCharSet optimization

It significantly improves speed for large input text - we fast skip all positions that are not worth executing full blown re engine.

How to refactor here? test fails

есть такая часть

              case regparse^ of // r.e.extensions
                'd':
                  EmitRangeStr('0123456789');
                'w':
                  {$IFDEF UseWordChars}
                  EmitRangeStr(WordChars);
                  {$ELSE}
                  EmitNode(OP_ANYLETTER);
                  {$ENDIF}

тут пробую сделать замену- по аналогии с \w \s

                  //EmitRangeStr('0123456789');
                  EmitNode(OP_ANYDIGIT);

но это сразу дает сбой Test11.

    (
    expression: '[^\d]+';
    inputText: '234578923457823659ARTZU38';
    substitutionText: '';
    expectedResult: 'ARTZU';
    matchStart: 19
    ),

видимо такая замена неверна для понимания [^\d] . уже нехорошо. и подозрение что тогда и код для \w тоже нехорош - он даст сбой в [^\w]. надо проверить.
что делать?

todo optimization

          if (PREOp(scan)^ = OP_EXACTLY) and
            (strlen(scan + REOpSz + RENextOffSz) >= PtrInt(Len)) then
          begin
            longest := scan + REOpSz + RENextOffSz;
            Len := strlen(longest);
          end;

Format src

I can format source code a little, by mass replaces

delete spaces before ( and [
delete spaces before : and :=
lowercase some keywords: Const Var
title case EXIT

Ok? @andgineer

Wanted features of RegEx in FreePascal

Big thanks for your RegExpr unit in FPC. It's very useful!
User of CudaText program wants modern regex features, he listed them in this page:
Alexey-T/CudaText#2279

What do you think?

Better remove macOS from CI

macOS tests run SLOOWLY, first VM is installing 100*k packages, then VM installs 100M Lazarus package, it runs for 9minutes already, all Linux tests are passed long ago.

Help on code needed

нужна подсказка. не могу понять код- мне надо менять StrScan, StrScanCI - но вызовы есть в 2х местах. какое из них менять чтобы поменять разбор в [ ] char class?

первоё в regrepeat

    OP_ANYOF:
      while (Result < TheMax) and (StrScan(opnd, scan^) <> nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYBUT:
      while (Result < TheMax) and (StrScan(opnd, scan^) = nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYOFCI:
      while (Result < TheMax) and (StrScanCI(opnd, scan^) <> nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYBUTCI:
      while (Result < TheMax) and (StrScanCI(opnd, scan^) = nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;

второе в MatchPrim

      OP_ANYOF:
        begin
          if (reginput = fInputEnd) or
            (StrScan(scan + REOpSz + RENextOffSz, reginput^) = nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYBUT:
        begin
          if (reginput = fInputEnd) or
            (StrScan(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYOFCI:
        begin
          if (reginput = fInputEnd) or
            (StrScanCI(scan + REOpSz + RENextOffSz, reginput^) = nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYBUTCI:
        begin
          if (reginput = fInputEnd) or
            (StrScanCI(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
            Exit;
          Inc(reginput);
        end;

это я хочу сделать запись в опкод ПАР символов (kind, data). @andgineer

Remove WordChars, SpaceChars props, add hardcoded Unicode checks for them

для своего CudaText хочется сделать такие правки.
предлагаю удалить проперти WordChars, SpaceChars и заменить их hardcoded checks.

и то и то делается просто и надежно. зачем вообще ввели это WordChars, я понимаю так что людям не хотелось делать полноценный анализ UnicodeData но хотелось детектить многие буквы - вот и приделали WordChars.
но это криво - оно и медленно и все буквы нереально туда записать, в юникод их очень много, многие запишут туда только латинские умляуты, а другие языки как? кто-то еще пропустит греческий, кто-то русский. а есть еще азиатские- их почти все пропустят (там Jap, Chinese, Korean, Indian итд).
hardcoded checks будут работать быстро - быстрее проверки 20-40 букв, там проверки по UnicodeData.

для SpaceChars тоже быстро, тоже проверка по UnicodeData.
проверки UnicodeData будут в ifdef unicode.

даете добро на пач?
@andgineer

FirstCharSet is not used when reganchored is tested

bug. tofix.

Why bitpacked modifiers?

function TRegExpr.GetModifier(AIndex: integer): boolean;
var
  Mask: integer;
begin
  Result := False;
  case AIndex of
    1:
      Mask := MaskModI;
    2:
      Mask := MaskModR;
    3:
      Mask := MaskModS;
    4:
      Mask := MaskModG;
    5:
      Mask := MaskModM;
    6:
      Mask := MaskModX;
  else
    begin
      Error(reeModifierUnsupported);
      Exit;
    end;
  end;
  Result := (fModifiers and Mask) <> 0;
end; { of function TRegExpr.GetModifier
  -------------------------------------------------------------- }

procedure TRegExpr.SetModifier(AIndex: integer; ASet: boolean);
var
  Mask: integer;
begin
  case AIndex of
    1:
      Mask := MaskModI;
    2:
      Mask := MaskModR;
    3:
      Mask := MaskModS;
    4:
      Mask := MaskModG;
    5:
      Mask := MaskModM;
    6:
      Mask := MaskModX;
  else
    begin
      Error(reeModifierUnsupported);
      Exit;
    end;
  end;
  if ASet then
    fModifiers := fModifiers or Mask
  else
    fModifiers := fModifiers and not Mask;
end; { of procedure TRegExpr.SetModifier
  -------------------------------------------------------------- }

@andgineer I suggest to make N bool props instead of N bits in int.

Small optimization: replace regdummy to bool flag

предлагаю такую мелкую оптимизацию. убрать regdummy (идет проверка if p = @regdummy чтобы понять что это первый проход - посчитать размер программы), заменить его на флаг DummyPass: boolean. чуть лучше. @andgineer ok?

Version prop is not needed

v. 0.947 2001.10.03
-=- (+) VersionMajor/Minor class method of TRegExpr ;)

после применения #41 предлагаю удалить эти два св-ва, какой в них смысл? только вывести что-то в окне REStudio? для программиста важен код и там в history.txt версия есть. даже REStudio может вывести эту версию без этого свойства.

CI for Unicode mode needed

subj.
для этого в тесте надо сделать define. у fpc есть параметр -dUnicode - слово после -d.
плиз, добавьте.

после #74 вы увидите тест с ф-ей TestUnicode1, которая определена в ifdef Unicode.

[Fatal Error] RegExpr.pas(735): File not found: 'System.Character.dcu' on D7

There seems to be an issue with ifdef statements for D7:

[Fataler Fehler] RegExpr.pas(735): Datei nicht gefunden: 'System.Character.dcu'
[Fatal Error] RegExpr.pas(735): File not found: 'System.Character.dcu'

I had to comment out the following lines:
// uses
// System.Character; // System.Character exists since Delphi 2009

D7 Build 4.453, TRegExpr latest

Non capturing groups (?:text) support

https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions

This is classic definition, this feature is needed, please.
In the #1 users want another feature— look ahead and look behind assertions.

Put FPC 3.0.4 to Travis on this Github

you use too old FPC

Free Pascal Compiler version 2.6.2-8 [2014/01/22] for x86_64

Copyright (c) 1993-2012 by Florian Klaempfl and others

Target OS: Linux for x86-64

Compiling testregexpr.pp

Compiling tcregexp.pas

tcregexp.pas(20,3) Fatal: Can't find unit fpwidestring used by tcregexp

ERROR: failed compiling of project /home/travis/build/andgineer/TRegExpr/test/testregexpr.lpi

Allow NULL chars in string

I can make a patch to allow subj, for this I want to add method InBuffer which checks for offset is it in the buffer. Okay?

Docs

зачем тут второй пример для \tfoobar? уже есть для \t.
лучше всю инфо про \ci поместить в одну строку а не в 3