andgineer / tregexpr Goto Github PK
View Code? Open in Web Editor NEWRegular expressions (regex), pascal.
Home Page: https://regex.sorokin.engineer/en/latest/
License: MIT License
Regular expressions (regex), pascal.
Home Page: https://regex.sorokin.engineer/en/latest/
License: MIT License
function TRegExpr.GetInputString : RegExprString;
begin
if fInputString = '' then begin
Error (reeGetInputStringWithoutInputString);
EXIT;
end;
Result := fInputString;
end; { of function TRegExpr.GetInputString
Ok to remove getter, or it's needed?
tests.pas
{$IFDEF VER130} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D5
{$IFDEF VER140} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D6
{$IFDEF VER150} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}
давайте выкинем суппорт старых делфей отсюда? эти D2, D3 D4 D5. только из теста.
https://regex.sorokin.engineer/en/latest/regular_expressions.html#user-character-classes
"If you want ] or [ you may place it at the start"
but in fact we do not do that:
https://github.com/andgineer/TRegExpr/blob/61701cb4f0a53f7001ba9a0867352f962b9ad3ce/src/RegExpr.pas#L1010
@andgineer хочу попробовать сделать только 1 проход. пока их два- 1й считает размер программы, 2й уже пишет программу в буфер. как сделать только один? сразу выделять память и писать в буфер, делая ReallocMem при увеличении буфера. делаем realloc шагами по Н символов (предлагаю дать Н=100). многие выражения коротки и уместятся в 100-300 символов, что требует только 2 реаллока. добро?
Maybe will make it.
https://www.regular-expressions.info/shorthand.html
While support for \d, \s, and \w is quite universal, there are some regex flavors that support additional shorthand character classes. Perl 5.10 introduced \h and \v. \h matches horizontal whitespace, which includes the tab and all characters in the "space separator" Unicode category. It is the same as [\t\p{Zs}]. \v matches "vertical whitespace", which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].
PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, Java as of version 8, and the JGsoft engine as of version 2. Boost supports \h starting with version 1.42. No version of Boost supports \v as a shorthand.
In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they were free to give \v a different meaning. Java 4 to 7 and JGsoft V1 did use \v to match only the vertical tab. Java 8 and JGsoft V2 changed the meaning of this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK to represent the vertical tab.
Initial ticket
#2
It changed in Freepascal so it keeps old behaviour by default and you can
switch to new one with property UseOsLineEndOnReplace
https://github.com/graemeg/freepascal/commit/4f00b7d7fcc92c84bcfa1bd28f6a835af6c510c2
Todo
if meta chars are not allowed in [], it is not handled, and no error shows.
e.g. try regex [\1] or [\Z] - this gives char '1' in [], char 'Z' in [].
suggestion- show error here.
if regparse^ = EscChar then
begin
Inc(regparse);
if regparse >= fRegexEnd then
begin
Error(reeParseAtomTrailingBackSlash);
Exit;
end;
if _IsMetaChar(regparse^) then
begin
AddrOfString := nil;
CanBeRange := False;
EmitC(OpKind_MetaClass);
EmitC(regparse^);
end
else
begin
EmitSimpleRangeC(UnQuoteChar(regparse));
//!! error
end;
{$IFDEF OverMeth}
function TRegExpr.Replace(const AInputStr: RegExprString;
AReplaceFunc: TRegExprReplaceFunction): RegExprString;
begin
{$IFDEF FPC}Result := {$ENDIF}
ReplaceEx(AInputStr, AReplaceFunc);
end; { of function TRegExpr.Replace
-------------------------------------------------------------- }
{$ENDIF}
Why result is set only for fpc???
Using Regex "Test:\s*([\s\S]?)\s;" (without quotes, obviously) with an input of "Test: hello ;" correctly Returns "hello" on other Regex tools (e.g. http://www.regexr.com/) but returns no results using TRegExpr.
Using "Test:\s*(.?)\s;" works for this case in TRegExpr but obviously wouldn't do the same job if you were using a multi-line input string.
vs
Unless I'm mistaken, the below should return "hell\nlo":
{$IFDEF D3} {$DEFINE UseAsserts} {$ENDIF}
{$IFDEF FPC} {$DEFINE UseAsserts} {$ENDIF}
// Define 'use subroutine parameters default values' option (do not edit this definition).
{$IFDEF D4} {$DEFINE DefParam} {$ENDIF}
{$IFDEF FPC} {$DEFINE DefParam} {$ENDIF}
// Define 'OverMeth' options, to use method overloading (do not edit this definitions).
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}
it's not good, maybe drop D5 and older? D4 and older? code not nice with ifdefs.
I will add TryOnce exec, exec which tests only at Offset (not in loop). it's needed for lexer parser which must test only one offset, outter code must change offset then (sometimes by 1, sometimes by n).
to add ATryOnce, i will change code near if reganchored <> #0
in MatchPrim
Ok naming?
ExecPos(AOffset: integer; ATryOnce: boolean)
DELETED.
it is okay.
Failed url
http://regexpr.masterandrey.com/en/latest/
Actual
https://regex.sorokin.engineer/
I can't compile restudio due to a missing directory.
The compiler complaines about missing unit 'tynList' which is expected in directory 'Persistence'.
Unfortunally this directory does not exist.
Can you please provide the directory (with tynlist, ansoStrings & ansoRTTIHook)?
Kind regards
Andreas
^(?!([0-9])\1{9})[0-9]{10}$
1111111111 - not ok
1111111112 - ok
I suggest adding these changes to the RegXepr unit:
{$IFDEF VER170} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER180} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER200} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER210} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER220} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER230} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER240} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER250} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER260} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER270} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER280} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
and maybe
{$IFDEF D7} {$DEFINE UseAsserts} {$ENDIF}
Extra:
I would also suggest increase this to 63 instead of 15 (you 15 really fast - in practice 63 holds even when used intensively over 10+ years)
NSUBEXP = 63;
with new opcode (OpKind_Char+len, chars), (OpKind_Meta, 'w'), (OpKind_Range, a, b) it's possible to make SmallSet (set of char with number of elements<32) optimization.
plan:
collect current opcode (OpKind_...), until end of range (OpKind_End).
then analize collected data- from opcode begin until end of range.
if collected data fits into 32 char range (e.g. 'a'...'z' plus symbols) then erase opcode (decrease regcode
) and replace it with SmallSet opcode.
this will need hard testing, new tests with complex ranges [a/,k-xbe-fz]
etc.
ranges with metachars \w \d - cannot be changed.
После патча #90 ещё сделаю
a-x
как два кода а не как всю строку от начала до концаIs it possible to add an option to use non-capturing groups?
either with ?: or /n
I suggest to rename files in test/ dir:
do you agree? @andgineer
https://regex.sorokin.engineer/en/latest/regular_expressions.html#predefined-character-classes
here \h, \v are missed.
NUL character does not properly matched.
program testemail;
uses
regexpr;
var
RegexObj: TRegExpr;
begin
RegexObj := TRegExpr.Create;
regexObj.expression := '^(\d+):CONTENT_LENGTH\x00(\d+)\x00';
if RegexObj.Exec('1065:CONTENT_LENGTH' + #0 + '185364' + #0 + 'SCGI'+ #0 + '1' + #0 + 'CONTENT_') then
WriteLn('matched!');
RegexObj.Free;
end.
ParseAtom имеет такой код для \d \D
'd': begin // r.e.extension - any digit ('0' .. '9')
ret := EmitNode (ANYDIGIT);
flagp := flagp or HASWIDTH or SIMPLE;
end;
'D': begin // r.e.extension - not digit ('0' .. '9')
ret := EmitNode (NOTDIGIT);
flagp := flagp or HASWIDTH or SIMPLE;
end;
тут делается опкод ANYDIGIT или обратный. ОК
для \w \W делается не так - тут или опкод или Emit строки wordchars
'w': begin // r.e.extension - any english char / digit / '_'
{$IFDEF UseSetOfChar}
ret := EmitRange (ANYOF);
EmitRangeStr (WordChars);
EmitRangeC (#0);
{$ELSE}
ret := EmitNode (ANYLETTER);
{$ENDIF}
flagp := flagp or HASWIDTH or SIMPLE;
end;
почему не делать тут всегда опкод? это же лучше вроде - тогда UseUnicodeWordDetection отработает для этого случая тоже (а пока оно где то видимо работает а где-то нет). @andgineer
вот где юзается UseUnicodeWordDetection:
function TRegExpr.IsWordChar(AChar: REChar): Boolean;
begin
Result := Pos(AChar, fWordChars)>0;
{$IFDEF UnicodeWordDetection}
If Not Result and UseUnicodeWordDetection then
Result:=IsUnicodeWordChar(aChar);
{$ENDIF}
end;
subj. @andgineer
and "anso.da.ru"- fix it.
uses RegExpr;
begin
WriteLn( ReplaceRegExpr('(\w*)','name.ext','$1.new', True) );
ReadLn;
end.
Return: name.new.new.ext.new.new
. Bug or incorrect use?
On Russian: некоторые подробности проблемы отсюда и ниже.
are not supported
It significantly improves speed for large input text - we fast skip all positions that are not worth executing full blown re engine.
есть такая часть
case regparse^ of // r.e.extensions
'd':
EmitRangeStr('0123456789');
'w':
{$IFDEF UseWordChars}
EmitRangeStr(WordChars);
{$ELSE}
EmitNode(OP_ANYLETTER);
{$ENDIF}
тут пробую сделать замену- по аналогии с \w \s
//EmitRangeStr('0123456789');
EmitNode(OP_ANYDIGIT);
но это сразу дает сбой Test11.
(
expression: '[^\d]+';
inputText: '234578923457823659ARTZU38';
substitutionText: '';
expectedResult: 'ARTZU';
matchStart: 19
),
видимо такая замена неверна для понимания [^\d]
. уже нехорошо. и подозрение что тогда и код для \w тоже нехорош - он даст сбой в [^\w]
. надо проверить.
что делать?
if (PREOp(scan)^ = OP_EXACTLY) and
(strlen(scan + REOpSz + RENextOffSz) >= PtrInt(Len)) then
begin
longest := scan + REOpSz + RENextOffSz;
Len := strlen(longest);
end;
I can format source code a little, by mass replaces
Ok? @andgineer
Big thanks for your RegExpr unit in FPC. It's very useful!
User of CudaText program wants modern regex features, he listed them in this page:
Alexey-T/CudaText#2279
What do you think?
macOS tests run SLOOWLY, first VM is installing 100*k packages, then VM installs 100M Lazarus package, it runs for 9minutes already, all Linux tests are passed long ago.
нужна подсказка. не могу понять код- мне надо менять StrScan, StrScanCI - но вызовы есть в 2х местах. какое из них менять чтобы поменять разбор в [ ] char class?
первоё в regrepeat
OP_ANYOF:
while (Result < TheMax) and (StrScan(opnd, scan^) <> nil) do
begin
Inc(Result);
Inc(scan);
end;
OP_ANYBUT:
while (Result < TheMax) and (StrScan(opnd, scan^) = nil) do
begin
Inc(Result);
Inc(scan);
end;
OP_ANYOFCI:
while (Result < TheMax) and (StrScanCI(opnd, scan^) <> nil) do
begin
Inc(Result);
Inc(scan);
end;
OP_ANYBUTCI:
while (Result < TheMax) and (StrScanCI(opnd, scan^) = nil) do
begin
Inc(Result);
Inc(scan);
end;
второе в MatchPrim
OP_ANYOF:
begin
if (reginput = fInputEnd) or
(StrScan(scan + REOpSz + RENextOffSz, reginput^) = nil) then
Exit;
Inc(reginput);
end;
OP_ANYBUT:
begin
if (reginput = fInputEnd) or
(StrScan(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
Exit;
Inc(reginput);
end;
OP_ANYOFCI:
begin
if (reginput = fInputEnd) or
(StrScanCI(scan + REOpSz + RENextOffSz, reginput^) = nil) then
Exit;
Inc(reginput);
end;
OP_ANYBUTCI:
begin
if (reginput = fInputEnd) or
(StrScanCI(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
Exit;
Inc(reginput);
end;
это я хочу сделать запись в опкод ПАР символов (kind, data). @andgineer
для своего CudaText хочется сделать такие правки.
предлагаю удалить проперти WordChars, SpaceChars и заменить их hardcoded checks.
и то и то делается просто и надежно. зачем вообще ввели это WordChars, я понимаю так что людям не хотелось делать полноценный анализ UnicodeData но хотелось детектить многие буквы - вот и приделали WordChars.
но это криво - оно и медленно и все буквы нереально туда записать, в юникод их очень много, многие запишут туда только латинские умляуты, а другие языки как? кто-то еще пропустит греческий, кто-то русский. а есть еще азиатские- их почти все пропустят (там Jap, Chinese, Korean, Indian итд).
hardcoded checks будут работать быстро - быстрее проверки 20-40 букв, там проверки по UnicodeData.
для SpaceChars тоже быстро, тоже проверка по UnicodeData.
проверки UnicodeData будут в ifdef unicode.
даете добро на пач?
@andgineer
bug. tofix.
function TRegExpr.GetModifier(AIndex: integer): boolean;
var
Mask: integer;
begin
Result := False;
case AIndex of
1:
Mask := MaskModI;
2:
Mask := MaskModR;
3:
Mask := MaskModS;
4:
Mask := MaskModG;
5:
Mask := MaskModM;
6:
Mask := MaskModX;
else
begin
Error(reeModifierUnsupported);
Exit;
end;
end;
Result := (fModifiers and Mask) <> 0;
end; { of function TRegExpr.GetModifier
-------------------------------------------------------------- }
procedure TRegExpr.SetModifier(AIndex: integer; ASet: boolean);
var
Mask: integer;
begin
case AIndex of
1:
Mask := MaskModI;
2:
Mask := MaskModR;
3:
Mask := MaskModS;
4:
Mask := MaskModG;
5:
Mask := MaskModM;
6:
Mask := MaskModX;
else
begin
Error(reeModifierUnsupported);
Exit;
end;
end;
if ASet then
fModifiers := fModifiers or Mask
else
fModifiers := fModifiers and not Mask;
end; { of procedure TRegExpr.SetModifier
-------------------------------------------------------------- }
@andgineer I suggest to make N bool props instead of N bits in int.
предлагаю такую мелкую оптимизацию. убрать regdummy (идет проверка if p = @regdummy
чтобы понять что это первый проход - посчитать размер программы), заменить его на флаг DummyPass: boolean. чуть лучше. @andgineer ok?
v. 0.947 2001.10.03
-=- (+) VersionMajor/Minor class method of TRegExpr ;)
после применения #41 предлагаю удалить эти два св-ва, какой в них смысл? только вывести что-то в окне REStudio? для программиста важен код и там в history.txt версия есть. даже REStudio может вывести эту версию без этого свойства.
subj.
для этого в тесте надо сделать define. у fpc есть параметр -dUnicode
- слово после -d
.
плиз, добавьте.
после #74 вы увидите тест с ф-ей TestUnicode1, которая определена в ifdef Unicode.
There seems to be an issue with ifdef statements for D7:
[Fataler Fehler] RegExpr.pas(735): Datei nicht gefunden: 'System.Character.dcu'
[Fatal Error] RegExpr.pas(735): File not found: 'System.Character.dcu'
I had to comment out the following lines:
// uses
// System.Character; // System.Character exists since Delphi 2009
D7 Build 4.453, TRegExpr latest
https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions
This is classic definition, this feature is needed, please.
In the #1 users want another feature— look ahead and look behind assertions.
you use too old FPC
Free Pascal Compiler version 2.6.2-8 [2014/01/22] for x86_64
Copyright (c) 1993-2012 by Florian Klaempfl and others
Target OS: Linux for x86-64
Compiling testregexpr.pp
Compiling tcregexp.pas
tcregexp.pas(20,3) Fatal: Can't find unit fpwidestring used by tcregexp
ERROR: failed compiling of project /home/travis/build/andgineer/TRegExpr/test/testregexpr.lpi
I can make a patch to allow subj, for this I want to add method InBuffer which checks for offset is it in the buffer. Okay?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.