ashinn / irregex Goto Github PK
View Code? Open in Web Editor NEWPortable Efficient IrRegular Expressions for Scheme
Portable Efficient IrRegular Expressions for Scheme
Extracting a named submatch that doesn't exist gives an error, but
extracting a numbered submatch that doesn't exist gives #f. Either the
former should return #f or the latter should raise a similar error.
#;1> (load "irregex.scm")
; loading irregex.scm ...
#;2> (define r (irregex '(seq "foo")))
#;3> (define m (irregex-match r "foo"))
#;4> (irregex-match-substring m 'x)
Error: unknown match name: x
....
#;5> (irregex-match-substring m 5)
#f
Original issue reported on code.google.com by [email protected]
on 7 Apr 2010 at 6:57
Currently, posix-string
sub-regexes are only handled in sre->procedure
, which means they are only handled by the backtracking engine, even if the strings are strictly regular expressions.
We could simply call string->sre
in sre->nfa
and recur on the result, but if then the compilation is aborted and we compile to a backtracking matcher we'd have to do the same again. Perhaps it would be better to "normalize" the expression beforehand so that we don't do the same thing twice? We kind of already do this in string->irregex
and irregex
, but only on the "main" input.
(string=
(irregex-replace/all '(: bol (* " ")) "## my bloody valentine" "")
(irregex-replace '(: bol (* " ")) "## my bloody valentine" ""))
I generated the R6RS version with:
make irregex-r6rs.scm
The resulting file includes a call to an undefined function called unicode-range->utf8-pattern. The calling function seems to be unused and so I commented it out.
Also, I see test suites, but I'm not sure how to test this generated R6RS version. Is there a predefined way?
Thanks!
The docs are outdated.
The attached patch is simple, but it's a simple thing anyway :)
Original issue reported on code.google.com by [email protected]
on 7 Apr 2010 at 6:52
Attachments:
I have got the latest revision 0.9.2 directly from the Synthcode site; the home
page of irregex still references revision 0.9.0 and the last tarball available
from Google Code is 0.8.3.
I want to track development of this package (I am including it in the
distribution of Vicare Scheme); being that Google Code is going West, what
about pushing the code to another site? (G*cough*hub*cough*)
Original issue reported on code.google.com by [email protected]
on 6 Sep 2013 at 9:22
Please consider cutting the 0.9.11 release. There are some bug fixes accumulated, so it would be nice.
Thank you
In the following example that uses SRE syntax, a named sub-pattern, and a kleene star (I think the or
might be necessary too), the string returned by irregex-match-substring looks like it goes from the beginning of the first match to the end of the last match (including all chars in between).
pajaro2:/tmp/irregex (git)-[master]- clements> scheme
Chez Scheme Version 9.5.4
Copyright 1984-2020 Cisco Systems, Inc.
> (import (irregex))
> (define subpat2 '(=> subby (: "@" alphanum)))
> (define pat2 `(* (or alphanum ,subpat2)))
> (define match2 (irregex-match pat2 "oeh@2tu@2n342"))
> (irregex-match-names match2)
((subby . 1))
> (irregex-match-num-submatches match2)
1
> (irregex-match-substring match2 1)
"@2tu@2”
Using an empty concatenation as part of an SRE pattern can cause an out-of-range vector reference.
(irregex-search '(or (:) "was") "abc")
pajaro2:/tmp/irregex (git)-[master]- clements> scheme
Chez Scheme Version 9.5.4
Copyright 1984-2020 Cisco Systems, Inc.
> (import (irregex))
> (irregex-search '(or (:) "was") "abc")
Exception in vector-ref: 3 is not a valid index for #(0 0)
Type (debug) to enter the debugger.
>
I'm not so sure about this one anymore, but here goes:
It would be useful to have a procedure to check whether a given submatch
exists at all. For example:
(irregex-submatch-exists? match <index-or-name>) => boolean
The naming could be better and it might be good to have a similar procedure
for irregex objects.
On the other hand, one could also extract the names and check whether the
name occurs in the list (but an existence check could possibly be
implemented more efficiently)
Original issue reported on code.google.com by [email protected]
on 7 Apr 2010 at 7:07
(import (chicken irregex))
(eq?
(not (not (irregex-search '(: bol eol) "")))
(equal? "foo" (irregex-replace/all '(: bol eol) "" "foo")))
This is #f on Chicken 5.3.0 but shouldn't it be #t?
The variable num-states
in find-reorder-commands-internal
is unused. As the initializing call is free from side-effects, the variable could be eliminated.
Hello.
Please, add functions which retrieve regexp source string and its options:
> (irregex-src (irregex "apple")) => "apple"
> (irregex-opt (irregex "apple" 'i 'm)) => (i m)
Thanks.
Original issue reported on code.google.com by [email protected]
on 5 Sep 2010 at 4:42
This code works as expected:
(irregex-extract '(: bow "foo" eow) "foo foo bar foo") ; => '("foo" "foo" "foo")
This code doesn't:
(irregex-extract '(or (: bow "foo" eow) any) "foo foo bar foo") ; => '("foo" " " "f" "o" "o" " " "b" "a" "r" " " "f" "o" "o")
;; Expected: '("foo" " " "foo" " " "b" "a" "r" " " "foo")
If I can help out with fixing this (if it is a bug, that is) please let me know.
This is either an irregex pro tip or a bug:
(: bol "blablabla" newline) will create overlaps if there are sequential matches. Capturing the newline there borks up the subsequent bol. It only works if the matches are non-sequential.
(: newline "blablabla" eol) doesn't have that problem at all.
(Now, the reason to capture newline instead of just using bol and eol is so that we can replace with "" to remove theh line entirely.)
Note that the (: newline "blablabla" eol) can't match the first line of the text. So this pro tip only works when you know you don't need to do it on the first line of text in a multiline string.
Not sure if there is an even better pro tip / best practice for this particular use case.
What steps will reproduce the problem?
See http://synthcode.com/scheme/irregex/#SECTION_2
What is the expected output? What do you see instead?
I see
"If you are using Chicken Scheme, you can just run chicken-setup, or wait for it to show up in the eggs repository."
but I expected something like
"irregex is built in CHICKEN as a core unit, so no need to install it. To use it, you just need (use irregex)."
Original issue reported on code.google.com by [email protected]
on 21 May 2013 at 8:56
This does not seem to be correct:
3> (sre->string '(* (/ "AZ09")))
"[]*"
I would have expected something like "[A-Z0-9]*"
.
Reported on chicken-users:
irregex-replace/all
only replaces first match when negative lookbehinds are involved
example:
the following regexp should replace any letter a preceded by x, y, or z
problematic code
(irregex-replace/all "(?<=[xyz])a" "xa ya za" "-")
; or
(irregex-replace/all '(: (look-behind (or "x" "y" "z")) "a") "xa ya za" "-")
should return "x- y- z-"
BUT actually returns "x- ya za"
I have a growing base of 50 complex irregex strings. I precompile all of them, store them in a file, and use the result. This works very nicely, but 1 of the irregex results contains a procedure, so that precompiling cannot work here.
Which expressions lead to procedures in the irregex result?
Can this behavior be influenced by options?
(The test is run on version 0.9.10 under bigloo; if it helps, I can repeat it under Chicken or Gauche or ...)
Thanks for this excellent tool for the Scheme community!
(irregex-replace/all "^" "42" "A:")
this returns "A:442" when I expected "A:42"
Hi,
I've compiled irregex 0.9.4 with Chez Scheme 9.4. I run the test suite and all tests do pass, but the ones in test-irregex-utf8.scm.
I'm just learning SRE, but, if I understand correctly, the following example should match:
(irregex-search (irregex `(: "fede" (~ "Λ") "!") 'utf8) "fedeλ!")
=> #f
Is this wrong? Chez Scheme seems to handle utf8 pretty well.
Thanks you for sharing your code!
(irregex-replace/all
(irregex '(: bos (* whitespace)))
"any gosh darn string" "")
⇒ "aany gosh darn string"
With a + instead of a kleenstar it works.
When creating irregex-based parsers, it is *extremely* useful to be able to
use named submatches inside alternatives. Real-life example from an URI parser:
`(or (seq ,scheme ":" ,hostname (submatch-named path ,slashed-segments))
(submatch-named path ,path-noscheme) ... etc)
It would rock if irregex would be smart enough to let the name "path" refer
to the second alternative when the first fails.
You said this would be hard to make consistent in case more than one of
these could match. However, I think it would be justified to simply
document the semantics. The user could then decide for himself whether he
wants to use it or not.
The behaviour as it is right now is simply to return a "random" submatch (I
think it returns the last one defined?), which practically makes it only
usable if you ensure all submatches are uniquely named.
I propose to change the definition to return the first (or last,
whichever is more useful/efficient/consistent) submatch that has a non-#f
value (and, of course, #f if all are #f). That would be enough for these
situations. It would work the same as it already does for the other cases,
but it would improve usability for matches with alternatives.
I'm not sure how to handle numbered submatches. I suppose that would just
stay the same; the first alternative has number 1, the second number 2, in
all cases. In other words, only the semantics of named submatches would change.
Original issue reported on code.google.com by [email protected]
on 7 Apr 2010 at 7:18
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.