ashinn / irregex Goto Github PK

View Code? Open in Web Editor NEW

93.0 93.0 15.0 1.17 MB

Portable Efficient IrRegular Expressions for Scheme

Makefile 1.02% Perl 0.26% Scheme 98.18% CSS 0.27% Shell 0.27%

irregex's People

Contributors

Stargazers

Watchers

Forkers

codemac eval-apply chaw aa10000 lemonboy guenchi cvnb c-ler snan zhaopufeng jbclements vkochan dtpeters russpalms ak-1

irregex's Issues

Extracting non-existant submatches should be consistent

Extracting a named submatch that doesn't exist gives an error, but
extracting a numbered submatch that doesn't exist gives #f. Either the
former should return #f or the latter should raise a similar error.

#;1> (load "irregex.scm")
; loading irregex.scm ...
#;2> (define r (irregex '(seq "foo")))
#;3> (define m (irregex-match r "foo"))
#;4> (irregex-match-substring m 'x)

Error: unknown match name: x
....
#;5> (irregex-match-substring m 5)
#f

Original issue reported on code.google.com by [email protected] on 7 Apr 2010 at 6:57

Convert embedded posix strings in a preprocessing step

Currently, posix-string sub-regexes are only handled in sre->procedure, which means they are only handled by the backtracking engine, even if the strings are strictly regular expressions.

We could simply call string->sre in sre->nfa and recur on the result, but if then the compilation is aborted and we compile to a backtracking matcher we'd have to do the same again. Perhaps it would be better to "normalize" the expression beforehand so that we don't do the same thing twice? We kind of already do this in string->irregex and irregex, but only on the "main" input.

Weird bol with /all

(string=
 (irregex-replace/all '(: bol (* " ")) "## my bloody valentine" "")
 (irregex-replace '(: bol (* " ")) "## my bloody valentine" ""))

irregex-r6rs refers to undefined function unicode-range->utf8-pattern

I generated the R6RS version with:

make irregex-r6rs.scm

The resulting file includes a call to an undefined function called unicode-range->utf8-pattern. The calling function seems to be unused and so I commented it out.

Also, I see test suites, but I'm not sure how to test this generated R6RS version. Is there a predefined way?

Thanks!

Document irregex-[match-]{names,num-submatches}

The docs are outdated.

The attached patch is simple, but it's a simple thing anyway :)

Original issue reported on code.google.com by [email protected] on 7 Apr 2010 at 6:52

Attachments:

matches-doc.patch

viable way to track development

I have got the latest revision 0.9.2 directly from the Synthcode site; the home 
page of irregex still references revision 0.9.0 and the last tarball available 
from Google Code is 0.8.3.

I want to track development of this package (I am including it in the 
distribution of Vicare Scheme); being that Google Code is going West, what 
about pushing the code to another site?  (G*cough*hub*cough*)

Original issue reported on code.google.com by [email protected] on 6 Sep 2013 at 9:22

Consider doing 0.9.11 release

Please consider cutting the 0.9.11 release. There are some bug fixes accumulated, so it would be nice.

Thank you

irregex-match-substring using subpattern under kleene star can return extra chars

In the following example that uses SRE syntax, a named sub-pattern, and a kleene star (I think the or might be necessary too), the string returned by irregex-match-substring looks like it goes from the beginning of the first match to the end of the last match (including all chars in between).

pajaro2:/tmp/irregex (git)-[master]- clements> scheme
Chez Scheme Version 9.5.4
Copyright 1984-2020 Cisco Systems, Inc.

> (import (irregex))
> (define subpat2 '(=> subby (: "@" alphanum)))
> (define pat2 `(* (or alphanum ,subpat2)))
> (define match2 (irregex-match pat2 "oeh@2tu@2n342"))
> (irregex-match-names match2)
((subby . 1))
> (irregex-match-num-submatches match2)
1
> (irregex-match-substring match2 1)
"@2tu@2”

Empty concat causes vector error

Using an empty concatenation as part of an SRE pattern can cause an out-of-range vector reference.

(irregex-search '(or (:) "was") "abc")

pajaro2:/tmp/irregex (git)-[master]- clements> scheme
Chez Scheme Version 9.5.4
Copyright 1984-2020 Cisco Systems, Inc.

> (import (irregex))
> (irregex-search '(or (:) "was") "abc")
Exception in vector-ref: 3 is not a valid index for #(0 0)
Type (debug) to enter the debugger.
>

Predicate procedure for existence of submatches

I'm not so sure about this one anymore, but here goes:

It would be useful to have a procedure to check whether a given submatch
exists at all. For example:

  (irregex-submatch-exists? match <index-or-name>) => boolean

The naming could be better and it might be good to have a similar procedure
for irregex objects.

On the other hand, one could also extract the names and check whether the
name occurs in the list (but an existence check could possibly be
implemented more efficiently)

Original issue reported on code.google.com by [email protected] on 7 Apr 2010 at 7:07

bol and eol behaving differently on -search vs -replace/all

(import (chicken irregex))
(eq?
 (not (not (irregex-search '(: bol eol) "")))
 (equal? "foo" (irregex-replace/all '(: bol eol) "" "foo")))

This is #f on Chicken 5.3.0 but shouldn't it be #t?

Unused variable

The variable num-states in find-reorder-commands-internal is unused. As the initializing call is free from side-effects, the variable could be eliminated.

Obtain regexp source string from compiled regexp

Hello.

Please, add functions which retrieve regexp source string and its options:

> (irregex-src (irregex "apple")) => "apple"
> (irregex-opt (irregex "apple" 'i 'm)) => (i m)

Thanks.

Original issue reported on code.google.com by [email protected] on 5 Sep 2010 at 4:42

Issues with or, eow/bow and any

This code works as expected:

(irregex-extract '(: bow "foo" eow) "foo foo bar foo") ; => '("foo" "foo" "foo")

This code doesn't:

(irregex-extract '(or (: bow "foo" eow) any) "foo foo bar foo") ; => '("foo" " " "f" "o" "o" " " "b" "a" "r" " " "f" "o" "o")
;; Expected: '("foo" " " "foo" " " "b" "a" "r" " " "foo")

If I can help out with fixing this (if it is a bug, that is) please let me know.

bol overlaps with newline

This is either an irregex pro tip or a bug:

(: bol "blablabla" newline) will create overlaps if there are sequential matches. Capturing the newline there borks up the subsequent bol. It only works if the matches are non-sequential.

(: newline "blablabla" eol) doesn't have that problem at all.

(Now, the reason to capture newline instead of just using bol and eol is so that we can replace with "" to remove theh line entirely.)

Note that the (: newline "blablabla" eol) can't match the first line of the text. So this pro tip only works when you know you don't need to do it on the first line of text in a multiline string.

Not sure if there is an even better pro tip / best practice for this particular use case.

Installation instructions for Chicken

What steps will reproduce the problem?

See http://synthcode.com/scheme/irregex/#SECTION_2


What is the expected output? What do you see instead?

I see

    "If you are using Chicken Scheme, you can just run chicken-setup, or wait for it to show up in the eggs repository."

but I expected something like

   "irregex is built in CHICKEN as a core unit, so no need to install it.  To use it, you just need (use irregex)."

Original issue reported on code.google.com by [email protected] on 21 May 2013 at 8:56

Unexpected result for sre->string

This does not seem to be correct:

3> (sre->string '(* (/ "AZ09")))
"[]*"

I would have expected something like "[A-Z0-9]*".

Replacements of positive lookbehinds only replace first match

Reported on chicken-users:

irregex-replace/all only replaces first match when negative lookbehinds are involved

example:

the following regexp should replace any letter a preceded by x, y, or z

problematic code

(irregex-replace/all "(?<=[xyz])a"     "xa ya za"  "-")
; or 
(irregex-replace/all '(: (look-behind (or "x" "y" "z")) "a")   "xa ya za"  "-")

should return "x- y- z-" BUT actually returns "x- ya za"

Some compilation results of `irregex` cannot be cached (write/read)

I have a growing base of 50 complex irregex strings. I precompile all of them, store them in a file, and use the result. This works very nicely, but 1 of the irregex results contains a procedure, so that precompiling cannot work here.

Which expressions lead to procedures in the irregex result?

Can this behavior be influenced by options?
(The test is run on version 0.9.10 under bigloo; if it helps, I can repeat it under Chicken or Gauche or ...)

Thanks for this excellent tool for the Scheme community!

Kind of unexpected result

(irregex-replace/all "^" "42" "A:")

this returns "A:442" when I expected "A:42"

problems with utf8

Hi,

I've compiled irregex 0.9.4 with Chez Scheme 9.4. I run the test suite and all tests do pass, but the ones in test-irregex-utf8.scm.

I'm just learning SRE, but, if I understand correctly, the following example should match:
(irregex-search (irregex `(: "fede" (~ "Λ") "!") 'utf8) "fedeλ!")
=> #f

Is this wrong? Chez Scheme seems to handle utf8 pretty well.

Thanks you for sharing your code!

Empty matches on replace/all

(irregex-replace/all
 (irregex '(: bos (* whitespace)))
 "any gosh darn string" "")

⇒ "aany gosh darn string"

With a + instead of a kleenstar it works.

Named submatches should "stack"

When creating irregex-based parsers, it is *extremely* useful to be able to
use named submatches inside alternatives. Real-life example from an URI parser:

    `(or (seq ,scheme ":" ,hostname (submatch-named path ,slashed-segments))
         (submatch-named path ,path-noscheme) ... etc)

It would rock if irregex would be smart enough to let the name "path" refer
to the second alternative when the first fails.

You said this would be hard to make consistent in case more than one of
these could match.  However, I think it would be justified to simply
document the semantics. The user could then decide for himself whether he
wants to use it or not.

The behaviour as it is right now is simply to return a "random" submatch (I
think it returns the last one defined?), which practically makes it only
usable if you ensure all submatches are uniquely named.

I propose to change the definition to return the first (or last,
whichever is more useful/efficient/consistent) submatch that has a non-#f
value (and, of course, #f if all are #f).  That would be enough for these
situations.  It would work the same as it already does for the other cases,
but it would improve usability for matches with alternatives.

I'm not sure how to handle numbered submatches.  I suppose that would just
stay the same; the first alternative has number 1, the second number 2, in
all cases. In other words, only the semantics of named submatches would change.

Original issue reported on code.google.com by [email protected] on 7 Apr 2010 at 7:18