Light

_amps Regex about markdownsharp HOT 8 CLOSED

ruffin-- commented on August 16, 2024

_amps Regex

from markdownsharp.

Comments (8)

GoogleCodeExporter commented on August 16, 2024

I think there are some misunderstandings here..

> [0-9a-f-A-F] is equivalent to \w+

no, because \w matches _ and A-Za-z

> the group cannot be captured anyways due to RegexOptions.ExplicitCapture

The match still occurs, you just can't get to it by saying $1 or $2 or $3 etc

> w is not a valid html escape sequence

have you tried it? produces "w" in Firefox 3.5, IE8, Chrome3..

I don't think there's any evidence this regex is wrong, as it is part of the 
core
Markdown.pl library. Can you provide actual test cases where it produces 
incorrect
results?

Original comment by [email protected] on 31 Dec 2009 at 8:04

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

I may be stupid but I do not find the line
> [0-9a-f-A-F] is equivalent to \w+
in the issue at all... Rather I am saying that [0-9a-fA-F] is INCLUDED in \w, 
therefore [0-9a-fA-F]+ is included in \w+ as well.

Assuming this is correct, the union of those sets can be computed without even 
knowing the sets, since the union of a set s with a subset of s is 
always s. This allows us the following transform:

([0-9a-fA-F]+|\w+) => (\w+)

According to Microsoft ( http://msdn.microsoft.com/en-us/library/20bw873z.aspx 
) the \w character class contains "Letter, Uppercase", "Letter, 
Lowercase" and "Number, Decimal Digit" - so I am quite certain that the 
transform is in deed equivalent to the original regular expression.

Concerning the group: I never said anything about the match not occuring, I 
said "the group cannot be captured anyways", which I would consider 
being quite true for a non-capturing group... RegexOptions.ExplicitCapture 
allows the usage of unnamed parentheses as non-capturing groups, see 
there: 
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexopti
ons.aspx
Since we are dealing with a unnamed group, it is non-capturing under this 
option. It also does not do any "grouping" with respect to operators in 
this case, since it only contains \w+, so it can simply be left out, dealing 
the transform:

(\w+) => \w+



About the escape sequences:
After looking through the html spec, it seems I was indeed mistaken and 
remembered it exactly the wrong way around:
The HTML 4.01 spec says in section 5.3 ( 
http://www.w3.org/TR/1999/REC-html401-19991224/charset.html ) allows the 
following two representations in 
addition to named entities:
"The syntax "&#D;", where D is a decimal number, refers to the ISO 10646 
decimal character number D.
The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the 
ISO 10646 hexadecimal character number H. Hexadecimal numbers in 
numeric character references are case-insensitive."

XHTML 1.0 ( http://www.w3.org/TR/2002/REC-xhtml1-20020801/ ) notes in section 
4.12:
"SGML and XML both permit references to characters by using hexadecimal values. 
In SGML these references could be made using either &#Xnn; or 
&#xnn;. In XML documents, you must use the lower-case version (i.e. &#xnn;)"

Therefore, in my opinion, the "correct" way to formulate the inner part of the 
regex is:

(#[xX]?)?\w+;

However, it does not really matter anyway, since this essentially means:

{all strings consisting only of word characters} \ {"#;", "#x;", "#X;"}

more interesting is the following version, that follows the specs rather 
closely:

(#[0-9]+|([xX][a-fA-F0-9]))|([a-zA-Z0-9]+)

Named character entities in HTML 4.01 ( 
http://www.w3.org/TR/1999/REC-html401-19991224/sgml/entities.html ) and XHTML 1 
( 
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#h-A2 ) specs consist of 
english letters (the first is oftentimes uppercase) and in some 
cases numbers (e.g. <!ENTITY frac34 "¾"> <!-- vulgar fraction three quarters = 
fraction three quarters, U+00BE ISOnum --> )



"I don't think there's any evidence this regex is wrong, as it is part of the 
core Markdown.pl library."
The first part is not about it being "wrong" in any kind, just about doing 
equivalent transforms that do not change the meaning, while making it 
more easy to understand just what it does.
The second part however, is about what the regex is supposed to do: Find 
ampersands that are not part of an html escape sequence. The idea seems to 
be to escape any ampersand that is supposed to be one, while still allowing the 
use of escape sequences. The number of false positives will never 
be zero due to the simple fact that the different specs for (x)html differ on 
this point, but it CAN be reduced. If the design intention is to not 
escape all ampersands that are part of something that might be mistaken for an 
escape sequence: well, that is a design decision...
To ensure that I am not mistaken about html escape sequences again, I have 
tested &x77; in all browsers on my system (chrome, firefox, safari, ie 
and opera) and I DO get &x77; if I use that "escape sequence" - and not a "w". 
Also, with the original regex &x77; is not escaped...

Original comment by [email protected] on 31 Dec 2009 at 9:28

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

I see your point on ([0-9a-fA-F]+|\w+) and I agree the other form is more 
accurate.
But there is a documentation reason it is written that way..

Two clarifications:

1) You know this is a *negative* lookahead, yes?

&(?!)

meaning, match '&' but ONLY if the rest isn't matched ahead of it.

2) it can also negatively match … not just w and m

I think the |\w+ was an attempt to document the the … case. Though as you
pointed out I doubt any named entities have _ in them, or numbers for that 
matter.

As I've said in other requests to change the regexes, there are only two 
reasons to
do so:

a) performance improves

b) fixes a bug

change for the sake of change is bad because

c) it makes us deviate from the Perl standard reference implementation regexes

which would be OK if a) or b) were true.. but they are not in this case.

Original comment by [email protected] on 1 Jan 2010 at 12:48

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

Original comment by [email protected] on 1 Jan 2010 at 12:52

Changed state: WontFix

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

1) Of course.
2) The point being that it DOES negatively match &x77; - although it should 
not. 
Named entities do have numbers, (sadly) there is i.e. frac34...
The amount of false negatives you get it humungous, but I can understand 
implementing 
a regex like (nbsp|cent|pound|copy|brvbar|......) would be most annoying (and 
slow, 
too). However, there are no entities starting with a number, nor are there any 
starting with a hash (except for the two types of numeric ones of course).

Therefore

@"&(?!(#[0-9]+)|(#[xX][a-fA-F0-9])|([a-zA-Z][a-zA-Z0-9]*);)"

drastically reduces the number of false negatives (being false positives of the 
lookahead). Maybe most importantly it catches &x77; or &172; ones, which might 
happen.

Therefor I would indeed consider this to be of kind b) - however, if being 
close to 
the perl implementation is more important than catching these kind of errors, 
well, 
that is a design decision ;)

Original comment by [email protected] on 1 Jan 2010 at 1:27

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

I appreciate the contribution.

I will definitely keep this in mind -- the goal at the moment is to match 
markdown.pl
1.0.2b8 very closely. If in the future we start deviating (and we might, since 
John
Gruber seems utterly uninterested in any public work on this since 2004), then I
agree this is a better and more accurate regex.

Still kind of a narrow and extremely minor "bug", though. I'd prefer some help 
with a
bigger and much more severe bug -- the HTML block matching algorithms in 
1.0.2b8 :)

Original comment by [email protected] on 1 Jan 2010 at 2:01

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

ok, now that we have been forced to absorb parts of PHP Markdown (because it is
better maintained), and thus deviating from the "canon" of the 1.0.2b8 Perl 
version
.. I am open to making this small change. reopening

Original comment by [email protected] on 6 Jan 2010 at 11:47

Changed state: Accepted

from markdownsharp.

GoogleCodeExporter commented on August 16, 2024

checked into r81

Original comment by [email protected] on 7 Jan 2010 at 12:13

Changed state: Fixed

from markdownsharp.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.