Ticket #254 (new defect)

Opened 1 year ago

Last modified 8 months ago

RegExp: Character classes should not be free-format even with the /x flag

Reported by: StevenLevithan Assigned to: anonymous
Type: defect Priority: major
Milestone: Component: Proposals
Version: 4 Keywords:
Cc: lth, brendan

Description

The ES4 RegExp? extension proposal discussion page at http://wiki.ecmascript.org/doku.php?id=discussion:extend_regexps indicates in the section about mixing the /x flag, comments, and unescaped / that the /x flag will apply free-spacing and comments within character classes. However, this will be unexpected by many regular expression users, and is hence undesirable IMO. The Perl, PCRE, .NET, Python, Ruby, Tcl (ARE), and JGsoft regular expression flavors (among many others) treat character classes as a single token and do not apply the /x modifier (or equivalent) within them. (I believe the only significant regex engine which differs on this and ignores whitespace and comments within a character class is java.util.regex.) Hence, it is a common convention to escape whitespace or # using a character class (e.g. [ ]) when using /x. This would result in the expression /a[d-o # Range (c,o]<newline>/]/x (which is shown as an example of a pattern which cannot be lexed) being equivalent to /a[d-o #Ra(c,]/]/ (i.e. a valid regex literal with no flags followed by "]/").

This approach would also be consistent with ES4's allowing unescaped / within a character class. Why would it be allowed, if not because character classes are treated as individual tokens and hence are an alternative way of escaping characters?

Attachments

Change History

Changed 1 year ago by lth

  • cc set to lth, brendan
  • summary changed from Character classes should not be free-format even with the /x flag to RegExp: Character classes should not be free-format even with the /x flag

First comment, unescaped / is allowed exclusively for compatibility with MSIE (which implements it and has requried other browsers to follow suit) and is not intended as a hack to be used the way you suggest, albeit that it may be used that way.

Second comment, once one's character classes grow beyond a fairly small size, being able to place different subranges on multiple lines, say, will likely be a feature. Your suggestion merely places artificial limitations on this facility.

BTW, please cc me on all RegExp? bugs you may create. Note also that the proposal period is over and that only small adjustment will be accepted. (The present bug clearly falls into the class of small adjustments, though, as do your previous suggestions.)

Changed 1 year ago by StevenLevithan

My argument against that would fall back to reiterating that ECMAScript regular expressions are based on Perl, and that practically every regex engine which has an /x modifier or equivalent does it the way Perl does (java.util.regex being the exception).

Given that character classes can be negated, contain shorthand character classes and Unicode properties, and use ranges, subtraction, and intersection, I would be quite surprised to see character classes grow very long in the real world.

However, ultimately this particular issue isn't a big deal to me, so I don't expect to further push the point. However, I'm certainly happy to respond to any questions, etc.

Changed 1 year ago by StevenLevithan

See also the semi-related issue of how empty character classes (which are made possible by applying the /x flag within classes) should be handled at ticket #261.

Changed 8 months ago by StevenLevithan

Upon review, I withdraw my own support for this ticket. The issue of compatibility with other regex flavors stands, but ignoring whitespace within character classes is not unprecedented. I've already mentioned java.util.regex, and I believe Apocalypse 5 for Perl 6 regexes also proposes ignoring whitespace in character classes there. I don't think the other issues I've raised here are strong enough to stand on their own.

Note: See TracTickets for help on using tickets.