Ticket #14 (new defect)

Opened 2 years ago

Last modified 1 year ago

Regex subtraction and intersection

Reported by: lth Assigned to: lth
Type: defect Priority: major
Milestone: M1 Component: RefImpl
Version: 4 Keywords:
Cc: brendan, jeffdyer, graydon, ibukanov, chrispi

Description

Russ Cox raised the issue on es4-discuss that the current intersection and subtraction operators \& and \- have two problems:

  • precedence: what does [a-z\-g-j\-z] mean, precisely (depends on associativity)
  • breaks tradition: the current syntax allows punctuation always to be escaped in a charset with the guarantee that the punctuation then stands for itself; this works even if the punctuation has no special meaning. Thus code that currently hedges its bets by using \& in a charset rather than plain & loses.

In addition, there is prior work that’s being ignored, Java regexes provide for intersection, and for subtraction by complementation and intersection.

History: Discussed during May 8, 2007 TG1 phone call, at which Lars took the action item of researching prior art other than Java (see his subsequent research notes). Lars advises that we change the proposal and adopt the Java technique.

Wiki locations proposals:extend_regexps discussion:extend_regexps

Attachments

Change History

Changed 2 years ago by lth

  • owner set to lth

Changed 2 years ago by jeffdyer

  • milestone changed from M0 to M1

Changed 2 years ago by lth

  • cc set to brendan, jeffdyer, graydon, ibukanov, chrispi

Going once, going twice, ...

Unless there's any objections to my conclusions recorded in the Wiki and linked above (follow Java on this, as the only credible prior art) then this will become a RefImpl? bug on Friday morning, after which it will be fixed.

Changed 1 year ago by lth

  • component changed from Proposals to RefImpl

Changed 1 year ago by lth

Brendan, Jeff, Chris:

I've added a big section to the "extend regexps" discussion page on how the lexer ought to cope with the proposed syntax for regex intersection and subtraction. Unlike what I said to Jeff earlier, I have stuck to the split in ES3 where the lexer follows some simple rules and the regex engine gets to do all the hard work later, and I don't think it ought to be very controversial, but I'd like to get your opinions before I move it into the proposal and the code.

Note: See TracTickets for help on using tickets.