Ticket #37 (new defect)

Opened 3 years ago

Last modified 1 year ago

RegExp: interaction of splitting large Unicode chars with quantifiers and charsets

Reported by: lth Assigned to: lth
Type: defect Priority: major
Milestone: Component: Spec
Version: Harmony Keywords: unicode
Cc:

Description

The "Update Unicode" proposal says that a 16-bit ECMAScript implementation must split a literal like \u{1AFFE} into two code points (a surrogate pair). Necessarily this will also be true in regular expressions.

Then quantifiers applied to such a character probably need to take that into account: \u{1AFFE}* is perhaps (?:\u{1AFFE})*, for example, the two characters acting as an atomic sequence during matching.

A harder problem is character sets, [...\u{1AFFE}...], where the splitting probably does not have the desired effect. It is possible to imagine the meaning of this to be (?:(?:\u{1AFFE})|[...]).

We need to investigate this more fully and explicate the meaning of splitting characters in regular expressions.

Attachments

Change History

Changed 3 years ago by lth

  • milestone set to M1

Changed 3 years ago by lth

  • owner set to lth

Changed 3 years ago by lth

  • milestone changed from M1 to M3

Changed 2 years ago by lth

  • priority changed from minor to major
  • component changed from Proposals to Spec

Changed 2 years ago by lth

Also see #213.

Changed 1 year ago by David-Sarah Hopwood

  • keywords set to unicode
  • version changed from 4 to Harmony
  • milestone deleted
Note: See TracTickets for help on using tickets.