Using UTF-8 with a regex engine that is not Unicode-aware
Monday, November 24th, 2008The regular expression engine provided by POSIX - which I’m currently using in the C++ library - does not provide any explicit support for Unicode. Fortunately, thanks to the properties of UTF-8, for the most part it doesn’t need to.
All of the characters that have special meanings within a regular expression pattern have Unicode values of less than 128, and therefore have the same representation in both ASCII and UTF-8. Furthermore, a UTF-8 byte stream will not contain these codes for any other purpose. Other characters will be represented by a sequence of two or more codes in the range 128-255, but so long as the regex engine is 8-bit clean it will be able to match these when they occur literally in the pattern.
Where a difference can be seen is that patterns which attempt to count characters in some way will actually be counting bytes. For this reason, codes greater than 127 within a bracket expression (a list of possible characters in square brackets) will not have the intended effect, nor will the wildcard (dot) character if there is any finite limit on the number of repetitions.
The workaround for bracket expressions is to use branches instead (pipe characters), thereby presenting the multi-byte sequences as strings instead of characters. This is what I’ve done in the language definition files that are currently being written.
I definitely don’t intend for this to be a permanent feature of the system. The replacement may be a regex engine that is Unicode-aware, in which case there are obvious benefits to ensuring forward compatibility (which the above workaround does: it will function correctly whether or not the engine supports Unicode).
Alternatively, I am looking seriously at replacing regular expressions with a system known as two-level morphology. This would be entirely incompatible, but has the advantage of being reversible (and therefore equally applicable to text generation and analysis). My main reservations concern its efficiency and readability, but given the considerable effort that will be needed to create the language definitions, anything that would widen their potential use has to be worth considering.