xnedit regex - infoBAG

Regular expressions (regex's) are useful as a way to match inexact sequences of characters.  They can be used in the `Find...' and `Replace...' search dialogs and are at the core of Color Syntax Highlighting patterns.  To specify a regular expression in a search dialog, simply click on the `Regular Expression' radio button in the dialog. 

A regex is a specification of a pattern to be matched in the searched text. This pattern consists of a sequence of tokens, each being able to match a single character or a sequence of characters in the text, or assert that a specific position within the text has been reached (the latter is called an anchor.)  Tokens (also called atoms) can be modified by adding one of a number of special quantifier tokens immediately after the token.  A quantifier token specifies how many times the previous token must be matched (see below.) 

Tokens can be grouped together using one of a number of grouping constructs, the most common being plain parentheses.  Tokens that are grouped in this way are also collectively considered to be a regex atom, since this new larger atom may also be modified by a quantifier. 

A regex can also be organized into a list of alternatives by separating each alternative with pipe characters, `|'.  This is called alternation.  A match will be attempted for each alternative listed, in the order specified, until a match results or the list of alternatives is exhausted (see Alternation section below.) 

The 'Any' Character

If a dot (`.') appears in a regex, it means to match any character exactly once.  By default, dot will not match a newline character, but this behavior can be changed (see help topic Parenthetical Constructs, under the heading, Matching Newlines). 

Character Classes

A character class, or range, matches exactly one character of text, but the candidates for matching are limited to those specified by the class.  Classes come in two flavors as described below: 

     [...]   Regular class, match only characters listed.
     [^...]  Negated class, match only characters not listed.

As with the dot token, by default negated character classes do not match newline, but can be made to do so. 

The characters that are considered special within a class specification are different than the rest of regex syntax as follows. If the first character in a class is the `]' character (second character if the first character is `^') it is a literal character and part of the class character set.  This also applies if the first or last character is `-'.  Outside of these rules, two characters separated by `-' form a character range which includes all the characters between the two characters as well.  For example, `[^f-j]' is the same as `[^fghij]' and means to match any character that is not `f', `g', `h', `i', or `j'. 

Anchors

Anchors are assertions that you are at a very specific position within the search text.  XNEdit regular expressions support the following anchor tokens: 

     ^    Beginning of line
     $    End of line
     <    Left word boundary
     >    Right word boundary
     \B   Not a word boundary

Note that the \B token ensures that neither the left nor the right character are delimiters, or that both left and right characters are delimiters. The left word anchor checks whether the previous character is a delimiter and the next character is not. The right word anchor works in a similar way. 

Note that word delimiters are user-settable, and defined by the X resource wordDelimiters, cf. X Resources. 

Quantifiers

Quantifiers specify how many times the previous regular expression atom may be matched in the search text.  Some quantifiers can produce a large performance penalty, and can in some instances completely lock up XNEdit.  To prevent this, avoid nested quantifiers, especially those of the maximal matching type (see below.) 

The following quantifiers are maximal matching, or "greedy", in that they match as much text as possible (but don't exclude shorter matches if that is necessary to achieve an overall match). 

     *   Match zero or more
     +   Match one  or more
     ?   Match zero or one

The following quantifiers are minimal matching, or "lazy", in that they match as little text as possible (but don't exclude longer matches if that is necessary to achieve an overall match). 

     *?   Match zero or more
     +?   Match one  or more
     ??   Match zero or one

One final quantifier is the counting quantifier, or brace quantifier. It takes the following basic form: 

     {min,max}  Match from `min' to `max' times the
                previous regular expression atom.

If `min' is omitted, it is assumed to be zero.  If `max' is omitted, it is assumed to be infinity.  Whether specified or assumed, `min' must be less than or equal to `max'.  Note that both `min' and `max' are limited to 65535.  If both are omitted, then the construct is the same as `*'.   Note that `{,}' and `{}' are both valid brace constructs.  A single number appearing without a comma, e.g. `{3}' is short for the `{min,min}' construct, or to match exactly `min' number of times. 

The quantifiers `{1}' and `{1,1}' are accepted by the syntax, but are optimized away since they mean to match exactly once, which is redundant information.  Also, for efficiency, certain combinations of `min' and `max' are converted to either `*', `+', or `?' as follows: 

     {} {,} {0,}    *
     {1,}           +
     {,1} {0,1}     ?

Note that {0} and {0,0} are meaningless and will generate an error message at regular expression compile time. 

Brace quantifiers can also be "lazy".  For example {2,5}? would try to match 2 times if possible, and will only match 3, 4, or 5 times if that is what is necessary to achieve an overall match. 

Alternation

A series of alternative patterns to match can be specified by separating them with vertical pipes, `|'.  An example of alternation would be `a|be|sea'. This will match `a', or `be', or `sea'. Each alternative can be an arbitrarily complex regular expression. The alternatives are attempted in the order specified.  An empty alternative can be specified if desired, e.g. `a|b|'.  Since an empty alternative can match nothingness (the empty string), this guarantees that the expression will match. 

Comments

Comments are of the form `(?#<comment text>)' and can be inserted anywhere and have no effect on the execution of the regular expression.  They can be handy for documenting very complex regular expressions.  Note that a comment begins with `(?#' and ends at the first occurrence of an ending parenthesis, or the end of the regular expression... period.  Comments do not recognize any escape sequences. 

Escaping Metacharacters

In a regular expression (regex), most ordinary characters match themselves. For example, `ab%' would match anywhere `a' followed by `b' followed by `%' appeared in the text.  Other characters don't match themselves, but are metacharacters. For example, backslash is a special metacharacter which 'escapes' or changes the meaning of the character following it. Thus, to match a literal backslash would require a regular expression to have two backslashes in sequence. XNEdit provides the following escape sequences so that metacharacters that are used by the regex syntax can be specified as ordinary characters. 

     \(  \)  \-  \[  \]  \<  \>  \{  \}
     \.  \|  \^  \$  \*  \+  \?  \&  \\

Special Control Characters

There are some special characters that are  difficult or impossible to type. Many of these characters can be constructed as a sort of metacharacter or sequence by preceding a literal character with a backslash. XNEdit recognizes the following special character sequences: 

     \a  alert (bell)
     \b  backspace
     \e  ASCII escape character (***)
     \f  form feed (new page)
     \n  newline
     \r  carriage return
     \t  horizontal tab
     \v  vertical tab

     *** For environments that use the EBCDIC character set,
         when compiling XNEdit set the EBCDIC_CHARSET compiler
         symbol to get the EBCDIC equivalent escape
         character.)

Octal and Hex Escape Sequences

Any ASCII (or EBCDIC) character, except null, can be specified by using either an octal escape or a hexadecimal escape, each beginning with \0 or \x (or \X), respectively.  For example, \052 and \X2A both specify the `*' character.  Escapes for null (\00 or \x0) are not valid and will generate an error message.  Also, any escape that exceeds \0377 or \xFF will either cause an error or have any additional character(s) interpreted literally. For example, \0777 will be interpreted as \077 (a `?' character) followed by `7' since \0777 is greater than \0377. 

An invalid digit will also end an octal or hexadecimal escape.  For example, \091 will cause an error since `9' is not within an octal escape's range of allowable digits (0-7) and truncation before the `9' yields \0 which is invalid. 

Shortcut Escape Sequences

XNEdit defines some escape sequences that are handy shortcuts for commonly used character classes. 

   \d  digits            0-9
   \l  letters           a-z, A-Z, and locale dependent letters
   \s  whitespace        \t, \r, \v, \f, and space
   \w  word characters   letters, digits, and underscore, `_'

\D, \L, \S, and \W are the same as the lowercase versions except that the resulting character class is negated.  For example, \d is equivalent to `[0-9]', while \D is equivalent to `[^0-9]'. 

These escape sequences can also be used within a character class.  For example, `[\l_]' is the same as `[a-zA-Z_]', extended with possible locale dependent letters. The escape sequences for special characters, and octal and hexadecimal escapes are also valid within a class. 

Word Delimiter Tokens

Although not strictly a character class, the following escape sequences behave similarly to character classes: 

     \y   Word delimiter character
     \Y   Not a word delimiter character

The `\y' token matches any single character that is one of the characters that XNEdit recognizes as a word delimiter character, while the `\Y' token matches any character that is not a word delimiter character.  Word delimiter characters are dynamic in nature, meaning that the user can change them through preference settings.  For this reason, they must be handled differently by the regular expression engine.  As a consequence of this, `\y' and `\Y' cannot be used within a character class specification. 

Capturing Parentheses

Capturing Parentheses are of the form `(<regex>)' and can be used to group arbitrarily complex regular expressions.  Parentheses can be nested, but the total number of parentheses, nested or otherwise, is limited to 50 pairs. The text that is matched by the regular expression between a matched set of parentheses is captured and available for text substitutions and backreferences (see below.)  Capturing parentheses carry a fairly high overhead both in terms of memory used and execution speed, especially if quantified by `*' or `+'. 

Non-Capturing Parentheses

Non-Capturing Parentheses are of the form `(?:<regex>)' and facilitate grouping only and do not incur the overhead of normal capturing parentheses. They should not be counted when determining numbers for capturing parentheses which are used with backreferences and substitutions.  Because of the limit on the number of capturing parentheses allowed in a regex, it is advisable to use non-capturing parentheses when possible. 

Positive Look-Ahead

Positive look-ahead constructs are of the form `(?=<regex>)' and implement a zero width assertion of the enclosed regular expression.  In other words, a match of the regular expression contained in the positive look-ahead construct is attempted.  If it succeeds, control is passed to the next regular expression atom, but the text that was consumed by the positive look-ahead is first unmatched (backtracked) to the place in the text where the positive look-ahead was first encountered. 

One application of positive look-ahead is the manual implementation of a first character discrimination optimization.  You can include a positive look-ahead that contains a character class which lists every character that the following (potentially complex) regular expression could possibly start with.  This will quickly filter out match attempts that cannot possibly succeed. 

Negative Look-Ahead

Negative look-ahead takes the form `(?!<regex>)' and is exactly the same as positive look-ahead except that the enclosed regular expression must NOT match.  This can be particularly useful when you have an expression that is general, and you want to exclude some special cases.  Simply precede the general expression with a negative look-ahead that covers the special cases that need to be filtered out. 

Positive Look-Behind

Positive look-behind constructs are of the form `(?<=<regex>)' and implement a zero width assertion of the enclosed regular expression in front of the current matching position.  It is similar to a positive look-ahead assertion, except for the fact that the match is attempted on the text preceding the current position, possibly even in front of the start of the matching range of the entire regular expression. 

A restriction on look-behind expressions is the fact that the expression must match a string of a bounded size.  In other words, `*', `+', and `{n,}' quantifiers are not allowed inside the look-behind expression. Moreover, matching performance is sensitive to the difference between the upper and lower bound on the matching size.  The smaller the difference, the better the performance.  This is especially important for regular expressions used in highlight patterns. 

Positive look-behind has similar applications as positive look-ahead. 

Negative Look-Behind

Negative look-behind takes the form `(?<!<regex>)' and is exactly the same as positive look-behind except that the enclosed regular expression must not match. The same restrictions apply. 

Note however, that performance is even more sensitive to the distance between the size boundaries: a negative look-behind must not match for any possible size, so the matching engine must check every size. 

Case Sensitivity

There are two parenthetical constructs that control case sensitivity: 

     (?i<regex>)   Case insensitive; `AbcD' and `aBCd' are
                   equivalent.

     (?I<regex>)   Case sensitive;   `AbcD' and `aBCd' are
                   different.

Regular expressions are case sensitive by default, that is, `(?I<regex>)' is assumed.  All regular expression token types respond appropriately to case insensitivity including character classes and backreferences.  There is some extra overhead involved when case insensitivity is in effect, but only to the extent of converting each character compared to lower case. 

Matching Newlines

XNEdit regular expressions by default handle the matching of newlines in a way that should seem natural for most editing tasks.  There are situations, however, that require finer control over how newlines are matched by some regular expression tokens. 

By default, XNEdit regular expressions will not match a newline character for the following regex tokens: dot (`.'); a negated character class (`[^...]'); and the following shortcuts for character classes: 

     `\d', `\D', `\l', `\L', `\s', `\S', `\w', `\W', `\Y'

The matching of newlines can be controlled for the `.' token, negated character classes, and the `\s' and `\S' shortcuts by using one of the following parenthetical constructs: 

     (?n<regex>)  `.', `[^...]', `\s', `\S' match newlines

     (?N<regex>)  `.', `[^...]', `\s', `\S' don't match
                                            newlines

`(?N<regex>)' is the default behavior. 

Notes on New Parenthetical Constructs

Except for plain parentheses, none of the parenthetical constructs capture text.  If that is desired, the construct must be wrapped with capturing parentheses, e.g. `((?i<regex))'. 

All parenthetical constructs can be nested as deeply as desired, except for capturing parentheses which have a limit of 50 sets of parentheses, regardless of nesting level. 

Back References

Backreferences allow you to match text captured by a set of capturing parenthesis at some later position in your regular expression.  A backreference is specified using a single backslash followed by a single digit from 1 to 9 (example: \3).  Backreferences have similar syntax to substitutions (see below), but are different from substitutions in that they appear within the regular expression, not the substitution string. The number specified with a backreference identifies which set of text capturing parentheses the backreference is associated with. The text that was most recently captured by these parentheses is used by the backreference to attempt a match.  As with substitutions, open parentheses are counted from left to right beginning with 1.  So the backreference `\3' will try to match another occurrence of the text most recently matched by the third set of capturing parentheses.  As an example, the regular expression `(\d)\1' could match `22', `33', or `00', but wouldn't match `19' or `01'. 

A backreference must be associated with a parenthetical expression that is complete.  The expression `(\w(\1))' contains an invalid backreference since the first set of parentheses are not complete at the point where the backreference appears. 

Substitution

Substitution strings are used to replace text matched by a set of capturing parentheses.  The substitution string is mostly interpreted as ordinary text except as follows. 

The escape sequences described above for special characters, and octal and hexadecimal escapes are treated the same way by a substitution string. When the substitution string contains the `&' character, XNEdit will substitute the entire string that was matched by the `Find...' operation. Any of the first nine sub-expressions of the match string can also be inserted into the replacement string.  This is done by inserting a `\' followed by a digit from 1 to 9 that represents the string matched by a parenthesized expression within the regular expression.  These expressions are numbered left-to-right in order of their opening parentheses. 

The capitalization of text inserted by `&' or `\1', `\2', ... `\9' can be altered by preceding them with `\U', `\u', `\L', or `\l'.  `\u' and `\l' change only the first character of the inserted entity, while `\U' and `\L' change the entire entity to upper or lower case, respectively. 

Substitutions

Regular expression substitution can be used to program automatic editing operations.  For example, the following are search and replace strings to find occurrences of the `C' language subroutine `get_x', reverse the first and second parameters, add a third parameter of NULL, and change the name to `new_get_x': 

     Search string:   `get_x *\( *([^ ,]*), *([^\)]*)\)'
     Replace string:  `new_get_x(\2, \1, NULL)'

Ambiguity

If a regular expression could match two different parts of the text, it will match the one which begins earliest.  If both begin in the same place but match different lengths, or match the same length in different ways, life gets messier, as follows. 

In general, the possibilities in a list of alternatives are considered in left-to-right order.  The possibilities for `*', `+', and `?' are considered longest-first, nested constructs are considered from the outermost in, and concatenated constructs are considered leftmost-first. The match that will be chosen is the one that uses the earliest possibility in the first choice that has to be made.  If there is more than one choice, the next will be made in the same manner (earliest possibility) subject to the decision on the first choice.  And so forth. 

For example, `(ab|a)b*c' could match `abc' in one of two ways.  The first choice is between `ab' and `a'; since `ab' is earlier, and does lead to a successful overall match, it is chosen.  Since the `b' is already spoken for, the `b*' must match its last possibility, the empty string, since it must respect the earlier choice. 

In the particular case where no `|'s are present and there is only one `*', `+', or `?', the net effect is that the longest possible match will be chosen.  So `ab*', presented with `xabbbby', will match `abbbb'.  Note that if `ab*' is tried against `xabyabbbz', it will match `ab' just after `x', due to the begins-earliest rule.  (In effect, the decision on where to start the match is the first choice to be made, hence subsequent choices must respect it even if this leads them to less-preferred alternatives.) 

References

An excellent book on the care and feeding of regular expressions is 

          Mastering Regular Expressions, 3rd Edition
          Jeffrey E. F. Friedl
          August 2006, O'Reilly & Associates
          ISBN 0-596-52812-4

The first end second editions of this book are still useful for basic introduction to regexes and contain many useful tips and tricks. 

The following are regular expression examples which will match: 

    * An entire line.
        ^.*$

    * Blank lines.
        ^$

    * Whitespace on a line.
        \s+

    * Whitespace across lines.
        (?n\s+)

    * Whitespace that spans at least two lines. Note minimal matching `*?' quantifier.
        (?n\s*?\n\s*)

    * IP address (not robust).
        (?:\d{1,3}(?:\.\d{1,3}){3})

    * Two character US Postal state abbreviations (includes territories).
        [ACDF-IK-PR-W][A-Z]

    * Web addresses.
        (?:http://)?www\.\S+

    * Case insensitive double words across line breaks.
        (?i(?n<(\S+)\s+\1>))

    * Upper case words with possible punctuation.
        <[A-Z][^a-z\s]*>
URL: https://ib.bsb.br/xnedit