Regular expressions

Reference

 

Regular expressions are used to find, find-replace and split strings. QM has two functions that use regular expressions: findrx and str.replacerx. Regular expressions also can be used with commands and functions that use window name, with window triggers, and in the Find dialog.

 

A regular expression is a pattern of text that consists of ordinary characters (for example, letters a through z) and special characters, known as metacharacters. The pattern describes one or more strings to match when searching text. It is like a string containing wildcard characters * and ? that you use when searching for files, but regular expressions are much more powerful. The following table contains some metacharacters and their behavior in the context of regular expressions.

 

Character Description
\ Marks the next character as either a special character, a literal (removes special meaning from next metacharacter), a backreference, anchor, or an octal|hex escape. For example, 't' matches the character "t". '\t' matches a tab character. The sequence '\\' matches "\", and "\(" matches "(".
.

Matches any single character except new line character (\n). If you use '(?s)' somewhere before, then . matches all characters.

 

Note: A Windows new line consists of two characters - carriage return followed by new line (\r\n). Since . matches carriage return (\r), sometimes you may get unexpected results. Then use [^\r\n] instead of . .

[xyz]

A character set. Matches any one of the enclosed characters. For example, '[abc]' matches the 'a' in "plain".


Note: [digits] in strings is a QM escape sequence and is replaced to a character code. If you need literal [digits] in regular expression, use [91] instead of [. For example, for [135] use [91]135]. Escape sequences are processed before processing regular expression.

[^xyz] A negative character set. Matches any character not enclosed. For example, '[^abc]' matches the 'p' in "plain".
[a-z] A range of characters. Matches any character in the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.
[^a-z] A negative range characters. Matches any character not in the specified range. For example, '[^a-z]' matches any character not in the range 'a' through 'z'.
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo".
+ Matches the preceding subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z".
? Matches the preceding subexpression zero or one time. For example, "do(es)?" matches the "do" in "do" or "does".
{n} Matches exactly n times (n is a nonnegative integer). For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food".
{n,} Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood".
{n,m} Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". Note that you cannot put a space between the comma and the numbers.
? When this character immediately follows any of the other quantifiers (*, +, ?, { n}, { n,}, { n, m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible, whereas the default greedy pattern matches as much of the searched string as possible. For example, in the string "oooo", 'o+?' matches a single "o", while 'o+' matches all 'o's.
^ Matches the position at the beginning of the input string. If multiline flag is set, ^ also matches the position at beginning of line.
$ Matches the position at the end of the input string. If multiline flag is set, $ also matches the position at end of line.
\b Matches a word boundary, that is, the position between a word and a space (or other delimiter, e.g. comma). For example, 'er\b' matches the 'er' in "never" but not the 'er' in "verb".
\B Matches a nonword boundary. 'er\B' matches the 'er' in "verb" but not the 'er' in "never".
(pattern) Matches pattern and captures (remembers) the match ("submatch"). To match parentheses characters ( ), use '\(' or '\)'.
(?:pattern) Matches pattern but does not capture the match, that is, it is a non-capturing match that is not stored for possible later use. This is useful for combining parts of a pattern with the "or" character (|). For example, 'industr(?:y|ies) is a more economical expression than 'industry|industries'.
(?=pattern) Positive lookahead matches the search string at any point where a string matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.
(?!pattern) Negative lookahead matches the search string at any point where a string not matching pattern begins. This is a non-capturing match, that is, the match is not captured for possible later use. For example 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but does not match "Windows" in "Windows 2000". Lookaheads do not consume characters, that is, after a match occurs, the search for the next match begins immediately following the last match, not after the characters that comprised the lookahead.
x|y Matches either x or y. For example, 'z|food' matches "z" or "food". 'colo(r|ur)' matches "color" or "colour".
\cx Matches the control character indicated by x. For example, \cM matches a Control-M or carriage return character. The value of x must be in the range of A-Z or a-z. If not, c is assumed to be a literal 'c' character.
\d Matches a digit character. Equivalent to [0-9].
\D Matches a nondigit character. Equivalent to [^0-9].
\f Matches a form-feed character. Equivalent to \x0c and \cL.
\n Matches a newline character. Equivalent to \x0a and \cJ.
\r Matches a carriage return character. Equivalent to \x0d and \cM.
\s Matches any whitespace character including space, tab, form-feed, etc. Equivalent to [ \f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\t Matches a tab character. Equivalent to \x09 and \cI.
\v Matches a vertical tab character. Equivalent to \x0b and \cK.
\w Matches any word character including underscore. Equivalent to '[A-Za-z0-9_]'.
\W Matches any nonword character. Equivalent to '[^A-Za-z0-9_]'.
\xn Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, '\x41' matches "A". '\x041' is equivalent to '\x04' ∓ "1". Allows ASCII codes to be used in regular expressions.
\num Matches num, where num is a positive integer. A reference back to captured matches. For example, '(.)\1' matches two consecutive identical characters.
\n Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7).
\nm Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captures, n is a backreference followed by literal m. If neither of the preceding conditions exist, \ nm matches octal escape value nm when n and m are octal digits (0-7).
\nml Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7).

 

Tutorials

 Shows how to use function findrx() to find and extract text.

 findrx() is is similar to find().
int i
i=find("Sunday Monday Tuesday" "Monday") ;;find Monday
out i
i=findrx("Sunday Monday Tuesday" "Monday") ;;find Monday
out i

 But findrx() can find not only exact text.
 The second argument (regular expression) can contain special characters (metacharacters) that match certain characters or conditions.
i=findrx("file578.txt" "\d+") ;;find a number. Here \d matches a digit, and + means "one or more". So, \d+ matches one or more digits.
out i

 How to extract the found text?
str s
i=findrx("file578.txt" "\d+" 0 0 s) ;;pass a str variable as 5-th argument, and it will be populated with the match
out i
out s

 What if not found?
i=findrx("file.txt" "\d+") ;;if not found, returns -1
if(i<0) out "not found"

 When whole string must match:
i=findrx("578" "^\d+$") ;;here ^ and $ mean "beginning" and "end"
out i
i=findrx("file578.txt" "^\d+$") ;;does not match
out i

 Find a number where it is whole word:
i=findrx("file123 456.txt" "\b\d+\b" 0 0 s) ;;here \b means "word boundary". Word characters are alphanumeric ASCII characters and _.
out i
out s

 Find a date in a filename:
i=findrx("file01-02-2000.txt" "\d{2}-\d{2}-\d{4}" 0 0 s) ;;here \d{2} means "2 digits"
out i
out s

 If need only year:
i=findrx("file01-02-2000.txt" "\d{2}-\d{2}-(\d{4})" 0 0 s 1) ;;6-th argument tells which submatch (part enclosed in ()) to get into s
out i
out s

 If need month, day and year:
ARRAY(str) a
i=findrx("file01-02-2000.txt" "(\d{2})-(\d{2})-(\d{4})" 0 0 a)
out a[0] ;;whole match
out a[1] ;;submatch 1 (month)
out a[2] ;;submatch 2 (day)
out a[3] ;;submatch 3 (year)

 Find substring consisting of specified characters:
i=findrx("xxx 0x3ea5 yyy" "0x[\dabcdef]+" 0 0 s) ;;here [] is used to specify characters, \d is digit, + means "1 or more". So it finds 0x followed by 1 or more digits and characters abcdef.
out i
out s

 How to use \^$()[].+? and other metacharacters as ordinary characters? (all metacharacters are listed in "Regular expression syntax" topic in QM Help)
i=findrx("file.txt" ".+\.txt" 0 0 s) ;;insert \ before each such character. Here the first . means "any character", however the second . is just . because preceded by \.
out s
i=findrx("xxx file.txt yyy" "\Qfile.txt\E" 0 0 s) ;;or enclose part of regular expression in \Q and \E
out s

 

 

 Shows how to use function str.replacerx() to find-replace text.

 str.replacerx() is similar to str.findreplace().
str x="Sunday Monday Tuesday"
x.findreplace("Monday" "(here was Monday)")
out x
str y="Sunday Monday Tuesday"
y.replacerx("Monday" "(here was Monday)")
out y

 But it can find not only exact text. It can find text like findrx().
y="file123 456.txt"
y.replacerx("\d+" "(here was a number)")
out y

 What if not found?
y="file.txt"
int n=y.replacerx("\d+" "(here was a number)") ;;replacerx returns number of found matches
if(n=0) out "not found"

 How to use submatches in the replacement string?
y="file01-02-2000.txt"
y.replacerx("(\d{2})-(\d{2})-(\d{4})" "$3-$1-$2") ;;in the replacement string, $number can be used for a submatch
out y ;;converted date from mm-dd-yyyy to yyyy-mm-dd format

 

On the web:

Tutorial (VBScript regular expressions)

More info, often used regular expressions, etc

Regular expression tools

PCRE

Regular expression support is provided by the PCRE library, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England.

 

The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. QM uses PCRE version 4.4. It corresponds approximately with Perl 5.8.

UTF-8

When QM is running in Unicode mode, it stores text in UTF-8 format. When text is in UTF-8 format, non ASCII characters consist of 2-4 bytes.

 

QM does not support UTF-8 mode (option RX_UTF8) in regular expressions. Non ASCII characters can be used, but they are interpreted as 2-4 separate bytes, not as single character. For example, . matches single byte, not whole character. Also, with non ASCII characters are not recognized word boundaries (\b), character classes (\w, \s, etc), not supported case insensitivity, etc.

 

In most cases all this will not be a problem. However in some cases, when there are non ASCII characters in regular expression or subject string, you have to build the regular expression differently, and maybe use flag 32.