Find substring using regular expression

Regular expressions

Regular expression syntax

 

Syntax

int findrx(string pattern [from|rf] [flags] [result] [submatch])

 

Parts

string - string to search in.

pattern - regular expression that matches substring to find. String.

from - zero-based character index, from which to start search. Default 0.

rf - address of variable of type FINDRX.

flags - combination of values listed below. Default: 0.

1 Case insensitive.
2 Whole word. This adds \b to the beginning and end of pattern.
4 Find all. Valid only if result is array.
8 Multiline. If this flag is set (or (?m) is used in pattern), ^ and $ match the beginning and end of line. Default: ^ and $ match the beginning and end of whole string.
16 Don't need submatches. Set this flag if you use result array, and performance is important.
32 QM 2.3.0. Convert pattern from UTF-8 to ANSI. Used when QM is running in Unicode mode (ignored otherwise). Set this flag if pattern contains non ASCII characters, but string is ANSI (not UTF-8). It is needed because these characters in pattern normally consist of 2 or 3 bytes, whereas characters in string consist of 1 byte.
128 Only compile pattern.
pcre flags  

result - variable of type str, int, ARRAY(str) or ARRAY(CHARRANGE).

submatch - submatch to find. Integer. If 0 (default), finds whole submatch. Not used if result is array.

 

Remarks

Finds a substring in string. To specify the substring, is used regular expression (patern). The function can find a whole match, a submatch, or all matches and submatches. A match is the part of string that matches pattern. A submatch is the part of the match that matches a captured subpatern. A captured subpattern is the part of pattern that is enclosed in parentheses and does not begin with ?.

 

The return value depends on flags and arguments:

default zero-based index of first character of the match in string, or -1 if not found.
nonzero submatch zero-based index of first character of the submatch in string, or -1 if not found.
flag 4 the number of found matches, or 0 if not found.
flag 128 undefined

 

result can be used to get more information about the found match and submatches. The following table shows what is stored to result depending on its type. Assume that flag 4 is not used.

str receives the match or submatch (if submatch is nonzero). If flag 128 is set, receives the compiled pattern.
int receives the length of the match or submatch.
ARRAY(str) receives the match in element 0 and submatches in subsequent elements.
ARRAY(CHARRANGE) receives start and end offsets of the match and submatches. To extract the matched substring as separate string, use str function get (see examples).

 

Definition of CHARRANGE (defined by QM):

 

type CHARRANGE cpMin cpMax

 

cpMin - start of substring (match or submatch) in string. It is zero-based index of first character of substring in string.

cpMax - end of substring.

 

If flag 4 is set and result is array, finds all matches. It creates two-dimensional array. To access an element, use result[x y], where y is the index of the found match (0 - first match, 1 - second match, ...), and x is 0 or submatch index (0 - whole match, 1 - first submatch, ...). For example, result[0 0] contains first match, result[0 1] - second match, result[1 0] - first submatch of first match.

 

If flag 128 (only compile) is set, and result is str variable, the function does not search. It only compiles pattern and stores compiled data into result variable. You can use that variable later with functions findrx and str.replacerx as pattern. If multiple operations are performed with the same pattern, using compiled pattern improves performance, because then pattern does not have to be compiled each time. To compile pattern, are used only pattern, flags and result. You should use same flags value when compiling and later.

 

Examples

 Find digits (10)
str subject="abc10 100 def"
out findrx(subject "\d+")

 Find digits as whole word (100), and store into s
str subject="abc10 100 def"
str s
if(findrx(subject "\d+" 0 2 s)>=0) out s

 Extract HTML tags (simplified; useful only as "find all" example)
str html
IntGetFile("http://www.google.com" html)
str pattern="<(.*)>.*<\/\1>" ;;matches a HTML tag
ARRAY(str) a
findrx(html pattern 0 4 a)
int i
for(i 0 a.len)
	out "submatch=%s, whole=%s" a[1 i] a[0 i]

 Extract URL components
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
int i; ARRAY(str) a
if(findrx(subject pattern 0 0 a)<0) out "does not match"; ret
for i 0 a.len
	out a[i]

 Extract URL components; show offsets and lenghts
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
int i; ARRAY(CHARRANGE) a
if(findrx(subject pattern 0 0 a)<0) out "does not match"; ret
for i 0 a.len
	int offset(a[i].cpMin) length(a[i].cpMax-a[i].cpMin)
	str s.get(subject offset length)
	out "offset=%i length=%i %s" offset length s

 Extract only server from URL 
str subject="http://msdn.microsoft.com:80/scripting/default.htm"
str pattern="(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)"
str server
if(findrx(subject pattern 0 0 server 2)>=0) out server