Regular Expressions

Regular expressions provide a very powerful method of matching, modifying and extracting information from strings. Using special syntax, code that would usually require line after line of special matching code can be summarised within a one line regular expression (from here on in referred to as a regex). They can either be found within the language, e.g. Perl or ferite, or as an add in library, e.g. Python, php and C. ferite's regex's are providied by means of PCRE (Perl Compatible Regular Expressions, a C library that can be found at http://www.pcre.org) and as a result are almost identical in operation to Perl's. Regex's look like this:

Example:

    s/1(2)3/456/

This one will match all occurrences of the string "123" and swap them with "456"

    s/W(or(l))d/Ch\1ris\2/

This is more complicated and will match occurrences of "World" and swap them with "Chorlrisl". The reason being is due to back ticks which are discussed soon.

There are three types of regular expression support and that is match, swap and capture. They are used as follows:

    m/expression to match/
    s/expression to match/string to replace it with/
    c/expression to match/comma,seperated,list,of,variables/

To match an m is used, to swap an s is used. It is possible to capture strings within the regular expression using the same method as in Perl i.e. By using brackets. The captured strings upon each match are placed into r<bracket number> - this is equivalent to the $1, $2, ... $n strings in Perl. The maximum number of captured strings is currently 99, and example of captured strings is in the above expressions, i.e. (2) would cause "2" to be place within r1, in the second expression (or(l)) would cause "orl" to be placed within r1 and "l" to be placed within r2.

Options

There are a number of options that can be used to modify the method that the regular expression's execution and processing, these are:

x - This option allows the regular expression to be multi line, and also allows comments using the # character. This is useful for long regular expressions where it is important to remember what each individual part performs.
s - This allows the . (dot) matching character to match newlines (\n's).

m - This gets the ^ and $ meta characters to match at newlines within the source string.

Example:

    string foo = "Hello\nWorld\nFrom\nChris";
    foo =~ s/^(.*)$/Foo/sm;
    
    The above regex will be changed to "Foo\nFoo\nFoo\nFoo"

i - This causes the regex engine to match cases without looking at the case of characters being processed.
e - This causes the replace string to be evaluated as if it had been passed to eval(). The return value from the script will be used as the replacement text - the return needs to be a string.
Example:
string foo = "Hello World"; foo =~ s/Hello/return "Goobye";/ge Console.println( foo ); foo will now equal "Goodbye World"
g - This forces all matches along a line to be matched. Normally it is only the first occurance that is matched.
o - This causes the regular expression to be compiled at compile time rather than runtime. This is useful for regular expressions that dont change but are used alot within a script.
A - The pattern will only match if it matches at the beginning of the string being searched.
D - This option allows the user to have only the $ tie to the end of a line when it is at the end of the regular expression.

Backticks

Backticks are used within the swap mode of the regular expressions. It allows you to used captured strings within string that should replace the matched expression. There are used within the second example above. They are used as follows: a '\' (back slash) followed by the number that you want to use.

This is only a brief insight into regular expressions, and a suggested read is "Mastering Regular Expressions" by Jeffrey E. F. Friedl (published by O'Reilly), and that will tell you everything you need to know about regular expressions. :-) It is also suggested that the libpcre documentation is worth reading on http://www.pcre.org.