Skip to main content

Regular expressions

This document describes the operation of the regular expression option of substitution rules. While this document is not intended as a resource for learning regular expressions, it will provide a brief introduction for the new user.

Regular Expressions in Clonable

If you are already familiar with regular expressions (regex) in other programs, you may find that within Clonable they do not always work as you expect. Clonable uses a non-backtracking regex engine to avoid problems. Because of this it can happen that complicated regex do not work, or work differently than in a backtracking engine.

Regex engines

Most regex engines are based on Perl Compatible Regular Expressions (PCRE). These engines often use backtracking, but this is not suitable within Clonable due to security reasons. Therefore, Clonable uses a non-backtracking engine that supports a subset of PCRE. If you have specific questions about this please contact us.

Introducing regular expressions

If you already have experience with regular expressions (regex), you probably don't need to read this chapter. If you are not yet familiar with regex, this is a small introduction to get you started. At the end of this chapter, further resources will be listed where you can practice more with regex.

What is a regex used for?

A regex is used when you want to capture multiple things that are very similar, but different in some places. For example, sentences that mention stock containing a number or sentences containing different dates. You could then create many different substitution rules for all possible combinations, but this is often impossible to do or takes a lot of time.

Character groups

Character groups are used in regex to indicate what kind of character may be in a certain place. A character group is always listed between two square brackets ([]). Within this are then the allowed caraceters. This can also be a range of characters, for example [0-9] all numbers from 0 to 9 and [A-Z] all uppercase letters. Character ranges can also be combined; for example, [A-Za-z] matches all upper and lower case letters.

// Matches both hello and
[Hh]allo

// Matches any number from 100 to 499:
[1-4][0-9][0-9]

Quantifiers

As shown in the example above, the [0-9] had to be repeated. In addition to making the regex cluttered, this is also inflexible (for example, it is not possible to create a regex that matches the numbers 1 to 99). To solve this problem, you can use quantifiers. These indicate how many times the character(group) should occur before it. These quantifiers are always listed between braces ({}). Within these, the minimum and maximum are separated by commas. If the minimum and maximum are the same, you only need to enter one number. For example, {1,3} means at least 1 and at most 3 and {4,4} and {4} both mean exactly 4.

In addition, there are a number of special quantifiers. ? means 0 or 1, + means 1 or more, and * means 0 or more.

// Matches hoi and hooi
ho{1,2}i

// Matches any number starting with a 1
1[0-9]*

// Matches any number starting with a 1 and that is greater than or equal to 10
1[0-9]+

// Matches any number from 1 to 19
1[0-9]?

// Matches any word (upper and lower case) 6 letters long
[A-Za-Z]{6}

// Try it yourself: create a regex that matches all words that start with an uppercase letter. (Answer: see footnote 1)
Using + and *

If you must use the + or * quantifiers in your regex, it is usually a good idea to use +? or *?. Asking + will minimize the match of the regex and prevent it from not working. Example:

// Between [] is indicated what will be matched
// [<img src="/images/photo.jpg"><a href="/home"]>home</a>
<img src=".*"

// [<img src="/images/photo.jpg"]><a href="/home">home</a>
<img src=".*?"

As you can imagine, the top behavior is problematic in a full web page.

(Match) groups

There may be times when you want to be more specific than a character group of a certain length. Regex has a solution for that too in the form of (match) groups. A match group is placed between two brackets (()). Within this group, a series of characters and/or character groups can be placed that together form a whole. Behind these groups, a quantifier can be placed.

// Matches nom, nomnom, nomnomnom and nomnomnomnom
(nom){1,4}

// Matches any word containing the letters ou
[A-Za-z]*(ou)[A-Za-z]*

// Try it yourself: create a regex that matches all words containing 2 or 4 vowels in a row. (Answer: see footnote 2)

In addition, these match groups are also very useful when taking over data in the replacement of a substitution (eg numbers). You can use $[group number] in the replacement to copy the contents of a sun group. See the simple example below for moving the euro sign in a money amount from before to after the number:

Original

Replacement

Options

Using reserved characters

As you may have noticed, regex uses a number of characters for syntax that may also occur in sentences (for example, ? or .). You may then wonder how to use those characters yourself. Fortunately, that is very simple: if you want to use a character without it performing its function on the regex, you put a backslash () in front of it.

// Matches Hello. but also Hellok or Hello@, etc.
Hello.

// Matches only Hello.
Hello\.


// Matches international
(in)ternational

// Matches (in)ternational
\(in\)ternational

Other tips

  • A . is a wildcard for any character. Thus, with .* you match everything.
  • [\d] is an alias for [0-9].
  • [A-z] is not the same as [A-Za-z], but also contains a number of punctuation marks

Further references

1 [A-Z][a-z]* 2 [A-Z][a-z]*([aeiou]{2}){1,2}[A-Z][a-z]*