Character Classes


A character class is used to represent a set of characters. The following combinations are allowed in describing a character class:

Character Classes

Lua patterns come with a selection of "built-in" character classes that are useful in more situations:

Class Description
x (where x is not one of the "magic" characters) represents the character x itself.
%x (where x is any non-alphanumeric character) represents the character x.
. (a dot) represents all characters.
%a represents all letters.
%c represents all control characters.
%d represents all digits.
%l represents all lowercase letters.
%p represents all punctuation characters.
%s represents all space characters.
%u represents all uppercase letters.
%w represents all alphanumeric characters.
%x represents all hexadecimal digits.
%z represents the character with representation 0.

When the built-in character classes are not sufficient for a task, custom character classes can be easily defined. Custom character classes are defined by surrounding the characters that should be included in the classes with square brackets.

For example, [set] represents a character class consisting of the letters s, e, and t. When a character class contains a contiguous sequence of letters or numbers, that sequence can be represented by a shorthand notation consisting of the first and last characters of the sequence, separated by a -. For example, the sequence of digits 0123456789 can be shorted to 0-9. As an even shorter alternative, Lua allows built-in character classes to be included in custom class specifications.

Finally, character classes can also be defined as the "complement" (or inverse) of the specified set. For built-in character classes the complementary class is specified by upper-case class name, so since %d is the class containing all digits 0-9, then %D is the class containing all characters except for those. For custom character classes inversion is achieved by making the first character of the set a ^. As an example, [^set] defines a character class containing all characters except s, e, and t.

Here are some examples showing alternative implementations for some of the built-in character classes above:

Class Equivalent
%d [0123456789]
%d [0-9]
%D [^0-9]
%a [a-zA-Z]
%l [a-z]
%u [A-Z]
%a [%l%u]
%w [%a%d]
%x [0-9a-f]

In a few cases we show multiple alternative implementations for the same built-in character class in order to show their flexibility.

Magic Characters

In the preceding discussion we saw a number of characters that had special meaning. For example, % indicates a built-in character class, [ and ] indicate a custom character class, and - indicates a range of characters. These (and a few more) characters are designated "magic characters" because they have special meaning in patterns.

The list of magic characters are:

^$()%.[]*+-?

This is important to understand when a pattern should interpret one of the magic characters literally. For example, in the previous we matched a phone number which consisted of magic characters +, (, ), and -

There are two ways to indicate a literal interpretation of magic characters. First, a custom character class consisting of the magic character can be used to force a literal interpretation:

local pattern = "+%d[(]%d%d%d[)]%d%d%d[-]%d%d%d%d"

print(string.match("+1(234)567-8910", pattern)) -- +1(234)567-8910
print(string.match("+1(23)4567-8910", pattern)) -- nil
print(string.match("(234)567-8910", pattern)) -- nil

The second option is to escape the magic character with a %:

local pattern = "%+%d%(%d%d%d%)%d%d%d%-%d%d%d%d"

print(string.match("+1(234)567-8910", pattern)) -- +1(234)567-8910
print(string.match("+1(23)4567-8910", pattern)) -- nil
print(string.match("(234)567-8910", pattern)) -- nil

Which to use is primarily a matter of personal preference and readability.