ripgrep - Regular Expression Syntax

We discussed in the previous chapter that each time ripgrep is called it goes through a few basic steps:

Select files to be searched
Apply the specified pattern(s) to each in the specified input(s)
Format and return each line of output

The previous chapter discussed how to customize which files are used as ripgrep's input, and this chapter is the first of a two-part chapter that discusses how to define the patterns that will be used to select lines from the input, while part two of this chapter discusses the various command-line options that are available to fine-tune our searches.

This section reviews the basics of working with ripgrep's regular expression syntax, which is mostly similar to other common engines but has a few differences. We focus on the most common aspects of the syntax to help day-to-day usage, and leave some of the more esoteric aspects to the official documentation.

Matching Literal Strings

Matching literal strings is pretty straight-forward, and provides a simple review of basic pattern-building:

x: Match a single literal character x
xy: Concatenation
x|y: Alternation

In these examples, each letter x and y represent "atomic unit" that are matched against a string, and these demonstrate a few of the basic operations that are used when building patterns.

The concatentation operation matches strings that contain two atomic units next to each other. In this example, the pattern xy will match any of wxy, xy, and xyz, but not x, y, or xz.

Alternation, on the other hand, allows patterns to match either of several atomic units, which is often considered an "OR operation". Here the pattern x|y will match either an x or a y. Although it doesn't make much sense in this example, for completeness we should point out that in cases where both x and y patterns match a string, the x pattern will be used since it came first.

We will see in a few moments that other types of atomic units exist that, while they themselves behave differently than literal strings, their function within an overall pattern remains essentially the same.

Repetition

Suppose we want to match strings containing 4 x characters in a row. As we learned above we could simply use the pattern xxxx, which is perfectly fine and quite common. However, patterns allow more powerful ways of representing repetition that can be very useful in more advanced usage.

Simple repetition is represented by following an atomic unit with *, +, and ? characters. Note that the regular expression engine understands that these are special characters and doesn't match them directly. Instead, it used them to define how the previous atomic unit should be matched.

Here are the basics:

x*: zero or more of x (greedy)
x+: one or more of x (greedy)
x?: zero or one of x (greedy)

These are the most important to remember, as they will be used repeatedly (pun intended) while building patterns. Each of these patterns is greedy, meaning that they will match as much of the string as possible. Many patterns don't want this behavior, which can be modified by appending a ?:

x*?: zero or more of x (ungreedy / lazy)
x+?: one or more of x (ungreedy / lazy)
x??: zero or one of x (ungreedy / lazy)

In each of these cases the second ? modifies the preceding pattern so that it is no longer greedy (which is often referred to as ungreedy or lazy). These patterns behave like their previous counterparts, except they match as little of the string as possible. We will see some examples where this can be beneficial.

ripgrep's regular expression engine also supports a more expressive repetition operator, which allows the pattern to define specific ranges of repetition that should be matched.

x{n}: exactly n x
x{n,}: at least n x (greedy)
x{n,m}: at least n and at most m x (greedy)

Like the previous repetition operators, these can also be modified to define their greediness, again by using the ? operator:

x{n}?: exactly n x
x{n,}?: at least n x (ungreedy / lazy)
x{n,m}?: at least n x and at most m x (ungreedy / lazy)

Anchors

Another powerful feature of regular expressions is that you can not only define what to match, but also at what locations within a string it should match. The first two are the most common:

^x: match the letter x, but only at the beginning of the string, or at the start of the line in multi-line mode
x$: match the letter x, but only at the end of the string, or at the end of the line in multi-line mode

Note that these behave differently depending on whether or not we are matching in single or multi-line mode. Two additional escapes are available that ensure consistent behavior regardless of mode.

\A: match only the beginning of the string, even in multi-line mode
\z: match only the end of the string, even in multi-line mode

Whereas the previous examples show how to match at the beginning and end of a string, the following perform roughly the same function, except they can match at "word boundaries" within the string. A word boundary refers to the transition between words and whitespace, such as that from whitespace to a letter (at the beginning of a word), and that from a letter back to whitespace (at the end of the word).

\b: match at a Unicode word boundary
\B: do not match at Unicode word boundaries

Note in these examples that the behavior of the capitalized B is the inverse of the lower-case b. We will see this repeat in other cases, so recognizing it will be helpful to learning how various components of regular expressions work.

One final note - this section described matching the transition between two types of characters, rather than either of the characters themselves. Sometimes these are called "zero-width matches", to reinforce that while they help define which text to match, they are not part of the matched text itself.

Defining Character Classes

Most patterns will replace one or both letters in the example with a "character class", which allows us we generally want to match patterns, which gets much easier with character classes. ripgrep's regex engine support several types of character classes.

Character classes provide a convenient means of matching characters that belong to a defined group or range of characters. The following examples demonstrate most common operations:

[0-9]: A character class matching any digit in range 0-9.
[0-35-9]: A character class matching digits from 0 to 9, except for 4 (union).
[^4]: A character class matching digits other than 4. (inversion)
[4[^0-9]]: Nested character class (matching only 4)
[0-4&&4-9]: Intersection (matching only 4)
[0-9--4]: Subtraction (matching 0-9 except 4)
[0-4~~4-9]: Symmetric difference (matching 0-9 except 4)
[0-9&&[^4]]: Subtraction using intersection and negation (matching 0-9 except 4)

As with algebra, there are rules the define how character class definitions are interpreted, which is often referred to as defining the precedence of each operation:

Ranges have the highest precedence, so the following two expressions are equivalent: [a-cd] == [[a-c]d].
Unions have the next highest precedence, so [ab&&bc] == [[ab]&&[bc]].
Intersection, difference, symmetric difference operations all have equivalent precedence, and are evaluated in left-to-right order, so [a-e--d-g&&b-j] ==[[a-e--d-g]&&b-j].
Negation has the lowest precedence, so [^a-z&&b] == [^[a-z&&b]].

Finally, since brackets [ and ] are used to define character classes, how would one define a character class that contains one or both bracket characters? This can be done by escaping the bracket characters inside the character class, so that they are treated as literal strings:

[\[\]]: matches either [ or ])

Escape-style Character Classes

Users with experience using other regular expression engines will recognize some standard "escape" style character classes which generally work as expected, with the caveat that they are Unicode aware.

Whether this is a benefit or not depends on the use case, but it is good to be aware of because they might match more characters than expected, which can effect performance. If you are primarily looking to match ASCII text you may want to define the character classes as we saw previously, or use named character classes, which we discuss next.

.: match any character except the new line character
\.: match a literal .
\w: match any Unicode word character
\W: match anything but a Unicode word character
\d: match any Unicode digit
\D: match anything but a Unicode digit
\s: match Unicode whitespace
\S: match anything but Unicode whitespace

Note the convention that lower-case define the characters to be matched, while the upper-case defines the characters that should not be matched.

ASCII Named Character Classes

[[:ascii:]]: match anything defined by ASCII ([\x00-\x7F])
[[:alpha:]]: match letters ([A-Za-z])
[[:lower:]]: match lower case letters ([a-z])
[[:upper:]]: match upper case letters ([A-Z])
[[:digit:]]: match digits ([0-9])
[[:alnum:]]: match alphanumerics ([0-9A-Za-z])
[[:word:]]: match word characters ([0-9A-Za-z_])
[[:punct:]]: match punctuation characters ([!-/:-@\[-{-~])
[[:blank:]]: match blank characters ([\t ])
[[:space:]]: match whitespace characters ([\t\n\v\f\r ])
[[:xdigit:]]: hex digit ([0-9A-Fa-f])
[[:cntrl:]]: match control characters ([\x00-\x1F\x7F])
[[:graph:]]: match graphical characters ([!-~])
[[:print:]]: match printable characters ([ -~])

Character class operators work for named character classes so, for example, the ^ operator can be used to invert the [[:alpha:]] class like this:

[[:^alpha:]]: match anything but ASCII letters ([^A-Za-z])

Character classes can also be combined by placing them inside brackets, so a character class that matches only digits and upper case letters would be: [[:upper:][:digit:]].

Capture Groups

After text has been matched to a pattern, we often want to extract useful parts of the matched text. This can be done by defining capture groups within the pattern. Capture groups come in named and numbered types. Numbered capture groups are defined by putting content to be contained in the group inside of parenthesis, while named groups do the same but also include a name field. The following patterns match numbers such as 123.456 and 12.3456, but capture only the decimal components (456 and 3456, respectively):

[0-9]+.([0-9]+)+: numbered capture group
[0-9]+.(?P<name>[0-9]+): named capture group
[0-9]+.(?<name>[0-9]+): named capture group

Numbered capture groups are indexed by their opening parenthesis, counting from left to right, starting from 1, and can be retrieved (such as when using the --replace option) using the pattern $1 for the first capture group, $2 for the second, etc. Index 0 ($0) is a special case, which contains the text of the entire match.

Named capture groups are indexed and can be retrieved just like numbered groups, but also have the option to be retrieved by name. Two options for naming groups are available, where the former is compatible with other common regular expression syntax, and the latter is more compact, but they are otherwise equivalent.

Capture group names can include any sequence of alpha-numeric letters, in addition to ., _, [ and ], though they must start with either an _ or a letter. So, for the named patterns above, the capture groups can be retrieved by either index or name, and $1 and $name return the same values.