Краткая справка о синтаксисе регулярных выражений

В этом списке представлены наиболее часто используемые а также трудно запоминаемые части механизма регулярного выражения, доступные в calibre в свойствах редактирования и преобразования текста. Обратите внимание, что этот движок поддерживает больше функций, чем основной движок регулярных выражений, используемый в прочих частях calibre.

Группы символов

Character classes are useful to represent different groups of characters, succinctly.

Примеры:

Обозначение Группа
[а-я] Строчные буквы. Не включает символы с ударением и лигатурами.
[а-я0-9] Строчные буквы от а до я а также цифры от 0 до 9.
[А-Яа-я-] Заглавные, строчные буквы, а также тире. Чтобы включить тире в классе, вы должны поместить его в начале или в конце, чтобы не путать его с дефисом, который обозначает диапазон символов.
[^0-9] Любые символы, кроме цифр. Карет (^), помещенный в начале класса, исключает символы класса (дополненный класс).
[[а-я]--[аеёиоуыэюя]] The lowercase consonants. A class can be included in a class. The characters -- exclude what follows them
[\w--[\d_]] All letters (including foreign accented characters). Abbreviated classes can be used inside a class

Пример:

<[^<>]+> to select an HTML tag

Shorthand character classes

Обозначение Группа
\d Цифра (то же, что и [0-9])
\D Любой нецифовой символ (то же, что и [^0-9])
\w An alphanumeric character ([a-zA-Z0-9]) including characters with accent mark and ligatures
\W Any “non-word” character
\s Space, non-breaking space, tab, return line
\S Any “non-whitespace” character
. Any character except newline. Use the “dot all” checkbox or the (?s) regexp modifier to include the newline character.

The quantifiers

Quantifier Number of occurrences of the expression preceding the quantifier
? 0 или 1 повторение регулярного выражения. Эквивалентно {0,1}.
+ Одно или более повторение регулярного выражения. Эквивалентно {1,}.
* Любое - 0, 1 или более - количество повторений регулярного выражения. Эквивалентно {0,}.
{n} Ровно n повторений регулярного выражения.
{min,max} Количество повторений регулярного выражения, не менее min раз и не более max раз.
{min,} Не менее min повторений регулярного выражения.
{,max} Number of occurrences between 0 and the maximum value included

Greed

By default, with quantifiers, the regular expression engine is greedy: it extends the selection as much as possible. This often causes surprises, at first. ? follows a quantifier to make it lazy. Avoid putting two in the same expression, the result can be unpredictable.

Beware of nesting quantifiers, for example, the pattern (a*)*, as it exponentially increases processing time.

Перечисление

Символ | в регулярном выражении соответствует логическому ИЛИ (OR). Т. е. могут совпадать либо предыдущее, либо последующее выражения.

Исключение

Метод 1

шаблон_исключения(*SKIP)(*FAIL)|шаблон_для_поиска

Пример:

"Blabla"(*SKIP)(*FAIL)|Blabla

selects Blabla, in the strings Blabla or «Blabla or Blabla», but not in «Blabla».

Метод 2

шаблон_для_исключения\K|(шаблон_для_поиска)

"Blabla"\K|(Blabla)

selects Blabla, in the strings Blabla or «Blabla or Blabla», but not in «Blabla».

Anchors

An anchor is a way to match a logical location in a string, rather than a character. The most useful anchors for text processing are:

\b
Designates a word boundary, i.e. a transition from space to non-space character. For example, you can use \bsurd to match the surd but not absurd.
^
Соответствует началу строки (в многострочном режиме, который используется по умолчанию)
$
Соответствует концу строки (в многострочном режиме, который используется по умолчанию)
\K
Resets the start position of the selection to its position in the pattern. Some regexp engines (but not calibre) do not allow lookbehind of variable length, especially with quantifiers. When you can use \K with these engines, it also allows you to get rid of this limit by writing the equivalent of a positive lookbehind of variable length.

Groups

(expression)
Capturing group, which stores the selection and can be recalled later in the search or replace patterns with \n, where n is the sequence number of the capturing group (starting at 1 in reading order)
(?:expression)
Group that does not capture the selection
(?>expression)
Atomic Group: As soon as the expression is satisfied, the regexp engine passes, and if the rest of the pattern fails, it will not backtrack to try other combinations with the expression. Atomic groups do not capture.
(?|expression)
Branch reset group: the branches of the alternations included in the expression share the same group numbers
(?<name>expression)
Group named “name”. The selection can be recalled later in the search pattern by (?P=name) and in the replace by \g<name>. Two different groups can use the same name.

Lookarounds

Lookaround Meaning
?= Positive lookahead (to be placed after the selection)
?! Negative lookahead (to be placed after the selection)
?<= Positive lookbehind (to be placed before the selection)
?<! Negative lookbehind (to be placed before the selection)

Lookaheads and lookbehinds do not consume characters, they are zero length and do not capture. They are atomic groups: as soon as the assertion is satisfied, the regexp engine passes, and if the rest of the pattern fails, it will not backtrack inside the lookaround to try other combinations.

When looking for multiple matches in a string, at the starting position of each match attempt, a lookbehind can inspect the characters before the current position. Therefore, on the string 123, the pattern (?<=\d)\d (a digit preceded by a digit) should, in theory, select 2 and 3. On the other hand, \d\K\d can only select 2, because the starting position after the first selection is immediately before 3, and there are not enough digits for a second match. Similarly, \d(\d) only captures 2. In calibre’s regexp engine practice, the positive lookbehind behaves in the same way, and selects only 2, contrary to theory.

Groups can be placed inside lookarounds, but capture is rarely useful. Nevertheless, if it is useful, it will be necessary to be very careful in the use of a quantifier in a lookbehind: the greed associated with the absence of backtracking can give a surprising capture. For this reason, use \K rather than a positive lookbehind when you have a quantifier (or worse, several) in a capturing group of the positive lookbehind.

Example of negative lookahead:

(?![^<>{}]*[>}])

Placed at the end of the pattern prevents to select within a tag or a style embedded in the file.

Whenever possible, it is always better to «anchor» the lookarounds, to reduce the number of steps necessary to obtain the result.

Recursion

Обозначение Meaning
(?R) Recursion of the entire pattern
(?1) Recursion of the only pattern of the numbered capturing group, here group 1

Recursion is calling oneself. This is useful for balanced queries, such as quoted strings, which can contain embedded quoted strings. Thus, if during the processing of a string between double quotation marks, we encounter the beginning of a new string between double quotation marks, well we know how to do, and we call ourselves. Then we have a pattern like:

start-pattern(?>atomic sub-pattern|(?R))*end-pattern

To select a string between double quotation marks without stopping on an embedded string:

“((?>[^“”]+|(?R))*[^“”]+)”

This template can also be used to modify pairs of tags that can be embedded, such as <div> tags.

Special characters

Обозначение Character
\t tabulation
\n line break
\x20 (breakable) space
\xa0 no-break space

Метасимволы

Meta-characters are those that have a special meaning for the regexp engine. Of these, twelve must be preceded by an escape character, the backslash (\), to lose their special meaning and become a regular character again:

^ . [ ] $ ( ) * + ? | \

Seven other meta-characters do not need to be preceded by a backslash (but can be without any other consequence):

{ } ! < > = :

Special characters lose their status if they are used inside a class (between brackets []). The closing bracket and the dash have a special status in a class. Outside the class, the dash is a simple literal, the closing bracket remains a meta-character.

The slash (/) and the number sign (or hash character) (#) are not meta-characters, they don’t need to be escaped.

In some tools, like regex101.com with the Python engine, double quotes have the special status of separator, and must be escaped, or the options changed. This is not the case in the editor of calibre.

Modes

(?s)
Causes the dot (.) to match newline characters as well
(?m)
Makes the ^ and $ anchors match the start and end of lines instead of the start and end of the entire string.