正規表示式語法的快速參考

This checklist summarizes the most commonly used/hard to remember parts of the regexp engine available in most parts of calibre.

字元類別

字元類別可以簡潔地表示不同的字元組。

範例:

表示

類別

[a-z]

小寫字母。不包括帶重音符號和連字的字元

[a-z0-9]

a 到 z 的小寫字母或 0 到 9 的數字

[A-Za-z-]

大寫或小寫字母或連接號。要將連接號包含在類別中,您必須將其放在開頭或結尾處,以免將其與指定一系列字元的連字號混淆

[^0-9]

除数字之外的任意字符。放在字符集合开头的“^”符号表示取除指定字符之外的集合(补集)。

[[a-z]--[aeiouy]]

小写元音字母。字符集合可以包含另一个字符集合。-- 符号表示排除它之后的集合

[\w--[\d_]]

所有字母(包括带重音符号的外语字母)。字符集合中可以使用缩写集合

例如:

<[^<>]+> to select an HTML tag

缩写字符集合

表示

類別

\d

一个数字(同 [0-9]

\D

非数字字符(同 [^0-9]

\w

一种字母数字字符([a-zA-Z0-9]`),包括带重音符号和连字的字符

\W

任何“非单词”字符

\s

空格、不间断空格、制表符、返回行

\S

任何“非空白”字符

.

Any character except newline. Use the “dot all” checkbox or the (?s) regexp modifier to include the newline character.

量词

量词

量词前表达式出现次数

?

表达式出现0或1次。与``{0,1}``相同

+

表达式出现1次或多次。与``{1,}``相同

*

表达式出现0次、1次或更多次。与``{0,}``相同

{n}

该表达式恰好出现n次

{min,max}

包含的最小值和最大值之间的出现次数

{min,}

包含的最小值和无穷大之间的出现次数

{,max}

介于0和包含的最大值之间的出现次数

Greed

By default, with quantifiers, the regular expression engine is greedy: it extends the selection as much as possible. This often causes surprises, at first. ? follows a quantifier to make it lazy. Avoid putting two in the same expression, the result can be unpredictable.

Beware of nesting quantifiers, for example, the pattern (a*)*, as it exponentially increases processing time.

交替

The | character in a regular expression is a logical OR. It means that either the preceding or the following expression can match.

不包括在内

方法 1

pattern_to_exclude(*SKIP)(*FAIL)|pattern_to_select

例如:

"Blabla"(*SKIP)(*FAIL)|Blabla

selects Blabla, in the strings Blabla or "Blabla or Blabla", but not in "Blabla".

方法 2

pattern_to_exclude\K|(pattern_to_select)

"Blabla"\K|(Blabla)

selects Blabla, in the strings Blabla or "Blabla or Blabla", but not in "Blabla".

锚点是匹配字符串中的逻辑位置而不是字符的一种方式。文本处理中最有用的锚点是:

\b

Designates a word boundary, i.e. a transition from space to non-space character. For example, you can use \bsurd to match the surd but not absurd.

^

匹配行首(在多行模式下,这是默认设置)

$

匹配行尾(在多行模式下,这是默认设置)

\K

Resets the start position of the selection to its position in the pattern. Some regexp engines (but not calibre) do not allow lookbehind of variable length, especially with quantifiers. When you can use \K with these engines, it also allows you to get rid of this limit by writing the equivalent of a positive lookbehind of variable length.

Groups

(expression)

Capturing group, which stores the selection and can be recalled later in the search or replace patterns with \n, where n is the sequence number of the capturing group (starting at 1 in reading order)

(?:expression)

Group that does not capture the selection

(?>expression)

Atomic Group: As soon as the expression is satisfied, the regexp engine passes, and if the rest of the pattern fails, it will not backtrack to try other combinations with the expression. Atomic groups do not capture.

(?|expression)

Branch reset group: the branches of the alternations included in the expression share the same group numbers

(?<name>expression)

Group named “name”. The selection can be recalled later in the search pattern by (?P=name) and in the replace by \g<name>. Two different groups can use the same name.

Lookarounds

Lookaround

Meaning

?=

正向前视(放置在选定内容之后)

?!

反向前视(放在选择之后)

?<=

正向后视(放在所选内容之前)

?<!

反向后视(放在选定内容之前)

Lookaheads and lookbehinds do not consume characters, they are zero length and do not capture. They are atomic groups: as soon as the assertion is satisfied, the regexp engine passes, and if the rest of the pattern fails, it will not backtrack inside the lookaround to try other combinations.

When looking for multiple matches in a string, at the starting position of each match attempt, a lookbehind can inspect the characters before the current position. Therefore, on the string 123, the pattern (?<=\d)\d (a digit preceded by a digit) should, in theory, select 2 and 3. On the other hand, \d\K\d can only select 2, because the starting position after the first selection is immediately before 3, and there are not enough digits for a second match. Similarly, \d(\d) only captures 2. In calibre's regexp engine practice, the positive lookbehind behaves in the same way, and selects only 2, contrary to theory.

Groups can be placed inside lookarounds, but capture is rarely useful. Nevertheless, if it is useful, it will be necessary to be very careful in the use of a quantifier in a lookbehind: the greed associated with the absence of backtracking can give a surprising capture. For this reason, use \K rather than a positive lookbehind when you have a quantifier (or worse, several) in a capturing group of the positive lookbehind.

Example of negative lookahead:

(?![^<>{}]*[>}])

Placed at the end of the pattern prevents to select within a tag or a style embedded in the file.

Whenever possible, it is always better to "anchor" the lookarounds, to reduce the number of steps necessary to obtain the result.

Recursion

表示

Meaning

(?R)

递归匹配整个模式

(?1)

Recursion of the only pattern of the numbered capturing group, here group 1

Recursion is calling oneself. This is useful for balanced queries, such as quoted strings, which can contain embedded quoted strings. Thus, if during the processing of a string between double quotation marks, we encounter the beginning of a new string between double quotation marks, well we know how to do, and we call ourselves. Then we have a pattern like:

start-pattern(?>atomic sub-pattern|(?R))*end-pattern

To select a string between double quotation marks without stopping on an embedded string:

“((?>[^“”]+|(?R))*[^“”]+)”

This template can also be used to modify pairs of tags that can be embedded, such as <div> tags.

特殊字符

表示

字符

\t

tabulation

\n

换行符

\x20

(可间断的)空格

\xa0

不间断空格

元字符

Meta-characters are those that have a special meaning for the regexp engine. Of these, twelve must be preceded by an escape character, the backslash (\), to lose their special meaning and become a regular character again:

^ . [ ] $ ( ) * + ? | \

其他七个元字符不需要以反斜杠开头(但可以没有任何其他后果):

{ } ! < > = :

Special characters lose their status if they are used inside a class (between brackets []). The closing bracket and the dash have a special status in a class. Outside the class, the dash is a simple literal, the closing bracket remains a meta-character.

The slash (/) and the number sign (or hash character) (#) are not meta-characters, they don’t need to be escaped.

In some tools, like regex101.com with the Python engine, double quotes have the special status of separator, and must be escaped, or the options changed. This is not the case in the editor of calibre.

Modes

(?s)

使句点(.)也能匹配换行符

(?m)

Makes the ^ and $ anchors match the start and end of lines instead of the start and end of the entire string.