Function mode for Search & replace in the Editor¶
The Search & replace tool in the editor support a function mode. In this mode, you can combine regular expressions (see Alles über die Verwendung von regulären Ausdrücken in Calibre) with arbitrarily powerful Python functions to do all sorts of advanced text processing.
In the standard regexp mode for search and replace, you specify both a regular expression to search for as well as a template that is used to replace all found matches. In function mode, instead of using a fixed template, you specify an arbitrary function, in the Python programming language. This allows you to do lots of things that are not possible with simple templates.
Techniken und Syntax des Funktions Modus werden mithilfe von Beispielen beschrieben, welche zeigen wie zunehmend schwierigere Aufgaben erledigt werden.
Automatisches Anpassen der Groß/Kleinschreibung der Dokumentüberschriften¶
Hier verwenden wir eine der eingebauten Funktionen des Editors um automatisch die Groß/Kleinschreibung aller Texte innerhalb eines Überschriften Tags in Titel Schreibweise zu ändern:
Find expression: <([Hh][1-6])[^>]*>.+?</\1>
Als Funktion einfach die eingebaute Title-case text (ignore tags) Funktion wählen. Die Funktion wird alle Titel der Form <h1>ein TITEL</h1>
in die Form <h1> Ein Titel</h1>
umwandeln. Die Funktion funktioniert auch wenn andere HTML Tags innerhalb der Titel Tags enthalten sind.
Die erste benutzerdefinierte Funktion - Trennstriche typografisieren¶
Die echten Macht der Funktions Modus kommt von der Möglichkeit eigene Funktionen zu schreiben, die Text willkürlich anpassen. Das Satzzeichen typografisieren Werkzeug im Editor ignoriert einzelne Trennstriche, mit dieser Funktion können diese durch Gedankenstriche ersetzen.
To create a new function, simply click the Create/edit button to create a new function and copy the Python code from below.
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return match.group().replace('--', '—').replace('-', '—')
Every Search & replace custom function must have a unique name and consist of a
Python function named replace, that accepts all the arguments shown above.
For the moment, we won’t worry about all the different arguments to
replace()
function. Just focus on the match
argument. It represents a
match when running a search and replace. Its full documentation in available
here.
match.group()
simply returns all the matched text and all we do is replace
hyphens in that text with em-dashes, first replacing double hyphens and
then single hyphens.
Verwendet man diese Funktion um den folgenden Regulären Ausdruck zu suchen:
>[^<>]+<
Wird sie alle Trennstriche mit Gedankenstrichen ersetzen. Allerdings nur im tatsächlichen Text und nicht innerhalb von HTML Tags.
Die Macht des Funktions Modus - Mit einem Rechtschreibbuch schlecht getrennte Wörter korrigieren¶
Often, e-books created from scans of printed books contain mis-hyphenated words – words that were split at the end of the line on the printed page. We will write a simple function to automatically find and fix such words.
import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
def replace_word(wmatch):
# Try to remove the hyphen and replace the words if the resulting
# hyphen free word is recognized by the dictionary
without_hyphen = wmatch.group(1) + wmatch.group(2)
if dictionaries.recognized(without_hyphen):
return without_hyphen
return wmatch.group()
# Search for words split by a hyphen
text = replace_entities(match.group()[1:-1]) # Handle HTML entities like &
corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
return '>%s<' % prepare_string_for_xml(corrected) # Put back required entities
Diese Funktion wird mit dem selben Regulären Ausdruck verwendet wie die vorherige und zwar:
>[^<>]+<
Damit werden wie durch Magie alle inkorrekt getrennten Wörter im Buchtext korrigiert. Diese Funktion bedients sich hauptsächlich einem der zusätzlichen Argumente der replace Funktion und zwar dictionaries
. Dieses Argument verweißt auf die Wörterbücher die vom Editor selbst verwendet werden um die Rechtschreibkontrolle des Buches durchzuführen. Was diese Funktion mach ist sich alle Wörter die durch einen Bindestrich getrennt sind zu nehmen, den Bindestrich zu entfernen und zu überprüfen ob das Wort ohne Bindestrich im Wörterbuch steht. Wenn ja werden die ürsprünglichen Wörter durch das zusammengesetzte Wort ohne Bindestrich ersetzt.
Zu Beachten ist, dass diese Technik nur bei einsprachigen Büchern funktioniert, da dictionaries.recognized()
im Normalfall nur die Hauptsprache des Buches verwendet.
Automatisches Nummerieren von Abschnitten¶
Now we will see something a little different. Suppose your HTML file has many
sections, each with a heading in an <h2>
tag that looks like
<h2>Some text</h2>
. You can create a custom function that will
automatically number these headings with consecutive section numbers, so that
they look like <h2>1. Some text</h2>
.
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
section_number = '%d. ' % number
return match.group(1) + section_number + match.group(2)
# Ensure that when running over multiple files, the files are processed
# in the order in which they appear in the book
replace.file_order = 'spine'
Verwendung mit dem Suchausdruck:
(?s)(<h2[^<>]*>)(.+?</h2>)
Positioniere die Schreibmarke am Anfang der Datei und klicke Alles ersetzen.
This function uses another of the useful extra arguments to replace()
: the
number
argument. When doing a Replace All number is
automatically incremented for every successive match.
Another new feature is the use of replace.file_order
– setting that to
'spine'
means that if this search is run on multiple HTML files, the files
are processed in the order in which they appear in the book. See
Wenn auf mehrere HTML-Dateien anzuwenden, wähle die Dateireihenfolge for details.
Inhaltsverzeichnis automatisch erstellen¶
Finally, lets try something a little more ambitious. Suppose your book has
headings in h1
and h2
tags that look like
<h1 id="someid">Some Text</h1>
. We will auto-generate an HTML Table of
Contents based on these headings. Create the custom function below:
from calibre import replace_entities
from calibre.ebooks.oeb.polish.toc import TOC, toc_to_html
from calibre.gui2.tweak_book import current_container
from calibre.ebooks.oeb.base import xml2str
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
if match is None:
# All matches found, output the resulting Table of Contents.
# The argument metadata is the metadata of the book being edited
if 'toc' in data:
toc = data['toc']
root = TOC()
for (file_name, tag_name, anchor, text) in toc:
parent = root.children[-1] if tag_name == 'h2' and root.children else root
parent.add(text, file_name, anchor)
toc = toc_to_html(root, current_container(), 'toc.html', 'Table of Contents for ' + metadata.title, metadata.language)
print(xml2str(toc))
else:
print('No headings to build ToC from found')
else:
# Add an entry corresponding to this match to the Table of Contents
if 'toc' not in data:
# The entries are stored in the data object, which will persist
# for all invocations of this function during a 'Replace All' operation
data['toc'] = []
tag_name, anchor, text = match.group(1), replace_entities(match.group(2)), replace_entities(match.group(3))
data['toc'].append((file_name, tag_name, anchor, text))
return match.group() # We don't want to make any actual changes, so return the original matched text
# Ensure that we are called once after the last match is found so we can
# output the ToC
replace.call_after_last_match = True
# Ensure that when running over multiple files, this function is called,
# the files are processed in the order in which they appear in the book
replace.file_order = 'spine'
Und Verwendung mit dem Suchausdruck:
<(h[12]) [^<>]* id=['"]([^'"]+)['"][^<>]*>([^<>]+)
Run the search on All text files and at the end of the search, a
window will popup with „Debug output from your function“ which will have the
HTML Table of Contents, ready to be pasted into toc.html
.
The function above is heavily commented, so it should be easy to follow. The
key new feature is the use of another useful extra argument to the
replace()
function, the data
object. The data
object is a Python
dictionary that persists between all successive invocations of replace()
during
a single Replace All operation.
Another new feature is the use of call_after_last_match
– setting that to
True
on the replace()
function means that the editor will call
replace()
one extra time after all matches have been found. For this extra
call, the match object will be None
.
This was just a demonstration to show you the power of function mode, if you really needed to generate a Table of Contents from headings in your book, you would be better off using the dedicated Table of Contents tool in Tools → Table of Contents.
Die API für den Funktionsmodus¶
All function mode functions must be Python functions named replace, with the following signature:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return a_string
When a find/replace is run, for every match that is found, the replace()
function will be called, it must return the replacement string for that match.
If no replacements are to be done, it should return match.group()
which is
the original string. The various arguments to the replace()
function are
documented below.
Das match
-Argument¶
The match
argument represents the currently found match. It is a
Python Match object.
Its most useful method is group()
which can be used to get the matched
text corresponding to individual capture groups in the search regular
expression.
Das number
-Argument¶
The number
argument is the number of the current match. When you run
Replace All, every successive match will cause replace()
to be
called with an increasing number. The first match has number 1.
Das Argument file_name
¶
This is the filename of the file in which the current match was found. When
searching inside marked text, the file_name
is empty. The file_name
is
in canonical form, a path relative to the root of the book, using /
as the
path separator.
Das Argument metadata
¶
This represents the metadata of the current book, such as title, authors,
language, etc. It is an object of class calibre.ebooks.metadata.book.base.Metadata
.
Useful attributes include, title
, authors
(a list of authors) and
language
(the language code).
Das Argument dictionaries
¶
This represents the collection of dictionaries used for spell checking the
current book. Its most useful method is dictionaries.recognized(word)
which will return True
if the passed in word is recognized by the dictionary
for the current book’s language.
Das Argument data
¶
This a simple Python dictionary
. When you run
Replace all, every successive match will cause replace()
to be
called with the same dictionary
as data. You can thus use it to store arbitrary
data between invocations of replace()
during a Replace all
operation.
Das Argument functions
¶
The functions
argument gives you access to all other user defined
functions. This is useful for code re-use. You can define utility functions in
one place and re-use them in all your other functions. For example, suppose you
create a function name My Function
like this:
def utility():
# do something
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
Then, in another function, you can access the utility()
function like this:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
utility = functions['My Function']['utility']
...
You can also use the functions object to store persistent data, that can be re-used by other functions. For example, you could have one function that when run with Replace All collects some data and another function that uses it when it is run afterwards. Consider the following two functions:
# Function One
persistent_data = {}
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
persistent_data['something'] = 'some data'
# Function Two
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
persistent_data = functions['Function One']['persistent_data']
...
Fehlersuche in deinen Funktionen¶
You can debug the functions you create by using the standard print()
function from Python. The output of print will be displayed in a popup window
after the Find/replace has completed. You saw an example of using print()
to output an entire table of contents above.
Wenn auf mehrere HTML-Dateien anzuwenden, wähle die Dateireihenfolge¶
When you run a Replace all on multiple HTML files, the order in
which the files are processes depends on what files you have open for editing.
You can force the search to process files in the order in which the appear by
setting the file_order
attribute on your function, like this:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
replace.file_order = 'spine'
file_order
accepts two values, spine
and spine-reverse
which cause
the search to process multiple files in the order they appear in the book,
either forwards or backwards, respectively.
Die Funktion nach dem letzten Fund ein zusätzliches Mal aufrufen¶
Sometimes, as in the auto generate table of contents example above, it is
useful to have your function called an extra time after the last match is
found. You can do this by setting the call_after_last_match
attribute on your
function, like this:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
replace.call_after_last_match = True
Die Ausgabe der Funktion an den markierten Text anhängen¶
When running search and replace on marked text, it is sometimes useful to
append so text to the end of the marked text. You can do that by setting
the append_final_output_to_marked
attribute on your function (note that you
also need to set call_after_last_match
), like this:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
return 'some text to append'
replace.call_after_last_match = True
replace.append_final_output_to_marked = True
Beim Durchsuchen von markiertem Text das Ergebnisfenster unterdrücken¶
You can also suppress the result dialog (which can slow down the repeated
application of a search/replace on many blocks of text) by setting
the suppress_result_dialog
attribute on your function, like this:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
...
replace.suppress_result_dialog = True
More examples¶
More useful examples, contributed by calibre users, can be found in the calibre E-book editor forum.