Functie modus voor zoek & vervangen in de editor

De :guilabel: Zoek & vervangen tool in de editor ondersteunt een functiemodus. In deze modus kunt u reguliere expressies combineren (zie :doc: regexp) met willekeurig krachtige Python-functies om allerlei geavanceerde tekstverwerking uit te voeren.

In de standaard regexp modus voor zoeken en vervangen, geeft u zowel een reguliere expressie op om naar te zoeken als een sjabloon, om alle gevonden overeenkomsten te vervangen. In functiemodus geeft u in plaats van een vast sjabloon een willekeurige functie op in de Python programmeertaal. Hiermee kunt u veel dingen doen die niet mogelijk zijn met eenvoudige sjablonen.

Technieken voor het gebruik van de functie modus en de syntaxis worden beschreven aan de hand van voorbeelden, waarin u wordt uitgelegd hoe u functies kunt maken om steeds complexere taken uit te voeren.

De Functie modus

Automatisch het hoofdlettergebruik van titels in documenten corrigeren

Hier zullen we gebruik maken van een van de ingebouwde functies in de editor om automatisch het hoofdlettergebruik van alle tekst binnen heading-tags naar elk woord met hoofdletters te veranderen:

Find expression: <([Hh][1-6])[^>]*>.+?</\1>

Kies voor de functie eenvoudigweg de Titel-case tekst (negeer tags) ingebouwde functie. Dit verandert titels die eruit zien als: <h1>een TITEL</h1> naar <h1>Een Titel</h1>. Het werkt zelfs als er andere HTML-tags in de heading-tags voorkomen.

Uw eerste aangepaste functie - koppeltekens slim maken

The real power of function mode comes from being able to create your own functions to process text in arbitrary ways. The Smarten Punctuation tool in the editor leaves individual hyphens alone, so you can use the this function to replace them with em-dashes.

To create a new function, simply click the Create/edit button to create a new function and copy the Python code from below.

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace('--', '—').replace('-', '—')

Every Search & replace custom function must have a unique name and consist of a Python function named replace, that accepts all the arguments shown above. For the moment, we won’t worry about all the different arguments to replace() function. Just focus on the match argument. It represents a match when running a search and replace. Its full documentation in available here. match.group() simply returns all the matched text and all we do is replace hyphens in that text with em-dashes, first replacing double hyphens and then single hyphens.

Gebruik deze functie met de zoek reguliere expressie:

>[^<>]+<

En het vervangt alle koppeltekens met em-streepjes maar enkel in echte tekst en niet in HTML tag definities.

De kracht van functie modus - een spellingswoordenboek gebruiken om fout koppeltekengebruik te herstellen

E-boeken gecreëerd van scans van gedrukte boeken bevatten dikwijls fout koppeltekengebruik – woorden gesplitst aan het einde van een regel in d gedrukte tekst. We gaan een eenvoudige functie schrijven om zulke woorden automatisch te vinden en herstellen.

import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    def replace_word(wmatch):
        # Try to remove the hyphen and replace the words if the resulting
        # hyphen free word is recognized by the dictionary
        without_hyphen = wmatch.group(1) + wmatch.group(2)
        if dictionaries.recognized(without_hyphen):
            return without_hyphen
        return wmatch.group()

    # Search for words split by a hyphen
    text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &amp;
    corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
    return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities

Gebruik deze functie met dezelfde zoek reguliere expressie als voorheen, namelijk:

>[^<>]+<

And it will magically fix all mis-hyphenated words in the text of the book. The main trick is to use one of the useful extra arguments to the replace function, dictionaries. This refers to the dictionaries the editor itself uses to spell check text in the book. What this function does is look for words separated by a hyphen, remove the hyphen and check if the dictionary recognizes the composite word, if it does, the original words are replaced by the hyphen free composite word.

Note that one limitation of this technique is it will only work for mono-lingual books, because, by default, dictionaries.recognized() uses the main language of the book.

Automatische nummering secties

Now we will see something a little different. Suppose your HTML file has many sections, each with a heading in an <h2> tag that looks like <h2>Some text</h2>. You can create a custom function that will automatically number these headings with consecutive section numbers, so that they look like <h2>1. Some text</h2>.

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    section_number = '%d. ' % number
    return match.group(1) + section_number + match.group(2)

# Ensure that when running over multiple files, the files are processed
# in the order in which they appear in the book
replace.file_order = 'spine'

Gebruik het met de find expresse:

(?s)(<h2[^<>]*>)(.+?</h2>)

Plaats de cursor aan het begin van het bestand en klik op Alle vervangen.

This function uses another of the useful extra arguments to replace(): the number argument. When doing a Replace All number is automatically incremented for every successive match.

Een andere nieuwe functie is het gebruik van de replace.file_order – instelling die voor 'spine' betekent dat als deze zoekopdracht draait op meerdere HTML files, de bestanden verwerkt worden in de volgorde waarin ze verschijnen in het book. Bekijk Kies bestandsvolgorde bij uitvoeren meerdere HTML bestanden voor details.

Maak automatische inhoudsopgave aan

Finally, lets try something a little more ambitious. Suppose your book has headings in h1 and h2 tags that look like <h1 id="someid">Some Text</h1>. We will auto-generate an HTML Table of Contents based on these headings. Create the custom function below:

from calibre import replace_entities
from calibre.ebooks.oeb.polish.toc import TOC, toc_to_html
from calibre.gui2.tweak_book import current_container
from calibre.ebooks.oeb.base import xml2str

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    if match is None:
        # All matches found, output the resulting Table of Contents.
        # The argument metadata is the metadata of the book being edited
        if 'toc' in data:
            toc = data['toc']
            root = TOC()
            for (file_name, tag_name, anchor, text) in toc:
                parent = root.children[-1] if tag_name == 'h2' and root.children else root
                parent.add(text, file_name, anchor)
            toc = toc_to_html(root, current_container(), 'toc.html', 'Table of Contents for ' + metadata.title, metadata.language)
            print(xml2str(toc))
        else:
            print('No headings to build ToC from found')
    else:
        # Add an entry corresponding to this match to the Table of Contents
        if 'toc' not in data:
            # The entries are stored in the data object, which will persist
            # for all invocations of this function during a 'Replace All' operation
            data['toc'] = []
        tag_name, anchor, text = match.group(1), replace_entities(match.group(2)), replace_entities(match.group(3))
        data['toc'].append((file_name, tag_name, anchor, text))
        return match.group()  # We don't want to make any actual changes, so return the original matched text

# Ensure that we are called once after the last match is found so we can
# output the ToC
replace.call_after_last_match = True
# Ensure that when running over multiple files, this function is called,
# the files are processed in the order in which they appear in the book
replace.file_order = 'spine'

En gebruik het met de find expressie:

<(h[12]) [^<>]* id=['"]([^'"]+)['"][^<>]*>([^<>]+)

Run the search on All text files and at the end of the search, a window will popup with “Debug output from your function” which will have the HTML Table of Contents, ready to be pasted into toc.html.

The function above is heavily commented, so it should be easy to follow. The key new feature is the use of another useful extra argument to the replace() function, the data object. The data object is a Python dictionary that persists between all successive invocations of replace() during a single Replace All operation.

Another new feature is the use of call_after_last_match – setting that to True on the replace() function means that the editor will call replace() one extra time after all matches have been found. For this extra call, the match object will be None.

This was just a demonstration to show you the power of function mode, if you really needed to generate a Table of Contents from headings in your book, you would be better off using the dedicated Table of Contents tool in Tools → Table of Contents.

De API voor de functie modus

All function mode functions must be Python functions named replace, with the following signature:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return a_string

When a find/replace is run, for every match that is found, the replace() function will be called, it must return the replacement string for that match. If no replacements are to be done, it should return match.group() which is the original string. The various arguments to the replace() function are documented below.

The match argument

The match argument represents the currently found match. It is a Python Match object. Its most useful method is group() which can be used to get the matched text corresponding to individual capture groups in the search regular expression.

The number argument

The number argument is the number of the current match. When you run Replace All, every successive match will cause replace() to be called with an increasing number. The first match has number 1.

Het file_name argument

This is the filename of the file in which the current match was found. When searching inside marked text, the file_name is empty. The file_name is in canonical form, a path relative to the root of the book, using / as the path separator.

Het metadata argument

This represents the metadata of the current book, such as title, authors, language, etc. It is an object of class calibre.ebooks.metadata.book.base.Metadata. Useful attributes include, title, authors (a list of authors) and language (the language code).

Het dictionaries argument

This represents the collection of dictionaries used for spell checking the current book. Its most useful method is dictionaries.recognized(word) which will return True if the passed in word is recognized by the dictionary for the current book’s language.

Het data argument

This a simple Python dictionary. When you run Replace all, every successive match will cause replace() to be called with the same dictionary as data. You can thus use it to store arbitrary data between invocations of replace() during a Replace all operation.

Het functions argument

The functions argument gives you access to all other user defined functions. This is useful for code re-use. You can define utility functions in one place and re-use them in all your other functions. For example, suppose you create a function name My Function like this:

def utility():
   # do something

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...

Dan, in een andere functie, hebt u toegang tot de utility() functie, zo:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    utility = functions['My Function']['utility']
    ...

You can also use the functions object to store persistent data, that can be re-used by other functions. For example, you could have one function that when run with Replace All collects some data and another function that uses it when it is run afterwards. Consider the following two functions:

# Function One
persistent_data = {}

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...
    persistent_data['something'] = 'some data'

# Function Two
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    persistent_data = functions['Function One']['persistent_data']
    ...

Foutzoeken in uw functies

You can debug the functions you create by using the standard print() function from Python. The output of print will be displayed in a popup window after the Find/replace has completed. You saw an example of using print() to output an entire table of contents above.

Kies bestandsvolgorde bij uitvoeren meerdere HTML bestanden

When you run a Replace all on multiple HTML files, the order in which the files are processes depends on what files you have open for editing. You can force the search to process files in the order in which the appear by setting the file_order attribute on your function, like this:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...

replace.file_order = 'spine'

file_order accepts two values, spine and spine-reverse which cause the search to process multiple files in the order they appear in the book, either forwards or backwards, respectively.

Having your function called an extra time after the last match is found

Sometimes, as in the auto generate table of contents example above, it is useful to have your function called an extra time after the last match is found. You can do this by setting the call_after_last_match attribute on your function, like this:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...

replace.call_after_last_match = True

Appending the output from the function to marked text

When running search and replace on marked text, it is sometimes useful to append so text to the end of the marked text. You can do that by setting the append_final_output_to_marked attribute on your function (note that you also need to set call_after_last_match), like this:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...
    return 'some text to append'

replace.call_after_last_match = True
replace.append_final_output_to_marked = True

Suppressing the result dialog when performing searches on marked text

You can also suppress the result dialog (which can slow down the repeated application of a search/replace on many blocks of text) by setting the suppress_result_dialog attribute on your function, like this:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    ...

replace.suppress_result_dialog = True

Meer voorbeelden

More useful examples, contributed by calibre users, can be found in the calibre E-book editor forum.