XPath教程¶

在本教程中，將對`XPath <https://en.wikipedia.org/wiki/XPath>`_進行簡要介紹，這是一種查詢語言，在cllibre中可用於選擇`HTML<https://en.wikipedia.org/wiki/HTML>`_文件的任意部分。 XPath是一種廣泛使用的標準，並且對其進行谷歌搜索將產生大量資訊。但是，本教程重點介紹如何將XPath用於與電子書相關的任務，例如在非結構化HTML文件中查詢章節標題。

按標記名選取 ¶

最简单的选择形式是按名称选择标记。例如，假设您要选择文档中的所有``<h2>``标签。针对此的XPath查询简单地为：：

//h:h2        (Selects all <h2> tags)

前缀`//` 表示“在文档的任何级别搜索”。现在假设您要搜索``<a>``标记内的``<span>``标记。这可以通过以下方式实现：

//h:a/h:span    (Selects <span> tags inside <a> tags)

如果要搜索文件中特定級別的標記，請更改字首：：

/h:body/h:div/h:p (Selects <p> tags that are children of <div> tags that are
             children of the <body> tag)

这将仅与:ref:``样本_电子书''中的``<p>A非常简短的电子书来演示XPath。</p>的使用''匹配，而不与其他任何``<p>''标记匹配。上面示例中的``h:``前缀需要匹配XHTML标签。这是因为在内部，Calibre将所有内容表示为XHTML。在XHTML标签中，有一个*命名空间*，而且```h:``是HTML标签的命名空间前缀。

Now suppose you want to select both <h1> and <h2> tags. To do that, we need an XPath construct called predicate. A predicate is simply a test that is used to select tags. Tests can be arbitrarily powerful and as this tutorial progresses, you will see more powerful examples. A predicate is created by enclosing the test expression in square brackets:

//*[name()='h1' or name()='h2']

此XPath表達式中有幾個新功能。第一個是萬用字元``*``的使用。它的意思是*匹配任何標籤*。現在看一下測試表達式``name()=『h1』或name()=『h2』`。:術語:`名稱()`是一個*內建函式*的例子。它只是評估標籤的名稱。因此通過使用它，我們可以選擇名稱為`h1`或`h2`的標籤。注意：:Term:`名稱()`函式忽略名稱空間，因此不需要``h:``字首。XPath有幾個有用的內建函式。本教程還將介紹更多內容。

按屬性選擇 ¶

要根據屬性選擇標籤，需要使用謂詞：

//*[@style]              (Select all tags that have a style attribute)
//*[@class="chapter"]    (Select all tags that have class="chapter")
//h:h1[@class="bookTitle"] (Select all h1 tags that have class="bookTitle")

這裡的``@``操作符指的是標籤的屬性。您可以使用某些`XPath內建函式`_對屬性值執行更復雜的匹配。

按標籤內容選擇 ¶

使用XPath，您甚至可以根据标签中包含的文本选择标签。最好的方法是通过内置函数:术语:`re:测试()`使用*正则表达式*的功能:

//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or
                                          section)

这里的``.``运算符指代标签的内容，就像``@``运算符指代标签的属性一样。

電子書樣本 ¶

<html>
    <head>
        <title>A very short e-book</title>
        <meta name="charset" value="utf-8" />
    </head>
    <body>
        <h1 class="bookTitle">A very short e-book</h1>
        <p style="text-align:right">Written by Kovid Goyal</p>
        <div class="introduction">
            <p>A very short e-book to demonstrate the use of XPath.</p>
        </div>

        <h2 class="chapter">Chapter One</h2>
        <p>This is a truly fascinating chapter.</p>

        <h2 class="chapter">Chapter Two</h2>
        <p>A worthy continuation of a fine tradition.</p>
    </body>
</html>

XPath内置函数 ¶

名称()¶: 当前标记的名称。
包含（）¶: 如果s1包含s2，contains(s1, s2) 返回 true
re:test()¶: 如果字串`src`與正規表示式`pattern`匹配，則``re:test(src, pattern, flags)``返回true。一個特別有用的標誌是``i``，它使匹配不區分大小寫。有關正規表示式語法的入門知識，請參見「正規表示式語法<https://docs.python.org/library/re.html>」_。

XPath教程¶

按標記名選取¶

按屬性選擇¶

按標籤內容選擇¶

電子書樣本¶

XPath内置函数¶

按標記名選取 ¶

按屬性選擇 ¶

按標籤內容選擇 ¶

電子書樣本 ¶

XPath内置函数 ¶