电子书转换¶
calibre 拥有一个易于使用的格式转换系统。一般情况下,您只需将一本图书添加到 calibre 中,单击“转换书籍”,calibre 就会自动尝试生成符合或近似于原始输入的输出文件。然而,尽管 calibre 可接受多种格式的文件进行转换,但并非所有格式都可以自动且无瑕地转换为其它电子书格式。对于转换过程中出现问题的文件格式,抑或是您只是想对图书转换过程进行自定义调整,calibre 提供多个选项来自定义图书转换过程。但请注意,calibre 的图书转换系统并不能替代功能完备的电子书编辑器。要编辑电子书,我建议先使用 calibre 把它们转换为 epub 或 azw3 格式,然后使用“编辑书籍”功能进行编辑,再使用编辑后的电子书作为源文件转换成其它格式。
本文档将主要考虑能在转换对话框中找到的转换设置,如下图所示。 所有这些设置也可以通过命令行界面进行转换,文档在 generated/zh-CN/ebook-convert 中。 在 Calibre 中,您可以通过将鼠标悬停在任何单独的设置上来获取帮助,然后会出现一个工具提示来描述该设置。
介绍¶
对转换系统,首先要了解的是它被设计为管道。从示意图上看,它看起来像这样:
输入首先被恰当的*Input 插件*转换到 XHTML 。 HTML 接着被*转换*。在最后一步,已处理的XHTML通过适当的*Output 插件*转换为指定的输出格式。根据输入格式的不同,转换的结果可能会有很大的差异。 有些格式的转换效果要好于其他格式。 `这里<best-source-formats>`提供了最佳的转换源格式列表。
所有工作都发生在作用于HTML输出的转换上。有各种转换,例如,在书籍开头插入书籍元数据作为页面、检测章节标题并自动创建目录、按比例调整字体大小等。重要的是要记住,所有的转换都作用于*Input插件*的XHTML输出,而不是作用于输入文件本身。因此,例如,如果您要求calibre将RTF文件转换为EPUB,它将首先在内部转换为XHTML,各种转换将应用于XHTML,然后*Output插件*将创建EPUB文件,自动生成所有元数据、目录等。
您可以通过使用调试选项 查看此过程的实际情况。只需指定调试输出文件夹的路径即可。在转换过程中,calibre会将转换通道各个阶段生成的HTML放置在不同的子文件夹中。四个子文件夹是:
文件名 |
描述 |
---|---|
输入 |
其中包含输入插件输出的HTML。使用此来调试输入插件。 |
已解析 |
The result of pre-processing and converting to XHTML the output from the Input plugin. Use to debug structure detection. |
结构 |
Post structure detection, but before CSS flattening and font size conversion. Use to debug font size conversion and CSS transforms. |
已处理 |
Just before the e-book is passed to the Output plugin. Use to debug the Output plugin. |
如果您想在进行Calibre转换之前对输入文档进行一点编辑,最好的做法是编辑“输入”子文件夹中的文件,然后将其压缩,并使用Zip文件作为后续转换的输入格式。为此,请使用“编辑Meta信息”对话框将Zip文件添加为书籍的格式,然后在转换对话框的左上角选择Zip作为输入格式。
This document will deal mainly with the various transforms that operate on the intermediate XHTML and how to control them. At the end are some tips specific to each input/output format.
界面外观¶
This group of options controls various aspects of the look and feel of the converted e-book.
字体¶
电子阅读体验最棒的特点之一是能够轻松调整字体大小以适应个人需求和照明条件。calibre有复杂的算法来确保它输出的所有书籍都有一致的字体大小,无论输入文档中指定了什么字体大小。
文档的基本字体大小是该文档中最常见的字体大小,即该文档中大部分文本的大小。当您指定“基本字体大小”时,Calibre会自动按比例重新缩放文档中的所有字体大小,以便最常见的字体大小成为指定的基本字体大小,并适当地重新缩放其他字体大小。通过选择更大的基本字体大小,您可以使文档中的字体更大,反之亦然。当您设置基本字体大小时,为了获得最佳效果,您还应该设置字体大小键。
Normally, calibre will automatically choose a base font size appropriate to the output profile you have chosen (see 页面设置). However, you can override this here in case the default is not suitable for you.
“字体大小键”选项允许您控制如何重新缩放非基本字体大小。字体重新缩放算法使用字体大小键工作,该键只是一个逗号分隔的字体大小列表。字体大小键告诉口径,给定字体大小与基本字体大小相比应该增大或减小多少个“台阶”。其想法是文档中的字体大小应该有限。例如,正文文本有一种尺寸,不同级别的标题有几种尺寸,超/子脚本和脚注有几种尺寸。字体大小键允许Calibre将输入文档中的字体大小划分为与不同逻辑字体大小相对应的单独“箱”。
让我们举个例子来说明。假设我们正在转换的源文档是由视力极佳的人生成的,基本字体大小为8pt。这意味着文档中的大部分文本大小为8pt,而标题稍大(比如10和12pt),脚注稍小,为6pt。现在,如果我们使用以下设置:
Base font size : 12pt
Font size key : 7, 8, 10, 12, 14, 16, 18, 20
The output document will have a base font size of 12pt, headings of 14 and 16pt and footnotes of 8pt. Now suppose we want to make the largest heading size stand out more and make the footnotes a little larger as well. To achieve this, the font key should be changed to:
New font size key : 7, 9, 12, 14, 18, 20, 22
最大的标题现在将变为18pt,而脚注将变为9pt。您可以使用字体重新缩放向导来尝试使用这些设置,并找出最适合您的设置,单击“字体大小键”设置旁边的小按钮即可访问该向导。
All the font size rescaling in the conversion can also be disabled here, if you would like to preserve the font sizes in the input document.
一个相关的设置是“行高度”。行高度控制线的垂直高度。默认情况下(线高度为0),不执行行高度操作。如果指定非默认值,则将在所有未指定自己的行高度的位置设置行高度。然而,这是一种钝器,应该谨慎使用。如果你想调整输入的某些部分的行高,最好使用“Extra CSS <extra-css>”。
In this section you can also tell calibre to embed any referenced fonts into the book. This will allow the fonts to work on reader devices even if they are not available on the device.
文本¶
文本可以是合理的,也可以是不合理的。对齐的文本在单词之间有额外的空格,以提供平滑的右边距。有些人更喜欢合理的文本,而另一些人则不喜欢。通常,Calibre会保留原始文档中的理由。如果你想覆盖它,你可以使用本节中的“文本对齐”选项。
您还可以将calibre改为“智能标点符号”,这将用排版正确的替代品替换纯引号、破折号和省略号。请注意,此算法并不完美,因此值得查看结果。反过来,也可以使用“非智能标点符号”。
最后是“输入字符编码”。旧文档有时不指定其字符编码。转换时,这可能会导致非英语字符或智能引号等特殊字符损坏。calibre尝试自动检测源文档的字符编码,但并不总是成功。您可以使用此设置强制它采用特定的字符编码`cp1252`是使用Windows软件生成的文档的常见编码。您还应该阅读“char encoding faq”以了解有关编码问题的更多信息。
布局¶
通常,XHTML中的段落之间会有一个空行,并且没有前导文本缩进。calibre有几个选项可以控制这一点`删除段落之间的间距,强制确保所有段落没有段落间间距。它还将文本缩进设置为1.5em(可以更改),以标记每个段落的开头`插入空白行则相反,保证每对段落之间只有一行空白。这两个选项都非常全面,可以删除空格,也可以为所有段落(技术上为<p>和<div>标签)插入空格。这样,无论输入文件有多乱,您都可以设置该选项并确保其按广告执行。一个例外是,当输入文件使用硬换行符来实现段落间间距时。
If you want to remove the spacing between all paragraphs, except a select few, don't use these options. Instead add the following CSS code to Extra CSS:
p, div { margin: 0pt; border: 0pt; text-indent: 1.5em }
.spacious { margin-bottom: 1em; text-indent: 0pt; }
Then, in your source document, mark the paragraphs that need spacing with class="spacious".
If your input document is not in HTML, use the Debug option, described in the Introduction to get HTML
(use the input
sub-folder).
另一个有用的选项是“线性化表”。一些设计糟糕的文档使用表格来控制页面上文本的布局。转换后,这些文档通常会有溢出页面的文本和其他工件。此选项将从表中提取内容并以线性方式呈现。请注意,此选项将*所有*表线性化,因此只有在您确定输入文档没有将表用于合法目的(如显示表格信息)时才使用它。
样式¶
“Extra CSS”选项允许您指定将应用于输入中所有HTML文件的任意CSS。此CSS具有很高的优先级,因此应该覆盖**输入文档**本身中存在的大多数CSS。您可以使用此设置微调文档的演示文稿/布局。例如,如果你想让`endnote`类的所有段落都对齐,只需添加:
.endnote { text-align: right }
or if you want to change the indentation of all paragraphs:
p { text-indent: 5mm; }
Extra CSS is a very powerful option, but you do need an understanding of how CSS works to use it to its full potential. You can use the debug pipeline option described above to see what CSS is present in your input document.
A simpler option is to use Filter style information. This allows you to remove all CSS properties of the specified types from the document. For example, you can use it to remove all colors or fonts.
改变样式¶
This is the most powerful styling related facility. You can use it to define rules that change styles based on various conditions. For example you can use it to change all green colors to blue, or remove all bold styling from the text or color all headings a certain color, etc.
Transform HTML¶
Similar to transform styles, but allows you to make changes to the HTML content of the book. You can replace one tag with another, add classes or other attributes to tags based on their content, etc.
页面设置¶
“页面设置”选项用于控制屏幕布局,如边距和屏幕大小。如果所选输出格式支持页边距,则有设置页边距的选项,输出插件将使用这些选项。此外,您应该选择输入配置文件和输出配置文件。这两组配置文件基本上都涉及如何解释输入/输出文档中的测量值、屏幕大小和默认字体缩放键。
如果您知道要转换的文件是要在特定设备/软件平台上使用的,请选择相应的输入配置文件,否则只需选择默认输入配置文件。如果您知道您正在生成的文件是针对特定设备类型的,请选择相应的输出配置文件。否则,请选择一个通用输出配置文件。如果你正在转换为MOBI或AZW3,那么你几乎总是想选择一个Kindle输出配置文件。否则,您对现代电子书阅读设备的最佳选择是选择“通用电子墨水HD”输出配置文件。
The output profile also controls the screen size. This will cause, for example, images to be auto-resized to be fit to the screen in some output formats. So choose a profile of a device that has a screen size similar to your device.
智能处理¶
启发式处理提供了各种功能,可用于尝试检测和纠正格式不佳的输入文档中的常见问题。如果您的输入文档格式不佳,请使用这些功能。由于这些函数依赖于常见模式,请注意,在某些情况下,选项可能会导致更糟糕的结果,因此请谨慎使用。例如,其中几个选项将删除所有不间断空格实体,或者可能包括与函数相关的误报匹配。
- Enable heuristic processing
This option activates calibre's Heuristic processing stage of the conversion pipeline. This must be enabled in order for various sub-functions to be applied
- Unwrap lines
启用此选项将导致calibre尝试使用标点符号线索和行长来检测和纠正文档中存在的硬换行。calibre将首先尝试检测是否存在硬线断裂,如果它们似乎不存在,calibre不会尝试解开线。如果你想“强制”口径打开线,可以降低线打开系数。
- Line-unwrap factor
此选项控制用于删除硬换行符的算法口径。例如,如果此选项的值为0.4,则意味着calibre将从长度小于文档中所有行的40%的行的末尾删除硬换行符。如果你的文档只有几个需要更正的换行符,那么这个值应该减少到0.1到0.2之间。
- Detect and markup unformatted chapter headings and sub headings
If your document does not have chapter headings and titles formatted differently from the rest of the text, calibre can use this option to attempt to detect them and surround them with heading tags. <h2> tags are used for chapter headings; <h3> tags are used for any titles that are detected.
此函数不会创建目录,但在许多情况下,它会导致calibre的默认章节检测设置正确检测章节并构建目录。如果未自动创建目录,请调整“结构检测”下的XPath。如果文档中没有使用其他标题,那么在结构检测下设置“//h:h2”将是为文档创建目录的最简单方法。
The inserted headings are not formatted, to apply formatting use the Extra CSS option under the Look and Feel conversion settings. For example, to center heading tags, use the following:
h2, h3 { text-align: center }
- Renumber sequences of <h1> or <h2> tags
Some publishers format chapter headings using multiple <h1> or <h2> tags sequentially. calibre's default conversion settings will cause such titles to be split into two pieces. This option will re-number the heading tags to prevent splitting.
- Delete blank lines between paragraphs
此选项将使calibre分析文档中包含的空白行。如果每一段都有一个空行,那么calibre将删除所有这些空行。多个空行序列将被视为场景分割,并保留为单个段落。此选项与“外观”下的“删除段落间距”选项不同,因为它实际上修改了HTML内容,而另一个选项修改了文档样式。此选项还可以删除使用calibre的“插入空白行”选项插入的段落。
- Ensure scene breaks are consistently formatted
使用此选项,calibre将尝试检测常见的场景中断标记,并确保它们居中对齐。”“软”场景打断标记,即仅由额外空白定义的场景打断,被设置样式以确保它们不会与分页符一起显示。
- Replace scene breaks
If this option is configured then calibre will replace scene break markers it finds with the replacement text specified by the user. Please note that some ornamental characters may not be supported across all reading devices.
一般来说,你应该避免使用HTML标签,calibre会丢弃任何标签并使用预定义的标记。<hr />标签,即水平规则,<img>标签是例外。水平规则可以选择与样式一起指定,如果您选择添加自己的样式,请确保包含“宽度”设置,否则样式信息将被丢弃。可以使用图像标签,但calibre不提供在转换过程中添加图像的功能,这必须在事后使用“编辑书籍”功能完成。
- Example image tag (place the image within an 'Images' folder inside the EPUB after conversion):
<img style="width:10%" src="../Images/scenebreak.png" />
- Example horizontal rule with styles:
<hr style="width:20%;padding-top: 1px;border-top: 2px ridge black;border-bottom: 2px groove black;"/>
- Remove unnecessary hyphens
启用此选项后,calibre将分析文档中的所有连字符内容。文档本身被用作分析词典。这使得calibre能够准确地删除任何语言文档中任何单词的连字符,以及虚构和模糊的科学单词。主要缺点是文档中只出现一次的单词不会被更改。分析分两次进行,第一次分析线尾。只有当单词在文档中存在连字符或不存在连字符时,才会展开行。第二步分析整个文档中的所有连字符单词,如果单词在文档的其他地方不匹配,则删除连字符。
- Italicize common words and patterns
When enabled, calibre will look for common words and patterns that denote italics and italicize them. Examples are common text conventions such as ~word~ or phrases that should generally be italicized, e.g. latin phrases like 'etc.' or 'et cetera'.
- Replace entity indents with CSS indents
Some documents use a convention of defining text indents using non-breaking space entities. When this option is enabled calibre will attempt to detect this sort of formatting and convert them to a 3% text indent using CSS.
搜索并替换¶
这些选项主要用于PDF文档的转换或OCR转换,尽管它们也可用于解决许多特定于文档的问题。例如,一些转换可能会在文本中留下页眉和页脚。这些选项使用正则表达式来尝试检测页眉、页脚或其他任意文本,并删除或替换它们。请记住,它们对转换管道生成的中间XHTML进行操作。有一个向导可以帮助您自定义文档的正则表达式。单击表达式框旁边的魔杖,在编写搜索表达式后单击“测试”按钮。成功的比赛将以黄色突出显示。
搜索是通过使用Python正则表达式来实现的。所有匹配的文本都会从文档中删除或使用替换模式替换。替换模式是可选的,如果留空,则将从文档中删除与搜索模式匹配的文本。您可以在`regexptutorial`了解更多关于正则表达式及其语法的信息。
结构检测¶
结构检测涉及calibre在输入文档中未正确指定结构元素时,尽最大努力检测这些元素。例如,章节、分页符、页眉、页脚等。可以想象,这一过程因书而异。幸运的是,calibre有非常强大的选项来控制这一点。权力带来复杂性,但如果你花时间学习复杂性,你会发现这是值得的。
Chapters and page breaks¶
calibre有两组选项用于“章节检测”和“插入分页符”。这有时会有点令人困惑,因为默认情况下,calibre会在检测到的章节之前插入分页符,以及分页符选项检测到的位置。这样做的原因是,通常应该插入分页符的位置不是章节边界。此外,检测到的章节可以选择性地插入到自动生成的目录中。
calibre使用*XPath*,这是一种强大的语言,允许用户指定章节边界/分页符。XPath一开始似乎有点让人望而生畏,幸运的是,用户手册中有一个“XPath教程<xpath-tutorial>”。请记住,结构检测是在转换管道通道生成的中间XHTML上进行的。使用“转换介绍”中描述的调试选项,为您的书找出合适的设置。还有一个XPath向导按钮,用于帮助生成简单的XPath表达式。
By default, calibre uses the following expression for detecting chapters:
//*[((name()='h1' or name()='h2') and re:test(., 'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
这个表达式相当复杂,因为它试图同时处理许多常见情况。它的意思是,calibre将假设章节以"<h1>"或"<h2>"标签开头,这些标签中包含任何单词“(章、书、节或部分)”,或者具有“class=“chapter”属性。
A related option is Chapter mark, which allows you to control what calibre does when it detects a chapter. By default, it will insert a page break before the chapter. You can have it insert a ruled line instead of, or in addition to the page break. You can also have it do nothing.
The default setting for detecting page breaks is:
//*[name()='h1' or name()='h2']
which means that calibre will insert page breaks before every <h1> and <h2> tag by default.
备注
The default expressions may change depending on the input format you are converting.
杂项¶
There are a few more options in this section.
- Insert metadata as page at start of book
One of the great things about calibre is that it allows you to maintain very complete metadata about all of your books, for example, a rating, tags, comments, etc. This option will create a single page with all this metadata and insert it into the converted e-book, typically just after the cover. Think of it as a way to create your own customised book jacket.
- Remove first image
有时,您正在转换的源文档将封面作为书籍的一部分,而不是单独的封面。如果你也在Calibre中指定了封面,那么转换后的书将有两个封面。此选项将简单地从源文档中删除第一张图像,从而确保转换后的书籍只有一个封面,即Calibre中指定的封面。
目录¶
当输入文档的元数据中有目录时,calibre将仅使用该目录。然而,许多旧格式要么不支持基于元数据的目录,要么单个文档没有目录。在这些情况下,本节中的选项可以帮助您根据输入文档中的实际内容,在转换后的电子书中自动生成目录。
备注
使用这些选项可能有点难以做到完全正确。如果您更喜欢手动创建/编辑目录,请转换为EPUB或AZW3格式,并选中转换对话框目录部分底部的复选框,该复选框显示“转换后手动微调目录”。这将在转换后启动目录编辑器工具。它允许您在目录中创建条目,只需在书中单击条目指向的位置即可。您还可以单独使用目录编辑器,而无需进行转换。转到“首选项”->“界面”->“工具栏”,并将“目录编辑器”添加到主工具栏。然后,只需选择要编辑的书,然后单击“目录编辑器”按钮。
The first option is Force use of auto-generated Table of Contents. By checking this option you can have calibre override any Table of Contents found in the metadata of the input document with the auto generated one.
创建自动生成的目录的默认方式是,calibre将首先尝试将任何检测到的章节添加到生成的目录中。您可以在上面的“结构检测”部分学习如何自定义章节的检测。如果您不想在生成的目录中包含检测到的章节,请选中“不添加检测到的章”选项。
创建自动生成的目录的默认方式是,calibre将首先尝试将任何检测到的章节添加到生成的目录中。您可以在上面的“结构检测”部分学习如何自定义章节的检测。如果您不想在生成的目录中包含检测到的章节,请选中“不添加检测到的章”选项。
calibre将自动从生成的目录中过滤重复项。但是,如果有一些其他不需要的条目,您可以使用“目录过滤器”选项对其进行过滤。这是一个正则表达式,将与生成的目录中的条目标题相匹配。只要找到匹配项,它就会被删除。例如,要删除所有标题为“下一个”或“上一个”的条目,请使用:
Next|Previous
“1、2、3级目录”选项允许您创建复杂的多级目录。它们是XPath表达式,与转换管道生成的中间XHTML中的标签相匹配。有关如何访问此XHTML,请参阅“转换介绍”。还要阅读“xpath教程”,了解如何构造xpath表达式。每个选项旁边都有一个按钮,用于启动向导以帮助创建基本的XPath表达式。以下简单示例说明了如何使用这些选项。
Suppose you have an input document that results in XHTML that look like this:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Sample document</title>
</head>
<body>
<h1>Chapter 1</h1>
...
<h2>Section 1.1</h2>
...
<h2>Section 1.2</h2>
...
<h1>Chapter 2</h1>
...
<h2>Section 2.1</h2>
...
</body>
</html>
Then, we set the options as:
Level 1 TOC : //h:h1
Level 2 TOC : //h:h2
This will result in an automatically generated two level Table of Contents that looks like:
Chapter 1
Section 1.1
Section 1.2
Chapter 2
Section 2.1
警告
Not all output formats support a multi level Table of Contents. You should first try with EPUB output. If that works, then try your format of choice.
Using images as chapter titles when converting HTML input documents¶
Suppose you want to use an image as your chapter title, but still want calibre to be able to automatically generate a Table of Contents for you from the chapter titles. Use the following HTML markup to achieve this:
<html>
<body>
<h2>Chapter 1</h2>
<p>chapter 1 text...</p>
<h2 title="Chapter 2"><img src="chapter2.jpg" /></h2>
<p>chapter 2 text...</p>
</body>
</html>
Set the Level 1 TOC setting to //h:h2
. Then, for chapter two, calibre will take the title from the value of the title
attribute on the <h2>
tag, since the tag has no text.
Using tag attributes to supply the text for entries in the Table of Contents¶
If you have particularly long chapter titles and want shortened versions in the Table of Contents, you can use the title attribute to achieve this, for example:
<html>
<body>
<h2 title="Chapter 1">Chapter 1: Some very long title</h2>
<p>chapter 1 text...</p>
<h2 title="Chapter 2">Chapter 2: Some other very long title</h2>
<p>chapter 2 text...</p>
</body>
</html>
Set the Level 1 TOC setting to //h:h2/@title
. Then calibre will
take the title from the value of the title
attribute on the <h2>
tags,
instead of using the text inside the tag. Note the trailing /@title
on the
XPath expression, you can use this form to tell calibre to get the text from any
attribute you like.
How options are set/saved for conversion¶
There are two places where conversion options can be set in calibre. The first is in Preferences->Conversion. These settings are the defaults for the conversion options. Whenever you try to convert a new book, the settings set here will be used by default.
您还可以在转换对话框中更改每本书转换的设置。当你转换一本书时,calibre会记住你为那本书使用的设置,这样如果你再次转换它,为每本书保存的设置将优先于“首选项”中设置的默认值。您可以使用个人书籍转换对话框中的“恢复默认值”按钮将个人设置恢复为默认值。您可以通过选择所有书籍,然后单击“编辑元数据”按钮打开批量元数据编辑对话框来删除一组书籍的已保存设置,对话框底部附近有一个删除已存储转换设置的选项。
When you bulk convert a set of books, settings are taken in the following order (last one wins):
From the defaults set in Preferences->Conversion
From the saved conversion settings for each book being converted (if any). This can be turned off by the option in the top left corner of the Bulk conversion dialog.
From the settings set in the Bulk conversion dialog
请注意,批量转换中每本书的最终设置将被保存,如果书籍再次转换,将重新使用。由于批量转换中的最高优先级赋予了批量转换对话框中的设置,因此这些设置将覆盖任何特定于书籍的设置。因此,您应该只批量转换需要类似设置的书籍。例外情况是元数据和输入格式特定的设置。由于“批量转换”对话框没有这两个类别的设置,因此将从特定于书籍的设置(如果有的话)或默认设置中获取。
备注
You can see the actual settings used during any conversion by clicking the rotating icon in the lower right corner and then double clicking the individual conversion job. This will bring up a conversion log that will contain the actual settings used, near the top.
Format specific tips¶
Here you will find tips specific to the conversion of particular formats. Options specific to particular format, whether input or output are available in the conversion dialog under their own section, for example TXT input or EPUB output.
转换Microsoft Word文档¶
calibre can automatically convert .docx
files created by Microsoft Word 2007 and
newer. Just add the file to calibre and click convert.
备注
There is a demo .docx file that demonstrates the capabilities of the calibre conversion engine. Just download it and convert it to EPUB or AZW3 to see what calibre can do.
calibre will automatically generate a Table of Contents based on headings if you mark
your headings with the Heading 1
, Heading 2
, etc. styles in Microsoft Word. Open
the output e-book in the calibre E-book viewer and click the Table of Contents button
to view the generated Table of Contents.
Older .doc files¶
对于较旧的.doc文件,您可以使用Microsoft Word将文档另存为HTML,然后使用calibre转换生成的HTML文件。保存为HTML时,请务必使用“另存为网页,已过滤”选项,因为这将生成转换良好的干净HTML。请注意,Word生成的HTML非常混乱,转换它可能需要很长时间,所以请耐心等待。如果您有较新版本的Word可用,也可以直接将其另存为.docx。
Another alternative is to use the free LibreOffice. Open your .doc file in LibreOffice and save it as .docx, which can be directly converted in calibre.
转换文本文档¶
TXT文档没有明确的方式来指定格式,如粗体、斜体等,或文档结构,如段落、标题、章节等,但有各种常用的约定。默认情况下,calibre会尝试根据这些约定自动检测正确的格式和标记。
TXT input supports a number of options to differentiate how paragraphs are detected.
- Paragraph style: Auto
Analyzes the text file and attempts to automatically determine how paragraphs are defined. This option will generally work fine, if you achieve undesirable results try one of the manual options.
- Paragraph style: Block
Assumes one or more blank lines are a paragraph boundary:
This is the first. This is the second paragraph.- Paragraph style: Single
Assumes that every line is a paragraph:
This is the first. This is the second. This is the third.- Paragraph style: Print
Assumes that every paragraph starts with an indent (either a tab or 2+ spaces). Paragraphs end when the next line that starts with an indent is reached:
This is the first. This is the second. This is the third.- Paragraph style: Unformatted
Assumes that the document has no formatting, but does use hard line breaks. Punctuation and median line length are used to attempt to re-create paragraphs.
- Formatting style: Auto
Attempts to detect the type of formatting markup being used. If no markup is used then heuristic formatting will be applied.
- Formatting style: Heuristic
Analyzes the document for common chapter headings, scene breaks, and italicized words and applies the appropriate HTML markup during conversion.
- Formatting style: Markdown
calibre还支持通过名为Markdown的转换预处理器运行TXT输入。Markdown允许将基本格式添加到TXT文档中,如粗体、斜体、节标题、表格、列表、目录等。用前导#标记章节标题并将章节XPath检测表达式设置为“//h:h1”是从TXT文档生成正确目录的最简单方法。你可以在`daringfireball上了解更多关于Markdown语法的信息<https://daringfireball.net/projects/markdown/syntax>`_.
- Formatting style: None
Applies no special formatting to the text, the document is converted to HTML with no other changes.
转换PDF文档¶
PDF文档是最难转换的格式之一。它们是固定的页面大小和文本放置格式。也就是说,很难确定一个段落在哪里结束,另一个段落从哪里开始。calibre将尝试使用可配置的“行解包因子”来解包段落。这是一个用于确定线应展开的长度的标尺。有效值是介于0和1之间的十进制数。默认值为0.45,刚好在中线长度下方。降低此值可在展开中包含更多文本。增加以包含更少。您可以在“PDF输入”下的转换设置中调整此值。
此外,它们通常有页眉和页脚作为文档的一部分,这些页眉和页脚将包含在文本中。使用“搜索和替换”面板删除页眉和页脚以缓解此问题。如果没有从文本中删除页眉和页脚,则可以取消段落展开。要了解如何使用页眉和页脚删除选项,请阅读“regexptutorial”。
PDF输入限制:
Complex, multi-column, and image based documents are not supported.
Extraction of vector images and tables from within the document is also not supported.
Some PDFs use special glyphs to represent ll or ff or fi, etc. Conversion of these may or may not work depending on just how they are represented internally in the PDF.
链接和目录不支持
PDFs that use embedded non-Unicode fonts to represent non-English characters will result in garbled output for those characters
Some PDFs are made up of photographs of the page with OCRed text behind them. In such cases calibre uses the OCRed text, which can be very different from what you see when you view the PDF file
PDFs that are used to display complex text, like right to left languages and math typesetting will not convert correctly
To re-iterate PDF is a really, really bad format to use as input. If you absolutely must use PDF, then be prepared for an output ranging anywhere from decent to unusable, depending on the input PDF.
Comic book collections¶
A comic book collection is a .cbc file. A .cbc file is a ZIP file that contains other CBZ/CBR files. In addition the .cbc file must contain a simple text file called comics.txt, encoded in UTF-8. The comics.txt file must contain a list of the comics files inside the .cbc file, in the form filename:title, as shown below:
one.cbz:Chapter One
two.cbz:Chapter Two
three.cbz:Chapter Three
The .cbc file will then contain:
comics.txt
one.cbz
two.cbz
three.cbz
calibre will automatically convert this .cbc file into a e-book with a Table of Contents pointing to each entry in comics.txt.
EPUB advanced formatting demo¶
Various advanced formatting for EPUB files is demonstrated in this demo file. The file was created from hand coded HTML using calibre and is meant to be used as a template for your own EPUB creation efforts.
The source HTML it was created from is available demo.zip. The settings used to create the EPUB from the ZIP file are:
ebook-convert demo.zip .epub -vv --authors "Kovid Goyal" --language en --level1-toc '//*[@class="title"]' --disable-font-rescaling --page-breaks-before / --no-default-epub-cover
Note that because this file explores the potential of EPUB, most of the advanced formatting is not going to work on readers less capable than calibre's built-in EPUB viewer.
Convert ODT documents¶
calibre can directly convert ODT (OpenDocument Text) files. You should use styles to format your document and minimize the use of direct formatting. When inserting images into your document you need to anchor them to the paragraph, images anchored to a page will all end up in the front of the conversion.
要启用章节的自动侦测,您需要使用名为“标题 1”、“标题2”、…`标题6`的内置样式标记它们(“标题1”相当于HTML标签"<h1>",“标题2”相当于"<h2>",以此类推)。当你在Calibre中转换时,你可以在“检测章节”框中输入你使用的样式。例子:
If you mark Chapters with style Heading 2, you have to set the 'Detect chapters at' box to
//h:h2
For a nested TOC with Sections marked with Heading 2 and the Chapters marked with Heading 3 you need to enter
//h:h2|//h:h3
. On the Convert - TOC page set the Level 1 TOC box to//h:h2
and the Level 2 TOC box to//h:h3
.
Well-known document properties (Title, Keywords, Description, Creator) are recognized and calibre will use the first image (not to small, and with good aspect-ratio) as the cover image.
There is also an advanced property conversion mode, which is activated by setting the custom property opf.metadata
('Yes or No' type) to Yes in your ODT document (File->Properties->Custom Properties).
If this property is detected by calibre, the following custom properties are recognized (opf.authors
overrides document creator):
opf.titlesort
opf.authors
opf.authorsort
opf.publisher
opf.pubdate
opf.isbn
opf.language
opf.series
opf.seriesindex
除此之外,您还可以通过在ODT中将图片命名为“opf.cover”(右键单击,图片->选项->名称)来指定用作封面的图片。如果找不到具有此名称的图片,则使用“smart”方法。由于封面检测可能会导致某些输出格式出现双封面,因此该过程将从文档中删除该段落(仅当唯一内容是封面时!)。但这只适用于命名的图片!
To disable cover detection you can set the custom property opf.nocover
('Yes or No' type) to Yes in advanced mode.
转换为PDF¶
转换为PDF时要决定的第一个也是最重要的设置是页面大小。默认情况下,calibre使用“U.S.Letter”的页面大小。您可以在转换对话框的“PDF输出”部分将其更改为另一个标准页面大小或完全自定义的大小。如果您正在生成要在特定设备上使用的PDF,则可以打开选项,使用“输出配置文件”中的页面大小。因此,如果您的输出配置文件设置为Kindle,calibre将创建一个页面大小适合在Kindle小屏幕上查看的PDF。
可打印目录¶
You can also insert a printable Table of Contents at the end of the PDF that lists the page numbers for every section. This is very useful if you intend to print out the PDF to paper. If you wish to use the PDF on an electronic device, then the PDF Outline provides this functionality and is generated by default.
You can customize the look of the generated Table of contents by using the Extra CSS conversion setting under the Look & feel part of the conversion dialog. The default CSS used is listed below, simply copy it and make whatever changes you like.
.calibre-pdf-toc table { width: 100%% }
.calibre-pdf-toc table tr td:last-of-type { text-align: right }
.calibre-pdf-toc .level-0 {
font-size: larger;
}
.calibre-pdf-toc .level-1 td:first-of-type { padding-left: 1.4em }
.calibre-pdf-toc .level-2 td:first-of-type { padding-left: 2.8em }
Custom page margins for individual HTML files¶
If you are converting an EPUB or AZW3 file with multiple individual HTML files inside it and you want to change the page margins for a particular HTML file you can add the following style block to the HTML file using the calibre E-book editor:
<style>
@page {
margin-left: 10pt;
margin-right: 10pt;
margin-top: 10pt;
margin-bottom: 10pt;
}
</style>
Then, in the PDF output section of the conversion dialog, turn on the
option to Use page margins from the document being converted.
Now all pages generated from this HTML file will have 10pt
margins.