Documentação da API para receitas¶

A API para escrever receitas é definida pela BasicNewsRecipe

class calibre.web.feeds.news.BasicNewsRecipe(options, log, progress_reporter)[código fonte]¶

Classe de base que contém a lógica necessária em todas as receitas. Ao efetuar sobreposições progressivas das funcionalidades desta classe, pode criar, progressivamente, receitas mais personalizadas/poderosas. Para um tutorial introdutório à criação de receitas, veja Adicionar o seu sítio de notícias favorito.

classmethod adeify_images(soup)[código fonte]¶: Se a sua receita, quando convertida para EPUB, tem problemas com imagens quando visionada no Adobe Digital Editions, chame este método a partir de postprocess_html().

classmethod image_url_processor(baseurl, url)[código fonte]¶: Perform some processing on image urls (perhaps removing size restrictions for dynamically generated images, etc.) and return the processed URL. Return None or an empty string to skip fetching the image.

classmethod print_version(url)[código fonte]¶

Take a url pointing to the webpage with article content and return the URL pointing to the print version of the article. By default does nothing. For example:

def print_version(self, url):
    return url + '?&pagewanted=print'

classmethod tag_to_string(tag, use_alt=True, normalize_whitespace=True)[código fonte]¶

Convenience method to take a BeautifulSoup Tag and extract the text from it recursively, including any CDATA sections and alt tag attributes. Return a possibly empty Unicode string.

use_alt: If True try to use the alt attribute for tags that don’t have any textual content

tag: BeautifulSoup Tag

abort_article(msg=None)[código fonte]¶: Invoque este método dentro de qualquer dos métodos de pré-processamento para abortar o descarregamento do artigo atual. É útil para descartar artigos que contêm conteúdo inapropriado, tais como artigos de vídeo puro.

abort_recipe_processing(msg)[código fonte]¶: Provoca, por parte do sistema de descarregamento de receitas, o abortar do descarregamento desta receita, apresentado ao utilizador uma mensagem simples de informação.

add_toc_thumbnail(article, src)[código fonte]¶: Invoque-o a partir de populate_article_metadata com o atributo src de uma etiqueta <img> no artigo que seja apropriado para uso como uma miniatura que represente o artigo no índice. Se essa miniatura é de facto utilizada depende do dispositivo usado (atualmente é apenas usada pelos Kindle). Note que a imagem referenciada tem de ser uma que já tenha sido descarregada com sucesso, sendo ignorada em caso contrário.

canonicalize_internal_url(url, is_link=True)[código fonte]¶

Devolve um conjunto de representações canónicas de url. A implementação padrão usa apenas o nome do servidor e o caminho do URL, ignorando tudo o que sejam parâmetros de pesquisa, fragmentos, etc. As representações canónicas têm de ser únicas no conjunto dos URL para esta fonte de notícias. Se o não forem, então as hiperligações internas podem ser incorretamente resolvidas.

Parâmetros:: is_link – É Verdadeira se o URL deriva de uma ligação interna num ficheiro HTML. Falso se o URL é o URL usado para descarregar um artigo.

cleanup()[código fonte]¶: Invocado depois de todos os artigos terem sido descarregados. Use-o para efetuar tarefas de limpeza, tais como terminar sessão nos sítios subscritos, etc.

clone_browser(br)[código fonte]¶

Clone the browser br. Cloned browsers are used for multi-threaded downloads, since mechanize is not thread safe. The default cloning routines should capture most browser customization, but if you do something exotic in your recipe, you should override this method in your recipe and clone manually.

Cloned browser instances use the same, thread-safe CookieJar by default, unless you have customized cookie handling.

default_cover(cover_file)[código fonte]¶: Cria uma capa genérica para receitas que não tenham uma capa

download()[código fonte]¶: Download and pre-process all articles from the feeds in this recipe. This method should be called only once on a particular Recipe instance. Calling it more than once will lead to undefined behavior. :return: Path to index.html

extract_readable_article(html, url)[código fonte]¶: Extracts main article content from “html”, cleans up and returns as a (article_html, extracted_title) tuple. Based on the original readability algorithm by Arc90.

get_article_url(article)[código fonte]¶: Override in a subclass to customize extraction of the URL that points to the content for each article. Return the article URL. It is called with article, an object representing a parsed article from a feed. See feedparser. By default it looks for the original link (for feeds syndicated via a service like FeedBurner or Pheedo) and if found, returns that or else returns article.link.

get_browser(*args, **kwargs)[código fonte]¶

Return a browser instance used to fetch documents from the web. By default it returns a mechanize browser instance that supports cookies, ignores robots.txt, handles refreshes and has a random common user agent.

To customize the browser override this method in your sub-class as:

def get_browser(self, *a, **kw):
    br = super().get_browser(*a, **kw)
    # Add some headers
    br.addheaders += [
        ('My-Header', 'one'),
        ('My-Header2', 'two'),
    ]
    # Set some cookies
    br.set_cookie('name', 'value')
    br.set_cookie('name2', 'value2', domain='.mydomain.com')
    # Make a POST request with some data
    br.open('https://someurl.com', {'username': 'def', 'password': 'pwd'}).read()
    # Do a login via a simple web form (only supported with mechanize browsers)
    if self.username is not None and self.password is not None:
        br.open('https://www.nytimes.com/auth/login')
        br.select_form(name='login')
        br['USERID']   = self.username
        br['PASSWORD'] = self.password
        br.submit()
    return br

get_cover_url()[código fonte]¶: Return a URL to the cover image for this issue or None. By default it returns the value of the member self.cover_url which is normally None. If you want your recipe to download a cover for the e-book override this method in your subclass, or set the member variable self.cover_url before this method is called.

get_extra_css()[código fonte]¶: By default returns self.extra_css. Override if you want to programmatically generate the extra_css.

get_feeds()[código fonte]¶: Return a list of RSS feeds to fetch for this profile. Each element of the list must be a 2-element tuple of the form (title, url). If title is None or an empty string, the title from the feed is used. This method is useful if your recipe needs to do some processing to figure out the list of feeds to download. If so, override in your subclass.

get_masthead_title()[código fonte]¶: Override in subclass to use something other than the recipe title

get_masthead_url()[código fonte]¶: Return a URL to the masthead image for this issue or None. By default it returns the value of the member self.masthead_url which is normally None. If you want your recipe to download a masthead for the e-book override this method in your subclass, or set the member variable self.masthead_url before this method is called. Masthead images are used in Kindle MOBI files.

get_obfuscated_article(url)[código fonte]¶

If you set articles_are_obfuscated this method is called with every article URL. It should return the path to a file on the filesystem that contains the article HTML. That file is processed by the recursive HTML fetching engine, so it can contain links to pages/images on the web. Alternately, you can return a dictionary of the form: {“data”: <HTML data>, “url”: <the resolved URL of the article>}. This avoids needing to create temporary files. The url key in the dictionary is useful if the effective URL of the article is different from the URL passed into this method, for example, because of redirects. It can be omitted if the URL is unchanged.

This method is typically useful for sites that try to make it difficult to access article content automatically.

get_url_specific_delay(url)[código fonte]¶

Return the delay in seconds before downloading this URL. If you want to programmatically determine the delay for the specified URL, override this method in your subclass, returning self.delay by default for URLs you do not want to affect.

Retorno:: A floating point number, the delay in seconds.

index_to_soup(url_or_raw, raw=False, as_tree=False, save_raw=None)[código fonte]¶

Convenience method that takes an URL to the index page and returns a BeautifulSoup of it.

url_or_raw: Either a URL or the downloaded index page as a string

is_link_wanted(url, tag)[código fonte]¶

Return True if the link should be followed or False otherwise. By default, raises NotImplementedError which causes the downloader to ignore it.

Parâmetros:

url – The URL to be followed
tag – The tag from which the URL was derived

parse_feeds()[código fonte]¶: Create a list of articles from the list of feeds returned by BasicNewsRecipe.get_feeds(). Return a list of Feed objects.

parse_index()[código fonte]¶

This method should be implemented in recipes that parse a website instead of feeds to generate a list of articles. Typical uses are for news sources that have a «Print Edition» webpage that lists all the articles in the current print edition. If this function is implemented, it will be used in preference to BasicNewsRecipe.parse_feeds().

It must return a list. Each element of the list must be a 2-element tuple of the form ('feed title', list of articles).

Each list of articles must contain dictionaries of the form:

{
'title'       : article title,
'url'         : URL of print version,
'date'        : The publication date of the article as a string,
'description' : A summary of the article
'content'     : The full article (can be an empty string). Obsolete
                do not use, instead save the content to a temporary
                file and pass a file:///path/to/temp/file.html as
                the URL.
}

For an example, see the recipe for downloading The Atlantic. In addition, you can add “author” for the author of the article.

If you want to abort processing for some reason and have calibre show the user a simple message instead of an error, call abort_recipe_processing().

populate_article_metadata(article, soup, first)[código fonte]¶

Called when each HTML page belonging to article is downloaded. Intended to be used to get article metadata like author/summary/etc. from the parsed HTML (soup).

Parâmetros:

article – A object of class calibre.web.feeds.Article. If you change the summary, remember to also change the text_summary
soup – Parsed HTML belonging to this article
first – True iff the parsed HTML is the first page of the article.

postprocess_book(oeb, opts, log)[código fonte]¶

Run any needed post processing on the parsed downloaded e-book.

Parâmetros:

oeb – Um objeto OEBBook
opts – Conversion options

postprocess_html(soup, first_fetch)[código fonte]¶

This method is called with the source of each downloaded HTML file, after it is parsed for links and images. It can be used to do arbitrarily powerful post-processing on the HTML. It should return soup after processing it.

Parâmetros:

soup – A BeautifulSoup instance containing the downloaded HTML.
first_fetch – True if this is the first page of an article.

preprocess_html(soup)[código fonte]¶

This method is called with the source of each downloaded HTML file, before it is parsed for links and images. It is called after the cleanup as specified by remove_tags etc. It can be used to do arbitrarily powerful pre-processing on the HTML. It should return soup after processing it.

soup: A BeautifulSoup instance containing the downloaded HTML.

preprocess_image(img_data, image_url)[código fonte]¶: Perform some processing on downloaded image data. This is called on the raw data before any resizing is done. Must return the processed raw data. Return None to skip the image.

preprocess_raw_html(raw_html, url)[código fonte]¶

This method is called with the source of each downloaded HTML file, before it is parsed into an object tree. raw_html is a unicode string representing the raw HTML downloaded from the web. url is the URL from which the HTML was downloaded.

Note that this method acts before preprocess_regexps.

This method must return the processed raw_html as a unicode object.

publication_date()[código fonte]¶: Use this method to set the date when this issue was published. Defaults to the moment of download. Must return a datetime.datetime object.

skip_ad_pages(soup)[código fonte]¶

This method is called with the source of each downloaded HTML file, before any of the cleanup attributes like remove_tags, keep_only_tags are applied. Note that preprocess_regexps will have already been applied. It is meant to allow the recipe to skip ad pages. If the soup represents an ad page, return the HTML of the real page. Otherwise return None.

soup: A BeautifulSoup instance containing the downloaded HTML.

sort_index_by(index, weights)[código fonte]¶

Convenience method to sort the titles in index according to weights. index is sorted in place. Returns index.

index: A list of titles.

weights: A dictionary that maps weights to titles. If any titles in index are not in weights, they are assumed to have a weight of 0.

articles_are_obfuscated = False¶: Set to True and implement get_obfuscated_article() to handle websites that try to make it difficult to scrape content.

auto_cleanup = False¶: Automatically extract all the text from downloaded article pages. Uses the algorithms from the readability project. Setting this to True, means that you do not have to worry about cleaning up the downloaded HTML manually (though manual cleanup will always be superior).

auto_cleanup_keep = None¶

Specify elements that the auto cleanup algorithm should never remove. The syntax is a XPath expression. For example:

auto_cleanup_keep = '//div[@id="article-image"]' will keep all divs with
                                               id="article-image"
auto_cleanup_keep = '//*[@class="important"]' will keep all elements
                                            with class="important"
auto_cleanup_keep = '//div[@id="article-image"]|//span[@class="important"]'
                  will keep all divs with id="article-image" and spans
                  with class="important"

browser_type = 'mechanize'¶: The simulated browser engine to use when downloading from servers. The default is to use the Python mechanize browser engine, which supports logging in. However, if you don’t need logging in, consider changing this to either “webengine” which uses an actual Chromium browser to do the network requests or “qt” which uses the Qt Networking backend. Both “webengine” and “qt” support HTTP/2, which mechanize does not and are thus harder to fingerprint for bot protection services.

center_navbar = True¶: If True the navigation bar is center aligned, otherwise it is left aligned

compress_news_images = False¶: Set this to False to ignore all scaling and compression parameters and pass images through unmodified. If True and the other compression parameters are left at their default values, images will be scaled to fit in the screen dimensions set by the output profile and compressed to size at most (w * h)/16 where w x h are the scaled image dimensions.

compress_news_images_auto_size = 16¶: The factor used when auto compressing JPEG images. If set to None, auto compression is disabled. Otherwise, the images will be reduced in size to (w * h)/compress_news_images_auto_size bytes if possible by reducing the quality level, where w x h are the image dimensions in pixels. The minimum JPEG quality will be 5/100 so it is possible this constraint will not be met. This parameter can be overridden by the parameter compress_news_images_max_size which provides a fixed maximum size for images. Note that if you enable scale_news_images_to_device then the image will first be scaled and then its quality lowered until its size is less than (w * h)/factor where w and h are now the scaled image dimensions. In other words, this compression happens after scaling.

compress_news_images_max_size = None¶: Set JPEG quality so images do not exceed the size given (in KBytes). If set, this parameter overrides auto compression via compress_news_images_auto_size. The minimum JPEG quality will be 5/100 so it is possible this constraint will not be met.

conversion_options = {}¶

Recipe specific options to control the conversion of the downloaded content into an e-book. These will override any user or plugin specified values, so only use if absolutely necessary. For example:

conversion_options = {
  'base_font_size'   : 16,
  'linearize_tables' : True,
}

cover_margins = (0, 0, '#ffffff')¶: By default, the cover image returned by get_cover_url() will be used as the cover for the periodical. Overriding this in your recipe instructs calibre to render the downloaded cover into a frame whose width and height are expressed as a percentage of the downloaded cover. cover_margins = (10, 15, “#ffffff”) pads the cover with a white margin 10px on the left and right, 15px on the top and bottom. Color names are defined here. Note that for some reason, white does not always work in Windows. Use #ffffff instead

delay = 0¶: The default delay between consecutive downloads in seconds. The argument may be a floating point number to indicate a more precise time. See get_url_specific_delay() to implement per URL delays.

description = ''¶: A couple of lines that describe the content this recipe downloads. This will be used primarily in a GUI that presents a list of recipes.

encoding = None¶: Specify an override encoding for sites that have an incorrect charset specification. The most common being specifying latin1 and using cp1252. If None, try to detect the encoding. If it is a callable, the callable is called with two arguments: The recipe object and the source to be decoded. It must return the decoded source.

extra_css = None¶

Specify any extra CSS that should be added to downloaded HTML files. It will be inserted into <style> tags, just before the closing </head> tag thereby overriding all CSS except that which is declared using the style attribute on individual HTML tags. Note that if you want to programmatically generate the extra_css override the get_extra_css() method instead. For example:

extra_css = '.heading { font: serif x-large }'

feeds = None¶: List of feeds to download. Can be either [url1, url2, ...] or [('title1', url1), ('title2', url2),...]

filter_regexps = []¶

List of regular expressions that determines which links to ignore. If empty it is ignored. Used only if is_link_wanted is not implemented. For example:

filter_regexps = [r'ads\.doubleclick\.net']

will remove all URLs that have ads.doubleclick.net in them.

Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.

handle_gzip = True¶: Set to False if you do not want to use gzipped transfers with the mechanize browser. Note that some old servers flake out with gzip.

ignore_duplicate_articles = None¶

Ignore duplicates of articles that are present in more than one section. A duplicate article is an article that has the same title and/or URL. To ignore articles with the same title, set this to:

ignore_duplicate_articles = {'title'}

Para usar URL em vez disso, definir para:

ignore_duplicate_articles = {'url'}

Para fazer coincidir no título ou URL, definir para:

ignore_duplicate_articles = {'title', 'url'}

keep_only_tags = []¶

Keep only the specified tags and their children. For the format for specifying a tag see BasicNewsRecipe.remove_tags. If this list is not empty, then the <body> tag will be emptied and re-filled with the tags that match the entries in this list. For example:

keep_only_tags = [dict(id=['content', 'heading'])]

will keep only tags that have an id attribute of «content» or «heading».

language = 'und'¶: The language that the news is in. Must be an ISO-639 code either two or three characters long

masthead_url = None¶: By default, calibre will use a default image for the masthead (Kindle only). Override this in your recipe to provide a URL to use as a masthead.

match_regexps = []¶

List of regular expressions that determines which links to follow. If empty, it is ignored. Used only if is_link_wanted is not implemented. For example:

match_regexps = [r'page=[0-9]+']

will match all URLs that have page=some number in them.

Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.

max_articles_per_feed = 100¶: Maximum number of articles to download from each feed. This is primarily useful for feeds that don’t have article dates. For most feeds, you should use BasicNewsRecipe.oldest_article

needs_subscription = False¶: If True the GUI will ask the user for a username and password to use while downloading. If set to «optional» the use of a username and password becomes optional

no_stylesheets = False¶: Convenient flag to disable loading of stylesheets for websites that have overly complex stylesheets unsuitable for conversion to e-book formats. If True stylesheets are not downloaded and processed

oldest_article = 7.0¶: Oldest article to download from this news source. In days.

preprocess_regexps = []¶

List of regexp substitution rules to run on the downloaded HTML. Each element of the list should be a two element tuple. The first element of the tuple should be a compiled regular expression and the second a callable that takes a single match object and returns a string to replace the match. For example:

preprocess_regexps = [
   (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE),
    lambda match: '</body>'),
]

will remove everything from <!–Article ends here–> to </body>.

publication_type = 'unknown'¶: Publication type Set to newspaper, magazine or blog. If set to None, no publication type metadata will be written to the opf file.

recipe_disabled = None¶: Set to a non empty string to disable this recipe. The string will be used as the disabled message

recipe_specific_options = None¶

Specify options specific to this recipe. These will be available for the user to customize in the Advanced tab of the Fetch News dialog or at the ebook-convert command line. The options are specified as a dictionary mapping option name to metadata about the option. For example:

recipe_specific_options = {
    'edition_date': {
        'short': 'The issue date to download',
        'long':  'Specify a date in the format YYYY-mm-dd to download the issue corresponding to that date',
        'default': 'current',
    }
}

When the recipe is run, self.recipe_specific_options will be a dict mapping option name to the option value specified by the user. When the option is unspecified by the user, it will have the value specified by “default”. If no default is specified, the option will not be in the dict at all, when unspecified by the user.

recursions = 0¶: Number of levels of links to follow on article webpages

remove_attributes = []¶

List of attributes to remove from all tags. For example:

remove_attributes = ['style', 'font']

remove_empty_feeds = False¶: If True empty feeds are removed from the output. This option has no effect if parse_index is overridden in the sub class. It is meant only for recipes that return a list of feeds using feeds or get_feeds(). It is also used if you use the ignore_duplicate_articles option.

remove_javascript = True¶: Convenient flag to strip all JavaScript tags from the downloaded HTML

remove_tags = []¶

Lista de etiquetas a serem removidas. As etiquetas especificadas são removidas do HTML descarregado. Uma etiqueta é especificado como um dicionário na forma:

{
 name      : 'tag name',   #e.g. 'div'
 attrs     : a dictionary, #e.g. {'class': 'advertisement'}
}

All keys are optional. For a full explanation of the search criteria, see Beautiful Soup A common example:

remove_tags = [dict(name='div', class_='advert')]

Isto removerá todas as etiquetas <div class=»advert»> e respetivos dependentes do :term: HTML descarregado.

remove_tags_after = None¶

Remove todos as etiquetas que ocorram depois da etiqueta especificada. Para informações sobre o formato de especificação de uma etiqueta, veja BasicNewsRecipe.remove_tags. Por exemplo:

remove_tags_after = [dict(id='content')]

will remove all tags after the first element with id=»content».

remove_tags_before = None¶

Remove all tags that occur before the specified tag. For the format for specifying a tag see BasicNewsRecipe.remove_tags. For example:

remove_tags_before = dict(id='content')

will remove all tags before the first element with id=»content».

requires_version = (0, 6, 0)¶: Minimum calibre version needed to use this recipe

resolve_internal_links = False¶: If set to True then links in downloaded articles that point to other downloaded articles are changed to point to the downloaded copy of the article rather than its original web URL. If you set this to True, you might also need to implement canonicalize_internal_url() to work with the URL scheme of your particular website.

reverse_article_order = False¶: Reverse the order of articles in each feed

scale_news_images = None¶: Maximum dimensions (w,h) to scale images to. If scale_news_images_to_device is True this is set to the device screen dimensions set by the output profile unless there is no profile set, in which case it is left at whatever value it has been assigned (default None).

scale_news_images_to_device = True¶: Rescale images to fit in the device screen dimensions set by the output profile. Ignored if no output profile is set.

simultaneous_downloads = 5¶: Number of simultaneous downloads. Set to 1 if the server is picky. Automatically reduced to 1 if BasicNewsRecipe.delay > 0

summary_length = 500¶: Max number of characters in the short description

template_css = '\n .article_date {\n color: gray; font-family: monospace;\n }\n\n .article_description {\n text-indent: 0pt;\n }\n\n a.article {\n font-weight: bold; text-align:left;\n }\n\n a.feed {\n font-weight: bold;\n }\n\n .calibre_navbar {\n font-family:monospace;\n }\n '¶: The CSS that is used to style the templates, i.e., the navigation bars and the Tables of Contents. Rather than overriding this variable, you should use extra_css in your recipe to customize look and feel.

timefmt = ' [%a, %d %b %Y]'¶: The format string for the date shown on the first page. By default: Day_Name, Day_Number Month_Name Year

timeout = 120.0¶: Tempo de espera em segundos para obter ficheiros do servidor

title = 'Fonte de notícias desconhecida'¶: The title to use for the e-book

use_embedded_content = None¶: Normally we try to guess if a feed has full articles embedded in it based on the length of the embedded content. If None, then the default guessing is used. If True then the we always assume the feeds has embedded content and if False we always assume the feed does not have embedded content.