dosage/dosagelib/helpers.py

# SPDX-License-Identifier: MIT
# Copyright (C) 2004-2008 Tristan Seligmann and Jonathan Jacobs
# Copyright (C) 2012-2014 Bastian Kleineidam
# Copyright (C) 2015-2020 Tobias Gruetzmacher
# Copyright (C) 2019-2020 Daniel Ring
from .util import getQueryParams


def queryNamer(param, use_page_url=False):
    """Get name from URL query part."""
    def _namer(self, image_url, page_url):
        """Get URL query part."""
        url = page_url if use_page_url else image_url
        return getQueryParams(url)[param][0]
    return _namer


def regexNamer(regex, use_page_url=False):
    """Get name from regular expression."""
    def _namer(self, image_url, page_url):
        """Get first regular expression group."""
        url = page_url if use_page_url else image_url
        mo = regex.search(url)
        if mo:
            return mo.group(1)
    return _namer


def joinPathPartsNamer(pageurlparts, imageurlparts=(-1,), joinchar='_'):
    """Get name by mashing path parts together with underscores."""
    def _namer(self, imageurl, pageurl):
        # Split and drop host name
        pageurlsplit = pageurl.split('/')[3:]
        imageurlsplit = imageurl.split('/')[3:]
        joinparts = ([pageurlsplit[i] for i in pageurlparts] +
            [imageurlsplit[i] for i in imageurlparts])
        return joinchar.join(joinparts)
    return _namer


def bounceStarter(self):
    """Get start URL by "bouncing" back and forth one time.

    This needs the url and nextSearch properties be defined on the class.
    """
    data = self.getPage(self.url)
    prevurl = self.fetchUrl(self.url, data, self.prevSearch)
    prevurl = self.link_modifier(self.url, prevurl)
    data = self.getPage(prevurl)
    nexturl = self.fetchUrl(prevurl, data, self.nextSearch)
    return self.link_modifier(prevurl, nexturl)


def indirectStarter(self):
    """Get start URL by indirection.

    This is useful for comics where the latest comic can't be reached at a
    stable URL. If the class has an attribute 'startUrl', this page is fetched
    first, otherwise the page at 'url' is fetched. After that, the attribute
    'latestSearch' is used on the page content to find the latest strip."""
    url = self.startUrl if hasattr(self, "startUrl") else self.url
    data = self.getPage(url)
    newurl = self.fetchUrl(url, data, self.latestSearch)
    return self.link_modifier(url, newurl)
Update file headers The default encoding for source files is UTF-8 since Python 3, so we can drop all encoding headers. While we are at it, just replace them with SPDX headers. 2020-04-18 11:45:44 +00:00			`# SPDX-License-Identifier: MIT`
Fixup copyright years. 2016-10-28 22:21:41 +00:00			`# Copyright (C) 2004-2008 Tristan Seligmann and Jonathan Jacobs`
Updated copyright. 2014-01-05 15:50:57 +00:00			`# Copyright (C) 2012-2014 Bastian Kleineidam`
Add self to authors list, update copyright headers 2020-01-13 06:34:05 +00:00			`# Copyright (C) 2015-2020 Tobias Gruetzmacher`
			`# Copyright (C) 2019-2020 Daniel Ring`
Convert starters and other helpers to new interface. This allows those starters to work with future scrapers. 2014-07-23 18:53:59 +00:00			`from .util import getQueryParams`
Initial commit to Github. 2012-06-20 19:58:13 +00:00
Read starter parameters from class. This allows to specify starters in a more declarative and dynamic way. 2016-04-12 21:11:39 +00:00
Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`def queryNamer(param, use_page_url=False):`
Document some functions. 2012-09-26 14:47:39 +00:00			`"""Get name from URL query part."""`
Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`def _namer(self, image_url, page_url):`
Various fixes and additions. 2012-12-12 16:41:29 +00:00			`"""Get URL query part."""`
Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`url = page_url if use_page_url else image_url`
			`return getQueryParams(url)[param][0]`
Initial commit to Github. 2012-06-20 19:58:13 +00:00			`return _namer`


Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`def regexNamer(regex, use_page_url=False):`
Document some functions. 2012-09-26 14:47:39 +00:00			`"""Get name from regular expression."""`
Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`def _namer(self, image_url, page_url):`
Various fixes and additions. 2012-12-12 16:41:29 +00:00			`"""Get first regular expression group."""`
Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class. 2016-04-21 06:20:49 +00:00			`url = page_url if use_page_url else image_url`
Code cleanup. 2013-03-07 17:22:49 +00:00			`mo = regex.search(url)`
Fix some comics. 2012-11-13 18:12:28 +00:00			`if mo:`
			`return mo.group(1)`
Initial commit to Github. 2012-06-20 19:58:13 +00:00			`return _namer`


Add new namer "joinPathPartsNamer" Additionally, switch some comics which benefit from it to the new namer. This fixes #127. 2019-06-30 18:52:15 +00:00			`def joinPathPartsNamer(pageurlparts, imageurlparts=(-1,), joinchar='_'):`
			`"""Get name by mashing path parts together with underscores."""`
			`def _namer(self, imageurl, pageurl):`
			`# Split and drop host name`
			`pageurlsplit = pageurl.split('/')[3:]`
			`imageurlsplit = imageurl.split('/')[3:]`
			`joinparts = ([pageurlsplit[i] for i in pageurlparts] +`
			`[imageurlsplit[i] for i in imageurlparts])`
			`return joinchar.join(joinparts)`
			`return _namer`


Refactor: Convert starter to simple method. 2016-04-13 18:01:51 +00:00			`def bounceStarter(self):`
Read starter parameters from class. This allows to specify starters in a more declarative and dynamic way. 2016-04-12 21:11:39 +00:00			`"""Get start URL by "bouncing" back and forth one time.`

			`This needs the url and nextSearch properties be defined on the class.`
			`"""`
Refactor: Convert starter to simple method. 2016-04-13 18:01:51 +00:00			`data = self.getPage(self.url)`
Apply link modifier to all links. This was previously only the "previous link modifier", now it can also modify "next" and "latest" links. Additionally, the modifier is given the current URL, so those cases can be distinguished. 2016-11-01 00:12:16 +00:00			`prevurl = self.fetchUrl(self.url, data, self.prevSearch)`
			`prevurl = self.link_modifier(self.url, prevurl)`
			`data = self.getPage(prevurl)`
			`nexturl = self.fetchUrl(prevurl, data, self.nextSearch)`
			`return self.link_modifier(prevurl, nexturl)`
Initial commit to Github. 2012-06-20 19:58:13 +00:00

Refactor: Convert starter to simple method. 2016-04-13 18:01:51 +00:00			`def indirectStarter(self):`
Read starter parameters from class. This allows to specify starters in a more declarative and dynamic way. 2016-04-12 21:11:39 +00:00			`"""Get start URL by indirection.`

			`This is useful for comics where the latest comic can't be reached at a`
			`stable URL. If the class has an attribute 'startUrl', this page is fetched`
			`first, otherwise the page at 'url' is fetched. After that, the attribute`
			`'latestSearch' is used on the page content to find the latest strip."""`
Refactor: Convert starter to simple method. 2016-04-13 18:01:51 +00:00			`url = self.startUrl if hasattr(self, "startUrl") else self.url`
			`data = self.getPage(url)`
Apply link modifier to all links. This was previously only the "previous link modifier", now it can also modify "next" and "latest" links. Additionally, the modifier is given the current URL, so those cases can be distinguished. 2016-11-01 00:12:16 +00:00			`newurl = self.fetchUrl(url, data, self.latestSearch)`
			`return self.link_modifier(url, newurl)`