Commit graph

131 commits

Author SHA1 Message Date
Tobias Gruetzmacher 23125c74d4
Unify XPath NS config over modules 2024-03-17 21:44:46 +01:00
Tobias Gruetzmacher 7b9ca867fb
Add some more type annotations 2024-02-18 16:53:17 +01:00
Tobias Gruetzmacher 4f932803a3
Extend scraper API with a extract_image_urls method
This is just a light wrapper around fetchUrls, but frees comic modules
from second-guessing for what purpose fetchUrls was called when they are
overriding that API - And yes, some comic modules already got this
wrong, they are now all fixed.
2023-06-10 15:05:57 +02:00
Tobias Gruetzmacher 67d1ee281b Ignore "usemap" attribute on images 2022-06-06 14:11:07 +02:00
Tobias Gruetzmacher 8e1e398a8d Deprecate underscore-prefixed parent classes
This is trying to strike a balance between updating as much existing
classes as possible, but not making the diff too big...
2022-06-06 12:08:32 +02:00
Tobias Gruetzmacher 99b72c90be Remove unused multi-match logic 2022-06-04 10:56:25 +02:00
Tobias Gruetzmacher 9b95171f37 Add some basic type annotations 2022-05-28 19:33:16 +02:00
Tobias Gruetzmacher c43bc0cef4 Fix duplicate module detection 2021-01-19 01:19:07 +01:00
Tobias Gruetzmacher e64635e86b Stricter style checking & related style fixes 2020-10-11 20:15:27 +02:00
Tobias Gruetzmacher 0bdf3dd94b Allow adding external directories to the plugin package 2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher 3256f9fdc2 Hardcode the "plugins" package name 2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher 9237bd62b2 Convert scraper cache to a class
This should make it easier to extend with additional entries.
2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher e98a1601ca Remove workaround for libxml2 older 2.9.3 (2015)
This workaround was written in 2016 while that version was still found
on many systems. Addionally, this workaround needs to be enabled by the
developer, who might not even be aware that they need to enable it for a
specific module. We still throw a warning to the user if running with
such an old libxml version.
2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher 0fe32e7562 Remove unused f-string
Since we still want to support Python 3.5 for a bit, we should avoid
f-strings until we finally drop support for that.
2020-09-28 22:19:48 +02:00
Tobias Gruetzmacher 7e040086b6 Try to inform the user about geo-blocks
Instead of letting the crawler run into "random" error messages, throw a
specific "geoblocked" exception instead.
2020-09-28 13:11:34 +02:00
Tobias Gruetzmacher 64123eab64 Add an xpath extension to match CSS classes 2020-07-31 20:14:04 +02:00
Tobias Gruetzmacher 27d28b8eef Update file headers
The default encoding for source files is UTF-8 since Python 3, so we can
drop all encoding headers. While we are at it, just replace them with
SPDX headers.
2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher 62c3540c28 Remove (useless) wrapper around html.unescape 2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher 44791439a5 Drop Python 2 support: Obsolete future statements 2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher 9c65c3e05f Drop Python 2 support: six & other imports 2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher 752525c3e9 Fix some old modules using the Internet Archive 2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher a347bebfe3 Add simple host-based throttling 2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher e5e7dfacd6 Move basic HTTP setup into a new module
We now subclass requests' Session to make further extensions of the HTTP
flow possible.
2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher 00d0201c5f Fix a bunch of flake8 issues 2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher e24c0ae557 Simplify voting code
Not sure if I keep this feature, but at least I can now see if anybody
is still using it...
2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher 90685d9b0c Only support modern versions of PyCountry. 2017-11-26 19:29:48 +01:00
sizlo a83911aa67 Favour the first image we found when we're not expecting multiple images 2017-04-18 21:59:04 +01:00
sizlo 8d84361de4 Preserve the order we found images in when removing duplicate images 2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher 3f9feec041 Allow modules to ignore some HTTP error codes.
This is neccessary since it seems some webservers out there are
misconfigured to deliver actual content with an HTTP error code...
2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher bc755d09a3 Apply link modifier to all links.
This was previously only the "previous link modifier", now it can also
modify "next" and "latest" links. Additionally, the modifier is given
the current URL, so those cases can be distinguished.
2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher 9a6a310b76 Fixup copyright years. 2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher 64c8e502ca Ignore case for comic download directories.
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher 215d597573 Remove DrunkDuck for now.
- It's been disabled for ages
- Needs a major rework
- I don't want to add that many comics anyways...
- This also gets rid of make_scraper :)
2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher df2048cb34 Keep track of removed and moved comics (fixes #41).
I plan on keeping this list for at least ~ 2 releases and then purging
older entries...
2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher 295b53a2d3 Fix name overrides (broken by 51008a). 2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher 51008a975b Refactor: Introduce generator methods for scrapers
This allows one comic module class to generate multiple scrapers. This
change is to support a more dynamic module system as described in #42.
2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher efe1308db2 Replace home-grown Python2/3 compat. with six. 2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher 0c1aa9e8bd Move libxml < 2.9.3 workaround to base class. 2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher 8b1ac4eb35 Fix "tagsoup" on SmackJeeves
Unfortunatly, browsers render < outside of HTML tags differently then
libXML until recently (libXML 2.9.3), so we need to preprocess pages
before parsing them...

(This was fixed in libXML commit 140c25)
2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher fd85c8583a Unify similar code in fetchUrl and fetchText 2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher 6574997e01 Refactor: All the other class methods.
Turns out, it would have been better if all methods had been instance
methods and not class methods. This finished a big chunk of the rework
needed for #42.
2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher 0d436b8ca9 Refactor: url modifiers to normal methods.
As before, to implement #42 these might want to access information from
the instance, so they should be normal methods.
2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher c3f32dfef7 Refactor: Make namer a method.
When #42 is realized, the naming of files might differ between comic
modules, so the namer's logical location is the instance, not the class.
2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher 5bd2a49f48 Add debug output on matched XPath/CSS expression. 2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher 190cd3b063 Convert language & getDisabledReasons to methods.
Both are more properties of a webcomic (this is part of the design
changes for #42)
2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher df46907f39 Register EXSLT extensions by default.
This allows comic module authors to use the full power of regular
expressions in XPath expression, see http://exslt.org/regexp/regexp.html
for usage. Please be aware that these use the prefix re: instead of
regexp: here.
2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher ee99c087d7 Remove prevUrlMatchesStripUrl.
It was only used for one test.
2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher 92a688457a Remove useless indirection. 2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher 060281e5ff Use concrete scraper objects everywhere.
This is a first step for #42. Since most access to the scraper classes
is through instances, modules can now dynamically override url and name
(name is now a property).
2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher f6e605e146 Fix unicode error in text search. 2016-04-10 13:16:30 +02:00