Tobias Gruetzmacher
23125c74d4
Unify XPath NS config over modules
2024-03-17 21:44:46 +01:00
Tobias Gruetzmacher
7b9ca867fb
Add some more type annotations
2024-02-18 16:53:17 +01:00
Tobias Gruetzmacher
4f932803a3
Extend scraper API with a extract_image_urls method
...
This is just a light wrapper around fetchUrls, but frees comic modules
from second-guessing for what purpose fetchUrls was called when they are
overriding that API - And yes, some comic modules already got this
wrong, they are now all fixed.
2023-06-10 15:05:57 +02:00
Tobias Gruetzmacher
67d1ee281b
Ignore "usemap" attribute on images
2022-06-06 14:11:07 +02:00
Tobias Gruetzmacher
8e1e398a8d
Deprecate underscore-prefixed parent classes
...
This is trying to strike a balance between updating as much existing
classes as possible, but not making the diff too big...
2022-06-06 12:08:32 +02:00
Tobias Gruetzmacher
99b72c90be
Remove unused multi-match logic
2022-06-04 10:56:25 +02:00
Tobias Gruetzmacher
9b95171f37
Add some basic type annotations
2022-05-28 19:33:16 +02:00
Tobias Gruetzmacher
c43bc0cef4
Fix duplicate module detection
2021-01-19 01:19:07 +01:00
Tobias Gruetzmacher
e64635e86b
Stricter style checking & related style fixes
2020-10-11 20:15:27 +02:00
Tobias Gruetzmacher
0bdf3dd94b
Allow adding external directories to the plugin package
2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
3256f9fdc2
Hardcode the "plugins" package name
2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
9237bd62b2
Convert scraper cache to a class
...
This should make it easier to extend with additional entries.
2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
e98a1601ca
Remove workaround for libxml2 older 2.9.3 (2015)
...
This workaround was written in 2016 while that version was still found
on many systems. Addionally, this workaround needs to be enabled by the
developer, who might not even be aware that they need to enable it for a
specific module. We still throw a warning to the user if running with
such an old libxml version.
2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher
0fe32e7562
Remove unused f-string
...
Since we still want to support Python 3.5 for a bit, we should avoid
f-strings until we finally drop support for that.
2020-09-28 22:19:48 +02:00
Tobias Gruetzmacher
7e040086b6
Try to inform the user about geo-blocks
...
Instead of letting the crawler run into "random" error messages, throw a
specific "geoblocked" exception instead.
2020-09-28 13:11:34 +02:00
Tobias Gruetzmacher
64123eab64
Add an xpath extension to match CSS classes
2020-07-31 20:14:04 +02:00
Tobias Gruetzmacher
27d28b8eef
Update file headers
...
The default encoding for source files is UTF-8 since Python 3, so we can
drop all encoding headers. While we are at it, just replace them with
SPDX headers.
2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher
62c3540c28
Remove (useless) wrapper around html.unescape
2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher
44791439a5
Drop Python 2 support: Obsolete future statements
2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher
9c65c3e05f
Drop Python 2 support: six & other imports
2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher
752525c3e9
Fix some old modules using the Internet Archive
2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher
a347bebfe3
Add simple host-based throttling
2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher
e5e7dfacd6
Move basic HTTP setup into a new module
...
We now subclass requests' Session to make further extensions of the HTTP
flow possible.
2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher
00d0201c5f
Fix a bunch of flake8 issues
2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher
e24c0ae557
Simplify voting code
...
Not sure if I keep this feature, but at least I can now see if anybody
is still using it...
2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher
90685d9b0c
Only support modern versions of PyCountry.
2017-11-26 19:29:48 +01:00
sizlo
a83911aa67
Favour the first image we found when we're not expecting multiple images
2017-04-18 21:59:04 +01:00
sizlo
8d84361de4
Preserve the order we found images in when removing duplicate images
2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher
3f9feec041
Allow modules to ignore some HTTP error codes.
...
This is neccessary since it seems some webservers out there are
misconfigured to deliver actual content with an HTTP error code...
2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher
bc755d09a3
Apply link modifier to all links.
...
This was previously only the "previous link modifier", now it can also
modify "next" and "latest" links. Additionally, the modifier is given
the current URL, so those cases can be distinguished.
2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher
9a6a310b76
Fixup copyright years.
2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher
64c8e502ca
Ignore case for comic download directories.
...
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher
215d597573
Remove DrunkDuck for now.
...
- It's been disabled for ages
- Needs a major rework
- I don't want to add that many comics anyways...
- This also gets rid of make_scraper :)
2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher
df2048cb34
Keep track of removed and moved comics ( fixes #41 ).
...
I plan on keeping this list for at least ~ 2 releases and then purging
older entries...
2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher
295b53a2d3
Fix name overrides (broken by 51008a).
2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher
51008a975b
Refactor: Introduce generator methods for scrapers
...
This allows one comic module class to generate multiple scrapers. This
change is to support a more dynamic module system as described in #42 .
2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher
efe1308db2
Replace home-grown Python2/3 compat. with six.
2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher
0c1aa9e8bd
Move libxml < 2.9.3 workaround to base class.
2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher
8b1ac4eb35
Fix "tagsoup" on SmackJeeves
...
Unfortunatly, browsers render < outside of HTML tags differently then
libXML until recently (libXML 2.9.3), so we need to preprocess pages
before parsing them...
(This was fixed in libXML commit 140c25)
2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher
fd85c8583a
Unify similar code in fetchUrl and fetchText
2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher
6574997e01
Refactor: All the other class methods.
...
Turns out, it would have been better if all methods had been instance
methods and not class methods. This finished a big chunk of the rework
needed for #42 .
2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher
0d436b8ca9
Refactor: url modifiers to normal methods.
...
As before, to implement #42 these might want to access information from
the instance, so they should be normal methods.
2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher
c3f32dfef7
Refactor: Make namer a method.
...
When #42 is realized, the naming of files might differ between comic
modules, so the namer's logical location is the instance, not the class.
2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher
5bd2a49f48
Add debug output on matched XPath/CSS expression.
2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher
190cd3b063
Convert language & getDisabledReasons to methods.
...
Both are more properties of a webcomic (this is part of the design
changes for #42 )
2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher
df46907f39
Register EXSLT extensions by default.
...
This allows comic module authors to use the full power of regular
expressions in XPath expression, see http://exslt.org/regexp/regexp.html
for usage. Please be aware that these use the prefix re: instead of
regexp: here.
2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher
ee99c087d7
Remove prevUrlMatchesStripUrl.
...
It was only used for one test.
2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher
92a688457a
Remove useless indirection.
2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher
060281e5ff
Use concrete scraper objects everywhere.
...
This is a first step for #42 . Since most access to the scraper classes
is through instances, modules can now dynamically override url and name
(name is now a property).
2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher
f6e605e146
Fix unicode error in text search.
2016-04-10 13:16:30 +02:00