Commit graph

124 commits

Author SHA1 Message Date
Tobias Gruetzmacher
c43bc0cef4 Fix duplicate module detection 2021-01-19 01:19:07 +01:00
Tobias Gruetzmacher
e64635e86b Stricter style checking & related style fixes 2020-10-11 20:15:27 +02:00
Tobias Gruetzmacher
0bdf3dd94b Allow adding external directories to the plugin package 2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
3256f9fdc2 Hardcode the "plugins" package name 2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
9237bd62b2 Convert scraper cache to a class
This should make it easier to extend with additional entries.
2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher
e98a1601ca Remove workaround for libxml2 older 2.9.3 (2015)
This workaround was written in 2016 while that version was still found
on many systems. Addionally, this workaround needs to be enabled by the
developer, who might not even be aware that they need to enable it for a
specific module. We still throw a warning to the user if running with
such an old libxml version.
2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher
0fe32e7562 Remove unused f-string
Since we still want to support Python 3.5 for a bit, we should avoid
f-strings until we finally drop support for that.
2020-09-28 22:19:48 +02:00
Tobias Gruetzmacher
7e040086b6 Try to inform the user about geo-blocks
Instead of letting the crawler run into "random" error messages, throw a
specific "geoblocked" exception instead.
2020-09-28 13:11:34 +02:00
Tobias Gruetzmacher
64123eab64 Add an xpath extension to match CSS classes 2020-07-31 20:14:04 +02:00
Tobias Gruetzmacher
27d28b8eef Update file headers
The default encoding for source files is UTF-8 since Python 3, so we can
drop all encoding headers. While we are at it, just replace them with
SPDX headers.
2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher
62c3540c28 Remove (useless) wrapper around html.unescape 2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher
44791439a5 Drop Python 2 support: Obsolete future statements 2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher
9c65c3e05f Drop Python 2 support: six & other imports 2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher
752525c3e9 Fix some old modules using the Internet Archive 2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher
a347bebfe3 Add simple host-based throttling 2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher
e5e7dfacd6 Move basic HTTP setup into a new module
We now subclass requests' Session to make further extensions of the HTTP
flow possible.
2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher
00d0201c5f Fix a bunch of flake8 issues 2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher
e24c0ae557 Simplify voting code
Not sure if I keep this feature, but at least I can now see if anybody
is still using it...
2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher
90685d9b0c Only support modern versions of PyCountry. 2017-11-26 19:29:48 +01:00
sizlo
a83911aa67 Favour the first image we found when we're not expecting multiple images 2017-04-18 21:59:04 +01:00
sizlo
8d84361de4 Preserve the order we found images in when removing duplicate images 2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher
3f9feec041 Allow modules to ignore some HTTP error codes.
This is neccessary since it seems some webservers out there are
misconfigured to deliver actual content with an HTTP error code...
2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher
bc755d09a3 Apply link modifier to all links.
This was previously only the "previous link modifier", now it can also
modify "next" and "latest" links. Additionally, the modifier is given
the current URL, so those cases can be distinguished.
2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher
9a6a310b76 Fixup copyright years. 2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher
64c8e502ca Ignore case for comic download directories.
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher
215d597573 Remove DrunkDuck for now.
- It's been disabled for ages
- Needs a major rework
- I don't want to add that many comics anyways...
- This also gets rid of make_scraper :)
2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher
df2048cb34 Keep track of removed and moved comics (fixes #41).
I plan on keeping this list for at least ~ 2 releases and then purging
older entries...
2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher
295b53a2d3 Fix name overrides (broken by 51008a). 2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher
51008a975b Refactor: Introduce generator methods for scrapers
This allows one comic module class to generate multiple scrapers. This
change is to support a more dynamic module system as described in #42.
2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher
efe1308db2 Replace home-grown Python2/3 compat. with six. 2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher
0c1aa9e8bd Move libxml < 2.9.3 workaround to base class. 2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher
8b1ac4eb35 Fix "tagsoup" on SmackJeeves
Unfortunatly, browsers render < outside of HTML tags differently then
libXML until recently (libXML 2.9.3), so we need to preprocess pages
before parsing them...

(This was fixed in libXML commit 140c25)
2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher
fd85c8583a Unify similar code in fetchUrl and fetchText 2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher
6574997e01 Refactor: All the other class methods.
Turns out, it would have been better if all methods had been instance
methods and not class methods. This finished a big chunk of the rework
needed for #42.
2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher
0d436b8ca9 Refactor: url modifiers to normal methods.
As before, to implement #42 these might want to access information from
the instance, so they should be normal methods.
2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher
c3f32dfef7 Refactor: Make namer a method.
When #42 is realized, the naming of files might differ between comic
modules, so the namer's logical location is the instance, not the class.
2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher
5bd2a49f48 Add debug output on matched XPath/CSS expression. 2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher
190cd3b063 Convert language & getDisabledReasons to methods.
Both are more properties of a webcomic (this is part of the design
changes for #42)
2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher
df46907f39 Register EXSLT extensions by default.
This allows comic module authors to use the full power of regular
expressions in XPath expression, see http://exslt.org/regexp/regexp.html
for usage. Please be aware that these use the prefix re: instead of
regexp: here.
2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher
ee99c087d7 Remove prevUrlMatchesStripUrl.
It was only used for one test.
2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher
92a688457a Remove useless indirection. 2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher
060281e5ff Use concrete scraper objects everywhere.
This is a first step for #42. Since most access to the scraper classes
is through instances, modules can now dynamically override url and name
(name is now a property).
2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher
f6e605e146 Fix unicode error in text search. 2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher
8768ff07b6 Fix AhoiPolloi, be a bit smarter about encoding.
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:

1. Let Requests figure out the content from the HTTP header. This
   overrides everything else. We need to "trick" LXML to accept our
   decision if the document contains an XML declaration which might
   disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
   encoding and be done with it.
2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher
c4fcd985dd Let urllib3 handle all retries. 2016-03-13 21:30:36 +01:00
Tobias Gruetzmacher
78e13962f9 Sort scraper modules (mostly for test stability). 2016-03-13 20:24:21 +01:00
Damjan Košir
fd9c480d9c adding bonus panel to SWBC and multiple images flag to ParserScraper 2015-08-03 22:58:44 +12:00
Tobias Gruetzmacher
303432fc68 Also use css expressions for textSearch. 2015-07-18 01:22:40 +02:00
Tobias Gruetzmacher
808b624e5f Remove hard dependency on pycountry again.
This basically reverts commit 86b31dc12b.

It now works like this: If the use has pycountry installed, it is used.
If not, Dosage falls back to a small internal list generated from
pycountry by scripts/mklanguages.py.

This means additional work if we ever decide to translate Dosage, since
pycountry already has all the translations for language names...

This fixes #23.
2015-07-11 01:27:39 +02:00
Damjan Košir
79d775a8d9 adding comicpress scraper 2015-05-16 00:15:32 +12:00