dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	7b9ca867fb	Add some more type annotations	2024-02-18 16:53:17 +01:00
Tobias Gruetzmacher	4f932803a3	Extend scraper API with a extract_image_urls method This is just a light wrapper around fetchUrls, but frees comic modules from second-guessing for what purpose fetchUrls was called when they are overriding that API - And yes, some comic modules already got this wrong, they are now all fixed.	2023-06-10 15:05:57 +02:00
Tobias Gruetzmacher	67d1ee281b	Ignore "usemap" attribute on images	2022-06-06 14:11:07 +02:00
Tobias Gruetzmacher	8e1e398a8d	Deprecate underscore-prefixed parent classes This is trying to strike a balance between updating as much existing classes as possible, but not making the diff too big...	2022-06-06 12:08:32 +02:00
Tobias Gruetzmacher	99b72c90be	Remove unused multi-match logic	2022-06-04 10:56:25 +02:00
Tobias Gruetzmacher	9b95171f37	Add some basic type annotations	2022-05-28 19:33:16 +02:00
Tobias Gruetzmacher	c43bc0cef4	Fix duplicate module detection	2021-01-19 01:19:07 +01:00
Tobias Gruetzmacher	e64635e86b	Stricter style checking & related style fixes	2020-10-11 20:15:27 +02:00
Tobias Gruetzmacher	0bdf3dd94b	Allow adding external directories to the plugin package	2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher	3256f9fdc2	Hardcode the "plugins" package name	2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher	9237bd62b2	Convert scraper cache to a class This should make it easier to extend with additional entries.	2020-10-04 22:28:51 +02:00
Tobias Gruetzmacher	e98a1601ca	Remove workaround for libxml2 older 2.9.3 (2015) This workaround was written in 2016 while that version was still found on many systems. Addionally, this workaround needs to be enabled by the developer, who might not even be aware that they need to enable it for a specific module. We still throw a warning to the user if running with such an old libxml version.	2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher	0fe32e7562	Remove unused f-string Since we still want to support Python 3.5 for a bit, we should avoid f-strings until we finally drop support for that.	2020-09-28 22:19:48 +02:00
Tobias Gruetzmacher	7e040086b6	Try to inform the user about geo-blocks Instead of letting the crawler run into "random" error messages, throw a specific "geoblocked" exception instead.	2020-09-28 13:11:34 +02:00
Tobias Gruetzmacher	64123eab64	Add an xpath extension to match CSS classes	2020-07-31 20:14:04 +02:00
Tobias Gruetzmacher	27d28b8eef	Update file headers The default encoding for source files is UTF-8 since Python 3, so we can drop all encoding headers. While we are at it, just replace them with SPDX headers.	2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher	62c3540c28	Remove (useless) wrapper around html.unescape	2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher	44791439a5	Drop Python 2 support: Obsolete future statements	2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher	9c65c3e05f	Drop Python 2 support: six & other imports	2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher	752525c3e9	Fix some old modules using the Internet Archive	2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher	a347bebfe3	Add simple host-based throttling	2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher	e5e7dfacd6	Move basic HTTP setup into a new module We now subclass requests' Session to make further extensions of the HTTP flow possible.	2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher	00d0201c5f	Fix a bunch of flake8 issues	2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher	e24c0ae557	Simplify voting code Not sure if I keep this feature, but at least I can now see if anybody is still using it...	2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher	90685d9b0c	Only support modern versions of PyCountry.	2017-11-26 19:29:48 +01:00
sizlo	a83911aa67	Favour the first image we found when we're not expecting multiple images	2017-04-18 21:59:04 +01:00
sizlo	8d84361de4	Preserve the order we found images in when removing duplicate images	2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher	3f9feec041	Allow modules to ignore some HTTP error codes. This is neccessary since it seems some webservers out there are misconfigured to deliver actual content with an HTTP error code...	2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher	bc755d09a3	Apply link modifier to all links. This was previously only the "previous link modifier", now it can also modify "next" and "latest" links. Additionally, the modifier is given the current URL, so those cases can be distinguished.	2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher	9a6a310b76	Fixup copyright years.	2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher	64c8e502ca	Ignore case for comic download directories. Since we already match comics case-insensitive on the command line, this was a logical step, even if this means changing quite a bit of code that all tries to resolve the "comic directory" in a slightly different way...	2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher	215d597573	Remove DrunkDuck for now. - It's been disabled for ages - Needs a major rework - I don't want to add that many comics anyways... - This also gets rid of make_scraper :)	2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher	df2048cb34	Keep track of removed and moved comics (fixes #41 ). I plan on keeping this list for at least ~ 2 releases and then purging older entries...	2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher	295b53a2d3	Fix name overrides (broken by 51008a).	2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher	51008a975b	Refactor: Introduce generator methods for scrapers This allows one comic module class to generate multiple scrapers. This change is to support a more dynamic module system as described in #42.	2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher	efe1308db2	Replace home-grown Python2/3 compat. with six.	2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher	0c1aa9e8bd	Move libxml < 2.9.3 workaround to base class.	2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher	8b1ac4eb35	Fix "tagsoup" on SmackJeeves Unfortunatly, browsers render < outside of HTML tags differently then libXML until recently (libXML 2.9.3), so we need to preprocess pages before parsing them... (This was fixed in libXML commit 140c25)	2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher	fd85c8583a	Unify similar code in fetchUrl and fetchText	2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher	6574997e01	Refactor: All the other class methods. Turns out, it would have been better if all methods had been instance methods and not class methods. This finished a big chunk of the rework needed for #42.	2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher	0d436b8ca9	Refactor: url modifiers to normal methods. As before, to implement #42 these might want to access information from the instance, so they should be normal methods.	2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher	c3f32dfef7	Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class.	2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher	5bd2a49f48	Add debug output on matched XPath/CSS expression.	2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher	190cd3b063	Convert language & getDisabledReasons to methods. Both are more properties of a webcomic (this is part of the design changes for #42)	2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher	df46907f39	Register EXSLT extensions by default. This allows comic module authors to use the full power of regular expressions in XPath expression, see http://exslt.org/regexp/regexp.html for usage. Please be aware that these use the prefix re: instead of regexp: here.	2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher	ee99c087d7	Remove prevUrlMatchesStripUrl. It was only used for one test.	2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher	92a688457a	Remove useless indirection.	2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher	060281e5ff	Use concrete scraper objects everywhere. This is a first step for #42. Since most access to the scraper classes is through instances, modules can now dynamically override url and name (name is now a property).	2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher	f6e605e146	Fix unicode error in text search.	2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00

1 2 3

130 commits