dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	e98a1601ca	Remove workaround for libxml2 older 2.9.3 (2015) This workaround was written in 2016 while that version was still found on many systems. Addionally, this workaround needs to be enabled by the developer, who might not even be aware that they need to enable it for a specific module. We still throw a warning to the user if running with such an old libxml version.	2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher	0fe32e7562	Remove unused f-string Since we still want to support Python 3.5 for a bit, we should avoid f-strings until we finally drop support for that.	2020-09-28 22:19:48 +02:00
Tobias Gruetzmacher	7e040086b6	Try to inform the user about geo-blocks Instead of letting the crawler run into "random" error messages, throw a specific "geoblocked" exception instead.	2020-09-28 13:11:34 +02:00
Tobias Gruetzmacher	64123eab64	Add an xpath extension to match CSS classes	2020-07-31 20:14:04 +02:00
Tobias Gruetzmacher	27d28b8eef	Update file headers The default encoding for source files is UTF-8 since Python 3, so we can drop all encoding headers. While we are at it, just replace them with SPDX headers.	2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher	62c3540c28	Remove (useless) wrapper around html.unescape	2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher	44791439a5	Drop Python 2 support: Obsolete future statements	2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher	9c65c3e05f	Drop Python 2 support: six & other imports	2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher	752525c3e9	Fix some old modules using the Internet Archive	2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher	a347bebfe3	Add simple host-based throttling	2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher	e5e7dfacd6	Move basic HTTP setup into a new module We now subclass requests' Session to make further extensions of the HTTP flow possible.	2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher	00d0201c5f	Fix a bunch of flake8 issues	2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher	e24c0ae557	Simplify voting code Not sure if I keep this feature, but at least I can now see if anybody is still using it...	2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher	90685d9b0c	Only support modern versions of PyCountry.	2017-11-26 19:29:48 +01:00
sizlo	a83911aa67	Favour the first image we found when we're not expecting multiple images	2017-04-18 21:59:04 +01:00
sizlo	8d84361de4	Preserve the order we found images in when removing duplicate images	2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher	3f9feec041	Allow modules to ignore some HTTP error codes. This is neccessary since it seems some webservers out there are misconfigured to deliver actual content with an HTTP error code...	2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher	bc755d09a3	Apply link modifier to all links. This was previously only the "previous link modifier", now it can also modify "next" and "latest" links. Additionally, the modifier is given the current URL, so those cases can be distinguished.	2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher	9a6a310b76	Fixup copyright years.	2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher	64c8e502ca	Ignore case for comic download directories. Since we already match comics case-insensitive on the command line, this was a logical step, even if this means changing quite a bit of code that all tries to resolve the "comic directory" in a slightly different way...	2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher	215d597573	Remove DrunkDuck for now. - It's been disabled for ages - Needs a major rework - I don't want to add that many comics anyways... - This also gets rid of make_scraper :)	2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher	df2048cb34	Keep track of removed and moved comics (fixes #41 ). I plan on keeping this list for at least ~ 2 releases and then purging older entries...	2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher	295b53a2d3	Fix name overrides (broken by 51008a).	2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher	51008a975b	Refactor: Introduce generator methods for scrapers This allows one comic module class to generate multiple scrapers. This change is to support a more dynamic module system as described in #42.	2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher	efe1308db2	Replace home-grown Python2/3 compat. with six.	2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher	0c1aa9e8bd	Move libxml < 2.9.3 workaround to base class.	2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher	8b1ac4eb35	Fix "tagsoup" on SmackJeeves Unfortunatly, browsers render < outside of HTML tags differently then libXML until recently (libXML 2.9.3), so we need to preprocess pages before parsing them... (This was fixed in libXML commit 140c25)	2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher	fd85c8583a	Unify similar code in fetchUrl and fetchText	2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher	6574997e01	Refactor: All the other class methods. Turns out, it would have been better if all methods had been instance methods and not class methods. This finished a big chunk of the rework needed for #42.	2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher	0d436b8ca9	Refactor: url modifiers to normal methods. As before, to implement #42 these might want to access information from the instance, so they should be normal methods.	2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher	c3f32dfef7	Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class.	2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher	5bd2a49f48	Add debug output on matched XPath/CSS expression.	2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher	190cd3b063	Convert language & getDisabledReasons to methods. Both are more properties of a webcomic (this is part of the design changes for #42)	2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher	df46907f39	Register EXSLT extensions by default. This allows comic module authors to use the full power of regular expressions in XPath expression, see http://exslt.org/regexp/regexp.html for usage. Please be aware that these use the prefix re: instead of regexp: here.	2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher	ee99c087d7	Remove prevUrlMatchesStripUrl. It was only used for one test.	2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher	92a688457a	Remove useless indirection.	2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher	060281e5ff	Use concrete scraper objects everywhere. This is a first step for #42. Since most access to the scraper classes is through instances, modules can now dynamically override url and name (name is now a property).	2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher	f6e605e146	Fix unicode error in text search.	2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher	c4fcd985dd	Let urllib3 handle all retries.	2016-03-13 21:30:36 +01:00
Tobias Gruetzmacher	78e13962f9	Sort scraper modules (mostly for test stability).	2016-03-13 20:24:21 +01:00
Damjan Košir	fd9c480d9c	adding bonus panel to SWBC and multiple images flag to ParserScraper	2015-08-03 22:58:44 +12:00
Tobias Gruetzmacher	303432fc68	Also use css expressions for textSearch.	2015-07-18 01:22:40 +02:00
Tobias Gruetzmacher	808b624e5f	Remove hard dependency on pycountry again. This basically reverts commit `86b31dc12b`. It now works like this: If the use has pycountry installed, it is used. If not, Dosage falls back to a small internal list generated from pycountry by scripts/mklanguages.py. This means additional work if we ever decide to translate Dosage, since pycountry already has all the translations for language names... This fixes #23.	2015-07-11 01:27:39 +02:00
Damjan Košir	79d775a8d9	adding comicpress scraper	2015-05-16 00:15:32 +12:00
Tobias Gruetzmacher	ff21df596b	Remove descriptions and genres (closes #9 ). Maintaining the descriptions creates quite a bit of overhead (finding them, copying them, checking if they are still correct) for a minimal user benefit. PS: Viewing this diff should be easier in a difftool that shows changes in a line, for example kdiff3.	2015-04-20 20:29:09 +02:00
Tobias Gruetzmacher	1d52d6a152	Add support for CSS selectors to HTML parser. Each comic module author can decide if she wants to use CSS or XPath, not a mix of both. Using CSS needs the cssselect python module and the module gets disabled if it is unavailable.	2014-10-13 22:43:06 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	e92a3fb3a1	New feature: Comic modules ca be "disabled". This is modeled parallel to the "adult" feature, except the user can't override it via the command line. Each comic module can override the classmethod getDisabledReasons and give the user a reason why this module is disabled. The user can see the reason in the comic list (-l or --singlelist) and the comic module refuses to run, showing the same message. This is currently used to disable modules that use the _ParserScraper if the LXML python module is missing.	2014-10-13 21:43:46 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00

1 2 3

119 commits