dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	752525c3e9	Fix some old modules using the Internet Archive	2020-01-09 17:38:13 +01:00
Tobias Gruetzmacher	a347bebfe3	Add simple host-based throttling	2019-12-04 00:28:27 +01:00
Tobias Gruetzmacher	e5e7dfacd6	Move basic HTTP setup into a new module We now subclass requests' Session to make further extensions of the HTTP flow possible.	2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher	00d0201c5f	Fix a bunch of flake8 issues	2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher	e24c0ae557	Simplify voting code Not sure if I keep this feature, but at least I can now see if anybody is still using it...	2019-11-03 21:04:34 +01:00
Tobias Gruetzmacher	90685d9b0c	Only support modern versions of PyCountry.	2017-11-26 19:29:48 +01:00
sizlo	a83911aa67	Favour the first image we found when we're not expecting multiple images	2017-04-18 21:59:04 +01:00
sizlo	8d84361de4	Preserve the order we found images in when removing duplicate images	2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher	3f9feec041	Allow modules to ignore some HTTP error codes. This is neccessary since it seems some webservers out there are misconfigured to deliver actual content with an HTTP error code...	2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher	bc755d09a3	Apply link modifier to all links. This was previously only the "previous link modifier", now it can also modify "next" and "latest" links. Additionally, the modifier is given the current URL, so those cases can be distinguished.	2016-11-01 01:50:44 +01:00
Tobias Gruetzmacher	9a6a310b76	Fixup copyright years.	2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher	64c8e502ca	Ignore case for comic download directories. Since we already match comics case-insensitive on the command line, this was a logical step, even if this means changing quite a bit of code that all tries to resolve the "comic directory" in a slightly different way...	2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher	215d597573	Remove DrunkDuck for now. - It's been disabled for ages - Needs a major rework - I don't want to add that many comics anyways... - This also gets rid of make_scraper :)	2016-06-05 22:22:17 +02:00
Tobias Gruetzmacher	df2048cb34	Keep track of removed and moved comics (fixes #41 ). I plan on keeping this list for at least ~ 2 releases and then purging older entries...	2016-06-05 21:47:58 +02:00
Tobias Gruetzmacher	295b53a2d3	Fix name overrides (broken by 51008a).	2016-06-05 10:03:29 +02:00
Tobias Gruetzmacher	51008a975b	Refactor: Introduce generator methods for scrapers This allows one comic module class to generate multiple scrapers. This change is to support a more dynamic module system as described in #42.	2016-05-21 01:29:36 +02:00
Tobias Gruetzmacher	efe1308db2	Replace home-grown Python2/3 compat. with six.	2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher	0c1aa9e8bd	Move libxml < 2.9.3 workaround to base class.	2016-05-02 23:22:06 +02:00
Tobias Gruetzmacher	8b1ac4eb35	Fix "tagsoup" on SmackJeeves Unfortunatly, browsers render < outside of HTML tags differently then libXML until recently (libXML 2.9.3), so we need to preprocess pages before parsing them... (This was fixed in libXML commit 140c25)	2016-04-26 08:05:38 +02:00
Tobias Gruetzmacher	fd85c8583a	Unify similar code in fetchUrl and fetchText	2016-04-22 00:42:46 +02:00
Tobias Gruetzmacher	6574997e01	Refactor: All the other class methods. Turns out, it would have been better if all methods had been instance methods and not class methods. This finished a big chunk of the rework needed for #42.	2016-04-21 23:52:31 +02:00
Tobias Gruetzmacher	0d436b8ca9	Refactor: url modifiers to normal methods. As before, to implement #42 these might want to access information from the instance, so they should be normal methods.	2016-04-21 21:39:25 +02:00
Tobias Gruetzmacher	c3f32dfef7	Refactor: Make namer a method. When #42 is realized, the naming of files might differ between comic modules, so the namer's logical location is the instance, not the class.	2016-04-21 08:20:49 +02:00
Tobias Gruetzmacher	5bd2a49f48	Add debug output on matched XPath/CSS expression.	2016-04-20 23:51:54 +02:00
Tobias Gruetzmacher	190cd3b063	Convert language & getDisabledReasons to methods. Both are more properties of a webcomic (this is part of the design changes for #42)	2016-04-19 23:53:46 +02:00
Tobias Gruetzmacher	df46907f39	Register EXSLT extensions by default. This allows comic module authors to use the full power of regular expressions in XPath expression, see http://exslt.org/regexp/regexp.html for usage. Please be aware that these use the prefix re: instead of regexp: here.	2016-04-19 23:48:14 +02:00
Tobias Gruetzmacher	ee99c087d7	Remove prevUrlMatchesStripUrl. It was only used for one test.	2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher	92a688457a	Remove useless indirection.	2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher	060281e5ff	Use concrete scraper objects everywhere. This is a first step for #42. Since most access to the scraper classes is through instances, modules can now dynamically override url and name (name is now a property).	2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher	f6e605e146	Fix unicode error in text search.	2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher	c4fcd985dd	Let urllib3 handle all retries.	2016-03-13 21:30:36 +01:00
Tobias Gruetzmacher	78e13962f9	Sort scraper modules (mostly for test stability).	2016-03-13 20:24:21 +01:00
Damjan Košir	fd9c480d9c	adding bonus panel to SWBC and multiple images flag to ParserScraper	2015-08-03 22:58:44 +12:00
Tobias Gruetzmacher	303432fc68	Also use css expressions for textSearch.	2015-07-18 01:22:40 +02:00
Tobias Gruetzmacher	808b624e5f	Remove hard dependency on pycountry again. This basically reverts commit `86b31dc12b`. It now works like this: If the use has pycountry installed, it is used. If not, Dosage falls back to a small internal list generated from pycountry by scripts/mklanguages.py. This means additional work if we ever decide to translate Dosage, since pycountry already has all the translations for language names... This fixes #23.	2015-07-11 01:27:39 +02:00
Damjan Košir	79d775a8d9	adding comicpress scraper	2015-05-16 00:15:32 +12:00
Tobias Gruetzmacher	ff21df596b	Remove descriptions and genres (closes #9 ). Maintaining the descriptions creates quite a bit of overhead (finding them, copying them, checking if they are still correct) for a minimal user benefit. PS: Viewing this diff should be easier in a difftool that shows changes in a line, for example kdiff3.	2015-04-20 20:29:09 +02:00
Tobias Gruetzmacher	1d52d6a152	Add support for CSS selectors to HTML parser. Each comic module author can decide if she wants to use CSS or XPath, not a mix of both. Using CSS needs the cssselect python module and the module gets disabled if it is unavailable.	2014-10-13 22:43:06 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	e92a3fb3a1	New feature: Comic modules ca be "disabled". This is modeled parallel to the "adult" feature, except the user can't override it via the command line. Each comic module can override the classmethod getDisabledReasons and give the user a reason why this module is disabled. The user can see the reason in the comic list (-l or --singlelist) and the comic module refuses to run, showing the same message. This is currently used to disable modules that use the _ParserScraper if the LXML python module is missing.	2014-10-13 21:43:46 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Tobias Gruetzmacher	f9f0b75d7c	Create new HTML parser based scraper class.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	fde1fdced6	Fix some typos.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	4265053846	Refactor: Move regualar expression scraping into a new class. - This also makes "<base href>" handling an internal detail of the regular expression scraper, future scrapers might not need that or handle it in another way.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	875e431edc	Provide page data in shouldSkipUrl() function	2014-02-10 21:58:09 +01:00
Bastian Kleineidam	5fe48d013a	Increase wait interval.	2014-01-05 17:14:19 +01:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00

1 2 3

111 commits