dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	060281e5ff	Use concrete scraper objects everywhere. This is a first step for #42. Since most access to the scraper classes is through instances, modules can now dynamically override url and name (name is now a property).	2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher	f6e605e146	Fix unicode error in text search.	2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher	c4fcd985dd	Let urllib3 handle all retries.	2016-03-13 21:30:36 +01:00
Tobias Gruetzmacher	78e13962f9	Sort scraper modules (mostly for test stability).	2016-03-13 20:24:21 +01:00
Damjan Košir	fd9c480d9c	adding bonus panel to SWBC and multiple images flag to ParserScraper	2015-08-03 22:58:44 +12:00
Tobias Gruetzmacher	303432fc68	Also use css expressions for textSearch.	2015-07-18 01:22:40 +02:00
Tobias Gruetzmacher	808b624e5f	Remove hard dependency on pycountry again. This basically reverts commit `86b31dc12b`. It now works like this: If the use has pycountry installed, it is used. If not, Dosage falls back to a small internal list generated from pycountry by scripts/mklanguages.py. This means additional work if we ever decide to translate Dosage, since pycountry already has all the translations for language names... This fixes #23.	2015-07-11 01:27:39 +02:00
Damjan Košir	79d775a8d9	adding comicpress scraper	2015-05-16 00:15:32 +12:00
Tobias Gruetzmacher	ff21df596b	Remove descriptions and genres (closes #9 ). Maintaining the descriptions creates quite a bit of overhead (finding them, copying them, checking if they are still correct) for a minimal user benefit. PS: Viewing this diff should be easier in a difftool that shows changes in a line, for example kdiff3.	2015-04-20 20:29:09 +02:00
Tobias Gruetzmacher	1d52d6a152	Add support for CSS selectors to HTML parser. Each comic module author can decide if she wants to use CSS or XPath, not a mix of both. Using CSS needs the cssselect python module and the module gets disabled if it is unavailable.	2014-10-13 22:43:06 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	e92a3fb3a1	New feature: Comic modules ca be "disabled". This is modeled parallel to the "adult" feature, except the user can't override it via the command line. Each comic module can override the classmethod getDisabledReasons and give the user a reason why this module is disabled. The user can see the reason in the comic list (-l or --singlelist) and the comic module refuses to run, showing the same message. This is currently used to disable modules that use the _ParserScraper if the LXML python module is missing.	2014-10-13 21:43:46 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Tobias Gruetzmacher	f9f0b75d7c	Create new HTML parser based scraper class.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	fde1fdced6	Fix some typos.	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	4265053846	Refactor: Move regualar expression scraping into a new class. - This also makes "<base href>" handling an internal detail of the regular expression scraper, future scrapers might not need that or handle it in another way.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	875e431edc	Provide page data in shouldSkipUrl() function	2014-02-10 21:58:09 +01:00
Bastian Kleineidam	5fe48d013a	Increase wait interval.	2014-01-05 17:14:19 +01:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00
Bastian Kleineidam	b6c913e2d5	Wait some time between requests.	2014-01-05 16:23:45 +01:00
Bastian Kleineidam	799d3040f0	Refactoring	2013-12-11 17:54:39 +01:00
Bastian Kleineidam	7343932a5a	Strip whitespace from image text.	2013-12-04 18:07:13 +01:00
Bastian Kleineidam	0e5c59133c	Provide HTML page data for image URL modifier function.	2013-12-04 17:54:55 +01:00
Bastian Kleineidam	0eaf9a3139	Add text search in comic strips.	2013-11-29 20:26:49 +01:00
Bastian Kleineidam	ca17332942	Call self.starter() on indexed comics since it might set cookies.	2013-11-07 20:48:10 +01:00
Bastian Kleineidam	ebdc1e6359	More unicode output fixes.	2013-04-30 06:41:19 +02:00
Bastian Kleineidam	80d7defcd2	Unicode descriptions.	2013-04-29 07:35:56 +02:00
Bastian Kleineidam	05dbc51d3e	Detect completed end-of-life comics.	2013-04-25 22:40:06 +02:00
Bastian Kleineidam	35c031ca81	Fixed some comics.	2013-04-11 18:27:43 +02:00
Bastian Kleineidam	b9dc385ff2	Implemented voting	2013-04-09 19:33:50 +02:00
Bastian Kleineidam	4528281ddd	Voting part 2	2013-04-08 21:20:01 +02:00
Bastian Kleineidam	e762f269b7	First part of voting stuff.	2013-04-08 20:19:10 +02:00
Bastian Kleineidam	97522bc5ae	Use tuples rather than lists.	2013-04-05 18:55:19 +02:00
Bastian Kleineidam	2c0ca04882	Fix warning for scrapers with multiple image patterns.	2013-04-03 20:32:19 +02:00
Bastian Kleineidam	1d7f7a8517	Fix genre list	2013-03-26 19:58:22 +01:00
Bastian Kleineidam	10985ae614	Add genre tags.	2013-03-26 17:33:27 +01:00
Bastian Kleineidam	ec33276fd7	Print stacktrace on image errors.	2013-03-25 19:48:47 +01:00
Tobias Gruetzmacher	0a218c0283	Add event comicPageLink for every previous link. This event allows a listener to build connections between pages.	2013-03-24 16:36:02 +01:00
Bastian Kleineidam	6a2f55ddef	Dont stop on image regex errors.	2013-03-15 07:03:54 +01:00
Bastian Kleineidam	43f20270d0	Allow a list of regular expressions for image and previous link search.	2013-03-12 20:48:26 +01:00
Bastian Kleineidam	88e28f3923	Fix some comics and add language tag.	2013-03-08 22:33:05 +01:00
Bastian Kleineidam	4c344765ff	Add option to wait before downloading.	2013-03-08 06:46:50 +01:00
Bastian Kleineidam	736d9aa8cf	Code cleanup.	2013-03-07 18:22:39 +01:00
Bastian Kleineidam	bae2a96d8b	Added some comic strips and cleanup the scraper code.	2013-03-06 20:00:30 +01:00
Bastian Kleineidam	3712799ee0	Add imageUrlModifier() for scrapers.	2013-03-04 19:10:27 +01:00
Bastian Kleineidam	f36ed46d6a	Fix tests which hit the first URL.	2013-02-21 19:48:21 +01:00
Bastian Kleineidam	ae0e9feea1	Remember skipped URLs.	2013-02-20 20:51:39 +01:00

1 2

83 commits