Tobias Gruetzmacher
1d52d6a152
Add support for CSS selectors to HTML parser.
...
Each comic module author can decide if she wants to use CSS or XPath,
not a mix of both. Using CSS needs the cssselect python module and the
module gets disabled if it is unavailable.
2014-10-13 22:43:06 +02:00
Tobias Gruetzmacher
17bc454132
Bugfix: Don't assume RE patterns in base class.
2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher
e92a3fb3a1
New feature: Comic modules ca be "disabled".
...
This is modeled parallel to the "adult" feature, except the user can't
override it via the command line. Each comic module can override the
classmethod getDisabledReasons and give the user a reason why this
module is disabled. The user can see the reason in the comic list (-l or
--singlelist) and the comic module refuses to run, showing the same
message.
This is currently used to disable modules that use the _ParserScraper if
the LXML python module is missing.
2014-10-13 21:43:46 +02:00
Tobias Gruetzmacher
3235b8b312
Pass unicode strings to lxml.
...
This reverts commit fcde86e9c0
& some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Tobias Gruetzmacher
f9f0b75d7c
Create new HTML parser based scraper class.
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher
0e03eca8f0
Move all regular expression operation into the new class.
...
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher
fde1fdced6
Fix some typos.
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher
4265053846
Refactor: Move regualar expression scraping into a new class.
...
- This also makes "<base href>" handling an internal detail of the regular
expression scraper, future scrapers might not need that or handle it in
another way.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam
3a929ceea6
Allow comic text to be optional. Patch from TobiX
2014-07-24 20:49:57 +02:00
Bastian Kleineidam
875e431edc
Provide page data in shouldSkipUrl() function
2014-02-10 21:58:09 +01:00
Bastian Kleineidam
5fe48d013a
Increase wait interval.
2014-01-05 17:14:19 +01:00
Bastian Kleineidam
4d63920434
Updated copyright.
2014-01-05 16:50:57 +01:00
Bastian Kleineidam
b6c913e2d5
Wait some time between requests.
2014-01-05 16:23:45 +01:00
Bastian Kleineidam
799d3040f0
Refactoring
2013-12-11 17:54:39 +01:00
Bastian Kleineidam
7343932a5a
Strip whitespace from image text.
2013-12-04 18:07:13 +01:00
Bastian Kleineidam
0e5c59133c
Provide HTML page data for image URL modifier function.
2013-12-04 17:54:55 +01:00
Bastian Kleineidam
0eaf9a3139
Add text search in comic strips.
2013-11-29 20:26:49 +01:00
Bastian Kleineidam
ca17332942
Call self.starter() on indexed comics since it might set cookies.
2013-11-07 20:48:10 +01:00
Bastian Kleineidam
ebdc1e6359
More unicode output fixes.
2013-04-30 06:41:19 +02:00
Bastian Kleineidam
80d7defcd2
Unicode descriptions.
2013-04-29 07:35:56 +02:00
Bastian Kleineidam
05dbc51d3e
Detect completed end-of-life comics.
2013-04-25 22:40:06 +02:00
Bastian Kleineidam
35c031ca81
Fixed some comics.
2013-04-11 18:27:43 +02:00
Bastian Kleineidam
b9dc385ff2
Implemented voting
2013-04-09 19:33:50 +02:00
Bastian Kleineidam
4528281ddd
Voting part 2
2013-04-08 21:20:01 +02:00
Bastian Kleineidam
e762f269b7
First part of voting stuff.
2013-04-08 20:19:10 +02:00
Bastian Kleineidam
97522bc5ae
Use tuples rather than lists.
2013-04-05 18:55:19 +02:00
Bastian Kleineidam
2c0ca04882
Fix warning for scrapers with multiple image patterns.
2013-04-03 20:32:19 +02:00
Bastian Kleineidam
1d7f7a8517
Fix genre list
2013-03-26 19:58:22 +01:00
Bastian Kleineidam
10985ae614
Add genre tags.
2013-03-26 17:33:27 +01:00
Bastian Kleineidam
ec33276fd7
Print stacktrace on image errors.
2013-03-25 19:48:47 +01:00
Tobias Gruetzmacher
0a218c0283
Add event comicPageLink for every previous link.
...
This event allows a listener to build connections between pages.
2013-03-24 16:36:02 +01:00
Bastian Kleineidam
6a2f55ddef
Dont stop on image regex errors.
2013-03-15 07:03:54 +01:00
Bastian Kleineidam
43f20270d0
Allow a list of regular expressions for image and previous link search.
2013-03-12 20:48:26 +01:00
Bastian Kleineidam
88e28f3923
Fix some comics and add language tag.
2013-03-08 22:33:05 +01:00
Bastian Kleineidam
4c344765ff
Add option to wait before downloading.
2013-03-08 06:46:50 +01:00
Bastian Kleineidam
736d9aa8cf
Code cleanup.
2013-03-07 18:22:39 +01:00
Bastian Kleineidam
bae2a96d8b
Added some comic strips and cleanup the scraper code.
2013-03-06 20:00:30 +01:00
Bastian Kleineidam
3712799ee0
Add imageUrlModifier() for scrapers.
2013-03-04 19:10:27 +01:00
Bastian Kleineidam
f36ed46d6a
Fix tests which hit the first URL.
2013-02-21 19:48:21 +01:00
Bastian Kleineidam
ae0e9feea1
Remember skipped URLs.
2013-02-20 20:51:39 +01:00
Bastian Kleineidam
6155b022a6
Allow selected strips without images.
2013-02-18 20:03:27 +01:00
Bastian Kleineidam
4f03963b9e
Code cleanup.
2013-02-18 20:02:16 +01:00
Bastian Kleineidam
c4191158ec
Sort scrapers only when listing them.
2013-02-18 20:01:50 +01:00
Bastian Kleineidam
dc9334cca9
Fix scraperclass function. Closes issue #7 .
2013-02-18 19:59:16 +01:00
Bastian Kleineidam
40de445d8c
Allow multiple comic name matches.
2013-02-13 22:18:05 +01:00
Bastian Kleineidam
93c48fb7e2
Make _BasicScraper hashable.
2013-02-13 20:00:16 +01:00
Bastian Kleineidam
23a1acd398
Add firstStripUrl to scrapers.
2013-02-13 19:59:59 +01:00
Bastian Kleineidam
312d117ff3
Rename get_scrapers to get_scraperclasses
2013-02-13 19:59:13 +01:00
Bastian Kleineidam
6d0fffd825
Always use connection pooling.
2013-02-12 17:55:13 +01:00
Bastian Kleineidam
67836942d8
Simplify the fetchUrl code.
2013-02-11 19:43:46 +01:00