Commit graph

85 commits

Author SHA1 Message Date
Tobias Gruetzmacher ee99c087d7 Remove prevUrlMatchesStripUrl.
It was only used for one test.
2016-04-16 01:14:26 +02:00
Tobias Gruetzmacher 92a688457a Remove useless indirection. 2016-04-15 23:42:24 +02:00
Tobias Gruetzmacher 060281e5ff Use concrete scraper objects everywhere.
This is a first step for #42. Since most access to the scraper classes
is through instances, modules can now dynamically override url and name
(name is now a property).
2016-04-13 22:17:30 +02:00
Tobias Gruetzmacher f6e605e146 Fix unicode error in text search. 2016-04-10 13:16:30 +02:00
Tobias Gruetzmacher 8768ff07b6 Fix AhoiPolloi, be a bit smarter about encoding.
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:

1. Let Requests figure out the content from the HTTP header. This
   overrides everything else. We need to "trick" LXML to accept our
   decision if the document contains an XML declaration which might
   disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
   encoding and be done with it.
2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher c4fcd985dd Let urllib3 handle all retries. 2016-03-13 21:30:36 +01:00
Tobias Gruetzmacher 78e13962f9 Sort scraper modules (mostly for test stability). 2016-03-13 20:24:21 +01:00
Damjan Košir fd9c480d9c adding bonus panel to SWBC and multiple images flag to ParserScraper 2015-08-03 22:58:44 +12:00
Tobias Gruetzmacher 303432fc68 Also use css expressions for textSearch. 2015-07-18 01:22:40 +02:00
Tobias Gruetzmacher 808b624e5f Remove hard dependency on pycountry again.
This basically reverts commit 86b31dc12b.

It now works like this: If the use has pycountry installed, it is used.
If not, Dosage falls back to a small internal list generated from
pycountry by scripts/mklanguages.py.

This means additional work if we ever decide to translate Dosage, since
pycountry already has all the translations for language names...

This fixes #23.
2015-07-11 01:27:39 +02:00
Damjan Košir 79d775a8d9 adding comicpress scraper 2015-05-16 00:15:32 +12:00
Tobias Gruetzmacher ff21df596b Remove descriptions and genres (closes #9).
Maintaining the descriptions creates quite a bit of overhead (finding
them, copying them, checking if they are still correct) for a minimal
user benefit.

PS: Viewing this diff should be easier in a difftool that shows changes
in a line, for example kdiff3.
2015-04-20 20:29:09 +02:00
Tobias Gruetzmacher 1d52d6a152 Add support for CSS selectors to HTML parser.
Each comic module author can decide if she wants to use CSS or XPath,
not a mix of both. Using CSS needs the cssselect python module and the
module gets disabled if it is unavailable.
2014-10-13 22:43:06 +02:00
Tobias Gruetzmacher 17bc454132 Bugfix: Don't assume RE patterns in base class. 2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher e92a3fb3a1 New feature: Comic modules ca be "disabled".
This is modeled parallel to the "adult" feature, except the user can't
override it via the command line. Each comic module can override the
classmethod getDisabledReasons and give the user a reason why this
module is disabled. The user can see the reason in the comic list (-l or
--singlelist) and the comic module refuses to run, showing the same
message.

This is currently used to disable modules that use the _ParserScraper if
the LXML python module is missing.
2014-10-13 21:43:46 +02:00
Tobias Gruetzmacher 3235b8b312 Pass unicode strings to lxml.
This reverts commit fcde86e9c0 & some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Tobias Gruetzmacher f9f0b75d7c Create new HTML parser based scraper class. 2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher 0e03eca8f0 Move all regular expression operation into the new class.
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher fde1fdced6 Fix some typos. 2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher 4265053846 Refactor: Move regualar expression scraping into a new class.
- This also makes "<base href>" handling an internal detail of the regular
  expression scraper, future scrapers might not need that or handle it in
  another way.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam 3a929ceea6 Allow comic text to be optional. Patch from TobiX 2014-07-24 20:49:57 +02:00
Bastian Kleineidam 875e431edc Provide page data in shouldSkipUrl() function 2014-02-10 21:58:09 +01:00
Bastian Kleineidam 5fe48d013a Increase wait interval. 2014-01-05 17:14:19 +01:00
Bastian Kleineidam 4d63920434 Updated copyright. 2014-01-05 16:50:57 +01:00
Bastian Kleineidam b6c913e2d5 Wait some time between requests. 2014-01-05 16:23:45 +01:00
Bastian Kleineidam 799d3040f0 Refactoring 2013-12-11 17:54:39 +01:00
Bastian Kleineidam 7343932a5a Strip whitespace from image text. 2013-12-04 18:07:13 +01:00
Bastian Kleineidam 0e5c59133c Provide HTML page data for image URL modifier function. 2013-12-04 17:54:55 +01:00
Bastian Kleineidam 0eaf9a3139 Add text search in comic strips. 2013-11-29 20:26:49 +01:00
Bastian Kleineidam ca17332942 Call self.starter() on indexed comics since it might set cookies. 2013-11-07 20:48:10 +01:00
Bastian Kleineidam ebdc1e6359 More unicode output fixes. 2013-04-30 06:41:19 +02:00
Bastian Kleineidam 80d7defcd2 Unicode descriptions. 2013-04-29 07:35:56 +02:00
Bastian Kleineidam 05dbc51d3e Detect completed end-of-life comics. 2013-04-25 22:40:06 +02:00
Bastian Kleineidam 35c031ca81 Fixed some comics. 2013-04-11 18:27:43 +02:00
Bastian Kleineidam b9dc385ff2 Implemented voting 2013-04-09 19:33:50 +02:00
Bastian Kleineidam 4528281ddd Voting part 2 2013-04-08 21:20:01 +02:00
Bastian Kleineidam e762f269b7 First part of voting stuff. 2013-04-08 20:19:10 +02:00
Bastian Kleineidam 97522bc5ae Use tuples rather than lists. 2013-04-05 18:55:19 +02:00
Bastian Kleineidam 2c0ca04882 Fix warning for scrapers with multiple image patterns. 2013-04-03 20:32:19 +02:00
Bastian Kleineidam 1d7f7a8517 Fix genre list 2013-03-26 19:58:22 +01:00
Bastian Kleineidam 10985ae614 Add genre tags. 2013-03-26 17:33:27 +01:00
Bastian Kleineidam ec33276fd7 Print stacktrace on image errors. 2013-03-25 19:48:47 +01:00
Tobias Gruetzmacher 0a218c0283 Add event comicPageLink for every previous link.
This event allows a listener to build connections between pages.
2013-03-24 16:36:02 +01:00
Bastian Kleineidam 6a2f55ddef Dont stop on image regex errors. 2013-03-15 07:03:54 +01:00
Bastian Kleineidam 43f20270d0 Allow a list of regular expressions for image and previous link search. 2013-03-12 20:48:26 +01:00
Bastian Kleineidam 88e28f3923 Fix some comics and add language tag. 2013-03-08 22:33:05 +01:00
Bastian Kleineidam 4c344765ff Add option to wait before downloading. 2013-03-08 06:46:50 +01:00
Bastian Kleineidam 736d9aa8cf Code cleanup. 2013-03-07 18:22:39 +01:00
Bastian Kleineidam bae2a96d8b Added some comic strips and cleanup the scraper code. 2013-03-06 20:00:30 +01:00
Bastian Kleineidam 3712799ee0 Add imageUrlModifier() for scrapers. 2013-03-04 19:10:27 +01:00