Commit graph

93 commits

Author SHA1 Message Date
Tobias Gruetzmacher 3f9feec041 Allow modules to ignore some HTTP error codes.
This is neccessary since it seems some webservers out there are
misconfigured to deliver actual content with an HTTP error code...
2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher a02660a7d3 Replace custom @memoized with stdlib @lru_cache. 2016-10-29 00:46:49 +02:00
Tobias Gruetzmacher 9a6a310b76 Fixup copyright years. 2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher 4f80016bf0 Change robotparser import to make PyInstaller happy. 2016-06-06 22:42:01 +02:00
Tobias Gruetzmacher 64c8e502ca Ignore case for comic download directories.
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher efe1308db2 Replace home-grown Python2/3 compat. with six. 2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher 4204f5f1e4 Send "If-Modified-Since" header for images. 2016-04-19 00:36:50 +02:00
Tobias Gruetzmacher 9028724a74 Clean up update helper scripts. 2016-04-13 00:52:16 +02:00
Tobias Gruetzmacher 8768ff07b6 Fix AhoiPolloi, be a bit smarter about encoding.
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:

1. Let Requests figure out the content from the HTTP header. This
   overrides everything else. We need to "trick" LXML to accept our
   decision if the document contains an XML declaration which might
   disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
   encoding and be done with it.
2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher 6727e9b559 Use vendored urllib3.
As long as requests ships with urllib3, we can't fall back to the
"system" urllib3, since that breaks class-identity checks.
2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher c4fcd985dd Let urllib3 handle all retries. 2016-03-13 21:30:36 +01:00
Johannes Schöpp 351fa7154e Modified maximum page size
Fixes #36
2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher 10d9eac574 Remove support for very old versions of "requests". 2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher 68d4dd463a Revert robots.txt handling.
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.

Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes #24.

(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots)
2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher 7c15ea50d8 Also check robots.txt on image downloads.
We DO want to honour if images are blocked by robots.txt
2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher 5affd8af68 More relaxed robots.txt handling.
This is in line with how Perl's LWP::RobotUA and Google handles server
errors when fetching robots.txt: Just assume access is allowed.

See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher 86b31dc12b Depend on pycountry directly. 2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher 5934f03453 Merge branch 'htmlparser' - I think it's ready.
This closes pull request #70.
2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher 17bc454132 Bugfix: Don't assume RE patterns in base class. 2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher 3235b8b312 Pass unicode strings to lxml.
This reverts commit fcde86e9c0 & some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Bastian Kleineidam e43694c156 Don't crash on multiple HTML output runs per day. 2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher fcde86e9c0 Change getPageContent to (optionally) return raw text.
This allows LXML to do its own "magic" encoding detection
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher 0e03eca8f0 Move all regular expression operation into the new class.
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam 3a929ceea6 Allow comic text to be optional. Patch from TobiX 2014-07-24 20:49:57 +02:00
Bastian Kleineidam 93fe5d5987 Minor useragent refactoring 2014-07-03 17:12:25 +02:00
Bastian Kleineidam 687d27d534 Stripping should be done in normaliseUrl. 2014-06-08 10:12:33 +02:00
Bastian Kleineidam 4d63920434 Updated copyright. 2014-01-05 16:50:57 +01:00
Bastian Kleineidam df9a381ae4 Document getfp() function. 2013-12-08 11:46:26 +01:00
Bastian Kleineidam 03fff069ee Apply same file checks files as for image files. 2013-12-05 18:29:15 +01:00
Bastian Kleineidam 0eaf9a3139 Add text search in comic strips. 2013-11-29 20:26:49 +01:00
Bastian Kleineidam ebdc1e6359 More unicode output fixes. 2013-04-30 06:41:19 +02:00
Bastian Kleineidam c246b41d64 Code formatting. 2013-04-13 08:00:11 +02:00
Bastian Kleineidam 35c031ca81 Fixed some comics. 2013-04-11 18:27:43 +02:00
Bastian Kleineidam 190ffcd390 Use str() for robotparser. 2013-04-09 19:36:00 +02:00
Bastian Kleineidam b9dc385ff2 Implemented voting 2013-04-09 19:33:50 +02:00
Bastian Kleineidam 4528281ddd Voting part 2 2013-04-08 21:20:01 +02:00
Bastian Kleineidam 781bac0ca2 Feed text content instead of binary to robots.txt parser. 2013-04-07 18:11:29 +02:00
Bastian Kleineidam 0fbc005377 A Python3 fix. 2013-04-05 18:57:44 +02:00
Bastian Kleineidam 97522bc5ae Use tuples rather than lists. 2013-04-05 18:55:19 +02:00
Bastian Kleineidam adb31d84af Use HTMLParser.unescape instead of rolling our own function. 2013-04-05 18:53:19 +02:00
Bastian Kleineidam 6aa588860d Code cleanup 2013-04-05 06:36:05 +02:00
Bastian Kleineidam 460c5be689 Add POST support to urlopen(). 2013-04-04 18:30:02 +02:00
Bastian Kleineidam 0054ebfe0b Some Python3 fixes. 2013-04-03 20:32:43 +02:00
Bastian Kleineidam 2c0ca04882 Fix warning for scrapers with multiple image patterns. 2013-04-03 20:32:19 +02:00
Bastian Kleineidam 110a67cda4 Retry failed page content downloads (eg. timeouts). 2013-03-25 19:49:09 +01:00
Bastian Kleineidam 43f20270d0 Allow a list of regular expressions for image and previous link search. 2013-03-12 20:48:26 +01:00
Bastian Kleineidam 88e28f3923 Fix some comics and add language tag. 2013-03-08 22:33:05 +01:00
Bastian Kleineidam c13aa323d8 Code cleanup [ci skip] 2013-03-04 21:44:26 +01:00
Bastian Kleineidam 41c954b309 Another try on URL quoting. 2013-02-23 09:08:08 +01:00
Bastian Kleineidam d0c3492cc7 Catch robots.txt errors. 2013-02-21 19:48:04 +01:00