Commit graph

81 commits

Author SHA1 Message Date
Tobias Gruetzmacher 10d9eac574 Remove support for very old versions of "requests". 2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher 68d4dd463a Revert robots.txt handling.
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.

Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes #24.

(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots)
2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher 7c15ea50d8 Also check robots.txt on image downloads.
We DO want to honour if images are blocked by robots.txt
2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher 5affd8af68 More relaxed robots.txt handling.
This is in line with how Perl's LWP::RobotUA and Google handles server
errors when fetching robots.txt: Just assume access is allowed.

See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher 86b31dc12b Depend on pycountry directly. 2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher 5934f03453 Merge branch 'htmlparser' - I think it's ready.
This closes pull request #70.
2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher 17bc454132 Bugfix: Don't assume RE patterns in base class. 2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher 3235b8b312 Pass unicode strings to lxml.
This reverts commit fcde86e9c0 & some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Bastian Kleineidam e43694c156 Don't crash on multiple HTML output runs per day. 2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher fcde86e9c0 Change getPageContent to (optionally) return raw text.
This allows LXML to do its own "magic" encoding detection
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher 0e03eca8f0 Move all regular expression operation into the new class.
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam 3a929ceea6 Allow comic text to be optional. Patch from TobiX 2014-07-24 20:49:57 +02:00
Bastian Kleineidam 93fe5d5987 Minor useragent refactoring 2014-07-03 17:12:25 +02:00
Bastian Kleineidam 687d27d534 Stripping should be done in normaliseUrl. 2014-06-08 10:12:33 +02:00
Bastian Kleineidam 4d63920434 Updated copyright. 2014-01-05 16:50:57 +01:00
Bastian Kleineidam df9a381ae4 Document getfp() function. 2013-12-08 11:46:26 +01:00
Bastian Kleineidam 03fff069ee Apply same file checks files as for image files. 2013-12-05 18:29:15 +01:00
Bastian Kleineidam 0eaf9a3139 Add text search in comic strips. 2013-11-29 20:26:49 +01:00
Bastian Kleineidam ebdc1e6359 More unicode output fixes. 2013-04-30 06:41:19 +02:00
Bastian Kleineidam c246b41d64 Code formatting. 2013-04-13 08:00:11 +02:00
Bastian Kleineidam 35c031ca81 Fixed some comics. 2013-04-11 18:27:43 +02:00
Bastian Kleineidam 190ffcd390 Use str() for robotparser. 2013-04-09 19:36:00 +02:00
Bastian Kleineidam b9dc385ff2 Implemented voting 2013-04-09 19:33:50 +02:00
Bastian Kleineidam 4528281ddd Voting part 2 2013-04-08 21:20:01 +02:00
Bastian Kleineidam 781bac0ca2 Feed text content instead of binary to robots.txt parser. 2013-04-07 18:11:29 +02:00
Bastian Kleineidam 0fbc005377 A Python3 fix. 2013-04-05 18:57:44 +02:00
Bastian Kleineidam 97522bc5ae Use tuples rather than lists. 2013-04-05 18:55:19 +02:00
Bastian Kleineidam adb31d84af Use HTMLParser.unescape instead of rolling our own function. 2013-04-05 18:53:19 +02:00
Bastian Kleineidam 6aa588860d Code cleanup 2013-04-05 06:36:05 +02:00
Bastian Kleineidam 460c5be689 Add POST support to urlopen(). 2013-04-04 18:30:02 +02:00
Bastian Kleineidam 0054ebfe0b Some Python3 fixes. 2013-04-03 20:32:43 +02:00
Bastian Kleineidam 2c0ca04882 Fix warning for scrapers with multiple image patterns. 2013-04-03 20:32:19 +02:00
Bastian Kleineidam 110a67cda4 Retry failed page content downloads (eg. timeouts). 2013-03-25 19:49:09 +01:00
Bastian Kleineidam 43f20270d0 Allow a list of regular expressions for image and previous link search. 2013-03-12 20:48:26 +01:00
Bastian Kleineidam 88e28f3923 Fix some comics and add language tag. 2013-03-08 22:33:05 +01:00
Bastian Kleineidam c13aa323d8 Code cleanup [ci skip] 2013-03-04 21:44:26 +01:00
Bastian Kleineidam 41c954b309 Another try on URL quoting. 2013-02-23 09:08:08 +01:00
Bastian Kleineidam d0c3492cc7 Catch robots.txt errors. 2013-02-21 19:48:04 +01:00
Bastian Kleineidam be1694592e Do not stream page content URLs. 2013-02-18 20:38:59 +01:00
Bastian Kleineidam 96bf9ef523 Recognize internal server errors. 2013-02-13 17:54:10 +01:00
Bastian Kleineidam f16e860f1e Only cache robots.txt URL on memoize. 2013-02-13 17:52:07 +01:00
Bastian Kleineidam 10f6a1caa1 Correct path quoting. 2013-02-12 17:55:33 +01:00
Bastian Kleineidam 6d0fffd825 Always use connection pooling. 2013-02-12 17:55:13 +01:00
Bastian Kleineidam a35c54525d Work around a bug in python requests. 2013-02-11 19:52:59 +01:00
Bastian Kleineidam 14f0a6fe78 Do not prefetch content with requests >= 1.0 2013-02-11 19:45:21 +01:00
Bastian Kleineidam 67836942d8 Simplify the fetchUrl code. 2013-02-11 19:43:46 +01:00
Bastian Kleineidam 1a0cd1ee6b Print HTTP client headers. 2013-02-07 18:28:56 +01:00
Bastian Kleineidam 73700e66f0 Cleanup 2013-01-24 21:42:27 +01:00
Bastian Kleineidam f1356a9ff8 Fix URL norming, See issue #2. 2013-01-23 21:16:22 +01:00
Bastian Kleineidam 5479627d86 Updated copyright. 2013-01-09 22:21:19 +01:00