Commit graph

108 commits

Author SHA1 Message Date
Tobias Gruetzmacher f3b8ebf0be Clean up some minor warnings 2022-05-28 17:52:42 +02:00
Tobias Gruetzmacher e64635e86b Stricter style checking & related style fixes 2020-10-11 20:15:27 +02:00
Tobias Gruetzmacher e98a1601ca Remove workaround for libxml2 older 2.9.3 (2015)
This workaround was written in 2016 while that version was still found
on many systems. Addionally, this workaround needs to be enabled by the
developer, who might not even be aware that they need to enable it for a
specific module. We still throw a warning to the user if running with
such an old libxml version.
2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher e34a0b539c Don't rethrow RequestException as IOError
Since RequestException already is an IOError, nothing of value is lost.
2020-09-28 12:05:01 +02:00
Tobias Gruetzmacher 27d28b8eef Update file headers
The default encoding for source files is UTF-8 since Python 3, so we can
drop all encoding headers. While we are at it, just replace them with
SPDX headers.
2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher 62c3540c28 Remove (useless) wrapper around html.unescape 2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher 44791439a5 Drop Python 2 support: Obsolete future statements 2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher 9c65c3e05f Drop Python 2 support: six & other imports 2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher 5a92505606 Fix & test query string parsing 2019-12-31 00:43:46 +01:00
Tobias Gruetzmacher e5e7dfacd6 Move basic HTTP setup into a new module
We now subclass requests' Session to make further extensions of the HTTP
flow possible.
2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher 00d0201c5f Fix a bunch of flake8 issues 2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher ac9d8db1e8 Make sure user agent is in all HTTP requests 2019-11-03 20:17:27 +01:00
Tobias Gruetzmacher 1d910a5bbd Remove pbr from runtime 2019-06-19 07:31:34 +02:00
Tobias Gruetzmacher fbb3a18c91 Enable warnings and fix some of them 2018-05-23 00:54:40 +02:00
sizlo 8d84361de4 Preserve the order we found images in when removing duplicate images 2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher 3f9feec041 Allow modules to ignore some HTTP error codes.
This is neccessary since it seems some webservers out there are
misconfigured to deliver actual content with an HTTP error code...
2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher a02660a7d3 Replace custom @memoized with stdlib @lru_cache. 2016-10-29 00:46:49 +02:00
Tobias Gruetzmacher 9a6a310b76 Fixup copyright years. 2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher 4f80016bf0 Change robotparser import to make PyInstaller happy. 2016-06-06 22:42:01 +02:00
Tobias Gruetzmacher 64c8e502ca Ignore case for comic download directories.
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher efe1308db2 Replace home-grown Python2/3 compat. with six. 2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher 4204f5f1e4 Send "If-Modified-Since" header for images. 2016-04-19 00:36:50 +02:00
Tobias Gruetzmacher 9028724a74 Clean up update helper scripts. 2016-04-13 00:52:16 +02:00
Tobias Gruetzmacher 8768ff07b6 Fix AhoiPolloi, be a bit smarter about encoding.
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:

1. Let Requests figure out the content from the HTTP header. This
   overrides everything else. We need to "trick" LXML to accept our
   decision if the document contains an XML declaration which might
   disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
   encoding and be done with it.
2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher 6727e9b559 Use vendored urllib3.
As long as requests ships with urllib3, we can't fall back to the
"system" urllib3, since that breaks class-identity checks.
2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher c4fcd985dd Let urllib3 handle all retries. 2016-03-13 21:30:36 +01:00
Johannes Schöpp 351fa7154e Modified maximum page size
Fixes #36
2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher 10d9eac574 Remove support for very old versions of "requests". 2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher 68d4dd463a Revert robots.txt handling.
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.

Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes #24.

(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots)
2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher 7c15ea50d8 Also check robots.txt on image downloads.
We DO want to honour if images are blocked by robots.txt
2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher 5affd8af68 More relaxed robots.txt handling.
This is in line with how Perl's LWP::RobotUA and Google handles server
errors when fetching robots.txt: Just assume access is allowed.

See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher 86b31dc12b Depend on pycountry directly. 2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher 5934f03453 Merge branch 'htmlparser' - I think it's ready.
This closes pull request #70.
2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher 17bc454132 Bugfix: Don't assume RE patterns in base class. 2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher 3235b8b312 Pass unicode strings to lxml.
This reverts commit fcde86e9c0 & some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Bastian Kleineidam e43694c156 Don't crash on multiple HTML output runs per day. 2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher fcde86e9c0 Change getPageContent to (optionally) return raw text.
This allows LXML to do its own "magic" encoding detection
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher 0e03eca8f0 Move all regular expression operation into the new class.
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam 3a929ceea6 Allow comic text to be optional. Patch from TobiX 2014-07-24 20:49:57 +02:00
Bastian Kleineidam 93fe5d5987 Minor useragent refactoring 2014-07-03 17:12:25 +02:00
Bastian Kleineidam 687d27d534 Stripping should be done in normaliseUrl. 2014-06-08 10:12:33 +02:00
Bastian Kleineidam 4d63920434 Updated copyright. 2014-01-05 16:50:57 +01:00
Bastian Kleineidam df9a381ae4 Document getfp() function. 2013-12-08 11:46:26 +01:00
Bastian Kleineidam 03fff069ee Apply same file checks files as for image files. 2013-12-05 18:29:15 +01:00
Bastian Kleineidam 0eaf9a3139 Add text search in comic strips. 2013-11-29 20:26:49 +01:00
Bastian Kleineidam ebdc1e6359 More unicode output fixes. 2013-04-30 06:41:19 +02:00
Bastian Kleineidam c246b41d64 Code formatting. 2013-04-13 08:00:11 +02:00
Bastian Kleineidam 35c031ca81 Fixed some comics. 2013-04-11 18:27:43 +02:00
Bastian Kleineidam 190ffcd390 Use str() for robotparser. 2013-04-09 19:36:00 +02:00
Bastian Kleineidam b9dc385ff2 Implemented voting 2013-04-09 19:33:50 +02:00