dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	e98a1601ca	Remove workaround for libxml2 older 2.9.3 (2015) This workaround was written in 2016 while that version was still found on many systems. Addionally, this workaround needs to be enabled by the developer, who might not even be aware that they need to enable it for a specific module. We still throw a warning to the user if running with such an old libxml version.	2020-09-29 21:16:48 +02:00
Tobias Gruetzmacher	e34a0b539c	Don't rethrow RequestException as IOError Since RequestException already is an IOError, nothing of value is lost.	2020-09-28 12:05:01 +02:00
Tobias Gruetzmacher	27d28b8eef	Update file headers The default encoding for source files is UTF-8 since Python 3, so we can drop all encoding headers. While we are at it, just replace them with SPDX headers.	2020-04-18 13:45:44 +02:00
Tobias Gruetzmacher	62c3540c28	Remove (useless) wrapper around html.unescape	2020-04-13 01:53:45 +02:00
Tobias Gruetzmacher	44791439a5	Drop Python 2 support: Obsolete future statements	2020-02-04 01:06:19 +01:00
Tobias Gruetzmacher	9c65c3e05f	Drop Python 2 support: six & other imports	2020-02-03 01:03:31 +01:00
Tobias Gruetzmacher	5a92505606	Fix & test query string parsing	2019-12-31 00:43:46 +01:00
Tobias Gruetzmacher	e5e7dfacd6	Move basic HTTP setup into a new module We now subclass requests' Session to make further extensions of the HTTP flow possible.	2019-12-03 23:58:20 +01:00
Tobias Gruetzmacher	00d0201c5f	Fix a bunch of flake8 issues	2019-11-04 00:16:25 +01:00
Tobias Gruetzmacher	ac9d8db1e8	Make sure user agent is in all HTTP requests	2019-11-03 20:17:27 +01:00
Tobias Gruetzmacher	1d910a5bbd	Remove pbr from runtime	2019-06-19 07:31:34 +02:00
Tobias Gruetzmacher	fbb3a18c91	Enable warnings and fix some of them	2018-05-23 00:54:40 +02:00
sizlo	8d84361de4	Preserve the order we found images in when removing duplicate images	2017-04-18 21:58:12 +01:00
Tobias Gruetzmacher	3f9feec041	Allow modules to ignore some HTTP error codes. This is neccessary since it seems some webservers out there are misconfigured to deliver actual content with an HTTP error code...	2016-11-01 18:25:02 +01:00
Tobias Gruetzmacher	a02660a7d3	Replace custom @memoized with stdlib @lru_cache.	2016-10-29 00:46:49 +02:00
Tobias Gruetzmacher	9a6a310b76	Fixup copyright years.	2016-10-29 00:21:41 +02:00
Tobias Gruetzmacher	4f80016bf0	Change robotparser import to make PyInstaller happy.	2016-06-06 22:42:01 +02:00
Tobias Gruetzmacher	64c8e502ca	Ignore case for comic download directories. Since we already match comics case-insensitive on the command line, this was a logical step, even if this means changing quite a bit of code that all tries to resolve the "comic directory" in a slightly different way...	2016-06-06 00:08:29 +02:00
Tobias Gruetzmacher	efe1308db2	Replace home-grown Python2/3 compat. with six.	2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher	4204f5f1e4	Send "If-Modified-Since" header for images.	2016-04-19 00:36:50 +02:00
Tobias Gruetzmacher	9028724a74	Clean up update helper scripts.	2016-04-13 00:52:16 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher	6727e9b559	Use vendored urllib3. As long as requests ships with urllib3, we can't fall back to the "system" urllib3, since that breaks class-identity checks.	2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher	c4fcd985dd	Let urllib3 handle all retries.	2016-03-13 21:30:36 +01:00
Johannes Schöpp	351fa7154e	Modified maximum page size Fixes #36	2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher	10d9eac574	Remove support for very old versions of "requests".	2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher	68d4dd463a	Revert robots.txt handling. This brings us back to only honouring robots.txt on page downloads, not on image downloads. Rationale: Dosage is not a "robot" in the classical sense. It's not designed to spider huge amounts of web sites in search for some content to index, it's only intended to help users keep a personal archive of comics he is interested in. We try very hard to never download any image twice. This fixes #24. (Precedent for this rationale: Google Feedfetcher: https://support.google.com/webmasters/answer/178852?hl=en#robots)	2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher	7c15ea50d8	Also check robots.txt on image downloads. We DO want to honour if images are blocked by robots.txt	2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher	5affd8af68	More relaxed robots.txt handling. This is in line with how Perl's LWP::RobotUA and Google handles server errors when fetching robots.txt: Just assume access is allowed. See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt	2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher	86b31dc12b	Depend on pycountry directly.	2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher	5934f03453	Merge branch 'htmlparser' - I think it's ready. This closes pull request #70.	2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Bastian Kleineidam	e43694c156	Don't crash on multiple HTML output runs per day.	2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher	fcde86e9c0	Change getPageContent to (optionally) return raw text. This allows LXML to do its own "magic" encoding detection	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	93fe5d5987	Minor useragent refactoring	2014-07-03 17:12:25 +02:00
Bastian Kleineidam	687d27d534	Stripping should be done in normaliseUrl.	2014-06-08 10:12:33 +02:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00
Bastian Kleineidam	df9a381ae4	Document getfp() function.	2013-12-08 11:46:26 +01:00
Bastian Kleineidam	03fff069ee	Apply same file checks files as for image files.	2013-12-05 18:29:15 +01:00
Bastian Kleineidam	0eaf9a3139	Add text search in comic strips.	2013-11-29 20:26:49 +01:00
Bastian Kleineidam	ebdc1e6359	More unicode output fixes.	2013-04-30 06:41:19 +02:00
Bastian Kleineidam	c246b41d64	Code formatting.	2013-04-13 08:00:11 +02:00
Bastian Kleineidam	35c031ca81	Fixed some comics.	2013-04-11 18:27:43 +02:00
Bastian Kleineidam	190ffcd390	Use str() for robotparser.	2013-04-09 19:36:00 +02:00
Bastian Kleineidam	b9dc385ff2	Implemented voting	2013-04-09 19:33:50 +02:00
Bastian Kleineidam	4528281ddd	Voting part 2	2013-04-08 21:20:01 +02:00
Bastian Kleineidam	781bac0ca2	Feed text content instead of binary to robots.txt parser.	2013-04-07 18:11:29 +02:00

1 2 3

106 commits