dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	efe1308db2	Replace home-grown Python2/3 compat. with six.	2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher	4204f5f1e4	Send "If-Modified-Since" header for images.	2016-04-19 00:36:50 +02:00
Tobias Gruetzmacher	9028724a74	Clean up update helper scripts.	2016-04-13 00:52:16 +02:00
Tobias Gruetzmacher	8768ff07b6	Fix AhoiPolloi, be a bit smarter about encoding. HTML character encoding in the context of HTTP is quite tricky to get right and honestly, I'm not sure if I did get it right this time. But I think, the current behaviour matches best what web browsers try to do: 1. Let Requests figure out the content from the HTTP header. This overrides everything else. We need to "trick" LXML to accept our decision if the document contains an XML declaration which might disagree with the HTTP header. 2. If the HTTP headers don't specify any encoding, let LXML guess the encoding and be done with it.	2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher	6727e9b559	Use vendored urllib3. As long as requests ships with urllib3, we can't fall back to the "system" urllib3, since that breaks class-identity checks.	2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher	c4fcd985dd	Let urllib3 handle all retries.	2016-03-13 21:30:36 +01:00
Johannes Schöpp	351fa7154e	Modified maximum page size Fixes #36	2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher	10d9eac574	Remove support for very old versions of "requests".	2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher	68d4dd463a	Revert robots.txt handling. This brings us back to only honouring robots.txt on page downloads, not on image downloads. Rationale: Dosage is not a "robot" in the classical sense. It's not designed to spider huge amounts of web sites in search for some content to index, it's only intended to help users keep a personal archive of comics he is interested in. We try very hard to never download any image twice. This fixes #24. (Precedent for this rationale: Google Feedfetcher: https://support.google.com/webmasters/answer/178852?hl=en#robots)	2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher	7c15ea50d8	Also check robots.txt on image downloads. We DO want to honour if images are blocked by robots.txt	2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher	5affd8af68	More relaxed robots.txt handling. This is in line with how Perl's LWP::RobotUA and Google handles server errors when fetching robots.txt: Just assume access is allowed. See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt	2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher	86b31dc12b	Depend on pycountry directly.	2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher	5934f03453	Merge branch 'htmlparser' - I think it's ready. This closes pull request #70.	2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Bastian Kleineidam	e43694c156	Don't crash on multiple HTML output runs per day.	2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher	fcde86e9c0	Change getPageContent to (optionally) return raw text. This allows LXML to do its own "magic" encoding detection	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	93fe5d5987	Minor useragent refactoring	2014-07-03 17:12:25 +02:00
Bastian Kleineidam	687d27d534	Stripping should be done in normaliseUrl.	2014-06-08 10:12:33 +02:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00
Bastian Kleineidam	df9a381ae4	Document getfp() function.	2013-12-08 11:46:26 +01:00
Bastian Kleineidam	03fff069ee	Apply same file checks files as for image files.	2013-12-05 18:29:15 +01:00
Bastian Kleineidam	0eaf9a3139	Add text search in comic strips.	2013-11-29 20:26:49 +01:00
Bastian Kleineidam	ebdc1e6359	More unicode output fixes.	2013-04-30 06:41:19 +02:00
Bastian Kleineidam	c246b41d64	Code formatting.	2013-04-13 08:00:11 +02:00
Bastian Kleineidam	35c031ca81	Fixed some comics.	2013-04-11 18:27:43 +02:00
Bastian Kleineidam	190ffcd390	Use str() for robotparser.	2013-04-09 19:36:00 +02:00
Bastian Kleineidam	b9dc385ff2	Implemented voting	2013-04-09 19:33:50 +02:00
Bastian Kleineidam	4528281ddd	Voting part 2	2013-04-08 21:20:01 +02:00
Bastian Kleineidam	781bac0ca2	Feed text content instead of binary to robots.txt parser.	2013-04-07 18:11:29 +02:00
Bastian Kleineidam	0fbc005377	A Python3 fix.	2013-04-05 18:57:44 +02:00
Bastian Kleineidam	97522bc5ae	Use tuples rather than lists.	2013-04-05 18:55:19 +02:00
Bastian Kleineidam	adb31d84af	Use HTMLParser.unescape instead of rolling our own function.	2013-04-05 18:53:19 +02:00
Bastian Kleineidam	6aa588860d	Code cleanup	2013-04-05 06:36:05 +02:00
Bastian Kleineidam	460c5be689	Add POST support to urlopen().	2013-04-04 18:30:02 +02:00
Bastian Kleineidam	0054ebfe0b	Some Python3 fixes.	2013-04-03 20:32:43 +02:00
Bastian Kleineidam	2c0ca04882	Fix warning for scrapers with multiple image patterns.	2013-04-03 20:32:19 +02:00
Bastian Kleineidam	110a67cda4	Retry failed page content downloads (eg. timeouts).	2013-03-25 19:49:09 +01:00
Bastian Kleineidam	43f20270d0	Allow a list of regular expressions for image and previous link search.	2013-03-12 20:48:26 +01:00
Bastian Kleineidam	88e28f3923	Fix some comics and add language tag.	2013-03-08 22:33:05 +01:00
Bastian Kleineidam	c13aa323d8	Code cleanup [ci skip]	2013-03-04 21:44:26 +01:00
Bastian Kleineidam	41c954b309	Another try on URL quoting.	2013-02-23 09:08:08 +01:00
Bastian Kleineidam	d0c3492cc7	Catch robots.txt errors.	2013-02-21 19:48:04 +01:00
Bastian Kleineidam	be1694592e	Do not stream page content URLs.	2013-02-18 20:38:59 +01:00
Bastian Kleineidam	96bf9ef523	Recognize internal server errors.	2013-02-13 17:54:10 +01:00
Bastian Kleineidam	f16e860f1e	Only cache robots.txt URL on memoize.	2013-02-13 17:52:07 +01:00
Bastian Kleineidam	10f6a1caa1	Correct path quoting.	2013-02-12 17:55:33 +01:00
Bastian Kleineidam	6d0fffd825	Always use connection pooling.	2013-02-12 17:55:13 +01:00

1 2

88 commits