dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	10d9eac574	Remove support for very old versions of "requests".	2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher	68d4dd463a	Revert robots.txt handling. This brings us back to only honouring robots.txt on page downloads, not on image downloads. Rationale: Dosage is not a "robot" in the classical sense. It's not designed to spider huge amounts of web sites in search for some content to index, it's only intended to help users keep a personal archive of comics he is interested in. We try very hard to never download any image twice. This fixes #24. (Precedent for this rationale: Google Feedfetcher: https://support.google.com/webmasters/answer/178852?hl=en#robots)	2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher	7c15ea50d8	Also check robots.txt on image downloads. We DO want to honour if images are blocked by robots.txt	2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher	5affd8af68	More relaxed robots.txt handling. This is in line with how Perl's LWP::RobotUA and Google handles server errors when fetching robots.txt: Just assume access is allowed. See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt	2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher	86b31dc12b	Depend on pycountry directly.	2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher	5934f03453	Merge branch 'htmlparser' - I think it's ready. This closes pull request #70.	2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Bastian Kleineidam	e43694c156	Don't crash on multiple HTML output runs per day.	2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher	fcde86e9c0	Change getPageContent to (optionally) return raw text. This allows LXML to do its own "magic" encoding detection	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	93fe5d5987	Minor useragent refactoring	2014-07-03 17:12:25 +02:00
Bastian Kleineidam	687d27d534	Stripping should be done in normaliseUrl.	2014-06-08 10:12:33 +02:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00
Bastian Kleineidam	df9a381ae4	Document getfp() function.	2013-12-08 11:46:26 +01:00
Bastian Kleineidam	03fff069ee	Apply same file checks files as for image files.	2013-12-05 18:29:15 +01:00
Bastian Kleineidam	0eaf9a3139	Add text search in comic strips.	2013-11-29 20:26:49 +01:00
Bastian Kleineidam	ebdc1e6359	More unicode output fixes.	2013-04-30 06:41:19 +02:00
Bastian Kleineidam	c246b41d64	Code formatting.	2013-04-13 08:00:11 +02:00
Bastian Kleineidam	35c031ca81	Fixed some comics.	2013-04-11 18:27:43 +02:00
Bastian Kleineidam	190ffcd390	Use str() for robotparser.	2013-04-09 19:36:00 +02:00
Bastian Kleineidam	b9dc385ff2	Implemented voting	2013-04-09 19:33:50 +02:00
Bastian Kleineidam	4528281ddd	Voting part 2	2013-04-08 21:20:01 +02:00
Bastian Kleineidam	781bac0ca2	Feed text content instead of binary to robots.txt parser.	2013-04-07 18:11:29 +02:00
Bastian Kleineidam	0fbc005377	A Python3 fix.	2013-04-05 18:57:44 +02:00
Bastian Kleineidam	97522bc5ae	Use tuples rather than lists.	2013-04-05 18:55:19 +02:00
Bastian Kleineidam	adb31d84af	Use HTMLParser.unescape instead of rolling our own function.	2013-04-05 18:53:19 +02:00
Bastian Kleineidam	6aa588860d	Code cleanup	2013-04-05 06:36:05 +02:00
Bastian Kleineidam	460c5be689	Add POST support to urlopen().	2013-04-04 18:30:02 +02:00
Bastian Kleineidam	0054ebfe0b	Some Python3 fixes.	2013-04-03 20:32:43 +02:00
Bastian Kleineidam	2c0ca04882	Fix warning for scrapers with multiple image patterns.	2013-04-03 20:32:19 +02:00
Bastian Kleineidam	110a67cda4	Retry failed page content downloads (eg. timeouts).	2013-03-25 19:49:09 +01:00
Bastian Kleineidam	43f20270d0	Allow a list of regular expressions for image and previous link search.	2013-03-12 20:48:26 +01:00
Bastian Kleineidam	88e28f3923	Fix some comics and add language tag.	2013-03-08 22:33:05 +01:00
Bastian Kleineidam	c13aa323d8	Code cleanup [ci skip]	2013-03-04 21:44:26 +01:00
Bastian Kleineidam	41c954b309	Another try on URL quoting.	2013-02-23 09:08:08 +01:00
Bastian Kleineidam	d0c3492cc7	Catch robots.txt errors.	2013-02-21 19:48:04 +01:00
Bastian Kleineidam	be1694592e	Do not stream page content URLs.	2013-02-18 20:38:59 +01:00
Bastian Kleineidam	96bf9ef523	Recognize internal server errors.	2013-02-13 17:54:10 +01:00
Bastian Kleineidam	f16e860f1e	Only cache robots.txt URL on memoize.	2013-02-13 17:52:07 +01:00
Bastian Kleineidam	10f6a1caa1	Correct path quoting.	2013-02-12 17:55:33 +01:00
Bastian Kleineidam	6d0fffd825	Always use connection pooling.	2013-02-12 17:55:13 +01:00
Bastian Kleineidam	a35c54525d	Work around a bug in python requests.	2013-02-11 19:52:59 +01:00
Bastian Kleineidam	14f0a6fe78	Do not prefetch content with requests >= 1.0	2013-02-11 19:45:21 +01:00
Bastian Kleineidam	67836942d8	Simplify the fetchUrl code.	2013-02-11 19:43:46 +01:00
Bastian Kleineidam	1a0cd1ee6b	Print HTTP client headers.	2013-02-07 18:28:56 +01:00
Bastian Kleineidam	73700e66f0	Cleanup	2013-01-24 21:42:27 +01:00
Bastian Kleineidam	f1356a9ff8	Fix URL norming, See issue #2 .	2013-01-23 21:16:22 +01:00
Bastian Kleineidam	5479627d86	Updated copyright.	2013-01-09 22:21:19 +01:00

1 2

81 commits