dosage

Author	SHA1	Message	Date
Tobias Gruetzmacher	5934f03453	Merge branch 'htmlparser' - I think it's ready. This closes pull request #70.	2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher	17bc454132	Bugfix: Don't assume RE patterns in base class.	2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher	3235b8b312	Pass unicode strings to lxml. This reverts commit `fcde86e9c0` & some more. This lets python-requests do all the encoding stuff and leaves LXML with (hopefully) clean unicode HTML to parse.	2014-10-13 19:39:48 +02:00
Bastian Kleineidam	e43694c156	Don't crash on multiple HTML output runs per day.	2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher	fcde86e9c0	Change getPageContent to (optionally) return raw text. This allows LXML to do its own "magic" encoding detection	2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher	0e03eca8f0	Move all regular expression operation into the new class. - Move fetchUrls, fetchUrl and fetchText. - Move base URL handling.	2014-07-26 11:28:43 +02:00
Bastian Kleineidam	3a929ceea6	Allow comic text to be optional. Patch from TobiX	2014-07-24 20:49:57 +02:00
Bastian Kleineidam	93fe5d5987	Minor useragent refactoring	2014-07-03 17:12:25 +02:00
Bastian Kleineidam	687d27d534	Stripping should be done in normaliseUrl.	2014-06-08 10:12:33 +02:00
Bastian Kleineidam	4d63920434	Updated copyright.	2014-01-05 16:50:57 +01:00
Bastian Kleineidam	df9a381ae4	Document getfp() function.	2013-12-08 11:46:26 +01:00
Bastian Kleineidam	03fff069ee	Apply same file checks files as for image files.	2013-12-05 18:29:15 +01:00
Bastian Kleineidam	0eaf9a3139	Add text search in comic strips.	2013-11-29 20:26:49 +01:00
Bastian Kleineidam	ebdc1e6359	More unicode output fixes.	2013-04-30 06:41:19 +02:00
Bastian Kleineidam	c246b41d64	Code formatting.	2013-04-13 08:00:11 +02:00
Bastian Kleineidam	35c031ca81	Fixed some comics.	2013-04-11 18:27:43 +02:00
Bastian Kleineidam	190ffcd390	Use str() for robotparser.	2013-04-09 19:36:00 +02:00
Bastian Kleineidam	b9dc385ff2	Implemented voting	2013-04-09 19:33:50 +02:00
Bastian Kleineidam	4528281ddd	Voting part 2	2013-04-08 21:20:01 +02:00
Bastian Kleineidam	781bac0ca2	Feed text content instead of binary to robots.txt parser.	2013-04-07 18:11:29 +02:00
Bastian Kleineidam	0fbc005377	A Python3 fix.	2013-04-05 18:57:44 +02:00
Bastian Kleineidam	97522bc5ae	Use tuples rather than lists.	2013-04-05 18:55:19 +02:00
Bastian Kleineidam	adb31d84af	Use HTMLParser.unescape instead of rolling our own function.	2013-04-05 18:53:19 +02:00
Bastian Kleineidam	6aa588860d	Code cleanup	2013-04-05 06:36:05 +02:00
Bastian Kleineidam	460c5be689	Add POST support to urlopen().	2013-04-04 18:30:02 +02:00
Bastian Kleineidam	0054ebfe0b	Some Python3 fixes.	2013-04-03 20:32:43 +02:00
Bastian Kleineidam	2c0ca04882	Fix warning for scrapers with multiple image patterns.	2013-04-03 20:32:19 +02:00
Bastian Kleineidam	110a67cda4	Retry failed page content downloads (eg. timeouts).	2013-03-25 19:49:09 +01:00
Bastian Kleineidam	43f20270d0	Allow a list of regular expressions for image and previous link search.	2013-03-12 20:48:26 +01:00
Bastian Kleineidam	88e28f3923	Fix some comics and add language tag.	2013-03-08 22:33:05 +01:00
Bastian Kleineidam	c13aa323d8	Code cleanup [ci skip]	2013-03-04 21:44:26 +01:00
Bastian Kleineidam	41c954b309	Another try on URL quoting.	2013-02-23 09:08:08 +01:00
Bastian Kleineidam	d0c3492cc7	Catch robots.txt errors.	2013-02-21 19:48:04 +01:00
Bastian Kleineidam	be1694592e	Do not stream page content URLs.	2013-02-18 20:38:59 +01:00
Bastian Kleineidam	96bf9ef523	Recognize internal server errors.	2013-02-13 17:54:10 +01:00
Bastian Kleineidam	f16e860f1e	Only cache robots.txt URL on memoize.	2013-02-13 17:52:07 +01:00
Bastian Kleineidam	10f6a1caa1	Correct path quoting.	2013-02-12 17:55:33 +01:00
Bastian Kleineidam	6d0fffd825	Always use connection pooling.	2013-02-12 17:55:13 +01:00
Bastian Kleineidam	a35c54525d	Work around a bug in python requests.	2013-02-11 19:52:59 +01:00
Bastian Kleineidam	14f0a6fe78	Do not prefetch content with requests >= 1.0	2013-02-11 19:45:21 +01:00
Bastian Kleineidam	67836942d8	Simplify the fetchUrl code.	2013-02-11 19:43:46 +01:00
Bastian Kleineidam	1a0cd1ee6b	Print HTTP client headers.	2013-02-07 18:28:56 +01:00
Bastian Kleineidam	73700e66f0	Cleanup	2013-01-24 21:42:27 +01:00
Bastian Kleineidam	f1356a9ff8	Fix URL norming, See issue #2 .	2013-01-23 21:16:22 +01:00
Bastian Kleineidam	5479627d86	Updated copyright.	2013-01-09 22:21:19 +01:00
Bastian Kleineidam	6a2f57b132	Support requests module >= 1.0	2012-12-19 20:43:18 +01:00
Bastian Kleineidam	e5a04931d3	Various fixes and additions.	2012-12-12 17:41:29 +01:00
Bastian Kleineidam	4def4b81bd	Add cookie feature.	2012-12-08 21:30:23 +01:00
Bastian Kleineidam	faba7b0bca	Fix more comics.	2012-12-08 00:45:18 +01:00
Bastian Kleineidam	e5d9002f09	Fix more comics.	2012-12-05 21:52:52 +01:00

1 2

76 commits