Tobias Gruetzmacher
efe1308db2
Replace home-grown Python2/3 compat. with six.
2016-05-05 23:33:48 +02:00
Tobias Gruetzmacher
4204f5f1e4
Send "If-Modified-Since" header for images.
2016-04-19 00:36:50 +02:00
Tobias Gruetzmacher
9028724a74
Clean up update helper scripts.
2016-04-13 00:52:16 +02:00
Tobias Gruetzmacher
8768ff07b6
Fix AhoiPolloi, be a bit smarter about encoding.
...
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:
1. Let Requests figure out the content from the HTTP header. This
overrides everything else. We need to "trick" LXML to accept our
decision if the document contains an XML declaration which might
disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
encoding and be done with it.
2016-04-06 22:22:22 +02:00
Tobias Gruetzmacher
6727e9b559
Use vendored urllib3.
...
As long as requests ships with urllib3, we can't fall back to the
"system" urllib3, since that breaks class-identity checks.
2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher
c4fcd985dd
Let urllib3 handle all retries.
2016-03-13 21:30:36 +01:00
Johannes Schöpp
351fa7154e
Modified maximum page size
...
Fixes #36
2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher
10d9eac574
Remove support for very old versions of "requests".
2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher
68d4dd463a
Revert robots.txt handling.
...
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.
Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes #24 .
(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots )
2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher
7c15ea50d8
Also check robots.txt on image downloads.
...
We DO want to honour if images are blocked by robots.txt
2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher
5affd8af68
More relaxed robots.txt handling.
...
This is in line with how Perl's LWP::RobotUA and Google handles server
errors when fetching robots.txt: Just assume access is allowed.
See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher
86b31dc12b
Depend on pycountry directly.
2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher
5934f03453
Merge branch 'htmlparser' - I think it's ready.
...
This closes pull request #70 .
2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher
17bc454132
Bugfix: Don't assume RE patterns in base class.
2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher
3235b8b312
Pass unicode strings to lxml.
...
This reverts commit fcde86e9c0
& some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Bastian Kleineidam
e43694c156
Don't crash on multiple HTML output runs per day.
2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher
fcde86e9c0
Change getPageContent to (optionally) return raw text.
...
This allows LXML to do its own "magic" encoding detection
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher
0e03eca8f0
Move all regular expression operation into the new class.
...
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam
3a929ceea6
Allow comic text to be optional. Patch from TobiX
2014-07-24 20:49:57 +02:00
Bastian Kleineidam
93fe5d5987
Minor useragent refactoring
2014-07-03 17:12:25 +02:00
Bastian Kleineidam
687d27d534
Stripping should be done in normaliseUrl.
2014-06-08 10:12:33 +02:00
Bastian Kleineidam
4d63920434
Updated copyright.
2014-01-05 16:50:57 +01:00
Bastian Kleineidam
df9a381ae4
Document getfp() function.
2013-12-08 11:46:26 +01:00
Bastian Kleineidam
03fff069ee
Apply same file checks files as for image files.
2013-12-05 18:29:15 +01:00
Bastian Kleineidam
0eaf9a3139
Add text search in comic strips.
2013-11-29 20:26:49 +01:00
Bastian Kleineidam
ebdc1e6359
More unicode output fixes.
2013-04-30 06:41:19 +02:00
Bastian Kleineidam
c246b41d64
Code formatting.
2013-04-13 08:00:11 +02:00
Bastian Kleineidam
35c031ca81
Fixed some comics.
2013-04-11 18:27:43 +02:00
Bastian Kleineidam
190ffcd390
Use str() for robotparser.
2013-04-09 19:36:00 +02:00
Bastian Kleineidam
b9dc385ff2
Implemented voting
2013-04-09 19:33:50 +02:00
Bastian Kleineidam
4528281ddd
Voting part 2
2013-04-08 21:20:01 +02:00
Bastian Kleineidam
781bac0ca2
Feed text content instead of binary to robots.txt parser.
2013-04-07 18:11:29 +02:00
Bastian Kleineidam
0fbc005377
A Python3 fix.
2013-04-05 18:57:44 +02:00
Bastian Kleineidam
97522bc5ae
Use tuples rather than lists.
2013-04-05 18:55:19 +02:00
Bastian Kleineidam
adb31d84af
Use HTMLParser.unescape instead of rolling our own function.
2013-04-05 18:53:19 +02:00
Bastian Kleineidam
6aa588860d
Code cleanup
2013-04-05 06:36:05 +02:00
Bastian Kleineidam
460c5be689
Add POST support to urlopen().
2013-04-04 18:30:02 +02:00
Bastian Kleineidam
0054ebfe0b
Some Python3 fixes.
2013-04-03 20:32:43 +02:00
Bastian Kleineidam
2c0ca04882
Fix warning for scrapers with multiple image patterns.
2013-04-03 20:32:19 +02:00
Bastian Kleineidam
110a67cda4
Retry failed page content downloads (eg. timeouts).
2013-03-25 19:49:09 +01:00
Bastian Kleineidam
43f20270d0
Allow a list of regular expressions for image and previous link search.
2013-03-12 20:48:26 +01:00
Bastian Kleineidam
88e28f3923
Fix some comics and add language tag.
2013-03-08 22:33:05 +01:00
Bastian Kleineidam
c13aa323d8
Code cleanup [ci skip]
2013-03-04 21:44:26 +01:00
Bastian Kleineidam
41c954b309
Another try on URL quoting.
2013-02-23 09:08:08 +01:00
Bastian Kleineidam
d0c3492cc7
Catch robots.txt errors.
2013-02-21 19:48:04 +01:00
Bastian Kleineidam
be1694592e
Do not stream page content URLs.
2013-02-18 20:38:59 +01:00
Bastian Kleineidam
96bf9ef523
Recognize internal server errors.
2013-02-13 17:54:10 +01:00
Bastian Kleineidam
f16e860f1e
Only cache robots.txt URL on memoize.
2013-02-13 17:52:07 +01:00
Bastian Kleineidam
10f6a1caa1
Correct path quoting.
2013-02-12 17:55:33 +01:00
Bastian Kleineidam
6d0fffd825
Always use connection pooling.
2013-02-12 17:55:13 +01:00