Tobias Gruetzmacher
6727e9b559
Use vendored urllib3.
...
As long as requests ships with urllib3, we can't fall back to the
"system" urllib3, since that breaks class-identity checks.
2016-03-16 23:18:19 +01:00
Tobias Gruetzmacher
c4fcd985dd
Let urllib3 handle all retries.
2016-03-13 21:30:36 +01:00
Johannes Schöpp
351fa7154e
Modified maximum page size
...
Fixes #36
2016-03-01 22:19:44 +01:00
Tobias Gruetzmacher
10d9eac574
Remove support for very old versions of "requests".
2015-11-02 23:24:01 +01:00
Tobias Gruetzmacher
68d4dd463a
Revert robots.txt handling.
...
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.
Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes #24 .
(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots )
2015-07-17 20:46:56 +02:00
Tobias Gruetzmacher
7c15ea50d8
Also check robots.txt on image downloads.
...
We DO want to honour if images are blocked by robots.txt
2015-07-15 23:50:57 +02:00
Tobias Gruetzmacher
5affd8af68
More relaxed robots.txt handling.
...
This is in line with how Perl's LWP::RobotUA and Google handles server
errors when fetching robots.txt: Just assume access is allowed.
See https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
2015-07-15 19:11:55 +02:00
Tobias Gruetzmacher
86b31dc12b
Depend on pycountry directly.
2015-04-21 21:56:54 +02:00
Tobias Gruetzmacher
5934f03453
Merge branch 'htmlparser' - I think it's ready.
...
This closes pull request #70 .
2015-04-01 22:13:55 +02:00
Tobias Gruetzmacher
17bc454132
Bugfix: Don't assume RE patterns in base class.
2014-10-13 22:29:47 +02:00
Tobias Gruetzmacher
3235b8b312
Pass unicode strings to lxml.
...
This reverts commit fcde86e9c0
& some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.
2014-10-13 19:39:48 +02:00
Bastian Kleineidam
e43694c156
Don't crash on multiple HTML output runs per day.
2014-09-22 22:00:16 +02:00
Tobias Gruetzmacher
fcde86e9c0
Change getPageContent to (optionally) return raw text.
...
This allows LXML to do its own "magic" encoding detection
2014-07-26 11:28:43 +02:00
Tobias Gruetzmacher
0e03eca8f0
Move all regular expression operation into the new class.
...
- Move fetchUrls, fetchUrl and fetchText.
- Move base URL handling.
2014-07-26 11:28:43 +02:00
Bastian Kleineidam
3a929ceea6
Allow comic text to be optional. Patch from TobiX
2014-07-24 20:49:57 +02:00
Bastian Kleineidam
93fe5d5987
Minor useragent refactoring
2014-07-03 17:12:25 +02:00
Bastian Kleineidam
687d27d534
Stripping should be done in normaliseUrl.
2014-06-08 10:12:33 +02:00
Bastian Kleineidam
4d63920434
Updated copyright.
2014-01-05 16:50:57 +01:00
Bastian Kleineidam
df9a381ae4
Document getfp() function.
2013-12-08 11:46:26 +01:00
Bastian Kleineidam
03fff069ee
Apply same file checks files as for image files.
2013-12-05 18:29:15 +01:00
Bastian Kleineidam
0eaf9a3139
Add text search in comic strips.
2013-11-29 20:26:49 +01:00
Bastian Kleineidam
ebdc1e6359
More unicode output fixes.
2013-04-30 06:41:19 +02:00
Bastian Kleineidam
c246b41d64
Code formatting.
2013-04-13 08:00:11 +02:00
Bastian Kleineidam
35c031ca81
Fixed some comics.
2013-04-11 18:27:43 +02:00
Bastian Kleineidam
190ffcd390
Use str() for robotparser.
2013-04-09 19:36:00 +02:00
Bastian Kleineidam
b9dc385ff2
Implemented voting
2013-04-09 19:33:50 +02:00
Bastian Kleineidam
4528281ddd
Voting part 2
2013-04-08 21:20:01 +02:00
Bastian Kleineidam
781bac0ca2
Feed text content instead of binary to robots.txt parser.
2013-04-07 18:11:29 +02:00
Bastian Kleineidam
0fbc005377
A Python3 fix.
2013-04-05 18:57:44 +02:00
Bastian Kleineidam
97522bc5ae
Use tuples rather than lists.
2013-04-05 18:55:19 +02:00
Bastian Kleineidam
adb31d84af
Use HTMLParser.unescape instead of rolling our own function.
2013-04-05 18:53:19 +02:00
Bastian Kleineidam
6aa588860d
Code cleanup
2013-04-05 06:36:05 +02:00
Bastian Kleineidam
460c5be689
Add POST support to urlopen().
2013-04-04 18:30:02 +02:00
Bastian Kleineidam
0054ebfe0b
Some Python3 fixes.
2013-04-03 20:32:43 +02:00
Bastian Kleineidam
2c0ca04882
Fix warning for scrapers with multiple image patterns.
2013-04-03 20:32:19 +02:00
Bastian Kleineidam
110a67cda4
Retry failed page content downloads (eg. timeouts).
2013-03-25 19:49:09 +01:00
Bastian Kleineidam
43f20270d0
Allow a list of regular expressions for image and previous link search.
2013-03-12 20:48:26 +01:00
Bastian Kleineidam
88e28f3923
Fix some comics and add language tag.
2013-03-08 22:33:05 +01:00
Bastian Kleineidam
c13aa323d8
Code cleanup [ci skip]
2013-03-04 21:44:26 +01:00
Bastian Kleineidam
41c954b309
Another try on URL quoting.
2013-02-23 09:08:08 +01:00
Bastian Kleineidam
d0c3492cc7
Catch robots.txt errors.
2013-02-21 19:48:04 +01:00
Bastian Kleineidam
be1694592e
Do not stream page content URLs.
2013-02-18 20:38:59 +01:00
Bastian Kleineidam
96bf9ef523
Recognize internal server errors.
2013-02-13 17:54:10 +01:00
Bastian Kleineidam
f16e860f1e
Only cache robots.txt URL on memoize.
2013-02-13 17:52:07 +01:00
Bastian Kleineidam
10f6a1caa1
Correct path quoting.
2013-02-12 17:55:33 +01:00
Bastian Kleineidam
6d0fffd825
Always use connection pooling.
2013-02-12 17:55:13 +01:00
Bastian Kleineidam
a35c54525d
Work around a bug in python requests.
2013-02-11 19:52:59 +01:00
Bastian Kleineidam
14f0a6fe78
Do not prefetch content with requests >= 1.0
2013-02-11 19:45:21 +01:00
Bastian Kleineidam
67836942d8
Simplify the fetchUrl code.
2013-02-11 19:43:46 +01:00
Bastian Kleineidam
1a0cd1ee6b
Print HTTP client headers.
2013-02-07 18:28:56 +01:00