This workaround was written in 2016 while that version was still found
on many systems. Addionally, this workaround needs to be enabled by the
developer, who might not even be aware that they need to enable it for a
specific module. We still throw a warning to the user if running with
such an old libxml version.
The default encoding for source files is UTF-8 since Python 3, so we can
drop all encoding headers. While we are at it, just replace them with
SPDX headers.
Since we already match comics case-insensitive on the command line, this
was a logical step, even if this means changing quite a bit of code that
all tries to resolve the "comic directory" in a slightly different
way...
HTML character encoding in the context of HTTP is quite tricky to get
right and honestly, I'm not sure if I did get it right this time. But I
think, the current behaviour matches best what web browsers try to do:
1. Let Requests figure out the content from the HTTP header. This
overrides everything else. We need to "trick" LXML to accept our
decision if the document contains an XML declaration which might
disagree with the HTTP header.
2. If the HTTP headers don't specify any encoding, let LXML guess the
encoding and be done with it.
This brings us back to only honouring robots.txt on page downloads, not
on image downloads.
Rationale: Dosage is not a "robot" in the classical sense. It's not
designed to spider huge amounts of web sites in search for some content
to index, it's only intended to help users keep a personal archive of
comics he is interested in. We try very hard to never download any image
twice. This fixes#24.
(Precedent for this rationale: Google Feedfetcher:
https://support.google.com/webmasters/answer/178852?hl=en#robots)
This reverts commit fcde86e9c0 & some
more. This lets python-requests do all the encoding stuff and leaves
LXML with (hopefully) clean unicode HTML to parse.