yarn.stigatle.no @<lyse https://lyse.isobeef.org/twtxt.txt> "(#qlgshhq) @movq Non-ASCII characters were broken. Like U+2028, degrees (°), etc. Turns out I used a silly library to detect the encoding and ..."

lyse.isobeef.org

@movq@www.uninformativ.de Non-ASCII characters were broken. Like U+2028, degrees (°), etc.

Turns out I used a silly library to detect the encoding and transform to UTF-8 if needed. When there is no Content-Type header, like for local files, it looks at the first 1024 bytes. Since it only saw ASCII in that region, the damn thing assumed the data to be in Windows-1252 (which for web pages kinda makes sense):

// TODO: change default depending on user's locale?
return charmap.Windows1252, "windows-1252", false

https://cs.opensource.google/go/x/net/+/master:html/charset/charset.go;l=102

This default is hardcoded and cannot be changed.

Trying to be smart and adding automatic support for other encodings turned out to be a bad move on my end. At least I can reduce my dependency list again. :-)

I now just reject everything that explicitly specifies something different than text/plain and an optional charset other than utf-8 (ignoring casing). Otherwise I assume it’s in UTF-8 (just like the twtxt file format specification mandates) and hope for the best.

⤋ Read More

Participate