WIP: Guess encoding if default does not work #114

Closed
mslw wants to merge 5 commits from encoding into main

5 commits

Author SHA1 Message Date
f0c44c1818 Make encoding a property of TabbyLoader
Because load functions are used recursively (when load statements are
found in a tabby file), it would be too much hassle to pass the
encoding parameter around - better use `self._encoding`.
2023-11-21 13:50:43 +01:00
070937a7c2 Add an encoding argument to tabby loader
When an encoding is explicitly specified, it will be used.

Otherwise, default encoding used by Path.open will be tried, and
charset_normalizer will be used to guess if that fails.
2023-11-21 12:35:50 +01:00
8d4b6e1aba Fix a type annotation 2023-11-21 12:20:21 +01:00
ef7d778311 Narrow down the try/except
This narrows down the try/except to wrap the loader only, and not the
extend/append. It is clearer what is being tried.
2023-11-13 19:05:37 +01:00
71676da64f Guess encoding if default does not work
If reading a tsv file with default encoding fails, roll out a
cannon (charset-normalizer) and try to guess encoding to use.

By default, `Path.open()` will use `locale.getencoding()` when reading
a file (which means that we implicitly use utf-8, at least on
linux). This would fail when reading files with non-ascii characters
prepared (with not-uncommon settings) on Windows. There is no perfect
way to learn the encoding from a plain text file, but existing tools
seem to do a good job.

This commit refactors tabby loader, makes it use guessed encoding (but
only after the default fails) and closes #112

https://charset-normalizer.readthedocs.io
2023-11-13 18:33:52 +01:00