WIP: Guess encoding if default does not work #114
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "encoding"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.
By default,
Path.open()useslocale.getencoding()when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.This PR refactors tabby loader (introducing
_parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.
Updated the try/except, agree that it looks cleaner.
Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...
I've added the possibility to specify encoding as an argument to
load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.
Pull request closed