WIP: Guess encoding if default does not work #114

Closed
mslw wants to merge 5 commits from encoding into main
mslw commented 2023-11-13 17:31:26 +00:00 (Migrated from github.com)

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.

By default, Path.open() uses locale.getencoding() when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.

This PR refactors tabby loader (introducing _parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

If reading a tsv file with default encoding fails, roll out a cannon ([charset-normalizer](https://charset-normalizer.readthedocs.io)) and try to guess the encoding to use. By default, `Path.open()` uses `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This PR refactors tabby loader (introducing `_parse_tsv_[single, many]`) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112
mih (Migrated from github.com) reviewed 2023-11-13 17:40:12 +00:00
mih (Migrated from github.com) left a comment

Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.

Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.
mslw commented 2023-11-13 18:39:21 +00:00 (Migrated from github.com)

Updated the try/except, agree that it looks cleaner.

Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

Updated the try/except, agree that it looks cleaner. Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...
mslw commented 2023-11-21 12:59:14 +00:00 (Migrated from github.com)

I've added the possibility to specify encoding as an argument to load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.

Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

I've added the possibility to specify encoding as an argument to `load_tabby`. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me. Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

Pull request closed

Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sfb1451/datalad-tabby!114
No description provided.