WIP: Guess encoding if default does not work #114

Closed

mslw wants to merge 5 commits from encoding into main

mslw commented

2023-11-13 17:31:26 +00:00

(Migrated from github.com)

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.

By default, Path.open() uses locale.getencoding() when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.

This PR refactors tabby loader (introducing _parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

If reading a tsv file with default encoding fails, roll out a cannon ([charset-normalizer](https://charset-normalizer.readthedocs.io)) and try to guess the encoding to use. By default, `Path.open()` uses `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This PR refactors tabby loader (introducing `_parse_tsv_[single, many]`) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

mih (Migrated from github.com) reviewed

2023-11-13 17:40:12 +00:00

mih (Migrated from github.com) left a comment

Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.

mslw commented

2023-11-13 18:39:21 +00:00

(Migrated from github.com)

Updated the try/except, agree that it looks cleaner.

Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

Updated the try/except, agree that it looks cleaner. Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

mslw commented

2023-11-21 12:59:14 +00:00

(Migrated from github.com)

I've added the possibility to specify encoding as an argument to load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.

Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

I've added the possibility to specify encoding as an argument to `load_tabby`. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me. Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

Pull request closed

This pull request cannot be reopened because the branch was deleted.

No reviewers

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sfb1451/datalad-tabby!114

No description provided.

Rows
Columns