Have source dataset with archivable file format #8

Open
opened 2024-03-06 07:44:50 +00:00 by mih · 0 comments
mih commented 2024-03-06 07:44:50 +00:00 (Migrated from github.com)

Here is a log. This could have been done simultaneously with #7, but we do not know yet, if we can actually switch to this format.

# another dedicated dataset, again no annex because data
# are small and public
❯ datalad create --no-annex joe_and_lili
❯ cd joe_and_lili
# link dataset with data files in pickle format
❯ datalad clone -d . https://github.com/psychoinformatics-de/joe_and_lili_pickle.git source/joe_and_lili_pickle
❯ mkdir code
# invent little conversion helper
❯ cat << EOT > code/convert.py
`heredoc> from numpy import savez
import pandas as pd
import sys

# barf whenever there are not exactly two arguments
infile, outfile = sys.argv[1:]

# read the pickle
df = pd.read_pickle(infile)
# write Numpy's stable/simple npz format
savez(outfile, **df)
EOT
# save script in dataset so we know exactly what ran
❯ datalad save -m "Script to convert from pickle to Numpy's NPZ format"
# run conversion, capture provenance
❯ datalad run -m "Convert data to NPZ format" -i code -i source/joe_and_lili_pickle/data -o data sh -c 'mkdir -p data && for f in source/joe_and_lili_pickle/data/*.pickle; do echo $f; python code/convert.py "$f" "data/$(basename ${{f%.pickle}}).npz"; done'
# run conversion on the `toc` file which uses a different
# naming schema, but otherwise is the exact same thing
# we need to adjust slightly nevertheless
❯ cat << EOT > code/convert_toc.py
`heredoc> from numpy import savez
import pandas as pd
import sys

# barf whenever there are not exactly two arguments
infile, outfile = sys.argv[1:]

# read the pickle
df = pd.read_pickle(infile)
# we need to rename the 'file' key for compatibility with savez()
# we also recode the filenames to match the new format, and
# convert to UTF8 strings
df['files'] = [f'{f[:-7].decode()}.npz' for f in df.pop('file')]
# write Numpy's stable/simple npz format
savez(outfile, **df)
EOT
❯ datalad save -m "Script to convert TOC from pickle to Numpy's NPZ format" code
❯ datalad run -m "Convert 'toc' to NPZ format" -i code -i source/joe_and_lili_pickle/data/toc -o data/toc.npz sh -c 'mkdir -p data && python code/convert_toc.py "{inputs[1]}" "{outputs}"'
❯ datalad run -i source/joe_and_lili_pickle/LICENSE -o LICENSE 'sh -c "cp -LRv {inputs} {outputs}"'

❯ git remote add origin git@github.com:psychoinformatics-de/joe_and_lili.git
❯ git branch -M main
❯ git push -u origin main

Outcome is at https://github.com/psychoinformatics-de/joe_and_lili

Here is a log. This could have been done simultaneously with #7, but we do not know yet, if we can actually switch to this format. ```bash # another dedicated dataset, again no annex because data # are small and public ❯ datalad create --no-annex joe_and_lili ❯ cd joe_and_lili # link dataset with data files in pickle format ❯ datalad clone -d . https://github.com/psychoinformatics-de/joe_and_lili_pickle.git source/joe_and_lili_pickle ❯ mkdir code # invent little conversion helper ❯ cat << EOT > code/convert.py `heredoc> from numpy import savez import pandas as pd import sys # barf whenever there are not exactly two arguments infile, outfile = sys.argv[1:] # read the pickle df = pd.read_pickle(infile) # write Numpy's stable/simple npz format savez(outfile, **df) EOT # save script in dataset so we know exactly what ran ❯ datalad save -m "Script to convert from pickle to Numpy's NPZ format" # run conversion, capture provenance ❯ datalad run -m "Convert data to NPZ format" -i code -i source/joe_and_lili_pickle/data -o data sh -c 'mkdir -p data && for f in source/joe_and_lili_pickle/data/*.pickle; do echo $f; python code/convert.py "$f" "data/$(basename ${{f%.pickle}}).npz"; done' # run conversion on the `toc` file which uses a different # naming schema, but otherwise is the exact same thing # we need to adjust slightly nevertheless ❯ cat << EOT > code/convert_toc.py `heredoc> from numpy import savez import pandas as pd import sys # barf whenever there are not exactly two arguments infile, outfile = sys.argv[1:] # read the pickle df = pd.read_pickle(infile) # we need to rename the 'file' key for compatibility with savez() # we also recode the filenames to match the new format, and # convert to UTF8 strings df['files'] = [f'{f[:-7].decode()}.npz' for f in df.pop('file')] # write Numpy's stable/simple npz format savez(outfile, **df) EOT ❯ datalad save -m "Script to convert TOC from pickle to Numpy's NPZ format" code ❯ datalad run -m "Convert 'toc' to NPZ format" -i code -i source/joe_and_lili_pickle/data/toc -o data/toc.npz sh -c 'mkdir -p data && python code/convert_toc.py "{inputs[1]}" "{outputs}"' ❯ datalad run -i source/joe_and_lili_pickle/LICENSE -o LICENSE 'sh -c "cp -LRv {inputs} {outputs}"' ❯ git remote add origin git@github.com:psychoinformatics-de/joe_and_lili.git ❯ git branch -M main ❯ git push -u origin main ``` Outcome is at https://github.com/psychoinformatics-de/joe_and_lili
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sfb1451/a06-inf-clustered-network-pub#8
No description provided.