datalad-course/casts/2hourintro

268 lines
No EOL
12 KiB
Text

say 'Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description'
run '### Code snippet 1
datalad create -c text2git DataLad-101'
say 'Datalad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.'
run '### Code snippet 2
cd DataLad-101
ls # ls does not show any output, because the dataset is empty.'
say 'GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let'"'"'s take a look into the history, even if it is small atm'
run '### Code snippet 3
git log'
say 'Datalad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!'
run '### Code snippet 4
ls -a # show also hidden files'
say 'The dataset is empty, lets put some PDFs inside. First, create a directory to store them in:'
run '### Code snippet 5
mkdir books'
say 'The tree command shows us the directory structure in the dataset. Apart from the directory, its empty.'
run '### Code snippet 6
tree'
say 'We use wget to download a few books from the web. CAVE: longish realcommand!'
run '### Code snippet 7
cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf -O byte-of-python.pdf && cd ../'
say 'Here they are:'
run '### Code snippet 8
tree'
say 'What has happened to our dataset now with this new content? We can use datalad status to find out:'
run '### Code snippet 9
datalad status'
say 'ATM the files are untracked and thus unknown to any version control system. In order to version control the PDFs we need to save them. We attach a meaningful summary of this with the -m option:'
run '### Code snippet 10
datalad save -m "add books on Python and Unix to read later"'
say 'Save command reports what has been added to the dataset. Now we can see how this action looks like in our dataset'"'"'s history:'
run '### Code snippet 11
git log -p -n 1'
say 'Its inconvenient that we saved two books together - we should have saved them as independent modifications of the dataset. To see how single modifications can be saved, let'"'"'s download another book'
run '### Code snippet 12
cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../'
say 'Check the dataset state with the status command frequently'
run '### Code snippet 13
datalad status'
say 'To save a single modification, provide a path to it!'
run '### Code snippet 14
datalad save -m "add reference book about git" books/progit.pdf'
say 'Let'"'"'s view the growing history (concise with the --oneline option):'
run '### Code snippet 15
# lets make the output a bit more concise with the --oneline option
git log --oneline'
say 'finally, datalad-download-url'
run '### Code snippet 16
datalad download-url http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
--dataset . \
-m "add beginners guide on bash" \
-O books/bash_guide.pdf'
run '### Code snippet 17
ls books'
run '### Code snippet 18
datalad status'
say 'Let'"'"'s find out how we can modify files in dataset. Lets create a text file with notes about the DataLad commands we learned. (maybe explain here docs)'
run '### Code snippet 19
cat << EOT > notes.txt
One can create a new dataset with '"'"'datalad create [--description] PATH'"'"'.
The dataset is created empty
EOT'
say 'As expected, there is a new file in the dataset. At first the file is untracked. We can save without a path specification because it is the only existing modification'
run '### Code snippet 20
datalad status'
run '### Code snippet 21
datalad save -m "Add notes on datalad create"'
say 'Now let'"'"'s start to modify this text file by adding more notes to it. Think about this being a code file that you add functions to:'
run '### Code snippet 22
cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.
EOT'
run '### Code snippet 23
datalad status'
say 'The modification can be saved as well'
run '### Code snippet 24
datalad save -m "add note on datalad save"'
say 'An the history gives an accurate record of what happened to this file'
run '### Code snippet 25
git log -p -n 2'
say 'The next challenge is to clone an existing dataset from the web as a subdataset. First, we create a location for this'
run '### Code snippet 26
# we are in the root of DataLad-101
mkdir recordings'
say 'We need to clone the dataset as a subdataset. For this, we use the datalad clone command with a --dataset option and a path. Else the dataset would not be registered as a subdataset!'
run '### Code snippet 27
datalad clone --dataset . \
https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow'
say 'Let'"'"'s take a look at the directory structure after cloning'
run '### Code snippet 28
tree -d # we limit the output to directories'
say 'And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?'
run '### Code snippet 29
cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
ls'
say 'Upon cloning of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)'
run '### Code snippet 30
cd ../ # in longnow/
du -sh # Unix command to show size of contents'
say 'But how large would the dataset be if we had all the content?'
run '### Code snippet 31
datalad status --annex'
say 'Now let'"'"'s finally get some content in this dataset. This is done with the datalad get command'
run '### Code snippet 32
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3'
say 'Datalad status can also summarize how much of the content is already present locally:'
run '### Code snippet 33
datalad status --annex all'
say 'Let'"'"'s get a few more files. Note how already obtained files are not downloaded again:'
run '### Code snippet 34
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3'
say 'On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!'
run '### Code snippet 35
git log --reverse'
say 'We can make a note about this:'
run '### Code snippet 36
# in the root of DataLad-101:
cd ../../
cat << EOT >> notes.txt
The command '"'"'datalad clone URL/PATH [PATH]'"'"'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '"'"'-d'"'"' option.
EOT
datalad save -m "Add note on datalad clone"'
say 'The superdataset only stores the version of the subdataset. Let'"'"'s take a look into how the superdataset'"'"'s history looks like'
run '### Code snippet 37
git log -p'
say 'We can find this shasum in the subdatasets history: it'"'"'s the most recent change'
run '### Code snippet 38
cd recordings/longnow
git log --oneline'
run '### Code snippet 39
cd ../../'
say 'Let'"'"'s create a data analysis project with a yoda procedure'
run '### Code snippet 126
# inside of DataLad-101
datalad create -c yoda --dataset . midterm_project'
say 'Now clone input data as a subdataset'
run '### Code snippet 127
cd midterm_project
# we are in midterm_project, thus -d . points to the root of it.
datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/'
say 'here is how the directory structure looks like'
run '### Code snippet 128
cd ../
tree -d
cd midterm_project'
say 'Let'"'"'s create code for an analysis'
run '### Code snippet 129
cat << EOT > code/script.py
import pandas as pd
import seaborn as sns
import datalad.api as dl
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
data = "input/iris.csv"
# make sure that the data are obtained (get will also install linked sub-ds!):
dl.get(data)
# prepare the data as a pandas dataframe
df = pd.read_csv(data)
attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
df.columns = attributes
# create a pairplot to plot pairwise relationships in the dataset
plot = sns.pairplot(df, hue='"'"'class'"'"', palette='"'"'muted'"'"')
plot.savefig('"'"'pairwise_relationships.png'"'"')
# perform a K-nearest-neighbours classification with scikit-learn
# Step 1: split data in test and training dataset (20:80)
array = df.values
X = array[:,0:4]
Y = array[:,4]
test_size = 0.20
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,
test_size=test_size,
random_state=seed)
# Step 2: Fit the model and make predictions on the test dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_test)
# Step 3: Save the classification report
report = classification_report(Y_test, predictions, output_dict=True)
df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"')
EOT'
say 'datalad status will show a new file'
run '### Code snippet 130
datalad status'
say 'Save the analysis to the history'
run '### Code snippet 131
datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py'
say 'The datalad run command can reproducibly execute a command reproducibly'
run '### Code snippet 132
datalad run -m "analyze iris data with classification analysis" \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"'
say 'Let'"'"'s take a look at the history'
run '### Code snippet 133
git log --oneline'
say 'create human readable information for your project'
run '### Code snippet 134
# with the >| redirection we are replacing existing contents in the file
cat << EOT >| README.md
# Midterm YODA Data Analysis Project
## Dataset structure
- All inputs (i.e. building blocks from other sources) are located in input/.
- All custom code is located in code/.
- All results (i.e., generated files) are located in the root of the dataset:
- "prediction_report.csv" contains the main classification metrics.
- "output/pairwise_relationships.png" is a plot of the relations between features.
EOT'
say 'The README file is now modified'
run '### Code snippet 135
datalad status'
say 'Let'"'"'s save this change'
run '### Code snippet 136
datalad save -m "Provide project description" README.md'
say 'Computational reproducibility: add a software container'
run '### Code snippet 137
# we are in the midterm_project subdataset
datalad containers-add midterm-software --url shub://adswa/resources:1'
say 'The software container got added to your datasets history'
run '### Code snippet 138
git log -n 1 -p'
say 'The analysis can be rerun in a software container'
run '### Code snippet 139
datalad containers-run -m "rerun analysis in container" \
--container-name midterm-software \
--input "input/iris.csv" \
--output "prediction_report.csv" \
--output "pairwise_relationships.png" \
"python3 code/script.py"'
say 'Here is how that looks like in the history:'
run '### Code snippet 140
git log -p -n 1'
say 'Save the change in the superdataset'
run '### Code snippet 141
cd ../
datalad status'
say 'Save the change in the superdataset'
run '### Code snippet 142
datalad save -d . -m "add container and execute analysis within container" midterm_project'