Notes while describing the palmerpenguins dataset #17
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Some links:
datalad-conceptsYAML encoding of the dataset: https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Distribution-penguins.yamldatalad-conceptsYAML example: https://docs.google.com/spreadsheets/d/1YNZV5_kSa9HS8iB8bfSBQf9_sMr4d3cl/edit?usp=sharing&ouid=106984577182142381313&rtpof=true&sd=trueLooking at the yaml and excel encoding, there are links to files that do not resolve any more. As far as I can see, the actual raw data in the github repo lives here: https://github.com/allisonhorst/palmerpenguins/tree/main/inst/extdata. Although, the truer sources would probably be the ones cited here: https://github.com/allisonhorst/palmerpenguins?tab=readme-ov-file#references:
Missing properties of
DatasetThe following are either missing in that the flat schema of
Datasetdoes not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense:author(s)- definitely necessaryhomepage- could be convenient, but could also be achieved byexactMatch, or a "see-also" annotationdoi- could be convenient, but could also be achieved by adding anissuedIdentifierlicense- not sure if this should rather be on theDistributionclass thoughfunding- I am not sure where exactly this should be represented, perhaps on a different related class (likeStudy?)publications- I am not sure where exactly this should be represented...
Structure
Some thoughts.
We could make a
Dataseteach out of the three "data packages" linked above, and then link them to a singlepalmerpenguindataset. Each one of them currently has its own set of metadata (even though this mostly overlaps). Each data package corresponds to the same type of data collected from a different penguin species (Adelie, gentoo, Chinstrap).On the other hand, each one of these three packages is also technically a distribution...
The
palmerpenguinrepository on github (and its distribution at the zenodo doi), is an entity with more content and meaning than the three data packages. It also has R code to preprocess the raw data, as well as illustrations, and more content.tbc...
The
StudycontextI've been going back and forth on how to describe the dataset, and I think the issue is that I'm trying to solve it from the content perspective (there is a dataset in the git repo, with many files, some of which contain raw data with study measures; there are also three data packages, each a table, at different sources), and it's not clear to me yet how to bring these different sources together in the flat-data schema.
Perhaps looking at it from a
Studyperspective is a simpler approach, and then the data content/distribution can be tied into it later down the line.Penguin metadata
Important and useful: the penguin data packages have extensive metadata attached to it, e.g. here: https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-pal.219.5
This includes e.g. under "Data Entities" a complete description of each column in a data table, including value data types and more.
Each line in a data table is what they describe as a "sample".
Here's a summary of a data table's columns:
("column name", "description", "storage type", "measure type", followed by my comments):
studyName:Sampling expedition from which data were collected, generated, etc.
string
nominal
SH: there are 3 unique values of this across all samples. It feels like a
StudyActivity, but looking at the description:these 3 unique
studyNames all have the same column headings and all include many samples, i.e. they have the exactly the same for "under the influence of certain factors, with one or more instruments, following a set of protocols". I guess the difference is likely just timing (aka study visits/sessions), does this warrant three differentStudyActivityrecords? If we keep in mind that aDataItemis generated by aStudyActivity, does it makes sense to have the three sessions as the originators of allDataItems, or is it better to have other qualifiers for this process?Sample Number:Species:SubjectType(all three species will be a suptype of the penguin genus)Region:Island:FactorStage:DataItemIndividual ID:SubjectClutch Completion:DataItemDate Egg:DataItemCulmen Length:DataItemCulmen Depth:DataItemFlipper Length:DataItemBody Mass:DataItemSex:DataItemDelta 15 N:DataItemDelta 13 C:DataItemComments:DataItem, but not sure, since this comment is sample-based. could just be fed into the generic comment slot of the DataItem that it pertains to...What we haven't done yet is investigate how
Protocols andInstruments would feed into the mapping of the above items to the flat schema....Mappings
The following needs a rework:
Study
There isn't an intuitively and uniquely described "study" that I can find that relates to this dataset. There are multiple study-related things though:
Protocols andInstruments)What makes sense to me is to create a container "study" record to describe the general study, that led to the creation/publication of the three data packages. This study would be a container for the
StudyActivitys described below.StudyActivity
The raw csv data (from the three files coming from the data packages) contain a column
studyNamewith descriptionSampling expedition from which data were collected, generated, etc.. To this maps neatly ontoStudyActivity.On second, thought, I am not sure that this maps neatly....
After some internal discussion about the
StudyActivityclass, some notes/thoughts/questions:StudyActivityis a time-dependent activity "in the context of a study, where one or more subjects are studied under the influence of certain factors, with one or more instruments, following a set of protocols"StudyActivitylinks to aStudy(and not the other way around, which I think is intentional based on @mih's description of "Design schemas to reduce churn" at https://concepts.inm7.de/about/)StudyActivitycan have multipleImplemented protocols,Used instruments, andStudied subjects.I am wondering now, how should I determine the scope of a
StudyActivityduring annotation?AFAICT there would not be a physical aspect motivating separate scopes of multiple
StudyActivitys, e.g. a site/location where the study activity was executed, because that would be aFactor.The scope could be intentionally broad and include all protocols and instruments and subjects, then there would be a single
StudyActivityin the context of aStudy. Or it could be guided by its related records, e.g. the type of protocol that was applied, or by the types ofDataItems that it generates.In a neuroimaging study, gathering survey answers from participants might be one
StudyActivity, and collection of MRI data might be another.Or is time the only discerning factor?
So far this is how I've been approaching it.
But the thing that bothers me now is how would one map a
DataItemto aSubject? We discussed that this is done via aStudyActivity, but what if aStudyActivityhas links to multipleSubjectsand it generates multipleDataItems. I don't currently see a way to link the correct values with the correct subjects. Does this imply that aStudyActivityshould be defined as something that generates a singleDataItem?Regarding factors, protocols and instruments... (I'm editing this comment as I go along, so don't read the current state as my final assessment)
I will post a few excerpts from the journal article that first published/analysed the data, and try to map that to concepts here, also taking the columns in the actual data into account.
1
Starting with the first paragraph of the
Field methodssection:Here we have study locations, which will be
Factors. All islands are within theAnvers Islandregion of thePalmer Archipelago:They mention study seasons being "austral summers of 2007/08, 2008/09, and 2009/10". These correspond to the values in the
studyNamecolumn of the data (PAL0708,PAL0809,PAL0910). These would be encoded intoStudyActivitys, although specific study activities that generate singleDataItems would have a much narrower scope. This could be an argument (IIRC mentioned somewhere before by @mih ) for introducing a new relationship on theStudyActivityclass that would make it belong to a parentStudyActivity.The rest of that paragraph gives summary information about where nests are located, and which species are on which islands. I think all of these would be encoded in the lower level
StudyActivitys.2
Then we have the second paragraph:
This is where
Protocols andInstruments first come into play...So:
Protocol:Penguin monitoring and selection: "study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass... After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs."Penguin blood sample collection: "At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing."Penguin structural size and mass measurement: "Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation."Instruments:sterile 3 ml syringeheparinized infusion needle1.5 ml micro-centrifuge tubedial calipers ±0.1 mmruler ±1 mm5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bagGenerated
DataItems:Not sure if the interim blood sample should be encoded as a
DataItem, or just seen as part of the protocol to get to the intendedDataItem, which is the subject's sex and also some "delta C/N" values. As will be seen in the next section, there are even more steps involved in the process to get to the intended measurements. I guess it depends on the initial question: to which level of granularity might users want to query this metadata. Would they want to find all types of (interim) blood samples generated by using a specific protocol? Or would they only want to find the sex?The above might change when connecting them with
StudyActivitys...3
Then we have the "Laboratory methods" section:
First, I'll map out the chronological process for getting to the
Delta 15 NandDelta 13 Cvalues from whole blood tubes (collected in the field):Then, I'll map out the chronological process for getting to the
Sexvalues from blood smears:This is the structure of the flat data schema:
updated to account for
part_ofrelationship between twoStudyActivitys and forderived_fromrelationship between aDataItemand aSubjectand between twoSubjects (which was previously denoted asspecimen_of)Suggested structure for measuring culmen length:
Suggested structure for determining sex:
(Note: not all used
Instruments are included in this diagram)The
Penguin Sex Determination Protocolwould then have the complete description:This large paragraph is an example of what I meant previously when I said "it makes annotation easier". Because, technically, this whole process can be broken down into separate
Protocols andStudyActivitys andDataItems. Each sentence in the paragraph is essentially a different study activity that uses a different protocol and generates a different interim data item. If we annotated all of them, annotation would be more involved. But throwing it all into a single process that generates the singleDataItemthat we care about, i.e. "Sex", makes it both easier and also specific to what we want to be able to query.Regarding the island, I think that should rather be coded as a
Dimensionthan aFactor. From scanning the paper, it does not look like the site is intended to be part of the study design as an independent variable. It is rather the case that there are three islands that are part of the data collection site, and some penguins happen to be on some islands.Latest. While mapping data from the penguin tables to the
flat-dataschema, I'm encountering some uncertainties that I'm listing here for awareness/input:How could the levels of a specific Dimension be encoded? A
DataItem(e.g.SexOfPenguinX) has a outcome variable (akaDimension) namedSex, which according to the source metadata has two possible values:MALEandFEMALE. The relevant value for the specific penguin will be entered into thevaluefield of the specificDataItem, but is it also a relevant use case to encode the possible options into the specificDimension? Looking atFactors, it deals with a similar use case by means of thefactor_level_offield that allows stating that a givenFactoris a level of anotherFactor.How to decide if a column in a CSV table is a
FactororDimension? Specifically theRegion("Nominal region of Palmer LTER sampling grid") andIsland("Island near Palmer Station where samples were collected") columns of the penguin tables. There is only one unique value forRegionacross all table data:Anvers. There are three unique values forIslandacross all table data:Dream,Biscoe,Torgersen. Structurally, the model is:Also, the study provides additional info about the islands (such as GPS locations) that can possibly be encoded somewhere.
Should these entities all be
Factors? Should the three islands be levels ofIsland? ShouldAnversbe a level ofRegion? And should eachStudyActivityfor this whole study therefore have two relatedFactors:Anversand the specific island that theDataItemgenerated by theStudyActivitywas collected from?How to map the provenance of a derived
Subject? I am busy with mapping the processes of doing molecular tests on penguin blood samples in order to get several outputs. Shortly, penguin blood is collected in the field and stored in a small tube as well as smeared on a piece of test paper; the test paper blood is used for molecular sexing, while the blood in the tube is used for calculating Delta15N and Delta13C values that are important for later analyses; the resulting sex and delta values are all columns in the CSV tables. So in the end these table values are allDataItems. I could approach this in a simple or less simple way:Simple:
Subject: an individualPenguinXStudyActivity:DeterminingTheSexofPenguinX,DeterminingDelta15NofPenguinX,DeterminingDelta13CofPenguinXProtocol: 2 partially overlapping protocols, one for collecting tube blood and determining the Delta values, and one for collecting test paper blood and determining the sexDataItem:SexofPenguinX,Delta15NofPenguinX,Delta13CofPenguinXNot so simple: this involves splitting up the
StudyActivitys into the processes of (a) collecting blood samples, and then (b) determining the outcome variables from those blood samples. Starting off, the process would be similar to above:Subject: an individualPenguinXStudyActivity:CollectingTubeBloodOfPenguinX,CollectingPaperBloodOfPenguinXProtocol: 2 protocols, one for collecting tube blood and one for collecting test paper bloodDataItem:TubeBloodOfPenguinX,PaperBloodOfPenguinXThese
DataItems become the newSubjects for the next steps, e.g. calculating the delta values:Subject:TubeBloodOfPenguinXStudyActivity:DeterminingDelta15NofPenguinX,DeterminingDelta13CofPenguinXProtocol: one protocol for determining the Delta values from tube bloodDataItem:Delta15NofPenguinX,Delta13CofPenguinXBut now, for
TubeBloodOfPenguinX, how is theDataItemconnected to the subsequentSubject.Subjecthasderived_fromwith the range ofSubject, i.e.TubeBloodOfPenguinXis derived fromPenguinX. But is that sufficient for multi-step queries starting fromPenguinXand its relatedStudyActivitys?One-to-one or one-to-many relationship for
StudyActivitythat generatesDataItem(s)? Assuming theStudyActivityis linked to a singleSubject, which allows linking specificDataItem(s) to a specificSubject, is there any particular benefit to creating multiple granularStudyActivityobjects such that they would be linked one-to-one to generatedDataItems? This could be related to the associatedProtocol. E.g. I have a protocol that spans several measurements ("Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation") and the associatedStudyActivitywould generate threeDataItems. This would be one-to many. But theProtocolcould also be split up into its parts, and multiple associatedStudyActivitys can be created, leading to a one-to-one relationship ofStudyActivityandDataItem. But are there other querying-related benefits to either the one-to-one or one-to-many relationship?Below is a hand-curated JSON object to help automate the process of creating all metadata objects. It is basically a set of tables, where the keys of the main object correspond to classes in the flat-data schema. Not every single key, though; some exist purely as helpers (e.g.
StudyActivityPerSubjectBasis), and some of their child key-value pairs are also helpers (e.g._generatesDataItemWithDim). The keys of internal objects are all unique within a class-corresponding namespace (e.g.gentoo,palmerpenguins, oradeliein theDatasetnamespace), and they will be used to generate PIDs.With the above, some accompanying code can be used to generate independent objects (such as
Studys,Factors,Dimensions), and then further code will be added to generate dependent objects from the CSV tables and helper data.flat-data#22datalad-catalogentities not covered by the flat schemas #24