inm7/annotate.inm7.de-data

Fork 1

Notes while describing the `palmerpenguins` dataset #17

New issue

Open

opened 2025-06-04 09:29:03 +00:00 by jsheunis · 11 comments

jsheunis commented

2025-06-04 09:29:03 +00:00

Owner

Some links:

github: https://github.com/allisonhorst/palmerpenguins
zenodo doi: https://doi.org/10.5281/zenodo.3960217
An old datalad-concepts YAML encoding of the dataset: https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Distribution-penguins.yaml
Tabby sheet summary, derived from the datalad-concepts YAML example: https://docs.google.com/spreadsheets/d/1YNZV5_kSa9HS8iB8bfSBQf9_sMr4d3cl/edit?usp=sharing&ouid=106984577182142381313&rtpof=true&sd=true

Looking at the yaml and excel encoding, there are links to files that do not resolve any more. As far as I can see, the actual raw data in the github repo lives here: https://github.com/allisonhorst/palmerpenguins/tree/main/inst/extdata. Although, the truer sources would probably be the ones cited here: https://github.com/allisonhorst/palmerpenguins?tab=readme-ov-file#references:

Missing properties of `Dataset`

The following are either missing in that the flat schema of Dataset does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense:

author(s) - definitely necessary
homepage - could be convenient, but could also be achieved by exactMatch, or a "see-also" annotation
doi - could be convenient, but could also be achieved by adding an issuedIdentifier
license - not sure if this should rather be on the Distribution class though
funding - I am not sure where exactly this should be represented, perhaps on a different related class (like Study?)
publications - I am not sure where exactly this should be represented

...

Structure

Some thoughts.

We could make a Dataset each out of the three "data packages" linked above, and then link them to a single palmerpenguin dataset. Each one of them currently has its own set of metadata (even though this mostly overlaps). Each data package corresponds to the same type of data collected from a different penguin species (Adelie, gentoo, Chinstrap).

On the other hand, each one of these three packages is also technically a distribution...

The palmerpenguin repository on github (and its distribution at the zenodo doi), is an entity with more content and meaning than the three data packages. It also has R code to preprocess the raw data, as well as illustrations, and more content.

tbc...

Some links: - github: https://github.com/allisonhorst/palmerpenguins - zenodo doi: https://doi.org/10.5281/zenodo.3960217 - An old `datalad-concepts` YAML encoding of the dataset: https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/sdd/unreleased/examples/Distribution-penguins.yaml - Tabby sheet summary, derived from the `datalad-concepts` YAML example: https://docs.google.com/spreadsheets/d/1YNZV5_kSa9HS8iB8bfSBQf9_sMr4d3cl/edit?usp=sharing&ouid=106984577182142381313&rtpof=true&sd=true Looking at the yaml and excel encoding, there are links to files that do not resolve any more. As far as I can see, the actual raw data in the github repo lives here: https://github.com/allisonhorst/palmerpenguins/tree/main/inst/extdata. Although, the truer sources would probably be the ones cited here: https://github.com/allisonhorst/palmerpenguins?tab=readme-ov-file#references: - https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.219.5 - https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-pal&identifier=220&revision=7 - https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-pal&identifier=221&revision=8 ### Missing properties of `Dataset` The following are either missing in that the flat schema of `Dataset` does not include a specific property for it, or otherwise it is not immediately clear to me how these would be annotated in a hierarchical/linking sense: - `author(s)` - definitely necessary - `homepage` - could be convenient, but could also be achieved by `exactMatch`, or a "see-also" annotation - `doi` - could be convenient, but could also be achieved by adding an `issuedIdentifier` - `license` - not sure if this should rather be on the `Distribution` class though - `funding` - I am not sure where exactly this should be represented, perhaps on a different related class (like `Study`?) - `publications` - I am not sure where exactly this should be represented ... ### Structure Some thoughts. We could make a `Dataset` each out of the three "data packages" linked above, and then link them to a single `palmerpenguin` dataset. Each one of them currently has its own set of metadata (even though this mostly overlaps). Each data package corresponds to the same type of data collected from a different penguin species (Adelie, gentoo, Chinstrap). On the other hand, each one of these three packages is also technically a distribution... The `palmerpenguin` repository on github (and its distribution at the zenodo doi), is an entity with more content and meaning than the three data packages. It also has R code to preprocess the raw data, as well as illustrations, and more content. _tbc..._

jsheunis commented

2025-06-04 14:27:28 +00:00

Author

Owner

The `Study` context

I've been going back and forth on how to describe the dataset, and I think the issue is that I'm trying to solve it from the content perspective (there is a dataset in the git repo, with many files, some of which contain raw data with study measures; there are also three data packages, each a table, at different sources), and it's not clear to me yet how to bring these different sources together in the flat-data schema.

Perhaps looking at it from a Study perspective is a simpler approach, and then the data content/distribution can be tied into it later down the line.

Penguin metadata

Important and useful: the penguin data packages have extensive metadata attached to it, e.g. here: https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-pal.219.5

This includes e.g. under "Data Entities" a complete description of each column in a data table, including value data types and more.

Each line in a data table is what they describe as a "sample".

Here's a summary of a data table's columns:
("column name", "description", "storage type", "measure type", followed by my comments):

studyName:

Sampling expedition from which data were collected, generated, etc.
string
nominal
SH: there are 3 unique values of this across all samples. It feels like a StudyActivity, but looking at the description:

activity in the context of a study, where one or more subjects are studied under the influence of certain factors, with one or more instruments, following a set of protocols"

these 3 unique studyNames all have the same column headings and all include many samples, i.e. they have the exactly the same for "under the influence of certain factors, with one or more instruments, following a set of protocols". I guess the difference is likely just timing (aka study visits/sessions), does this warrant three different StudyActivity records? If we keep in mind that a DataItem is generated by a StudyActivity, does it makes sense to have the three sessions as the originators of all DataItems, or is it better to have other qualifiers for this process?

Sample Number:

continuous numberng sequence for each sample
integer
interval
SH: this is largely arbitrary, I don't think this has to be encoded somewhere.

Species:

a string representing the species of an organism
string
nominal
SH: SubjectType (all three species will be a suptype of the penguin genus)

Region:

Nominal region of Palmer LTER sampling grid
string
nominal
SH: this is the same value for all samples, so IMO it is not that important to map to any particular thing in the flat schema.

Island:

Island near Palmer Station where samples were collected
string
nominal
SH: an independent variable, i.e. Factor

Stage:

Reproductive stage at sampling
string
nominal
SH: DataItem

Individual ID:

A unique ID for each individual in dataset
string
nominal
SH: Subject

Clutch Completion:

Was the study nest observed with a full clutch, i.e., 2 eggs
string
nominal
SH: DataItem

Date Egg:

Date study nest observed with 1 egg (sampled)
date
dateTime
SH: DataItem

Culmen Length:

length of the dorsal ridge of a bird's bill
decimal
ratio
SH: DataItem

Culmen Depth:

depth of the dorsal ridge of a bird's bill
decimal
ratio
SH: DataItem

Flipper Length:

Length of flipper
integer
ratio
SH: DataItem

Body Mass:

Mass of body
integer
ratio
SH: DataItem

Sex:

code for the sex of an animal
string
nominal
SH: DataItem

Delta 15 N:

a measure of the ratio of stable isotopes 15N:14N
decimal
ratio
SH: DataItem

Delta 13 C:

a measure of the ratio of stable isotopes 13C:12C
decimal
ratio
SH: DataItem

Comments:

Text field to provide additional relevant information for data
string
nominal
SH: DataItem, but not sure, since this comment is sample-based. could just be fed into the generic comment slot of the DataItem that it pertains to...

What we haven't done yet is investigate how Protocols and Instruments would feed into the mapping of the above items to the flat schema....

Mappings

The following needs a rework:

Study

There isn't an intuitively and uniquely described "study" that I can find that relates to this dataset. There are multiple study-related things though:

There's this paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081#s2 (contains extended descriptions of methods and protocols and measures etc -> could feed into Protocols and Instruments)
There is a common abstract of all three data packages

What makes sense to me is to create a container "study" record to describe the general study, that led to the creation/publication of the three data packages. This study would be a container for the StudyActivitys described below.

StudyActivity

The raw csv data (from the three files coming from the data packages) contain a column studyName with description Sampling expedition from which data were collected, generated, etc.. To this maps neatly onto StudyActivity.

On second, thought, I am not sure that this maps neatly....

## The `Study` context I've been going back and forth on how to describe the dataset, and I think the issue is that I'm trying to solve it from the content perspective (there is a dataset in the git repo, with many files, some of which contain raw data with study measures; there are also three data packages, each a table, at different sources), and it's not clear to me yet how to bring these different sources together in the flat-data schema. Perhaps looking at it from a `Study` perspective is a simpler approach, and then the data content/distribution can be tied into it later down the line. ## Penguin metadata Important and useful: the penguin data packages have extensive metadata attached to it, e.g. here: https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-pal.219.5 This includes e.g. under "Data Entities" a complete description of each column in a data table, including value data types and more. Each line in a data table is what they describe as a "sample". Here's a summary of a data table's columns: ("column name", "description", "storage type", "measure type", followed by my comments): `studyName`: - Sampling expedition from which data were collected, generated, etc. - string - nominal - SH: there are 3 unique values of this across all samples. It feels like a `StudyActivity`, but looking at the description: > activity in the context of a study, where one or more subjects are studied under the influence of certain factors, with one or more instruments, following a set of protocols" these 3 unique `studyName`s all have the same column headings and all include many samples, i.e. they have the exactly the same for "under the influence of certain factors, with one or more instruments, following a set of protocols". I guess the difference is likely just timing (aka study visits/sessions), does this warrant three different `StudyActivity` records? If we keep in mind that a `DataItem` is generated by a `StudyActivity`, does it makes sense to have the three sessions as the originators of all `DataItem`s, or is it better to have other qualifiers for this process? `Sample Number`: - continuous numberng sequence for each sample - integer - interval - SH: this is largely arbitrary, I don't think this has to be encoded somewhere. `Species`: - a string representing the species of an organism - string - nominal - SH: `SubjectType` (all three species will be a suptype of the penguin genus) `Region`: - Nominal region of Palmer LTER sampling grid - string - nominal - SH: this is the same value for all samples, so IMO it is not that important to map to any particular thing in the flat schema. `Island`: - Island near Palmer Station where samples were collected - string - nominal - SH: an independent variable, i.e. `Factor` `Stage`: - Reproductive stage at sampling - string - nominal - SH: `DataItem` `Individual ID`: - A unique ID for each individual in dataset - string - nominal - SH: `Subject` `Clutch Completion`: - Was the study nest observed with a full clutch, i.e., 2 eggs - string - nominal - SH: `DataItem` `Date Egg`: - Date study nest observed with 1 egg (sampled) - date - dateTime - SH: `DataItem` `Culmen Length`: - length of the dorsal ridge of a bird's bill - decimal - ratio - SH: `DataItem` `Culmen Depth`: - depth of the dorsal ridge of a bird's bill - decimal - ratio - SH: `DataItem` `Flipper Length`: - Length of flipper - integer - ratio - SH: `DataItem` `Body Mass`: - Mass of body - integer - ratio - SH: `DataItem` `Sex`: - code for the sex of an animal - string - nominal - SH: `DataItem` `Delta 15 N`: - a measure of the ratio of stable isotopes 15N:14N - decimal - ratio - SH: `DataItem` `Delta 13 C`: - a measure of the ratio of stable isotopes 13C:12C - decimal - ratio - SH: `DataItem` `Comments`: - Text field to provide additional relevant information for data - string - nominal - SH: `DataItem`, but not sure, since this comment is sample-based. could just be fed into the generic comment slot of the DataItem that it pertains to... What we haven't done yet is investigate how `Protocol`s and `Instrument`s would feed into the mapping of the above items to the flat schema.... ## Mappings The following needs a rework: ### Study There isn't an intuitively and uniquely described "study" that I can find that relates to this dataset. There are multiple study-related things though: - There's this paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081#s2 (contains extended descriptions of methods and protocols and measures etc -> could feed into `Protocol`s and `Instrument`s) - There is a common abstract of all three data packages What makes sense to me is to create a container "study" record to describe the general study, that led to the creation/publication of the three data packages. This study would be a container for the `StudyActivity`s described below. ### StudyActivity The raw csv data (from the three files coming from the data packages) contain a column `studyName` with description `Sampling expedition from which data were collected, generated, etc.`. To this maps neatly onto `StudyActivity`. On second, thought, I am not sure that this maps neatly....

jsheunis commented

2025-06-05 09:36:13 +00:00

Author

Owner

After some internal discussion about the StudyActivity class, some notes/thoughts/questions:

A StudyActivity is a time-dependent activity "in the context of a study, where one or more subjects are studied under the influence of certain factors, with one or more instruments, following a set of protocols"
A StudyActivity links to a Study (and not the other way around, which I think is intentional based on @mih's description of "Design schemas to reduce churn" at https://concepts.inm7.de/about/)
A StudyActivity can have multiple Implemented protocols, Used instruments, and Studied subjects.

I am wondering now, how should I determine the scope of a StudyActivity during annotation?

AFAICT there would not be a physical aspect motivating separate scopes of multiple StudyActivitys, e.g. a site/location where the study activity was executed, because that would be a Factor.

The scope could be intentionally broad and include all protocols and instruments and subjects, then there would be a single StudyActivity in the context of a Study. Or it could be guided by its related records, e.g. the type of protocol that was applied, or by the types of DataItems that it generates.

In a neuroimaging study, gathering survey answers from participants might be one StudyActivity, and collection of MRI data might be another.

Or is time the only discerning factor?

After some internal discussion about the `StudyActivity` class, some notes/thoughts/questions: - A `StudyActivity` is a time-dependent activity "in the context of a study, where one or more subjects are studied under the influence of certain factors, with one or more instruments, following a set of protocols" - A `StudyActivity` links to a `Study` (and not the other way around, which I think is intentional based on @mih's description of "Design schemas to reduce churn" at https://concepts.inm7.de/about/) - A `StudyActivity` can have multiple `Implemented protocols`, `Used instruments`, and `Studied subjects`. I am wondering now, _how should I determine the scope of a `StudyActivity`_ during annotation? AFAICT there would not be a physical aspect motivating separate scopes of multiple `StudyActivity`s, e.g. a site/location where the study activity was executed, because that would be a `Factor`. The scope could be intentionally broad and include all protocols and instruments and subjects, then there would be a single `StudyActivity` in the context of a `Study`. Or it could be guided by its related records, e.g. the type of protocol that was applied, or by the types of `DataItem`s that it generates. In a neuroimaging study, gathering survey answers from participants might be one `StudyActivity`, and collection of MRI data might be another. Or is time the only discerning factor?

loj commented

2025-06-05 09:46:26 +00:00

Owner

In a neuroimaging study, gathering survey answers from participants might be one StudyActivity, and collection of MRI data might be another.

So far this is how I've been approaching it.

> In a neuroimaging study, gathering survey answers from participants might be one StudyActivity, and collection of MRI data might be another. So far this is how I've been approaching it.

jsheunis commented

2025-06-05 09:54:55 +00:00

Author

Owner

But the thing that bothers me now is how would one map a DataItem to a Subject? We discussed that this is done via a StudyActivity, but what if a StudyActivity has links to multiple Subjects and it generates multiple DataItems. I don't currently see a way to link the correct values with the correct subjects. Does this imply that a StudyActivity should be defined as something that generates a single DataItem?

But the thing that bothers me now is how would one map a `DataItem` to a `Subject`? We discussed that this is done via a `StudyActivity`, but what if a `StudyActivity` has links to multiple `Subjects` and it generates multiple `DataItem`s. I don't currently see a way to link the correct values with the correct subjects. Does this imply that a `StudyActivity` should be defined as something that generates a single `DataItem`?

jsheunis commented

2025-06-10 09:42:47 +00:00

Author

Owner

Regarding factors, protocols and instruments... (I'm editing this comment as I go along, so don't read the current state as my final assessment)

I will post a few excerpts from the journal article that first published/analysed the data, and try to map that to concepts here, also taking the columns in the actual data into account.

1

Starting with the first paragraph of the Field methods section:

Field research was conducted on Pygoscelis penguins nesting on several islands within the Palmer Archipelago west of the AP near Anvers Island (64°46′S, 64°03′W, Fig. 1a-c), during the austral summers of 2007/08, 2008/09, and 2009/10. Specifically, study nests were located on Biscoe (64°48′S, 63°46′W), Torgersen (64°46′S, 64°04′W), and Dream (64°43′S, 64°13′W) Islands (Fig. 1c). Each study season, Adélie penguin study nests (n = 30) were distributed equally between the three study islands, with 10 nests located on each island. Gentoo penguin study nests (n = 30) were all located on Biscoe Island, while chinstrap penguin study nests (n = 15) were all located on Dream Island (Fig. 1c). The reduced sample size for chinstraps was due to the overall smaller number of individuals breeding at rookeries on Dream Island.

Here we have study locations, which will be Factors. All islands are within the Anvers Island region of the Palmer Archipelago:

Dream (64°43′S, 64°13′W)
Biscoe (64°48′S, 63°46′W)
Torgersen (64°46′S, 64°04′W)

They mention study seasons being "austral summers of 2007/08, 2008/09, and 2009/10". These correspond to the values in the studyName column of the data (PAL0708, PAL0809, PAL0910). These would be encoded into StudyActivitys, although specific study activities that generate single DataItems would have a much narrower scope. This could be an argument (IIRC mentioned somewhere before by @mih ) for introducing a new relationship on the StudyActivity class that would make it belong to a parent StudyActivity.

The rest of that paragraph gives summary information about where nests are located, and which species are on which islands. I think all of these would be encoded in the lower level StudyActivitys.

2

Then we have the second paragraph:

Each season, study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass. At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing. Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation. After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs.

This is where Protocols and Instruments first come into play...

So:

penguins are selected => people go to the sites and monitor nests, specifically looking for "one-egg stage" nests
data samples are taken => people capture both adults, obtain blood samples, take structural and mass measurements

Protocol:

Penguin monitoring and selection: "study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass... After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs."
Penguin blood sample collection: "At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing."
Penguin structural size and mass measurement: "Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation."

Instruments:

sterile 3 ml syringe
heparinized infusion needle
1.5 ml micro-centrifuge tube
dial calipers ±0.1 mm
ruler ±1 mm
5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag

Generated DataItems:

Culmen Length
Culmen Depth
Flipper Length
Body Mass

Not sure if the interim blood sample should be encoded as a DataItem, or just seen as part of the protocol to get to the intended DataItem, which is the subject's sex and also some "delta C/N" values. As will be seen in the next section, there are even more steps involved in the process to get to the intended measurements. I guess it depends on the initial question: to which level of granularity might users want to query this metadata. Would they want to find all types of (interim) blood samples generated by using a specific protocol? Or would they only want to find the sex?

The above might change when connecting them with StudyActivitys...

3

Then we have the "Laboratory methods" section:

Within 12 hours (hrs) of field collection, tubes containing whole blood were centrifuged to separate plasma and red blood cell (RBC) fractions, which were stored separately and frozen at −80 degrees Celsius (°C). Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator. After drying, tubes were sealed and frozen at −80°C.

Tubes containing RBCs were first allowed to dry to a consistent mass in a drying oven at 60°C. Using a mortar and pestle lined with clean weighing paper, dried RBC pellets were homogenized into a powder. Each mortar and pestle was washed and dried in between sample processing. Aliquots of powdered samples were transferred to 8×5 mm pressed tin capsules (Elemental Microanalysis) and weighed (∼2 mg) using an analytical balance. Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis. Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N = ([Rsample/Rstandard]-1)×1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen.

Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis. Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis.

First, I'll map out the chronological process for getting to the Delta 15 N and Delta 13 C values from whole blood tubes (collected in the field):

whole blood tubes centrifuged to separate plasma and red blood cell (RBC) fractions
RBC tubes frozen at −80 degrees Celsius (°C)
allowed to dry to a consistent mass in a drying oven at 60°C => dried RBC pellets
dried RBC pellets homogenized into a powder using a mortar and pestle lined with clean weighing paper (Each mortar and pestle was washed and dried in between sample processing)
Aliquots of powdered samples were transferred to 8×5 mm pressed tin capsules (Elemental Microanalysis) and weighed (∼2 mg) using an analytical balance
Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis
Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N = ([Rsample/Rstandard]-1)×1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen.

Then, I'll map out the chronological process for getting to the Sex values from blood smears:

Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator.
After drying, tubes were sealed and frozen at −80°C.
Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis.
Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis.

Regarding factors, protocols and instruments... (I'm editing this comment as I go along, so don't read the current state as my final assessment) I will post a few excerpts from the journal article that first published/analysed the data, and try to map that to concepts here, also taking the columns in the actual data into account. --- ### 1 Starting with the first paragraph of the `Field methods` section: > Field research was conducted on Pygoscelis penguins nesting on several islands within the Palmer Archipelago west of the AP near Anvers Island (64°46′S, 64°03′W, Fig. 1a-c), during the austral summers of 2007/08, 2008/09, and 2009/10. Specifically, study nests were located on Biscoe (64°48′S, 63°46′W), Torgersen (64°46′S, 64°04′W), and Dream (64°43′S, 64°13′W) Islands (Fig. 1c). Each study season, Adélie penguin study nests (n = 30) were distributed equally between the three study islands, with 10 nests located on each island. Gentoo penguin study nests (n = 30) were all located on Biscoe Island, while chinstrap penguin study nests (n = 15) were all located on Dream Island (Fig. 1c). The reduced sample size for chinstraps was due to the overall smaller number of individuals breeding at rookeries on Dream Island. Here we have study locations, which will be `Factor`s. All islands are within the `Anvers Island` region of the `Palmer Archipelago`: - Dream (64°43′S, 64°13′W) - Biscoe (64°48′S, 63°46′W) - Torgersen (64°46′S, 64°04′W) They mention study seasons being "austral summers of 2007/08, 2008/09, and 2009/10". These correspond to the values in the `studyName` column of the data (`PAL0708`, `PAL0809`, `PAL0910`). These would be encoded into `StudyActivity`s, although specific study activities that generate single `DataItem`s would have a much narrower scope. This could be an argument (IIRC mentioned somewhere before by @mih ) for introducing a new relationship on the `StudyActivity` class that would make it belong to a parent `StudyActivity`. The rest of that paragraph gives summary information about where nests are located, and which species are on which islands. I think all of these would be encoded in the lower level `StudyActivity`s. --- ### 2 Then we have the second paragraph: > Each season, study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass. At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing. Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation. After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs. This is where `Protocol`s and `Instrument`s first come into play... So: - penguins are selected => people go to the sites and monitor nests, specifically looking for "one-egg stage" nests - data samples are taken => people capture both adults, obtain blood samples, take structural and mass measurements `Protocol`: - `Penguin monitoring and selection`: "study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored. When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass... After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs." - `Penguin blood sample collection`: "At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing." - `Penguin structural size and mass measurement`: "Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation." `Instrument`s: - `sterile 3 ml syringe` - `heparinized infusion needle` - `1.5 ml micro-centrifuge tube` - `dial calipers ±0.1 mm` - `ruler ±1 mm` - `5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag` Generated `DataItem`s: - Culmen Length - Culmen Depth - Flipper Length - Body Mass Not sure if the interim blood sample should be encoded as a `DataItem`, or just seen as part of the protocol to get to the intended `DataItem`, which is the subject's sex and also some "delta C/N" values. As will be seen in the next section, there are even more steps involved in the process to get to the intended measurements. I guess it depends on the initial question: to which level of granularity might users want to query this metadata. Would they want to find all types of (interim) blood samples generated by using a specific protocol? Or would they only want to find the sex? The above might change when connecting them with `StudyActivity`s... --- ### 3 Then we have the "Laboratory methods" section: > Within 12 hours (hrs) of field collection, tubes containing whole blood were centrifuged to separate plasma and red blood cell (RBC) fractions, which were stored separately and frozen at −80 degrees Celsius (°C). Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator. After drying, tubes were sealed and frozen at −80°C. > > Tubes containing RBCs were first allowed to dry to a consistent mass in a drying oven at 60°C. Using a mortar and pestle lined with clean weighing paper, dried RBC pellets were homogenized into a powder. Each mortar and pestle was washed and dried in between sample processing. Aliquots of powdered samples were transferred to 8×5 mm pressed tin capsules (Elemental Microanalysis) and weighed (∼2 mg) using an analytical balance. Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis. Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N = ([Rsample/Rstandard]-1)×1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen. > > Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis. Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis. First, I'll map out the chronological process for getting to the `Delta 15 N` and `Delta 13 C` values from whole blood tubes (collected in the field): - whole blood tubes centrifuged to separate plasma and red blood cell (RBC) fractions - RBC tubes frozen at −80 degrees Celsius (°C) - allowed to dry to a consistent mass in a drying oven at 60°C => dried RBC pellets - dried RBC pellets homogenized into a powder using a mortar and pestle lined with clean weighing paper (Each mortar and pestle was washed and dried in between sample processing) - Aliquots of powdered samples were transferred to 8×5 mm pressed tin capsules (Elemental Microanalysis) and weighed (∼2 mg) using an analytical balance - Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis - Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N = ([Rsample/Rstandard]-1)×1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen. Then, I'll map out the chronological process for getting to the `Sex` values from blood smears: - Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator. - After drying, tubes were sealed and frozen at −80°C. - Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis. - Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis.

jsheunis commented

2025-06-10 12:39:04 +00:00

Author

Owner

This is the structure of the flat data schema:

updated to account for part_of relationship between two StudyActivitys and for derived_from relationship between a DataItem and a Subject and between two Subjects (which was previously denoted as specimen_of)


graph TD
    %% Core Entities
    Study[Study]
    StudyActivity[StudyActivity]
    Subject[Subject]
    SubjectType[SubjectType]
    Instrument[Instrument]
    Protocol[Protocol]
    DataItem[DataItem]
    Dataset[Dataset]
    Unit[Unit]
    Dimension[Dimension]
    Factor[Factor]
    Distribution[Distribution]
    
    %% Core Relationships
    Subject -->|derived from| Subject
    Subject -->|has type| SubjectType
    Subject -->|has context| Study
    DataItem -->|generated by| StudyActivity
    
    StudyActivity -->|has context| Study
    StudyActivity -->|studied subjects| Subject
    StudyActivity -->|used instruments| Instrument
    StudyActivity -->|part of| StudyActivity
    StudyActivity -->|implemented protocols| Protocol
    StudyActivity -->|influencing factors| Factor
    
    DataItem -->|part of| Dataset
    DataItem -->|unit| Unit
    DataItem -->|outcome variables| Dimension
    DataItem -->|derived from| Subject
 
    Dataset -->|primary source| Study
    Dataset -->|outcome variables| Dimension
    
    Distribution -->|distribution of| DataItem
    Distribution -->|distribution of| Dataset

    %% Styling
    %% classDef coreEntity fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    %% classDef dataItem fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    %% class Study,StudyActivity,Subject,SubjectType,Instrument,Protocol,Dataset,Unit,Dimension coreEntity
    %% class DataItem dataItem

jsheunis commented

2025-06-10 21:40:10 +00:00

Author

Owner

Suggested structure for measuring culmen length:


classDiagram
    class Palmer Penguin Study {
        is_a: Study
    }
    class Measuring The Culmen Length Of PenguinX {
        is_a: StudyActivity
        start_date: ""
        end_date: ""
    }
    class Culmen Length Of PenguinX {
        is_a: DataItem
        value: "77"
    }
    class Millimeters {
        is_a: Unit
        short_name: "mm"
        description: "..."
    }
    class Culmen Length {
        is_a: Dimension
    }
    class Dial Calipers {
        is_a: Instrument
        description: "+-0.1mm"
    }
    class Culmen Length Measurement Protocol {
        is_a: Protocol
        description: "..."
    }
    class PenguinX {
        is_a: Subject
    }
    class Dream Island {
        is_a: Factor
    }
    class Site {
        is_a: Factor
    }
    class Adelie {
        is_a: SubjectType
    }

    class Palmer Penguin Dataset {
        is_a: Dataset
    }


    Measuring The Culmen Length Of PenguinX <.. Culmen Length Of PenguinX : generated_by
    Palmer Penguin Dataset <.. Culmen Length Of PenguinX : part_of

    Culmen Length Of PenguinX ..> Millimeters : has_unit
    Culmen Length Of PenguinX ..> Culmen Length : outcome_variable

    Measuring The Culmen Length Of PenguinX ..> Culmen Length Measurement Protocol : implemented_protocol
    Measuring The Culmen Length Of PenguinX ..> Dial Calipers : used_instruments
    Measuring The Culmen Length Of PenguinX ..> PenguinX : studied_subjects
    Measuring The Culmen Length Of PenguinX ..> Dream Island : influencing_factors
    
    Dream Island ..> Site : level_of

    PenguinX ..> Adelie : is_of_type

    Palmer Penguin Study <.. Measuring The Culmen Length Of PenguinX : has_study_context
    Palmer Penguin Study <.. PenguinX : has_study_context

Suggested structure for measuring culmen length: ```mermaid classDiagram class Palmer Penguin Study { is_a: Study } class Measuring The Culmen Length Of PenguinX { is_a: StudyActivity start_date: "" end_date: "" } class Culmen Length Of PenguinX { is_a: DataItem value: "77" } class Millimeters { is_a: Unit short_name: "mm" description: "..." } class Culmen Length { is_a: Dimension } class Dial Calipers { is_a: Instrument description: "+-0.1mm" } class Culmen Length Measurement Protocol { is_a: Protocol description: "..." } class PenguinX { is_a: Subject } class Dream Island { is_a: Factor } class Site { is_a: Factor } class Adelie { is_a: SubjectType } class Palmer Penguin Dataset { is_a: Dataset } Measuring The Culmen Length Of PenguinX <.. Culmen Length Of PenguinX : generated_by Palmer Penguin Dataset <.. Culmen Length Of PenguinX : part_of Culmen Length Of PenguinX ..> Millimeters : has_unit Culmen Length Of PenguinX ..> Culmen Length : outcome_variable Measuring The Culmen Length Of PenguinX ..> Culmen Length Measurement Protocol : implemented_protocol Measuring The Culmen Length Of PenguinX ..> Dial Calipers : used_instruments Measuring The Culmen Length Of PenguinX ..> PenguinX : studied_subjects Measuring The Culmen Length Of PenguinX ..> Dream Island : influencing_factors Dream Island ..> Site : level_of PenguinX ..> Adelie : is_of_type Palmer Penguin Study <.. Measuring The Culmen Length Of PenguinX : has_study_context Palmer Penguin Study <.. PenguinX : has_study_context ```

jsheunis commented

2025-06-11 09:48:37 +00:00

Author

Owner

Suggested structure for determining sex:


classDiagram
    class Palmer Penguin Study {
        is_a: Study
    }
    class Determining the Sex Of PenguinX {
        is_a: StudyActivity
        start_date: ""
        end_date: ""
    }
    class Sex Of PenguinX {
        is_a: DataItem
        value: "FEMALE"
    }
    class Sex {
        is_a: Dimension
    }
    class 1.5 ml micro-centrifuge tube {
        is_a: Instrument
    }
    class heparinized infusion needle {
        is_a: Instrument
    }
    class sterile 3 ml syringe {
        is_a: Instrument
    }

    class Penguin Sex Determination Protocol {
        is_a: Protocol
        description: Very long paragraph describing all the steps.
    }
    class PenguinX {
        is_a: Subject
    }
    class Dream Island {
        is_a: Factor
    }
    class Site {
        is_a: Factor
    }
    class Adelie {
        is_a: SubjectType
    }

    class Palmer Penguin Dataset {
        is_a: Dataset
    }

    Determining the Sex Of PenguinX <.. Sex Of PenguinX : generated_by
    Palmer Penguin Dataset <.. Sex Of PenguinX : part_of

    Sex Of PenguinX ..> Sex : outcome_variable

    Determining the Sex Of PenguinX ..> Penguin Sex Determination Protocol : implemented_protocol
    Determining the Sex Of PenguinX ..> PenguinX : studied_subjects
    Determining the Sex Of PenguinX ..> Dream Island : influencing_factors

    Determining the Sex Of PenguinX ..> 1.5 ml micro-centrifuge tube : used_instruments
    Determining the Sex Of PenguinX ..> heparinized infusion needle : used_instruments
    Determining the Sex Of PenguinX ..> sterile 3 ml syringe : used_instruments
    
    Dream Island ..> Site : level_of

    PenguinX ..> Adelie : is_of_type

    Palmer Penguin Study <.. Determining the Sex Of PenguinX : has_study_context
    Palmer Penguin Study <.. PenguinX : has_study_context

(Note: not all used Instruments are included in this diagram)

The Penguin Sex Determination Protocol would then have the complete description:

At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing.

Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator. After drying, tubes were sealed and frozen at −80°C. Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis.

Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis.

This large paragraph is an example of what I meant previously when I said "it makes annotation easier". Because, technically, this whole process can be broken down into separate Protocols and StudyActivitys and DataItems. Each sentence in the paragraph is essentially a different study activity that uses a different protocol and generates a different interim data item. If we annotated all of them, annotation would be more involved. But throwing it all into a single process that generates the single DataItem that we care about, i.e. "Sex", makes it both easier and also specific to what we want to be able to query.

Suggested structure for determining sex: ```mermaid classDiagram class Palmer Penguin Study { is_a: Study } class Determining the Sex Of PenguinX { is_a: StudyActivity start_date: "" end_date: "" } class Sex Of PenguinX { is_a: DataItem value: "FEMALE" } class Sex { is_a: Dimension } class 1.5 ml micro-centrifuge tube { is_a: Instrument } class heparinized infusion needle { is_a: Instrument } class sterile 3 ml syringe { is_a: Instrument } class Penguin Sex Determination Protocol { is_a: Protocol description: Very long paragraph describing all the steps. } class PenguinX { is_a: Subject } class Dream Island { is_a: Factor } class Site { is_a: Factor } class Adelie { is_a: SubjectType } class Palmer Penguin Dataset { is_a: Dataset } Determining the Sex Of PenguinX <.. Sex Of PenguinX : generated_by Palmer Penguin Dataset <.. Sex Of PenguinX : part_of Sex Of PenguinX ..> Sex : outcome_variable Determining the Sex Of PenguinX ..> Penguin Sex Determination Protocol : implemented_protocol Determining the Sex Of PenguinX ..> PenguinX : studied_subjects Determining the Sex Of PenguinX ..> Dream Island : influencing_factors Determining the Sex Of PenguinX ..> 1.5 ml micro-centrifuge tube : used_instruments Determining the Sex Of PenguinX ..> heparinized infusion needle : used_instruments Determining the Sex Of PenguinX ..> sterile 3 ml syringe : used_instruments Dream Island ..> Site : level_of PenguinX ..> Adelie : is_of_type Palmer Penguin Study <.. Determining the Sex Of PenguinX : has_study_context Palmer Penguin Study <.. PenguinX : has_study_context ``` <br><br> (Note: not all used `Instrument`s are included in this diagram) The `Penguin Sex Determination Protocol` would then have the complete description: > At the time of capture, each adult penguin was quickly blood sampled (∼1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle. Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool. In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing. > > Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator. After drying, tubes were sealed and frozen at −80°C. Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis. > > Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis. This large paragraph is an example of what I meant previously when I said "it makes annotation easier". Because, technically, this whole process can be broken down into separate `Protocol`s and `StudyActivity`s and `DataItem`s. Each sentence in the paragraph is essentially a different study activity that uses a different protocol and generates a different interim data item. If we annotated all of them, annotation would be more involved. But throwing it all into a single process that generates the single `DataItem` that we care about, i.e. "Sex", makes it both easier and also specific to what we want to be able to query.

jsheunis commented

2025-06-11 09:58:23 +00:00

Author

Owner

Regarding the island, I think that should rather be coded as a Dimension than a Factor. From scanning the paper, it does not look like the site is intended to be part of the study design as an independent variable. It is rather the case that there are three islands that are part of the data collection site, and some penguins happen to be on some islands.

Regarding the island, I think that should rather be coded as a `Dimension` than a `Factor`. From scanning the paper, it does not look like the site is intended to be part of the study design as an independent variable. It is rather the case that there are three islands that are part of the data collection site, and some penguins happen to be on some islands.

jsheunis commented

2025-06-18 08:30:54 +00:00

Author

Owner

Latest. While mapping data from the penguin tables to the flat-data schema, I'm encountering some uncertainties that I'm listing here for awareness/input:

How could the levels of a specific Dimension be encoded? A DataItem (e.g. SexOfPenguinX) has a outcome variable (aka Dimension) named Sex, which according to the source metadata has two possible values: MALE and FEMALE. The relevant value for the specific penguin will be entered into the value field of the specific DataItem, but is it also a relevant use case to encode the possible options into the specific Dimension? Looking at Factors, it deals with a similar use case by means of the factor_level_of field that allows stating that a given Factor is a level of another Factor.
How to decide if a column in a CSV table is a Factor or Dimension? Specifically the Region ("Nominal region of Palmer LTER sampling grid") and Island ("Island near Palmer Station where samples were collected") columns of the penguin tables. There is only one unique value for Region across all table data: Anvers. There are three unique values for Island across all table data: Dream, Biscoe, Torgersen. Structurally, the model is:
- a region can have many islands
- Anvers is a region
- Anvers has three islands relevant to the study
- Dream, Biscoe, and Torgersen are these islands
Also, the study provides additional info about the islands (such as GPS locations) that can possibly be encoded somewhere.

Should these entities all be Factors? Should the three islands be levels of Island? Should Anvers be a level of Region? And should each StudyActivity for this whole study therefore have two related Factors: Anvers and the specific island that the DataItem generated by the StudyActivity was collected from?
How to map the provenance of a derived Subject? I am busy with mapping the processes of doing molecular tests on penguin blood samples in order to get several outputs. Shortly, penguin blood is collected in the field and stored in a small tube as well as smeared on a piece of test paper; the test paper blood is used for molecular sexing, while the blood in the tube is used for calculating Delta15N and Delta13C values that are important for later analyses; the resulting sex and delta values are all columns in the CSV tables. So in the end these table values are all DataItems. I could approach this in a simple or less simple way:
- Simple:
  - Subject: an individual PenguinX
  - StudyActivity: DeterminingTheSexofPenguinX, DeterminingDelta15NofPenguinX, DeterminingDelta13CofPenguinX
  - Protocol: 2 partially overlapping protocols, one for collecting tube blood and determining the Delta values, and one for collecting test paper blood and determining the sex
  - DataItem: SexofPenguinX, Delta15NofPenguinX, Delta13CofPenguinX
- Not so simple: this involves splitting up the StudyActivitys into the processes of (a) collecting blood samples, and then (b) determining the outcome variables from those blood samples. Starting off, the process would be similar to above:
  - Subject: an individual PenguinX
  - StudyActivity: CollectingTubeBloodOfPenguinX, CollectingPaperBloodOfPenguinX
  - Protocol: 2 protocols, one for collecting tube blood and one for collecting test paper blood
  - DataItem: TubeBloodOfPenguinX, PaperBloodOfPenguinX
  These DataItems become the new Subjects for the next steps, e.g. calculating the delta values:
  - Subject: TubeBloodOfPenguinX
  - StudyActivity: DeterminingDelta15NofPenguinX, DeterminingDelta13CofPenguinX
  - Protocol: one protocol for determining the Delta values from tube blood
  - DataItem: Delta15NofPenguinX, Delta13CofPenguinX
  But now, for TubeBloodOfPenguinX, how is the DataItem connected to the subsequent Subject. Subject has derived_from with the range of Subject, i.e. TubeBloodOfPenguinX is derived from PenguinX. But is that sufficient for multi-step queries starting from PenguinX and its related StudyActivitys?
One-to-one or one-to-many relationship for StudyActivity that generates DataItem(s)? Assuming the StudyActivity is linked to a single Subject, which allows linking specific DataItem(s) to a specific Subject, is there any particular benefit to creating multiple granular StudyActivity objects such that they would be linked one-to-one to generated DataItems? This could be related to the associated Protocol. E.g. I have a protocol that spans several measurements ("Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation") and the associated StudyActivity would generate three DataItems. This would be one-to many. But the Protocol could also be split up into its parts, and multiple associated StudyActivitys can be created, leading to a one-to-one relationship of StudyActivity and DataItem. But are there other querying-related benefits to either the one-to-one or one-to-many relationship?

Latest. While mapping data from the penguin tables to the `flat-data` schema, I'm encountering some uncertainties that I'm listing here for awareness/input: - ***How could the levels of a specific Dimension be encoded?*** A `DataItem` (e.g. `SexOfPenguinX`) has a outcome variable (aka `Dimension`) named `Sex`, which according to the source metadata has two possible values: `MALE` and `FEMALE`. The relevant value for the specific penguin will be entered into the `value` field of the specific `DataItem`, but is it also a relevant use case to encode the possible options into the specific `Dimension`? Looking at `Factor`s, it deals with a similar use case by means of the `factor_level_of` field that allows stating that a given `Factor` is a level of another `Factor`. - ***How to decide if a column in a CSV table is a `Factor` or `Dimension`?*** Specifically the `Region` ("Nominal region of Palmer LTER sampling grid") and `Island` ("Island near Palmer Station where samples were collected") columns of the penguin tables. There is only one unique value for `Region` across all table data: `Anvers`. There are three unique values for `Island` across all table data: `Dream`, `Biscoe`, `Torgersen`. Structurally, the model is: - a region can have many islands - Anvers is a region - Anvers has three islands relevant to the study - Dream, Biscoe, and Torgersen are these islands Also, the study provides additional info about the islands (such as GPS locations) that can possibly be encoded somewhere. Should these entities all be `Factor`s? Should the three islands be levels of `Island`? Should `Anvers` be a level of `Region`? And should each `StudyActivity` for this whole study therefore have two related `Factors`: `Anvers` and the specific island that the `DataItem` generated by the `StudyActivity` was collected from? - ***How to map the provenance of a derived `Subject`?*** I am busy with mapping the processes of doing molecular tests on penguin blood samples in order to get several outputs. Shortly, penguin blood is collected in the field and stored in a small tube as well as smeared on a piece of test paper; the test paper blood is used for molecular sexing, while the blood in the tube is used for calculating Delta15N and Delta13C values that are important for later analyses; the resulting sex and delta values are all columns in the CSV tables. So in the end these table values are all `DataItems`. I could approach this in a simple or less simple way: - *Simple*: - `Subject`: an individual `PenguinX` - `StudyActivity`: `DeterminingTheSexofPenguinX`, `DeterminingDelta15NofPenguinX`, `DeterminingDelta13CofPenguinX` - `Protocol`: 2 partially overlapping protocols, one for collecting tube blood and determining the Delta values, and one for collecting test paper blood and determining the sex - `DataItem`: `SexofPenguinX`, `Delta15NofPenguinX`, `Delta13CofPenguinX` - *Not so simple*: this involves splitting up the `StudyActivity`s into the processes of (a) collecting blood samples, and then (b) determining the outcome variables from those blood samples. Starting off, the process would be similar to above: - `Subject`: an individual `PenguinX` - `StudyActivity`: `CollectingTubeBloodOfPenguinX`, `CollectingPaperBloodOfPenguinX` - `Protocol`: 2 protocols, one for collecting tube blood and one for collecting test paper blood - `DataItem`: `TubeBloodOfPenguinX`, `PaperBloodOfPenguinX` These `DataItem`s become the new `Subject`s for the next steps, e.g. calculating the delta values: - `Subject`: `TubeBloodOfPenguinX` - `StudyActivity`: `DeterminingDelta15NofPenguinX`, `DeterminingDelta13CofPenguinX` - `Protocol`: one protocol for determining the Delta values from tube blood - `DataItem`: `Delta15NofPenguinX`, `Delta13CofPenguinX` But now, for `TubeBloodOfPenguinX`, how is the `DataItem` connected to the subsequent `Subject`. `Subject` has `derived_from` with the range of `Subject`, i.e. `TubeBloodOfPenguinX` is derived from `PenguinX`. But is that sufficient for multi-step queries starting from `PenguinX` and its related `StudyActivity`s? - ***One-to-one or one-to-many relationship for `StudyActivity` that generates `DataItem`(s)?*** Assuming the `StudyActivity` is linked to a single `Subject`, which allows linking specific `DataItem`(s) to a specific `Subject`, is there any particular benefit to creating multiple granular `StudyActivity` objects such that they would be linked one-to-one to generated `DataItem`s? This could be related to the associated `Protocol`. E.g. I have a protocol that spans several measurements ("Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation") and the associated `StudyActivity` would generate three `DataItem`s. This would be one-to many. But the `Protocol` could also be split up into its parts, and multiple associated `StudyActivity`s can be created, leading to a one-to-one relationship of `StudyActivity` and `DataItem`. But are there other querying-related benefits to either the one-to-one or one-to-many relationship?

jsheunis commented

2025-06-18 10:09:55 +00:00

Author

Owner

{
    "Dataset": {
        "adelie":{
            "name": "Adélie Penguin Dataset",
            "short_name": "Adélie Penguins",
            "description": "Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009",
            "display_label": "Adélie Penguins"
        },
        "gentoo":{
            "name": "Gentoo Penguin Dataset",
            "short_name": "Gentoo Penguins",
            "description": "Structural size measurements and isotopic signatures of foraging among adult male and female gentoo penguins (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009",
            "display_label": "Gentoo Penguins"
        },
        "chinstrap":{
            "name": "Chinstrap Penguin Dataset",
            "short_name": "Chinstrap Penguins",
            "description": "Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguins (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009",
            "display_label": "Chinstrap Penguins"
        },
        "palmerpenguins":{
            "name": "Palmer Penguin Dataset",
            "short_name": "Palmer Penguins",
            "description": "Data collected from three species of Pygoscelis penguins nesting on several islands within the Palmer Archipelago west of the AP near Anvers Island, during the austral summers of 2007/08, 2008/09, and 2009/10.",
            "display_label": "Palmer Penguins"
        }
    },
    "DataItem": {
        "ex": {
            "part_of": "",
            "generated_by": "",
            "derived_from": "",
            "value": "",
            "unit": "",
            "dimensions": [],
            "description": "",
            "display_label": ""
        }
    },
    "Dimension": {
        "stage": {
            "name": "Stage",
            "description": "Reproductive stage at sampling",
            "display_label": "Stage"
        },
        "clutch_completion": {
            "name": "Clutch Completion",
            "description": "Was the study nest observed with a full clutch, i.e., 2 eggs",
            "display_label": "Clutch Completion"
        },
        "date_egg": {
            "name": "Date Egg",
            "description": "Date study nest observed with 1 egg (sampled); string formatted as YYYY-MM-DD",
            "display_label": "Date Egg"
        },
        "culmen_length": {
            "name": "Culmen Length (mm)",
            "description": "length of the dorsal ridge of a bird's bill",
            "display_label": "Culmen Length (mm)"
        },
        "culmen_depth": {
            "name": "Culmen Depth (mm)",
            "description": "depth of the dorsal ridge of a bird's bill",
            "display_label": "Culmen Depth (mm)"
        },
        "flipper_length": {
            "name": "Flipper Length (mm)",
            "description": "Length of flipper",
            "display_label": "Flipper Length (mm)"
        },
        "body_mass": {
            "name": "Body Mass (g)",
            "description": "Mass of body",
            "display_label": "Body Mass (g)"
        },
        "sex": {
            "name": "Sex",
            "description": "code for the sex of an animal",
            "display_label": "Sex"
        },
        "delta15N": {
            "name": "Delta 15 N (o/oo)",
            "description": "a measure of the ratio of stable isotopes 15N:14N",
            "display_label": "Delta 15 N (o/oo)"
        },
        "delta13C": {
            "name": "Delta 13 C (o/oo)",
            "description": "a measure of the ratio of stable isotopes 13C:12C",
            "display_label": "Delta 13 C (o/oo)"
        },
        "comments": {
            "name": "Comments",
            "description": "Text field to provide additional relevant information for data",
            "display_label": "Comments"
        }
    },
    "Factor": {
        "anvers": {
            "name": "Anvers",
            "description": "Island in the Palmer Archipelago on which the Palmer Station is located, 64°46'S, 64°03'W. This is a nominal region of the Palmer LTER sampling grid",
            "display_label": "Anvers"
        },
        "dream": {
            "name": "Dream",
            "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°43'S, 64°13'W",
            "display_label": "Dream island"
        },
        "biscoe": {
            "name": "Biscoe",
            "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°48'S, 63°46'W.",
            "display_label": "Biscoe island"
        },
        "torgersen": {
            "name": "Torgersen",
            "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°46'S, 64°04'W.",
            "display_label": "Torgersen island"
        }
    },
    "Instrument": {
        "sterile_syringe": {
            "name": "sterile 3 ml syringe",
            "display_label": "sterile 3 ml syringe"
        },
        "heparinized_infusion_needle": {
            "name": "heparinized infusion needle",
            "display_label": "heparinized infusion needle"
        },
        "micro_centrifuge_tube": {
            "name": "1.5 ml micro-centrifuge tube",
            "display_label": "1.5 ml micro-centrifuge tube"
        },
        "dial_calipers": {
            "name": "dial calipers ±0.1 mm",
            "display_label": "dial calipers ±0.1 mm"
        },
        "ruler": {
            "name": "ruler ±1 mm",
            "display_label": "ruler ±1 mm"
        },
        "pesola_spring_scales_and_weigh_bag": {
            "name": "5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag",
            "display_label": "5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag"
        }
    },
    "Protocol": {
        "penguin_monitoring_and_selection": {
            "name": "Penguin monitoring and selection",
            "description": [
                "Study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored.",
                "When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass",
                "After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs."
            ],
            "display_label": "Penguin monitoring and selection"
        },
        "penguin_structural_size_and_mass_measurement": {
            "name": "Penguin structural size and mass measurement",
            "description": [
                "Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation."
            ],
            "display_label": "Penguin structural size and mass measurement"
        },
        "penguin_blood_sample_collection": {
            "name": "Penguin blood sample collection",
            "description": [
                "At the time of capture, each adult penguin was quickly blood sampled (~1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle.",
                "Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool.",
                "In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing."
            ],
            "display_label": "Penguin blood sample collection"
        },
        "calculating_delta15N_and_delta13C_values_from_whole_blood_in_tubes": {
            "name": "Calculating Delta 15 N and Delta 13 C values from whole blood in tubes",
            "description": [        
                "whole blood tubes centrifuged to separate plasma and red blood cell (RBC) fractions",
                "RBC tubes frozen at -80 degrees Celsius (°C)",
                "allowed to dry to a consistent mass in a drying oven at 60°C => dried RBC pellets",
                "dried RBC pellets homogenized into a powder using a mortar and pestle lined with clean weighing paper (Each mortar and pestle was washed and dried in between sample processing)",
                "Aliquots of powdered samples were transferred to 8x5 mm pressed tin capsules (Elemental Microanalysis) and weighed (~2 mg) using an analytical balance",
                "Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis",
                "Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N=([Rsample/Rstandard]-1)x1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen."
            ],
            "display_label": "Calculating Delta 15 N and Delta 13 C values from whole blood in tubes"
        },
        "calculating_penguin_sex_from_blood_smears": {
            "name": "Calculating penguin sex from blood smears",
            "description": [
                "Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator.",
                "After drying, tubes were sealed and frozen at -80°C.",
                "Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis.",
                "Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis."
            ],
            "display_label": "Calculating penguin sex from blood smears"
        }
    },
    "Study": {
        "palmerpenguins": {
            "name": "Palmer Penguin Study",
            "short_name": "Palmer Penguin Study",
            "description": "Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)",
            "display_label": "Palmer Penguin Study"
        }
    },
    "StudyActivity": {
        "PAL0708": {
            "study": "palmerpenguins",
            "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2007/08.",
            "display_label": "PAL0708"
        },
        "PAL0809": {
            "study": "palmerpenguins",
            "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2008/09.",
            "display_label": "PAL0809"
        },
        "PAL0910": {
            "study": "palmerpenguins",
            "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2009/10.",
            "display_label": "PAL0910"
        },
        "ex": {
            "study": "",
            "subjects": "",
            "implements": "",
            "factors": "",
            "instruments": "",
            "part_of": "",
            "description": "",
            "display_label": ""
        }

    },
    "StudyActivityPerSubjectBasis": {
        "MonitoringAndSelectionOfPenguinX": {
            "study": "palmerpenguins",
            "subjects": ["X"],
            "implements": ["penguin_monitoring_and_selection"],
            "factors": ["X"],
            "part_of": "X",
            "description": "The act of monitoring and selecting a specific penguin for further studying",
            "display_label": "X",
            "_generatesDataItemWithDim": [
                "stage",
                "clutch_completion",
                "date_egg"
            ]
        },
        "StructuralSizeMeasurementsOfPenguinX": {
            "study": "palmerpenguins",
            "subjects": ["X"],
            "implements": ["penguin_structural_size_and_mass_measurement"],
            "factors": ["X"],
            "instruments": ["dial_calipers", "ruler", "pesola_spring_scales_and_weigh_bag"],
            "part_of": "X",
            "description": "The act of taking structural size measurements from a specific studied penguin",
            "display_label": "X",
            "_generatesDataItemWithDim": [
                "culmen_length",
                "culmen_depth",
                "flipper_length",
                "body_mass"
            ]
        },
        "DeterminingTheSexOfPenguinX": {
            "study": "palmerpenguins",
            "subjects": ["X"],
            "implements": ["penguin_blood_sample_collection", "calculating_penguin_sex_from_blood_smears" ],
            "factors": ["X"],
            "instruments": ["sterile_syringe", "heparinized_infusion_needle", "micro_centrifuge_tube"],
            "part_of": "X",
            "description": "Molecular sexing from blood smears",
            "display_label": "X",
            "_generatesDataItemWithDim": ["sex"]
        },
        "DeterminingDeltaValuesofPenguinX": {
            "study": "palmerpenguins",
            "subjects": ["X"],
            "implements": ["penguin_blood_sample_collection", "calculating_delta15N_and_delta13C_values_from_whole_blood_in_tubes" ],
            "factors": ["X"],
            "instruments": ["sterile_syringe", "heparinized_infusion_needle", "micro_centrifuge_tube"],
            "part_of": "X",
            "description": "Isotope calculations",
            "display_label": "X",
            "_generatesDataItemWithDim": ["delta15N", "delta13C"]
        },
        "AssessingDataQualityOfPenguinX": {
            "study": "palmerpenguins",
            "subjects": ["X"],
            "factors": ["X"],
            "part_of": "X",
            "description": "Determining data quality and adding explanatory comments where applicaple",
            "display_label": "X",
            "_generatesDataItemWithDim": ["comments"]
        }
    },
    "Subject": {
        "ex": {
            "study": "",
            "name": "",
            "subject_type": "",
            "short_name": "",
            "description": "",
            "display_label": ""
        }
    },
    "SubjectType": {
        "adelie":{
            "name": "Adélie Penguin Dataset",
            "short_name": "Adélie",
            "description": "Pygoscelis adeliae",
            "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_9238"],
            "display_label": "Adélie Penguin"
        },
        "gentoo":{
            "name": "Gentoo Penguin",
            "short_name": "Gentoo",
            "description": "Pygoscelis papua",
            "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_30457"],
            "display_label": "Gentoo Penguin"
        },
        "chinstrap":{
            "name": "Chinstrap Penguin",
            "short_name": "Chinstrap",
            "description": "Pygoscelis antarctica",
            "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_79643"],
            "display_label": "Chinstrap Penguin"
        }
    },
    "Unit": {
        "gram": {
            "name": "gram",
            "short_name": "g",
            "description": "a thousandth of one kilogram, the SI unit of mass",
            "exact_mappings": ["http://purl.obolibrary.org/obo/NCIT_C48155"],
            "display_label": "gram"
        },
        "millimeter": {
            "name": "millimeter",
            "short_name": "mm",
            "description": "a thousandth of one meter, the SI unit of length",
            "exact_mappings": ["http://purl.obolibrary.org/obo/NCIT_C28251"],
            "display_label": "millimeter"
        },
        "parts_per_thousand": {
            "name": "parts per thousand",
            "short_name": "o/oo",
            "description": "parts per thousand, relative to a standard. for isotopes. Isotope data uses LC-delta=(Rx/Rs-1)*1000",
            "narrow_mappings": ["http://purl.obolibrary.org/obo/UO_0000168"],
            "display_label": "parts per thousand"
        }
    }
}

With the above, some accompanying code can be used to generate independent objects (such as Studys, Factors, Dimensions), and then further code will be added to generate dependent objects from the CSV tables and helper data.

Below is a hand-curated JSON object to help automate the process of creating all metadata objects. It is basically a set of tables, where the keys of the main object correspond to classes in the flat-data schema. Not every single key, though; some exist purely as helpers (e.g. `StudyActivityPerSubjectBasis`), and some of their child key-value pairs are also helpers (e.g. `_generatesDataItemWithDim`). The keys of internal objects are all unique within a class-corresponding namespace (e.g. `gentoo`, `palmerpenguins`, or `adelie` in the `Dataset` namespace), and they will be used to generate PIDs. ```json { "Dataset": { "adelie":{ "name": "Adélie Penguin Dataset", "short_name": "Adélie Penguins", "description": "Structural size measurements and isotopic signatures of foraging among adult male and female Adélie penguins (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009", "display_label": "Adélie Penguins" }, "gentoo":{ "name": "Gentoo Penguin Dataset", "short_name": "Gentoo Penguins", "description": "Structural size measurements and isotopic signatures of foraging among adult male and female gentoo penguins (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009", "display_label": "Gentoo Penguins" }, "chinstrap":{ "name": "Chinstrap Penguin Dataset", "short_name": "Chinstrap Penguins", "description": "Structural size measurements and isotopic signatures of foraging among adult male and female Chinstrap penguins (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009", "display_label": "Chinstrap Penguins" }, "palmerpenguins":{ "name": "Palmer Penguin Dataset", "short_name": "Palmer Penguins", "description": "Data collected from three species of Pygoscelis penguins nesting on several islands within the Palmer Archipelago west of the AP near Anvers Island, during the austral summers of 2007/08, 2008/09, and 2009/10.", "display_label": "Palmer Penguins" } }, "DataItem": { "ex": { "part_of": "", "generated_by": "", "derived_from": "", "value": "", "unit": "", "dimensions": [], "description": "", "display_label": "" } }, "Dimension": { "stage": { "name": "Stage", "description": "Reproductive stage at sampling", "display_label": "Stage" }, "clutch_completion": { "name": "Clutch Completion", "description": "Was the study nest observed with a full clutch, i.e., 2 eggs", "display_label": "Clutch Completion" }, "date_egg": { "name": "Date Egg", "description": "Date study nest observed with 1 egg (sampled); string formatted as YYYY-MM-DD", "display_label": "Date Egg" }, "culmen_length": { "name": "Culmen Length (mm)", "description": "length of the dorsal ridge of a bird's bill", "display_label": "Culmen Length (mm)" }, "culmen_depth": { "name": "Culmen Depth (mm)", "description": "depth of the dorsal ridge of a bird's bill", "display_label": "Culmen Depth (mm)" }, "flipper_length": { "name": "Flipper Length (mm)", "description": "Length of flipper", "display_label": "Flipper Length (mm)" }, "body_mass": { "name": "Body Mass (g)", "description": "Mass of body", "display_label": "Body Mass (g)" }, "sex": { "name": "Sex", "description": "code for the sex of an animal", "display_label": "Sex" }, "delta15N": { "name": "Delta 15 N (o/oo)", "description": "a measure of the ratio of stable isotopes 15N:14N", "display_label": "Delta 15 N (o/oo)" }, "delta13C": { "name": "Delta 13 C (o/oo)", "description": "a measure of the ratio of stable isotopes 13C:12C", "display_label": "Delta 13 C (o/oo)" }, "comments": { "name": "Comments", "description": "Text field to provide additional relevant information for data", "display_label": "Comments" } }, "Factor": { "anvers": { "name": "Anvers", "description": "Island in the Palmer Archipelago on which the Palmer Station is located, 64°46'S, 64°03'W. This is a nominal region of the Palmer LTER sampling grid", "display_label": "Anvers" }, "dream": { "name": "Dream", "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°43'S, 64°13'W", "display_label": "Dream island" }, "biscoe": { "name": "Biscoe", "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°48'S, 63°46'W.", "display_label": "Biscoe island" }, "torgersen": { "name": "Torgersen", "description": "Island in the Palmer Archipelago in the region of Anvers island, located at 64°46'S, 64°04'W.", "display_label": "Torgersen island" } }, "Instrument": { "sterile_syringe": { "name": "sterile 3 ml syringe", "display_label": "sterile 3 ml syringe" }, "heparinized_infusion_needle": { "name": "heparinized infusion needle", "display_label": "heparinized infusion needle" }, "micro_centrifuge_tube": { "name": "1.5 ml micro-centrifuge tube", "display_label": "1.5 ml micro-centrifuge tube" }, "dial_calipers": { "name": "dial calipers ±0.1 mm", "display_label": "dial calipers ±0.1 mm" }, "ruler": { "name": "ruler ±1 mm", "display_label": "ruler ±1 mm" }, "pesola_spring_scales_and_weigh_bag": { "name": "5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag", "display_label": "5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag" } }, "Protocol": { "penguin_monitoring_and_selection": { "name": "Penguin monitoring and selection", "description": [ "Study nests, where pairs of adults were present, were individually marked and chosen before the onset of egg laying, and consistently monitored.", "When study nests were found at the one-egg stage, both adults were captured to obtain blood samples used for molecular sexing and SI analyses, and measurements of structural size and body mass", "After handling, individuals at study nests were further monitored to ensure the pair reached clutch completion, i.e., two eggs." ], "display_label": "Penguin monitoring and selection" }, "penguin_structural_size_and_mass_measurement": { "name": "Penguin structural size and mass measurement", "description": [ "Measurements of culmen length and depth (using dial calipers ±0.1 mm), right flipper (using a ruler ±1 mm), and body mass (using 5 kg±25 g or 10 kg±50 g Pesola spring scales and a weigh bag) were obtained to quantify body size variation." ], "display_label": "Penguin structural size and mass measurement" }, "penguin_blood_sample_collection": { "name": "Penguin blood sample collection", "description": [ "At the time of capture, each adult penguin was quickly blood sampled (~1 ml) from the brachial vein using a sterile 3 ml syringe and heparinized infusion needle.", "Collected blood was stored in 1.5 ml micro-centrifuge tubes that were kept cool.", "In the field, a small amount of whole blood was smeared on clean filter paper stored in a 1.5 ml micro-centrifuge tube for molecular sexing." ], "display_label": "Penguin blood sample collection" }, "calculating_delta15N_and_delta13C_values_from_whole_blood_in_tubes": { "name": "Calculating Delta 15 N and Delta 13 C values from whole blood in tubes", "description": [ "whole blood tubes centrifuged to separate plasma and red blood cell (RBC) fractions", "RBC tubes frozen at -80 degrees Celsius (°C)", "allowed to dry to a consistent mass in a drying oven at 60°C => dried RBC pellets", "dried RBC pellets homogenized into a powder using a mortar and pestle lined with clean weighing paper (Each mortar and pestle was washed and dried in between sample processing)", "Aliquots of powdered samples were transferred to 8x5 mm pressed tin capsules (Elemental Microanalysis) and weighed (~2 mg) using an analytical balance", "Samples were organized in 96-microwell plates and analyzed for δ13C and δ15N SI signatures using an elemental analyzer interfaced with an isotope ratio mass spectrometer at the Stable Isotope Facility, University of California (UC) - Davis", "Data expressed as δ13C or δ15N were calculated using the following equation: δ13C or δ15N=([Rsample/Rstandard]-1)x1000, where Rsample is the ratio of the heavy to light isotope for either 13C/12C or 15N/14N, and Rstandard is the heavy to light isotope ratios for international standards - Vienna PeeDee Belemnite for carbon, and atmospheric N2 (Air) for nitrogen." ], "display_label": "Calculating Delta 15 N and Delta 13 C values from whole blood in tubes" }, "calculating_penguin_sex_from_blood_smears": { "name": "Calculating penguin sex from blood smears", "description": [ "Tubes containing whole blood smears on filter paper were allowed to dry in a desiccator.", "After drying, tubes were sealed and frozen at -80°C.", "Whole blood smears were allowed to dry a second time in a desiccator for at least 24 hrs prior to analysis.", "Sex of adult Pygoscelis penguins was determined molecularly using PCR amplification as outlined by Griffiths et al. [53], as well as Fridolfsson and Ellegren [54]. See Supporting Information Text S1 for specific details regarding PCR methods including extraction, amplification, and gel electrophoresis." ], "display_label": "Calculating penguin sex from blood smears" } }, "Study": { "palmerpenguins": { "name": "Palmer Penguin Study", "short_name": "Palmer Penguin Study", "description": "Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis)", "display_label": "Palmer Penguin Study" } }, "StudyActivity": { "PAL0708": { "study": "palmerpenguins", "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2007/08.", "display_label": "PAL0708" }, "PAL0809": { "study": "palmerpenguins", "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2008/09.", "display_label": "PAL0809" }, "PAL0910": { "study": "palmerpenguins", "description": "Collection of data from three species of Pygoscelis penguinsduring the austral summer of 2009/10.", "display_label": "PAL0910" }, "ex": { "study": "", "subjects": "", "implements": "", "factors": "", "instruments": "", "part_of": "", "description": "", "display_label": "" } }, "StudyActivityPerSubjectBasis": { "MonitoringAndSelectionOfPenguinX": { "study": "palmerpenguins", "subjects": ["X"], "implements": ["penguin_monitoring_and_selection"], "factors": ["X"], "part_of": "X", "description": "The act of monitoring and selecting a specific penguin for further studying", "display_label": "X", "_generatesDataItemWithDim": [ "stage", "clutch_completion", "date_egg" ] }, "StructuralSizeMeasurementsOfPenguinX": { "study": "palmerpenguins", "subjects": ["X"], "implements": ["penguin_structural_size_and_mass_measurement"], "factors": ["X"], "instruments": ["dial_calipers", "ruler", "pesola_spring_scales_and_weigh_bag"], "part_of": "X", "description": "The act of taking structural size measurements from a specific studied penguin", "display_label": "X", "_generatesDataItemWithDim": [ "culmen_length", "culmen_depth", "flipper_length", "body_mass" ] }, "DeterminingTheSexOfPenguinX": { "study": "palmerpenguins", "subjects": ["X"], "implements": ["penguin_blood_sample_collection", "calculating_penguin_sex_from_blood_smears" ], "factors": ["X"], "instruments": ["sterile_syringe", "heparinized_infusion_needle", "micro_centrifuge_tube"], "part_of": "X", "description": "Molecular sexing from blood smears", "display_label": "X", "_generatesDataItemWithDim": ["sex"] }, "DeterminingDeltaValuesofPenguinX": { "study": "palmerpenguins", "subjects": ["X"], "implements": ["penguin_blood_sample_collection", "calculating_delta15N_and_delta13C_values_from_whole_blood_in_tubes" ], "factors": ["X"], "instruments": ["sterile_syringe", "heparinized_infusion_needle", "micro_centrifuge_tube"], "part_of": "X", "description": "Isotope calculations", "display_label": "X", "_generatesDataItemWithDim": ["delta15N", "delta13C"] }, "AssessingDataQualityOfPenguinX": { "study": "palmerpenguins", "subjects": ["X"], "factors": ["X"], "part_of": "X", "description": "Determining data quality and adding explanatory comments where applicaple", "display_label": "X", "_generatesDataItemWithDim": ["comments"] } }, "Subject": { "ex": { "study": "", "name": "", "subject_type": "", "short_name": "", "description": "", "display_label": "" } }, "SubjectType": { "adelie":{ "name": "Adélie Penguin Dataset", "short_name": "Adélie", "description": "Pygoscelis adeliae", "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_9238"], "display_label": "Adélie Penguin" }, "gentoo":{ "name": "Gentoo Penguin", "short_name": "Gentoo", "description": "Pygoscelis papua", "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_30457"], "display_label": "Gentoo Penguin" }, "chinstrap":{ "name": "Chinstrap Penguin", "short_name": "Chinstrap", "description": "Pygoscelis antarctica", "exact_mappings": ["http://purl.obolibrary.org/obo/NCBITaxon_79643"], "display_label": "Chinstrap Penguin" } }, "Unit": { "gram": { "name": "gram", "short_name": "g", "description": "a thousandth of one kilogram, the SI unit of mass", "exact_mappings": ["http://purl.obolibrary.org/obo/NCIT_C48155"], "display_label": "gram" }, "millimeter": { "name": "millimeter", "short_name": "mm", "description": "a thousandth of one meter, the SI unit of length", "exact_mappings": ["http://purl.obolibrary.org/obo/NCIT_C28251"], "display_label": "millimeter" }, "parts_per_thousand": { "name": "parts per thousand", "short_name": "o/oo", "description": "parts per thousand, relative to a standard. for isotopes. Isotope data uses LC-delta=(Rx/Rs-1)*1000", "narrow_mappings": ["http://purl.obolibrary.org/obo/UO_0000168"], "display_label": "parts per thousand" } } } ``` With the above, some accompanying code can be used to generate independent objects (such as `Study`s, `Factor`s, `Dimension`s), and then further code will be added to generate dependent objects from the CSV tables and helper data.