Skip to main content

Dataset Metadata Model

A Dataset is a collection of data gathered by a project using a single sampling protocol (data collection method). Projects may have 1 or more datasets. In the context of PPSR Core, datasets represent the observations collected by the community of contributors.

The Dataset Metadata Mmodel (DMM) is a metadata model that describes a collection of observations. Dataset level metadata provides context for a collection of observational records and expresses information associated with and common to all records within a dataset. The dataset metadata enables datasets to be discovered and accessed by a range of factors which assist users, especially 3rd party users, in making informed decisions about the suitability of a dataset for their particular usage requirements. It helps researchers understand a group of observations:

  • Title and description of the dataset
  • Graphical elements associated with the dataset
  • Method/survey protocol used
  • Temporal range of the dataset
  • Licence and ownership of the dataset
  • Quality assurance methods applied to the dataset (pre, during and post recording)
  • Data access methods
  • Constraints and biases affecting the usage of the data
  • Data management plan

Information Sharing Example

People need to know what datasets exist where about which topics. They need to be able to efficiently search for all available datasets about a topic, discover all available relevant datasets, access them, and then be able to discern whether they can use the datasets for their decision making or research needs. For example, imagine a policy maker in Australia hoping to find all data available on invasive species to make decisions about how to prioritize available funding to combat the worst invasions. Imagine the delight when this policy maker finds out they can search for available datasets (perhaps on a search engine such as the citizen science cloud, for example) and find datasets generated by both active citizen science projects currently addressing invasive species hosted by the Atlas of Living Australia Biocollect platform as well as historic citizen science projects conducted in Australia by Earthwatch volunteers that had also mapped invasive species populations several years previously.

Entity Relationship Diagram

[current approved version: 2020.0]

The Dataset ERD describes the relationships between class entities in the Dataset Metadata Model. Each dataset contains a set of Core Attributes which represent the core terms associated with a project. The Extension Attributes are optional terms associated with a project.

Figure 1 Dataset Entity Relationship Diagram;

Core Attributes

[current approved version: 2020.0]

Core attributes are the main fields associated with a dataset. The table below lists all Core Attributes; their field name & a description of how it is used.

Many of the core terms are mandatory. Every dataset instance is required to have an entry in this field.

EntityAttribute or Entity NameDescriptionData or Entity TypeObligationMultiplicitySynonym term in other standards
activity An activity is analagous to a survey and comprises 2 components: a metadata schema; and an observational data model (ie. the data schema into which observational records are created). The data schema definition will represent a specific data collection protocol. In the context of an activity/survey, these exist as a singular pair of objects. Usage of an activity is always made in the context of an event, ie. A non-persistent time-based usage of an observational data schema. Observational data schemas are domain and protocol specific, and may be published in other repositories.ClassOptional0:ndcmitype:Event
activityactivityIdA globally unique identifier for an activity.textMandatory1:1
activitydatasetMetadataSchemaThe datasetMetadataSchema (DMM) describes the metadata pertaining to the specific observationalDataSchema selected and it's associated data. There is a 1:1 relationship between the datasetMetadataSchema and the observationalDataSchema. The datasetMetadataSchema is consistent for all classes of observationalDataDomains. This is a class object.ClassMandatory1:1
datasetMetadataSchemadmmCoreTermsThe set of core terms which comprise the PPSR-Core Dataset Metadata Model (DMM). These are the minimum set of attributes required to adequately describe a dataset and enable exchange of dataset metadata between data catalogues.ClassOptional1:1dcat:CatalogRecord
dmmCoreTermsdcterms:identifierPersistent identifier of a dataset (associated with the project). Should equate to the datasetExternalId if data is stored in an external repository.textMandatory1:1dwcterms:datasetID
cosi:hasIdentifierValue
dmmCoreTermsdcterms:dateSubmittedThe date a dataset submission was published into a receiving system. Uses the ISO 8601:2004 (E) dateTime standarddate timeMandatory1:1prov:generatedAtTime
CI_Citation date
dmmCoreTermsdcterms:modifiedThe most recent dateTime at which the resource was changed. Uses the ISO 8601:2004 (E) dateTime standard",date timeMandatory1:1prov:generatedAtTime
dmmCoreTermsdatasetStatusIndicator of the current status of a dataset (e.g. if it already published)vocabularyMandatory1:1cosi:hasStatus
dmmCoreTermsdcterms:titleThe name of the dataset for discovery and citation purposes.textMandatory1:1dwcterms:datasetName
CI_Citation.title
cosi:hasTitle
dmmCoreTermsdcterms:abstractAbstract or description of the dataset.textMandatory1:1cosi:hasDescription
dmmCoreTermsdcterms:accessRightsCategory of rights to use IP contained in the dataset or a type of use applied to the dataset.vocabularyMandatory1:1dct:rights
dmmCoreTermsdcterms:bibliographicCitationFormat to be structured as follows: 'Author/Rightsholder. (Year). Title of data set (Version number) [Description of form]. Retrieved from https://<website url>'. The attribution text string to be cited by people who use the dataset.textOptional0:1
dmmCoreTermsdcterms:rightsHolderThe name of the organisation which is the legal custodian of the dataset.textMandatory1:1prov:wasAttributedTo
dmmCoreTermsdcterms:licenseLicense applied to the dataset.vocabularyMandatory1:1cosi:hasLicenceInformation
dmmCoreTermsdcterms:languageThe machine language the dataset and associated metadata is encoded in. Uses Unicode Standard UTF-8 (ISO/IEC 10646:2014 plus Amendment 1).textOptional0:nMD_DataIdentification.characterSet
dmmCoreTermsdatesetStartDateThe date on which the dataset collection survey commences. This may reflect the earliest record in the dataset or when a survey is open to begin data recording. This date may be => the projectStartDate. Uses the ISO 8601:2004 (E) dateTime standard.date timeMandatory1:1
dmmCoreTermsdatasetEndDateThe date on which the dataset collection survey concluded. Uses the ISO 8601:2004 (E) dateTime standard.date timeOptional0:1
dmmCoreTermsmethodTypeThe type of methodology or sampling protocol used to collect the dataset.vocabularyMandatory1:1
dmmCoreTermsdataAccessMethodA list of available methods for people to access the dataset.vocabularyMandatory1:n
datasetMetadataSchemamethodSpecificationDetails of the methodology or sampling protocol used to collect the dataset.ClassMandatory1:1cosi:hasRelatedMaterial
methodSpecificationsamplingProtocolDomainThe name of the methodology or sampling protocol used to collect the dataset.vocabularyOptional0:1
methodSpecificationsamplingProtocolMethodThe sampling protocol method used for a given survey.vocabularyOptional0:1dcterms:samplingProtocol
dwcterms:samplingProtocol
cosi:hasProcedure
methodSpecificationmethodAbstractDescription of the methodology or sampling protocol used to collect the dataset.textOptional0:1
methodSpecificationmethodUrlURL address of an officially published article which describes the methodology or sampling protocol used to collect the dataset.urlOptional0:1
methodSpecificationmethodDocUrlURL link to an uploaded document artefact which describes the methodology or sampling protocol used to collect the dataset.urlOptional0:1
datasetMetadataSchemaobservationalDataModelThe observationalDataDomain contains an array of different domain schemas (eg. biodiversity, water, atmosphere, ecology, geology, geomorphology, astronomy, etc.). Each domain will contain an array of standard protocols which apply in that domain context. The domains listed are not a comprehensive list and are expected to be appended to over time as new domains are specified and appropriate samplingProtocol standards are defined for them.
This class object serves only to structurally differentiate and describe the different domains and is not a structural element of the observationalDataModel (ODM).
ClassMandatory1:1dcat:Dataset

Extension attributes

[current approved version: 2020.0]

Extension attributes are the fields whose inclusion is not mandatory for all systems that are compliant with PPSR Core. The table below lists all Extension Attributes; their field name, a description of how it is used. Every system is encouraged to include these fields to ensure greater interoperability between systems.

EntityAttribute or Entity NameDescriptionData or Entity TypeObligationMultiplicitySynonym term in other standards
ProjectprojectIdGlobally unique identifier (GUID) for the project. System generated.TextOptional0:1dcterms:identifier
cosi:hasIdentifier
datasetMetadataSchemadmmExtensionTermsThe set of extension terms which comprise the PPSR-Core Dataset Metadata Model (DMM). These terms enhance the description of a dataset and improve the ability of users of the dataset to understand or interpret fitness for use.ClassOptional0:1dcat:CatalogRecord
dmmExtensionTermsdatasetUpdateFrequencyHow often the dataset is updated.vocabularyOptional0:1dcterms:accrualPeriodicity
dmmExtensionTermsdatasetExternalUrlWeb location where the dataset will be published.textOptional0:n
dmmExtensionTermsdcat:downloadURLA URL from which dataset observation records can be accessed and downloaded.urlOptional0:1
dmmExtensionTermsdatasetGeographicCoverageGeographic/spatial scope of coverage of the collection sites of data records within the dataset. Uses OGC GeoAPI (09-083r3) standard.geoObjectOptional0:n
dmmExtensionTermscosi:hasHypothesisThe experimental hypothesis underpinning the experimental design for which the dataset was collected.textOptional0:1
dmmExtensionTermscosi:hasInstrumentDetails of instrumentation used in the data recording.textOptional0:n
dmmExtensionTermsdataQualityAssuranceMethodDescription of the types of data quality assurance methods that were applied in capturing, curating and managing the dataset.vocabularyOptional0:n
dmmExtensionTermsdataQualityAssuranceDescriptionDetailed description of the methods used to quality assure the dataset both during capture and post processing. This is important for data users to understand the processes applied to the data to verify or enhance it's quality for use.textOptional0:1
dmmExtensionTermsusageGuideDescription of any constraints and biases in the dataset which are associated with how the data collection methodology was applied, eg. Concentration of data points along access networks, targeted/non-random approaches causing bias towards certain factors at the expense of other factors, etc.textOptional0:1
dmmExtensionTermsactivityCountNumber of data recording events in the datasetintegerOptional0:1
dmmExtensionTermsdatasetAssociatedMediaImage(s) and/or other media used to graphically enhance or represent the dataset. This is a class object.ClassOptional0:n
datasetAssociatedMediadatasetAssociatedMediaTypeThe category of media type representing the type of dataset media item chosenvocabulary Mandatory1:1foaf:img
datasetAssociatedMediadatasetAssociatedMediaFileMedia file upload representing the type of dataset media chosenmediaFileMandatory1:1foaf:img
datasetAssociatedMediadatasetAssociatedMediaCreditAttribution credit for the logo image or other mediatextMandatory1:1dcterms:bibliographicCitation
dmmExtensionTermsdataAccuracyDeclarationsGeneralised categories that best reflect the accuracy of records in the dataset.ClassOptional0:4
dataAccuracyDeclarationsspatialAccuracyA generalised category that best reflects the least spatially accurate record in the dataset.vocabularyOptional0:1
dataAccuracyDeclarationstemporalAccuracyA generalised category that best reflects the least accurate record in the dataset in respect to date of the observation.vocabularyOptional0:1
dataAccuracyDeclarationsspeciesIdentificationAccuracyA generalised category that best reflects the least accurate record in the dataset for species identification. Choose 'Not applicable' species fields are not included in the dataset.vocabularyOptional0:1
dataAccuracyDeclarationsnonTaxonomicAccuracyA generalised category that best reflects the least accurate record in the dataset in respect to non-biodiversity attributes.vocabularyOptional0:1
dmmExtensionTermsdataManagementPlanDetails of a data management plan associated with the dataset.ClassOptional0:1cosi:hasRelatedMaterial
dataManagementPlanisDataManagementPolicyDocumentedIndicator of whether a data management plan has been prepared for the dataset.booleanMandatory1:1
dataManagementPlandataManagementPolicyDescriptionDescription of data management policytextOptional0:1
dataManagementPlandataManagementPolicyURLLink to data management policy descriptionurlOptional0:1
dataManagementPlandataManagementPolicyDocumentDocument describing data management policyurlOptional0:1
dataManagementPlandataManagementPrinciplesConformanceAssessment of the conformance of the data management principles applied to the dataset with standard GEOlabels.textOptional0:1

Vocabulary

[current approved version: 2020.0]

The Vocabulary for Dataset defines enumerations for attributes above. These are controlled lists of defined terms. These terms may be used either as provided in full or as a reduced subset relevant to the purpose for which they are being used. They should not be modified or augmented with additional terms as this would prevent shareability and effective aggregation.

Provisional

The vocabulary is part of the published standard. Be warned that this vocabulary is subject to larger changes than the core terms & attributes. Reaching consensus with the larger scientific community is important to us. If you are interested in help with this work, please see the contribute page.

EntityAttribute NameVocabulary termsComments
dmmCoreTermsdcterms:accessRightsOpen access
Embargoed access
Restricted access
Pending public release
Metadata only access
Need to validate these terms and adjust as necessary. Must ensure mutual exclusivity and comprehensive coverage.
dmmCoreTermsdcterms:licenseCreative Commons zero (CC 0)
Creative Commons Attribution (4.0) international (CC-BY 4.0)
Creative Commons Attribution Non-commercial (CC-BY-NC)
dmmCoreTermsdatasetStatusActive - unpublished - unverified
Active - unpublished - partially verified
Active - unpublished - fully verified
Active - published - unverified
Active - published - partially verified
Active - published - fully verified
Complete - unpublished - unverified
Complete - unpublished - partially verified
Complete - unpublished - fully verified
Complete - published - unverified
Complete - published - partially verified
Complete - published - fully verified
Archived - unpublished - unverified
Archived - unpublished - partially verified
Archived - unpublished - fully verified
Archived - published - unverified
Archived - published - partially verified
Archived - published - fully verified
Need to validate these terms and adjust as necessary. Must ensure mutual exclusivity and comprehensive coverage.
dmmCoreTermsmethodTypeOpportunistic/ad-hoc observation
Systematic method-based survey
samplingProtocolDomainsamplingProtocolMethodEcologyAir quality - Fixed sensor
Air quality - Mobile sensor
Bat survey - Echolocation recorder
Bat survey - Harp trapping
Beach profile survey - Emery method
Beach profile survey - Optical method
Bird survey - Distance sample (along transect)
Bird survey - Fixed-area
Bird survey - Fixed-time
Bird survey - Fixed-time & Fixed-area
Bird survey - Mist netting
Fauna survey - 2-Ha track plot method
Fauna survey - Active search
Fauna survey - Aerial distance sampler method
Fauna survey - Cage trapping
Fauna survey - Call playback
Fauna survey - Camera trapping
Fauna survey - Elliot trapping
Fauna survey - Funnel trapping
Fauna survey - Hair tubes
Fauna survey - Nest box monitoring
Fauna survey - Pitfall trapping
Fauna survey - Scat survey
Fauna survey - Spotlight search
Fauna survey - Strip transect aerial survey
Fauna survey - Turtle trapping
Fish survey - Electrofishing
Fish survey - Set net/trap
Fish survey - Sweep netting
Insect survey - Black light
Insect survey - Malaise trap
Insect survey - Baited trap
Insect survey - Glue trap
Insect survey - Sweep netting
Riparian condition assessment - Rapid Appraisal of Riparian Condition (RARC)
Vegetation condition assessment
Vegetation survey - General transect & plot
Vegetation survey - Intensive inventory
Vegetation survey - Step point method
Water quality - Standardised physical/chemical attribute measurements
Water quality - Macroinvertebrate survey
Possible additional sampling protocols may include:
samplingProtocolWater
samplingProtocolMarine
samplingProtocolLimnology
samplingProtocolClimate
samplingProtocolAtmosphere
samplingProtocolSoils
samplingProtocolGeology
samplingProtocolChemistry
samplingProtocolPhysics
etc.

Methods should be unique within a vocabulary, but may occur in more than one vocabulary.

Domain-based protocols vocabularies may already exist for other domains, but they have not been identified as part of this current activity.
dmmCoreTermsdataAccessMethodOpen access structured raw data download from this system
Open access opaque raw data file attached in this system
Limited structured raw data access in this system - via request (subject to embargo)
Opaque raw data file attached in this system - via request
Open access structured raw data download from external source
Closed access structured raw data download from external source
Application Programming Interface (API)
Raw data not available
Only derived/interpreted data products available
dmmExtensionTermsdatasetUpdateFrequencyTriennial
Biennial
Annual
Semi-annual
Three times a year
Quarterly
Bi-monthly
Monthly
Semi-monthly
Bi-weekly
Three times a month
Weekly
Semi-weekly
Three times a week
Daily
Continuous
Irregular
No further updates
Need to validate these terms and adjust as necessary. Must ensure mutual exclusivity and comprehensive coverage.
dmmExtensionTermsdataQualityAssuranceMethodData owner curated
Subject matter expert record verification
Crowd-sourced record verification
Record annotation
System supported data attribute configuration
No DQ methods used
Not applicable
Need to validate these terms and adjust as necessary. Must ensure mutual exclusivity and comprehensive coverage.
dataAccuracyDeclarationsspatialAccuracyHigh
Medium
Low
dataAccuracyDeclarationstemporalAccuracyHigh
Medium
Low
dataAccuracyDeclarationsspeciesIdentificationAccuracyHigh
Medium
Low
dataAccuracyDeclarationsnonTaxonomicAccuracyHigh
Medium
Low
datasetAssociatedMediadatasetAssociatedMediaTypeImage file
Image URL
Audio file
Audio URL
Video file
Video URL
Last updated on by imitton