PPSR Core

Approach to addressing Data Quality and Quality Assurance Processes

July 2, 2020 · 10 min read

Peter Brenton

Atlas of Living Australia

The citizen science domain is often criticized for generating data which has limited scientific value. In some cases such criticism may be fair, but in many cases the citizen science domain is unfairly attributed with producing "poor quality" data without data consumers properly considering factors such as "fitness for use".

In general, data users are interested in the following questions when considering trust in data for fitness for use:

Where has the data come from?
Who collected it?
Is the accuracy and precision of the data adequate for my intended use?
Were the collection, treatment methods and equipment suitable for my intended use?
Is the data comprehensive & complete?
Were the collection and treatment methods applied consistency for the whole dataset?
What quality assurance, curation, validation & management processes have been applied to the data?
What are the known and implied biases in the data?
What conditions apply to using the data?
How should the dataset be cited or referenced?
How can the data be accessed and in what format(s) is it available?
Is a data management plan available for the data?

The PPSR-Core project is attempting to address many of the causal factors and underlying issues in respect to data and metadata associated trust in citizen science data. Whilst many of these issues are not unique to citizen science, the citizen science domain is frequently criticised for failing to meet expectations in these areas.

Data Quality and Fitness for Use - some general principles

All datasets have limitations.
Data quality is largely a relative concept that relates to “fitness for the intended purpose”.
Fitness for use should be assessed from the integrity of the data rather than from type of source, personalities or reputations.
Metadata that describes the processes and protocols by which the data was created and treated are critical pieces of information to assist data consumers in making informed choices about fitness for use.
Accuracy & precision are important for data interpretation and analysis.
The more accurate a record is, the higher it’s “quality”, as it has more potential for re-use across a wider range of situations.
Data precision can be reduced for a particular use, but it cannot be enhanced. Therefore the more precice one can be at the point of making a record, the more utility the record has.
Lack of trust and provenance prejudice often lie at the heart of the criticisms being levelled at citizen science in respect to perceptions of poor quality data.
Software design is an important factor in improving DQ/QA processes and outcomes, but it is only part of the solution. Project and data owners also need to take responsibility for these and employ a range of off-system solutions.

Factors Affecting Trust in Citizen Science Data

Accuracy	How closely a data value for a property represents the real/true value of the property being measured. Problem: Data may not contain any metadata at the dataset level or meta-attribute information at the record level to describe accuracy for spatial, temporal, taxonomic and/or other aspects of the data. This makes it difficult for data users to interpret whether the data is an accurate representation of what it is purported to be. Recommended Best Practices Application developers should incorporate as many measures as possible into the design of the their systems, in both the metadata and observational data, to minimize opportunities for users to enter inaccurate data. These include but are not limited to: Pre-population of fields where possible and appropriate Locking of pre-populated values where only one value is possible (eg. Single species surveys) Pre-population of date, time and co-ordinate fields derived from image metadata Data type controls for consistent field behavior such as: Date and time pickers Categorical single and multi-select lists, preferably based on controlled and/or standardized vocabularies. Maximum and/or minimum acceptable range validation when appropriate Decimal and integer numeric options used appropriately Mandatory and optional field controls Validation and value constraints on fields Cross-attribute validation and/or options filtering when appropriate etc. For species-based attributes, limiting choices to locally applicable species, linking to identification support tools, etc. Auto-population of calculated values with auto-calc algorithm support Making fields that should not be edited, non-editable Mapping behaviors to minimize spatial errors. (eg. spatial out of range warnings, northern/southern hemisphere constraints, longitude range constraints, etc.) Field level help/tool tips Accuracy and precision tagging support on key attributes (eg. Species, spatial) Attribute mapping to industry standards at the database level Data visibility constraint configuration/management Data verification/moderation/curation features Accuracy measures should be included within the data and metadata itself.
Precision	The range of uncertainty in the measured resolution of a property. Measures of the degree of precision in respect to the attribute types: spatial, temporal, taxonomic and/or other aspects of the data can be important when analyzing the data. Problem: Including precision measures in citizen science projects can seem like an unnecessary complication for public contributors and therefore they are often omitted. Recommended Best Practices Precision measures should be included within the data for all key attributes within the observational data to indicate the range of uncertainty in each observational measurement. Units of measure should always be included within the database name as well as in the field display label or suffix. Dataset metadata should indicate the maximum and minimum range values for precision of each attribute type in the dataset. This should ideally be dynamically derived from the observational data itself and stored as metadata properties that can be conveyed to other data repositories when sharing metadata. Precision measures should be included within the observational data at individual record level itself.
Timeliness	Whether records are temporally relevant to the intended use (eg. Previously present, now absent). Observational data provides information about an observed situation at a specific time in which the subject is either present, absent or in a particular state at that time. This temporal dimension of observational data is very important for many reasons. Problem: Things change over time and therefore old observations may not be representative of the current state of the subject when the data are interpreted. As data accumulates and/or is aggregated over time, this temporal-state dimension needs to be given profile in the dataset metadata so that it can be effectively accounted for and the data context correctly interpreted. Recommended Best Practices Include start and end dates in the dataset metadata to indicate the earliest and latest record dates for observational records. Include last update date in the dataset metadata as an empirical indicator of data currency and creation/curation activity. Include a dataset status in the dataset metadata as a categorical indicator of data currency and creation/curation activity. This should use a standardized controlled vocabulary of terms.
Completeness	Whether all of the required data is provided for all records in the dataset. It is very common for datasets to contain incomplete records, particularly for attributes which are not mandatory. Problem: This can particularly be an issue in citizen science projects which try to keep the barrier to public entry as low as possible by minimizing the skills and effort required of public participants as much as possible and minimising the complexity of the tasks they are expected to perform. This can result in omission of the meta-information required for interpretation of the data. Recommended Best Practices Database and system design should ensure that all key data attributes are set as mandatory, in particular event attributes such as location and date, as well as any attributes on which calculated values, downstream processes, etc. are dependent. If appropriate, use pre-population and multi-layered and cross validation approaches to improve completeness and minimize expectations on users. Dataset metadata should include a statement on completeness of the dataset, as text and/or categorical vocabulary-based selection. Apply standardized mechanisms to consistently describe the provenance, treatment, DQ & QA curation, as well as data management, access, use and constraints of datasets. Data and software based solutions to this issue are limited and project owners should look mainly to off-system solutions. As a general rule, the more complex the required sampling protocol is, the more effort needs to be put into participant training and support in respect to data acquisition and to data verification, moderation and curation mechanisms, post-acquisition.
Consistency	Whether methods have been applied in the same way for all records. Consistent application of sampling protocols in data collection requires knowledge of applicable methodologies and consistency and discipline in their use. Problem: As more individuals participate in a sampling activity, the risk of getting variability in the application of sampling protocols increases significantly, especially when participating individuals have limited or no training in the protocols. When sampling protocols involve instruments or sensors, regular calibration is important to ensure temporal comparability of data at a location and spatial comparability across locations using different equipment. Recommended Best Practices Dataset metadata should include detailed statements on the sampling protocol(s) used, including any variations from published protocols. Standardized controlled categorical vocabularies and ontologies should be used wherever possible instead of free-text based descriptors. Dataset metadata should include a statement of how consistency issues have been addressed. Data and software based solutions to this issue are limited and project owners should look mainly to off-system solutions. As a general rule, the more complex the required sampling protocol is, the more effort needs to be put into participant training and support in respect to data acquisition and to data verification, moderation and curation mechanisms, post-acquisition.
Validity	Whether records are factually correct. Data validity is an observational record level issue which is a function of the interpretation of the observed result by the observer of the thing being observed. For sensor-based observational records, data validity is more a function of the serviceability and calibration of the sensor than of human interpretation of the thing being observed. Problem: Human observers frequently make mistakes in their interpretations of what they observe and how they record their observations. Sensors and instruments may arguably be more robust and reliable than most human observers, but they can break-down or lose their calibrated accuracy. Systems can't know whether something is factually correct or not, unless there is additional information and rules which can be used by systems to cross-reference and validate recorded observations, for example, using artificial intelligence and machine learning (AI/ML) algorithms. Recommended Best Practices Include database level controls on different data types such as those described under "Accuracy" above. Data validation, verification, moderation and curation processes and capabilities need to be built into the data collection and management pipeline, including system design support for such capabilities. Data and software based solutions to this issue are somewhat limited. Project owners should therefore look to augment these with appropriate off-system solutions. Depending on the subject matter and complexity, effort needs to be put into participant training and support in respect to data acquisition and to data verification, moderation and curation mechanisms, post-acquisition.

References

The following references were used to inform the above content:

A. Alabri and J. Hunter, "Enhancing the Quality and Trust of Citizen Science Data," 2010 IEEE Sixth International Conference on e-Science, Brisbane, QLD, 2010, pp. 81-88, doi: 10.1109/eScience.2010.33.

Chapman, Arthur. (2005). Principles of Data Quality. version 1.0 Report for the Global Biodiversity Information Facility, Copenhagen.

Eréndira Aceves‐Bueno Adeyemi S. Adeleye Marina Feraud Yuxiong Huang Mengya Tao Yi Yang Sarah E. Anderson, "The Accuracy of Citizen Science Data: A Quantitative Review". The Bulletin of the Ecological Society of America. First published:29 September 2017 https://doi.org/10.1002/bes2.1336

J. Hunter, A. Alabri and C. van Ingen, "Assessing the quality and trustworthiness of citizen science data", Special Issue Paper, Wiley Online Library, First published:13 September 2012 https://doi.org/10.1002/cpe.2923

Recent posts

One post tagged with "philosophy"

Approach to addressing Data Quality and Quality Assurance Processes

Peter Brenton

Data Quality and Fitness for Use - some general principles

Factors Affecting Trust in Citizen Science Data

References