Guidance

The Government Data Quality Framework: guidance

Published 3 December 2020

This guidance document supports the main Data Quality Framework. It provides a set of practical tools and techniques which can be used to assess, communicate and improve data quality.

Data quality action planning

This advice is aimed at data owners and data managers who are responsible for actively managing data sets over time.

What is a data quality action plan?

A data quality action plan will help you to identify and understand the strengths and limitations of your critical data. It will help you to demonstrate whether your data is fit for purpose, understand where to put resource to improve its quality and set out goals to consistently improve your data.

How often should I create a data quality action plan?

You should review your data quality regularly, but the frequency of this will be specific to your organisation and data.

By regularly benchmarking data quality levels, data quality changes can be assessed, and quality can be measured over time.

How to create your data quality action plan

1. Identify your critical data

Identify the data that is most important for the work you are doing. Not all data that you have is critical data, so it is important to effectively target your resources at this critical data.

Critical data is data on which business or operational success depends, and which is the most important for organisational decision making. Setting out your critical data within your process, during the planning stage, will help you to decide what fit for purpose looks like.

When identifying your critical data, you should consider:

  • whether business or operational success depends on this data
  • whether the data is vital for decision-making in your organisation or team
  • whether there is a high impact on your operations if this data is of low quality

Within your critical data, identify the fields or tables that have the greatest impact on users. For example, very few business processes will fail if ‘fax number’ is incomplete. The importance of fields or tables will vary depending on how the data is used. Assessing data quality for different uses will therefore give different results.

2. Identify your data quality rules

Set out data quality rules which priority fields in your data must follow to be fit for purpose. These rules set out the quality requirements for your data and should align with your user needs and business objectives, as well as being realistic and achievable. These data quality rules are different to business rules that might be applied as validation or standardisation routines during data processing.

Data quality rules allow for the measurement of data quality dimensions. Data quality dimensions are measurable characteristics of data quality and can act as a guide for data quality rules. If using data quality dimensions, consider the prioritisation of the dimensions. This should be dictated by your user needs.

For example, you may have a data quality rule for timeliness which states that data must be entered into your database within three days of collection. Data must conform to this rule to be considered ‘timely’, and the timeliness of the data can be measured against this rule.

For each check, decide what fit for purpose looks like. These are not rules that every value must meet, but they should describe what typical good quality looks like. For example, an animal’s date of birth will typically be within the last 50 years, but the data set would not suddenly become unfit for purpose if a tortoise born in 1900 was entered. We might set this value at a target of 98%.

3. Perform an initial data quality assessment

Measure your data against your data quality rules to get a baseline understanding of its level of data quality. You should only be measuring things that have been identified as critical and that are tied to a specific data use.

Metrics

Use metrics to measure your data’s compliance with your quality rules. Use the metric that is most appropriate for the check you are measuring; you may use different metrics for different fields and checks within the same data set. Quantitative metrics will allow you to see clear trends over time.

Here are some examples of metrics that you may wish to use:

  • percentages: measuring the whole data set, or a part of it - percentages can indicate the scale of a problem
  • count: measuring errors, particularly where this can’t be considered as a percentage of the data set - typically counts are used to measure incorrect data
  • true or false: things that will compromise the entire data set if they are wrong
  • ratio: the ratio of errors or problems to data without errors or problems

When you have decided on the checks that form part of your data quality assessment it is recommended that you automate them. Automating your assessments can save time and resources and can help to ensure your measurements are consistent.

4. Document your findings

Log your data quality over time by documenting results of each data quality assessment. This means that problems can be identified and fixed, and these documented findings can be used as a benchmark to compare against future data quality assessments.

Logging data quality also helps find trends in data quality which can help identify points of failure in the data or systems to prevent future problems.

Documentation helps future users:

  • understand previous data quality problems
  • know where improvements may need to be made in the future
  • get information about where data quality may limit the use of the data

5. Identify and prioritise potential improvements

When you have assessed your data, identify which areas require improvements and prioritise the most pressing ones.

6. Define goals for data quality improvement

Once you have prioritised the areas to focus on, goals should be set to improve the data quality. The goals should set out the specific fixes that you are aiming to implement alongside considering the return on investment of fixing data quality problems.

When creating your goals, consider:

  • the importance of the data affected by data quality problems
  • how much data is affected
  • the risk that the data quality problem creates
  • the cost of making improvements

7. Identify the root cause and take action to address this

Root cause analysis involves understanding your data’s fundamental quality problem and solving it at source.

For checks that do not meet your data quality rules, conduct a root cause analysis to identify how the errors arose. Identify whether the cause of the errors is a systemic problem or the result of a one-off event, such as an incorrect data migration or import. Remember that there may be more than one cause.

Only when you understand the root cause of an error can you take effective action to address the problem. Root cause analysis ensures that you are treating the cause of poor data quality, rather than the symptoms.

Decide on actions to take

There are several different ways to improve data quality. The actions you take will depend on the root cause of the problems you have identified, along with the specific needs of your organisation and users.

Data quality remediation actions include:

  • ensuring that you have an organisation-wide data management strategy, and that teams understand and implement the principles and policies within it
  • introducing data quality checks for data entry, such as data validation
  • improving data storage and data architecture
  • improving training and guidance for those involved in data entry
  • introducing automation, such as validation on data entry, automated quality checks or using specialist coding tools rather than spreadsheets
  • addressing team culture and behaviours, such as creating a clear culture of accountability for data
  • correcting low quality data directly (but this can be risky and cause more problems if done incorrectly, so prioritise other fixes before attempting this)
  • accepting the risk and revisiting the issue in the future, though this action should only be taken after weighing up the trade-offs between the cost of correcting the root cause and the value of high quality data

To choose which actions to take, you should assess the costs and benefits of different options and put together a plan to implement the solution.

8. Report on your data quality

There are different ways to report data quality. Choose a method of reporting that communicates your data quality most clearly and presents data quality in the context of being fit for purpose. You may need to communicate quality differently with different audiences. Stakeholders and data users must be equally clear on the strengths and limitations of your data.

Informing users about your data quality is important for them to understand the data and use it appropriately. If the data quality action plan highlights any problem areas in the data, caveat these when telling users about the data quality. If you have implemented improvements following the outcomes of the action plan, it is important to discuss these with users, so they know what has changed and how it might affect the data.

9. Repeat measurements of data quality over time

Return to your data set and assess it using the same assessment methods as before. This will allow you to compare the data quality over time. Record the data quality and take actions as necessary.

Case study

The following case study provides examples of actions NHS Digital has taken to improve data quality:

NHS Digital: Developing the Data Quality Maturity Index

Data quality root cause analysis

Why is root cause analysis important?

Root cause analysis involves finding and fixing the cause of problems, rather than applying superficial fixes to problems as they occur.

Underneath a large percentage of technical and system problems there are often issues of data quality. The root cause of the problem is often not established and even if the root cause is established, sticking plaster fixes are applied to the symptoms rather than fixing the cause.

Problems can become accepted and repeat fixes become part of processes, consuming time and resource. Completing root cause analysis means that problems can be dealt with at the source and avoid these costly and inefficient fixes.

How can root cause analysis be used?

Root cause analysis can be used to identify data quality issues that have caused problems in work projects and processes. The data quality dimensions can help to diagnose specific data quality problems and action plans can be used to address them.

1. Log data quality problems

The first step to effective root cause analysis is to have an effective process for reporting and managing data quality risks and issues. Risks should be categorised by severity and the most serious should be reported to senior managers. Data quality action plans can be used to identify and report problems in data quality and to set out steps for improvement.

2. Understand the data journey

The system landscape in most organisations is complex. It can be difficult to understand where data comes from and where the errors are originating. Having a clear understanding of the data lifecycle makes finding the likely source of the problem easier to work out.

3. Estimate the cost of fixing and not fixing

Often fixes are not made because the cost of fixing is seen to be too high. The cost of not fixing should therefore also be understood and estimated. Sometimes this can be underestimated as individuals start to accept multiple data quality problems.

4. Fix as close to source as possible

Always fix problems in data quality as close to the source as possible. Sometimes a short-term fix will need to be applied but try to push this closer to the source over time.

5. Is it correct for its original purpose?

As data moves through the system the use of data can change. You should assess whether the data is suitable for both its original purpose and its future use. This will impact where the proposed fix is applied and whether data at source can and should change.

6. Continue to monitor your data

Data quality is never perfect. Certain applications and areas have a degradation rate, however. Try to monitor the degradation of data quality and consider how to improve it. Understand where errors creep into the system.

Case study

The following case study describes how the Government Digital Service addressed the root cause of their data quality issues:

Government Digital Service: Improving pipeline data quality

Data quality and metadata

This guide gives advice on how metadata can support your data quality work. This is for anyone managing data, creating metadata or those using data managed by others.

What is metadata and why is it important?

Metadata is information about a data set. The structure of metadata will vary between data sets but it typically contains information such as:

  • the name of the data set
  • a description of the content
  • the area or time period the data covers
  • the frequency of updates
  • any other important information

Good metadata will help people working with the data to understand the context in which data was collected, helping them to use and interpret it properly. Metadata can also include information about the quality of data.

All well-managed data has metadata. Properly describing your data will enable it to be better managed, understood and used. Metadata helps reduce duplication, saves time spent searching for data, and improves the usability of data. Good quality, regularly updated metadata is also a route for communicating data quality to users.

Metadata comes in different forms, including written documentation, automatically generated audit information and data catalogued by data set owners.

Metadata for data managers

Metadata standards

Different standards exist for metadata; some are set by legislation for specific types of data. The Data Standards Authority have published metadata standards for sharing and publishing data which recommends the Dublin Core standard. INSPIRE is recommended when creating and managing geographic data.

You should ensure that metadata exists for all data sets you manage and that these records are kept up to date.

Metadata and data quality

Good metadata supports data quality work. If you have a known quality problem, your metadata is a good way of communicating this to your users. This can be included in the description, along with any caveats about using the data. Alternatively, it can be included in a specific data quality metadata section if available. This should not be your only way of communicating the quality of the data. See communicating data quality to users for more information.

Metadata should be updated whenever a data set is updated. It is important that your metadata doesn’t give false information about the quality of your data by including out of date information.

If your metadata is updated less frequently than the data set, you should only include information on known quality problems that will not change in the short term. Examples of long-term quality problems include:

  • data for a given year is missing
  • definitions of a classification changed at a certain point
  • dummy codes have been used in certain fields
  • certain geographical areas are missing from the data set

Some metadata standards may include quantitative information about a data set such as completeness checks and error counts. This metadata can only be trusted if it is always updated at the same time as the data set. It may be possible to automate the update of quantitative quality checks in metadata records as part of the processes used to produce a data set.

If you cannot automate population of metadata with the same frequency as the data set production, carefully consider the risks and benefits of including quantitative quality measurements in your metadata record. If you choose to automate quality checks, these should be regularly reviewed to ensure the measurements used continue to be meaningful. The results should be regularly reviewed to investigate any emerging or unusual quality issues.

Achieving good quality metadata

Your organisation’s metadata is like any other data set: it is a valuable asset that must be properly maintained and stored consistently. Those responsible for metadata should keep their records up to date and accurate and the corporate owner of the metadata catalogue should actively monitor the quality of metadata.

As with all data sets, it is important to frame quality in terms of fitness for purpose. Typical purposes for metadata include:

  • finding data within an organisation
  • making appropriate use of data sets
  • sharing metadata for integration with other organisations, for example data.gov.uk

Your organisation’s metadata standard will support, but not replace, your assessment of metadata quality. The quality of your metadata should be monitored frequently.

You may also choose to identify critical data sets where users have a higher expectation of, and dependency upon, quality metadata. These data sets may require more stringent monitoring of their metadata to reduce the risk of misuse. These may include data sets that underpin important outcomes elsewhere in government or data sets that are widely shared. As with all data quality monitoring, it is important that the quality of this metadata is measured regularly and that any problems are tackled following root cause analysis.

Metadata for data users

Understanding quality through metadata

Metadata can help data users to understand quality at different stages of data collection, analysis and dissemination. It is not an alternative to proper data quality measurement.

We can use the stages of the data lifecycle to identify aspects of quality that may appear in metadata. The table below provides examples of areas where quality may be compromised and types of metadata that may provide this information.

Data lifecycle phase Quality consideration Metadata example
Plan What is the structure of the data set?
Is the structure defined and consistent?
What user needs were considered?
Does it contain unique identifiers?
What purpose(s) was the data set designed for?
Subject information
Data set description
Structure information
Collect, acquire, ingest Have consistent definitions been used across time/location?
Has data validation been used?
Has data standardisation been included in any data processing?
Have any other data sets been incorporated into the data?
Data set description
Controlled lists
Third party data rights statement
Prepare, store and maintain Has the data been combined with data from another source? If so, how was this done and is the process repeatable?
What format is the data held in, and have there been any changes to this?
What is the size of the data set?
How frequently does the data set update?
What is the geographical coverage of the data set?
Has it been geocoded? And if so, how?
Has data been anonymised in any way? If so, how?
What quality monitoring is in in place for the data set?
Who is responsible for maintaining the data?
Data set description
Process information
Technical metadata
Ownership metadata
Update frequencies
Geographical extent
Use and process Are there any legal, regulatory or administrative restrictions on access?
How were adjustments to the data decided?
Data set description
Data set licence
Third party data rights statement
Documentation of processes and quality
Share and publish Has any data been removed from publication due to legal, regulatory or administrative restrictions?
Has any data been anonymised for publication?
What format is the published data set supplied in?
What quality information do users need to know? For example methods used to collect data, changes made to the data, contributors involved or known quality issues.
Data set description
National security, personal data or commercial confidentiality flags
Documentation of processes and quality
Archive or destroy Is there comprehensive information required to supplement the archived data?
What is the quality of the data at the point of archiving?
What are the known quality issues in the data set?
What processing or aggregation was done to prepare for archiving?
Data management history Data set documentation
Data quality measurements over time

Metadata recommendations

  1. Organisations should have a centrally managed metadata system and owner responsible for managing the quality of the organisation’s metadata.
  2. Organisations should have a data ownership policy that includes responsibility for maintaining metadata records for data sets.
  3. Metadata records must be comprehensive throughout a data set’s lifecycle.
  4. Metadata records must be updated in line with data set updates.
  5. Use metadata when reporting the quality of data to users.

Communicating data quality to users

This advice is for anyone working with data who may need to inform others about the quality of the data. It is also for anyone who wants to understand the quality of incoming data and what to expect from data suppliers.

Why communicate data quality?

Understanding data quality is essential to be able to use data effectively. Users need context to decide on appropriate uses of the data, which in turn reduces the risk of misuse. Data quality information can also help users to determine whether the data meets their needs.

Explaining data quality caveats to users and providing explanations for them can provide users with a more complete picture of the data and can prevent queries and confusion in the long run.

Users should be kept informed about data quality regularly and the information should be included with any data sets, reports or other data products delivered to users.

Who needs to know about data quality?

Data quality action planning discusses how to understand your users’ quality needs and priorities. This is important for knowing who needs quality information and how to pitch that information. For example, if you have some data that will go to both an analysis area and to another area for archiving, you may give more technical quality information to the analysts, and more detail about the data lifecycle processes to the archiving area.

What do users need to know?

What users need to know will depend on their needs and any agreements in place between parties.

Is the quality as expected from the terms of agreements in place?

If you have a service level agreement (SLA) or any other documentation outlining the data that will be provided to users, this can be used to report on quality metrics. For example:

  • does the data set contain the expected number of records?
  • does it have all the required variables?
  • has the data been quality assured as expected?

Are there any changes to data that users need to know about?

If you are reporting on data trends or changes to data, provide users with context for this change. Are figures naturally volatile, or has there been a change to policy or process that may have contributed to the changes?

It is important to communicate anything that has changed since the last time the data was provided as this may affect interpretation of the data. For example, the data may have been collected in a different way.

Are there any caveats to the data?

Important information about the data lifecycle should be highlighted to users. For example, what changes were made to the data during processing and how were these decisions made? Giving users information about the process will help them make assessments about whether the data is fit for their purposes.

You can use the data quality dimensions to describe data quality trade-offs. For example, if there is a quick turnaround, there may not be time to complete some quality assurance processes which may affect data accuracy.

Does the data quality meet the data quality dimensions?

You can use the data quality dimensions to report on data quality.

Completeness

Tell users if the data set includes all the expected records and if any records are missing important data. It is useful to report on missing data and the reasons for this. If data is missing for a systematic reason, users may need to account for this in any analysis they do. If data is meant to be missing (for example variables that don’t apply to all cases), informing users about this upfront will prevent confusion and queries later in the process.

Uniqueness

Inform users if some records in the data set are not unique. Describe de-duplication processes where appropriate.

Consistency

Data cleaning checks should pick up on any inconsistencies in the data. Describe to users any data cleaning done to resolve inconsistent data and any reasons for inconsistencies that remain.

Timeliness

Timeliness depends on the intended use of the data. Telling users when the data was collected will inform them of the period the data reflects, which will help them decide if the data is appropriate for their needs. Timeliness is sometimes in a trade off with accuracy or the amount of data available to users. Timely data may have been collected and processed more quickly, limiting the amount of data collected and the quality assurance done. You can communicate this trade-off when informing users about data quality and provide any justifications for it.

Validity

Validity is data being in the correct range and format. Inform users about any data that does not conform to the expected validity rules.

Accuracy

Accuracy is the degree to which data matches reality. Report on any biases in the data that may affect its quality.

Background to the data

The data lifecycle can be complex, with several parties and processes involved. You can use the stages of the data lifecycle to inform users of any other background information they may need to know, such as how the data was collected and by whom.

Example: why communicating data quality can be important

Imagine you are sending some pay data to another area of the office for analysis. This year some changes were made to the variables to include bonuses in the main total pay variable. There has been a delay getting data from one of the larger business areas, meaning any total organisation figures would need to be treated as provisional and would not reflect the whole organisation. The data set was linked with last year’s data, however there were some issues with updating leavers and new staff records in the data set.

The potential consequences of the analysis area not receiving these pieces of information are:

  • the incorrect data are used to make organisation-wide pay estimates
  • the figures are inappropriately used to compare pay across years
  • potential linkage error may not be accounted for by analysts

These areas need to be described to data users clearly and in detail so they can make appropriate decisions when using the data.

How should data quality be presented?

There are several ways to present data quality information to users. Whichever way data quality is presented, it is important that this information is easy for users to access and easy to understand, without jargon. Present the most important details early on in quality documentation so users understand key quality issues and are less likely to misinterpret the data.

A lot of the information users need to know comes from metadata, including information about processes and information about the data set structure and content. You can report against stages of the data lifecycle using metadata to describe the journey the data has taken. You can also use the data quality dimensions to report against.

If producing statistics, it is important to present quality information clearly in line with the Code of Practice for Statistics. For more information, see the Government Statistical Service’s quality statistics in government guidance.

Example: Department for Transport

The Department for Transport (DfT) used a timeline (PDF, 801KB) to communicate methodological changes to the National Travel Survey and outline changes planned for the future. This gave users background to the data and information on how the quality has improved over time. Their quality report is published alongside the main publication.

Example: Office for National Statistics

The Office for National Statistics told users what data was appropriate to use for measuring migration flows:

The number of migrants working in the UK is not a measure of how many migrate to work. While the Labour Force Survey (LFS) data appear consistent with the International Passenger Survey (IPS) migration flows, users should not use the LFS employment trends as a measure of migration flows. The best measure of total migration flows into and out of the UK is the Long-Term International Migration (LTIM) estimates. Not all data sources are directly comparable and users should be aware of these differences before drawing conclusions.

These caveats are included in the publication alongside the data. Providing this information upfront to users reduced the risk of data misuse.

Data maturity models and quality

This section of the framework provides an overview of data maturity models. It is aimed at those who want to take a holistic approach to assessing and improving data quality. It illustrates how maturity models can be useful tools to apply in this context and highlights existing cross-government work in this area.

What is a data maturity model?

Data maturity models are tools used to assess an organisation’s level of data capability and to highlight areas where progress can be made. Data capability is broken down into a set of themes which are assessed against a series of maturity levels. These themes can differ slightly between models, but data quality is typically covered under one of them.

At each level, characteristics are defined which indicate an organisation has reached that stage of maturity. The table below shows a fictional maturity model with five possible maturity levels. For each theme, it highlights the level the organisation is at currently, and the level it aspires to.

Level 1 (e.g. limited) Level 2 (e.g. reactive) Level 3 (e.g. stable) Level 4 (e.g. proactive) Level 5 (e.g. exemplar)
Leadership and culture   Now Future    
Skills Now   Future    
Tools and architecture     Now Future  
Data governance   Now   Future  
Quality and standards Now       Future

A maturity model approach can help leaders to set strategic direction for data capability, as well as helping to benchmark progress over time. In order to drive change, regular maturity assessments can be made, allowing consistent comparisons to be made over time and across business areas. This promotes a culture of continuous improvement and enables leaders to focus resource on areas where it is needed most.

Many organisations across the private, public and third sectors have adopted a data maturity model approach in recent years. Examples include maturity models developed by the Local Government Association, the Higher Education Statistics Agency and Data Orchard. Maturity model approaches are also being trialled and implemented in some central government departments and agencies including the Department for Transport, the Home Office and the Environment Agency.

Applying maturity models in a data quality context

There are typically several themes within a data maturity model. Data quality is often incorporated as one of these. In a quality context, maturity models can provide a holistic view of data quality management and highlight areas that need improvement. They can support an organisation to move from ad-hoc or reactive approaches to data quality, towards a culture where managing and communicating quality proactively is considered business as usual.

Many of the tools described in this framework can help an organisation to progress and improve their data quality maturity:

Example: Environment Agency

The Environment Agency data integrity maturity model defines five maturity levels, ranging from unacceptable to optimised. The Environment Agency has applied this to their organisational functions and has seen a notable improvement in maturity over recent years.

The model includes a data quality and confidence theme:

Maturity Level 1: Unacceptable
  • We do not consider data quality when creating data or commissioning new IT.
  • We do not measure or report on data quality.
  • We have no plan to improve data quality.
  • Key staff do not have the right skills to monitor and improve data quality.
  • We do not consider the confidence that we have in our data and information.
Maturity Level 2: Improvement Required
  • We have reactive processes in place after data creation to control data input.
  • Our data quality monitoring and reporting is largely ad-hoc or reactive.
  • We cleanse data rather than focus on root causes of data quality issues.
  • We have trained staff to reactively manage data quality.
  • We can define the confidence we have in our data and information qualitatively.
Maturity Level 3: Acceptable
  • There are measures in place at IT/data creation to control data quality.
  • We have data quality action plans for our priority data and use them to improve our data.
  • We have trained staff who proactively monitor and improve data quality.
  • We quantitatively define the confidence we have in our data and information.
Maturity Level 4: Good
  • When we create data or new IT systems, we define the level of data quality and confidence needed for our purposes.
  • We have ongoing data quality monitoring and improvement programmes for our priority data sets.
  • We give our staff the time to ensure data is fit for purpose.
  • We help others understand what we have confidence in using the data for.
  • We are taking actions to improve confidence in the data.
Maturity Level 5: Optimised
  • Monitoring data quality is an integral part of our data set and systems design.
  • Evidence based risks and opportunities from data quality form a key part of our data management and decision making.
  • All data quality issues are dealt with at source.
  • We build data quality and confidence best practice into new systems, data sets and reports.
  • Defining, improving and communicating confidence in our data and information is business as usual.
  • We share data quality and confidence best practice within the business area and with others.

Towards a single data maturity model for government

As maturity models become more common, there is a risk that they are applied inconsistently across government. The Government Data Quality Hub is therefore leading a piece of work to develop a single data maturity model for use across government, with implementation expected in 2021. This single, consistently applied maturity model will cover all aspects of government data. It will develop a cross-government picture of data capability, facilitate more effective targeting of interventions, and promote the sharing of best practice between departments.

The work will build on existing models, standards and frameworks. It is not intended to replace maturity models which have been developed with specific organisational priorities in mind, but rather to work in tandem and allow for consistent assessments across government.

When released, the single data maturity model for government will include reference to data quality and will be accompanied by detailed implementation guidance. Until then, the Environment Agency model may assist you in identifying areas for improvement in your organisation.

For more information about the project, or if you wish to contribute to developing the model, please email [email protected].