NHS England: NHS.UK Reviews Automoderation Tool
The NHS.UK website receives thousands of reviews each year about NHS services. This tool automates a previously manual process by using natural language processing (NLP) techniques to improve efficiency, scalability, and user satisfaction.
Tier 1 Information
Name
NHS.UK Reviews Automoderation Tool
Description
The NHS.UK website receives thousands of reviews each year about NHS services - these reviews need to meet our guidelines about personal information, abuse, discrimination, and other policies before we can publish them. This tool automates a previously manual process by using natural language processing (NLP) techniques to improve efficiency, scalability, and user satisfaction all at a cost which is a fraction of that of manual moderation.
Website URL
https://nhsengland.github.io/datascience/our_work/ratings-and-reviews/
Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
NHS England
1.2 - Team
NHS.UK
1.3 - Senior responsible owner
NHS.UK SRO (Deputy Director of Delivery - NHS.UK & app)
1.4 - External supplier involvement
No
Tier 2 - Description and Rationale
2.1 - Detailed description
NHS users can leave a review following their experience at an NHS health and social care service in England. When a user submits a review, it gets sent to a Flask App which runs the title and body of the review against our moderation rules. This involves both code running locally on the Flask App and external calls to models hosted on Azure endpoints. The outcome of each moderation rule is then returned to the NHS.UK Leave a Review site where the reviewer is informed if their review is in breach of any of the rules. For some moderation rules, users have the option to disagree with the outcome, in which case the review is sent to a human moderator to be checked. Reviews that are flagged by the algorithm as containing safeguarding concerns are currently automatically sent to a human for review.
2.2 - Scope
This tool is specifically designed to automate the moderation of NHS.UK reviews in line with the existing moderation policy, which includes nine distinct rules that were each handled individually. To detect rule violations, a rule-based approach is employed for some rules, while machine learning (ML) models have been trained for others.
Certain rule-based solutions could potentially be used in other contexts, such as identifying emails or detecting profanity. However, others are tailored to the specific definitions used by NHS.UK (Complaints, Safeguarding, Descriptions, Names and Not an Experience). The performance of the ML models on data other than that specifically used in the training and deployment (reviews on NHS.UK) has not been evaluated and is not recommended.
2.3 - Benefit
The benefits of this tool are best described in comparison to the previous manual process. The previous manual process required full human moderation and reviews were often published several days after being submitted. In 2021, 9 out of 10 reviews that were rejected by moderators were not resubmitted for publication. Utilising automatic moderation means:
- in the majority of cases, users receive instant feedback about their review (the exception being when the review is flagged for safeguarding)
- in most cases, users therefore have the opportunity to adjust their review immediately to meet our policy resulting in a higher rate of published reviews
- the third-party moderation contract could be terminated, drastically reducing cost
- the service is far more scalable
- more consistent moderation decisions
2.4 - Previous process
The previous process for moderating NHS.UK reviews involved a third-party contractor manually reviewing each review as they were received. The user would then receive the decision of the moderation (published or rejected) by email.
2.5 - Alternatives considered
One alternative was to keep the current service as-is by continuing the third-party moderation contract and not promoting the service to keep incoming reviews at a manageable level. This was discounted, as it does not align with the long-term goal of increasing volumes of user feedback to inform better patient experiences.
Another option was to pay for more moderation staff to increase capacity and enable the ratings and reviews team to promote the service. This option was also discounted due to the increased costs involved.
Given the above, an algorithmic and NLP approach was taken to enable scalability whilst also reducing costs.
Tier 2 - Decision making Process
3.1 - Process integration
The algorithmic tool primarily supports decision-making in the moderation process. However, there are specific cases where it makes automatic decisions as to whether the review can be published or whether it needs editing to meet the review policy.
For example, the tool automatically rejects content containing fully capitalized text, URLs, email addresses, and confirmed profanity words.
When the tool identifies reviews as complaints, users are re-directed to a specific page appropriate for raising complaints.
If reviews contain names, non-experiential content, or ambiguous language that could be considered as offensive, the tool provides users the option to accept the moderation decision or have their review sent to a human moderator if they contest it.
Currently, the tool flags all safeguarding content, and these reviews are automatically sent to human moderators to review.
3.2 - Provided information
When the tool detects that a review contains fully capitalized text, URLs, email addresses, or confirmed profanity words, the review is automatically rejected, the user is informed and prompted to edit and resubmit the review. A human decision maker is not involved.
In cases where the user can contest auto-moderation, the review may be sent to a human moderator. The human receives the review content (review title and comment) and the algorithm’s outcome for each moderation rule (pass/fail) including extracts of words or phrases indicating what has caused the moderation rule to be broken. The human can choose to override or uphold the auto-moderation outcome.
3.3 - Frequency and scale of usage
In 2023, there were between 14,000 and 24,000 reviews left on the NHS.UK website each month.
3.4 - Human decisions and review
Human intervention is required in two key scenarios:
- When a user disagrees with the tool’s outcome concerning the presence of names, non-experiential content, or ambiguous language that could be considered as offensive, they can choose to contest it. Their review is then escalated to a human moderator for further assessment. The moderator receives the review content, along with the tool’s decisions for each moderation rule. This information aids the moderator in deciding whether to publish or reject the review.
- When a review is flagged for safeguarding concerns, it is automatically sent to a human moderator. The moderator then decides whether to approve or reject the review based on their own assessment. This ensures that sensitive issues are handled with the appropriate care and judgment that only a human can provide.
3.5 - Required training
Human moderators are trained to understand:
- That the algorithm can get decisions wrong, how these cases are identified and how they can be resolved.
- The ratings and reviews policy before reviewing contested decisions.
- the escalation routes available (e.g. clinical input for safeguarding concerns).
Since the users of the webpage are not directly operating or maintaining the algorithmic tool but merely interacting with it by submitting reviews, formal training is not required. Instead, a clear and concise notice on the webpage informs users that their reviews are being scanned by AI technology.
3.6 - Appeals and review
For moderation rules that allow users to disagree with the algorithm’s decision, they can send the review to a human moderator for review. The user will then receive an email stating that their review has been published or explaining how to change it to meet the review policy.
If a review breaks policy but does not get caught by the moderation rules, service providers can report the comment directly to a human moderator for review.
Tier 2 - Tool Specification
4.1.1 - System architecture
The main logic of the moderation rules take place within a Flask app. Deterministic rules such as those used for checking URLs and email addresses are applied directly on the Flask app compute. For more complicated rules (such as safeguarding), models are hosted on Azure Machine Learning endpoints which are queried by the Flask app. Once the review has been checked against all the moderation rules, the results are sent back to the NHS.UK website where the appropriate page is displayed to the user - either a page explaining which rule was broken or the next page in the leave-a-review process. For a high level diagram and description of the tools architecture, please see the NHS England Data Science Website
4.1.2 - Phase
Production
4.1.3 - Maintenance
We track key metrics such as CPU usage, requests per minute, and latency. We also have alerts set up to notify the responsible team when metrics exceed a certain threshold. This provides real-time insights into resource utilisation, enabling us to identify and address performance bottlenecks. By tracking requests per minute, we can understand the traffic patterns and scale our resources accordingly to maintain optimal performance. Monitoring latency helps to ensure user experience, as it directly impacts the responsiveness of our models.
The solution has also been engineered to allow for quick redeployment of both the flask app and each model.
To ensure the ongoing accuracy and efficiency of our models, we will be evaluating their performance at approximately three-month intervals. This assessment will specifically involve analysing data recorded by our human moderators. There are also plans in place to continuously monitor the absolute and proportional rates of reviews breaking policy in the long-term.
4.1.4 - Models
Name detection (rule-based and ML model):
- BERT NER model
Complaint detection (ML model):
- all-mpnet-base-v2 embeddings model
- SVM classifier model
Not-an-experience detection (ML model):
- BOW text vectoriser
- Linear regression classifier
Safguarding detection (ML model):
- BERT (embedding text and classifier)
Tier 2 - Model Specification: Complaint Detection (1/4)
4.2.1 - Model name
Complaint detection model
4.2.2 - Model version
1
4.2.3 - Model task
This model was trained to identify written reviews that violate the NHS complaints policy by detecting content that should be escalated as a formal complaint rather than addressed through the review service.
4.2.4 - Model input
Short text strings - the comment title and the comment body
4.2.5 - Model output
A classification of 0 or 1 indicating whether the review violates the NHS complaints policy
4.2.6 - Model architecture
The model is a supervised learning classification model. It uses an all-mpnet-base-v2 model to generate the text embeddings. A scikit-learn SVM is then used to classify those embeddings.
Hyperparameter tuning was utilised to optimize the model for an asymmetric goal, prioritising the reduction of false positives.
To assist in selection of the best model, a pipeline was utilised to allow testing of 6 different embeddings models with multiple classifiers, their parameters were optimised and best combination selected.
4.2.7 - Model performance
The model was tested on a holdout (validation) dataset, which is a separate set of data not used during the training phase. Validation of the model was completed at the time of the model’s creation, however the results presented here come from data collected entirely after the model’s creation and deployment.
- False positive (FP) rate: 3.0%
- False negative (FN) rate: 15.1%
- Recall: 0.85
- Precision: 0.87
- F1: 0.86
The model’s performance was not specifically broken down by demographic or other characteristics due to the unavailability of such data in the review content. No additional characteristics were identified as relevant that warranted separate performance tracking.
4.2.8 - Datasets
The model uses text data primarily sourced from user reviews about healthcare providers submitted on the NHS.UK website.
A sample from the large source dataset was taken which included a mix of reviews: some that breached the complaints rule and those published without issue.
Additionally, the training set was augmented to expand the training range by introducing more positive records. This used: 1. A paraphraser method based on the PEGASUS paraphraser model. This takes individual sentences and re-phrases them to something semantically similar. We apply this to records sentence by sentence to generate new text. 2. A word replacement method, which probabilistically replaces individual words with other words which are semantically similar, based on their embeddings. 3. A sentence shuffling method, which shuffles sentences relative to one another.
More information about the size of the dataset is available in section 2.4.3.
4.2.9 - Dataset purposes
12.5% of the dataset was used for the validation set and the remaining data was divided into train and test sets using a 75/25 split. No augmented data was used in validation.
Tier 2 - Model Specification: NAE Detection (2/4)
4.2.1 - Model name
Not an experience (NAE) detection model
4.2.2 - Model version
1
4.2.3 - Model task
This model was trained to identify written reviews that violate the NHS review policy by detecting content that does not describe a specific experience
4.2.4 - Model input
Short text strings - the comment body
4.2.5 - Model output
A classification of 0 or 1 indicating whether the review does or does not describe an experience
4.2.6 - Model architecture
The model is a supervised learning classification model. It uses a bag-of-words (BOW) approach to vectorise the text, listing the word counts for the 10,000 most common words in the training data. A logistic regression classifier model is then used to output the classification and the corresponding probability.
To assist in selection of the best model, a pipeline was utilised to allow testing of multiple different embeddings models with multiple classifiers. The results of this were compared to the simpler BOW and linear regression approach and no significant improvement was found.
Cross-validation on the training data was used to select the best parameters for the logistic regression model, optimising the F1 score. 5 folds were used.
4.2.7 - Model performance
The model was evaluated by measuring its performance on the labelled validation data. A high F1 score was prioritised, but where there was a trade-off between the false positive rate and false negative rate, preference was shown towards a slightly lower FP rate, to try and avoid overburdening the human moderation team.
- False positive (FP) rate: 8.4%
- False negative (FN) rate: 16.4%
- Recall 0.836
- Precision 0.877
- F1: 0.856
The model’s performance was not specifically broken down by demographic or other characteristics due to the unavailability of such data in the review content. No additional characteristics were identified as relevant that warranted separate performance tracking.
4.2.8 - Datasets
The model uses text data primarily sourced from user reviews about healthcare providers submitted on the NHS.UK website.
A sample from the large source dataset was taken which included a mix of reviews: some that breached the NAE rule and those published without issue. Additional published reviews were sourced by running ~80,000 published reviews through an early iteration of the NAE model and selecting those with a high NAE probability.
More information about the size of the dataset is available in section 2.4.3.
4.2.9 - Dataset purposes
Each dataset was split into train and validation sets using a 75/25 split. The additional published records that were run through an early iteration of the NAE model were used for training only. Cross-validation with five folds was used during the training process.
Tier 2 - Model Specification: Suicide and Self-harm Ideation Content Detection (3/4)
4.2.1 - Model name
Suicide and self-harm ideation content detection model
4.2.2 - Model version
1
4.2.3 - Model task
This model was trained to to identify content related to suicidal thoughts and self-harm in user reviews on the NHS.UK website.
4.2.4 - Model input
Short text strings - the comment title and the comment body
4.2.5 - Model output
A classification of 0, 1 or 2 indicating whether the review contains no safeguarding concerns, low risk concerns or high-risk concerns respectively. In practice, it is used as a binary classifier grouping responses 1 or 2 into a single class.
4.2.6 - Model architecture
The model is a supervised learning classification model. It uses a BERT (Bidirectional Encoder Representations from Transformers) model, specifically bert-base-uncased. It is a pre-trained neural network model with 12 layers designed for natural language understanding. BERT serves as both the feature extractor and the classification model: it transforms text into contextualised embeddings (sentence embedding vectors) which are then used by the same BERT architecture, pre-trained for classification tasks.
The model is fine-tuned on records specifically labelled for safeguarding concerns, allowing it to learn the special characteristics of this type of text and adjust the neural network weights accordingly. A classification layer is added on top of the pre-trained BERT model to categorise text into relevant classes
4.2.7 - Model performance
The model was evaluated by measuring its performance on the labelled validation data. Following advice from clinicians, minimising false negatives was prioritised due to the sensitive nature of the content and the potential harm of not detecting a case of suicidal or self-harm content.
Although this model produces a three-way classification (0,1,2), for simplicity the metrics here group both positive classes (1 and 2) together.
- False positive (FP) rate: 2.3%
- False negative (FN) rate: 0.6%
- Recall: 0.99
- Precision: 0.85
- F1: 0.92
Due to the small number of real reviews containing safeguarding concerns, the validation dataset was necessarily small and thus the certainty of results is slightly lower than our other models. (More information about the size of the dataset is available in section 2.4.3.)
The model’s performance was not specifically broken down by demographic or other characteristics due to the unavailability of such data in the review content. No additional characteristics were identified as relevant that warranted separate performance tracking.
4.2.8 - Datasets
The model uses text data primarily sourced from user reviews about healthcare providers submitted on the NHS.UK website.
A sample from the large source dataset was taken to train the model which included a mix of publishable comments and comments containing safeguarding concerns. Due to the relatively low number of real reviews containing high-risk safeguarding concerns, additional data was created manually by in-house human moderators using reviews data mixed with generic open source datasets modified to convey a high risk message. More information about the size of the dataset is available in section 2.4.3.
Additionally, the training set was augmented to expand the training range by introducing more positive records. This used:
- A word replacement method, which probabilistically replaces individual words with other words which are semantically similar, based on their embeddings.
- A sentence shuffling method, which shuffles sentences relative to one another.
4.2.9 - Dataset purposes
The dataset was split into train and validation sets using a 70/30 split. No augmented data was used in validation, though some manually generated data was used in validation.
Tier 2 - Model Specification: Names identification (4/4)
4.2.1 - Model name
Names identification algorithm
4.2.2 - Model version
1
4.2.3 - Model task
The model used in this algorithm, together with post-processing steps is used to identify reviews that contain people’s names.
4.2.4 - Model input
Short text strings - the comment title and the comment body
4.2.5 - Model output
A classification of 0 or 1 indicating whether there is a name within the review along with a list of names that the model has identified.
4.2.6 - Model architecture
The model is an open-source pre-trained BERT (Bidirectional Encoder Representation from Transformers) model which is embedded within a series of post-procesing steps to improve the algorithm output. More details on the BERT-base-NER model can be found on the Hugging Face website.
After the model identifies any names in the review, the following post-processing takes place:
- Remove non-names - discard names that are in a pre-defined allow list
- Allow reference to the organisation name
- Allow users to sign off a review with their name
4.2.7 - Model performance
The names algorithm was evaluated as whole, including the model and the post processing steps applied to review submissions and lists of names. The BERT model was not evaluated individually.
The model was tested with a balanced data set of reviews containing names and reviews not containing names. When choosing a BERT model and post-processing steps we focused on obtaining a low number of false negatives. Once the model was chosen, our evaluation focused on the effectiveness of our post processing steps and reducing false positives.
- False positive (FP) rate: 5.9%
- False negative (FN) rate: 6.6%
- Recall: 0.934
- Precision: 0.941
- F1: 0.937
The model’s performance was not specifically broken down by demographic or other characteristics due to the unavailability of such data in the review content. No additional characteristics were identified as relevant that warranted separate performance tracking.
4.2.8 - Datasets
The model uses text data primarily sourced from user reviews about healthcare providers submitted on the NHS.UK website.
A sample from the large source dataset was taken to evaluate the algorithm which included a mix of publishable comments and comments containing names.
More information about the size of the dataset is available in section 2.4.3.
4.2.9 - Dataset purposes
Since the model was pre-trained, the dataset was used only to evaluate the whole names algorithm (i.e the model alongside post-processing steps).
Tier 2 - Data Specification
4.3.1 - Source data name
NHS.UK Ratings and Reviews Data
4.3.2 - Data modality
Text
4.3.3 - Data description
The datasets captured the review title and comment along with the care provider details such as organisation type and name. All comments had labels indicating which (if any) moderation rules had been broken, assigned either by human labelling, augmentation or generation.
4.3.4 - Data quantities
Here ‘moderators’ refer to our in-house moderation team.
- Complaints model: A total of 20,625 records were used, of these 19,179 were actual records labelled by moderators (798 positive; 18,381 negative). An additional 1,446 positive records were generated using augmentation methods detailed in the Complaints Model section 2.4.2.
- NAE model: A total of 8,215 records were used, of these 4,969 were records labelled by moderators (2,083 positive; 2,886 negative). 3,246 negative records were identified by running published records through an early iteration of the NAE model and using only those classified with a high degree of probability as negative.
- Safeguarding model: A total of 4,752 records were used, of these 3,050 were actual records labelled by moderators (549 positive; 2,393 negative). 399 positive records were generated and labelled by moderators using reviews data mixed with generic open-source datasets modified to convey a high risk message. 1,303 positive records were generated using augmentation methods detailed in the Safeguarding Model section 2.4.2, which were subject to a mixture of moderator, model-based and manual reviews.
4.3.5 - Sensitive attributes
The source dataset contains email addresses and IP addresses, however this is filtered out when sampling a smaller set for the purposes of model training and evaluation. There may be some cases where the comment title or text contain sensitive details (such as people’s names) where this information has been volunteered by the user leaving the review.
4.3.6 - Data completeness and representative-ness
The dataset consisted of up to 2 years of real-world historical data that was expected to cover a range of scenarios covering all of the NHS.UK moderation rules.
However, there were insufficient quantities of reviews with contents that constituted complaints and contained safeguarding concerns (where real instances were infrequent).
To address these gaps, data augmentation techniques, manual creation, and synthetic data generation were employed to increase the size and variety of the training set, and ultimately improve the robustness of the models. Details of this can be found in section 2.4.2.8 for each model, as different techniques were used for different tasks.
No augmented or synthetically generated data was used for validation to maintain the integrity and reliability of the validation results.
4.3.7 - Source data URL
N/A
4.3.8 - Data collection
The data is collected in real-time through end user submissions on the NHS website. It is sent through Azure web apps and to MS Dynamics, where it is then viewable and can be sampled for model training and testing purposes.
4.3.9 - Data cleaning
Before the data is sampled for model training and testing purposes, sensitive fields such as email and IP address are removed.
Most data used for training was vetted by in-house human moderators to ensure accuracy and consistency of labelling.
4.3.10 - Data sharing agreements
The specific data used in training, test and validation for this project is not shared outside of NHS England. The automoderation tool does not share information outside of NHS England. Reviews are published on the NHS.UK website and are available online in the service search. Several fields can also be accessed via the API. You can request access via this link.
4.3.11 - Data access and storage
The data is stored in a secure environment in MS Dynamics and a deletion policy is in place to redact any personal information after 2 years (for published comments), 3 months for rejected comments, 10 days for unverified submissions.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
A DPIA was conducted for the NHS.UK reviews service and was updated in February 2024 to reflect the recent change from manual human moderation to the automoderation solution.
5.2 - Risks and mitigations
Risk of false negatives:
- While the safeguarding detection model has a high level of accuracy there is a residual risk of false negatives which here means users that may be in need of help that do not get correctly flagged. Failing to intervene in a timely manner in these cases has potentially harmful consequences.
- There is a small risk of false negatives from the names model which could result in someone being identified without their consent, this is more likely to occur for reviews with names that are also words e.g. Destiny.
- The names and safeguarding detection models can generate false negatives in long reviews due to BERT token limitation. Any names or safeguarding triggers in text after this token limit are not picked up by the model.
- False negatives from the complaints models could mean that complaints are not properly escalated through the appropriate route. Note that a false negative here simply means not signposting the reviewer towards the complaints process, and does not prevent a user from submitting a complaint independently of the reviews process.
- Mitigations: The names and safeguarding models are tuned to minimise false negatives. In the event they occur, care providers reading the reviews are able to manually flag comments that they think break policy, which then can be dealt with accordingly. There are also plans in place to continuously monitor the absolute and proportional rates of reviews breaking policy to ensure assumptions continue to hold (including the appearance of names that are also words being sufficiently rare). The monitoring dashboard for the NHS.UK Reviews service allows the development team to be able to easily monitor activity. Given the BERT token limit is already more generous than the majority of reviews, the risk of false negatives in long reviews is deemed low, however plans are in place to limit the front-end character limit in line with BERT limits.
Risk of false positives:
- False positives from any of the models have the potential to frustrate users if the tool does not publish their review and could undermine their trust in the reviews service. Users are likely to contest false positive cases, which could increase the workload of the human moderators.
- False positives from the safeguarding model risk directing reviewers to innapropriate content regarding self-harm and suicidality. This could be confusing and frustrating to users, and possibly triggering in some cases.
- False positives in the complaints model would direct reviewers to raise formal complaints where there was no real need to do so, resulting in unnecessary costs for the NHS.
- Mitigations: Users do have a facility to contest the auto-moderation decision in most cases, which should lead to a resolution in a timely manner. There are plans in place to monitor the performance of the models overtime with the potential to retrain with more data or new methods to reduce false negatives or false positives. The complaints model was tuned such that the false positive rate was lowered as far as possible (at the price of other performance metrics).
Technology stability risk:
- If the Flask app or models break then users may find it difficult or impossible to leave a review, depending on the situation.
- Mitigations: The automoderation tool can be turned off and full control returned to in-house human moderators whilst any issue is resolved. Our alerting and one-click re-deploy scripts should minimise this down-time.
Malicious user risk:
- Malicious users could exploit the tool’s limits to publish reviews that break review policy.
- Mitigations: The risk of this is deemed tolerably low, given mitigations in place where care providers can still flag reviews that break policy.