DBT: Find Exporters
Find Exporters data uses the Export Propensity Scores to aid users in establishing the export potential of companies on the Companies House register. The algorithm predicts the probability that a company exports goods. This can then be used to identify companies to work with.
Tier 1 Information
Name
Find Exporters
Description
Find Exporters data uses the Export Propensity Scores, produced by the Export Propensity Algorithm, to aid users in establishing the export potential of companies on the Companies House register. The algorithm predicts the probability that a company exports goods. This can then be used by staff within DBT to identify companies to work with.
Website URL
N/A
Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Department for Business and Trade (DBT)
1.2 - Team
Digital, Data and Technology Data Science Team
1.3 - Senior responsible owner
Chief Data Officer
1.4 - External supplier involvement
No
2.1 - Detailed description
This algorithm uses a Supervised classification, where the binary target variable (marking a “company that exports goods”) is associated with a probability which indicates the confidence level in the predicted variable.
By making this process a binary problem it simplifies the implementation and interpretation of the algorithm, as we want to compute an associated propensity score between 0 and 1. This was agreed with the Export and Investment portfolio when the model was first designed and deployed.
The Target variable is whether a company has exported in the ~6 months (180 days) successive to a prediction cut-off (defined below).
All of these variables were agreed by the Export and Investment portfolio when the algorithm was first designed and deployed. Other choices could be sensible; however this is linked to the current expected semantics of the model and should not be changed without proper communications with our users. We will continue to review this in the future.
2.2 - Scope
The scope of the tool is to review all UK companies so that the DBT team can make targeted decisions on which organisations to first engage. The tool does not link the propensity of exports and the government support given to companies (if any) and cannot be used alone to measure the success of government support. The tool does not account company export strategies (i.e. low propensity score may reflect a business strategy rather than obstacles in exporting and should not be used alone for that purpose).
2.3 - Benefit
The benefit of this tool is to help inform operational decisions and identify leads for DBT staff. This tool can help staff to decide which organisation to engage next but does not restrict the user from choosing themselves.
2.4 - Previous process
Prior to this tool, information about the propensity to export for companies were guessed using human expertise in manually inspecting and sifting through data. The tool provides an accurate, well-defined, data-driven probability to replace the manual process and improve its outcome.
2.5 - Alternatives considered
Initial version of the model was implemented using Xgboost. However, we now use LightGBM.
LightGBM is a very efficient and flexible implementation of gradient boosting (faster than, for example xgboost). Gradient boosting models typically lead to accurate models because they use ensemble techniques. In particular, they use a combination of decision trees, which allow to capture non-linear relationships between variables and accommodates well both categorical and numerical variables. LightGBM makes no assumption on the distribution or processes determining the input data.
Tier 2 - Decision making Process
3.1 - Process integration
The tool provides users with data-driven insights on a company’s propensity to export, which provides the user with future prospects to engage. This information complements and integrates with many further insights from other sources available to the users to make the best-informed decisions on which companies to engage next.
3.2 - Provided information
The output of the tool (export predictions) is accessible both as raw data as well as through a dashboard. The dashboard has quick filtering functionality that allows the user to quickly retrieve the desired output for any specific use case they are searching for. The tool calculates an Export Propensity score, which tries to estimate the export potential of a company, as a real number between 0 and 1. This score is then used to assign an Export Propensity Label:
- Very high: the top 7% of companies with the highest propensity scores
- High: the next 8% of companies
- Medium: the next 15% of companies
- Low: the next 20% of companies
- Very low: the 50% of companies with the lowest propensity scores
For example, companies in the top 7% bracket in terms of their Export Propensity Score, will be assigned the label “Very High”. However, their score might be low, for instance 0.1, corresponding to a 10% estimated probability of exporting within the next 6 months. This is because, in general, not many companies are exporters, and there is a lot of inherent variability among companies even when they appear similar on paper, which makes it difficult to obtain scores close to 1.
The score is updated daily.
3.3 - Frequency and scale of usage
There are on average 0.143 daily users of the find exporters tool. Every time a user engages with the tool, it is understood that they use this information to make informed decision as to which company they will reach out to next.
3.4 - Human decisions and review
The tool provides information that may help human to take decisions. The tools does not provide decisions or hints for decisions, only additional data-driven information. Humans may or may not consider it in their decision process. The users do not question, check or review the accuracy of the output of the tool (predictions), as they are in fact measured, with data.
3.5 - Required training
Users of the tools do not require any specific training; they are required to read the user documentation that is provided with the tool (webpage). For the developing, maintenance, and operations of the tool, Data Scientists would be able to operate the tool in any of its parts with a few days of hand-over sessions and documentation.
Tier 2 - Tool Specification
4.1.1 - System architecture
4.1.2 - Phase
Production
4.1.3 - Maintenance
Company data is being refreshed on a continual basis via daily upload feed. The algorithm used is retrained every six months or more often if it becomes apparent that there is a need for the model to be retrained.
4.1.4 - Models
LightGBM classification model
Tier 2 - Model Specification
4.2.1 - Model name
LightGBM
4.2.2 - Model version
4.1.0
4.2.3 - Model task
Classification Model
4.2.4 - Model input
Company related features, accounting information, export information (target variable)
4.2.5 - Model output
Export propensity probability
4.2.6 - Model architecture
Gradient Boosting classification model.
Hyper-parameters of LightGBM are chosen using a randomised search and cross-validation (5 folds) based on the negative log loss metric. Chosen parameters are hard-coded in the training pipeline and also logged upon training.
4.2.7 - Model performance
The outputs of the models give a range of different outputs for different geographies and categories. There are extensive notebooks containing a range of breakdowns.
At a high level the model is evaluated on a test split, using a Brier score. Brier score can be slightly biased against the rare positive class. Other evaluation tools are used for quality assurance:
The calibration curve The Brier skill, defined as 1 minus the ratio between the Brier score of the trained predictor and that of the no-skill predictor that outputs for all samples the overall positive class probability.
Justification for this method is that the Brier Score is a strictly proper score function that measures the accuracy of probabilistic predictions. As such, it is able to give us an indication of how good the probability score output by the model is, not just the ranking of predictions or other metrics that only consider the output label for some confidence threshold. The calibration curve gives us greater insight into how good the scoring is for different strata. The Brier skill allows to contextualise the Brier score in terms of a baseline.
4.2.8 - Datasets
- Companies House data.
- DBT data: Data Hub, Export Wins.
- Export Data from HMRC.
Each dataset has an entry on our departmental data platform which includes data dictionary and code snippets. The data science team do not maintain or own these datasets.
4.2.9 - Dataset purposes
Dataset from Companies House is used for a canonical list of companies to use for training and prediction, and to extract company metadata. It is also used for extracting all accounting information. It is not loaded via dataflow, but using the Data Store service, which existed before dataflow was created.
Extract from a HMRC export data is used to obtain the date of last export for a company. The last date for which we have export information for any company is assumed to be the smallest of the current date and the last day of the last month-year available in this table.
Tier 2 - Data Specification
4.3.1 - Source data name
-
Companies House data snapshot of information for live companies on the public register.
-
HMRC exports data, that comprises data on the export of goods from the UK, combining the previous separated non-EU exports and EU dispatches.
4.3.2 - Data modality
Tabular
4.3.3 - Data description
Administrative data which gives information in relation to companies.
4.3.4 - Data quantities
The final model output is for all companies in the UK (i.e. few million rows). The dataset for model development has about 20 columns.
The test split corresponds to the data for 20% of the companies fetched at training time. The training split is the remaining 80% of companies. No separate validation set is used, in favour of cross-validation.
4.3.5 - Sensitive attributes
The data all relates to companies, so should not contain personal data other than where this has been used, for example, as a company name. International trade advisor (ITA) the users of the tool have their own separate data set with contacts and means to reach companies.
4.3.6 - Data completeness and representativeness
This is administrative data which has been enriched with additional DBT data to improve the data quality of the data set such as Company ID matching. There are some issues with the underlying accuracy of some of the companies house data (e.g. some dormant companies, incorrect industrial classifications).
4.3.7 - Source data URL
4.3.8 - Data collection
We get the full raw administrative data as an API feed and is not changed from how it has been inputted onto HMRC and Companies House systems.
4.3.9 - Data cleaning
Not applicable
4.3.10 - Data sharing agreements
The HMRC and Companies House data sets are open data sets.
4.3.11 - Data access and storage
The outputs of find exporters are only available to DBT staff who have a login and credentials for this system that is managed and monitored. Data is loaded via the DBT Data Workspace platform, which handles security controls.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
There has been no need for a Data Protection Impact Assessment (DPIA) to be conducted as it only contains non-personal company data. There has been no external impact assessments of the model. Internal model evaluation using standard data science metrics used for model building have been undertaken in term of fairness.
5.2 - Risks and mitigations
The key risk here is that use of the tool could reinforce existing DBT operational bias towards certain types of companies if data are not objectively interpreted. However operational staff are not required to use Find Exporters, it is simply a tool they can use in order to identify potential leads. The fact that operational staff are responsible for identifying companies and Find exporters is not mandatory will mitigate any risk of biasing the direction of operational work. The tool team will continue to explore risk mitigations, particularly as some teams are keen to automate aspects of lead generation and casework.