4 Data Quality Tool Features for Improving Open Datasets

The rise in popularity of open datasets has seen an increase in light of the Covid-19 Pandemic. With more people thirsty for data, there exists a need for technology solutions to ensure open datasets are maintained and reliable.

Open datasets consist of structured data that is machine readable and typically used by individuals, financial services and public sector bodies. The Open Data Handbook provides the following definition:

“Open data is data that can be freely used, reused and redistributed by anyone – subject only, at most, to the requirement to attribute and sharealike.”

The uses for open datasets are innumerable- from economic indicators, business information and public health statistics. Similarly, the advantages of open datasets are widely acknowledged, including greater transparency and accountability from our public bodies.

However, open datasets come with caution around data quality. Due to the scale of the datasets, and the lack of sensitive information included, there is unlikely to be a great deal of maintenance and data quality control taking place before they are released. In order to tackle the issues around accuracy, usability and quality of open datasets, advancing AI technologies are taking the opportunity to provide their solutions to the problem.

4 Data Quality Tool Features for Improving Open Datasets

In order to extract valuable insights from open datasets (i.e. data that is accessible by the public) sufficient tooling is needed to ensure the data is of high quality. In this blog we look at 4 features of a data quality tool. We have used Companies House – the UKs open register of companies as a Use Case to illustrate the impact of standardising the data quality and preparation process for datasets.

Companies House is an open register sponsored by the UK Government’s Department for Business, Energy & Industrial Strategy. Here, company information is registered and made available to the public. Details can include company addresses, persons of significant control and disqualified directors, as well as insolvency information which are often entered as free form text. As such, entities using these data often undertake data quality and preparation work before using this information for downstream tasks.

It’s estimated that millions of professionals rely on Companies House data every day, with real-world applications including KYC onboarding and AML processes. In these applications, public data becomes critical as Companies House data is matched against financial crime and terrorism lists. Therefore, data needs to be of high quality in order to prevent fraudulent and potentially dangerous activity.  

We summarise below in the Companies House Use Case how Datactics developed systematic data quality and matching pipelines which were run against these data. By using automated pipelines, users save time on designing and maintaining the programmes needed to run against the open data. Moreover, the pipelines have been developed by data experts in order to achieve high quality data without needing tech expertise.  

What is data profiling and why do we need it? 

Data profiling is the process of analysing, inspecting and creating helpful summaries of big data. Through running a diagnosis of the data, including its sources and metadata, bad data can be detected and amended before becoming actionable intelligence. 

Data profiling is an important first step in improving data sets. For organisations like financial services or public sector bodies with vast citizen datasets, profiling can detect where there are spelling mistakes or a lack of standardisation. The downstream benefits of this is that organisations are left with accurate citizen data. Without profiling, data quality issues may creep into downstream analysis and result in poor business decision (and worse, potential fines).

Application of Data Quality Profiling through Datactics Platform 

Data profiling is implemented by the Datactics tool using different techniques including frequency analysis through to co-occurance metrics. This information can be used to identify potential outliers in a dataset through to understanding the underling data and data quality rules that can be applied. In the Companies House Use Case, these open data sets were initially profiled to identify data quality issues before being remediated. This including profiling on names and country columns. Profiling was performed on forenames and surnames using string length (the length of a set of characters), unexpected characters and numerals to identify outliers, or data elements which didn’t conform to any common pattern within the data set.

‘Regular expressions’, a method used for matching patterns in text, were run against the column to pick up on non-name words such as ‘to’, ‘the’ and ‘a’ and output an accurate list of names.

For ‘country’ columns, Datactics engineers looked at identifying outliers such as misspellings and city names being included in country lists. This data profiling work resulted in an accurate list of countries ready to be cleansed and standardised.  

2. What is data cleansing?  

Data Cleansing refers to the process of removing and remediating inaccurate or incomplete data from a dataset. Data cleansing helps produce accurate data which are fed into downstream processes. For the Companies House Use Case, having accurate data plays a key role in financial services processes, including KYC and detection of financial fraud.

Significant effort is required to cleanse and standardise datasets. Using the Datactics platform, this process can be automated and re-run resulting in efficiencies, transparency and auditability in the process. 

How is it used in Datactics data quality tool? 

The Datactics platform provides a self service approach in building data quality rules along with out of the box data cleansing rules which can be executed against datasets. For the Companies House Use Case, information in the Companies and Persons of Significant Control datasets, including company name and addresses, are typically input as free text. As such, this can result in mistakes such as misspelling which requires remediation before it can be used downstream.

Datactics data engineers developed data quality cleansing pipelines applied to the data to achieve high quality, standardised and updated data sets which could be used for downstream analysis. This included the substitution of special characters with safe alternatives to address potential issues such as SQL injections (where fraudsters or bad actors use SQL phrases such as maliciously delete records), along with the validation of countries and addresses.

Data quality checks against the ‘identification.country’ registered column in the dataset identified over a quarter of a million instances (nearly half of the total dataset) where the name entered was not found within the standard list of the United Nations Geoscheme.  

3. What is data matching? 

Data matching is the process of comparing two data sets to identify similar matches and detect outliers. This is an integral data quality tool feature that calculates the probability that two similar entities are in fact the same, helping organisations achieve better data quality and identify potentially fraudulent activity.

Common types of data matching include ‘exact’ and ‘fuzzy’ matching. Exact matching refers to the process of finding all of the entries in a dataset that are clearly identical. Fuzzy matching works in slightly less clear circumstances, where it may have to account for errors e.g. typos or abbreviations to determine whether entities are close enough to be a match.  

How is it used in Datactics data quality tool?  

For the Companies House Use Case, matching was undertaken using built in functionality within the Datactics platform. The example in this Use Case was using matching to verifiy countries entered in the country column and match them to the United Nations Geoscheme of countries and subregions. These matched countries were standardised to the list of the Geoscheme country names. Entries in the country column were compared against this list using exact and fuzzy matching.

Exact matching highlighted when a country name was completely identical to one from the Geoscheme list. In comparison, fuzzy matching identified similarities which could be considered a match for example, identifying countries that may have a spelling mistake. Data matching is useful for facilitating and merging diverse datasets at scale. As a result, users are left with a standardised dataset which makes downstream activities much easier.

4. What is low-code?  

Low-code applications are easy to use and reproducible and can conduct data quality tasks such as cleansing and matching. They alleviate users from having to write coding scripts, such as Python, in order to achieve their business goals and manage tasks. Platforms typically feature a user-friendly GUI (Graphical User Interface), where non-technical users can select the data management components they need. Low-code platforms help users satisfy business needs and solve problems in a management workflow. 

How is a low-code interface used in Datactics data quality tool? 

Having an easy-to-use tool makes data quality projects simple, efficient, and reliable. Datactics developed data quality and matching pipelines which are automated, reproducible, and auditable. A systematic approach enables the consistent re-running of the data quality projects without the need for development, tuning or support. Developing an in-house solution may require the resources to update and maintain scripts weekly.

Datactics developed a data quality tool feature with a low-code interface, to produce transparent data quality projects. As part of the Companies House project, 67 data quality and standardisation rules were developed and 16 were presented in a PowerBI dashboard. In this instance it was presented in a Power BI dashboard, however, templates can be built in most data visualisation tools. 

As the data quality pipelines are reproducible and auditable, the trend of data quality can also be tracked over time, with visualisations of Data Quality metrics by data source and by dimensionality. Moreover, it highlights where most of the DQ breaks occur, in terms of dimensionality and the total number of failing records.

Summary 

The functions associated with a data quality tool offer significant benefits to the end user, whether that is a KYC officer wrangling with messy data during onboarding, or an AML professional needing accurate, up to date information to prevent financial crime.

This report highlights a real-life, relevant use-case in dealing with open datasets namely the ‘Companies House’ and ‘Persons of Significant Control’ data. For these particular datasets, having these 4 features are interoperable so that end-users don’t have to manually extract, transform and load data (ETL) in order to be able to use it. 

In short, better data quality means better decision making and business insights. 

If you would like to explore this solution for open data, please reach out through our contact page. 

Roisin Floyd