At Datactics, we develop and maintain a number of internal tools for use within the AI and Software Development teams. One of which is a synthetic data generation tool that can be used to create large datasets of placeholder information. This was initially built to generate benchmarking datasets as another method of evaluating software performance, but has also been used to generate sample datasets for use in software demonstrations and for building proof-of-concept solutions. The project has been hugely beneficial, providing tailor-made, customisable datasets for each specific use case, with control over dataset size, column datatype, duplicate entries, and even insertion of simulated errors to mimic the uncleanliness of some real-world datasets. As this tool has seen increased usage, we’ve discussed and considered additional areas within data science that can benefit from the application of synthetic data.
One such area is in the training of machine learning models. Synthetic data and synthetic data generation tools such as the Synthetic Data Vault have already seen widespread usage in the AI/ML space. A report from Gartner has gone as far as to predict that synthetic data will comprise the majority of data used ML model training by 2030 – and understandably so.
Creating implementations of technologies such as deep learning can require massive datasets. Sourcing large, comprehensive, clean, well-structured datasets for training models is a lengthy, expensive process, which is one of the main barriers of entry to the space today. To generate synthetic data for use in place of real-world data opens the door to the world of AI/ML to many teams and researchers that would have otherwise been unable to explore it. This can lead to accelerated innovation in the space, and faster implementation of AI/ML technologies in the real world.
The use of synthetic data can clearly reduce the impact of many of the struggles faced during ML model building, however, there remains some potential flaws in ML models that cannot simply be solved by replacing real-world data with synthetic data, such as bias.
There’s no doubt that using raw real-world data in certain use cases can create heavily biased models, as this training data can reflect existing biases in our world. For example, a number of years ago, Amazon began an internal project to build a natural language processing model to parse through the CVs of job applicants to suggest which candidates to hire. Thousands of real CVs submitted by prior applicants were used as training data, labelled by whether or not they were hired.
The model trained using this data began to reflect inherent biases within our world, within the tech industry, and within Amazon as a company, resulting in the model favouring male candidates over others, and failing to close the gender gap in recruitment at the company. A candidate would be less likely to be recommended for hiring by the model if their CV contained the word “women’s”, or mention of having studied at either of two specific all-women’s colleges. This model’s training data was not fit for purpose, as the model it produced reflected the failures of our society, and would have only perpetuated these failures had it been integrated into the company’s hiring process.
It’s important to note where the issue lies here: Natural Language Processing as a technology was not at fault in this scenario – it simply generated a model that reflected the patterns in the data it was provided. A mirror isn’t broken just because we don’t like what we see in it.
For a case such as this, generating synthetic training data initially seems like an obvious improvement over using real data to eliminate the concern over bias in the model entirely. However, synthetic training data must still be defined, generated, analysed, and ultimately signed off on for use by someone, or some group of people. The people that make these decisions are humans, born and raised in a biased world, as we all are. We unfortunately all have unconscious biases, formed from of a lifetime of conditioning by the world we live in. If we’re not careful, synthetic data can reflect the biases of the engineer(s) and decision maker(s) specifically, rather than the world at large. This raises the question – which is more problematic?
As a simple point to analyse this from, let’s look at a common learning example used for teaching the basics of building an ML model – creating a salary estimator. In a standard exercise, we can use features like qualification level, years of experience, location, etc.. This doesn’t include names, gender, religion, or any other protected characteristic. With the features we use, you can’t directly determine any of this information. Can this data still reflect biases in our world?
A synthetic training dataset can still reflect the imperfect world we live in, because the presumptions and beliefs of those that ultimately sign off on the data can be embedded into it. Take, for instance, the beliefs of the executive team at a company like Airbnb. They’ve recently abolished location-based pay grades within the company, as they believe that an employee’s work shouldn’t be valued any differently based on their location – if they’re willing to pay an employee based in San Francisco or New York a certain wage for their given contribution to the team, an employee with a similar or greater level of output based in Iowa, Ireland or India shouldn’t be paid less, simply because average income or cost of living where they live happens to be lower.
If synthetic training data for a salary estimation model were to be analysed and approved by someone that had never considered or disagreed with this point of view, the resulting model could be biased against those that don’t live in areas with high average income and cost of living, as their predicted salary would likely be lower than someone with identical details that lived in a different area.
Similarly, returning to the example of Amazon’s biased CV-scanning model, if we were to generate a diverse and robust synthetic dataset to eliminate gender bias in a model, there’s still danger of ml algorithms favouring candidates based on “prestige” of universities, for example. As seen with the news of wealthy families paying Ivy League universities to admit their children, this could be biased in favour of people from affluent backgrounds, people that are more likely to benefit from generational wealth, which can continue to enforce many of the socioeconomic biases that exist within our world.
Additionally, industries such as tech have a noteworthy proportion of the workforce that, despite having a high level of experience and expertise in their respective field, may not have an official qualification from a university or college, having learned from real-world industry experience. A model that fails to take this into account is one with an inherent bias against such workers.
As these examples show, eliminating bias isn’t as simple as removing protected characteristics, or ensuring an equal balance of instances of possible values for these features. Trends and systems in our world that reflect the imperfections and biases that exist in it may not show it explicitly, and the beliefs in ways that these systems should fundamentally operate at all can vary wildly from person to person. It presents us with an interesting issue moving forwards – If, instead of using real-world data for models to mirror the world we live in, we use synthetic data representative of a world in which we wish to live, how do we ensure that this hypothetical future world this data represents is one that works for all of us?
Centuries ago, the rules and boundaries of society were decided on and codified by the literate, those that were fortunate enough to have the resources and access to education that allowed them to learn to read and write. The rules that governed the masses and defined our way of life were written into law, and, intentionally or otherwise, these rules tended to benefit those that had the resources and power to be in the position to write them. As technological advancement saw literacy rates increase, “legalese” – technical jargon used to obfuscate the meaning of legal documents, was used to construct a linguistical barrier once again, now to those that do not have the resources to attain a qualification in law.
We’re now firmly in the technological age. As computers and software become ever more deeply ingrained into the fabric of society, it’s important that we as developers are aware of the fact that, if we’re not careful with where and how we develop and integrate our technological solutions, we could be complicit in allowing existing systems of inequality and exploitation to be solidified into the building blocks of our society for the future. Technologies like AI and ML have the ability to allow us to tackle systemic issues in our world to benefit us all, not just those fortunate enough to sit behind the keyboard or their CEOs.
However, to achieve this, we must move forward with care, with caution, and with consideration for those outside the tech space. We’re not the only ones influenced by what we create. At a time where the boot of oppression can be destroyed, it’s important that it doesn’t just end up on a different foot.
This is absolutely not to say that AI and ML should be abandoned because of the theoretical dangers that could be faced as a result of careless usage – it means these tools should be utilised and explored in the right places, and in the right way. The potential benefits that well-implemented AI/ML can provide, and the fundamental improvements to our way of life and our collective human prosperity that this technology can bring could change the world for the better, forever.
Technologies such as active learning and deep learning have the capabilities to help automate, streamline and simplify tasks that would otherwise rely on vast amounts of manual human effort.
The reduction in manual human effort and attention required for tasks that can safely and reliably be operated by AI/ML, and the insights that can be gained from its implementation can lead to further advancements in science, exploration and innovation in art, and greater work-life balance, giving us back time for leisure and opportunities for shared community experiences, creating a more connected, understanding society.
That being said, there’s just as much opportunity for misuse of these tools to create a more imbalanced, divided, exploited world, and it’s our job as developers and decision-makers to steer clear of this, pushing this technology and its implementations in the right direction.
I believe that if synthetic data is going to comprise a large majority of the data in use in the near future, it is vitally important that we stay aware of the potential pitfalls of using such data, and make sure to utilise it only where it makes the most sense. The difficulty here for each individual ML project is in determining whether synthetic data or real-world data is the ideal choice for that specific use case of building an ML model. The Turing Institute’s FAST Track Principles for AI Ethics (Fairness, Accountability, Sustainability/Safety and Transparency) provide a strong framework for ethical decision-making and implementation of AI and ML technology – the spirit of these principles must be applied to all forms of development in the AI/ML space, including the use of synthetic data.
There’s no room for complacency. With great power, comes great responsibility.
To learn more about an AI-Driven Approach to Data Quality, download our AI whitepaper by Dr. Browne.
And for more from Datactics, find us on Linkedin or Twitter.