Synthetic data and its importance to AI-first startups

At Hillfarrance, we have an unhealthy obsession with data. Here are some of our musings on the future of synthetic data, digital twins and how it can play into advancing the frontiers of machine learning

Figure Image


Figure Image

We are firmly in the age of data. According to the International Data Corporation, in 2020, we created or replicated 64.2ZB (zettabytes, which make up roughly one billion terabytes). The IDC prophesizes that “the amount of data created over the next five years will be greater than twice the amount of data made since the advent of digital storage.

One must wonder if the age-old minimalist philosophy of quality over quantity will apply with this proliferation of data. Perhaps more pertinently is the question of how much of this data will we have to access and use?

Today, the lion's share of data is not publicly available and protected by a wide range of measures from privacy regulations to security concerns. It is here that synthetic data comes into play, and according to Will Haringa, Founder & CEO of Segna, “synthetic data will emerge as a key resource on the toolbelt of machine learning engineers.”



Setting the scene


As discussed in our blog post last year, the more superior and pristine the nature of the data analysed, the more significant and more profound the defensive moat you build. 

Digging this moat is easier said than done. To illustrate this, let's take a look at one of the tech sectors most affected by regulations and protocols - fintech:

  • Getting hold of financial data is problematic as most consumers are not keen on parting with their personal information. Publicly available financial data sets are less granular and expensive to purchase.

  • Most data is either partially or entirely anonymised to avoid costly and lengthy litigation. To anonymise the information, you typically strip the original data of its utility and value.

  • It is legally forbidden for data to be shared with third parties or across internal departments.

  • For effective machine learning to occur, it needs large amounts of clean and pristine data. Sadly, most data is prone to significant bias at the source, especially from a niche financial services provider.

The outcome from these limitations is that training effective and valuable machine learning models is complex if you are an early-stage fintech platform.



Using synthetic data to fill in the blanks


Synthetic data does not contain any real data. Instead, it aims to mimic the subtleties and complexities of real-world datasets by how they are distributed, the types of relationships and connections they uncover and the outcomes they generate. This negates some of the issues mentioned previously in this post to a certain degree. The Alan Turing Institute notes that “Synthetic data can be used to train data-hungry algorithms in small-data environments, or for data sets with severe imbalances. It can also be used to train and validate machine learning models under adversarial scenarios.”

Perhaps one of the most exciting features of synthetic data is that it can provide early-stage startups with the chance to achieve minimal algorithmic performance without needing to spend the considerable resources it typically takes.

Synthetic data can also create entirely new data points while still preserving the original data’s statistical features, allowing data scientists and machine learning engineers the chance to slice, dice and create new scenarios altogether as a result.



Synthetic data and machine learning


Machine learning closed loops are hungry. Hungry for highly dimensional data to create a robust and reliable model. Our experience from previous investments has shown that depending on the process you are trying to replicate, this could require millions of data points to achieve some form of MAP. One such example is computer vision and image processing. If you have ever seen one of Google’s or Uber’s autonomous vehicles driving around, you may be surprised to hear that a large portion of their algorithmic training is based on synthetic data.

A recent breakthrough in image recognition is the development of Generative Adversarial Networks (GANs). Consisting of two networks: one discriminator and one generator. The generator network creates synthetic images, which are much closer to real-world images, whilst the discriminator network aims to identify the actual images from synthetic ones. GANs are another machine learning neural network, where both the networks keep learning and improving by building new nodes and layers.



Harnessing the power of video game engines for synthetic data collection.


In addition, in GANs, synthetic data generation can originate from video game engines, especially those models looking to simulate physics and crowd dynamics. That being said, the current capabilities of game engines are limited and narrow in focus. This is mainly because there is no existing engine indistinguishable from reality. That being said, using engines such as Unity or Unreal 5 to understand how people react to hazards on the road in a driving game is very valuable for autonomous car software platforms. Equally, MMORPG games have hundreds of players playing concurrently, which may shine a light on how groups of people behave in certain situations. Like with all synthetic data, this isn’t enough to create fully formed predictions, but it is a valuable data point to augment information from the real world. Platforms like are doing exciting things in this space.

If VR ever reaches its true potential and achieves mass consumer adoption, the potential for data collection is truly enormous. We believe that one of the most exciting outcomes from the eventual launch of the metaverse is not how many NFTs you can sell but rather how much behavioural data its architects can collect. 

We are deeply interested in this area and would love to talk to entrepreneurs or academics exploring this area.



Types of synthetic data

Choosing wisely and objectively which type of synthetic data you use is vital.

Kajal Singh, the author of Applications of “Reinforcement Learning to Real-World Data (2021)”, puts synthetic data into three broad categories:

  1. Fully synthetic data. This is entirely synthetic data with no roots or basis in real-world data. This category has strong privacy protection at the detriment of the truthfulness of the data.

  2. Partially synthetic data. This data replaces some sensitive elements of the data set with artificial values. The real-world values are synthesised if they contain a high risk of disclosure and, in turn, preserve the privacy in the freshly generated data. An example of this is American Express generating statistically accurate synthetic data from financial transactions to perform fraud detection.

  3. Hybrid synthetic data. Probably the most commonly used form, hybrid synthetic data is generated using both natural and synthetic data. For every random record of real-world data, a similar record in the synthetic information is selected, and then both are married to create hybrid data. Hybrid synthetic data offers the advantages of both fully and partially synthetic data and, as a result, provides good privacy preservation with a high output value.



Case study: A synthetic data-focused startup

We have covered a fair chunk of theory. Let's talk about synthetic data in the wild and a real-life example of how a startup has built a business around it.

I recently read an article on the growth of and its platform, which claims to offer a stack for synthetic data including a developer environment, a content management system, scenario building, compute orchestration, post-processing tools, and more.

The CEO and founder, Nathan Kundtz, was inspired to create after building and exiting his previous startup, Kymeta (Kymeta is a developer of hybrid satellite-cellular networks). During this time, Nathan kept hearing about the challenges people in the satellite industry were having with data. He decided to publish his thoughts in a whitepaper and launched in 2019.

Starting with satellites and other camera-based imagery from third-party sources such as Orbital Insight, set out to improve object-detection performance outcomes through synthetic data. In particular, helped them modify synthetic images, so the trained AI model can generalise them to real images. They also helped use both a large set of synthetic images and a small set of real examples to train a model jointly. raised $6m in seed capital in 2021 and has just released an open-source platform to help onboard users who want to use it to train their models.



Challenges with synthetic data and the opportunities they create for startups in this field.


If you have gotten this far in the blog, I suspect you may have already clocked some of the opportunities that synthetic data provides and some of the challenges. As with all challenges, they also offer options for those who are entrepreneurially minded:

  • It is still difficult to generate quality synthetic data. There are still significant challenges and inconsistencies when replicating the behavioural attributes of real-world data.

  • The flexible nature of how synthetic data is created makes it easy for human-aided bias to appear at the source.

  • If you are modelling outcomes that can significantly impact its users (healthcare-related models, etc.), you will still need actual data for validation purposes.

  • Synthetic data has limited applications for natural language processing, such as speech or text, which is highly nuanced data, hard to synthesise and requires a significant amount of human-aided tagging.

  • Your users or customers may not accept the validity of synthetic data and reject your business model.



Synthetic data and digital twins


In addition to synthetic data whetting our appetite, we also have quite the hunger for digital twin technology. A digital twin is, in essence, a virtual model replicating and reflecting a physical object or process. This might be a logistics company building a digital twin of its fulfilment process or an agritech company building a digital clone of the sensory output from a wind turbine.

Suppose I have managed to keep your attention through this piece. In that case, you will hopefully have noticed a glaring similarity between the nature of synthetic data and digital twins—they are both creating scenarios for real and virtual events and objectives.

This symbiotic relationship excites the possibility of extending and expanding the value from synthetic data models. Equally, synthetic data could help simulate different scenarios more efficiently and effectively.

If you have a hypothesis or business idea seeking to combine synthetic data and digital twins, we would love to hear from you.

If you would like to read more about digital twins, we recommend checking out this article by Steffen Merten.





As you can probably tell, we are incredibly bullish on the future of machine learning and the range of tools and methodologies that can push its capabilities into unknown realms.

To reach these new extremes, we will need to democratise and speed up the pathway to achieving minimal algorithmic performance within AI-first companies. Synthetic data forms a vital injection into machine learning closed loops despite its limitations. We are looking at a wildly impressive toolset that combines this with advancements in digital twin technology and video game engines.

If you are still on the fence, I will leave you with the results from a recently published survey by Datagen. They discovered that 96% of teams report using synthetic data to train computer vision models. 99% of respondents reported having had an ML project wholly cancelled due to insufficient training data, and 100% of respondents reported experiencing project delays due to inadequate training data.

As George Bernard Shaw once said, “Imitation is not just the sincerest form of flattery - it’s the sincerest form of learning”.

P.S. I asked an AI platform that creates artwork by analysing words to develop a set of unique pieces for this article. Here is what it interpreted from “land+of+the+long+white+cloud”, "new+zealand" and "synthetic+data". Check out Dream in the iOS App Store if you want to try it yourself.


Figure Image
Figure Image
Figure Image
Figure Image
Figure Image
Figure Image