
We operate in a world of
increasing data privacy regulation, growing fragmentation across channels and ever-more complex media ecosystems. Against that backdrop, the demand for richer, more granular and more complete datasets
to power robust marketing mix modeling (MMM) has never been higher.
MMM has always been powered by the quality of its data. If you get that right, everything else follows. The
impending convergence of generative AI, large language models (LLMs) and synthetic data is going to be genuinely transformative. Not just for measurement, but for the entire data infrastructure that
underpins it.
Filling the gaps with artificial intelligence
It’s estimated that between 60% to 80% of analytical work involves data
wrangling: pulling data together, reconciling sources, resolving gaps etc. For MMM, that figure feels, if anything, conservative.
advertisement
advertisement
Clients routinely experience missing variables.
Media channels may only be reported at a total level without platform-level splits. Competitor activity can be difficult to source. Hyper-local indicators such as weather patterns, local economic
signals or foot traffic data are rarely available in a consistent format. Historically these limitations have forced compromises in model design and can reduce confidence in output.
This is where synthetic data will begin to play an important role. Synthetic data is artificially generated data that reproduces the statistical characteristics and structure of real
datasets. It is not random dummy data. When generated responsibly and trained on robust benchmarks, proprietary datasets or anonymized patterns across industries, it can help fill gaps where
information is incomplete or unavailable.
Although we’re not quite there yet, synthetic augmentation will increasingly allow us to construct statistically sound datasets that
mirror real-world behavior and enable more comprehensive modeling than would otherwise be possible. Over time, responsibly developed norms databases built from these techniques could provide stronger
foundations for measurement across industries.
Scenario planning at real scale
Since the recent wave of privacy regulation, the industry has lost
access to some of its most informative data sources. Sharing customer transaction data, CRM records and granular audience segments with external partners has become legally and reputationally complex,
and rightly so.
Synthetic data offers a compelling solution. Organizations can generate statistically representative synthetic versions of their own proprietary data and share those
with measurement partners, preserving privacy while restoring the analytical depth that’s now being missed. The ability to examine customer transaction patterns or niche audience segments
without exposing underlying personal data is, frankly, exciting. It opens doors that regulation has, understandably, closed.
Scenario planning has always been a strength of
well-executed MMM. However, building hypothetical datasets to stress-test model robustness or explore future media strategy has historically been time-consuming and constrained by human bandwidth.
With synthetic data, we will be able to generate thousands of plausible future market conditions. That means testing the resilience of our models, pressure-testing a client’s data
strategy, and exploring the implications of different media investment scenarios at a depth and speed that simply was not achievable before. This capability will only sharpen as the tools mature, and
I expect it to become a defining feature of MMM practice within the next two to three years.
From unstructured information to modeling signals
Beyond synthetic data, large language models (LLMs), deep learning models trained on massive text datasets, are unlocking a different kind of opportunity. LLMs can make the vast universe of
unstructured data legible to MMM.
The majority of MMM has always relied on structured data that’s clean, labeled and tabular. But enormous quantities of contextually rich
information exist in text documents, news sentiment, industry reports, internal survey results and qualitative research outputs that have never been systematically incorporated into models.
LLMs can increasingly help convert that unstructured information into structured signals that models can incorporate. Being able to systematically account for sentiment shifts, news cycles
or broader market narratives alongside traditional variables could add valuable context and explanatory power to measurement models.
Agentic AI workflows, which go beyond static
predictions to execute multi-step plans and decision-making, will also reduce the time it takes to get from messy, fragmented data streams to model-ready inputs. It should help to accelerate data
collection, validation and cleaning. Longer term it’s expected to deliver self-governing, self-healing data pipelines, able to identify anomalies, predict breakages and harmonize disparate
sources automatically, ultimately becoming standard infrastructure.
Speed, trust and the human element
None of this, however, changes the
fundamental requirement for human expertise and authentic partnership. Synthetic data is an augmentation, not a replacement for real data. Models built on biased or poorly governed synthetic inputs
will produce poor outputs, and no amount of computational sophistication changes that. The human role remains in identifying the right model, interpreting outputs in context and ensuring that the
assumptions embedded in any AI-assisted process are sound. That role becomes more important, not less, as these AI powered tools proliferate.
Filling data gaps intelligently,
unlocking first-party data safely, sharpening scenario planning and transforming unstructured information into modeling-ready insight are realities that will soon be available to marketers. Yet
authenticity, transparency and trust will remain the non-negotiables of good marketing prediction and measurement practices.
The tools are evolving quickly, but the
fundamentals have not changed: synthetic data augments, it does not replace; AI accelerates, it does not decide. The organizations that will lead in this space will combine genuine technical
capability with the econometrics and measurement expertise to deploy it well, alongside partners they can trust to tell them the difference.