Foundation AI Models for Product Analytics

Alex Chin

and

Sean J. Taylor

February 13, 2024

Introduction

How can AI help us understand our products and our users better? Code generation and automatic data analysis tools are good starts, but we believe this indirect approach still leaves a lot on the table. The LLMs powering code generation are trained on billions or even trillions of tokens in order to develop a deep understanding of the semantics of sequences of tokens. Product usage data also consists of large collections of sequences of user event tokens, making it a good fit for deep learning architectures similar to those used for natural language and code.

Consider the workflows of three analysts.

Analyst 1 is doing analytics the traditional way, writing code to interact with the data warehouse.
Analyst 2 has access to AI-powered chatbot and copilot tool that can write SQL for her and compile basic summary tables into charts and graphs. She can use the tool to improve her productivity and can get faster answers to her questions.
Analyst 3, on the other hand, is working with Motif’s AI models that have read and understood her underlying event sequence data. She can directly explore the resulting insights with interactive visualizations, providing a faster feedback loop and surfacing more useful insights.

At Motif, we are applying breakthroughs in language modeling to train foundation models of event sequences in order to surface more useful and important insights for decision makers. In this post we’ll explain how we do it, why we are doing it, and how we are applying these foundation models in practice.

Generalized Large Event Analytics Models (GLEAM)

Product event sequences are rich data sets generated by websites, apps, and backend infrastructure as they operate to serve user-facing requests. Product instrumentation is organized around capturing a log of immutable events, typically user identifiers, timestamps, and a variety of additional properties about the event.

Event log data sets trace all the steps that users take, as well as the specific processes that backend and frontend software execute in order to facilitate those user steps. They are often generated by several different systems and need to be merged together. When systems are instrumented well, raw event data is an attractive data asset to work with because it contains the most complete possible representation of a user’s behavior and experience.

To model event sequences with minimal changes to the underlying data, our approach is to mimic the training of LLMs on text data but with two key modifications:

Event tokenization: We developed tokenizers for events that richly capture their timing and various properties that have been logged in a flexible way.
Multi-time-scale loss function: Modify the training objective to make sequence models less myopic, encouraging them to learn about how sequences will evolve over the next few minutes, hours and days rather than simply the next event.

These architectures comprise what we are calling Generalized Large Event Analytics models (✨GLEAM✨).

Tokenization

In NLP, tokenization refers to the mapping of short strings of text characters into integer IDs that can be parsed by the AI model. For event log datasets, the events comprise a more complex nested data structure with useful information that we want to make sure is not discarded during the tokenization procedure. We create a vocabulary for the set of things that can happen in the system, and that vocabulary needs to tradeoff size and fidelity.

A key part of that vocabulary is the relative timing of events, which convey rich information about the underlying user behaviors and experiences. It matters a great deal if it’s been a minute or day between two user actions, indicating quite different levels of engagement. To capture this we developed a duration binning that encodes this information at multiple time-scales, while keeping an efficient vocabulary size.

Durations are binned on a human-interpretable, roughly logarithmic scale. Boundaries are roughly 2-5x larger than the previous boundary.

The values of event properties are assigned tokens, so events are represented with multiple tokens, and take on different lengths depending on which properties are captured. This is a flexible framework that extends to rich event data while allowing us to use off-the-shelf sequence models designed for text data.

Long-term objective

The semi-supervised objective for decoder models (or auto-regressive models) is to predict the next token in a sequence. While undoubtedly useful for generative models which are used in text-generation tasks, generating sequences is not our primary goal and we have found that this myopic prediction task does not generalize as well to analytical AI tasks. Product analytics data has a relatively constrained grammar and information content compared to language, and sequence models can do well at predicting the next event. However, achieving good performance on predicting user behaviors such as churn, retention, and purchasing patterns involves predicting tokens that are typically observed far into the future of the sequence.

To give the model feedback about the longer-term trajectories of event sequences, we generate future-to-past convolutions of the input data to provide additional supervision. The model learns a representation to forecast what is likely to happen to the user over standard time-horizons for analytics tasks, in contrast to predicting the next event to generate realistic event sequences. We have found these representations to be quite different in practice and that strictly auto-regressive models do not generalize well to longer-term estimation without this kind of pre-training.

The user-specified value function can be quite flexible and can include non-parametric objects of interest such as the survival function and cumulative incidence.

GLEAM eliminates an information bottleneck

Because event sequence data is so rich and complex, it is usually heavily summarized before it is analyzed by data scientists, analysts, and product managers. It is common to segment users based on easily measurable properties such as the cohort when they signed up, their country, their device, or level of recent activity. These user segments are undoubtedly useful but they tend to make restrictive assumptions:

They should be straightforward to compute in SQL: Due to convenience, analysts often stick to properties that are already available in source tables rather than ones which are challenging or expensive to compute.
They should be simple: A small number of segments providing relatively coarse decompositions of user behavior and product experiences. Typically user segments must be MECE and relatively low-dimensional for them to be useful in practice.
They (mostly) should not vary over time: Time-varying segments are more challenging to work with analytically, as users drift between segments the analyses and breakdowns grow more complex and dimensional modeling breaks down.

When data transformations are used to create pre-modeled datasets for scientists and analysts to use, it creates an information bottleneck in downstream analyses, discarding potentially valuable information in the pre-transformation event data. The set of hypotheses that can be tested using this approach is constrained by how well the transformations have captured the important properties of the data. As datasets grow ever larger and richer while products and use cases grow ever more complex, standard modeling techniques create blindspots — hypotheses that are difficult to test without changing how data is processed, burdening data engineering teams.

The workaround we have observed in practice is for data scientists to develop machine learning models for predicting user behaviors. Churn, retention, and LTV models can be trained on complex, human-designed features based on lagged counts and rates of user actions. These task-specific models are obviously valuable but they have some deficiencies of their own:

Complexity and time cost: For purely analytical use-cases, it is difficult to justify building and maintaining complex machine learning models.
Low reliability: Behavioral models depend on the upstream data quality being maintained, and may not be resilient to data pipeline changes and outages.
Feature engineering bottlenecks: Developing features which can be computed, are predictive, and provide valuable insights is an ongoing challenge. Even with diligent work, it can still result in many hypotheses being difficult to test without re-training.

Our foundation model approach addresses these deficiencies in a convenient way. Once pre-trained, GLEAM is a rolling feature generator, obviating the need for feature engineering pipelines and making creating new models a simple fine-tuning step. Representations computed from GLEAM on an event sequence approach have some important advantages:

they provide features for any downstream task that are size-efficient
they are resilient to messy or missing data; and
they automatically adapt and encode information from logging improvements over time.

Through training and inference procedures we’re able to produce a useful latent space for user trajectories, and the user’s (time-varying) position in this latent space is valuable and simple-to-use source of information about their past and future behaviors. However, interpretation of this latent space requires us to decode shifts in the latent space and understand what that means, which is a data exploration task.

How we use GLEAM in practice

So we’ve trained a GLEAM architecture on event sequence data — so what? These models have a super-power and it is that they create a fixed-length, information-rich representation of a user’s entire event history for every event in the sequence. These are more than user embeddings, they are user-history embeddings capturing time-variation in the user’s state. Meaningful shifts in the latent space over time tend to be interesting: they capture users learning about the product, making it through certain onboarding milestones, using different sets of features, and becoming more or less likely to buy, churn, retain, or convert.

The GLEAM latent space can be projected into two-dimensional space using a standard dimensionality reduction technique such as UMAP or t-SNE, giving a user-history space that useful for exploratory visualizations. These latent space visualizations depict:

States: clusters of states that users can be in, indicated by areas of density in the latent space.
Flows: subsequences of events that transition users between states according to the mode

Generally visualizations of high-dimensional spaces can be tricky to get right, and we have found this to be true in this application. But the points in the visualization can be directly linked to the original event data — there is a mapping between them that is preserved and can provides the user with a clear interpretation of what the states and paths mean by focusing on example users.

Fine-tuning to outcomes and finding examples

The latent space is a compressed form of the raw data, from which we can project into what we call the “outcome space,” which captures the expected value of some important outcome of interest for the analyst. The outcome space is a low-dimensional state space that we can directly interpret as business value, allowing analysts to interpret latent space in meaningful terms.

Large shifts in the outcome space are typically interesting for analysts to investigate. They point to events which meaningfully change the estimate of some outcome for the user once observed by the model. These shifts can be immediate, from observing a single event, or the gradual accumulation of events over a contiguous time period. From a causal inference perspective we can think of shifts in the outcome space as either:

Revealing information about the user: the user has an unobserved probability of the outcome that we learn more about through their behaviors and experiences.
Causing the user’s outcome to change: in a counterfactual sense, observing an alternative event would result in a different outcome for the user. For example, if the event were an exposure to random assignment of an A/B test, then the change in the outcome space will capture an estimate of the individual treatment effect.

As the analyst sifts through shifts in the outcome space, she can determine the most important events in the business process and generate hypotheses about which ones might be suitable for interventions and product changes. This theory building can be accomplished by:

Inspecting interesting examples: viewing the event sequence in the data associated with the shift.
Retrieving nearest neighbors in the latent space: determining how common the context for this shift is using the latent space to find sequences with similar user contexts.

Generalizing from specific observations

The path from raw event data to the outcome space is bi-directional, and this enables powerful new workflows we have built into Motif. If the analyst observes a large shift in outcome space, it means that the model has some evidence that something important has happened to this user. The natural step is to ascertain why this may be happening, which is a model interpretation task. The approach enabled by the latent space is to use nearest neighbors search to find users in a similar states with similar shifts, very similar to retrieving related documents in a RAG context. Roughly we capture three sets of users-histories:

Similar state users: this set contains all users who have similar historical sequences to the users. There are two subsets of this set that are interesting:
Treated paths: users for whom the shift we are interested in has occurred.
Counterfactual paths: users who were in the similar state, but for whom the shift did not occur.

We can analyze this retrieved data set as an observational experiment. Obviously the differences are not always caused by the path the user took (there are many sources of confounding), but by inspecting the underlying events using Motif’s visualizations, we can determine via domain knowledge if the difference between the two groups yields potential for intervention. Based on the framework introduced in our earlier post, this workflow supports the tasks of finding previously unknown effects (inspecting the outcome space) and determining unknown causes (by inspecting sets of examples where the effect appears to be present).

Unlocking event sequence data

Motif’s mission is to help our customers deeply understand their products and their users. In interviews with hundreds of data organizations, we have observed that they collect far richer data than they are able to use effectively. We believe the bottleneck to accessing these insights is not in writing better SQL queries, but creating a more expressive system for interacting with event sequence data. That means we need to innovate on:

Representations of event sequence data: Transformations which uncover valuable and important structure in the data, capturing the rich variation in behaviors between users and across time. Foundation models trained for specific data sets help us organize and index large data sets.
Affordances for querying event sequence data: the questions people really care about are not well served by SQL (too restrictive) or by writing arbitrary code (too non-specific). Motif’s Sequence Operations Language was being developed for this task and has dramatically accelerated our analyses.
Visualizations that help users uncover patterns: we have developed a variety of innovative interactive visualizations that spur curiosity and investigation.

If this sounds like something that would be useful for your organization, we’d love to collaborate with you!

Get started for free