Blog
chevron-right
Introducing SOL: Sequence Operations Language

Introducing SOL: Sequence Operations Language

Sean J. Taylor
August 31, 2023

Shortly after we started Motif last year, my co-founder Misha asked me to think about what core set of sequence operations we should support. Thus began our quest for a small, yet expressive set of operations that allowed our users to do almost anything they could imagine with their event sequence data. We’ve made a lot of progress since then, culminating in a carefully crafted domain-specific language we call SOL: Sequence Operations Language. I’m excited to share more about how we designed SOL — it will help you understand how Motif is attacking the problem of making it joyful and exciting to work with event sequence data we collect from our product instrumentation.

SOL is pronounced “Soul,” as in the title of this 1981 book by Tracy Kidder.

Before we begin, it’s worth clarifying what we mean by an event sequence. An event is a structured record of some specific thing that occurred, often created and recorded by a logging framework (e.g. a JSON object or Python dictionary). An event must have a timestamp associated with it, the moment it was observed or recorded. Events must also have a unique identifier, like the user associated with the event. Sorting events in order by timestamp, partitioned by these unique identifiers, we end up with a set of event sequences, each capturing everything that’s been observed about one particular user over time.

Partitioning and sorting event data yields a data structure with one user sequence per “row.”

This is a very common shape for the data companies collect in practice, and it can easily be stored in tables (with events represented as rows). But answering even simple sequential questions in SQL is a challenge that analysts struggle with — there is an obvious mismatch between the language and the transformations you may want to perform on event sequence data. Complicated joins, window functions, and UDTFs can work in some cases, but it felt like they were not really intended for the tasks we had. So we took a step back and thought about what we were trying to do from first principles.

I’ve been a huge fan of Hadley Wickham’s dplyr (which he calls a “grammar of data manipulation”) since it was released almost ten years ago. The beauty of dplyr is that it “unbundles” SQL into a small set of operations that share a similar interface, and can be composed into a wide variety of procedures for processing tabular data. The dplyr operations are applied to a table in an order that makes intuitive sense to the analyst, and after each operation the result is an intermediate table that can be inspected for correctness. It’s an elegant way to specify a procedure for transforming data from its raw format into something suitable for visualization and analysis, and it totally changed the way I thought about working with data by giving me better language to describe what I am doing. We decided early on that we should emulate the best parts of this design:

  • A small and orthogonal set of operations that were expressive in our domain.
  • An intuitive way to compose these operations (chaining) to construct useful transformations of the data.
  • A syntax similar to existing tools (SQL and Python) that users would find familiar.

In addition, we aligned on an important new affordance: all SOL queries are automatically linked to intuitive visualizations in Motif so changes to SOL queries always result in visible changes in our user interface. This makes it easy to check that your query is doing what you intend and to evaluate how modifications affect your results.

Prior Art

It’s helpful to draw an analogy to some existing data processing problems: dplyr operations that operate on tables and unix commands that operate on streams.

Comparison of different composable systems for processing streams of items.

You can see in our analogy that event sequences are comparable to a line in a file or a row in a table. They are not fixed-length so they are more like lines of characters than rows (which are tuples and have fixed length in the relational model). But they are more like rows than lines in that they contain structured information.

At this point one may ask (perhaps again): why not just use SQL which can naturally operate on events stored as rows? The fundamental issue is that SQL is not nearly as useful for expressing operations across rows as it is for operations across columns. There are three reasonable approaches to working “across” rows in SQL and each has drawbacks in terms of verbosity, performance, and expressivity:

Different strategies for operating “across rows” in SQL. Sequences can be projected into columns, referenced as windows, or as arrays of primitive types.

I’d also add match recognize to the list of SQL-based approaches here — we were inspired by how powerful it is but found barely anybody uses it because it is challenging to learn, has limited support, and is slow to execute.

The SOL Runtime

SOL is made up of a handful of operations which “map” over the set of event sequences in parallel. At present, the user does not typically need to specify reduce operations, we perform data aggregation in our backend that is customized for the visualizations and analyses we provide. Aggregation of sequence data is a complex topic worthy of a future blog post.

A SOL program is an ordered list of operations, each usually consuming one sequence and outputting zero or more sequences. SOL syntax is implicitly chained, with the first operation always operating on the input sequences (provided by the user, which are partitioned and sorted) and the last operation always passing its output to the Motif user interface.

Core SOL Operations

Match

The most fundamental challenge in working with event sequences is creating references to events that you’d like to use in subsequent operations. For this purpose, we designed the match operation, which is inspired by regular expressions for finding patterns in strings, and named groups, which make it easy to reference parts of a pattern you’ve matched. We also borrowed syntax ideas from the Cypher language for querying graphs in neo4j.

The match operation in SOL is a generalization of regular expressions (and is similarly implemented as a state machine), because it includes the same quantifiers but you can use arbitrary predicates about individual events (and groups of events) to define a pattern. The result of running a match operation is that the sequence has labels referencing what we call “fragments,” (our term for a referenced array of events) for use in subsequent operations.

Matching an “a” event followed by a “b” event and naming them A and B.

Though not shown here, match also has an if-clause which adds additional constraints as boolean expressions, so you can specify that matches require events to have specific joint properties or relative timing (e.g. B must be within 10 minutes of A).

For many data analysts, match is the most unusual operation to learn in SOL. Regular expressions are powerful but can be difficult to learn and debug and match inherits some of these challenges, but we find that example recipes and instant feedback from our development environment (like in regex testers) can get users up to speed quickly. And once users understand how to match, they can specify any event pattern with very little code!

Match Split + Combine

When a pattern matches a sequence in multiple locations, SOL provides match split which finds all occurrences of the specified pattern and splits the sequence at the start of the match. This operation disaggregates sequences into subsequences, allowing for very flexible sessionization, more advanced than simple inactivity conditions. Splitting can be done multiple times, creating a nested hierarchy of subsequences.

Splitting a sequence before every “a” event, which is labeled A.

A common pattern in SOL is to split sequences into parts and perform some operation, then merge them back together, a variation of the split-apply-combine pattern, providing iteration without for-loops. The combine operation does this while allowing the user to specify aggregates that should be propagated upward to the original level of granularity. Iterating between match split and combine allows the user to seamlessly move between different units of analysis in SOL and avoid explicit iteration.

Splitting and un-splitting using the combine command.
Replace

We recognized early on that we could learn a lot from how string manipulation works in unix tools (like grep) and scripting languages (like Perl and Python). One of the most powerful operations they provide is the ability to replace part of a string with a different string (or an empty string for deletion!). This operation allows you to clean up event sequences by removing unnecessary parts and replacing many events with a single event to simplify sequences.

Replacing an event labeled B with a new event “d” which is labeled D.

Replace is more general than you may expect, and can be used for prepending, appending, inserting, and removing events, plus completely rewriting the sequence. It also provides lightweight syntax for event literals that are added to the sequence, including using expressions and any previously labeled events to compute new dimensions.

Set and filter

In addition to more sequence-focused functionality, SOL provides more standard analytical operations. The set operation allows users to construct expressions and assign new dimensions of events, fragments, or the entire sequence. This mirrors dplyr’s mutate operation or assign in Pandas, but with some flexibility to broadcast assignment to different parts of the sequence.

We use fragments created by match to assign a new dimension to all the events in that fragment.

The filter operation functions analogously to filter in dplyr (and to query in Pandas). The user can either discard or keep specific sequences that meet specific conditions, and the Motif user interface provides information on how many sequences were kept. Very commonly filter is applied directly after match in order to retain only sequences where a match was found.

Right now SOL is very compact, and we have only omitted one operation in this post (top, which finds the most extreme sequences by an expression the user defines). We may add some other operations to SOL over time — an entire ecosystem of packages been folded into the tidyverse over time to augment dplyr’s core capabilities. As we encounter new problems and challenges, we’re open to growing SOL by adding simple, self-contained operations that compose with the core operations and extend its capabilities in useful ways.

Let out powers combine!

Our main design goal for SOL was expressive power — you should be able to accomplish almost any sequence analysis task you can think of with SOL with just a small number of operations. To stress test this, we started a fun team activity called “SOL Question of the Week” where we take turns trying to come up with questions that might be difficult or impossible to answer in SOL. We’ve been doing this for almost a year now and haven’t found too many gaps left to plug. SOL is a highly useful and usable addition to the analyst toolkit and it may just change the way you think about how to answer your next product question.

The bar for a new analytics language is very high, so we’ve been very careful to design one we’re proud of that addresses real gaps in existing tools. We think you’ll find that SOL is worth learning and streamlines your work by providing a faster path to specifying what you’re trying to achieve. The most interesting and impactful questions are often natural to answer using sequence analytics and with SOL. The equivalent SQL for some SOL queries is difficult to write correctly (even for LLMs) and even harder to read and debug. And SOL pairs well with SQL because it operates directly on columnar data that is easily queryable in popular databases like DuckDB. We hope you give it a try by signing up to test Motif!