Sequences: they’re all around us. More specifically, they’re in your data warehouse with timestamps, payloads, and mysterious columns from Segment. Many of the real, course-changing insights that data teams dream are hidden deep inside these elusive event streams. This post will help you find them, using your favorite neighborhood query language.
For the purposes of this journey, imagine you’re a Data Scientist at Netflix. You and your team want to better understand what your watch funnel looks like – what does a successful session entail? – as well as understand how users interact with important features like search.
Getting started: event prep work
Before we can dig into what’s going on in our sequence data, we’ll need to clean up and organize a bit. Most event data I’ve worked with needs a good deal of prep to get it ready for analysis: timestamps are funky, data is missing, some events mean the same thing, etc. We’ll walk through a few common prep steps and sample queries to get them done.
Grouping events that mean the same thing together
In our dataset, we have 4 different events that relate to interacting with the search feature:
- search_bar_clicked – the user clicked on the search icon
- search_bar_typed – the user typed something into the search bar
- search_results_page – the user submitted what they typed in the bar and got some results
- search_result_clicked – the user clicked on a result in the search results
If this project was all about search and understanding how users engage with it, each one of these events would carry individual meaning. But for our purposes, it’s more than enough to just know that the user engaged with the search feature; we don’t need to know every detail about the search experience. So it makes sense to collapse any search related event into a single event type. Here’s how to do that in SQL:
This simple statement filters the dataset for search related events, and renames them all the same thing. We’ll string it together with subsequent cleaning ops in a chain of CTEs. You could also do it inline with a CASE WHEN:
Double checking how long events have existed for
Given how iteratively most companies implement event tracking, it’s likely that some events started being captured earlier or later than others. Before diving into analysis, it’s important to get a sense of how far back tracking reliability goes for what you think the important events in your sequence might be. Grouping events by their minimum and maximum timestamp is a good place to start:
In our case, it looks like the favorites_add event has a min_timestamp of only a few weeks ago. After some back and forth with the product team, it turns out that the event tracking library was under maintenance when they built the “add to favorites” feature, so tracking for that event was only added in a recent sprint. Bummer, but either way, we now know not to include this event in our analysis. This is a much better situation than spending 4-5 hours wondering why conversion to favorites is so low in your funnel, only to realize that the events are missing.
Renaming events for the purposes of this analysis
Events get named with many different stakeholders in mind, and sometimes those names are less than ideal for the analysis you’re trying to do. In our data set, there are some events with names like user_watch_time_start that can be simplified to watch_start with the same logic as above:
Defining a session window
Perhaps the most important part of prep! To analyze a set of events or a funnel, there needs to be some sort of definition of when it “starts” and when it “stops.” For many use cases – like e-commerce, or a web app with clear “presence” like Netflix – this is pretty straightforward. But for others, like when I was working at DigitalOcean and analyzing how people spun up and shut down servers, session definition can require some creativity.
Sessions are most simply defined as any sequence of events with at least some time frame of inactivity as the “end,” usually an hour or so. It’s not a perfect definition, but it probably means that your user at least stepped away from the computer.
The goal is to end up with some sort of session_id or session_number column (or both) attached to each event. If you’re working as part of a data team, there’s a good chance you already have this done for you in the data warehouse. In the past I’ve worked off of this sample SQL snippet from dbt made for Segment events, tweaked for my use case.
It’s important to note that this sample snippet is helpful, but just a start. Sessionizing your events is not a trivial task, and can require double digit hours of work and follow up tweaks as your team grows.
Creating an intermediate event dataset for analysis
For funnel analysis, most of the time you can ignore the order that events happened in: later steps (like added to cart) naturally cannot happen before earlier steps (visited website). But for more general analysis, you need your events to be in a roughly standardized format with order attached. We need to get to something like this:
Each one of these columns will be important for queries we’ll want to write later:
- num_session – starting at 1, which number session this is for the user. For analyzing how users onboard, we might filter this for 1, or for analyzing post onboarding activity, >1.
- num_event – starting at 1, which number event this is in the current session for the user.
- prev_event – the event the user did directly before this one
- next_event – the event the user did directly after this one
- payload – any data associated with the event (e.g. if it’s movie_watch, the name of the movie)
I also will sometimes add things like total number of events in the session, whether the session ended with a conversion event or not, etc. I should mention that this is the schema I like when I’m analyzing sequences, but you may find some of the columns superfluous, or think I’m missing some. Everyone’s data and project is different!
Before getting started with some sample queries, you’re going to need to make a decision about how you want to build this intermediate data set. I’ve used all of these methods for different projects, it really just depends on your constraints and how long you intend this data set to live for.
For this post, I’ll write each SQL query as a standalone statement for simplicity’s sake. Let’s walk through how to derive the data set we need using some basic (and a little intermediate) window function action.
Calculating event order with ROW_NUMBER
The ROW_NUMBER function tells you what order a given row is in a given window given an ordering criteria. In English, this will help us know which event the current event is in a given session. Because we want the event number to reset every time there’s a new session, we’ll partition by session_num and order by timestamp.
Note that for window functions like ROW_NUMBER, it’s absolutely critical that your event timestamps are in good working order. If there’s a slight delay on when the event gets emitted, or two events are given the same timestamp when they really happened one after another, this SQL logic will not work. Which is also why we’re using ROW_NUMBER here instead of RANK or DENSE_RANK, because we’re assuming that there will be 0 events with identical timestamps for a given user.
Calculating previous and next events
This one is easy:
Depending on what you’re trying to do, you can also increase the LEAD or LAG distance to more than one and get two events ago, 3 events in the future, etc.
Another thing I’ll sometimes use these functions for is calculating time distance between events. This is necessary for session logic (you need to break up a session if a previous event is more than N minutes in the past), but also useful for looking at which events take a long time to get to (and thus perhaps indicate areas of improvement for your product).
First and last event in the sequence
Another common question in analysis is “what event did this session start with?” In our data set, it’s going to be useful to know if a watch session started with search, as opposed to starting with the favorites tab or a different source. And for that we can use the lesser known FIRST_VALUE and LAST_VALUE functions. They grab the first event in the given partition by your order clause.
By now you’ve probably realized that this post could have been titled “an intermediate guide to window functions” because analyzing sequences in SQL is basically window function con.
Analyzing our nice, clean dataset
By now, you’ve already done the hard work. Now it’s time to ask questions of our pristine, ordered, information-full event sequence data set. Here are some common ones that I’ve done in the past.
Building a basic funnel
By far the most common use case for event data like this is building a funnel for your business. Our data is product related (i.e. there are no marketing-specific events like CTA clicked) so we can design a (very) simple one:
Our team is mostly concerned with how many users accomplished each of these funnel stages over the course of a discrete time frame (in the future, compared to a previous time frame). Here’s a basic query for it:
And indeed our results do look like a funnel:
Most startups I’ve worked at use a simple query barely more complex than this one to power their dashboard funnel. But the astute reader will note several limitations of this overly simplistic funnel design: it’s missing events users can take, finishing a movie isn’t the only desirable user path, events can happen out of order, etc. Building funnels can get arbitrarily complex; it’s a topic that deserves its own blog post.
How many users took a specific path?
Our data team is curious about how many watch sessions start with visits to the home page (and not search or the favorites tab). Expressed in a sequence, that’s:
The key is that these events happened after one another, it doesn’t matter where in the sequence they did (e.g. if the home_page event’s event_num is 1 or 12).There is theoretically a way to do this with our existing intermediate dataset but it’s going to be much easier to update the data set and add more LAG functions. We’re looking for a 4 event sequence, and we already have one LAG, so we just need to add 3 more:
We can then write a simple query to see how many users took a specific path:
Astute readers will notice that this method is crude, because it doesn’t account for any deviations from the prescribed path. A user might be perusing the home page, click on a movie, and then go back to the home page before finding the movie they actually want to watch. This lack of flexibility in expression can make SQL a difficult language to use for sequence analysis.
How many users did X events at any point in their path?
If order doesn’t matter (in contrast to the above query), this one is pretty simple: we can do a distinct count of users who have at least one event. Our team is curious to know how many users actually end up finishing a full trailer:
If we want to filter for users who have done multiple events in a sequence, but order doesn’t matter, you can do something like this:
The first CTE aggregates all of the users who have at least one home_page event and one home_page_click event, and the second one counts how many there are. Note that we need to use MAX here instead of summing the two columns, otherwise you could have a user with two home page visits and no home page clicks who makes it through the filter.
How many users did NOT do a specific event in their path?
A hacky way to do this is to get the total number of users who did the specific event, and subtract it from the total number of users in the data set. If we wanted to aggregate the number of users in our data set who have never watched a trailer:
This is a little trick I picked up a few years ago to avoid needing a filter or multiple queries to get a user count based on row data (I’m using DuckDB and there’s no COUNT_IF function). You set up a CASE WHEN statement that returns the user_id if true, and NULL if false.
By the way, in many cases churn is actually a special case of this query, where no event has happened in that past X timeframe.
Which events tend to be followed by other events?
This one requires a bit of string manipulation. What we’ll do is create string event pairs using the event_name and prev_event columns, and group and count them. Here’s a simple version:
Here’s a version that removes identical event pairs, and events with no previous event. The watch_heartbeat event gets sent every few seconds while a user is watching something, so we can remove that too.
Interesting…it seems like the home_page to home_page_click is the most popular event pair in our data set.
How much time is there between different event pairs?
Here we’ll build off of the previous query, but instead of counting occurrences of the pairs, we’ll measure the time between them. I’ve added median, average, max, and min functions, and included the filters from the last section.
A final disclaimer: this was a bit too easy
Lest the reader get the sense from this post that sequence analytics in SQL is a straightforward, simple task, this section is meant to disabuse you of that notion. The sample queries – and situations – in this post are great simplifications of what you’ll most likely encounter in the wild. Here are a few examples:
1. Sessionization is a major pain point
Sessionizing your events is a lot more complex than copying a few lines of SQL, and will require continuous tweaking as you learn more about your downstream needs.
2. Consolidating events requires care and iteration
We renamed all search events to “search” which roughly works for this analysis, but is likely too simplistic for more complex ones. Often, the dimensions you need will be spread across different types of these events and you’ll need to figure out a way to get what you need.
3. Single queries are easy, but bigger ones are hard
Each one of the sample queries in the last section is simple enough on their own, but where things get hairy is when you need to combine and iterate on them. As with our 4-lag example, you can see that some analysis requires you to “destroy” the data set in a sense, and having to do this over and over again is hard.
4. Chaining larger queries is computationally intensive
Once you gather these queries together and they get bigger, they will start to take a long time to run and get expensive.In other words, take caution when approaching this branch of analytics – it’s not something that’s as simple as “I learned how to do it in a blog post.” Having said that, I do hope this was helpful 😉
Other questions we missed? Let us know at firstname.lastname@example.org.