A Beginner's Guide to Sequence Analytics in SQL
Written by Misha Panko
Sequences: they’re all around us. More specifically, they’re in your data warehouse with timestamps, payloads, and mysterious columns from Segment. Many of the real, course-changing insights that data teams dream are hidden deep inside these elusive event streams. This post will help you find them, using your favorite neighborhood query language.
For the purposes of this journey, imagine you’re a Data Scientist at Netflix. You and your team want to better understand what your watch funnel looks like – what does a successful session entail? – as well as understand how users interact with important features like search.
Getting started: event prep work
Before we can dig into what’s going on in our sequence data, we’ll need to clean up and organize a bit. Most event data I’ve worked with needs a good deal of prep to get it ready for analysis: timestamps are funky, data is missing, some events mean the same thing, etc. We’ll walk through a few common prep steps and sample queries to get them done.
Grouping events that mean the same thing together
In our dataset, we have 4 different events that relate to interacting with the search feature:
search_bar_clicked– the user clicked on the search icon
search_bar_typed– the user typed something into the search bar
search_results_page– the user submitted what they typed in the bar and got some results
search_result_clicked– the user clicked on a result in the search results
If this project was all about search and understanding how users engage with it, each one of these events would carry individual meaning. But for our purposes, it’s more than enough to just know that the user engaged with the search feature; we don’t need to know every detail about the search experience. So it makes sense to collapse any search related event into a single event type. Here’s how to do that in SQL:
SELECT user_id, 'search' AS event_name, timestamp, movies_number, browser, app, country, language, position, id, title, category, rating, duration, video_quality, volume, subtitles, search_results FROM "netflix.csv" WHERE event_name IN ('search_bar_clicked','search_bar_typed','search_results_page','search_result_clicked')
This simple statement filters the dataset for search related events, and renames them all the same thing. We’ll string it together with subsequent cleaning ops in a chain of CTEs. You could also do it inline with a
SELECT user_id, CASE WHEN event_name IN ('search_bar_clicked','search_bar_typed','search_results_page','search_result_clicked') THEN 'search' ELSE event_name END AS event_name, timestamp, movies_number, browser, app, country, language, position, id, title, category, rating, duration, video_quality, volume, subtitles, search_results FROM event_data
Double checking how long events have existed for
Given how iteratively most companies implement event tracking, it’s likely that some events started being captured earlier or later than others. Before diving into analysis, it’s important to get a sense of how far back tracking reliability goes for what you think the important events in your sequence might be. Grouping events by their minimum and maximum timestamp is a good place to start:
SELECT event_name, MIN(timestamp) AS min_timestamp, MAX(timestamp) AS max_timestamp FROM event_data GROUP BY 1 ORDER BY 2 DESC
In our case, it looks like the
favorites_add event has a
min_timestamp of only a few weeks ago. After some back and forth with the product team, it turns out that the event tracking library was under maintenance when they built the “add to favorites” feature, so tracking for that event was only added in a recent sprint. Bummer, but either way, we now know not to include this event in our analysis. This is a much better situation than spending 4-5 hours wondering why conversion to favorites is so low in your funnel, only to realize that the events are missing.
Renaming events for the purposes of this analysis
Events get named with many different stakeholders in mind, and sometimes those names are less than ideal for the analysis you’re trying to do. In our data set, there are some events with names like `user_watch_time_start` that can be simplified to `watch_start` with the same logic as above:
SELECT user_id, CASE WHEN event_name = 'user_watch_time_start` THEN `watch_start` ELSE event_name END AS event_name, ... FROM event_data
Defining a session window
Perhaps the most important part of prep! To analyze a set of events or a funnel, there needs to be some sort of definition of when it “starts” and when it “stops.” For many use cases – like e-commerce, or a web app with clear “presence” like Netflix – this is pretty straightforward. But for others, like when I was working at DigitalOcean and analyzing how people spun up and shut down servers, session definition can require some creativity.
Sessions are most simply defined as any sequence of events with at least some time frame of inactivity as the “end,” usually an hour or so. It’s not a perfect definition, but it probably means that your user at least stepped away from the computer.
The goal is to end up with some sort of
session_number column (or both) attached to each event. If you’re working as part of a data team, there’s a good chance you already have this done for you in the data warehouse. In the past I’ve worked off of this sample SQL snippet from dbt made for Segment events, tweaked for my use case.
It’s important to note that this sample snippet is helpful, but just a start. Sessionizing your events is not a trivial task, and can require double digit hours of work and follow up tweaks as your team grows.
Creating an intermediate event dataset for analysis
For funnel analysis, most of the time you can ignore the order that events happened in: later steps (like added to cart) naturally cannot happen before earlier steps (visited website). But for more general analysis, you need your events to be in a roughly standardized format with order attached. We need to get to something like this:
Each one of these columns will be important for queries we’ll want to write later:
num_session– starting at 1, which number session this is for the user. For analyzing how users onboard, we might filter this for 1, or for analyzing post onboarding activity, >1.
num_event– starting at 1, which number event this is in the current session for the user.
prev_event– the event the user did directly before this one
next_event– the event the user did directly after this one
payload– any data associated with the event (e.g. if it’s
movie_watch, the name of the movie)
I also will sometimes add things like total number of events in the session, whether the session ended with a conversion event or not, etc. I should mention that this is the schema I like when I’m analyzing sequences, but you may find some of the columns superfluous, or think I’m missing some. Everyone’s data and project is different!
Before getting started with some sample queries, you’re going to need to make a decision about how you want to build this intermediate data set. I’ve used all of these methods for different projects, it really just depends on your constraints and how long you intend this data set to live for.
For this post, I’ll write each SQL query as a standalone statement for simplicity’s sake. Let’s walk through how to derive the data set we need using some basic (and a little intermediate) window function action.
Calculating event order with `ROW_NUMBER`
ROW_NUMBER function tells you what order a given row is in a given window given an ordering criteria. In English, this will help us know which event the current event is in a given session. Because we want the event number to reset every time there’s a new session, we’ll partition by
session_num and order by
SELECT user_id, event_name, timestamp, ROW_NUMBER() OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS event_sequence_number FROM event_data
Note that for window functions like
ROW_NUMBER, it’s absolutely critical that your event timestamps are in good working order. If there’s a slight delay on when the event gets emitted, or two events are given the same timestamp when they really happened one after another, this SQL logic will not work. Which is also why we’re using
ROW_NUMBER here instead of
DENSE_RANK, because we’re assuming that there will be 0 events with identical timestamps for a given user.
Calculating previous and next events
This one is easy:
SELECT user_id, event_name, timestamp, LAG(event_name) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_event, LEAD(event_name) OVER (PARTITION BY user_id ORDER BY timestamp) AS next_event FROM event_data
Depending on what you’re trying to do, you can also increase the LEAD or LAG distance to more than one and get two events ago, 3 events in the future, etc.
SELECT user_id, event_name, timestamp, LAG(event_name,3) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS three_events_prev, LEAD(event_name,2) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS two_events_next FROM event_data
Another thing I’ll sometimes use these functions for is calculating time distance between events. This is necessary for session logic (you need to break up a session if a previous event is more than N minutes in the past), but also useful for looking at which events take a long time to get to (and thus perhaps indicate areas of improvement for your product).
SELECT user_id, event_name, timestamp, timestamp - LAG(timestamp) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS time_since_last_event, LEAD(timestamp) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) - timestamp AS time_to_next_event, FROM event_data
First and last event in the sequence
Another common question in analysis is “what event did this session start with?” In our data set, it’s going to be useful to know if a watch session started with search, as opposed to starting with the favorites tab or a different source. And for that we can use the lesser known FIRST_VALUE and LAST_VALUE functions. They grab the first event in the given partition by your order clause.
SELECT user_id, event_name, timestamp, FIRST_VALUE(event_name) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS first_event, LAST_VALUE(event_name) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS last_event, FROM event_data
By now you’ve probably realized that this post could have been titled “an intermediate guide to window functions” because analyzing sequences in SQL is basically window function con.
Analyzing our nice, clean dataset
By now, you’ve already done the hard work. Now it’s time to ask questions of our pristine, ordered, information-full event sequence data set. Here are some common ones that I’ve done in the past.
Building a basic funnel
By far the most common use case for event data like this is building a funnel for your business. Our data is product related (i.e. there are no marketing-specific events like CTA clicked) so we can design a (very) simple one:
Looked around: home_page OR search --- Found a movie: movie_page --- Started a movie: watch_start --- Finished a movie: watch_end
Our team is mostly concerned with how many users accomplished each of these funnel stages over the course of a discrete time frame (in the future, compared to a previous time frame). Here’s a basic query for it:
SELECT COUNT(DISTINCT CASE WHEN event_name = 'search' OR event_name = 'home_page' THEN user_id ELSE NULL END) AS count_users_stage_one, COUNT(DISTINCT CASE WHEN event_name = 'movie_page' THEN user_id ELSE NULL END) AS count_users_stage_two, COUNT(DISTINCT CASE WHEN event_name = 'watch_start' THEN user_id ELSE NULL END) AS count_users_stage_three, COUNT(DISTINCT CASE WHEN event_name = 'watch_end' THEN user_id ELSE NULL END) AS count_users_stage_four FROM cleaned_data
And indeed our results do look like a funnel:
Most startups I’ve worked at use a simple query barely more complex than this one to power their dashboard funnel. But the astute reader will note several limitations of this overly simplistic funnel design: it’s missing events users can take, finishing a movie isn’t the only desirable user path, events can happen out of order, etc. Building funnels can get arbitrarily complex; it’s a topic that deserves its own blog post.
How many users took a specific path?
Our data team is curious about how many watch sessions start with visits to the home page (and not search or the favorites tab). Expressed in a sequence, that’s:
home_page → home_page_click → movie_page → watch_start
The key is that these events happened after one another, it doesn’t matter where in the sequence they did (e.g. if the home_page event’s event_num is 1 or 12).
There is theoretically a way to do this with our existing intermediate dataset but it’s going to be much easier to update the data set and add more `LAG` functions. We’re looking for a 4 event sequence, and we already have one `LAG`, so we just need to add 3 more:
SELECT user_id, event_name, timestamp, LAG(event_name) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_event, LAG(event_name,2) OVER (PARTITION BY user_id ORDER BY timestamp) AS two_prev_event, LEAD(event_name,3) OVER (PARTITION BY user_id ORDER BY timestamp) AS three_prev_event, FROM event_data
We can then write a simple query to see how many users took a specific path:
SELECT COUNT(DISTINCT user_id) FROM lag_data WHERE event_name = 'watch_start' AND prev_event = 'movie_page' AND two_prev_event = 'home_page_click' AND three_prev_event = 'home_page'
Astute readers will notice that this method is crude, because it doesn’t account for any deviations from the prescribed path. A user might be perusing the home page, click on a movie, and then go back to the home page before finding the movie they actually want to watch. This lack of flexibility in expression can make SQL a difficult language to use for sequence analysis.
How many users did X events at any point in their path?
If order doesn’t matter (in contrast to the above query), this one is pretty simple: we can do a distinct count of users who have at least one event. Our team is curious to know how many users actually end up finishing a full trailer:
SELECT COUNT(DISTINCT user_id) FROM event_data WHERE event_name = 'trailer_end'
If we want to filter for users who have done multiple events in a sequence, but order doesn’t matter, you can do something like this:
WITH sequence_data AS ( SELECT user_id FROM event_data GROUP BY user_id HAVING (MAX(CASE WHEN event_name = 'home_page' THEN 1 ELSE 0 END) + MAX(CASE WHEN event_name = 'home_page_click' THEN 1 ELSE 0 END)) > 1 ) SELECT COUNT(user_id) FROM sequence_data
The first CTE aggregates all of the users who have at least one `home_page` event and one `home_page_click` event, and the second one counts how many there are. Note that we need to use `MAX` here instead of summing the two columns, otherwise you could have a user with two home page visits and no home page clicks who makes it through the filter.
How many users did NOT do a specific event in their path?
A hacky way to do this is to get the total number of users who did the specific event, and subtract it from the total number of users in the data set. If we wanted to aggregate the number of users in our data set who have never watched a trailer:
SELECT COUNT(DISTINCT user_id) AS count_users, COUNT(DISTINCT CASE WHEN event_name = 'trailer_start' THEN user_id ELSE NULL END) AS count_users_watched_trailer FROM event_data
This is a little trick I picked up a few years ago to avoid needing a filter or multiple queries to get a user count based on row data (I’m using DuckDB and there’s no `COUNT_IF` function). You set up a `CASE WHEN` statement that returns the `user_id` if true, and `NULL` if false.
By the way, in many cases churn is actually a special case of this query, where no event has happened in that past X timeframe.
Which events tend to be followed by other events?
This one requires a bit of string manipulation. What we’ll do is create string event pairs using the
prev_event columns, and group and count them. Here’s a simple version:
SELECT CONCAT(prev_event, '-->', event_name) as event_pair, COUNT(*) as count, COUNT(DISTINCT user_id) as count_users FROM event_data GROUP BY 1 ORDER BY 2 DESC
Here’s a version that removes identical event pairs, and events with no previous event. The
watch_heartbeat event gets sent every few seconds while a user is watching something, so we can remove that too.
SELECT CONCAT(prev_event, '-->', event_name) as event_pair, COUNT(*) as count, COUNT(DISTINCT user_id) as count_users FROM cleaned_data WHERE event_name != prev_event AND event_name != 'watch_heartbeat' AND prev_event IS NOT NULL GROUP BY 1 ORDER BY 2 DESC
Interesting…it seems like the
home_page to `home_page_click` is the most popular event pair in our data set.
How much time is there between different event pairs?
Here we’ll build off of the previous query, but instead of counting occurrences of the pairs, we’ll measure the time between them. I’ve added median, average, max, and min functions, and included the filters from the last section.
SELECT CONCAT(prev_event, '-->', event_name) AS event_pair, MEDIAN(DATE_PART('millisecond', time_since_last_event)) AS median_ms_since_last_event, AVG(DATE_PART('millisecond', time_since_last_event)) AS avg_ms_since_last_event, MAX(DATE_PART('millisecond', time_since_last_event)) AS max_ms_since_last_event, MIN(DATE_PART('millisecond', time_since_last_event)) AS min_ms_since_last_event, FROM cleaned_data WHERE event_name != prev_event AND event_name != 'watch_heartbeat' AND prev_event IS NOT NULL GROUP BY 1
A final disclaimer: this was a bit too easy
Lest the reader get the sense from this post that sequence analytics in SQL is a straightforward, simple task, this section is meant to disabuse you of that notion. The sample queries – and situations – in this post are great simplifications of what you’ll most likely encounter in the wild. Here are a few examples:
1. Sessionization is a major pain point
Sessionizing your events is a lot more complex than copying a few lines of SQL, and will require continuous tweaking as you learn more about your downstream needs.
2. Consolidating events requires care and iteration
We renamed all search events to “search” which roughly works for this analysis, but is likely too simplistic for more complex ones. Often, the dimensions you need will be spread across different types of these events and you’ll need to figure out a way to get what you need.
3. Single queries are easy, but bigger ones are hard
Each one of the sample queries in the last section is simple enough on their own, but where things get hairy is when you need to combine and iterate on them. As with our 4-lag example, you can see that some analysis requires you to “destroy” the data set in a sense, and having to do this over and over again is hard.
4. Chaining larger queries is computationally intensive
Once you gather these queries together and they get bigger, they will start to take a long time to run and get expensive.
In other words, take caution when approaching this branch of analytics – it’s not something that’s as simple as “I learned how to do it in a blog post.” Having said that, I do hope this was helpful 😉
Other questions we missed? Let us know at email@example.com.