This is part 4 of a 4-part series on some of my most valued SQL ‘hacks’.
Check out the others in the series:
- Part 1: Common Table Expressions
- Part 2: All about those dates
- Part 3: The other JOINs
- Part 4: Window Functions
This is the last one I've planned for, but if you want to see more, drop me a note in the comments!
But, now on to the #content.
Window functions are some of the most beloved functionality for SQL users. They are very powerful, and quickly eliminate the need for nested subqueries and annoying cross joining.
Yet, due to their insanely verbose implementation and unhelpful keywords, many people are afraid to dive into the window function waters. This article will give a brief intro to the workings of window functions before diving into some of my favorite window function use cases.
All of the examples in this notebook were built using BigQuery syntax. You can view the code and results in full here.
Window Functions Explained
"In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row."
The key thing to understand about window functions is that they compute a value for each row of the specified table. This contrasts with aggregation functions which will summarize values across specified groups.
We can see this play out as we look at some fictionalized Legends of the Hidden Temple data. To find each of the team's best event, we can see how using a simple aggregate function, MAX, will return the top score for each Team, but it doesn't tell us which competition was their best.
To do that with just aggregate functions we'd have to join our initial results back to our original table to find for which competition the Red Jaguars had 98 points. Kind of a pain ☠️.
In contrast, by using a window function we can quickly find the best event for each team. Let's break down how that worked.
Anatomy of a Window Function
Window functions are complex at first glance-they have unintuitive keywords, and require a good grasp of your data to use them effectively.
But when you break them down, they can quickly become a go-to tool to speed up and improve your analysis.
Essentially window functions are composed of two parts:
- Define what is returned
- Define your window
Define what's returned ↪
In general, window functions can be grouped into 3 types:
- Navigation functions: Return the value given a specific location criteria (e.g. first_value, lag, lead)
- Numbering functions: Assign a number (e.g. rank, row_number) to each row based on their position in the specified window
- Analytic functions: Perform a calculation on a set of values (e.g. sum, avg, max)
The specific functions available will depend on your database. Use the guides in the References section to see what's available in your database:
Define your window 🪟
In your window function, everything after the OVER keyword represents your window definition.
The OVER clause has the following components :
- PARTITION BY: what makes up each window (e.g. restart every...)
- ORDER BY : how to order the rows of each window
- ROWS BETWEEN X PRECEEDING AND Y FOLLOWING : which rows to use in each window
Window Function Cheat Codes 🕹
1. Ranking Results
Questions like 'What are the top 5 items purchased' are simple, but add 'by region' to that question and your query can explode in complexity.
Being able to rank results over a specified grouping of data unlocks all kinds of new analytical possibilities like finding the second item your users purchased, or comparing usage patterns between a user's first and second visits to your website.
Example: If I wanted to find the top tracks for three popular artists: Drake, T Swift, and Post Malone, I could use the rank function, partitioned by artist, and ordered by streams to calculate the rank of each track.
Then we can easily compare the top 3 tracks for each artist and see that Drake and Post Malone's 3rd highest ranked songs still top Taylor's most streamed track.
2. Calculating Time Between 2 Events
It's common for data to be structured such that each row is an event. That can make it tricky to find the time between events, since it involves finding the time delta between 2 rows, not 2 columns.
But, here's where Window functions come in! Particularly, lag and lead.
Example: I wanted to find the biggest droughts for Spotify artists between their #1 hits. To do that I:
- Found the number one track each day
- Found the previous day the artist was #1 for each time they were #1 using lag.
- Then I found the difference between the current day they were #1 and the most recent day they were #1.
3. "Moving" calculations
When doing any kind of time-series analytics, you're often looking at a series of events happening across a long period of time. This data tends to be 'noisy' since it's full of day-to-day changes that make it difficult to see the forest through the trees.
Moving calculations, like moving averages, are a good way to summarize time-series data to a level that lets you determine some high-level patterns without losing touch with the underlying detail.
Example: In this case, I want to see the daily Spotify streams by day with a 30 day moving average. This will let me quickly see if streams are increasing over time, and how many noticeable outliers are there, amongst other things.
Window Function Cautions & Best Practices
⚠️ Window functions cannot be in the WHERE or HAVING parts of a query, so you'll need to use CTEs or sub-queries to filter based on window function results. See the diagram in Part 3 of this series for a reminder why you can't add window functions to WHERE and HAVING parts of a query.
⚠️ Combining Window and typical Aggregate functions in the same query is just very messy if not downright impossible. Just avoid if it if you can.
💡 If you want to "group" the results of a window function query, use DISTINCT.
⚠️ Always check your results with window functions because if the data isn't formatted as you expect you are much more likely to get erroneous results.