I’ve been using both
GROUP BY & Window functions since I started writing SQL, but never really together in the same query. Not long ago I encountered a Stackoverflow answer that used a window function & Group by in the same query, and more recently I saw
Ram Kedem use it to solve a question from a
set of questions he posted online for practice. I really liked his use of it there because
I initially solved it in a more cumbersome way (see question 5).
Since then I started playing with the two (Windows & GROUP BY) more, so much that I also decided to go ahead and buy a book about it — “T-SQL Window Functions: For data analysis and beyond”. It’s mainly about Window Functions but there’s a small section specifically about the relationship of the two.
This post is about some experimentation I did with both concepts and what it taught me about the order of operations in a SQL query, specifically in the SELECT. We’ll start from a basic use case, move to a little more complex example and end with what seems to me as an unintended use case, but helpful nonetheless (and valid SQL code!).
Alright then, let’s begin!
Similar to many recent posts of mine, we’ll be looking at users and their payments information. Here’s a brief glimpse of our data:
Setting up a SQL connection in R
Since I write my blog posts using R, we need to setup a SQL connection to our SSMS localhost. We can do it with the following, and then moving forward just write SQL syntax in a regular code chunk.
library(odbc) library(DBI) sqlconn <- dbConnect(odbc(), Driver = "SQL Server", Server = "localhost\\SQLEXPRESS", Database = "window_groupby")
SELECT TOP 5 * FROM USERS U LEFT JOIN PAYMENTS P ON P.USER_ID = u.user_id;
Table: Table 1: 5 records
We’ve got users, their country, payment ids, dates and amounts. Great, let’s go ahead and get started.
Our next two examples will focus on the
SELECT statement and the relationship between Window Functions and
GROUP BY there. One question you might have is why can’t I use a window function when the data is still grouped? Let’s look at a quick example when we try to collect the total funds by user, but also aggregate the total amount for all users together as a new column (read as a window function of
SUM() OVER ()).1
Here’s what you’ll probably get:
SELECT u.user_id, country, SUM(amount) AS total_funds, SUM(amount) OVER() FROM USERS U LEFT JOIN PAYMENTS P ON P.USER_ID = u.user_id GROUP BY u.user_id, Country
## Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Column 'PAYMENTS.amount' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. ## <SQL> 'SELECT u.user_id, ## country, ## SUM(amount) AS total_funds, ## SUM(amount) OVER() ## FROM USERS U ## LEFT JOIN PAYMENTS P ON P.USER_ID = u.user_id ## GROUP BY u.user_id, Country'
Ha, that is weird! The error states that the “Payments.amount column is invalid in the select because it’s not in either an aggregate function or
GROUP BY clause.” But we can see it right there in the row above our window function, aggregated as
What’s going on here? Well hopefully in the next two examples you’ll better understand this error, how you can actually do what we’re trying here and how do these two concepts relate to one another when used in the same query.
As previously noted, I encountered the following example in Ram Kedem’s online stream and thought it’ll be a good opening example to explain the execution logic.
Let’s assume you want to aggregate the data in some way, and then take the top 1 row of some group. So in our case, we want to return users receiving the most funds within a given country:
WITH users_sum as ( SELECT u.user_id, u.Country, SUM(amount) AS total_funds, DENSE_RANK() OVER(PARTITION BY COUNTRY ORDER BY SUM(amount) DESC) as rnk FROM USERS U LEFT JOIN PAYMENTS P ON P.USER_ID = u.user_id GROUP BY u.user_id, u.country ) SELECT user_id, Country, total_funds FROM USERS_SUM where rnk = 1
Table: Table 2: 3 records
The relevant part for our discussion (and the purpose of this blog post) is the following line to create the
DENSE_RANK() OVER(PARTITION BY COUNTRY ORDER BY SUM(amount) DESC) as rnk
What’s interesting here is that we use the aggregate value of the group by,
SUM(amount), within our new column that’s a window function.
Now I’m not particularly SQL savvy when it comes to things occurring behind the scenes, but here’s my understanding of it from reading Itzik Ben-Gan’s book2. As a start, let’s remind ourselves the order of operations for a given SQL query:
FROM and others, Window functions are evaluated in the
SELECT statement but separate from the
GROUP BY aggregations. Itzik Ben-Gan in his book
“T-SQL Window Functions: For data analysis and beyond” describes this very well, where we can think of the two — when the
GROUP BY aggregation happens and the evaluation of window functions — occurring in two different contexts:
“Grouped aggregates operate on groups of rows defined by the
GROUP BYclause and return one value per group. Window aggregates operate on windows of rows and return one value for each row in the underlying query”
“A window function is evaluated after aggregation.[…] Because aggregate functions are evaluated before window functions, aggregate functions can be used as input operands to window functions.”
GROUP BY occurs, and then we
SELECT the relevant columns we want in our aggregation, and any window function in the
SELECT is evaluated on the post-aggregated data. In a sense we reference the already-aggregated column
SUM(amount) as a column to sort our ranking window function we’re creating!
This also explains the error we got earlier: At the time of evaluating the Window Function the
amount column doesn’t exist anymore. We can only use columns we grouped by or newly created columns in our aggregation, in this case the
The previous example opens up an opportunity to use our aggregated results within a window function. There we used it as an ordering clause in a window function, but can we use other forms of aggregation within the window function that reference the previous aggregation? For example, aggregate what we just aggregated in the
Let’s take an example Itzik’s also shares in his book that I used a few times at work to better explore this: How would you aggregate the total amount each user received, and the proportion of that amount out of the total funds for all users? And more specifically, can you do this in one query?
Well you can, let’s have a look:
SELECT u.user_id, u.Country, SUM(amount) AS total_funds, SUM(amount) / SUM(SUM(amount)) OVER() * 100.0 AS pct_funds FROM USERS U LEFT JOIN PAYMENTS P ON P.USER_ID = u.user_id GROUP BY u.user_id, u.country ORDER BY pct_funds DESC
Table: Table 3: 7 records
Though this seems like disjointed syntax, it’s valid SQL nonetheless.
Let’s look into our important code line:
SUM(amount) / SUM(SUM(amount)) OVER() * 100.0 AS pct_funds
What’s happening here? The first
SUM(amount) is an aggregation from our initial
GROUP BY clause we have; that’s the total_funds for each user. The second part
SUM(SUM(amount)) OVER() might take a few readings to grasp but is something as follows: The most inner
SUM(amount) references our
GROUP BY aggregation as before; this is then wrapped within an outer
SUM that invokes a window function with the
OVER() command to get the total sum of the window passed in.
Basically we’re dividing each user’s funds that we just aggregated by the total funds of all users, the latter calculated as a window function of summing all users aggregated funds. Lastly, we finalize it by multiplying by 100 to get a percentage value.
Personally, this was mind blowing when I played around with it and realized this worked. Only then did I fully grasp the order of operations for a given query with Window functions and Grouping.
I will say, however, that the above syntax might not be easily understood to someone first encountering it SQL is a declarative language that usually feels intuitive to read; and this usage somewhat contradicts it. Though valid SQL code, and pretty cool nonetheless, maybe breaking it up into several parts (e.g. with CTE) would make it more readable for others if you’re working on something collaboratively or readability is important. I’m also not sure about performance issues here so definitely compare the approaches if that’s crucial for you.
First off, I highly recommend Itzik Ben-Gan’s book. I bought it when I learned what I shared in this blog post in order to learn more advanced SQL. I realized that while I know how to use window functions and construct them, I still have much more to discover. The book is fantastic and addresses a lot of topics ranging from basics of a window function to optimization. What I address in this post - the relationship between
GROUP BY and Window Functions - is a very small section of the book, barely two pages; he covers a lot more.
The second thing I’d like to share is about my discovery of this and why it was exciting for me. I love learning new things (which I imagine most of us do). As to SQL I’ve learned
correlated sub-queries a while back which was cool,
recursive CTE’s, using things such as
CROSS/OUTER APPLY and more. But they’ve all been new concepts altogether, so it was a different kind of excitement compared to what I shared here.
I’ve been using the
GROUP BY clause since I started writing SQL a few years ago and window functions not much after. I became better at both, learned how to maximize the use of them and general tips and tricks. However what I was hopefully able to convey in this post was a different kind of learning: It was taking what seemed as two concepts I continuously used and learned more about how they each operate separately and in relation to one another. Definitely an exciting thing to discover.
Hope you enjoyed it too!
We could use
WITH ROLLUPto get the total here in the bottom row, but for the sake of the tutorial let’s ignore it for now (plus we want it as a column). ↩︎
Do reach out and share insights about this process if you feel that I got it wrong here (or in the post in general). ↩︎
I think an example of this would have made it much more memorable when reading over the docs! ↩︎
- Posted on:
- July 8, 2023
- 10 minute read, 1927 words