Why Window Functions?
There is no simple way in SQL to calculate running aggregate on data e.g. for a monthly sales data, we may need to calculate the cumulative sales.
Let's take a simple table of sales data. How will I get -
- Example#1 - Monthly cumulative sales, expense of the Org
We can join the table with self on two condition i.e. department = department and Month >= Month
Basically, it will add the rows needed e.g. for Jan, no rows and for March, two rows i.e. Feb and Jan. This is what we need for a cumulative sum
SELECT t1.Month, SUM(t2.Sales) AS Cumulative_Sales, SUM(t2.Expense) AS Cumulative_Expense FROM learn.sales_expense t1 INNER JOIN learn.sales_expense t2 ON t1.department = t2.department and t1.Month >= t2.Month GROUP BY t1.Month ORDER BY t1.Month;
- Example#2 - Monthly cumulative sales, expense of each department
I leave this as an exercise.
What Windows functions facilitate
The SQL in its current form lacks elegance and look more of a work-around. This is where Window Functions come into their own, offering a powerful toolset for elegant and expressive SQL capability.
What it gives and How it works
For each row, we can define a window—essentially a set of rows related to it—over which the Window Functions can operate, with these tool
Window Range
Window Ordering
Window Partition
Let's learn each one by one.
Window Range
Using Range, we create a visible window of range to work up-on e.g. in the sample query we needed a range from start of the data to the current row
Following images demonstrate different cases(Not exhaustive).
Images are self-explanatory, for our case we need the 2nd scenario
How to do it with SQL
High level syntax - agg_function or window_function OVER ()
So, the power comes from over()
. When we say over()
, it gives us the window for each row,
SUM(Sales) OVER () AS Cumulative_Sales
Why the above query will work
The important question is, how the range is decided out of the 3 scenarios we discussed and many other possible scenarios.
This is achieved by the optional frame_clause
that can be one of [Check pgsql docs]
UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING
So, when we left it blank, it fall back to the default value. In PostgreSQL, when using window functions, if the frame_clause
is not specified, the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
for the OVER()
clause.
Window Ordering
This must be obvious by now to define some sense of order before blocking the windows range otherwise we may not achieve the desire result e.g. in our example, the ordering on month is must before we define window.
How to do it with SQL
High level syntax - agg_function or window_function OVER ( order by column_name)
SUM(Sales) OVER (order by month) AS Cumulative_Sales
Window Partition
Partitioning is key when we’re working with window functions. It splits the data into clear sections or groups, allowing us to apply RANGE
and ORDER
within those boundaries. It’s like when you're organizing a big event and you have different activities in separate areas.
For our data, without partitioning, the RANGE
would extend across the entire set. The image below clearly shows how partitioning organizes the data, making it easier to focus and calculate values for each department on its own.
How to do it with SQL
High level syntax - agg_function or window_function OVER ( partition by column_name_1 order by column_name_2)
SUM(Sales) OVER (partition by department order by month) AS Cumulative_Sales
Analytics functions and multiple over() clause
In addition to the core window function tools we've discussed, SQL provides a variety of auxiliary features to streamline our database work
Analytics function - This is like we have a superpower to do more than just group data with aggregate functions e.g.
SUM, AVG
. There are some special functions made just for this that help us identify and organize our data in really useful ways. Here's a quick rundown of some of them in this category:row_number()
: Assigns a unique number to each row.rank()
: Numbers rows with tied ranks skipping subsequent numbers.dense_rank()
: Numbers rows sequentially, no skips for ties.
Below is an image depiction for better understanding.
Check the official doc for an exhaustive list esp. lead, lag [ Windows function ]
Multiple over clauses - We can have multiple over clauses. Few rule to keep in mind [ Official doc ]
7.2.5. Window Function Processing
When multiple window functions are used, all the window functions having syntactically equivalent
PARTITION BY
andORDER BY
clauses in their window definitions are guaranteed to be evaluated in a single pass over the data. Therefore they will see the same sort ordering, even if theORDER BY
does not uniquely determine an ordering. However, no guarantees are made about the evaluation of functions having differentPARTITION BY
orORDER BY
specifications. (In such cases a sort step is typically required between the passes of window function evaluations, and the sort is not guaranteed to preserve ordering of rows that itsORDER BY
sees as equivalent.)
Practical examples
Let’s delve into a few practical examples that incorporate the various aspects of window functions such as PARTITION BY
, ordering with ORDER BY
, the frame specification with RANGE
, and the use of specific ranking functions like ROW_NUMBER()
, RANK()
, and DENSE_RANK()
. These examples will provide a clearer understanding of how these features can be combined to address complex data questions.
Let's look at the 2nd problem that we left as an exercise. Its exactly the first problem and partitioning on department
SELECT department, Month, SUM(Sales) OVER (PARTITION BY department ORDER BY Month) AS Cumulative_Sales, SUM(Expense) OVER (PARTITION BY department ORDER BY Month) AS Cumulative_Expense FROM sales_expense ORDER BY department, Month;
Sequential Month-over-Month Growth Rate by Department
How to think
By department =>
Partition by department
sequential =>
order by month
Growth rate => its
(current_sales - last_sales)*100/last sales
. Now, to get last sales, we can useLAG()
windows function.SELECT department, month, sales, (sales - LAG(sales,1) OVER (PARTITION BY department ORDER BY month) )*100/(LAG(sales,1) OVER (PARTITION BY department ORDER BY month)) AS growth_rate FROM learn.sales_expense;
Year-to-Date (YTD) Sales and Expense Ratio by Department
How to think
By department =>
partition by department
sequential =>
order by month
Range => Default will work since YTD will need
UNBOUNDED PRECEDING AND CURRENT ROW
YTD =>
SUM
till current row. Do for both sales and expenseSELECT department, month, (SUM(sales) OVER (PARTITION BY department ORDER BY month) /SUM(expense) OVER (PARTITION BY department ORDER BY month)) AS ytd_sales_expense_ratio FROM learn.sales_expense;
We may use
nullif
to safeguard againstdivide-by-zero
.Rolling Three-Month Average Sales Across All Departments
Its simple just use the right
RANGE
, no partitioning as its across department
SELECT department, sales, month, AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_3_month_avg_sales FROM learn.sales_expense;
- Identify the 2nd best sales month for each department
How to think
By department =>
partition
by departmentsequential =>
order by month
Range => Default will work since we need ranking
RANK => We can use
dense_rank
here. Why ?
SELECT department, month, sales FROM ( SELECT department, month, sales, DENSE_RANK() OVER (PARTITION BY department ORDER BY sales DESC) as rank FROM learn.sales_expense ) as ranked_data WHERE rank = 2;
If asked for 2nd worst case, just change the order by to ASC.
Miscellaneous and left-over
Window Function Processing
If the query contains any window functions, these functions are evaluated after any grouping, aggregation, and HAVING
filtering is performed. That is, if the query uses any aggregates, GROUP BY
, or HAVING
, then the rows seen by the window functions are the group rows instead of the original table rows from FROM
/WHERE
.
Rank vs Dense Rank
This is commonly asked question. Technically, its easy to answer that rank
simply follow the row_number
after tied values while dense_rank
continues on next rank Though the questions remains, what is the practical use case. So, here is one way to think of it,
Whether the questions is what is the company's 2nd best sales month/figure
Or Which is the top 2 performing sales person ?
Assume the tie is at 1st place, for the former, rank
is useful i.e. the 3rd value is required while for the later dense_rank
would be useful i.e. the tied names are sufficient.