Learning SQL Window Functions - The Right Way

·

8 min read

Learning SQL Window Functions - The Right Way

Why Window Functions?

There is no simple way in SQL to calculate running aggregate on data e.g. for a monthly sales data, we may need to calculate the cumulative sales.

Let's take a simple table of sales data. How will I get -

  • Example#1 - Monthly cumulative sales, expense of the Org

We can join the table with self on two condition i.e. department = department and Month >= Month

Basically, it will add the rows needed e.g. for Jan, no rows and for March, two rows i.e. Feb and Jan. This is what we need for a cumulative sum

SELECT t1.Month, SUM(t2.Sales) AS Cumulative_Sales, SUM(t2.Expense) AS Cumulative_Expense
FROM   learn.sales_expense t1
INNER JOIN learn.sales_expense t2 ON t1.department = t2.department and t1.Month >= t2.Month
GROUP BY t1.Month
ORDER BY t1.Month;
  • Example#2 - Monthly cumulative sales, expense of each department

I leave this as an exercise.

What Windows functions facilitate

The SQL in its current form lacks elegance and look more of a work-around. This is where Window Functions come into their own, offering a powerful toolset for elegant and expressive SQL capability.

What it gives and How it works

For each row, we can define a window—essentially a set of rows related to it—over which the Window Functions can operate, with these tool

  1. Window Range

  2. Window Ordering

  3. Window Partition

Let's learn each one by one.

Window Range

Using Range, we create a visible window of range to work up-on e.g. in the sample query we needed a range from start of the data to the current row

Following images demonstrate different cases(Not exhaustive).

Images are self-explanatory, for our case we need the 2nd scenario

How to do it with SQL

High level syntax - agg_function or window_function OVER ()

So, the power comes from over(). When we say over(), it gives us the window for each row,

SUM(Sales) OVER () AS Cumulative_Sales

Why the above query will work

The important question is, how the range is decided out of the 3 scenarios we discussed and many other possible scenarios.

This is achieved by the optional frame_clause that can be one of [Check pgsql docs]

  • UNBOUNDED PRECEDING

  • offset PRECEDING

  • CURRENT ROW

  • offset FOLLOWING

  • UNBOUNDED FOLLOWING

So, when we left it blank, it fall back to the default value. In PostgreSQL, when using window functions, if the frame_clause is not specified, the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW for the OVER() clause.

Window Ordering

This must be obvious by now to define some sense of order before blocking the windows range otherwise we may not achieve the desire result e.g. in our example, the ordering on month is must before we define window.

How to do it with SQL

High level syntax - agg_function or window_function OVER ( order by column_name)

SUM(Sales) OVER (order by month) AS Cumulative_Sales

Window Partition

Partitioning is key when we’re working with window functions. It splits the data into clear sections or groups, allowing us to apply RANGE and ORDER within those boundaries. It’s like when you're organizing a big event and you have different activities in separate areas.

For our data, without partitioning, the RANGE would extend across the entire set. The image below clearly shows how partitioning organizes the data, making it easier to focus and calculate values for each department on its own.

How to do it with SQL

High level syntax - agg_function or window_function OVER ( partition by column_name_1 order by column_name_2)

SUM(Sales) OVER (partition by department order by month) AS Cumulative_Sales

Analytics functions and multiple over() clause

In addition to the core window function tools we've discussed, SQL provides a variety of auxiliary features to streamline our database work

  • Analytics function - This is like we have a superpower to do more than just group data with aggregate functions e.g. SUM, AVG. There are some special functions made just for this that help us identify and organize our data in really useful ways. Here's a quick rundown of some of them in this category:

    • row_number(): Assigns a unique number to each row.

    • rank(): Numbers rows with tied ranks skipping subsequent numbers.

    • dense_rank(): Numbers rows sequentially, no skips for ties.

Below is an image depiction for better understanding.

Check the official doc for an exhaustive list esp. lead, lag [ Windows function ]

  • Multiple over clauses - We can have multiple over clauses. Few rule to keep in mind [ Official doc ]

    • 7.2.5. Window Function Processing

      When multiple window functions are used, all the window functions having syntactically equivalent PARTITION BY and ORDER BY clauses in their window definitions are guaranteed to be evaluated in a single pass over the data. Therefore they will see the same sort ordering, even if the ORDER BY does not uniquely determine an ordering. However, no guarantees are made about the evaluation of functions having different PARTITION BY or ORDER BY specifications. (In such cases a sort step is typically required between the passes of window function evaluations, and the sort is not guaranteed to preserve ordering of rows that its ORDER BY sees as equivalent.)

Practical examples

Let’s delve into a few practical examples that incorporate the various aspects of window functions such as PARTITION BY, ordering with ORDER BY, the frame specification with RANGE, and the use of specific ranking functions like ROW_NUMBER(), RANK(), and DENSE_RANK(). These examples will provide a clearer understanding of how these features can be combined to address complex data questions.

  1. Let's look at the 2nd problem that we left as an exercise. Its exactly the first problem and partitioning on department

     SELECT 
       department,
       Month,
       SUM(Sales) OVER (PARTITION BY department ORDER BY Month) AS Cumulative_Sales,
       SUM(Expense) OVER (PARTITION BY department ORDER BY Month) AS Cumulative_Expense
     FROM sales_expense
     ORDER BY department, Month;
    
  2. Sequential Month-over-Month Growth Rate by Department

    How to think

    By department => Partition by department

    sequential => order by month

    Growth rate => its (current_sales - last_sales)*100/last sales. Now, to get last sales, we can use LAG() windows function.

    SELECT department, month, sales,
        (sales - LAG(sales,1) OVER (PARTITION BY department ORDER BY month) )*100/(LAG(sales,1) OVER (PARTITION BY department ORDER BY month)) AS growth_rate
    FROM learn.sales_expense;
    
  3. Year-to-Date (YTD) Sales and Expense Ratio by Department

    How to think

    By department => partition by department

    sequential => order by month

    Range => Default will work since YTD will need UNBOUNDED PRECEDING AND CURRENT ROW

    YTD => SUM till current row. Do for both sales and expense

    SELECT 
        department, month,
        (SUM(sales) OVER (PARTITION BY department ORDER BY month)
         /SUM(expense) OVER (PARTITION BY department ORDER BY month)) AS ytd_sales_expense_ratio
    FROM learn.sales_expense;
    

    We may use nullif to safeguard against divide-by-zero.

  4. Rolling Three-Month Average Sales Across All Departments

Its simple just use the right RANGE, no partitioning as its across department

SELECT 
    department, sales, month,
    AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_3_month_avg_sales
FROM learn.sales_expense;
  1. Identify the 2nd best sales month for each department

How to think

By department => partition by department

sequential => order by month

Range => Default will work since we need ranking

RANK => We can use dense_rank here. Why ?

SELECT department, month, sales
FROM ( SELECT department, month, sales,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY sales DESC) as rank
    FROM learn.sales_expense
) as ranked_data
WHERE rank = 2;

If asked for 2nd worst case, just change the order by to ASC.

Miscellaneous and left-over

Window Function Processing

If the query contains any window functions, these functions are evaluated after any grouping, aggregation, and HAVING filtering is performed. That is, if the query uses any aggregates, GROUP BY, or HAVING, then the rows seen by the window functions are the group rows instead of the original table rows from FROM/WHERE.

Rank vs Dense Rank

This is commonly asked question. Technically, its easy to answer that rank simply follow the row_number after tied values while dense_rank continues on next rank Though the questions remains, what is the practical use case. So, here is one way to think of it,

Whether the questions is what is the company's 2nd best sales month/figure Or Which is the top 2 performing sales person ?

Assume the tie is at 1st place, for the former, rank is useful i.e. the 3rd value is required while for the later dense_rank would be useful i.e. the tied names are sufficient.