Window functions in PostgreSQL - Part 1

Window functions in PostgreSQL - Part 1


Hi Readers,

In this post, we’ll look at how to use PostgreSQL’s window functions instead of GROUP BY to handle calculations directly in your SQL queries. Window functions give you more control and can make your code simpler by reducing the need to do extra processing in your server-side code. This is the first part of a two-part series, with more examples coming in the next post.

What is window functions:

Window functions in PostgreSQL allow you to perform calculations across a set of rows related to the current row.

It is useful when you want to analyze data across a partition (subset) of your data or apply ranking, cumulative sums, or moving averages.

In window functions, the keywords OVER and PARTITION BY are essential for defining how the window function operates on rows.

OVER Clause

The OVER clause defines the "window" or the set of rows on which the window function will perform its calculation. It specifies how rows are grouped and ordered for the window function.

  • Without OVER, the function behaves as a normal aggregate function, reducing rows like SUM(), COUNT(), etc.
  • With OVER, you can control the scope of rows on which the function operates, without collapsing the result set.

Syntax

<window_function> OVER (PARTITION BY <expression> [ORDER BY <expression>])

The over clause can have:

  • PARTITION BY (optional): Divides the result set into partitions or groups.
  • ORDER BY (optional): Orders rows within each partition.

PARTITION BY Clause

The PARTITION BY clause, used within the OVER clause, divides the rows into partitions or subsets.

The window function is applied independently within each partition. It's similar to a GROUP BY, but unlike GROUP BY, it doesn't collapse the rows into a single result.

  • When you use PARTITION BY, the window function is reset for each partition.
  • If you don’t specify PARTITION BY, the function considers the entire result set as one partition.

Key Differences Between PARTITION BY and GROUP BY:

  • PARTITION BY (in Window Functions):
    • Keeps individual rows intact, showing each row while still calculating results over partitions.
    • It calculates values within each partition but doesn’t collapse the rows into one summary row.
  • GROUP BY:
    • Groups rows into one for each group, collapsing the dataset.
    • You lose the row-level detail, as only one result per group is returned.

Types of window functions:

Here's a breakdown of the main types of window functions available

  1. Ranking Functions
  2. Aggregate Functions
  3. Value Functions
  4. Frame Clauses

Ranking Functions:

ROW_NUMBER()

Assigns a unique number to each row starting from 1

SELECT ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank, employee_name, salary
FROM employees;

RANK()

This ranks the rows within their partition.

Similar to ROW_NUMBER(), but if there are ties (rows with the same value), it assigns the same rank and leaves a gap for the next rank.

SELECT RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank, employee_name, salary
FROM employees;

DENSE_RANK()

Similar to RANK(), but no gaps in the ranking sequence if there are ties.

SELECT DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank, employee_name, salary
FROM employees;


NTILE(n)

Divides the rows into n buckets or groups and assigns a bucket number to each row.

This example divides employees into 4 salary groups (quartiles).

SELECT NTILE(4) OVER (ORDER BY salary) AS bucket, employee_name, salary
FROM employees;

Aggregate Functions:

These functions perform calculations on a group of rows but keep the results in each row instead of collapsing them like regular GROUP BY.

SUM()

It sums a set of rows.

This calculates the total salary in each department while showing individual employee salaries.

SELECT employee_name, salary, SUM(salary) OVER (PARTITION BY department) AS total_salary
FROM employees;

AVG()

It calculates the average of a set of rows.

SELECT employee_name, salary, AVG(salary) OVER (PARTITION BY department) AS avg_salary
FROM employees;

COUNT()

It counts the number of rows within a partition

SELECT employee_name, salary, AVG(salary) OVER (PARTITION BY department) AS avg_salary
FROM employees;

MAX() and MIN()

Finds the maximum and minimum values within a partition.

SELECT employee_name, salary, MAX(salary) OVER (PARTITION BY department) AS max_salary
FROM employees;

Value Functions:

LAG()

Returns the value from the previous row within the partition. You can use this to compare current and past rows.

SELECT employee_name, salary, LAG(salary) OVER (ORDER BY hire_date) AS prev_salary
FROM employees;

LEAD()

Returns the value from the next row in the partition. This is useful for future comparisons.

SELECT employee_name, salary, LEAD(salary) OVER (ORDER BY hire_date) AS next_salary
FROM employees;

FIRST_VALUE() and LAST_VALUE()

Return the first or last value in the window.

SELECT employee_name, salary, FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY hire_date) AS first_salary
FROM employees;

Frame Clauses:

Frame clauses in window functions define a specific subset (or "window") of rows relative to the current row for performing calculations. This allows you to compute things like moving averages, running totals, and other calculations that depend on a set of neighboring rows.

There are two types of frame clauses in PostgreSQL:

  1. ROWS: Specifies a frame based on a specific number of physical rows before or after the current row.
  2. RANGE: Specifies a frame based on the values of the rows, typically focusing on a range of values in the ORDER BY clause (e.g., all rows within a certain value range).

Syntax:

[ ROWS | RANGE ] BETWEEN <start> AND <end>

You can use:

  1. UNBOUND PRECEDING: Refers to all rows before the current one.
  2. UNBOUND FOLLOWING: Refers to all rows after the current one.
  3. CURRENT ROW: Refers to the current row itself.
  4. N PRECEDING or N FOLLOWING: Refers to a specific number of rows before or after the current one.

ROWS

This clause works with the physical number of rows. It considers a fixed number of preceding or following rows, irrespective of the values in the ORDER BY column.

Example:

We calculate the sum of the current row's value and the previous row's value (1 PRECEDING) for a running total.

SELECT
    date,
    sales,
    SUM(sales) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS rolling_sum
FROM sales_data;

ROWS BETWEEN 1 PRECEDING AND CURRENT ROW: This sums the current row and the immediately preceding row. If you're at the first row, the frame will only include the current row (no preceding row).

RANGE

This clause works based on the values in the ORDER BY column. It includes all rows that fall within the defined range of values relative to the current row, rather than a fixed number of rows.

Example:

We calculate the cumulative sum of sales for the current row and all preceding rows up to the current row in terms of sales amounts.

SELECT
    date,
    sales,
    SUM(sales) OVER (ORDER BY date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum
FROM sales_data;

RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: This sums all sales from the first row (UNBOUNDED PRECEDING) to the current row

Differences between ROWS and RANGE

  • ROWS: Works with a specific number of rows, regardless of the values in the ORDER BY column. It’s good for calculations like a fixed moving window (e.g., sum of 3 rows before and after).
  • RANGE: Works with a range of values based on the ORDER BY column. It’s useful when you need calculations that depend on a range of values, like cumulative totals or percentile-based calculations.

Key Takeaways 📚

  1. PARTITION BY: Divides the rows into subsets (e.g., by department)
  2. ORDER BY: Defines the order in which rows are processed within each partition.
  3. Frame clauses (like ROWS and RANGE) let you specify a sliding window of rows for calculations.


In the next part, we'll explore the advantages and disadvantages of using window functions, as well as how they compare across other relational databases.

Thank you for reading!