Select row with highest value in group in SQL

David Hamrick

Yesterday when I was investigating how to get the most recent transaction from each customer I realized I stumbled on a problem that many people have. This Stack Overflow post even says:

It is, actually, so common that Stack Overflow community has created a single tag just to deal with questions like that: greatest-n-per-group.

The approach that I took was to use a correlated subquery. I noticed that it was pretty slow but I was just running an ad-hoc query so it didn’t bother me that much. But when I realized that I needed to write a similar query that would need to run more often I wanted to research the best approaches. I found this article by Ian at Database Guide that discussed 5 ways of selecting rows with the maximum value for their group. I profiled each of the techniques that he tried:

But first I needed to generate some test data so that I could benchmark each of the algorithms.

import datetime
import sqlite3
from random import randrange

start = datetime.datetime.strptime("2021-01-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2021-12-31", "%Y-%m-%d")
date_range = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
sku_choices = ('Basic Edition', 'Standard Edition', 'Professional Edition')

con = sqlite3.connect('test.db')
con.execute('''DROP TABLE IF EXISTS transactions''')
con.execute('''CREATE TABLE transactions (transaction_id INTEGER PRIMARY KEY, purchase_date text, email text, sku text)''')

for purchase_date in date_range:
    num_customers = randrange(400, 500, 1)
    for _ in range(num_customers):
        email = f"customer{randrange(0,100000):09}@test.com"
        sku = sku_choices[randrange(0,len(sku_choices))]
        con.execute("INSERT INTO transactions(purchase_date, email, sku) VALUES (?,?,?)", (purchase_date.date().isoformat(), email, sku, ))
con.commit()

This generates about a 10MB sqlite database (163,357 rows, 80,593 distinct email addresses) which I think is a good amount of data to test on.

Approach 1: Group By

SELECT
    email,
    MAX(purchase_date) AS max_purchase_date
FROM transactions
GROUP BY email
ORDER BY max_purchase_date, email;

Note: The first approach that Ian uses is just using a GROUP BY statement with a MAX. This is a pretty simple approach but the downside is that you don’t get the whole row, just the max value of one of the columns for each group.

Approach 2: Correlated Subquery

SELECT
    purchase_date, email, sku
FROM transactions t
WHERE
    transaction_id=(SELECT transaction_id FROM transactions WHERE email=t.email ORDER BY purchase_date DESC LIMIT 1)
ORDER BY purchase_date, email;

Approach 3: Uncorrelated Subquery

SELECT
    t1.purchase_date,
    t1.email,
    t1.sku
FROM transactions t1
JOIN (
    SELECT email, MAX(purchase_date) AS max_purchase_date
    FROM transactions
    GROUP BY email
) AS t2 ON t1.email = t2.email and t1.purchase_date=t2.max_purchase_date
ORDER BY t1.purchase_date, t1.email;

Approach 4: Left Join

SELECT
    t1.purchase_date,
    t1.email,
    t1.sku
FROM transactions t1
LEFT JOIN transactions t2 ON t1.email = t2.email and t1.purchase_date>t2.purchase_date
AND t2.email IS NULL
ORDER BY t1.purchase_date, t1.email;

Approach 5: Common Table Expression (CTE) with Window Function

WITH cte AS (
    SELECT
        purchase_date, email, sku,
        RANK() OVER ( PARTITION BY email ORDER BY purchase_date DESC) AS r
    FROM transactions
)
SELECT purchase_date, email, sku
FROM cte
WHERE r = 1
ORDER BY purchase_date, email;

Now that I have each of the sql approaches, I wanted to benchmark them. So I ran each of them 3 times and took the median values using the following in bash:

time sqlite3 test.db ".read query1.sql">/dev/null
time sqlite3 test.db ".read query2.sql">/dev/null
time sqlite3 test.db ".read query3.sql">/dev/null
time sqlite3 test.db ".read query4.sql">/dev/null
time sqlite3 test.db ".read query5.sql">/dev/null
Approach Time
1. Group by* 139ms
2. Correlated Subquery 1,546,320ms
3. Uncorrelated Subquery 287ms
4. Left Join 387ms
5. CTE with Window Function 276ms

After yesterday’s work I knew that approach 2 was slow, but I had no idea how slow until I actually ran it compared against the others. Even though that approach came intuitively to me when I was working on that query, I realize now that it’s always helpful to benchmark.

I think in the future I will probably go with Approach 5 when I come across this problem. To me, it seems the most intuitive and memorable.