Select row with highest value in group in SQL
Yesterday when I was investigating how to get the most recent transaction from each customer I realized I stumbled on a problem that many people have. This Stack Overflow post even says:
It is, actually, so common that Stack Overflow community has created a single tag just to deal with questions like that: greatest-n-per-group.
The approach that I took was to use a correlated subquery. I noticed that it was pretty slow but I was just running an ad-hoc query so it didn’t bother me that much. But when I realized that I needed to write a similar query that would need to run more often I wanted to research the best approaches. I found this article by Ian at Database Guide that discussed 5 ways of selecting rows with the maximum value for their group. I profiled each of the techniques that he tried:
But first I needed to generate some test data so that I could benchmark each of the algorithms.
import datetime
import sqlite3
from random import randrange
start = datetime.datetime.strptime("2021-01-01", "%Y-%m-%d")
end = datetime.datetime.strptime("2021-12-31", "%Y-%m-%d")
date_range = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)]
sku_choices = ('Basic Edition', 'Standard Edition', 'Professional Edition')
con = sqlite3.connect('test.db')
con.execute('''DROP TABLE IF EXISTS transactions''')
con.execute('''CREATE TABLE transactions (transaction_id INTEGER PRIMARY KEY, purchase_date text, email text, sku text)''')
for purchase_date in date_range:
num_customers = randrange(400, 500, 1)
for _ in range(num_customers):
email = f"customer{randrange(0,100000):09}@test.com"
sku = sku_choices[randrange(0,len(sku_choices))]
con.execute("INSERT INTO transactions(purchase_date, email, sku) VALUES (?,?,?)", (purchase_date.date().isoformat(), email, sku, ))
con.commit()
This generates about a 10MB sqlite database (163,357 rows, 80,593 distinct email addresses) which I think is a good amount of data to test on.
Approach 1: Group By
SELECT
email,
MAX(purchase_date) AS max_purchase_date
FROM transactions
GROUP BY email
ORDER BY max_purchase_date, email;
Note: The first approach that Ian uses is just using a GROUP BY
statement with a MAX
. This is a pretty simple approach but the downside is that you don’t get the whole row, just the max value of one of the columns for each group.
Approach 2: Correlated Subquery
SELECT
purchase_date, email, sku
FROM transactions t
WHERE
transaction_id=(SELECT transaction_id FROM transactions WHERE email=t.email ORDER BY purchase_date DESC LIMIT 1)
ORDER BY purchase_date, email;
Approach 3: Uncorrelated Subquery
SELECT
t1.purchase_date,
t1.email,
t1.sku
FROM transactions t1
JOIN (
SELECT email, MAX(purchase_date) AS max_purchase_date
FROM transactions
GROUP BY email
) AS t2 ON t1.email = t2.email and t1.purchase_date=t2.max_purchase_date
ORDER BY t1.purchase_date, t1.email;
Approach 4: Left Join
SELECT
t1.purchase_date,
t1.email,
t1.sku
FROM transactions t1
LEFT JOIN transactions t2 ON t1.email = t2.email and t1.purchase_date>t2.purchase_date
AND t2.email IS NULL
ORDER BY t1.purchase_date, t1.email;
Approach 5: Common Table Expression (CTE) with Window Function
WITH cte AS (
SELECT
purchase_date, email, sku,
RANK() OVER ( PARTITION BY email ORDER BY purchase_date DESC) AS r
FROM transactions
)
SELECT purchase_date, email, sku
FROM cte
WHERE r = 1
ORDER BY purchase_date, email;
Now that I have each of the sql approaches, I wanted to benchmark them. So I ran each of them 3 times and took the median values using the following in bash:
time sqlite3 test.db ".read query1.sql">/dev/null
time sqlite3 test.db ".read query2.sql">/dev/null
time sqlite3 test.db ".read query3.sql">/dev/null
time sqlite3 test.db ".read query4.sql">/dev/null
time sqlite3 test.db ".read query5.sql">/dev/null
Approach | Time |
---|---|
1. Group by* | 139ms |
2. Correlated Subquery | 1,546,320ms |
3. Uncorrelated Subquery | 287ms |
4. Left Join | 387ms |
5. CTE with Window Function | 276ms |
After yesterday’s work I knew that approach 2 was slow, but I had no idea how slow until I actually ran it compared against the others. Even though that approach came intuitively to me when I was working on that query, I realize now that it’s always helpful to benchmark.
I think in the future I will probably go with Approach 5 when I come across this problem. To me, it seems the most intuitive and memorable.