Multi-Column Index
👨💼 As we've accumulated more and more users who have more and more notes, we've
found that our query for the user list is getting slower and slower and using
more and more memory and CPU. We need to optimize this query.
I need you to do some analysis on the query and determine what indexes we need
to add to optimize it.
🐨 Before you go straight to making the one line code change
in , please read through these
instructions to do some analysis.
If you'd like to observe the performance issue, you can update the seed script
to generate more users and more notes per user. This could make the seed
script take a very very long time (several minutes). You should notice
major slow downs by having 15000 users each with 200-300 notes. You can also
comment out the images bit if you want it to run faster since images shouldn't
affect this. If you do this, a query with no search term should take ~8
seconds.
Identify the problem query
If you run the query by opening , you should see the
query that was executed in the terminal output. It should log something like
this:
SELECT user.id, user.username, user.name, image.id AS imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE ?
OR user.name LIKE ?
ORDER BY (
SELECT updatedAt
FROM Note
WHERE ownerId = user.id
ORDER BY updatedAt DESC
LIMIT 1
) DESC
LIMIT 50;
If you seeded your database with lots of data, you should also notice it took
a long time for your data to come back. Yikes!
Analyze the problem query
The
?
symbols in that query represent interpolated values. To use this query
in the sqlite EXPLAIN QUERY PLAN
command, we need to replace those with real
values. So, let's run this query with some values:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id AS imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE "%kody%"
OR user.name LIKE "%kody%"
ORDER BY (
SELECT updatedAt
FROM Note
WHERE ownerId = user.id
ORDER BY updatedAt DESC
LIMIT 1
) DESC
LIMIT 50;
That gives us this output:
QUERY PLAN
|--SCAN user
|--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
|--CORRELATED SCALAR SUBQUERY 1
| |--SEARCH Note USING INDEX Note_ownerId_idx (ownerId=?)
| `--USE TEMP B-TREE FOR ORDER BY
`--USE TEMP B-TREE FOR ORDER BY
Huh... Yeah, there are a couple things going on in there. Let's build up the
query a bit at a time so we can talk about each bit.
🐨 Let's start with this:
EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name
FROM User AS user
LIMIT 50;
That should give you this:
QUERY PLAN
`--SCAN user
Remember what a SCAN with no index means? It says "read every record." But it's
only happening because there's no
WHERE
or ORDER BY
clause.🐨 Add an
ORDER BY
clause:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name
FROM User AS user
ORDER BY user.username
LIMIT 50;
QUERY PLAN
`--SCAN user USING INDEX User_username_key
Great, we're using an index, so that'll be much more efficient.
🐨 Let's try it out with the
name
column (which is not indexed):EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name
FROM User AS user
ORDER BY user.name
LIMIT 50;
This gives us:
QUERY PLAN
|--SCAN user
`--USE TEMP B-TREE FOR ORDER BY
So here's something new. Because the
user.name
column is not indexed, the
database has to scan the user without an index, but in order to sort the data by
user.name
, it has to store all the user's names in a temporary "B-TREE" data
structure so it can do the sorting for the order by. This is going to eat up
some memory for the space for the data structure, and CPU for performing the
comparison.But we're getting side-tracked. We're not ordering by the user's name. Let's
get back on track.
🐨 We'll add back the
ORDER BY user.username
for a second so we can see what
the LEFT JOIN
does to our query:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
ORDER BY user.username
LIMIT 50;
And that gives us:
QUERY PLAN
`--SCAN user USING INDEX User_username_key
At first this may surprise you, but the optimizer has determined that it can
skip the
LEFT JOIN
entirely because we're not referencing it anywhere else
in the query.🐨 Add
image.id
in the selectEXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id as imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
ORDER BY user.username
LIMIT 50;
That will give us:
QUERY PLAN
|--SCAN user USING INDEX User_username_key
`--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
Super, with that we get a "SEARCH" using an index on the foreign key, so that
should be quick enough. This is because the
ON
predicate is on the indexed
column.🐨 Alrighty, let's add the
WHERE
clause with the LIKE
in:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id as imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE '%kody%'
ORDER BY user.username
LIMIT 50;
QUERY PLAN
|--SCAN user USING INDEX User_username_key
`--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
Whoops! It is using an index for the
ORDER BY
, but it's not telling us that
it's likely not using an index for the LIKE
(according to
the rules).🐨 Let's add the
OR
for the user.name
:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id as imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE '%kody%'
OR user.name LIKE '%kody%'
ORDER BY user.username
LIMIT 50;
QUERY PLAN
|--SCAN user USING INDEX User_username_key
`--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
It's still leveraging the index for the
ORDER BY
, but it's not telling us that
it's likely not using an index for the LIKE
.🐨 Before we put the subquery in the
ORDER BY
, let's look at the subquery on
its own:EXPLAIN QUERY PLAN
SELECT updatedAt
FROM Note
WHERE ownerId = "some_id"
ORDER BY updatedAt DESC
LIMIT 1;
QUERY PLAN
|--SEARCH Note USING INDEX Note_ownerId_idx (ownerId=?)
`--USE TEMP B-TREE FOR ORDER BY
Here we are again with the
TEMP B-TREE
for the ORDER BY
. This is because
the updatedAt
column is not indexed. So for every note for the user, it has
to store the updatedAt
in a temporary data structure so it can sort it to find
the one that was most recently updated. Definitely something fishy here...🐨 Now, let's put that query as a subquery in the
ORDER BY
:EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id as imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE '%kody%'
OR user.name LIKE '%kody%'
ORDER BY (
SELECT updatedAt
FROM Note
WHERE ownerId = user.id
ORDER BY updatedAt DESC
LIMIT 1
) DESC
LIMIT 50;
QUERY PLAN
|--SCAN user
|--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
|--CORRELATED SCALAR SUBQUERY 1
| |--SEARCH Note USING INDEX Note_ownerId_idx (ownerId=?)
| `--USE TEMP B-TREE FOR ORDER BY
`--USE TEMP B-TREE FOR ORDER BY
Now
EXPLAIN
is (finally) showing that we're scanning because there's no index in
use on the user table (it was scanning the whole time, silly). But, it's now also
showing a TEMP B-TREE
for the ORDER BY
, and because that's a subquery, that
will happen for every user 😱. Very very bad.We need to optimize the
ORDER BY
sub query, and then the whole query should
run much faster. Because that query is against the Note table, the index we need
will go on the Note
model.Identify the index
Remember the photo album metaphor? In that example, the first folder you want is
the one where you're trying to filter things. That's often the column most
frequently used in your queries. While individual indexes can certainly be
beneficial for corresponding single-column queries, for multi-column queries, a
well-designed composite index is often more efficient.
Generally, in a composite (multi-column) index, start with the column that is
most frequently used in your
WHERE
clause and can be most utilized, then add
frequently ORDER BY
columns to the end of the index.This often means you start with "bigger buckets" and get more specific as you
go.
While these rules are not hard and fast, they are a good starting point. You
should always test your queries to see if they're using the indexes you expect
them to use and if they're performing well.
So in our case, we look at the
WHERE
and then the ORDER BY
to determine our
indexes. We're referencing the updatedAt
in the ORDER BY
and the ownerId
in
the WHERE
. We need to combine these columns in a single index to optimize this
query.🐨 So, let's sort first by the
ownerId
and then by the updatedAt
.🐨 Now run:
npx prisma db push
Remember, this does not update your migration file. You'll want to do that
when you're ready to commit to this. We're just testing things out for now.
If you'd like, you can execute this command in SQLite and it will show you all
the indexes active in the database:
.indexes
That should give you something like this:
NoteImage_noteId_idx sqlite_autoindex_NoteImage_1
Note_ownerId_idx sqlite_autoindex_Note_1
Note_ownerId_updatedAt_idx sqlite_autoindex_UserImage_1
UserImage_userId_key sqlite_autoindex_User_1
User_email_key sqlite_autoindex__prisma_migrations_1
User_username_key
Our new one is the one called
Note_ownerId_updatedAt_idx
!🐨 Now, let's run the query plan again:
EXPLAIN QUERY PLAN
SELECT user.id, user.username, user.name, image.id as imageId
FROM User AS user
LEFT JOIN UserImage AS image ON user.id = image.userId
WHERE user.username LIKE '%kody%'
OR user.name LIKE '%kody%'
ORDER BY (
SELECT updatedAt
FROM Note
WHERE ownerId = user.id
ORDER BY updatedAt DESC
LIMIT 1
) DESC
LIMIT 50;
QUERY PLAN
|--SCAN user
|--SEARCH image USING INDEX UserImage_userId_key (userId=?) LEFT-JOIN
|--CORRELATED SCALAR SUBQUERY 1
| `--SEARCH Note USING COVERING INDEX Note_ownerId_updatedAt_idx (ownerId=?)
`--USE TEMP B-TREE FOR ORDER BY
Sweet! We've just eliminated the
TEMP B-TREE
on the Note table! And now our
Note Search is using our new index (learn more about what "covering index
means" below). This is a huge win because the query doesn't have to read every
note for every user!Note we do still have a non-indexed scan for the user, but based on the
requirements of a wildcard prefix in the
LIKE
, an index won't help anyway.Incidental Covering Index
If you're curious, we just happen to have what's known as a "covering index"
here, which means SQLite doesn't even have to read any records from the Note
table. It can just read the index and get the data it needs. This is because all
the columns used by the query are covered by the index and is highly
efficient. You can learn more about this from
the SQLite Query Planner docs
🐨 Now go ahead and try searching users again:
Checking our logs, we're getting about
30ms
per query when there's no search
term, and just a couple milliseconds when there is a search term. Going from 8
seconds to 30ms is over 250x faster! Nice!