Follow me on Twitter Join me on Facebook
Richard Wallman - a MySQL® DBA (database administrator)

What are 'index prefixes'?

Posted on 2008-03-02T00:00:00 @ 13:45:17

Without wishing to sound pedantic, they're indexes build upon a prefix of the column's data. Examples might be more useful.

Normally, the entire value of a column is used to build the index - this is fine for short data types (integers and the like) but can result in a lot of data in the index for longer data types (CHAR and VARCHAR, for example). Using an index prefix allows you to make a trade off between the space required for the index and the cardinality of the index.

For small data types column prefixes don't make a lot of sense - there's usually very little to gain by using a prefix, and so the server won't let you do it. The advantages are mainly seen in longer data types, and are actually essential in the (potentially) longest types (TEXT and BLOB).

So, why are they useful? Let's consider a pretty standard feature for most of us - a 'users' table, holding authentication data for the users of a system. One common task may be to look up a user's ID based on their username, a short text string.

With no indexes at all, the MySQL database server would have to scan the entire table to do the lookup - while fine for small tables and/or infrequent lookups, this will soon become a major performance hit as things start to grow.

To solve our performance problem, we create a new index! However, as everyone has (or at least, they should have!) a unique username this results in an index with very high cardinality. While this is perfect for this particular use (it will quickly return the ID for a given username) it can result in the storage requirements growing quite large as the number of users increases (the full username is stored in both the data table and the index file). It also makes more work when dealing with ranges of data (i.e. "all usernames starting with the letters 'a'-'c'") as there are more index records to go through, and when dealing with long pieces of data (more data needs to be loaded and compared).

It's important to note that this is all in the query optimiser stage - if the query optimiser decides another index is more appropriate (a date based one, for example, if you're doing some kind of temporal search on your users) then it may totally ignore your full-column index. Also, if there are a significant number of usernames within the requested range (more than about 30% of the total number, according to the MySQL documentation) then the query optimiser may decide that a table scan would be just as quick. In all cases, the index is only used to decide which records will be sent to be tested against the query conditions (unless you do some really clever index creation).

The key to effective index prefix use is to know how your users will access the data and to calculate the cardinality of prefix lengths before creating the index. If only exact username lookups are done then your complete column index is probably the best. If people are searching on the starting letter(s) then a column prefix might be better, the length of which can be decided by calculating the different cardinalities of different prefix lengths.

These type of indexes also need to be monitored to provide the most efficiency - as the data in the table changes, so does the cardinality, and so your once-efficient index prefix may suddenly be dropped in favour of a full table scan. A good example of this is the BT/Yahoo! deal - people signing up for BT broadband were automatically given a Yahoo! email address, with a username starting with the string 'bty'. If you had created an index prefix on the first three characters of the username, as more of these people joined the cardinality of the index starts to change, until eventually the query optimiser decides that searches on any username starting with 'b' (as this also includes the string 'bty') would be just as quick to do a full table scan than use the index.