Follow me on Twitter Join me on Facebook
Richard Wallman - a MySQL® DBA (database administrator)

Using Master-Master replication for hot-spare database servers

Posted on 2008-03-25T00:00:00 @ 16:43:32

Although the most common use of the MySQL database server replication facility is master-slave replication, master-master replication (where each is both a master, and a slave to the other) can be extremely useful in maintaining a hot-spare database server for emergency use.

First of all, a big warning:

Do not try doing updates to both masters - this is not a true multi-master system. You can break things horribly by doing so - and you have been warned. Only one master at a time should be used for data modification, although the other can be safely used as a 'slave' server.

With that out of the way, let's begin.

Setting up the master-master replication is fairly straightforward - both servers need to have binary logging turned on, and both servers need to be set up as a replicated slave to the other server. The MySQL database server is smart enough not to infinitely loop updates between the two servers, so this configuration is safe.

We now have two servers, running at the same time with their own copies of the database. We'll use one of them as a 'live' master (we'll call this one 'alice') and one 'backup' server either idle or being used as a slave (we'll call this one 'bob').

Your applications quite happily send their requests to alice, and any data modification performed on alice will be replicated across to bob. Provided replication doesn't break down, you can consider alice and bob to have identical copies of the data. If bob was to be taken offline (for whatever reason) then when it's brought back up it will connect back to alice and update it's own data (standard replication catchup).

If your nerves can stand it, the 'backup' server can be stopped to obtain consistent backups. When started again, it'll catch up and be ready for emergencies.

Disaster strikes! Alice is off-line and your applications are failing! Time for your fail-over solution to kick in. Depending on your fail-over solution, your downtime should be fairly small.

Queries being sent to alice need to be redirected to bob, whatever way you're most comfortable in doing so. IP address takeover is a popular solution (bob is given the IP address of alice) as it means there's no changes to be made to applications - they have no idea they're now talking to bob instead of alice (and it really doesn't matter anyway). As bob was maintaining a consistent copy of the data by means of replication, there should be little or no data loss in the handover. Provided that the hardware can support the load, you can (and should) carry on running with bob as your 'live' server indefinitely.

However, you really want to get alice back up and running, at least to take over the role of 'backup' server. Depending on what happened to take alice off-line, you may or may not be willing to trust the data being held by alice - this is where your backups come in! (you do have backups, don't you?) Once alice is brought back online, it will connect to bob and start applying whatever data modification statements have been run on bob (once again, standard replication catchup). Once alice has caught up, you're now safe again - you have one 'live' and one 'backup' server with identical copies of the data.

If you need to switch 'live' and 'backup' servers around (because one has better hardware, for example) then it should just be a case of shutting down the MySQL database server on the 'live' machine and waiting for the fail-over the kick in again. Once applications are talking to the other server, start the MySQL database server again and it should quickly catch up. If you're going to do this, pick a quiet time so that there's not too much impact and the replication catch-up is quick.

It is vital that a server is completely up-to-date with it's replication before it can be switched back to being the 'live' server, otherwise you may end up with overlapping data and potential replication failure. Also, be extremely careful with IP address takeover - if both alice and bob are online with the same IP address, chaos will ensue and your entire database (on both servers) will start becoming corrupted. Use a separate IP address that needs to be manually set up (either via the command line or by your fail-over solution) to prevent this from happening.

Using more than two servers for this is not recommended - MySQL database servers can only have one master, and so there's the potential for the whole system to fail. Consider a three server replication set-up: A -> B -> C -> A, with 'A' running as the live server and 'B' and 'C' as backups. This is fine provided 'B' remains online - if 'B' fails, then the data on 'C' is no longer current, so if 'A' fails as well you're in trouble! It's better to have an additional 'cold-spare' server that's not part of the system, that can be put into action in the event of hardware failure.

This solution is not suitable for transaction or data critical applications - there is a small lag introduced with replication, and so there is the potential at least for data to be lost when the 'live' server goes down. If your system can handle losing a couple of seconds of data, then this solution can help you avoid downtime. If it can't, you need a more robust fail-over solution.