I have been using MySQL replication for years, specifically multi-master where you can do both read and write to either server.
To make my situation even worse, I do it across the global Internet. The added latency and/or outages between sites can cause ‘issues’ that you might not see if your servers were connected locally with high speed Ethernet.
This weekend one of my servers lost a hard drive. One of the NEW WD red hard drives, I just wrote about a few weeks ago. The drives have only been in production for less than two months and one has failed already.
Using Linux software RAID, the data is OK and the machine is still humming along while we wait for a replacement drive to be installed.
One thing that did not survive the hard drive crash is the MySQL replication. The other server (the one where no hard disk crashed) actually started showing this error after the replication stopped:
Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.006259’ at 2608901, the last event read from ‘/var/log/mysql/mysql-bin.006259’ at 2608901, the last byte read from ‘/var/log/mysql/mysql-bin.006259′ at 2609152.’
They suggest out of disk in the message, but not the case here. The problem was probably from the server being restarted without being properly shut down (a big NO NO).
So if you get this error, or any other replication error that references a binlog and a location how can you find out what the problem is?
If you have been running MySQL replication for any length of time, you have seen times where the replication stops and you need to know why.
To view the transaction that is killing your server, head over to the master server and use the mysqlbinlog utility to view what is going on.
In my case I was greeted with this message.
# Warning: this binlog is either in use or was not closed properly.
Essentially it is saying the file is messed up, with no valid content in that file the replication is stuck. To get it started again you will need to update your slave(s) with a new instruction, telling them to move to the next binglog. I advanced to the next binlog like this:
STOP SLAVE; CHANGE MASTER TO MASTER_LOG_FILE = 'mysql-bin.006260'; CHANGE MASTER TO MASTER_LOG_POS = 0; START SLAVE;
With that the slave can start up again, reading from the next binlog which contains valid content.
Crisis diverted, all we need now is a new hard drive and the server should be a happy camper once more.