понедельник, 28 апреля 2014 г.

When Redis failover fails

Given:
- 2 Redis instances running in replication mode (master and slave). Master node went down, so only slave is running
- 2 Redis Sentinels that are trying to start failover process
- Sentinel's log with error:
-failover-abort-no-good-slave master primary [some_ip] [some_port]

What does it mean? 

That means (obviousely) Sentinel just cannot find appropriate slave to promote it to master. But wait! We have slave node! It is running! What's wrong then? The trick is: our running slave has corrupted data, so it doesn't meet requirements for promotion.
Possible situation: master went down when slave was synchronizing data. The more data you have to synchronize, the higher probability of such situation.

What to do?

Check your slave's redis.log. If it is constantly screaming something like:
[17319] 22 Apr 10:44:29.689 * Connecting to MASTER [address][17319] 22 Apr 10:44:29.689 * MASTER <-> SLAVE sync started[17319] 22 Apr 10:44:29.689 # Error condition on socket for SYNC: Connection refused

that means, the slave was interrupted during synchronization. It cannot be promoted to master, Sentinel cannot help you with it. You have to restart master by your own means (custom script, your own hands, etc).
That's what I got. Probably, there is another solution that comes from Redis. Please share if you know more about it.

How to reproduce.

Given:
- 2 Redis instances. They are configured to run in replication mode (let's name them R1 - currently master, and R2 - slave).
- 2 Redis Sentinels that are monitoring R1
- ~500Mb of data to keep

What happens when we start our system:
1. R1 loads data into memory
2. R2 starts synchronization

Let's say, synchronization was successful. And let's continue our experiments :)

3. R1 goes down. Let's say, redis process was killed by some cruel IT guy :). In-memory data is lost, process id (PID) is lost as well. But R2 got all the data, so Sentinels promote it to master. Everything ok
4. R1 goes up, Sentinels command it to switch to slave.
5. R1 is slave now, it starts synchronization with R2. R1 performs full synchronization (instead of partial one, which should be faster) because it lost in-memory data and process's pid was changed (http://redis.io/topics/replication).
6. Since we have relatively big dataset (~500Mb), it takes some time to transfer from master to slave (let's say, 5mins)
7. Now R2 goes down...
8. R1 cannot complete synchronization. It tries to get data from R2 again and again, but it's hopeless - R2 was killed. NOOOOOOOOOOOOOOOOOOOOOOOOOoooooo..... (looks like tragic moment in Hollywood movie :))
9. Sentinels cannot promote R1 to master because it's trying to complete synchronization... Our system has no more master... Saaad.

Please leave a comment if you have any idea how to overcome this situation.

1 комментарий:

  1. Check your fail over delay it's usually 5 minutes so if the slave is of out sync then it happens after the delay

    ОтветитьУдалить