Keeping two tables in MySQL in sync

We had a simple requirement. We needed a few tables part of a remote database A to be synced within our database B.

Looks pretty simple, so we set out to find all the options to make this happen.

Airbyte

The first option that we checked out was Airbyte.  At that point, for the MySql version in source, it did not support incremental sync but only incremental appends and updates but it sufficed for our use case. In order to learn the difference between these modes of syncing you may check the sections below.

Airbyte was the fastest option to try out and setup. It performed a fast initial sync but beyond that the updates started to fail a lot. Sometimes they were due to lost connection, and sometimes for unknown reasons. Also all the failed updates caused some residual data to be left in the destination database which quickly caused the disk to fill up. I believe Airbyte was still in a beta phase and all these processes were not very clean at that point. The team was supportive and I tried various methods but eventually, we moved to other solutions.

pt-table-sync

It's a tool in Percona toolkit which is very highly regarded. I learnt that for replication it needs the binlog format to be STATEMENT, otherwise it will try to set it via SUPER privilege for the session.  Unfortunately, we did not have the right permissions for the setup and therefore this was not an option.

ZongJi

Finally, we moved on to cooking up our own solution using ZongJi, which is a simple binlog reader for MySql.  It requires binlog format to be ROW, which was suited for us and this along with mysql-events gave us binlog events in the master which we could easily apply to our slave tables for reliable replication.


Initial Sync

All sync methods usually start with a full copy of the initial table to the destination. The changes beyond that point are usually copied to the destination via below methods.

Incremental Sync

Incremental sync means updating all the changes from the source to the destination including inserts, updates as well as deletes. This is the best option for syncing two sources as it ensures an exact match along with the least amount of data overhead.

Usually, to achieve this, a log of all transactions on the table is needed like the binlog in MySql and oplog in MongoDB.

Incremental Appends and Updates

This ensures that all the inserts and updates are synced with the destination but deletes are not. So all the data in the destination table is correct, except that it may have some extra rows.  This is also a good option for many use cases.

To achieve this you just need a cursor field in the source table. An example of a cursor field would be say an update_time field that updates for a row when any other field in the row is changed. If this is available, we can just check for records that have update_time newer than last update and copy just those rows to the destination. Since deleted rows cannot be discovered with it, this method fails to sync deletes.

The extra deletes can then be synced by an occasional full copy.