Parallel Extractor for Provisioning

Coming up as a new feature in Tungsten Replicator (and written by our replicator expert Stephane Giron) is the ability to provision a new database by using data from an existing database. This new feature comes in the form of a tool called the Parallel Extractor.
The principles are very simple. On the master side:

  • Start the master replicator offline.
  • Switch the replicator to the online provision state.
  • The master replicator pulls the data out of the existing database and writes that information into the Transaction History Log (THL). At this point, the normal replicator thread is not extracting events from the source database.
  • Once the parallel replication has completed, the replicator switches over to normal extraction mode, and starts writing change data into the THL.

On the slave side, the THL events are read as usual from the master and applied to the slave, but because the provisioned data is inserted into the start of the THL before the main THL thread, the slave reads the provisioned data first, then the data changes that occurred since the provisioning started.
In fact, it’s best to think of it like the diagram below:
Parallel Extractor Blog THL
The parallel extraction happens in a very specific fashion:
A chunking thread identifies all the tables, and also identifies the keys and chunks that can be extracted from each table. It then coordinates the multiple threads:

  • Multiple chunks from the source tables are extracted in parallel.
  • Multiple tables are extracted in parallel.

Because both of these operations happen at the same time, the parallel extractor can pull from multiple tables and multiple chunks, meaning that the actual extraction of the data happens very quickly. In fact, tests are running at a rate of about 80 million rows/15 mins. That was from a single table.
Parallel Extractor Blog Figure
Obviously the number of parallel threads can be controlled, and in fact, the chunking is controlled further by use of a configuration file to determine the chunking configuration.
Currently, the parallel extractor is designed to work for Oracle to MySQL provisioning with Tungsten Replicator, but the same principles can be applied to MySQL-to-MySQL setups. Using the parallel extractor is deceptively simple, and you can check out the current, Oracle-related, instructions here.
What this provides is a very simple way to take an entire existing database full of data and seed your target database with that information by using the replicator. This means the Parallel Extractor could be used to provision new slaves when expanding an existing cluster, to convert a single-machine installation to use replication by seeding the slave with the existing data without needing a backup, or, as currently designed, to seed a heterogeneous replication installation with new data without having to use a complex dump, massage and reload process.