Replicating into ElasticSearch

So here at Continuent we are working on multiple new targets for applying data using Tungsten Replicator. There are so many potential targets out there where people want to replicate data directly into a specific system, sometimes just for a specific data set, table, database or requirements.
Yesterday afternoon, I started working on ElasticSearch – this morning I have it finished!
As with all solutions, the same basic principles apply – want to pull out of MySQL or Oracle and into something else? That’s fine. Want to replicate to HDFS and ElasticSearch? We do that too!
So what does it look like?
Installation operates just our normal appliers – you just specify the datasource type (ElasticSearch) and the EL host name and port:

tools/tpm configure alpha \
--datasource-type=elasticsearch \
--install-directory=/opt/continuent \
--master=ubuntuheterosrc \
--members=elasticsearch \
--replication-host=localhost \
--replication-password=root \
--replication-port=9200 \
--replication-user=root

There are some configurable options, but I’ll get to those later. For right now, let’s just see what happens when you insert some data. Here’s a simple table in MySQL:

mysql> describe mg;
+-------+----------+------+-----+---------+----------------+
| Field | Type     | Null | Key | Default | Extra          |
+-------+----------+------+-----+---------+----------------+
| id    | int(11)  | NO   | PRI | NULL    | auto_increment |
| msg   | char(80) | YES  |     | NULL    |                |
+-------+----------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

And Let’s insert some data:

mysql> insert into mg values (99999,"Hello ElasticSearch");
Query OK, 1 row affected (0.10 sec)

Now let’s have a look what happens to that when it gets into ElasticSearch:

{
 "_id" : "99999",
 "_type" : "mg",
 "found" : true,
 "_version" : 2,
 "_index" : "msg",
 "_source" : {
 "msg" : "Hello ElasticSearch",
 "id" : "99999"
 }
}

Yay! A nice clean record into ElasticSearch so that we could be searching for the data it contains.
Incidentally, the information was written in using a Document ID made of up of the primary (more on that in a minute), and written into an index and type based on the schema and table.
Obviously we’re writing in a full record here – but keep in mind that this is the replicator and we could have filtered out columns or even tables from the information generated content. We’re trying to keep with the operational perspective of writing everything over to the target that we’ve been asked to.
Also be aware that we do this on a per-row basis. That is, every single row updated/inserted is written as a single entry into the ElasticSearch index.
That said, there are quite a few things that we can control:

By default, we treat the incoming schema name as the ElasticSearch ‘index’ and the incoming table name as the ElasticSearch ‘type’. So for example, with the schema ‘blog’ and the table ‘posts’ you are are going to get data written into /blog/posts/ID.You can change this behaviour by setting an explicit index and/or type name – this obviously writes everything into the target with those specific values, regardless of the incoming schema or table name, but maybe you just want one big index of all the data.So, by setting an explicit index of ‘allmybigdata’ and a type ‘rawtext’, everything gets written to /allmybigdata/rawtext/ID.
The difficulty with the above approach is it limits your ability to search based on some other values. Maybe the incoming data is from multiple blogs, but you want to be able to perform searches, there’s also an option to embed the schemaname and tablename into the data too:
```
{
 "_source" : {
 "id" : "9",
 "source_table" : "mgg",
 "msg" : "Barneyrubble",
 "source_schema" : "msg",
 "idb" : "5",
 "committime" : "2017-05-11 11:30:04.0"
 },
 "_id" : "95",
 "_version" : 1,
 "found" : true,
 "_index" : "msg",
 "_type" : "mgg"
}
```
You can also see in the above that embed a ‘committime’ if asked too, in case you want to search on that too.
Incidentally, one other thing about the above record, it’s actually a compound index from the MySQL side – you can see that there are to ID fields, ‘id’ and ‘idb’ and the ElasticSearch _id is ’95’
The format of the document id is configurable, so you use:
- The primary key (including compound ones), with everything combined into a single string. I.e key (9,5) becomes 95.
- The primary key using underscores, I.e. (9,5) becomes 9_5;
- The schema, table and primary key, I.e. (9,5) in msg.mg becomes msgmg95
- The schema, table and primary key with underscores, I.e. (9,5) in msg.mg becomes msg_mg_9_5
Updates work exactly as you expect – they update the record directly, as we do a *proper* update, so the _version is updated appropriately
Deletes work as expected too
Document IDs can be configured so that an ElasticSearch auto generated value is used in place of an incoming primary key. However, be aware that if you use this, then we are unable to do deletes or updates, because we cannot track the generated ID and looks up would be expensive.
Fortunately, you can ignore errors when performing a delete or update to avoid the problem.

These are all configured through the usual properties, and the defaults look like this:

replicator.applier.dbms.ignoreDeleteErrors=false
replicator.applier.dbms.ignoreUpdateErrors=false
replicator.applier.dbms.docIdFormat=pkey
replicator.applier.dbms.selfGeneratedId=false
replicator.applier.dbms.useSchemaAsIndex=true
replicator.applier.dbms.indexName=
replicator.applier.dbms.useTableAsType=true
replicator.applier.dbms.typeName=
replicator.applier.dbms.embedSchemaTable=true
replicator.applier.dbms.embedCommitTime=true

Currently, all of these are global settings – I’m toying with the idea of using these as defaults, and then having a separate JSON configuration file that would be able to set these values on a per schema/table basis. I’d be interested to hear if anybody would find this useful. While I like this approach, it would add some processing overhead we might want to avoid. In reality, the better way to do this would be to configure separate services in the replicator to handle that process.
Some things that I am still checking and investigating:

Performance – Currently I’m seeing about 125 rows per second into ElasticSearch. This is in a VM with just 2 CPUs and 2GB RAM. I suspect we could increase this.
I also have not in anyway done a more random workload, like Sysbench, or checked the compatibility with our own multi-threaded/parallel apply.
Latency – Latency right now is down in the µs, about where you’d expect. Obviously, it depends on the incoming data, but worth looking at.
Start/Stop/Restart – This first version contains *complete* restart ability as you would expect with the replicator. However, I haven’t added support to some of our other tools, like dsctl. I’ll address that in a future release.
Datatype Support – I’ve done only a few tables, and nothing substantial like textual or logging data.
Currently, we send individual rows as individual REST requests; I dont use the open channel and regular submissions (which might improve performance), or any kind of batching. These are only going to improve large data loads and dumps, rather than more traditional streaming replication

So there’s still some work to do, but the basic process is currently perfectly serviceable.
More important as far as I’m concerned, is that with this basic applier done and ready to be released to the public in our upcoming Tungsten Replicator 5.2.0, which is due at the end of June. That gives us about a month to complete testing and address some of the above issues.
If you would like to test out the new applier for ElasticSearch, please email me (mc.brown@continuent.com). I’m interested to get as much input and testing as possible.