MySQL to Hadoop Step-By-Step


We had a great webinar on Thursday about replicating from MySQL to Hadoop (watch the whole thing). It was great, but one of the questions at the end was ‘is there an easy way to test’.
Sadly we can’t go giving out convenient ready-to-run downloads of these things because of licensing and and other complexities, so I want to try and make it as simple and straightforward as possible by giving you the directions to complete. I’m going to be point to the Continuent Documentation every now and then so this is not too crowded, but we should get through it pretty easily.
Major Decisions
For this to work: 

  • We’ll setup two VMs, one the master (running MySQL), the other the slave (Running Cloudera)
  • The two VMs must be able to reach each other on the network. It doesn’t matter whether they are running Internal, NAT, or bridge-mode network, they just need to be able to ping and SSH each other. Switch off firewalls to prevent port weirdness.
  • For convenience, update your /etc/hosts to have a host1 (the master) and host2 (the slave)
  • The master must have followed the prereqs; for the slave it’s optional, but highly recommended

With that in mind, let’s get started.
Step 1: Setup your Master Host
There are a number of ways you can do this. If you want to simplify things and have VirtualBox, try downloading this VM. It’s a 1.5GB download containing and OVF VM, and is a Ubuntu host, with our prerequisites followed. To use this :

  1. Uncompress the package.
  2. Import the VM into your VirtualBox.
  3. If you want, change the network type from Internal to a bridged or NAT environment.

Using internal networking, you can login to this using:

shell> ssh -p2222 tungsten@localhost

Passwords are ‘password’ for tungsten and root.
If you don’t want to follow this but want your own VM:

  1. Create a new VM with 1-2GB of RAM, and 8GB or more of disk space
  2. Install your OS of choice, either Ubuntu or CentOS
  3. Follow our prerequisite instructions
  4. Make sure MySQL is setup and running, and that the binary logging is enabled
  5. Make sure it has a valid IP address

Step 2: Setup Tungsten Replicator
Download the latest Tungsten Replicator binary from this page
Unpack the file:

shell> tar zxf tungsten-replicator-3.0.tar.gz

Change into the directory:

shell> cd tungsten-replicator-3.0

Create a new replicator installation, this will read from the binary log into THL:

shell> ./tools/tpm install alpha \
--install-directory=/opt/continuent \
--master=host1 \
--members=host1 \
--java-file-encoding=UTF8 \
--java-user-timezone=GMT \
--mysql-enable-enumtostring=true \
--mysql-enable-settostring=true \
--mysql-use-bytes-for-string=false \
--svc-extractor-filters=colnames,pkey \
--property=replicator.filter.pkey.addColumnsToDeletes=true \
--property=replicator.filter.pkey.addPkeyToInserts=true \
--replication-password=password \
--replication-user=tungsten \
--skip-validation-check=HostsFileCheck \
--skip-validation-check=ReplicationServicePipelines \
--start-and-report=true

For a full description of what’s going on here, see this page and click on the magnifying glass. You’ll get the full description of each option.
To make sure everything is OK, you should get a status from trepctl generated. If it’s running and it shows the status as online, we’re ready.
Step 3: Get your Cloudera Host Ready
There are lots of ways to get Cloudera’s Hadoop solution installed. The ready-to-run VM is the simplest by far.

  1. Download the Cloudera VM quick start host from here; there are versions for VirtualBox and VMware and KVM.
  2. Set the networking type to match the master.
  3. Start the host
  4. Set the hostname to host2
  5. Update the networking to an IP address that can talk to the master.
  6. Update /etc/hosts to add the IP address of host1 and host2 e.g.:

192.168.0.2 host1
Add a ‘tungsten’ user which we will use to install Tungsten Replicator.
Step 4: Install your Hadoop Slave
Download the latest Tungsten Replicator binary from this page
Unpack the file:

shell> tar zxf tungsten-replicator-3.0.tar.gz

Change into the directory:

shell> cd tungsten-replicator-3.0

Create a new replicator installation, this will read the information from the master (host1) and apply it to this host (host2)

shell> ./tools/tpm install alpha \
--batch-enabled=true \
--batch-load-language=js \
--batch-load-template=hadoop \
--datasource-type=file \
--install-directory=/opt/continuent \
--java-file-encoding=UTF8 \
--java-user-timezone=GMT \
--master=host1 \
--members=host2 \
'--property=replicator.datasource.applier.csv.fieldSeparator=\\u0001' \
--property=replicator.datasource.applier.csv.useQuotes=false \
--property=replicator.stage.q-to-dbms.blockCommitInterval=1s \
--property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 \
--replication-password=secret \
--replication-user=tungsten \
--skip-validation-check=DatasourceDBPort \
--skip-validation-check=DirectDatasourceDBPort \
--skip-validation-check=HostsFileCheck \
--skip-validation-check=InstallerMasterSlaveCheck \
--skip-validation-check=ReplicationServicePipelines \
--start-and-report=true

For a description of the options, visit this page and click on the second magnifying glass to get the description.
As before, we want everything to be running and for the replicator to be online, run:

shell> trepctl status

This should tell you everything is running – if you get an error about this not being found, source the environment to populate your PATH correctly:

shell> source /opt/continuent/share/env.sh

We want everything to be online and running. If it isn’t, use the docs to help determine the reason, or use our discussion group to ask questions.
Step 5: Generating DDL
For your chosen MySQL database schema, you need to generate the staging and live table definitions for Hive.
A tool, ddlscan, is provided for this. You need to run it and provide the JDBC connect string for your database, and your user and password. If you followed the prereqs, use the one for the tungsten user.
First create the live table DDL:

shell> ddlscan -user tungsten -url 'jdbc:mysql://host1:3306/test' -pass password -template ddl-mysql-hive-0.10.vm -db test > schema.sql

Now apply it to Hive:

shell> cat schema.sql | hive

To create Hive tables that read the staging files loaded by the replicator, use the ddl-mysql-hive-0.10-staging.vm template on the same database:

shell> ddlscan -user tungsten -url 'jdbc:mysql://host:3306/test' -pass password -template ddl-mysql-hive-0.10-staging.vm -db test > schema-staging.sql

Now apply it to Hive again:

shell> cat schema-staging.sql | hive

Step 6: Start Writing Data
Hopefully by this point you’ve got two VMs, one running MySQL and the master replicator extracting info from the MySQL binary log. On the other, you have a basic Cloudera instance with a slave replicator writing changes. Both replicator should be online (use ‘trepctl status’ to check).
All you need to do is start writing data into the tables you selected when creating the DDL. That should be it – you should see data start to stream into Hadoop.


5 responses to “MySQL to Hadoop Step-By-Step”

  1. Excellent! Thank you for the write-up as I was the one that asked for a way to try/test out the setup. Besides the creation of the two Tungsten setups nothing seems too complicated and I am sure after a read through of each setting things will become more clear.

    • Hi Keith
      Glad it got to the right person (I was kinda hoping you would see this).
      Let me know if you have issues, and I’ll see what I can do. Meanwhile, I’m trying to go one step further and build some VMs you can download. They should be ready this week, travel permitting 🙂

  2. Hi,
    First of all, thanks a lot for providing the VMs and detailed steps to setup the test environment. It was really helpful for me.
    I was able to set up Tungsten Replicator in both MySQL master side and Hadoop slave side by following the steps you have provided. I used the “trdemo1” virtual machine as the MySQL master, and Hortonworks Sandbox as the Hadoop slave. However, the data at MySQL side data does not automatically start streaming to Hadoop side upon the completion of ddlscan. I have attached the status logs for both master and slave. It would be great if you can let me know whether I’m missing any necessary steps in-between. Thanks.
    Status in MySQL master side:
    NAME VALUE
    —- —–
    appliedLastEventId : mysql-bin.000010:0000000000006179;-1
    appliedLastSeqno : 40
    appliedLatency : 0.724
    channels : 1
    clusterName : alpha
    currentEventId : mysql-bin.000010:0000000000006179
    currentTimeMillis : 1394539165891
    dataServerHost : trdemo1
    extensions :
    host : trdemo1
    latestEpochNumber : 27
    masterConnectUri : thl://localhost:/
    masterListenUri : thl://trdemo1:2112/
    maximumStoredSeqNo : 40
    minimumStoredSeqNo : 0
    offlineRequests : NONE
    pendingError : NONE
    pendingErrorCode : NONE
    pendingErrorEventId : NONE
    pendingErrorSeqno : -1
    pendingExceptionMessage: NONE
    pipelineSource : jdbc:mysql:thin://trdemo1:13306/
    relativeLatency : 835.891
    resourcePrecedence : 99
    rmiPort : 10000
    role : master
    seqnoType : java.lang.Long
    serviceName : alpha
    serviceType : local
    simpleServiceName : alpha
    siteName : default
    sourceId : trdemo1
    state : ONLINE
    timeInStateSeconds : 1550.317
    transitioningTo :
    uptimeSeconds : 1551.466
    useSSLConnection : false
    version : Tungsten Replicator 3.0.0 build 23
    Status in Hadoop slave side:
    NAME VALUE
    —- —–
    appliedLastEventId : mysql-bin.000010:00000000000
    06179;-1
    appliedLastSeqno : 40
    appliedLatency : 0.0
    channels : 1
    clusterName : alpha
    currentEventId : NONE
    currentTimeMillis : 1394539242914
    dataServerHost : sandbox.hortonworks.com
    extensions :
    host : sandbox.hortonworks.com
    latestEpochNumber : 27
    masterConnectUri : thl://trdemo1:2112/
    masterListenUri : null
    maximumStoredSeqNo : 40
    minimumStoredSeqNo : 0
    offlineRequests : NONE
    pendingError : NONE
    pendingErrorCode : NONE
    pendingErrorEventId : NONE
    pendingErrorSeqno : -1
    pendingExceptionMessage: NONE
    pipelineSource : thl://trdemo1:2112/
    relativeLatency : 912.914
    resourcePrecedence : 99
    rmiPort : 10000
    role : slave
    seqnoType : java.lang.Long
    serviceName : alpha
    serviceType : local
    simpleServiceName : alpha
    siteName : default
    sourceId : sandbox.hortonworks.com
    state : ONLINE
    timeInStateSeconds : 315.832
    transitioningTo :
    uptimeSeconds : 322.142
    useSSLConnection : false
    version : Tungsten Replicator 3.0.0 build 23

    • Hi,
      Glad you got it to work!
      The replicator works by transferring information from the binary log, it doesn’t extract all of the existing data from MySQL, only the new content added that gets written into the binary log. That initial pull of data is called provisioning.
      We have some solutions that will be available soon that will move the data, for example, using Sqoop to provision the initial data, and then the replicator streams the additional changes.
      If you try creating data on the MySQL server, you should see the changes and CSV files make it over into Hadoop.
      There’s an upcoming blog on this topic which should come out this week.