Data Migration: Understanding the Challenges


Data migration – that is, the practice of sharing and distributing information between databases – requires some very careful consideration. Are you moving the data permanently, temporarily, sharing it between applications? Do want to share all of it, some of it? Are you changing databases, or trying to move some data to access or use the data in a more efficient system?
Let’s start by looking at what we mean by a database, and what the myriad of different databases are out there.


 
Walk up to any person at an IT conference or gathering twenty five years ago and ask them to name a database most would have probably selected one of a couple of the available tools at the time. All of the databases would have been the same type. That type would have been some kind of fixed record database management system, along the lines of dBase III+ or Oracle.
These had some very specific layouts and formats – the record would have had a fixed size, based on fixed fields, often with fixed widths. The reasons for this were largely for technical reasons – the way to store data efficiently was in records of a fixed size. Each record was made up of fields, each with a fixed size. To read a record, you needed the definition and then just extracted the bytes, as shown in Figure 1-1.
Figure 1-1.png
Figure 1-1: Fixed Record and Field Sizes
To access a different record, you could ‘seek’ ahead in the file according to the size of the records, and the number of the record you wanted to update. For example, to read record number 15 you would skip forward by physically reading the bytes from a file at 14 x RECORDSIZE.bytes, reading RECORDSIZE bytes, and then extracting the field data using the known record structure. This meant that records were treated as one, big, long block of bytes, as shown here in Figure 1-2.
Figure 1-2.png
Figure 1-2: Fixed Records as a stream of data
In fact, this was a very simple data model that was (and still is) thoroughly practical – many young developers and programmers may well have created a database using this very model. It even works if you use indexes – you can point directly to a record using the same system.
It may surprise you to know that for some databases this is still the fundamental model at the lower levels, although there may be some additional complexities and features. But over those same 25 years some other things have changed in two different directions, data formats, and data diversity. Those two have lead to a level of complexity in terms of the database systems that manage.
Although it may be useful to understand these low-level data formats about how the data is actually physically stored by the database, the focus of this series is one level higher. We want to consider how the data is structured, fields, records, documents, and also about the formatting and character structures and information, and finally how the entire database appears and is usable within your chosen database system. More importantly, we want to know how to move it all elsewhere. Before we get there, let’s look at the top level, database types.

Database Types

My earliest database – at age eight – was one that I built to catalogue my book collection using my Sinclair ZX81, with the software written entirely in BASIC. By the time I was 13 I had started to build custom applications using dBase III+ to manage my fathers accounts. When I left college, my first job was to move data, first from an old Digital Unix system to the new Sun Solaris 2 using the same database, and then from that database engine called BRS/Search, to Oracle. BRS/Search was a completely free-form database.
The aim of this process was to move that free-form store into a structured format – Oracle, an RDBMS – and to access it using a front-end built using a Macintosh specific RDBMS engine called 4th Dimension. In the background, we also started putting different classes of data into the then-brand-new Macintosh specific database called Filemaker.
Since those early days I’ve worked with (and on) PostgreSQL, MySQL, Oracle, Microsoft SQL Server, Microsoft Access, CouchDB, Berkeley DB, SQLite, Couchbase, MongoDB, Cassandra, DB2, and most recently Hadoop, to name just a few. They all have different characteristics – this is the primary reason they exist at all, in fact – and capturing the essential essence of each group of databases is our first step on the road to understanding how to move data between these databases.
The point here is not that I’ve got experience of (although hopefully that helps explain the reason and experience behind the content here), but instead, to demonstrate that there is a huge array of choice out there today. They all have different parameters, different methods of storing data, different supported formats, and a huge array of methods for reading, querying and extracting the information.
But what exactly moves a collection of data from just that – a string of bytes – into a database? And how does affect how we move data between them? Let’s look at some basic database principles. This will not be new information, but they are vital concepts to understand so that we can translate and refer to these elements through the rest of the series.

Database Principles

What is a database?
That is not an innocent question, and the answer depends entirely on the database system, type and individual solution before you can really provide an answer.
However, it can be summed up in two sentences:

A database enables the storage of individual, addressable blocks of information to be stored efficiently. These blocks can also be retrieved and potentially searched and indexed to enable the information to be effectively retrieved.

Whenever you look at a database and how to store, retrieve and update the information, you need to consider how the information within the database is accessed.
All databases share the same basic principles when it comes to working with the information itself, they must all share the following functionality referred to as CRUD; Create, Read, Update, Delete:

  • Create – data must be able to be created within the database, and this can be done on record or block basis, or in a batch mode where data is created in bulk.
  • Read – data must be able to be read back out. By their very nature, all databases must be able to do this on a selective basis, either by record, or by a group of records. More complex databases enable you to achieve this more selectively, for example, by selecting all of cars that are blue, or all the invoices raised for Acme Inc.
  • Update – data must be able to be updated. Again, as with reading, this must be possible on a record by record basis. Updates may also involve bulk modification of multiple records and even multiple fields simultaneously.
  • Delete – data must be deletable or removable on a record by record basis, involving either single or multiple records simultaneously.

Understanding the significance of these different operations within different databases is important to getting the movement and migration of information correct. Some databases can, by design, only support certain levels of these operations. Some provide implicit and explicit deletion of records, and others may deliberately not support update operations.
To further complicate matters, performance should always be a consideration for certain types of data migration. Most analytical and data warehouse platforms benefit from large, batched, or combined updates. Hadoop, for example, works badly with a large number of small files, because these cannot easily be distributed across the cluster. Hadoop is also, by design, an append-only system, which means updates are more complex to handle.
Contrast this with Memcached, where bulk writes or updates are supported, but where for reasons of cache efficiency you do not want large batches of data to be updated simultaneously as it would invalidate large portions of the cache.

Data Formats

Different databases store and structure information differently. Some use records, some use fields, some use documents. Some expect data to be highly structured, where a single ‘database’ may consist of tens, hundreds or even thousands of different tables for different pieces and types of information. At the opposite end of the scale, some just have a record with no further classification or identification.
These principles and how to migrate between them will be discussed throughout the series, but some general principles about the different structures and how to move between them will be examined in closer detail in a future post, when we look at Data Mapping and Transformations.

Datatypes

Depending on the database in use, different databases may use or enforce specific datatypes on the data that is stored. For example, there may be both character (string) and numeric datatypes.Although it is possible to store numeric information into a string column, there are often benefits to the numerical identity, including more efficient storage (and therefore faster operation), and the ability to run or perform specific operations, such as a SUM() or AVERAGE() function on a numeric column without having to translate each individual string into an integer or floating-point value.
Datatypes and their identification and translation are a major focus of a future post on  Data Mapping and Transformations.

Indexes

All databases are predicated on the need to access the information within them very quickly. Consider a simple contact database with just 20 records in it. To look for the record with the name ‘MC Brown’ in it requires us to look at every record until we find the matching one. Of course, there may be more than one such record, so even if we find that the first record matches, we still have to iterate over 20 records to find all the matching entries.
With 20 records this isn’t a problem, with 20,000,000 records this is inefficient. Indexes bridge the gap by allowing the database to be addressed more efficiently. There are different algorithms for creating indexes that are beyond the scope of this text, but in all cases, the role of the index is to provide quicker access to information than could be achieved through a sequential sort.

Database Types

There are a myriad of different ways in which you can identify and classify different databases, and the dissection mechanism depends on what aspect of the database you are looking at. For example, SQL was for a long time associated exclusively with structured RDBMS engines, but has now become a data interface standard of it’s own and is used in both RDBMS and non-RDBMS environments. For the purposes of our understanding, we’ll examine them according to how they organise and classify their data.
Through the rest of this series, we concentrate on three major types, the RDBMS, NoSQL and Big Data.

Structured and Relational Database Management Systems (RDBMS)

Examples: Oracle, MySQL, PostgreSQL, Microsoft SQL Server, Microsoft Access, Filemaker Pro

Most structured database systems tend to have a relational database core (RDBMS), and most often, but not always, are interacted through the Structured Query Language (SQL). When talking to people about any databases, an RDBMS and SQL is what people will think of first, because it matches the idea of a strict database and types. The highly structured and rigid nature requires a rigid method of storing and retrieving information. It also places limitations and rigidity to your database types and structure. A simple layout is shown in Figure 1-3.
Figure 1-3.png
Figure 1-3: A structured RDBMS table diagram
Structured databases have a few specific characteristics:

  • Strict data structure – data is stored within fixed named silos (databases), within named tables, and with each table having a fixed number of named columns. Every single record within each table has the same number of fields (columns), and each column is used for a specific purpose or piece of information.
  • Strict data types – for example, an RDBMS will store integers and floats differently, and may have additional data types designed to provide fast access to specific information, for example, the SET and ENUM types within MySQL.
  • Data Definition Language (DDL) – related to the elements above, the DDL within any database is important because it provides a reference structure which can be used to replicate that structure in other database. Depending on the database system, the DDL may either be implicit in the way the data is accessed or stored, or in the API and interfaces provides, or the DDL could be more explicit, as in the dialects in SQL and similar statement-based interfaces.
  • Data manipulation language (DML) – Typically, but not always, SQL. The DML enables you to perform the correct CRUD operations to enable the information to be managed. Like DDL, the exact interface is very database specific. Some databases and systems rely entirely on a statement based language like SQL, which has it’s own dialects and structures for performing the updates. Others rely entirely on the API that interfaces between client applications and the database storage.
  • Relational capability – because the data is in a fixed format and with fixed types, it is possible to create specific relations between the field in one table with the field in other tables. This enables the data to be JOINed together to provide a unified output. For example, if you have orders and invoices, it’s possible to link the order and the invoice by a unique ID, and the database can either use or explicitly enforce the relationship. Joins are actually further characterised by their type, enabling many-to-one relationships (for example, multiple invoices relating to one client), one-to-many relationships (one invoice number referring to multiple invoice lines) and one-to-one (invoice to payment received).
  • Constraints and Indexes – constraints enable data to be created within a limited subset, or to identify rows uniquely. For example, a primary key constraint can force the table to create new records only with a new unique identifier. Indexes are used to create efficient methods for looking up and identifying data according to criteria. Within an RDBMS indexes are generally used to speed up access on a specific column, or multiple columns, to improve the speed of access during specific queries. Without an index, the RDBMS will default to performing a full table scan.

Structured/RDBMS solutions provide some of the easiest methods for exchanging data – it is generally easier to move data from a structure store to elsewhere. However, most destination databases do not have support the same range of indexes. Conversely, moving data from unstructured databases of any kind into Structured/RDBMS because you have to decide what goes where.

NewSQL Databases

Examples: Clustrix, VoltDB, InfiniDB, TokuDB

Traditional RDBMS and SQL databases are designed to run on a single machine. This has performance and hardware limitation issues. There is only so much memory and hard disk space that can be installed in a single machine, and if your database or performance requirements are high enough, a single server is not the solution. There are strategies, such as sharding the database (specifically splitting it up by an identifiable key, such as ID, name or geographical location), or more specifically dividing the database across machines, but these place a different load on your application layer, and are beyond the scope of this book.
NewSQL databases are a modification of the Structured/RDBMS that use multiple machines in a cluster to support the database requirements. Unlike the sharding and other methods, NewSQL solutions automatically distribute the load across the machines and handle the interface, indexing and querying required to access the data.
The main elements of the database and structure, such as databases, records and fields, and all other data migration considerations are the same as for traditional RDBMS environments.

NoSQL/Document Databases

Examples: Couchbase, CouchDB, MongoDB, Cassandra, HBase

NoSQL databases actually span a wide range of different databases, originally classified by their rejection of SQL as the DDL and DML language of choice, more usually resorting to the use of a direct API for accessing information. There was a resurgence of these different solutions in the early 2000s as people sought alternatives that were faster and simpler than the transactional RDBMS for web applications and websites.
Most NoSQL databases rely on simpler methods for accessing the information, for example by using a single document ID to retrieve a record of information. This document ID could be extracted from the users email address, so when a user logs in or register on a website, the document associated with that email address is accessed, rather than ‘looking-up’ the record in a larger table of user records.
NoSQL databases of this type can be roughly split into two groups, the columnar/tabular databases, and the document databases. The columnar/tabular type include Cassandra, Apache Hbase (part of Hadoop), and Google’s BigTable. Data is organised through an identifiable row ID, and a collection of associated column IDs that classify the data structure. They can look, and even act and operate in a similar fashion to the structured RDBMS table/row/column structure. A sample column style database (in this case Cassandra) looks roughly like that in Figure 1-4.
Figure 1-4.png
Figure 1-4: A columnar (Cassandra) database structure
Document databases are completely different. Unlike the table structure, data is instead organised into a document, usually using JSON or a JSON-like structure. Unlike the table structure, a document often combines different fragments of information together – for example, a contact record may store all the phone numbers, email addresses and other components within the single document for a given person. Documents, especially JSON based documents, are also very flexible and consist of fields that are nested, such as an array of phone numbers, or even entire nested structures, such as the individual rows (qty, product id, description, price) for an invoice or order, all encapsulated into a single document. A simple document database structure can be seen in Figure 1-5.
Figure 1-5.png
Figure 1-5: Document Databases
Perhaps most importantly, documents in a document database do not need to be identical. In a structured RDBMS environment, every record contains every field, even if the field is not actually used for that record. In a document database, different documents, even if within the same database or group may have only one field, or may have 20. The variable nature makes them appealing for this very reason, but represents an area of complexity when migrating information.
Most NoSQL systems have no idea of an explicit relation or join – this is often one of the aspects that makes the system faster. However, the lack of this element means that different techniques are required to store and interact with complex data.
Depending on the NoSQL solution, you may or may not have access to an index or quicker method of accessing the data. In CouchDB and Couchbase, for example, the fields of a document can be used to generate an index that provides quick searching and retrieval of information.
NoSQL databases can be easy to interact and migrate data to and from, providing there is (or isn’t) a strict schema, accordingly. For example, moving from an RDBMS to a document-based NoSQL database can be a case of converting the table records into documents identified by the primary key. It can also pay off in the long term to perform a more concerted conversion and translation of the source tables into unified documents.

Key/value (KV) Stores

Examples: Memcached, Redis, Riak

For most global declarations, key/value stores are treated as NoSQL, but I’ve split them out here because they have some interesting attributes that affect data exchange. A key/value store is exactly what it sounds like. A single blob of data (the value) is stored against a given key identifier. You store the information by giving the key, and retrieve the information by giving the same key. In most cases, the information can only be retrieved if you know the key. Iteration over the stored data, or indexes, are generally not available.
The roots of the key/value store go back to the attempt to speed up access to data where a given identifier is known, such as user id or email address. The best known key/value store is probably memcached which was originally developed to make use of the spare RAM of machines supporting a website (LiveJournal, a blogging platform) and enable fast access to blog entries. Since the ID of the blog could be derived from the URL being accessed, the entry could easily be looked up in memcached. If it didn’t exist, it was looked up from a MySQL database, and the formatted/retrieved version placed into the cache with the identifying URL.
Most document databases are really a modification of the key/value store. The value portion can be any data you like, from a simple string, through to a serialised object from C, Java or other languages, or a JSON document. In fact, some databases actually support both, and the only distinction between a key/value store and a document database is whether the database engine itself can identify and interact with the embedded structure. MongoDB and Couchbase, for example, have this distinction; MongoDB enables the database engine to update fields within the BSON (JSON-like) values, while Couchbase supports indexing of the JSON fields.
Key/Value stores are some of the harder databases to migrate and move data between. The lack of a structure, or the custom nature (for example a serialised language object), and the requirement to identify the record by a specific ID make exchanging data more complex.

Big Data (aka Unstructured, Semi-structured and Implied Structure Databases)

Examples: Hadoop, Apache Solr, ElasticSearch, Lucene

BRS/Search was, for the time and technology, relatively ground breaking in that it was a full-text retrieval system. Today we would probably classify this as a ‘document’ based database, that is, one that has a structured format, although the power behind BRS/Search was the ability to perform a free-text search across an entire collection.
Today, we generally referred to these types of database as unstructured, that is, there is no discernible format or structure to the information. Although there are many different examples of this, probably the best known today is Hadoop. Without getting into the functionality or history of Hadoop, the power of Hadoop comes from it’s ability to distribute the raw data and also to process and extract usable information from the unstructured data into something usable.
Within Hadoop, the normal workflow is to load Hadoop with raw data, for example, the text from tweets, or web-pages, and then use that information to build an index or data structure around the information so that it can be analysed or searched. Solutions such as Solr, Lucene and ElasticSearch work in similar ways, accessing the raw text and either indexing it so that the data can be indexed and searched, or using the structure that is available to provide searching and indexing by a more specific area.
This is an example where ‘semi-structured’ data applies. Twitter data for example consists of the twitter name, the tweet itself, and any tags or twitter users the tweet was directed to. The fixed fields and the tweet go together to make it semi-structured, as it consists of both structured and free-form information.
Implied structure databases are those where the structure of the data is implied by the database, even though the underlying data may only be partially structured and described. Apache Hive, part of Hadoop, is an example of this. Hive can natively read text files and interpret them with a specific structure, converting CSV files into columns so that they can be queried by HiveQL, a simplified form of SQL. Hive can also parse more complex data, including CSV that embeds JSON and serialised data structures, all so they can be queried through a familiar interface.
However, unlike a true RDBMS, Hive only interprets the underlying format, and it performs this interpretation every time the data is accessed. At no time does the data have to be translated into Hive format (nor, really, is there one), and no indexes are created to enable quick access to the data.
All of these individual types are wrapped up into what I’ve classed as ‘Big Data’. This is not to say that the data needs to be of specific size or complexity, only that it may consist of structured, unstructured, or all variants in between.
Moving data to and from unstructured, semi-structured, and implied structure databases entirely depends on what the information is, what structure is available, and how that structure can be used (or ignored) accordingly.