In the previous post we looked at a number of different database types and solutions, and it should be clear that there are a huge range of different terms for the different entities that make up the database structure. All the different entities fit into one of four categories, and they have significance because when moving and migrating data you need to know the source and destination type and whether you should be creating a database for every document (bad) or a document for every record (good). The components can be described as shown in Figure 1-6.
Figure 1-6: Database Terms and Structures
Most databases support the notion of four different components:
- Field – generally the smallest piece of addressable data within any database. However, not all databases identify information down to the field level. Others don’t even recognise fields at all.
- Record – a group of fields, or, a single block of identifiable information. For example, your contact information is a record made of the fields that define your name, your address, and your email address. Some databases only support the notion of a block of information and don’t care what it contains, whether that is fields or a binary string of data. Records may also involve either a fixed set of fields, or a variable group.
- Table – a group of records. Some databases assign a specific group of fields to a specific table. Others just use a table to hold or identify a collection of records with largely similar information. Some database types, such as NoSQL, do not support a table, but immediately jump from record to database.
- Database – a group of tables. Not all databases support this additional level of organisation, and in fact it tends to be those that have a significant structure at the lower levels (field, record). The database is usually used in the role of multi-tenancy, that is, the ability to store a collection of data related to a single application.
Of course, the problem is that different databases apply and support these terms differently, many use different terms, and some may blur the lines between each term to such an extent that it is impossible to tell where the different elements exist.
Let’s explain this a little further by providing some explicit examples:
- MySQL, Oracle database, IBM DB2, Microsoft SQL Server, Microsoft Access, and other relational databases tend to support all four levels with a very rigid structure in place, as you would expect from a structured RDBMS.
- Memcached knows only records (values) identified by a supplied key, and those records have no fields.
- CouchDB, MongoDB and Couchbase support different databases, and within those databases you have documents, which are logically similar to records. These documents have fields, but there is no requirement for the fields within each document to be the same from document to document. MongoDB also supports collections, which are akin to tables.
- Hadoop in it’s bare Highly Distributed File System (HDFS) native structure doesn’t understand anything, although you can place files into different directories to mimic a structure. If you use a system on top of HDFS, such as Hive, HBase or Impala, you are normally implying a typical 4-level data architecture.
In general, the ability to identify different components within the database depends on the database type, and a summary of these is provided in the table below.
Database | Fields | Records | Tables | Databases |
RDBMS | Yes | Yes | Yes | Yes |
NewSQL | Yes | Yes | Yes | Yes |
NoSQL | Mostly | Documents/Rows | Maybe | Yes |
Key/Value Stores | No | Yes, by ID | No | Maybe |
Unstructured | No | No | No | Maybe |
Now let’s have a look at the specific example database solutions, including the term used for the corresponding value:
Database Type | Database | Fields | Records | Tables | Databases |
RDBMS | Oracle | Yes | Yes | Yes | Yes |
MySQL | Yes | Yes | Yes | Yes | |
PostgreSQL | Yes | Yes | Yes | Yes | |
NewSQL | InfiniDB | Yes | Yes | Yes | Yes |
TokuDB | Yes | Yes | Yes | Yes | |
NoSQL | CouchDB | Yes, embedded in JSON | Documents | No | Yes |
Couchbase | Yes, embedded in JSON | Documents | No | Buckets | |
MongoDB | Yes, embedded in BSON | Documents | Collections | Yes | |
Cassandra | Implied in column family | Yes, implied by key ID | Implied in Column Family | No | |
HBase | Implied in columns | Yes | Implied in Column Family | No | |
Key/Value | Memcached | No | Yes, by key ID | No | Maybe |
Redis | Yes | Yes, key/value pair | No | No | |
Riak | Yes | Yes | Schema | No | |
Unstructured | Hadoop/HDFS | No | No | No | By HDFS directory |
Hive | Yes, if implied | Yes, if implied | Yes | Yes |
Although it wont be covered in this series to any significant degree, these different levels also tend to support one further distinction, and that is security. Different database solutions provide security at a variety of levels and some allow you to restrict access down to the record level. For all database systems where different databases are supported and their is some level of security or protection between them, these databases are called multi tenant databases.
As we start moving the data between databases, understanding the importance of these elements is critical. For example, when moving data from an RDBMS to Hadoop, the distinction of table or database may disappear, and the significance of individual records may be deliberately removed entirely to enable the information to be processed effectively.
In contrast, moving data from MongoDB into MySQL is easier because we can identify specific elements such as a database and a table. Where we start to become unstuck is that although documents contain a collection of fields, they may not contain the same fields across each document.
Homogeneous vs. Heterogeneous
The primary issue with exchanging information is whether you are moving data between homogeneous or heterogeneous databases. Homogeneous databases are those that are of the same type, for example, moving data from Oracle to MySQL; both are RDBMSs, both have databases, tables, records and fields, and therefore the complexity of moving data between the database is straightforward from a structural perspective. But the datatypes supported are not the same. What do you do about CLOB or RAW datatypes in Oracle when migrated to MySQL?
In a similar vein, the actual procedural process of moving data between database types is similarly affected. MongoDB and Couchbase, for example, support the same structure; JSON and BSON are largely identical, and although there are some differences, reading the data from MongoDB and writing it to Couchbase can be achieved with functions that are almost identical – get the document by it’s ID on MongoDB and set the document on Couchbase with the same ID.
Most RDBMSs can be accessed through SQL and front-ends like JDBC or ODBC, opening two connections and reading/writing are easy to do. Most support the SELECT INTO and LOAD DATA INFILE style SQL to export and import data in larger chunks. But in heterogeneous deployments the same tools are not always available. A quick, but not always accurate, description of these elements across different databases is shown in this table.
Issue | Homogeneous | Heterogeneous |
Data structure | No | Yes |
Data types | Yes | Yes |
Data Loading | No | Yes |
Data Usability | Yes | Yes |
Defining the Problem
Now that we have a good grasp of the different databases, their abilities, and their differences, it is time to take a closer look at what we mean by moving and migrating data and the problems associated with this kind of operation. Now we can finally start to define the problem of exchanging data between different databases and how that process can be tackled and resolved.
All of the following aspects must be considered in entirety before you start to exchange data, but think about it logically and holistically – you have to decide how data will be formatted, how the data is going to look (structure), how the data physically going to be transferred, and finally how it is going to be used.
Altering the Format
All data is not created the same, or in the same format, and furthermore, not all data is supported or acknowledged. Within NoSQL, for example, there may be no datatypes other than string, so you need to consider how you are going to move the data to the right type and the right format without (unnecessarily) losing data. The main considerations are:
Differences in supported types – you may have to choose between migrating to the next nearest, or most appropriate type. NoSQL and all Big Data targets tend not to have strong datatypes, whereas RDBMS database have very strong typing. You must choose a type that is able to handle the data in the way you want, and be able to hold the size of the information being inserted. Large text data, for example, may be too long to fit in a CHAR or VARCHAR column, and may need to be inserted into a BLOB or RAW column.
Differences in type definitions – databases have different definitions of different types. For example, Amazon RedShift supports only 19 digits of precision for floating-point values, while MySQL supports up to 53. Dates and times are also typically represented different, with some only supporting an explicit date type, or supporting a combined date time, or supporting a time with heavily restricted precision. All these differences mean that you may wish to store values outside the given range as a different type; for example, storing dates or timestamps-point values and dates as strings so as not to lose data.
Differences in type interpretation – generally a difficult problem to resolve without extensive testing, some datatypes can be interpreted incorrectly when the data is moved into a target database. String encoding – for example ASCII and Unicode, or bit-specific fields can cause issues. Also timestamps which may be interpreted during import as being subject to time differences; for example, if you exported on a server using Pacific Standard Time (PST) but imported on a different database using Central European Standard Time (CEST).
These issues must be considered in entirety before you exchange data; getting it wrong could lead to incorrect, invalid, and even completely corrupt information.
Altering the Structure
It should be clear right now that there are differences in the structure of the different database types. What may not be clear is that there are more options available to you than a simple direct association from one type to another. Instead you must make sure that the data is exchanged in an effective manner appropriate the information that is being exchanged.
For certain combinations the structure may appear obvious, but there is always the possibility that you the structure and information can be more effectively organised. For example, when moving from an RDBMS to a document store, the first intention is simply to place the different tables and structure them as different documents within the target database. This is fine, but adds complications you may want to avoid when you come to use it. Instead, merging the different tables into one larger document with nested components may simplify the use of the data in the target application.
The same can be true in reverse, exploding a single document into multiple, related, tables. Alternatively, you may want to take advantage of specific functionality in the RDBMS, such as XML fields, sets, enums or even convert the information to embedded JSON or serialised language variables if that makes sense to your application.
Loading the Information
Physically transferring the information seems like the most mundane of the processes in the entire scheme of exchanging data between systems, but in actual fact, it is is less clear than you might think. We’ll look at this in more detail when examining specific examples and database exchange projects, but some upfront issues to consider:
Does the solution include a native bulk loading system. Some databases specifically support a method of importing data, whether larger or small. For example, in MySQL the LOAD DATA INFILE SQL statement can do this for you. Cassandra supports a COPY command in CQL, and various Hadoop interfaces such as HBase and Hive enable you to access CSV files directly without explicitly importing them.
Custom loading may be required if no built-in solution exists. This can take many forms, including writing your own, or if they are available using specialised tools like Tungsten Replicator or Sqoop. The exact method is going to depend on the data exchange type, data size, and complexity of the load process.
Application loading can be used in those situations where the application is running and a different version or format of the information is used. For example, when caching with a NoSQL engine on top of an RDBMS, you might adapt your application to automatically generate the NoSQL record. Similarly, during a migration, you might configure your application to look in the new database, and if it doesn’t exist, load it from the old database and generate the new record.
Data sizes must be a consideration. It seems ridiculous in this age when disk sizes are so large, but database sizes can be huge too. A recent project I was involved in required migrating just under 150TB of information. Storing all of that data in one go would have a required a mammoth sized disk array before the data was loaded into a Hadoop/Hive database. There are solutions for moving and migrating such large volumes of data without it ever touching the disk and using up all that space.
Depending on your data exchange requirements, any, or all of these may be an issue you have to contend with.
Making the Data Usable
Exchanging data between systems is only any good if once there the data is usable. Nobody would consider releasing a physical book in the USA, and a digital book in France, and not translating it. The same is true of data. Exchanging the data between databases requires you to take these issues into account during the movement of the data; it’s no good just blindly copying the data over and hoping it will be usable.
To make the data usable the following aspects must be considered:
- Data accessibility – we’ve already talked about the key structural translation that needs to take place, but you also need to think about the effect on elements such as searching and indexing. Certain indexing methods are more complex (and therefore computationally expensive) than others. Some are more efficient. Some database environments support a limited number, quantity or complexity of indexing and querying that can only be addressed if the format and structure of the data is correct to begin with.
- Data validity – if you change the structure of the data, does that change the ability to validate or otherwise ensure the quality of the information? For example, moving from RDBMS to NoSQL you may lose the ability to single out duplicate entries for certain types and fragments of the dataset. Relational constraints are not enforced within non-relational databases. Data format differences may also present problems; in a NoSQL database, for example, the same strict database types, such as dates, times or numbers do not exist. How do you prevent an invalid date being inserted into a date column, or worse, a non-date value into a date column that would have been identified during a database write?
- Application usability – if the data is moved, can you still access and update it in the same way? RDBMSs tend to be transactional, providing stability and support, NoSQL databases do not as a rule, particularly across multiple silos. If an invoice is updated, how do I guarantee that the customers account is also updated, especially if one operation, or the database itself, fails during the process?
These are some, but not all, of the issues you need to be aware of. Regardless of the actual method though, you want to actually use the data at the end, so don’t forget how you might query or index the data once it’s moved. Keep in mind that not all data moves require heavy consideration. If you are exporting the data to be loaded for a mail merge for example, the usability aspects may be minor compared to the format and quality of the information.