Page Comparison

...

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL

What Is Hadoop, Exactly?

A distributed file system
A method of processing massive quantities of data in parallel
The Cutting family's stuffed elephant
All of the above

Traditional Hadoop Deployments

Data Analytics
- Single databases
- Collective databases
- Databases and external information
- Non-structured data
- Long term datastores and archiving

MySQL to Hive ETL

ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.

...

Figure 2: Mapping between MySQL and HDFS Schema

Exporting Tables From MySQL To Hive

Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
Sqoop
Tungsten Replicator

	Manual via CSV	Sqoop	Tungsten Replicator
Process	Manual/ Scripted	Manual/ Scripted	Fully automated
Incremental Loading	Possible with DDL changes	Requires DDL changes	Fully supported
Latency	Full-load	Intermittent	Real-time
Extraction Requirements	Full table scan	Full and partial table scans	Low-impact binlog scan

MockUp UI

First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.

...

Version	Old Version 24	New Version 25
Changes made by	Vineet Kumar	Vineet Kumar
Saved on	Jun 11, 2014	Jun 13, 2014

Versions Compared

Key

What Is Hadoop, Exactly?

Traditional Hadoop Deployments

MySQL to Hive ETL