...
IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL
What Is Hadoop, Exactly?
- A distributed file system
- A method of processing massive quantities of data in parallel
- The Cutting family's stuffed elephant
- All of the above
Traditional Hadoop Deployments
- Data Analytics
- Single databases
- Collective databases
- Databases and external information
- Non-structured data
- Long term datastores and archiving
MySQL to Hive ETL
ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.
...
Figure 2: Mapping between MySQL and HDFS Schema
Exporting Tables From MySQL To Hive
- Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
- Sqoop
- Tungsten Replicator
| Manual via CSV | Sqoop | Tungsten Replicator |
Process | Manual/ Scripted | Manual/ Scripted | Fully automated |
Incremental Loading | Possible with DDL changes | Requires DDL changes | Fully supported |
Latency | Full-load | Intermittent | Real-time |
Extraction Requirements | Full table scan | Full and partial table scans | Low-impact binlog scan |
MockUp UI
First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.
...