...
IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL
What Is Hadoop, Exactly?
- A distributed file system
- A method of processing massive quantities of data in parallel
- The Cutting family's stuffed elephant
- All of the above
Traditional Hadoop Deployments
- Data Analytics
- Single databases
- Collective databases
- Databases and external information
- Non-structured data
- Long term datastores and archiving
MySQL to Hive ETL
ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.
...
Figure 2: Mapping between MySQL and HDFS Schema
Exporting Tables From MySQL To Hive
- Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
- Sqoop
- Tungsten Replicator
| Manual via CSV | Sqoop | Tungsten Replicator |
Process | Manual/ Scripted | Manual/ Scripted | Fully automated |
Incremental Loading | Possible with DDL changes | Requires DDL changes | Fully supported |
Latency | Full-load | Intermittent | Real-time |
Extraction Requirements | Full table scan | Full and partial table scans | Low-impact binlog scan |
MockUp UI
First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.
...
I. After a ETL is performed, now for Second Stage we require two component
1. EDW
2. OpenMRS Predictive Analysis Module.
II. Performing Predictive Analysis - we can use Apache Mayhout for implementation.
...
Mockup UI
Mockup Version 4 Width 600 Name Query Editor
Mockup Version 1 Width 600 Name TabularView
Project Timeline
- 19 May - 22 May: Study MySQL DB Transformation using JSP
- 22 May - 26 May: Study Direct Web Remoting and Spring Controller
- 26 May - 4 June. DWR implementation and back end designing using JavaScript and Java.
- 5 June - 8 June: Make a Web Wizard similar to mentioned Mockup UI.
- 9 June - 12 June: jQuery Drag and Drop support for columns in table.
- 12 June - 14 June: Hadoop, Hive, Thrift Study, Setup, Resource Collection.
- 14 June - 15 June: Testing and Implementing Datawarehouse Login (Backend and Frontend).
- 16 June - 22 June: Transform and Load Table from MySQL to Hive Datawarehouse. Implementing various Joins.
- 23 June - 30 June: Add more feature like multiple table or database selection and Hive Table Editing.1 29 June: Loading data to Hive using ssh and thrift
- 30 June - 6 July: Test and execute Hive queries using ssh, hive-jdbc and thrift in OpenMRS Module.
- 7 July - 13 July: Get results of queries on page in interactive manner.
- 14 July - 28 July: Studying Apache Mahout, Implementaion and Web Interface
- 28 July - 18 August: Code Fixing, Bug Solving and Later Finishing
Demo Video
Resources
- Data Warehousing : http://en.wikipedia.org/wiki/Data_warehouse
- ETL ( Extraction - Transformation - Loading) : http://en.wikipedia.org/wiki/Extract,_transform,_load
- Apache Hive : http://hive.apache.org/
- Apache Mahout :https://mahout.apache.org/