Page Comparison

...

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL

What Is Hadoop, Exactly?

A distributed file system
A method of processing massive quantities of data in parallel
The Cutting family's stuffed elephant
All of the above

Traditional Hadoop Deployments

Data Analytics
- Single databases
- Collective databases
- Databases and external information
- Non-structured data
- Long term datastores and archiving

MySQL to Hive ETL

ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.

...

Figure 2: Mapping between MySQL and HDFS Schema

Exporting Tables From MySQL To Hive

Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
Sqoop
Tungsten Replicator

	Manual via CSV	Sqoop	Tungsten Replicator
Process	Manual/ Scripted	Manual/ Scripted	Fully automated
Incremental Loading	Possible with DDL changes	Requires DDL changes	Fully supported
Latency	Full-load	Intermittent	Real-time
Extraction Requirements	Full table scan	Full and partial table scans	Low-impact binlog scan

MockUp UI

First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.

...

I. After a ETL is performed, now for Second Stage we require two component
1. EDW
2. OpenMRS Predictive Analysis Module.
II. Performing Predictive Analysis - we can use Apache Mayhout for implementation.

...

Mockup UI

Mockup


Version	4
Width	600
Name	Query Editor

Mockup


Version	1
Width	600
Name	TabularView

Project Timeline

19 May - 22 May: Study MySQL DB Transformation using JSP
22 May - 26 May: Study Direct Web Remoting and Spring Controller
26 May - 4 June. DWR implementation and back end designing using JavaScript and Java.
5 June - 8 June: Make a Web Wizard similar to mentioned Mockup UI.
9 June - 12 June: jQuery Drag and Drop support for columns in table.
12 June - 14 June: Hadoop, Hive, Thrift Study, Setup, Resource Collection.
14 June - 15 June: Testing and Implementing Datawarehouse Login (Backend and Frontend).
16 June - 22 June: Transform and Load Table from MySQL to Hive Datawarehouse. Implementing various Joins.
23 June - 30 June: Add more feature like multiple table or database selection and Hive Table Editing.1 29 June: Loading data to Hive using ssh and thrift
30 June - 6 July: Test and execute Hive queries using ssh, hive-jdbc and thrift in OpenMRS Module.
7 July - 13 July: Get results of queries on page in interactive manner.
14 July - 28 July: Studying Apache Mahout, Implementaion and Web Interface
28 July - 18 August: Code Fixing, Bug Solving and Later Finishing

Demo Video

Image Added

Resources

Data Warehousing : http://en.wikipedia.org/wiki/Data_warehouse
ETL ( Extraction - Transformation - Loading) : http://en.wikipedia.org/wiki/Extract,_transform,_load
Apache Hive : http://hive.apache.org/
Apache Mahout :https://mahout.apache.org/

Version	Old Version 24	New Version Current
Changes made by	Vineet Kumar	Vineet Kumar
Saved on	Jun 11, 2014	Jul 05, 2014

Versions Compared

Key

What Is Hadoop, Exactly?

Traditional Hadoop Deployments

MySQL to Hive ETL

Resources