Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Demo Video Added

...

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL

 

What Is Hadoop, Exactly?

  • A distributed file system
  • A method of processing massive quantities of data in parallel
  • The Cutting family's stuffed elephant
  • All of the above

 

Traditional Hadoop Deployments

  • Data Analytics
    • Single databases
    • Collective databases
    • Databases and external information
    • Non-structured data
    • Long term datastores and archiving

MySQL to Hive ETL

ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.

...

Figure 2: Mapping between MySQL and HDFS Schema


Exporting Tables From MySQL To Hive

  1. Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
  2. Sqoop
  3. Tungsten Replicator

 

 

Manual via CSV

Sqoop

Tungsten Replicator

Process

Manual/ Scripted

Manual/ Scripted

Fully automated

Incremental Loading

Possible with DDL changes

Requires DDL changes

Fully supported

Latency

Full-load

Intermittent

Real-time

Extraction Requirements

Full table scan

Full and partial table scans

Low-impact binlog scan

 

 

 

 

 

 

 


MockUp UI

First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.

...

I. After a ETL is performed, now for Second Stage we require two component
     1. EDW
     2. OpenMRS Predictive Analysis Module.
II. Performing Predictive Analysis - we can use Apache Mayhout for implementation.

...

Mockup UI

Mockup
Version4
Width600
NameQuery Editor

Mockup
Version1
Width600
NameTabularView

Project Timeline

  • 19 May -  22 May: Study MySQL DB Transformation using JSP
  • 22 May - 26 May: Study  Direct Web Remoting and Spring Controller
  • 26 May - 4 June. DWR implementation and back end designing using JavaScript and Java.
  • 5 June -  8 June:  Make a Web Wizard similar to mentioned Mockup UI.
  • 9 June - 12 June: jQuery Drag and Drop support for columns in table.
  • 12 June - 14 June: Hadoop, Hive, Thrift Study, Setup, Resource Collection.
  • 14 June - 15 June: Testing and Implementing Datawarehouse Login (Backend and Frontend).
  • 16 June - 22 June: Transform and Load Table from MySQL to Hive Datawarehouse. Implementing various Joins.
  • 23 June -  30 June: Add more feature like multiple table or database selection and Hive Table Editing.1 29 June: Loading data to Hive using ssh and thrift
  • 30 June - 6 July: Test and execute Hive queries using ssh, hive-jdbc and thrift in OpenMRS Module.
  • 7 July -  13 July: Get results of queries on page in interactive manner.
  • 14 July - 28 July: Studying Apache Mahout, Implementaion and Web Interface
  • 28 July - 18 August: Code Fixing, Bug Solving and Later Finishing

Demo Video

Image Added

Resources

  1. Data Warehousing : http://en.wikipedia.org/wiki/Data_warehouse
  2. ETL ( Extraction - Transformation - Loading) : http://en.wikipedia.org/wiki/Extract,_transform,_load
  3. Apache Hive : http://hive.apache.org/
  4. Apache Mahout :https://mahout.apache.org/