Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Demo Video Added

...

Primary mentor

Gaurav Paliwal

Backup mentor

TBDSuranga Kasthurirathne

Assigned to

Vineet Kumar

Abstract

Excerpt

The amount of data generated is getting increased day by day and so as the appetite for finding the information from data as well. Growing appetite for data analysis can't be achieved by transactional databases. With organization thriving towards separating their data warehouse compliance from transactional databases so, that they can track the historical data better, the need for ETL tool is getting increased everyday. The intention of this project is to have a ETL module to interact with multiple DW compliance over which predictive modeling code could run. So, that healthcare provider can check upon the predictive modeling result based on historical data they are having/loading.

During this summer we need to have at-least a basic module which will allow doing ETL to MySQL & Hadoop (Hive) and have a way to inject the predictive modeling code into the warehouse compliance directly from the OpenMRS UI and to fetch back the result.

Project Champions

 

Objectives

Peform ETL in web app from MySQL to Hadoop Hive and perform predictive analysis over it.

ApproachTheory

A. First Stage Process -

I. ETL can performed in following ways -

    • Non realtime: Sqoop
    • Realtime: Hadoop Applier for MySQL

     Which is performed over: OpenMRS Database  to Enterprise Data-warehouse (EDW)

II. Open Source Tools for ETL - Petaho, Talend, CloverETL, JasperSoft etc.

Where Pentaho and Talend are in my first preferences because Pentaho is faster and easier to use because of its GUI while Talend is more a tool for people who are making already a Java program

...

and want to save lots and lots of time with a tool that generates code for them.


III. Three way to integrate it in OpenMRS -

          1. Integration in the Web Application (Dynamic) -  

We can perform ETL in OpenMRS web application at some regular interval of time. Where a

...

jQuery library is required which is integrated in JSP pages.

          2. Integration in the Server i.e. OpenMRS Standalone (Static) -

When any one start the server he/she runs the OpenMRS Standalone. We can perform ETL in standalone at some regular interval of time. Where we have some menu items in Standalone like :
     

ETL Tool Menu-

          Force Update - Escaping whatever the scheduled time and performs ETL
          Set Interval - Set interval of time after which a ETL is performed.

3. OpenMRS ETL Tool-kit

...

 

This is separate application that doesn't require the running of server. It will have the same functionality as first way integration with some extending capability. The benefit of this method is we can also use other cross platform implementation other than java like in C# for performing ETL. This gives the independency of language selection.

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL

 

What Is Hadoop, Exactly?

  • A distributed file system
  • A method of processing massive quantities of data in parallel
  • The Cutting family's stuffed elephant
  • All of the above

 

Traditional Hadoop Deployments

  • Data Analytics
    • Single databases
    • Collective databases
    • Databases and external information
    • Non-structured data
    • Long term datastores and archiving

MySQL to

...

Hive ETL

ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.

 

MySQL to HDFS IntegrationImage Modified

 

Figure 1: MySQL to HDFS Integration

...

Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments.

Mapping between MySQL and HDFS SchemaImage Modified

Figure 2: Mapping between MySQL and HDFS Schema


Exporting Tables From MySQL To Hive

  1. Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
  2. Sqoop
  3. Tungsten Replicator

 

 

Manual via CSV

Sqoop

Tungsten Replicator

Process

Manual/ Scripted

Manual/ Scripted

Fully automated

Incremental Loading

Possible with DDL changes

Requires DDL changes

Fully supported

Latency

Full-load

Intermittent

Real-time

Extraction Requirements

Full table scan

Full and partial table scans

Low-impact binlog scan

 

 

 

 

 

 

 


MockUp UI

First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.

Mockup
Version3
Width600
NameLogin - Source Database

Mockup
Version2
Width600
NameSelecting DB

Mockup
Version1
Width600
NameSelect DB Table

 

Mockup
Version4
Width600
NameSelecting Data - Null Select

Mockup
Version8
Width600
NameSelect Columns

Mockup
Version1
Width600
NameJoinning Condition

Mockup
Version2
Width600
NameDatawarehouse Login

Mockup
Version2
Width600
NameFinal Progress

B. Second Stage Process -

I. After a ETL is performed, now for Second Stage we require two component
     1. EDW
     2. OpenMRS Predictive Analysis Module.
II. Performing Predictive Analysis - we can use Apache Mayhout for implementation.

...

Extra Credit

...

Mockup UI

Mockup
Version4
Width600
NameQuery Editor

Mockup
Version1
Width600
NameTabularView

Project Timeline

  • 19 May -  22 May: Study MySQL DB Transformation using JSP
  • 22 May - 26 May: Study  Direct Web Remoting and Spring Controller
  • 26 May - 4 June. DWR implementation and back end designing using JavaScript and Java.
  • 5 June -  8 June:  Make a Web Wizard similar to mentioned Mockup UI.
  • 9 June - 12 June: jQuery Drag and Drop support for columns in table.
  • 12 June - 14 June: Hadoop, Hive, Thrift Study, Setup, Resource Collection.
  • 14 June - 15 June: Testing and Implementing Datawarehouse Login (Backend and Frontend).
  • 16 June - 22 June: Transform and Load Table from MySQL to  Datawarehouse. Implementing various Joins.
  • 23 June - 29 June: Loading data to Hive using ssh and thrift
  • 30 June - 6 July: Test and execute Hive queries using ssh, hive-jdbc and thrift in OpenMRS Module.
  • 7 July -  13 July: Get results of queries on page in interactive manner.
  • 14 July - 28 July: Studying Apache Mahout, Implementaion and Web Interface
  • 28 July - 18 August: Code Fixing, Bug Solving and Later Finishing

Demo Video

Image Added

Resources

  1. Data Warehousing : http://en.wikipedia.org/wiki/Data_warehouse
  2. ETL ( Extraction - Transformation - Loading) : http://en.wikipedia.org/wiki/Extract,_transform,_load
  3. Apache Hive : http://hive.apache.org/
  4. Apache Mahout :https://mahout.apache.org/