ETL/Predictive Modeling Across Multiple Platforms
2014 Internship Project
This project is being considered as a potential project for 2014 Internships. If you are a potential intern and are interested in working on this project, please discuss it in detail with the mentor(s) listed here before submitting your internship proposal.
Primary mentor | |
Backup mentor | @Suranga Kasthurirathne |
Assigned to | @Vineet Kumar |
Abstract
The amount of data generated is getting increased day by day and so as the appetite for finding the information from data as well. Growing appetite for data analysis can't be achieved by transactional databases. With organization thriving towards separating their data warehouse compliance from transactional databases so, that they can track the historical data better, the need for ETL tool is getting increased everyday. The intention of this project is to have a ETL module to interact with multiple DW compliance over which predictive modeling code could run. So, that healthcare provider can check upon the predictive modeling result based on historical data they are having/loading.
During this summer we need to have at-least a basic module which will allow doing ETL to MySQL & Hadoop (Hive) and have a way to inject the predictive modeling code into the warehouse compliance directly from the OpenMRS UI and to fetch back the result.
Objectives
Peform ETL in web app from MySQL to Hive and perform predictive analysis over it.
Theory
A. First Stage Process -
I. ETL can performed in following ways -
Non realtime: Sqoop
Realtime: Hadoop Applier for MySQL
Which is performed over: OpenMRS Database to Enterprise Data-warehouse (EDW)
II. Open Source Tools for ETL - Petaho, Talend, CloverETL, JasperSoft etc.
Where Pentaho and Talend are in my first preferences because Pentaho is faster and easier to use because of its GUI while Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.
III. Three way to integrate it in OpenMRS -
1. Integration in the Web Application (Dynamic)
We can perform ETL in OpenMRS web application at some regular interval of time. Where a jQuery library is required which is integrated in JSP pages.
2. Integration in the Server i.e. OpenMRS Standalone (Static)
When any one start the server he/she runs the OpenMRS Standalone. We can perform ETL in standalone at some regular interval of time. Where we have some menu items in Standalone like :
ETL Tool Menu-
Force Update - Escaping whatever the scheduled time and performs ETL
Set Interval - Set interval of time after which a ETL is performed.
3. OpenMRS ETL Tool-kit
This is separate application that doesn't require the running of server. It will have the same functionality as first way integration with some extending capability. The benefit of this method is we can also use other cross platform implementation other than java like in C# for performing ETL. This gives the independency of language selection.
IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL
What Is Hadoop, Exactly?
A distributed file system
A method of processing massive quantities of data in parallel
The Cutting family's stuffed elephant
All of the above
Traditional Hadoop Deployments
Data Analytics
Single databases
Collective databases
Databases and external information
Non-structured data
Long term datastores and archiving
MySQL to Hive ETL
ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.
Figure 1: MySQL to HDFS Integration
This integration uses a JS Library where user drag and drop objects somehow similar to what one can do in informatica and then generate a sql which will be executed at backend.
It connects to the MySQL OpenMRS database to read the table information and then:
Fetches the row insert events occurring on the database
Decodes these events, extracts data inserted into each field of the row, and uses content handlers to get it in the format required
Appends it to a text file in HDFS.
Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments.
Figure 2: Mapping between MySQL and HDFS Schema
Exporting Tables From MySQL To Hive
Manual Loading - Export MySQL Table as csv data and Import that csv data to Hive tables.
Sqoop
Tungsten Replicator
| Manual via CSV | Sqoop | Tungsten Replicator |
Process | Manual/ Scripted | Manual/ Scripted | Fully automated |
Incremental Loading | Possible with DDL changes | Requires DDL changes | Fully supported |
Latency | Full-load | Intermittent | Real-time |
Extraction Requirements | Full table scan | Full and partial table scans | Low-impact binlog scan |
MockUp UI
First we try to perform a sample ETL from a MySQL DB to another MySQL DB and Export it to Hive.