Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Project Champions

 

Objectives

Peform ETL in web app from MySQL to Hadoop and perform predictive analysis over it.

Approach

I am dividing the whole project in 2 stage Process

A. First Stage Process -

I. ETL can performed in following ways -

  • Non realtime: Sqoop
  • Realtime: Hadoop Applier for MySQL


Which is performed over: OpenMRS Database  to Enterprise Data-warehouse (EDW)

II. Open Source Tools for ETL - Petaho, Talend, CloverETL, JasperSoft etc.
Where Pentaho and Talend are in my first preferences because Pentaho is faster and easier to use because of its GUI while Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.

III. Three way to integrate it in OpenMRS -

 

 

 

1. Integration in the Web Application (Dynamic) - We can perform ETL in OpenMRS web application at some regular interval of time. Where a JS library is required which is integrated in JSP pages.

 

 

 

2. Integration in the Server i.e. OpenMRS Standalone (Static) - When any one start the server he/she runs the OpenMRS Standalone. We can perform ETL in standalone at some regular interval of time. Where we have some menu items in Standalone like :
ETL Tool Menu-
Force Update - Escaping whatever the scheduled time and performs ETL
Set Interval - Set interval of time after which a ETL is performed.

 

 

 

3. OpenMRS ETL Tool-kit - This is separate application that doesn't require the running of server. It will have the same functionality as first way integration with some extending capability. The benefit of this method is we can also use other cross platform implementation other than java like in C# for performing ETL. This gives the independency of language selection.

 

 

 

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL

 

MySQL to Hadoop ETL

 

 

 

ETL via the Apache Hive is implemented by connecting to the MySQL OpenMRS database and reading tables information, and writing them into a file in HDFS.

 

MySQL to HDFS IntegrationImage Added

 

Figure 1: MySQL to HDFS Integration

 

 

 

This integration uses a JS Library where user drag and drop objects somehow similar to what one can do in informatica and then generate a sql which will be executed at backend.

 

 

 

It connects to the MySQL OpenMRS database to read the table information and then:

 

 

 

  • Fetches the row insert events occurring on the database
  • Decodes these events, extracts data inserted into each field of the row, and uses content handlers to get it in the format required
  • Appends it to a text file in HDFS.

 

 

 

Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments.

 

Mapping between MySQL and HDFS SchemaImage Added

 

Figure 2: Mapping between MySQL and HDFS Schema


B. Second Stage Process -
I. After a ETL is performed, now for Second Stage we require two component
1. EDW
2. OpenMRS Predictive Analysis Module.
II. Performing Predictive Analysis - we can use Apache Mayhout for implementation.

Extra Credit

1. Adding a UI interface for doing ETL similar to what Informatica has.
2. Providing a interactive UI to analyze the predictive modeling results coming out from the DW compliance.

...

  1. Data Warehousing : http://en.wikipedia.org/wiki/Data_warehouse
  2. ETL ( Extraction - Transformation - Loading) : http://en.wikipedia.org/wiki/Extract,_transform,_load
  3. Apache Hive : http://hive.apache.org/
  4. Apache Mahout : https://mahout.apache.org/

I am dividing the whole project in 2 stage Process

A. First Stage Process -

I. ETL can performed in following ways -
1. Non realtime: Sqoop
2. Realtime: Hadoop Applier for MySQL

Which is performed over: OpenMRS Database  to Enterprise Data-warehouse (EDW)

II. Open Source Tools for ETL - Petaho, Talend, CloverETL, JasperSoft etc.
Where Pentaho and Talend are in my first preferences because Pentaho is faster and easier to use because of its GUI while Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.

III. Three way to integrate it in OpenMRS -

 

1. Integration in the Web Application (Dynamic) - We can perform ETL in OpenMRS web application at some regular interval of time. Where a JS library is required which is integrated in JSP pages.

 

2. Integration in the Server i.e. OpenMRS Standalone (Static) - When any one start the server he/she runs the OpenMRS Standalone. We can perform ETL in standalone at some regular interval of time. Where we have some menu items in Standalone like :
ETL Tool Menu-
Force Update - Escaping whatever the scheduled time and performs ETL
Set Interval - Set interval of time after which a ETL is performed.

 

3. OpenMRS ETL Tool-kit - This is separate application that doesn't require the running of server. It will have the same functionality as first way integration with some extending capability. The benefit of this method is we can also use other cross platform implementation other than java like in C# for performing ETL. This gives the independency of language selection.

 

IV. Selection of EDW - Using Apache Hive. Apache Hive is probably the best way to store data in Hadoop as it uses a table concept and have a SQL like language, HiveQL.