Agreed Requirements
Task 1 - Incorporate a process to validate de-duplication strategies.
Task Description
Configuring a de-duplication strategy to find potential duplicates is a moderately complex task. If configured incorrectly, the de-duplication process may fail, or the linkage results may be inaccurate. To ensure a properly configured linkage approach, we will incorporate a validation process that highlights errors in the linkage configuration. Examples of invalid matching configurations include undefined blocking field(s), undefined matching fields, etc.
Solution
Validate each criteria and help the user to fix any issues before the matching criteria is saved.
Approach
For new strategies
A checklist will be added to the left of the save button at the bottom of the page. The checklist will display a list of errors and warnings. The errors will be displayed in the red color whereas the warnings are displayed in green. The list will be dynamically updated at each action by the user. The "Save" button to save the strategy will be disabled initially and will be kept disabled as long as there are errors in the strategy. When the user fix all the errors the "Save" button will be enabled and user will be allowed to to proceed.
When proceeding to save, a confirmation window will be shown to the user. It will include the warnings about the strategy (as remaining in the checklist) if any, and the other details (related to other tasks). If the user is happy to proceed with the strategy, he/she can save the strategy (even if there are warnings) or else he/she can go back and do further modifications to the strategy.
Mock UI for the task:
For existing strategies
The existing strategies are checked at the update of the module and the user will be prompted to correct if the strategy fails the test. (Need feasibility study)
Additional notes
The space used by the check list will be used to display any other details related to other tasks (e.g. the number of pairs the strategy would create as in task 2) as necessary.
Not specifying at least one "Must match" field and at least one "Should match" field will be considered as an error. User need to specify and least one of each of them to save an strategy.
The warnings situations will be identified by the task 4.
Task 2 - Incorporate a process to calculate total number of potential pairs formed by particular blocking strategy.
Task Description
The de-duplication module searches through "record pairs" that have a high likelihood for being duplicates. "Records pairs" are formed using "blocking strategies", which are simple approaches to finding similar records by requiring that one or more corresponding fields exactly agree among 2 records. Occasionally, however, the user may choose a blocking strategy that results in very large or very low numbers of record pairs. Widely varying numbers of record pairs can result in unexpected results, including out-of-memory errors, excessively long runtime, confusing or inaccurate results, etc.
Solution
Give the user an estimation of how much potential pairs that the strategy would result before he uses the strategy.
Approach
Calculate the estimations of how many results it would produce. The estimations will be done when the user save the new strategy. The user will be shown the estimations on how much pairs whould be resulted with the new strategy. If the number of record pairs is more than 10 times (this is configurable) the original number of record pairs, the user will be warned to consider changing the strategy.
The number of total records (at the time of saving the strategy) is also saved with the strategy. When the user runs the de-duplication process the strategies will be re-estimated if the no of total records have been changed by more than 10%(configurable) of the records it had when the strategy is created.
In case the newly created strategy has same blocking fields as an existing strategy, the new strategy would use the same estimation value instead of recalculating the estimations.
Mock UI:
Task 3 - Upgrade the de-duplication reports from flat files to database persistence.
Task Description
The de-duplication module creates reports listing potentially duplicate records, which end-users can manually review and merge when necessary. Until recently, these "de-duplication reports" were stored as flat files. Unfortunately, flat files limit the ability to manage the data and hinder new creative ways to display the data. Therefore, upgrading from flat files to persisting the data in a relational database will help users and developers more meaningfully use this data.
Solution
Implement a functionality that persist reports into relational databases instead of flat-files
Approach
ER Diagram
Database Schema
patientmatching_report{
id
name : varchar(50)
createdBy refers user.id
createdOn : datetime
}
Implemented by now
patientmatching_matchingset{
reportId refers patientmatching_report.id
groupId
patientId refers patient.id
state
}
The "patientmatching_matchingset" table will store the sets that are identified as the duplicates. The groupId identifies the duplicate groups within a report. State column keep whether the match is “IGNORED”, “MERGED”, or “PENDING”. Partially implemented
patientmatching_report_configurations{
reportId refers patientmatching_report.id
configurationId refers patientmatching_configuration.id
}
This table will store the configurations(Strategies) that are used for the reports. Implemented
patientmatching_report_process_time{
reportId refers patientmatching_report.id
processSequenceNo
process : varchar(50)
timetaken
}
Partially implemented
Project Timeline
Task | Projected completion time | Comments |
---|---|---|
Task Number 1 | One week (Deadline 28th May) | This is a basic task which is ideal to start off with. We allocated one week to it considering that this is the very fist task, and that it lays initial groundwork for tasks 2 and 4. Furthermore, this will be an ideal opportunity to get familiar with OpenMRS coding conventions |
Task Number 2 | Two weeks (Deadline 11th June) | Completing task one should prepare the student for task two, which is more complicated. We have allocated two weeks time because this task will also include a) extensive testing for data accuracy b) The first task involving Hibernate |
Task Number 3 | Two weeks (25th June) | Requirements for these tasks have not yet been assessed. Therefore, we can only present a 'projected' deadline. However, consiering that this task is already half completed, and that task 2 allowed the student to get familiar with hibernate, we assume that a two week period would be satisfactory for now |
Task Number 4 | (Projected - three weeks) | To be finalized |
Task Number 5 |
| To be finalized |
task Number 6 |
| To be finalized |