Data Synchronization Project

This page refers to a Summer Of Code 2007 project. Documentation and code have been placed into the Sync Module for general use.

Background

Intern: Anders Gjendem
Mentor: Maros Cunderlik
Other developers: Christian Allen, Darius Jazayeri, Julie

Abstract: As described under the remote data entry project and mobile data collection projects, we have a need to synchronize local data storage with the central servers. Local data storage may entail a fully functioning OpenMRS instance or perhaps a scaled down version of the server. While a complete end-to-end synchronization solution featuring master-master replication, conflict resolution, contextual data replication, automatic schema updates is clearly beyond the scale of a summer project, significant contributions toward this critical feature could be made.

In the initial phase of this effort, we would like to implement data synchronization under the assumption that the remote data entry site would be able to add new patients and encounters but not modify existing records. As such, it is assumed that there will be only a single 'master' copy of data at any given point and thus conflict detection and resolution are not needed at this point. Consequently, the synchronization in this context can be viewed as export, transfer, and import (analogous to ETL techniques used in data warehousing). Unlike ETL (which generally deals with the problem domain of the one directional transform between the different E-R schemas and data semantics) however, it is assumed and expected that the advanced synchronization features will be added to OpenMRS in the near future. Thus the broader consideration and awareness of the challenges in bi-directional replication and contextual replication are desired and expected to be reflected in the proposed solution.

Project News

The latest source in SVN, branch: data_synchronization_bidirectional. We are preparing to move the branch to trunk however it will not happen until after 1.4 release. The branch code is running in Rwanda in production use. Pls. see the following presentation of the feature overview:

Unable to render embedded object: File (RITA_&_Sync.pptx) not found.

Task Backlog

Tasks to complete by 1.0 release:

  1. Darius 'one-click' process of creating a child database that’s going to sync to a given parent.
    1. status: in-progress: initial code committed, needs work
    2. This should also support ability to 'dump and refresh' client DB from parent (i.e. after exporting sync journal, rebuild client DB from parent)
    3. I assume we duplicate the parent database, give the child a new GUID, and that’s all?
    4. idea from Christian: it would automatically adjust global properties for the new instance, or at least prompt you for a minimal amount of changes to make.
  2. TBD Handle deletes in interceptor
  3. Maros Fix/replace the activateTransactionSerialization() behavior (doesn't work in case of session.merge() as in updateSyncImportRecord())
  4. TBD Documentation
    1. online user help: done
    2. TBDwiki detailed user docs
    3. TBD wiki detailed tech docs
  5. all Code/pkg clean-up.
    1. Maros Split add_sync.sql to DDL and DML; add server guid generation to it.
    2. Move several common files (i.e constants) from org.openmrs.synchronization.engine to org.openmrs.synchronization
    3. Clean-up/consolidate the state handling of sync transmission/sync record
    4. Comment the code
    5. Collectively look at/review bidi code (i.e. sending data in both direction): is it ready to be used?
    6. look into bug that Ben saw Maros will do
    7. Maros there seems to be a bug with syncTx serialization when exporting/processing form edits; i.e. I edited Form properties and corresponding syncTx failed to create valid XML file – some serialization issues most likely
  6. TBD LastRecordGuid/OriginalGuid - needs to be cleaned up/finished; as of now it is just a dup of SyncRecord.guid
  1. ALL Testing.
    1. Minimally, create a list of scenarios that need to be tested and passed to release sync
    2. Automated test suite. First, using latest in-mem db testing facilities, begin creating automated 'integration'/user scenario tests (i.e. add patient, sync, verify data came across). This will probably not get fully done by 1.0 release, but we should start

Tasks with unassigned priorities

  1. compression: SyncTx are getting large in a hurry
  2. journal/data archive
  3. Error handling/reporting to user in scheduled task
    1. If error occurs in scheduled sync, where should it be reported? Currently, it will throw an exception on the background thread.
  4. Consolidation of the different hardcoded domain object instantiation schemes and metadata.
    1. For example:
      1. in SyncUtil.updateOpenmrsObject, .getOpenmrsObj()
      2. in SynchronizationHistoryListController.referenceData()
      3. SyncClass
      4. HibernateSynchronizationDAO.createDatabaseForChild
    2. Priority: post 1.0 release most likely?

Post 1.0 release:

  1. Conflict resolution.
  2. Code re-factor.
    1. TBD Consolidate some of the processing code from controllers to sync domain classes.
  3. Sync transmission tracking/control.
  4. Ability to edit sync journal entries for UI.

'Done' List:

  1. DONE Maros Rename ISynchronizable to Synchronizable.
  2. DONE All Settle of the approach to avoiding redundantly sending changes that just came from child back out to the same child (i.e. preparing for bi-directional sync)
    1. we will retain same guid for a sync record on parent when sync record is processed; having always the same quid (on all nodes) for a change that was generated on a given node will allow us easily detect duplication
  3. DONE Review the current approach to maintaining Tx boundaries and turning 'off' the change recording: see current use of threadlocal per Anders
  4. DONE Handle prepare stmts in interceptor (or weed them out of the DAO/service code)
  5. DONE Merge from Alpha
    1. to be done by Dec 2nd
  6. DONE Add ability to run sync in 'strict' mode where on error exception is thrown and any pending op is aborted
    1. in-progress
    2. add error handling in place of log.error/warn

Weekly conference call

Weekly Friday calls, starting again Dec 7th
We have decided to (at least for now) conduct weekly conf. calls over Skype. It is proven to be useful to be able to weekly touch base in group setting; weekly scrum so to speak. Usual attendees: archive:user__maros, archive:user__Anders, archive:user__Djazayeri, archive:user__callen.

Regular schedule:

  • Friday, 10:30AM EDT

Notes for Friday calls:

Dec 2007 7 | 14 | 21 |
July 2007 6 | 13 | 20 | 27
June 2007 8 | 15 | 22 | 29

Design Specifics

This section speaks to the design specifics of data synchronization for OpenMRS.

Roadmap

We're tackling this roughly in the following order:

  1. Implement Uni-directional children->parent sync with foundation towards bi-directional sync. Release to field as v1.0.
  2. Bi-directional sync between children<->parent and basic conflict resolution framework.
  3. Bi-directional sync in tree of parent nodes.

Future considerations:

  • Sync across compatible but not equivalent DB schema(s); DB schema synchronization
  • Sync for modules and other ad-hoc data structures.
  • Contextual replication (i.e. based on location hierarchy)
  • Advanced conflict detection and resolution (i.e. patient matching algorithm as part of conflict resolution during the sync)
  • Data sync protocol in tree/graph network topology of loosely connected systems: bayou (http://www2.parc.com/csl/projects/bayou/)

Version 1.0: One-way sync

Use Need:
The goal of this release is to provide a 'one way' sync of information from a 'child' node/installation to a parent server. Both child and parent are 'normal' functioning instances of openmrs.

Use Scenario - Setup:

  1. Child clinic begins with exact copy of parent's database.
  2. Sync is enabled on child and parent: update script is run to create necessary sync data structures, unique server IDs are assigned.
  3. Decide on the method of sending data to parent: if via web, setup info about the parent server in child's sync configuration.
  4. Once upgrade scripts are run and instances configured, sync is 'enabled' on child. From that point on, child will keep track of all changes destined to be sent to server.

Use Scenario - Send data from child parent:

  1. Send.
    1. At child: Accumulated sync changes are ready to be 'sent' to parent server. Display records that are to be send: select all not 'already' committed to parent change records for send.
      1. Client changes are to be tracked, and applied to server according to the established transaction boundaries.
      2. Client changes for the following 'domains' are not be sync-ed: any synchronization data structures, global properties, scheduler.
    1. At child: If send method is file, generate a sync transmission file and save it locally.
    2. At child: If send method is web: if parent server info not configured, report error
    3. At child: If send method is web and parent server info configured, open 'live' connection to parent server and post the sync transmission content to parent server for processing
  1. Apply sync transmission at parent.
    1. At parent: If sync transmission mode was file, upload the sync transmission file received from the child and prepare for processing.
    2. At parent: If sync transmission mode was web, accept connection from client and receive the contents of the sync transmission, prepare for processing.
    3. At parent: Process received data: Read through all records send to the parent and apply them to the parent database in order they were received.
  2. Send & process sync transmission confirmation:
  3. At parent: Upon processing the sync transmission from child, for every record received and applied generate a confirmation record indicating the record commit state (committed, aborted).
  4. At parent: Based on the apply sync method (i.e. file vs. web) create appropriate sync transmission response container (i.e. file vs. http response, respectively).
  5. At child: Receive the sync transmission response container, for each record in the response update the status of the source sync record accordingly.

Assumptions/limitations:

  • before sync is enabled, child and server are in 'sync'; they both start with exact same copy of the data
  • the relationship between child and parent, once established, is static and immutable: one cannot associate child with the alternate/another server

Change tracking mechanism: Change journal and hibernate interceptor.

Network transfer implementation:org.openmrs.synchronization.server

Apply changes @ Parent:org.openmrs.synchronization.ingest

Synchronization algorithm, conflict resolution: org.openmrs.synchronization.engine

Scenarios, Use Cases needs update

The primary scenario we are trying to address:

  • Bidirectional synchronization of data between one parent and several children/satellite OpenMRS installations. Under this scenario, there is exactly one parent installation with several children. While children installations are fully functional OpenMRS server installations logically they are children of the parent node. Specifically, depending on the scope of implementation, we may choose to restrict some operations (such as create new clinic) to parent node only. Every child node is associated with exactly one and always the same parent. Furthermore, the parents can be connected to other parents forming tree topology where children are leaf nodes and parents are internal tree nodes. Other desired characteristics and limitations:
  1. parent and children can create, update, delete data
  2. there is a 'defined' set of conflict resolution rules and process (i.e. merge procs or policies) for resolving them (what that is is TBD.)
  3. in scope of initial release, conflict resolution mechanisms will support detection of logical duplicates: i.e. if patient with the same patient identifier (not the same DB PK) is entered at a parent and child; the sync process would detect such conflict
  4. All nodes are are assumed to have same DB schema
  5. Replication is triggered by an event (user intervention or scheduled event) and isn't expected to be real-time or near real-time
  6. While replication is in progress, system, or portions of the system should be operational
  7. Transactional consistency: in other words, atomic 'write' operations performed on satellite in scope of DB transaction should be applied at parent also in scope of the transaction (note this isn't the same as supporting DB isolation levels across parent and children)
  8. Performance considerations: we will set specific performance expectations with respect to network latency, and impact to parent/child OpenMRS installations. In general, at this point, we are targeting scenarios where parent may have 10s (not 100s) of children with total patient population of max 100,000s (not 1,000,000s)
  9. Additional desired functionality which however may not be initially included:
    1. Synchronization of modules data: We would like to expose sync facilities to module developers.

In addition to this primary scenario, the other urgent (alas less complex) use case is Remote Data Entry. Here disconnected (or loosely connected) client application synchronizes its local data store with a parent OpenMRS installation. Desired characteristics and limitations:

  1. Local client app is 'skeleton' OpenMRS install that allows entry of forms and their submittal to the parent server.
  2. Minimal subset of the data model is replicated to the client (i.e. clinic/patient list) to support form entry.

Data conflict use cases

The key to the making design choices is the understanding of the data 'conflict' scenarios and expected system behavior. This is often described as data consistency or write stability guarantees. Follow the section link to find out more.

Open Questions and Issues needs update

*Serialization format.*We have been having initial discussion about what serialization format to use to capture the changes to data. Naturally, XML would be nice but several questions come to mind right away:

  • Performance: will we be efficient enough on the wire?
  • Serializer choice: from onset, it seems the common methods (i.e. XMLEncoder or JAXB) may not work out of box (i.e. circular references in object model)
    The following summarizes the current thinking on this topic (email snippet by archive:user__maros): As far as the exact mechanism of XML serialization, I think it is quite OK for us to roll our own if there are good reasons why we can't use std. mechanisms. However more generally, I think it is important to perform serialization at the core object model level. Do we agree that since we do need some data abstraction to represent the 'payload' of the write operations being sync-ed between two nodes, using org.openmrs package is the preferred choice? From my standpoint the main motivation being:
    a) ideally the sync code would manipulate the same domain objects as the rest of the codebase so that the code is fairly familiar to anyone familiar with openmrs core API
    b) serialization of the core domain model is something useful beyond just sync
    c) we use OR mapper for DB access therefore we, at this point, don't have any other convenient data abstraction or update mechanisms (i.e. JDBC and not that we couldn't create one) for representing and saving 'location record' without referencing object model: DAOs work in terms of org.openmrs.

That said, I don't want to get hang up on XML serialization as defined in XMLEncoder. As far as I am concerned, we can override Serializable and in read/Write object methods call XMLEncoder if appropriate, if not hand-code our own xml-based format. I agree that the problem of XML serialization of the domain model per se is not central to getting sync done thus there is a potential of it becoming a distraction: chasing down the object graph serialization issues can be time consuming. Similarly, as far as the data packet size there are more efficient alternatives.
– email snippet ends here –

Global data record/identity tracking(aka GUID or not to GUID). We have been pondering the problem of synchronizing primary keys. Given that we don't have GUIDs in the data model today, we wondered if we do need them and what are the consequences of not having them. During that discussion, this thought came up: what if we just sync data in tables based on equivalency irrespective of PKs? For example, if two location records has the same address then they are logical equivalent and thus should be sync-ed. This seems straightforward enough in simple case, however at this point we are not certain of this would work for all data. For example, having two patient records with different patient.patient_ids same patient_identifier.identifier seemingly wouldn't work since patient_identifier has patient_id as FK thus we could physically make this work.

Global Ordered record history. In disconnected system of nodes, each with individual physical clock, how do we know the global record ordering? In ideal world, all children and servers would have perfect clock synchrony – this of course is not realistic assumption.

Implementation Plan OBSOLETE

archive:work in progress - needs to be rewritten

High-level project plan and tasks bellow. The general idea is to begin with the basic framework for change detection and serialization that can be refined in future.

Phase 1: Sync 'Lite' implementation

Goal

Skeleton framework for detecting changesets, serialization, transfer, deserialization, and hooks for conflict
resolution.

Primary use cases

  • Server to server data transfer (bidirectional). In two installation of OpenMRS one is acting as 'parent' and the other one as 'satellite'; see design section for scenario description.
  • Remote patient data entry. Form data entry is performed at the remote site with occasional connectivity to the main OpenMRS 'server'. The remote site is running a 'skeleton' version of the OpenMRS software locally to allow entry and storage of the forms locally while the site is disconnected from its corresponding server. When online, the data entered locally is to be transfered to the parent server and in return local site receives minimal updates to common data (i.e. concepts, clinic and patient list).

Design

Key elements of solution:

  • Changeset tracking and detection at the satellite and parent
  • Changeset transfer and transfer protocol; call it a transfer packet
  • Parent endpoint logic: receive transfer packet and transfer packet decode to satellite changeset
  • Apply Changeset @ parent while maintaining transactional boundaries
  • Conflict detection and resolution, merge procedures, status/failure/rollback notifications
  • Apply parent changeset at satellite

Consequently, several key elements of design:

  • changeset detection mechanism
  • changeset data serialization (for example XML serialization of org.openmrs)
  • transactional consistency of the serialized entities (what is the proper sequence of 'write' operations)
  • data transfer protocol
  • parent endpoint: web service and parent endpoint implementation: mechanism for parsing the entities and re-applying them to the server via hibernate
  • conflict detection and resolution

Proposed Data Model Changes

The changes we'd like to make the database can be divided into 2 phases, or performed all at once as part of Sync Lite Phase I. They are detailed below.

Phase I - Addition of History Tables

The idea is that every table in the database will have a twin "history" table. This table will contain a row for every version of a single row in the original table.

Visual example:

Advantages:

  • Full auditable history of every table in database
  • Origin tables are maintained - allowing existing keys and relationships between entities to remain unaltered.
  • History tables have an algorithm
    • At DB level, a history table is the same as its origin table + 2-4 extra columns
    • At the code level, a history object can extend an origin object, and then implement the History interface - simple to understand
  • Allows easier synchronization between different databases, since all changes to objects are timestamped

    Phase II - Remove Extraneous Attributes from All Tables

    Once we have history tables, the current auditing information becomes obsolete, as we are now collecting much more auditing data. In fact, history tables contain a superset of the current auditing information from the auditing attributes in the various tables. One consideration, though, is that many tables contain a "void_reason" attribute. We may need to include equivalent information or perhaps an "action_reason" attribute in the history tables so as not to lose this valuable information.

Visual example:

Advantages:

  • More efficient (we think) - the origin tables, the tables most often used for queries, will have less attributes and less keyed relationships - leading to better performance
  • Easier to read for new developers / the development community at large
    Less code and less confusing code - before this second step, we will have to set all old auditing attributes when a change is made, *and add information to the auditing table - essentially duplicating and confusing the process.

Revised, Proposed Data Model Changes

The changes we'd like to make the database have been revised, and are depicted below.

Concept:

Encounter type:

Obs:

Person: