Support for Clustering

Background

In order to scale OpenMRS for large implementations and maintain high reliability, we need to support clustering of the OpenMRS API and web application.

See TRUNK-314

Design Considerations

We found some interesting lessons by looking at Atlassian's notes on clustering (overview, checklist, technical overview, cluster safety, developers guide to clustering). For example:

  • They use a replicated cache (Oracle Coherence cache, formerly known as Tangosol Coherence) – see image on this page

  • Clustering yields higher reliability, not higher availability (e.g., all nodes must come down if they get disconnected from each other or for upgrades)

  • All nodes must have the exact same version of OS, application server, Java, etc.

  • Database is on a separate server

  • Things typically stored in the file system are stored in the database instead, including large binary files. This means that things like HL7 archiving, FormEntry queues, etc. that have been moved to the file system must be able to optionally run with their data back in the database. This also means other metadata (like config files and resources) may need to be able to be moved into the database as well. Any changing data that is not moved into the database will need special code to keep it synchronized across nodes.

  • Extra work is needed to communicate between nodes (e.g., see the Indexing section on this page

  • Queues (like HL7 queue) need to support locking

  • We'll need a way to issue a cluster panic – i.e., shut things down immediately if split-brain occurs.

  • We'll need a caching API (not only for core code, but for plugins to use)

  • Scheduled tasks need to be coordinated across nodes

  • We cannot use typical locking in Java, but a cluster-aware approach instead (see Cluster-wide locks section on this page)

  • Event handling must work across nodes, which means that AOP must support clustering or we will have to use another mechanism for capturing API events.