High Availability: a mandate for success
As CEP has moved into
the
mainstream of mission critical business applications it brings rise to
a whole set of new challenges for the purveyors of CEP
technologies. There is a mandate for software platforms that
want
to play in the enterprise class big
leagues. That mandate encompasses many capabilities over and
above the core functionality of a platform, be it CEP or any other
software technology. Witness the maturity of common infrastructure
platforms today, whether it's messaging, databases, application servers
or other, they all share a set of capabilities or ilities
necessary to be considered in the enterprise
class.
A software platform typically starts its life with an initial appeal of
either providing greater business value and/or improved developer
productivity. As it moves from being the cool new toy
on the bleeding edge, employed for new and emerging use cases, it
starts to get more widely deployed in critical business functions. As
this maturity occurs, the IT user community wants... demands
that the core set of its capabilities be broadened to better
support the deployed production environment. Two
primary areas
are Management and Monitoring
and High Availability.
As a brief description, a highly
available platform is one that is tolerant of
faults. Faults or failures can happen to either software or
hardware components. A system that can detect failures and recover from
them is considered to be highly available. There is a tremendous amount
of academic research and technology focused on fault
tolerant
systems. Availability is the tolerance to failure achieved through
redundancy and/or persistence. My intent here is to simply provide a
pragmatic perspective to
the challenges and solutions for high availability in CEP.
Lost Opportunity vs.
Loss-Less
As a broad categorization one can divide CEP applications into two
groups. The first group is rather broad
itself and encompasses use-cases such as sensor monitoring,
business or
system activity monitoring and in Capital Markets certain types of algo
trading.
From the perspective of high availability I've lumped all of
these (and similar) uses cases together because of their failure
context. If for any reason there is a failure that brings
down
the CEP application (or a key component thereof), the down time
represents lost
opportunity. While the system is down one loses the opportunity for...
an Algo to make a trade, for a fraudulent trade to be detected, for a
threat condition to be recognized and so on and so forth.
When the system is once again operational, the lost data is largely
irrelevant as in the algo trading case. Market data is very time
sensitive and as time advances the value of that data diminishes
eventually reaching zero. In other cases, lost data could be
recovered from persistent sources (more on this later). The
overall point of this category of CEP use cases is that failures do not
represent catastrophic business failure but a moment lost.
The second categorization tips the scale in the other direction. The
classic examples in Capital Markets are pricing engines, crossing
engines, Dark Pools and Smart Order Routers. In these examples
the CEP application is, in effect the trading destination
for client Order flow. Institutional investors (the clients)
place orders to buy or sell large quantities of an asset
(stock,
currency, options, etc.) with these trading systems. There are
both legal and regulatory
requirements to ensure the transaction is
completed accurately, in the best possible manner (i.e best
execution)
and without loss of information. This type of CEP application
mandates a loss-less model. It is paramount incoming streams of data
(i.e. client
orders) are not lost for any reason. Failures of CEP systems deployed for these purposes do
represent catastrophic business failure and all the ramifications that
can lead to. As such highly available architectures become a
mandate for deployment into the production arena.
High Availability Variants
High Availability architectures fall on a continuum, starting from the
very basic ability to recover application state to continuous operation
with seamless fail over. Choosing the right architecture for any
particular
deployment is based on the categorization I outline above (Lost Opportunity and Loss-Less)
and the dollar-value cost of up time.
A basic
premise of
all high availability architectures is the notion of redundancy. Having
redundant components provides that safeguard against failure. If a
component (software or hardware) fails, the secondary or backup
instance assumes a primary role. In the variants I outline
below
(cold, warm, continuous) the chief
difference between them is the fail over time and consequently downtime duration.
Persistence is another component in a high availability deployment.
Persistence for recovery of application state. This may sound somewhat
obvious, but it's typically not transparent to CEP applications nor is
disk persistence the only means of recovery. Many of the external
services CEP applications connect to include a measure of persistence and therefore provide a level of recovery
themselves. Reliable message systems, distributed cache engines and in
Capital Markets FIX
servers provide a means of recovering application
state.
Cold Standby
A cold
standby deployment provides availability via complete
replication of the primary system (both hardware and software). The
replica or backup
server(s) is considered cold
because they do not actively or passively participate in
the business
function of the primary system. In fact, they could actually
be
powered off. In the event of a failure of the primary system,
the
standby machine assumes control. The process by which that
fail over occurs could be one of many. In the simplest case, human
intervention manually performs the fail over. This could
entail switching network and storage cables, and the like. In
a more automated case, the use of clustering
software would be employed. These products provide automated
detection of faults and automatic fail over of network, SAN
and application components (i.e. CEP engines) to backup servers.
A few noteworthy points
about cold standby.
- Fail over time
can be quite lengthy, measurable in the tens of minutes at best.
If the fail over mechanism is manual the first order of business
is a means to detect a
failure in the first place. Obviously the application stops working,
but having a means to monitor runtime health is always a first-line of
defense against failure. I'll cover runtime Management and Monitoring in
another blog. Even in clustered environments, the failure
detection and fail over sequence can cause downtime of such
duration that a data recovery scheme is necessary.
- Loss of data is
very likely. In the streaming data paradigm of CEP, failure of the
application does not turn off the spigot. A cold standby model
inherently incurs a downtime period between failure detection and
startup of the standby system. If the application environment is in the
Lost Opportunity category then loss of data is largely insignificant.
- Data
persistence and recovery become an inherent part of the CEP
application design. CEP applications must be designed to
persistent state at appropriate
intervals. In the event of a failure, the standby system can recover
the application state from the persistent store. As part of the
recovery process, the application must be able reconcile the last known
persisted state with reality. For example, in the algo trading
environment a trading application keeps track of Orders in-market
and current position.
To safeguard against losing this information in the event of
a failure, as the state of Orders and position changes it is
persisted to disk (or a SAN device). During the recovery phase, the CEP
application must reconcile the Order state read from the
persistent store with the state as known by the execution venue(s). A
non-trivial task to say the least and one that is very dependent on, and
unique to specific trading destinations. If connectivity is via FIX, that protocol provides a Order Status Request feature that can be used to facilitate this reconciliation
Warm Standby
Similar to a cold standby deployment, availability is achieved in a
warm standby system through redundancy. However the similarity
ends there, in a warm standby environment the standby or secondary
system (or systems) are generally on line with their own
private connection to external data sources and networks. However, they
operate is a strictly passive mode, leaving all the actual business functions to the active primary
system. The primary and secondary systems, acting in a
master-slave relationship keep tabs on one another via a heartbeat
scheme. A fail over is initiated when the primary system fails to
respond to a heartbeat request from the standby.
A few noteworthy points
about warm standby.
- Minimal fail over
time (as compared to cold standby). Since the secondary system
(or systems) are up and running typically with their own connection to
external systems, they can quickly assume control in the event of
failure on the primary system.
- Loss of
data is minimized. Application
state, while actively maintained by the primary system is also
monitored and
replicated by the secondary (passive) systems. So instead of persisting
application state (i.e Orders and positions) to disk, it is
replicated in-memory on the secondary systems. This of course
implies
reliable inter-connectivity between the systems. The
most common means of accomplishing this is via a message bus. Similar
to cold standby's data persistence, the propagation of state between
the master and slave CEP engines becomes part of the application design.
- It's important to minimize or eliminate false-positives in the design of the heartbeat scheme. In the ideal case, heartbeating is done on an out-of-band
connection between primary and secondary systems, out of the
application processing code path and any input/output data
queues. This ensures heartbeating is not (or minimally) effected
by fluctuations in processing load or congestion within the application itself.
- There can be multiple secondary (slave) systems providing multiple levels of redundancy to ensure the absolute minimal downtime.
Continuous Operation
A high availability architecture that provides continuous operation or continuous availability implies a seamless fail over
model. Like the standby designs, redundancy is the key
ingredient. However, unlike either cold or warm standby where the
redundant systems are passive, in the continuous environment multiple
redundant systems are peers and execute in parallel.
Each system runs a full complement of the application code and
it's supporting infrastructure (i.e. CEP engine). To make this model
function effectively, external
connectivity is not uniquely wired to each instance but
multiplexed by an independent component that manages both the inbound
and outbound
data streams to and from all redundant systems. Inbound the data is
multiplexed
to each instance, outbound this multiplexor manages the duplicate
streams using a first out wins model. Since each peer instance is
executing in tandem, they are producing duplicate output. In the design
of a continuous operation high availability architecture duplicates are
a natural by-product. Managing the duplicate output is vitally
important.
A few noteworthy points
about continuous operation.
- Fail over is
truly seamless or in fact non-existent. If one of the participant
instances drops out (i.e. fails) its an inconsequential event since the
remaining systems are fully active and continue to operate.
- Implies strict
determinism in the CEP engine. Since all redundant systems are actively
engaged at all times, they must produce the exact same output
streams given that they receive the same input streams.
- Failed instances that are brought back on line must have a means to catch up. Since processing does not stop, a cleanly restarted system must
have a means to load state from one of the currently running instances.
Opposing Forces
High
availability is fast becoming a key constituent for CEP. The three
basic variants I've outlined are just the starting points when
considering options for a highly available deployment. There are cost
considerations to take into account, both from a dollars and cents
viewpoint but also manageability, complexity and performance. These are
factors
to weigh in evaluating or designing a high availability solution.
Throughput and low-latency are typically paramount objectives in any
CEP
application. Wrapping that application in any variant of high
availability can impact
the overall performance. There is overhead in persisting state to disk
or replicating that state across a message bus to
secondary systems. To minimize the impact, there are disk and message
subsystem solutions designed for extreme low-latency. RamSAN devices leverage solid state disk technology for improved storage performance. RTI and 29West
offer high-speed low-latency reliable messaging. These technologies and others can
be employed within a high availability architecture to keep the
performance demons at bay.
CEP technology is clearly reaching a level of critical mass as a
platform for mainstream business, especially in Capital Markets.
Most if not all CEP vendors provide some high availability
options within their platform. However CEP is not an island within
itself, it lives within a larger infrastructure and the degree of high
availability is only as good as the weakest link. Availability
requirements are not just something to add-on after the fact, but
should be part of the overall design of the infrastructure,
connectivity and the deployed applications.
Thanks for reading...
Louie