High Availability a mandate for success
High Availability: a mandate for success
As CEP has moved into
the
mainstream of mission critical business applications it brings rise to
a whole set of new challenges for the purveyors of CEP
technologies. There is a mandate for software platforms that
want
to play in the enterprise class big
leagues. That mandate encompasses many capabilities over and
above the core functionality of a platform, be it CEP or any other
software technology. Witness the maturity of common infrastructure
platforms today, whether it's messaging, databases, application servers
or other, they all share a set of capabilities or ilities
necessary to be considered in the enterprise
class.
A software platform typically starts its life with an initial appeal of
either providing greater business value and/or improved developer
productivity. As it moves from being the cool new toy
on the bleeding edge, employed for new and emerging use cases, it
starts to get more widely deployed in critical business functions. As
this maturity occurs, the IT user community wants... demands
that the core set of its capabilities be broadened to better
support the deployed production environment. Two
primary areas
are Management and Monitoring
and High Availability.
As a brief description, a highly
available platform is one that is tolerant of
faults. Faults or failures can happen to either software or
hardware components. A system that can detect failures and recover from
them is considered to be highly available. There is a tremendous amount
of academic research and technology focused on fault
tolerant
systems. Availability is the tolerance to failure achieved through
redundancy and/or persistence. My intent here is to simply provide a
pragmatic perspective to
the challenges and solutions for high availability in CEP.
Lost Opportunity vs.
Loss-Less
As a broad categorization one can divide CEP applications into two
groups. The first group is rather broad
itself and encompasses use-cases such as sensor monitoring,
business or
system activity monitoring and in Capital Markets certain types of algo
trading.
From the perspective of high availability I've lumped all of
these (and similar) uses cases together because of their failure
context. If for any reason there is a failure that brings
down
the CEP application (or a key component thereof), the down time
represents lost
opportunity. While the system is down one loses the opportunity for...
an Algo to make a trade, for a fraudulent trade to be detected, for a
threat condition to be recognized and so on and so forth.
When the system is once again operational, the lost data is largely
irrelevant as in the algo trading case. Market data is very time
sensitive and as time advances the value of that data diminishes
eventually reaching zero. In other cases, lost data could be
recovered from persistent sources (more on this later). The
overall point of this category of CEP use cases is that failures do not
represent catastrophic business failure but a moment lost.
The second categorization tips the scale in the other direction. The
classic examples in Capital Markets are pricing engines, crossing
engines, Dark Pools and Smart Order Routers. In these examples
the CEP application is, in effect the trading destination
for client Order flow. Institutional investors (the clients)
place orders to buy or sell large quantities of an asset
(stock,
currency, options, etc.) with these trading systems. There are
both legal and regulatory
requirements to ensure the transaction is
completed accurately, in the best possible manner (i.e best
execution)
and without loss of information. This type of CEP application
mandates a loss-less model. It is paramount incoming streams of data
(i.e. client
orders) are not lost for any reason. Failures of CEP systems deployed for these purposes do
represent catastrophic business failure and all the ramifications that
can lead to. As such highly available architectures become a
mandate for deployment into the production arena.
High Availability Variants
High Availability architectures fall on a continuum, starting from the
very basic ability to recover application state to continuous operation
with seamless fail over. Choosing the right architecture for any
particular
deployment is based on the categorization I outline above (Lost Opportunity and Loss-Less)
and the dollar-value cost of up time.
A basic
premise of
all high availability architectures is the notion of redundancy. Having
redundant components provides that safeguard against failure. If a
component (software or hardware) fails, the secondary or backup
instance assumes a primary role. In the variants I outline
below
(cold, warm, continuous) the chief
difference between them is the fail over time and consequently downtime duration.
Persistence is another component in a high availability deployment.
Persistence for recovery of application state. This may sound somewhat
obvious, but it's typically not transparent to CEP applications nor is
disk persistence the only means of recovery. Many of the external
services CEP applications connect to include a measure of persistence and therefore provide a level of recovery
themselves. Reliable message systems, distributed cache engines and in
Capital Markets FIX
servers provide a means of recovering application
state.
Cold Standby
A cold
standby deployment provides availability via complete
replication of the primary system (both hardware and software). The
replica or backup
server(s) is considered cold
because they do not actively or passively participate in
the business
function of the primary system. In fact, they could actually
be
powered off. In the event of a failure of the primary system,
the
standby machine assumes control. The process by which that
fail over occurs could be one of many. In the simplest case, human
intervention manually performs the fail over. This could
entail switching network and storage cables, and the like. In
a more automated case, the use of clustering
software would be employed. These products provide automated
detection of faults and automatic fail over of network, SAN
and application components (i.e. CEP engines) to backup servers.
A few noteworthy points
about cold standby.
- Fail over time can be quite lengthy, measurable in the tens of minutes at best. If the fail over mechanism is manual the first order of business is a means to detect a failure in the first place. Obviously the application stops working, but having a means to monitor runtime health is always a first-line of defense against failure. I'll cover runtime Management and Monitoring in another blog. Even in clustered environments, the failure detection and fail over sequence can cause downtime of such duration that a data recovery scheme is necessary.
- Loss of data is very likely. In the streaming data paradigm of CEP, failure of the application does not turn off the spigot. A cold standby model inherently incurs a downtime period between failure detection and startup of the standby system. If the application environment is in the Lost Opportunity category then loss of data is largely insignificant.
- Data persistence and recovery become an inherent part of the CEP application design. CEP applications must be designed to persistent state at appropriate intervals. In the event of a failure, the standby system can recover the application state from the persistent store. As part of the recovery process, the application must be able reconcile the last known persisted state with reality. For example, in the algo trading environment a trading application keeps track of Orders in-market and current position. To safeguard against losing this information in the event of a failure, as the state of Orders and position changes it is persisted to disk (or a SAN device). During the recovery phase, the CEP application must reconcile the Order state read from the persistent store with the state as known by the execution venue(s). A non-trivial task to say the least and one that is very dependent on, and unique to specific trading destinations. If connectivity is via FIX, that protocol provides a Order Status Request feature that can be used to facilitate this reconciliation
Warm Standby
Similar to a cold standby deployment, availability is achieved in a
warm standby system through redundancy. However the similarity
ends there, in a warm standby environment the standby or secondary
system (or systems) are generally on line with their own
private connection to external data sources and networks. However, they
operate is a strictly passive mode, leaving all the actual business functions to the active primary
system. The primary and secondary systems, acting in a
master-slave relationship keep tabs on one another via a heartbeat
scheme. A fail over is initiated when the primary system fails to
respond to a heartbeat request from the standby.
A few noteworthy points
about warm standby.
- Minimal fail over time (as compared to cold standby). Since the secondary system (or systems) are up and running typically with their own connection to external systems, they can quickly assume control in the event of failure on the primary system.
- Loss of data is minimized. Application state, while actively maintained by the primary system is also monitored and replicated by the secondary (passive) systems. So instead of persisting application state (i.e Orders and positions) to disk, it is replicated in-memory on the secondary systems. This of course implies reliable inter-connectivity between the systems. The most common means of accomplishing this is via a message bus. Similar to cold standby's data persistence, the propagation of state between the master and slave CEP engines becomes part of the application design.
- It's important to minimize or eliminate false-positives in the design of the heartbeat scheme. In the ideal case, heartbeating is done on an out-of-band connection between primary and secondary systems, out of the application processing code path and any input/output data queues. This ensures heartbeating is not (or minimally) effected by fluctuations in processing load or congestion within the application itself.
- There can be multiple secondary (slave) systems providing multiple levels of redundancy to ensure the absolute minimal downtime.
Continuous Operation
A high availability architecture that provides continuous operation or continuous availability implies a seamless fail over
model. Like the standby designs, redundancy is the key
ingredient. However, unlike either cold or warm standby where the
redundant systems are passive, in the continuous environment multiple
redundant systems are peers and execute in parallel.
Each system runs a full complement of the application code and
it's supporting infrastructure (i.e. CEP engine). To make this model
function effectively, external
connectivity is not uniquely wired to each instance but
multiplexed by an independent component that manages both the inbound
and outbound
data streams to and from all redundant systems. Inbound the data is
multiplexed
to each instance, outbound this multiplexor manages the duplicate
streams using a first out wins model. Since each peer instance is
executing in tandem, they are producing duplicate output. In the design
of a continuous operation high availability architecture duplicates are
a natural by-product. Managing the duplicate output is vitally
important.
A few noteworthy points
about continuous operation.
- Fail over is truly seamless or in fact non-existent. If one of the participant instances drops out (i.e. fails) its an inconsequential event since the remaining systems are fully active and continue to operate.
- Implies strict determinism in the CEP engine. Since all redundant systems are actively engaged at all times, they must produce the exact same output streams given that they receive the same input streams.
- Failed instances that are brought back on line must have a means to catch up. Since processing does not stop, a cleanly restarted system must have a means to load state from one of the currently running instances.
Opposing Forces
High
availability is fast becoming a key constituent for CEP. The three
basic variants I've outlined are just the starting points when
considering options for a highly available deployment. There are cost
considerations to take into account, both from a dollars and cents
viewpoint but also manageability, complexity and performance. These are
factors
to weigh in evaluating or designing a high availability solution.
Throughput and low-latency are typically paramount objectives in any
CEP
application. Wrapping that application in any variant of high
availability can impact
the overall performance. There is overhead in persisting state to disk
or replicating that state across a message bus to
secondary systems. To minimize the impact, there are disk and message
subsystem solutions designed for extreme low-latency. RamSAN devices leverage solid state disk technology for improved storage performance. RTI and 29West
offer high-speed low-latency reliable messaging. These technologies and others can
be employed within a high availability architecture to keep the
performance demons at bay.
CEP technology is clearly reaching a level of critical mass as a
platform for mainstream business, especially in Capital Markets.
Most if not all CEP vendors provide some high availability
options within their platform. However CEP is not an island within
itself, it lives within a larger infrastructure and the degree of high
availability is only as good as the weakest link. Availability
requirements are not just something to add-on after the fact, but
should be part of the overall design of the infrastructure,
connectivity and the deployed applications.
Thanks for reading...
Louie
Nice writeup, it does capture the use case for the RamSan in a many environments (Thanks for the reference!). The simplicity of the recovery model of writing to disk is the draw for the RamSan in this environment. O-Sync performance is a big part of our market, and our engineering keeps it in mind when designing our system. If you are going to be at the SIFMA conference in a couple weeks swing by our booth.
-Jamon Bowen
Texas Memory Systems
Posted by: Jamon Bowen | Friday, May 30, 2008 at 11:45 AM