Wednesday, May 14, 2008

High Availability a mandate for success

                        <p>High Availability: a mandate for success</p>                                                                        


As CEP has moved into the mainstream of mission critical business applications it brings rise to a whole set of new challenges for the purveyors of CEP technologies.  There is a mandate for software platforms that want to play in the enterprise class big leagues.  That mandate encompasses many capabilities over and above the core functionality of a platform, be it CEP or any other software technology. Witness the maturity of common infrastructure platforms today, whether it's messaging, databases, application servers or other, they all share a set of capabilities or ilities necessary to be considered in the enterprise class.

A software platform typically starts its life with an initial appeal of either providing greater business value and/or improved developer productivity. As it moves from being the cool new toy on the bleeding edge, employed for new and emerging use cases, it starts to get more widely deployed in critical business functions. As this maturity occurs, the IT user community wants... demands that the core set of its capabilities be broadened to better support the deployed production environment.  Two primary areas are Management and Monitoring and High Availability.

As a brief description, a highly available platform is one that is tolerant of faults.  Faults or failures can happen to either software or hardware components. A system that can detect failures and recover from them is considered to be highly available. There is a tremendous amount of academic research and technology focused on fault tolerant systems. Availability is the tolerance to failure achieved through redundancy and/or persistence. My intent here is to simply provide a pragmatic perspective to the challenges and solutions for high availability in CEP.

Lost Opportunity vs. Loss-Less

As a broad categorization one can divide CEP applications into two groups.  The first group is rather broad itself and encompasses use-cases such as sensor monitoring, business or system activity monitoring and in Capital Markets certain types of algo trading. From the perspective of high availability I've lumped all of these (and similar) uses cases together because of their failure context.  If for any reason there is a failure that brings down the CEP application (or a key component thereof), the down time represents lost opportunity. While the system is down one loses the opportunity for... an Algo to make a trade, for a fraudulent trade to be detected, for a threat condition to be recognized and so on and so forth.  When the system is once again operational, the lost data is largely irrelevant as in the algo trading case. Market data is very time sensitive and as time advances the value of that data diminishes eventually reaching zero.  In other cases, lost data could be recovered from persistent sources (more on this later).  The overall point of this category of CEP use cases is that failures do not represent catastrophic business failure but a moment lost.

The second categorization tips the scale in the other direction. The classic examples in Capital Markets are pricing engines, crossing engines, Dark Pools and Smart Order Routers. In these examples the CEP application is, in effect the trading destination for client Order flow.  Institutional investors (the clients) place orders to buy or sell large quantities of an asset (stock, currency, options, etc.) with these trading systems. There are both legal and regulatory requirements to ensure the transaction is completed accurately, in the best possible manner (i.e best execution) and without loss of information.  This type of CEP application mandates a loss-less model. It is paramount incoming streams of data (i.e. client orders) are not lost for any reason. Failures of CEP systems deployed for these purposes do represent catastrophic business failure and all the ramifications that can lead to.  As such highly available architectures become a mandate for deployment into the production arena.

High Availability Variants

High Availability architectures fall on a continuum, starting from the very basic ability to recover application state to continuous operation with seamless fail over. Choosing the right architecture for any particular deployment is based on the categorization I outline above (Lost Opportunity and Loss-Less) and the dollar-value cost of up time.

A basic premise of all high availability architectures is the notion of redundancy. Having redundant components provides that safeguard against failure. If a component (software or hardware) fails, the secondary or backup instance assumes a primary role.  In the variants I outline below (cold, warm, continuous) the chief difference between them is the fail over time and consequently downtime duration.

Persistence is another component in a high availability deployment. Persistence for recovery of application state. This may sound somewhat obvious, but it's typically not transparent to CEP applications nor is disk persistence the only means of recovery. Many of the external services CEP applications connect to include a measure of persistence and therefore provide a level of recovery themselves. Reliable message systems, distributed cache engines and in Capital Markets FIX servers provide a means of recovering application state.


Cold Standby

A cold standby deployment provides availability via complete replication of the primary system (both hardware and software). The replica or backup server(s) is considered cold because they do not actively or passively participate in the business function of the primary system. In fact, they could actually be powered off.  In the event of a failure of the primary system, the standby machine assumes control. The process by which that fail over occurs could be one of many. In the simplest case, human intervention manually performs the fail over. This could entail switching network and storage cables, and the like. In a more automated case, the use of clustering software would be employed. These products provide automated detection of faults and automatic fail over of network, SAN and application components (i.e. CEP engines) to backup servers.

A few noteworthy points about cold standby.

     
  1. Fail over time can be quite lengthy, measurable in the tens of minutes at best. If the fail over mechanism is manual the first order of business is a means to detect a failure in the first place. Obviously the application stops working, but having a means to monitor runtime health is always a first-line of defense against failure. I'll cover runtime Management and Monitoring in another blog.  Even in clustered environments, the failure detection and fail over sequence can cause downtime of such duration that a data recovery scheme is necessary.
  2.  
  3. Loss of data is very likely. In the streaming data paradigm of CEP, failure of the application does not turn off the spigot. A cold standby model inherently incurs a downtime period between failure detection and startup of the standby system. If the application environment is in the     Lost Opportunity category then loss of data is largely insignificant.
  4.  
  5. Data persistence and recovery become an inherent part of the CEP application design. CEP applications must be designed to persistent state at appropriate intervals. In the event of a failure, the standby system can recover the application state from the persistent store. As part of the recovery process, the application must be able reconcile the last known persisted state with reality. For example, in the algo trading environment a trading application keeps track of Orders in-market and current position. To safeguard against losing this information in the event of a failure, as the state of Orders and position changes it is persisted to disk (or a SAN device). During the recovery phase, the CEP application must reconcile the Order state read from the persistent store with the state as known by the execution venue(s). A non-trivial task to say the least and one that is very dependent on, and unique to specific trading destinations. If connectivity is via FIX, that protocol provides a Order Status Request feature that can be used to facilitate this reconciliation



Warm Standby

Similar to a cold standby deployment, availability is achieved in a warm standby system through redundancy.  However the similarity ends there, in a warm standby environment the standby or secondary system (or systems) are generally on line with their own private connection to external data sources and networks. However, they operate is a strictly passive mode, leaving all the actual business functions to the active primary system.  The primary and secondary systems, acting in a master-slave relationship keep tabs on one another via a heartbeat scheme. A fail over is initiated when the primary system fails to respond to a heartbeat request from the standby.

A few noteworthy points about warm standby.

     
  1. Minimal fail over time (as compared to cold standby).  Since the secondary system (or systems) are up and running typically with their own connection to external systems, they can quickly assume control in the event of failure on the primary system.
  2.  
  3. Loss of data is minimized. Application state, while actively maintained by the primary system is also monitored and replicated by the secondary (passive) systems. So instead of persisting application state (i.e Orders and positions) to disk, it is replicated in-memory on the secondary systems.  This of course implies reliable inter-connectivity between the systems. The most common means of accomplishing this is via a message bus. Similar to cold standby's data persistence, the propagation of state between the master and slave CEP engines becomes part of the application design.
  4.  
  5. It's important to minimize or eliminate false-positives in the design of the heartbeat scheme. In the ideal case, heartbeating is done on an out-of-band connection between primary and secondary systems, out of the application processing code path and any input/output data queues.  This ensures heartbeating is not (or minimally) effected by fluctuations in processing load or congestion within the application itself.
  6.  
  7. There can be multiple secondary (slave) systems providing multiple levels of redundancy to ensure the absolute minimal downtime.
       



Continuous Operation

A high availability architecture that provides continuous operation or continuous availability implies a seamless fail over model. Like the standby designs, redundancy is the key ingredient. However, unlike either cold or warm standby where the redundant systems are passive, in the continuous environment multiple redundant systems are peers and execute in parallel. Each system runs a full complement of the application code and it's supporting infrastructure (i.e. CEP engine). To make this model function effectively, external connectivity is not uniquely wired to each instance but multiplexed by an independent component that manages both the inbound and outbound data streams to and from all redundant systems. Inbound the data is multiplexed to each instance, outbound this multiplexor manages the duplicate streams using a first out wins model. Since each peer instance is executing in tandem, they are producing duplicate output. In the design of a continuous operation high availability architecture duplicates are a natural by-product. Managing the duplicate output is vitally important.

A few noteworthy points about continuous operation.

     
  1. Fail over is truly seamless or in fact non-existent. If one of the participant instances drops out (i.e. fails) its an inconsequential event since the remaining systems are fully active and continue to operate.
  2.  
  3. Implies strict determinism in the CEP engine. Since all redundant systems are actively engaged at all times, they must produce the exact same output streams given that they receive the same input streams.
  4.  
  5. Failed instances that are brought back on line must have a means to catch up.  Since processing does not stop, a cleanly restarted system must have a means to load state from one of the currently running instances.


Opposing Forces
 
High availability is fast becoming a key constituent for CEP. The three basic variants I've outlined are just the starting points when considering options for a highly available deployment. There are cost considerations to take into account, both from a dollars and cents viewpoint but also manageability, complexity and performance. These are factors to weigh in evaluating or designing a high availability solution. Throughput and low-latency are typically paramount objectives in any CEP application. Wrapping that application in any variant of high availability can impact the overall performance. There is overhead in persisting state to disk or replicating that state across a message bus to secondary systems. To minimize the impact, there are disk and message subsystem solutions designed for extreme low-latency. RamSAN devices leverage solid state disk  technology for improved storage performance. RTI and 29West offer high-speed low-latency reliable messaging. These technologies and others can be employed within a high availability architecture to keep the performance demons at bay.

CEP technology is clearly reaching a level of critical mass as a platform for mainstream business, especially in Capital Markets. Most if not all CEP vendors provide some high availability options within their platform. However CEP is not an island within itself, it lives within a larger infrastructure and the degree of high availability is only as good as the weakest link.  Availability requirements are not just something to add-on after the fact, but should be part of the overall design of the infrastructure, connectivity and the deployed applications.

Thanks for reading...
Louie

 

Thursday, May 01, 2008

An Apama Hat Trick

Last week proved to be a busy one for Apama on the marketing front as we issued three separate announcements in conjunction with our presence at the TradeTech show in Paris.  Two of the announcements focused on customers, while the third focused on work that a partner is jointly doing with Apama in the area of market surveillance.

  • ING Wholesale Banking announced that it is expanding its use of Apama, previously focused on algorithms for Benelux Small and Mid-Caps.  ING has re-engineered those algorithms to address markets in Hungary and Poland with a variety of features that include hybrid cross asset algorithms that leverage ING’s direct access within those emerging markets.
  • SEB, the Scandinavian financial group, announced it will expand its use of Apama to deliver advanced order flow monitoring services within a compliance application.  This was a second SEB announcement, following one last year regarding the SEB deployment of Apama to support client trading in Exchange-traded equities and futures.
  • And lastly, but certainly not least, together with Detica, a key partner, we jointly announced a Market Surveillance Accelerator.  Accelerators are extensions to the core Apama platform that help our customers jumpstart their deployments, incorporating business logic and other components like sample dashboards and adapters for connectivity.  In this instance, we are combining the technical know-how and experience of Detica and Apama – both of which are now supporting projects at the FSA and Turquoise - to address the growing demand for real-time market surveillance capabilities.  We’ve previously announced Accelerators for Market Aggregation and Smart Order Routing.   And there'll be more to come.

These three announcements collectively illustrate that a key part of the value of the Apama CEP platform is its versatility.  Apama may initially be deployed in support of a specific application like algorithmic trading or a specific asset class, but upon experience with the product, many of our customers expand their use to different asset classes, different geographic markets and/or entirely different applications – like compliance or risk or market surveillance.

Saturday, April 26, 2008

Judgement Day

Earlier this week I was told by a client (a proprietary trading shop) that they were switching brokers - the bank to which they submit their (cash equities) orders for execution across a range of European and US Equity markets. Now this was intriguing to me - as I happen to know that their new broker implements their client-facing Direct Strategy Access (DSA) offering using Apama. So Apama will be generating orders and sending them to ... Apama. (Via FIX protocol as it happens)

Brought a smile ... but then they told me the rest. They were planning to go via this broker to a range of markets to run some new cross-market arb strategies - apparently their new broker provides access to more European markets with very low latency. One of their primary markets will be the new Turquoise exchange, which will launch in September. At this point you might see where this is going; at Turquoise their orders will be subject to surveillance by the Turquoise Market Abuse Detection system, built on top of ... Apama.

I am smiling no longer. Didn't Skynet start like this?

Wednesday, April 23, 2008

On (Complex) Event Processing

I'm sure this has been said before, but I had cause to think again today on what is actual "complex" about CEP. I've never been particularly comfortable with the term CEP (of course, my personal comfort is not required ...) as it suggests that CEP is in some ways "hard". Well, those of us in the business of building CEP technologies know that the challenges of processing 10s of 000s of events per second with sub-millis latency against 000s of event patterns with temporal and logical constraints introduces a shed-load of complexity under the hood - but that's the point really - it is (or should be) all under the hood; we don't call databases "tricky databases" just because they have some fancy background indexing and schema evolution capabilities inside them.

Anyway, that ship has sailed - and I digress. The thing that got me thinking about this (again) was the recent press release regarding our launch - with our partner Detica - of a Solution Accelerator for Market Surveillance. That press release gives some examples of the kinds of surveillance strategies - "front running", "washing" ec. - that the Solution Accelerator provides (along with a handy definition of what this jargon actually means). The point is that the event processing logic required for these types of applications is not in any way "complex" - e.g. "detect a spike in trading volume or rapid price move within 10 minutes prior to a news release regarding a particular instrument" - despite the machinations under the hood that might be required to do this in real-time for all trades and news articles for all listed instruments on an exchange.

No, what is key in supporting these types of applications or strategies is the *lack* of complexity - and the provision of tools allowing strategy builders - here, operational teams at the exchange - to quickly express the business logic, generate a Dashboard to visualise the alerts it generates, and get all this into production before it becomes irrelevant. As markets and traders become ever more sophisticated it is the ease with which strategies can be modified and new strategies deployed that determines whether CEP is the right technology or not, not how clever it might be under the hood.

The term CEP seems to be here to stay, and I'm certainly not volunteering to be the flag carrier for a terminology battle ("Event Processing"? Anyone?). But let's be clear that neither the use case nor the hoops that need to be jumped through to deploy it need be complex for "CEP" to be an effective technology solution.

The key to effective CEP technology is to keep as much of the complexity as possible away from the people who have to use it.

Tuesday, April 22, 2008

Asia Report: Fighting White Collar Crime

Titchy_johnHello from Hong Kong. As always it is fascinating to see how CEP is evolving in Asia. One trend I am observing is the huge interest in Hong Kong in rogue traders and white collar crime – and how CEP can be used to detect and prevent this – before it moves the market. Obviously the original rogue trader, Nick Leeson, is well known here. But there has been a great deal of interest in more recent goings-on, at firms such as SocGen. Amazingly, until a couple of years ago, insider trading was not illegal in Hong Kong! Now we have a highly volatile market, with a lot of uncertainty, huge event volumes and a real problem of seeking out and preventing rogue trading activities, as well as managing risk exposure proactively.

Of course CEP provides a compelling approach. In market surveillance - the ability to monitor-analyze and act on complex patterns that indicate potential market abuse or potential dangerous risk exposure can allow a regulator, trading venue or bank to act instantly. Banks want the reassurance that they are policing their own systems. Regulators need to protect the public. The media and public here find this fascinating.

On the topic of a different kind of white collar crime – consider using CEP to detect abuse in the gaming industry. The gambling phenomenon that has propelled Macau to overtake Las Vegas as the world’s biggest gambling hub is also an exciting opportunity for CEP. We have customers using CEP to monitor and detect various forms of potential abuse in casinos. Events that are analyzed to find these patterns include gamblers and dealers signing on at tables, wins and losses, cards being dealt etc. It is possible to detect a range of potential illegal activities, ranging from dealer-gambler collusion to card counting.

As a final thought - having met with some of our customers that operate both in Hong Kong and mainland China, it is clear that China is a massive market opportunity for CEP. Exciting times ahead for CEP in Asia.