« October 2007 | Main | January 2008 »

November 2007

Wednesday, November 28, 2007

Design Patterns in CEP – The instance Life Cycle

Posted by Louis Lovas

Much has been written on CEP design patterns. You can find two very good whitepapers on the subject on the Complex Event Processing website. These whitepapers and various blogs very clearly and succinctly describe a number of use-cases that show how CEP engines with their complex pattern detection and temporal awareness can be used effectively to derive real business value from streaming data sources. However, in reading these documents and contrasting them to my own experiences with Apama CEP applications they miss a very real and fundamental design pattern. That of instance life cycle management. I see Complex Event Processing as much more than pattern detection but as the need for pattern detection and the expression of business logic executing within a managed life cycle context.

What I mean by the instance life cycle is the need to invoke multiple application instances upon receipt of specific control events or the detection of event patterns. The execution context and the application instance that runs within it define a unit of work. The life time of instances is driven directly by the application's semantics. Instances might have a long or short life span. There are three basic traits of the instance life cycle; creation, modification and deletion. Creation is simply the need to establish a new object instance and context for its execution. Modification is the need to support dynamic changes to the operational state (i.e. runtime parameters). Lastly, is support for terminating a running instance – either abruptly or gracefully. The instance life cycle can be managed by various means - other parts of an application or from a user's dashboard. Interfaces to create new, modify existing or delete instances, given adequate security privileges should be present. The runtime environment of the CEP engine should provide a means to establish and manage these application instances.

To be a bit more definitive, instances can represent many differing aspects of an application. In the Algo Trading world 'instances' generally refer to running trading strategies. The basic template of a strategy is to take a set of input parameters during creation. For example, an Arbitrage strategy might take two instrument names and a Bollinger Band range as input. Modifying the running strategy might be the ability to increase or decrease the Band range. Deleting the strategy would cause it to either complete or cancel any outstanding orders its managing and then terminate. Given this example, executing multiple concurrent instances of the Arbitrage strategy should be a straight forward task and the CEP engine should provide the corresponding language semantics to make this a simple task.

Instances are not limited to this top-level construct of a strategy. They could represent many things in a CEP engine. Using another finance example, a CEP application designed for best execution of client orders would represent each client or parent order as an instance. Each subsequent child order also represents a managed instance with its own life cycle. For a CEP application monitoring a manufacturing environment, instances might represent parts moving through a production process from raw material to finished goods. A managed instance could represent a mortgage application as it flows through qualification in a BAM application within a bank.

In some respects I'm not identifying a revelation or an idea all that new. Commercial applications of all sorts incorporate the notion of instance life cycle management. It's a classic design pattern that is provided in some form in both development languages and application frameworks (i.e. AppServers). The point I'm attempting to identify is that one should be wary of getting enamored with the uniqueness of CEP. It's not just about design patterns for filtering and enrichment but also incorporating classic and traditional constructs for application design and implementation. In the traditional languages such as java and C++ the instance design pattern can take numerous forms. It can be as simple as the new operator. Thus an instance can be represented as a new object instance of a class. An instance could take the form a thread via CreateThread or in the old-school Unix mentality, an instance can be a forked process. Granted all of these are not equivalent, there are pros and cons in using them for life cycle management. Furthermore, frameworks such as J2EE application servers provide their own instance management schemes to make the application development task of instance management easier. CEP engines that are implemented using an embedded model, specifically those that run within the context of a host language (i.e. java) and/or infrastructure (i.e. AppServer) can leverage these inherent lifecycle instance management capabilities.

Other languages have managed to leverage the best of these options to create safer more scalable instance management. Apama's MonitorScript event processing language (EPL) supports an application threading model via a spawn operator. Each instance is referred to as an mThread. From this core capability in the EPL, the life cycle design pattern is built. The spawn operator allows Apama CEP applications to be built and designed to leverage the best of thread-based concurrent programming (simpler coding model, code isolation and encapsulation, etc.) without the worry of all the complex uses that surround the use of system threads (thrashing, non-deterministic scheduling, poor portability, poor scalability). mThreads in MonitorScript guarantee predictable behavior, are fully deterministic, fully portable and massively scalable. This capability is not adjunct to the Apama platform. It's a core capability inherent in the language and heavily leveraged by many of the supporting development and runtime tools.

CEP platforms that narrowly define the paradigm of event stream processing as a language for filtering and enrichment leave many very important aspects of application management undefined and as such challenging to implement. The uniqueness and the benefits of CEP are clearly evident in the many documented design patterns. However, CEP cannot ignore the mature aspects of traditional application design such as instance life cycle management.

Tuesday, November 13, 2007

Taking Aim

Posted by Louis Lovas

I am both humbled and appreciative of all the accolades, constructive comments (hey, fix that misspelled word) and yes, criticism on my latest blog about using SQL for Complex Event Processing. I was expecting some measure of response, as shown by this rebuttal given the somewhat polarizing nature of using the SQL language for CEP applications. For every viewpoint there is always an opposing, yet arguably valid outlook. I welcome any and all commentary. Again thanks all reading and commenting.

There were two main themes of the criticism that I received. One was on the viability of the SQL language, the other on my commentary on the use of Java and C++ for CEP applications. I would like to clarify and reinforce a few points that I made.

I chose the aggregation use-case as an example to highlight limitations of SQL because its one that I have recent experience with. I have been both directly and indirectly involved in six aggregation projects for our financial services customers over the past year. In those endeavors I've both learned much and leveraged much. As I tried to describe in a condensed narrative, aggregation is a challenging problem. One that is best solved by use of complex nested data structures and associated program logic. Trying to represent this in SQL is a tall order given the flat 2-dimensional nature of SQL tables and the limited ability for semantic expression. To try to further explain this, I have taken a snippet of a MonitorScript-based implementation, presenting it as a bit of pseudo code that describes the nested structure I am referring to. I've intentionally avoided any specific language syntax and I've condensed the structures to just the most relevant elements. But suffice to say, defining these structures is clearly possible in Java and C++ and Apama's MonitorScript. I would also like to give credit where it's due and acknowledge many of my colleagues for the (abridged) definition I'm using below from one of our customer implementations.

structure Provider {

string symbol; // symbol (unique to this provider)

string marketId; // the market identifier of the price point

integer quantity; // quantity the provider is offering

float timestamp; // time of the point

float cost; // transaction cost for this provider

hashmap<string,string> extraParams; // used for storing extra information on point

}

 

structure PricePoint

{

float price;// the price (either a bid or ask)

array<Provider> providers; // array of providers at this price

integer totalQty; // total quantity across all providers at this price

}

 

structure AggregatedOrderBook

{

integer sequenceId; // counter incremented each time the book is updated

integer totalBidQuantity; // total volume available on the bid side

integer totalAskQuantity; // total volume available on the ask side

integer totalProviders; // total number of providers

array<PricePoint> bids; // list of bids, sorted by price

array<PricePoint> asks; // list of asks, sorted by price

}

An aggregation engine would create an instance of AggregatedOrderBook for each symbol, tracking prices per market data Provider. As market data quotes arrive they are decomposed and inserted (sort/merged) into the appropriate PricePoint and total values are calculated. This is an oversimplification of what transpires per incoming quote, but the aim here is to provide a simplified yet representative example of the complexities in representing an Aggregated Order Book.

Furthermore, after each quote is processed and the aggregate order book is updated it's imperative that it be made available to trading strategies expeditiously. Minimizing the signal-to-trade latency is a key measure of success of algorithmic trading. Aggregation is a heavyweight, compute intensive operation. It takes a lot of processing power to aggregate 1,000 symbols across 5 Exchanges. As such, it is one (of many) opposing forces to the goal of minimizing latency. So this presents yet another critical aspect of aggregation, how best to design it so that is can deliver its content to eagerly awaiting strategies. One means of minimizing that latency is to have the aggregation component and trading strategies co-resident within the CEP runtime engine. Passing (or otherwise providing) the aggregated order book to the strategies becomes a simple 'tap-on-the-shoulder' coding construct. But it does imply the CEP language has the semantic expressiveness to design and implement both aggregation and trading strategies and then the ability to load and run them side-by-side within the CEP engine. Any other model implies not only multiple languages (i.e. java and streamSQL) but likely some sort of distributed, networked model. Separating aggregation from its consumers, the trading strategies will likely incur enough overhead that it impacts that all important signal-to-trade latency measure. I do realize that the CEP vendors using a streaming SQL variant have begun to add imperative syntax to support complex prodedural logic and "loop" constructs something I'm quite glad to see happening. It only validates the claim I've been making all along. The SQL language at its core is unsuitable for full-fledged CEP-style applications. The unfortunate side effect of these vendor-specific additions is that it will fracture attempts at standardization.

In my previous blog, I wanted to point out the challenges of the SQL language to both implement logic and manage application state. To that end, I provided a small snippet of a streamSQL variant. A criticism leveled against it states that it's an unnecessarily inefficient bit of code. I won't argue that point, and I won't take credit for writing it either. I simply borrowed it from a sample application provided with another SQL-based CEP product. The sample code a vendor includes with their product is all too often taken as gospel. A customer's expectation is that it represents best practice usage. Vendors should take great care in providing samples, portions of which inevitably end up in production code.

The second criticism I received was on a few unintentionally scathing comments I made against Java and C++. I stated that using C++ and/or Java "means you start an application's implementation at the bottom rung of the ladder". My intent was to draw an analogy to CEP with its language and surrounding infrastructure. All CEP engines provide much more than just language. They provide a runtime engine or virtual machine, connectivity components, visualization tools and management/deployment tools. CEP vendors like all infrastructure vendors live and die by the features, performance and quality of their product. All too often I've witnessed customers take a "not invented here" attitude. They may survey the (infrastructure) landscape and decide "we can do better". For a business' IT group chartered with servicing the business to think they can implement infrastructure themselves is a naïve viewpoint. Granted, on occasion requirements might be so unique that the only choice is to start slinging C++ code, but weighing the merits of commercial (and open source) infrastructure should not be overlooked.

My goal in this and past blogs is to provide concrete use-cases and opinions on CEP drawn from my own experiences with designing, building and deploying Apama CEP applications. In doing so I was quite aware that I am drawing a big red bulls-eye on my back making me an easy target for detractors to take aim. Surprisingly, I have received much more positive commentary than I ever expected and fully professional criticisms. I thank all that have taken the time to read my editorials, I am quite flattered.

Monday, November 05, 2007

Bending the Nail

Posted by Louis Lovas

In my recent blogs (When all you have is a hammer everything looks like a nail and Hitting the nail on the head) one could conclude that I've been overly inflammatory to SQL-based CEP products. I really have no intention to be seditious, just simply factual. I've been building and designing software far too long to have an emotional response to newly hyped products or platforms. Witnessing both my own handy work and many compelling technologies fade from glory all too soon has steeled me against any fervent reactions. I've always thought the software business is for the young. As with most professions one gets hardened by years of experience, some good, and some only painful lessons. Nonetheless, over time that skill, knowledge and experience condenses. The desire to share that (hard-won) wisdom is all too often futile. The incognizant young are too busy repeating the same mistakes to take notice. Funny thing is … wisdom is an affliction that inevitably strikes us all.

Well, enough of the philosophical meanderings just the facts please …

In a recent blog, I explored the need for CEP-based applications to manage state. As a representative example, I used the algo-trading example of managing Orders-in-Market. The need to process both incoming market data and take care of Orders placed is paramount to the goals of algorithmic trading. I'll delve a bit deeper into the state management requirement but this time focusing on the management of complex market data, the input if you will to the algorithmic engine. Aggregation of market data is a trend emerging across all asset classes in Capital Markets. Simply put, aggregation is the process of collecting and ordering quote data (bids & asks) from multiple sources of liquidity into a consolidated Order Book. In the end, this is a classic sort/merge problem. Incoming quotes are dissected and inserted into a cache organized by symbol, exchange and/or market maker and sorted Bid and Ask prices. Aggregation of market data is applicable to many asset classes (i.e. Equities, Foreign Exchange and Fixed Income). The providers of liquidity in any asset class share a number of common constructs but an equal number of unique oddities. For the aggregation engine, there are also common requirements (i.e. sorting/merging) and a few unique nuances. It's the role of the aggregation engine to understand each provider's subtleties and normalize them for the consuming audience. For example, different Equities Exchanges (or banks providing FX liquidity) can use slightly different symbol naming conventions. Likewise, transaction costs can (or should) have an influence on the quote prices. Many FX providers put a time-to-live (TTL) on their streaming quotes, which implies the aggregation engine has to handle price expirations (and subsequently eject them from its cache). In the event of a network (or other) disconnection, the cache must be cleansed of that provider's (now stale) prices. The aggregation engine must account for these (and a host of other needs) since its role is to provide a single unified view of an Order Book to trading applications. The trading applications can be on both sides of the equation. A typical Buy-side consumer is a Best Execution algo. Client orders or Prop desk orders are filled by sweeping the aggregate book from the top. For Market Makers, aggregation can be the basis for a Request For Quote (RFQ) system.

At first glance, one would expect that SQL-based CEP engines would be able to handle this use-case effectively. After all, sorting and merging (joining) is a common usage of SQL in the database world and streaming SQL does provide Join and Gather type operators. However, the complexities of an aggregation model quickly outstrip the use of SQL as an efficient means of implementation. The model requires managing/caching a complex multi-dimensional data structure. For each symbol, multiple arrays of a price structure are necessary, one for the bid side another for the ask side. Each element in the price structure would include total quantity available at this price and a list of providers. Each provider entry in turn, ends up being a complex structure in itself since it would include any symbol mapping, transaction costing, expiration and connectivity information. At the top level of the aggregate book would be a summation of the total volume available (per symbol of course). Algos more interested in complete order fulfillment (i.e. fill-or-kill) would want this summary view.

Using stream SQL to attempt to accomplish this would mean flattening this logical multi-dimension object into the row/column format of a SQL table. SQL tables can contain only scalar values; multidimensional-ness can only be achieved by employing multiple tables. I don't mean to imply this is undesirable or illogical. Initially it seems like a natural fit. However, an Aggregated Book is more than just it's structure, but as I mentioned above, a wealth of processing logic. In the end one would be bending the SQL language to perform unnatural acts in any attempt to implement this complex use-case.

To illustrate an unnatural act, here's a very simple streamSQL example. The purpose of this bit of code is to increment an integer counter, (TradeID = TradeID + 1) on receipt of every tick (TradesIn) event and produce a new output stream of ticks (Trades_with_ID) that now includes that integer counter - a trade identifier of sorts.

 

CREATE INPUT STREAM TradesIn (

Symbol string(5),

Volume int,

Price double

);

 

CREATE MEMORY TABLE TradeIDTable (

TradeID int,

RowPointer int,

PRIMARY KEY(RowPointer) USING btree

);

 

CREATE STREAM Trades_with_ID;

 

INSERT INTO TradeIDTable (RowPointer, TradeID)

SELECT 1 AS RowPointer, 0 AS TradeID

FROM TradesIn

ON DUPLICATE KEY UPDATE

TradeID = TradeID+1

RETURNING

TradesIn.Symbol

AS

Symbol,

TradesIn.Volume

AS

Volume,

TradesIn.Price

AS

Price,

TradeIDTable.TradeID

AS

TradeID

INTO Trades_with_ID;

 

The state to manage and the processing logic in this small stream SQL snippet is no more than incrementing an integer counter (i.e. i = i + 1). In order to accomplish this very simple task a memory table (TradeIDTable) is used to INSERT and then SELECT a single row (1 AS RowPointer) that contains that incrementing integer (ON DUPLICATE KEY UPDATE TradeID = TradeID + 1) when a new TradesIn event is received. In a way, a rather creative use of SQL don't you think? However, simply extrapolate the state requirements beyond TradeID int and the processing logic beyond TradeID = TradeID + 1 and you quickly realize you would be bending the language to the point of breaking.

In the commercial application world, relational databases are an entrenched and integral component. SQL is the language for applications to interact with those databases. As applications have grown in complexity, the data needs have also grown in complexity. One outgrowth of this is a new breed of application service known as Object-Relational (O/R) mapping. O/R mapping technologies have emerged to fill the impedance mismatch between an application's object view of data and SQL's flat two-dimensional view. A wealth of O/R products are available today so the need for such technologies clearly exists.

Why am I mentioning O/R technologies in a CEP blog? Simply to emphasize the point that the SQL language, as validated by the very existence of O/R technologies in the commercial space, is a poor choice for CEP applications. As I've mentioned in previous blogs, programming languages that provide the vernacular to express both complex structures (objects) and complex semantics (programming logic) are as necessary for aggregation as they are for Orders-in-Market or any CEP application.

So what sort of language is appropriate for CEP? Well, there is always the choice of Java or C++. Using traditional languages such as Java and C++ clearly provide this expressiveness and can be used to build applications in any domain. However, trailing along behind that expressiveness is also risk. Using these languages means you start an application's implementation at the bottom rung of the ladder. The risk associated with this is evident in many a failed project. A step up is domain-specific languages. For the domain of streaming data, Event Programming Languages (EPL's) are clear winners. Like C++ and Java they contain syntax for defining complex objects (like an Aggregated Order Book) and imperative execution but they also include a number of purposed declarative constructs specifically designed to process streaming data efficiently. Apama's MonitorScript is one such EPL.