« Bending the Nail | Main | Design Patterns in CEP – The instance Life Cycle »

Tuesday, November 13, 2007

Taking Aim

Posted by Louis Lovas

I am both humbled and appreciative of all the accolades, constructive comments (hey, fix that misspelled word) and yes, criticism on my latest blog about using SQL for Complex Event Processing. I was expecting some measure of response, as shown by this rebuttal given the somewhat polarizing nature of using the SQL language for CEP applications. For every viewpoint there is always an opposing, yet arguably valid outlook. I welcome any and all commentary. Again thanks all reading and commenting.

There were two main themes of the criticism that I received. One was on the viability of the SQL language, the other on my commentary on the use of Java and C++ for CEP applications. I would like to clarify and reinforce a few points that I made.

I chose the aggregation use-case as an example to highlight limitations of SQL because its one that I have recent experience with. I have been both directly and indirectly involved in six aggregation projects for our financial services customers over the past year. In those endeavors I've both learned much and leveraged much. As I tried to describe in a condensed narrative, aggregation is a challenging problem. One that is best solved by use of complex nested data structures and associated program logic. Trying to represent this in SQL is a tall order given the flat 2-dimensional nature of SQL tables and the limited ability for semantic expression. To try to further explain this, I have taken a snippet of a MonitorScript-based implementation, presenting it as a bit of pseudo code that describes the nested structure I am referring to. I've intentionally avoided any specific language syntax and I've condensed the structures to just the most relevant elements. But suffice to say, defining these structures is clearly possible in Java and C++ and Apama's MonitorScript. I would also like to give credit where it's due and acknowledge many of my colleagues for the (abridged) definition I'm using below from one of our customer implementations.

structure Provider {

string symbol; // symbol (unique to this provider)

string marketId; // the market identifier of the price point

integer quantity; // quantity the provider is offering

float timestamp; // time of the point

float cost; // transaction cost for this provider

hashmap<string,string> extraParams; // used for storing extra information on point

}

 

structure PricePoint

{

float price;// the price (either a bid or ask)

array<Provider> providers; // array of providers at this price

integer totalQty; // total quantity across all providers at this price

}

 

structure AggregatedOrderBook

{

integer sequenceId; // counter incremented each time the book is updated

integer totalBidQuantity; // total volume available on the bid side

integer totalAskQuantity; // total volume available on the ask side

integer totalProviders; // total number of providers

array<PricePoint> bids; // list of bids, sorted by price

array<PricePoint> asks; // list of asks, sorted by price

}

An aggregation engine would create an instance of AggregatedOrderBook for each symbol, tracking prices per market data Provider. As market data quotes arrive they are decomposed and inserted (sort/merged) into the appropriate PricePoint and total values are calculated. This is an oversimplification of what transpires per incoming quote, but the aim here is to provide a simplified yet representative example of the complexities in representing an Aggregated Order Book.

Furthermore, after each quote is processed and the aggregate order book is updated it's imperative that it be made available to trading strategies expeditiously. Minimizing the signal-to-trade latency is a key measure of success of algorithmic trading. Aggregation is a heavyweight, compute intensive operation. It takes a lot of processing power to aggregate 1,000 symbols across 5 Exchanges. As such, it is one (of many) opposing forces to the goal of minimizing latency. So this presents yet another critical aspect of aggregation, how best to design it so that is can deliver its content to eagerly awaiting strategies. One means of minimizing that latency is to have the aggregation component and trading strategies co-resident within the CEP runtime engine. Passing (or otherwise providing) the aggregated order book to the strategies becomes a simple 'tap-on-the-shoulder' coding construct. But it does imply the CEP language has the semantic expressiveness to design and implement both aggregation and trading strategies and then the ability to load and run them side-by-side within the CEP engine. Any other model implies not only multiple languages (i.e. java and streamSQL) but likely some sort of distributed, networked model. Separating aggregation from its consumers, the trading strategies will likely incur enough overhead that it impacts that all important signal-to-trade latency measure. I do realize that the CEP vendors using a streaming SQL variant have begun to add imperative syntax to support complex prodedural logic and "loop" constructs something I'm quite glad to see happening. It only validates the claim I've been making all along. The SQL language at its core is unsuitable for full-fledged CEP-style applications. The unfortunate side effect of these vendor-specific additions is that it will fracture attempts at standardization.

In my previous blog, I wanted to point out the challenges of the SQL language to both implement logic and manage application state. To that end, I provided a small snippet of a streamSQL variant. A criticism leveled against it states that it's an unnecessarily inefficient bit of code. I won't argue that point, and I won't take credit for writing it either. I simply borrowed it from a sample application provided with another SQL-based CEP product. The sample code a vendor includes with their product is all too often taken as gospel. A customer's expectation is that it represents best practice usage. Vendors should take great care in providing samples, portions of which inevitably end up in production code.

The second criticism I received was on a few unintentionally scathing comments I made against Java and C++. I stated that using C++ and/or Java "means you start an application's implementation at the bottom rung of the ladder". My intent was to draw an analogy to CEP with its language and surrounding infrastructure. All CEP engines provide much more than just language. They provide a runtime engine or virtual machine, connectivity components, visualization tools and management/deployment tools. CEP vendors like all infrastructure vendors live and die by the features, performance and quality of their product. All too often I've witnessed customers take a "not invented here" attitude. They may survey the (infrastructure) landscape and decide "we can do better". For a business' IT group chartered with servicing the business to think they can implement infrastructure themselves is a naïve viewpoint. Granted, on occasion requirements might be so unique that the only choice is to start slinging C++ code, but weighing the merits of commercial (and open source) infrastructure should not be overlooked.

My goal in this and past blogs is to provide concrete use-cases and opinions on CEP drawn from my own experiences with designing, building and deploying Apama CEP applications. In doing so I was quite aware that I am drawing a big red bulls-eye on my back making me an easy target for detractors to take aim. Surprisingly, I have received much more positive commentary than I ever expected and fully professional criticisms. I thank all that have taken the time to read my editorials, I am quite flattered.

TrackBack

TrackBack URL for this entry:
https://www.typepad.com/services/trackback/6a00d83452154069e200e54f8296b48833

Listed below are links to weblogs that reference Taking Aim:

<-- end entry-individual -->