Taking Aim
Posted by Louis Lovas
I am both humbled and appreciative
of all the accolades, constructive comments (hey, fix that misspelled word) and yes, criticism on my latest blog
about using SQL for Complex Event Processing. I was expecting some measure of response, as shown by this rebuttal
given the somewhat polarizing nature of using the SQL language for CEP
applications. For every viewpoint there
is always an opposing, yet arguably valid outlook. I welcome any and all
commentary. Again thanks all reading and commenting.
There were two main themes of the
criticism that I received. One was on the viability of the SQL language, the
other on my commentary on the use of Java and C++ for CEP applications. I would
like to clarify and reinforce a few points that I made.
I chose the aggregation use-case
as an example to highlight limitations of SQL because its one that I have
recent experience with. I have been both directly and indirectly involved in
six aggregation projects for our financial services customers over the past
year. In those endeavors I've both learned much and leveraged much. As I tried to describe in a condensed
narrative, aggregation is a challenging problem. One that is best solved by use
of complex nested data structures and associated program logic. Trying to represent this in SQL is a tall
order given the flat 2-dimensional nature of SQL tables and the limited ability
for semantic expression. To try to further explain this, I have taken a snippet
of a MonitorScript-based implementation, presenting it as a bit of pseudo code
that describes the nested structure I am referring to. I've intentionally
avoided any specific language syntax and I've condensed the structures to just
the most relevant elements. But suffice to say, defining these structures is
clearly possible in Java and C++ and Apama's MonitorScript. I would also like to give credit where it's
due and acknowledge many of my colleagues for the (abridged) definition I'm
using below from one of our customer implementations.
structure Provider {
string symbol; // symbol (unique to this
provider)
string marketId; // the market identifier of
the price point
integer quantity; // quantity the provider is
offering
float timestamp; // time of the point
float cost; // transaction cost for
this provider
hashmap<string,string>
extraParams; // used for storing
extra information on point
}
structure PricePoint
{
float price;// the
price (either a bid or ask)
array<Provider>
providers; // array of providers at
this price
integer totalQty; // total quantity across all
providers at this price
}
structure AggregatedOrderBook
{
integer sequenceId; // counter incremented each time the book is updated
integer
totalBidQuantity; // total volume
available on the bid side
integer
totalAskQuantity; // total volume
available on the ask side
integer
totalProviders; // total number
of providers
array<PricePoint>
bids; // list of bids, sorted by
price
array<PricePoint>
asks; // list of asks, sorted by
price
}
An aggregation engine would create
an instance of AggregatedOrderBook for each symbol, tracking prices per market data Provider. As market data quotes arrive
they are decomposed and inserted (sort/merged) into the appropriate PricePoint and total values are calculated.
This is an oversimplification of what transpires per incoming quote, but the
aim here is to provide a simplified yet representative example of the
complexities in representing an Aggregated Order Book.
Furthermore, after each quote is
processed and the aggregate order book is updated it's imperative that it be
made available to trading strategies expeditiously. Minimizing the signal-to-trade latency is a key measure
of success of algorithmic trading. Aggregation is a heavyweight, compute intensive operation. It takes a
lot of processing power to aggregate 1,000 symbols across 5 Exchanges. As such,
it is one (of many) opposing forces to the goal of minimizing latency. So this presents yet another critical aspect
of aggregation, how best to design it so that is can deliver its content to
eagerly awaiting strategies. One means
of minimizing that latency is to have the aggregation component and trading
strategies co-resident within the CEP runtime engine. Passing (or otherwise
providing) the aggregated order book to the strategies becomes a simple 'tap-on-the-shoulder' coding
construct. But it does imply the CEP
language has the semantic expressiveness to design and implement both
aggregation and trading strategies and then the ability to load and run them
side-by-side within the CEP engine. Any
other model implies not only multiple languages (i.e. java and streamSQL) but
likely some sort of distributed, networked model. Separating aggregation from
its consumers, the trading strategies will likely incur enough overhead that it
impacts that all important signal-to-trade
latency measure. I do realize that the
CEP vendors using a streaming SQL variant have begun to add imperative syntax to
support complex prodedural logic and "loop" constructs something
I'm quite glad to see happening. It only validates the claim I've been making
all along. The SQL language at its core is unsuitable for full-fledged CEP-style
applications. The unfortunate side effect of these vendor-specific additions is
that it will fracture attempts at standardization.
In my previous blog,
I wanted to point out the challenges of the SQL language to both implement
logic and manage application state. To that end, I provided a small snippet of
a streamSQL variant. A criticism leveled
against it states that it's an unnecessarily inefficient bit of code. I won't
argue that point, and I won't take credit for writing it either. I simply
borrowed it from a sample application provided with another SQL-based CEP
product. The sample code a vendor
includes with their product is all too often taken as gospel. A customer's expectation is that it
represents best practice usage. Vendors should take great care in providing
samples, portions of which inevitably end up in production code.
The second criticism I received
was on a few unintentionally scathing comments I made against Java and
C++. I stated that using C++ and/or Java
"means you start
an application's implementation at the bottom rung of the ladder". My
intent was to draw an analogy to CEP with its language and surrounding
infrastructure. All CEP engines provide much more than just language. They provide a runtime engine
or virtual machine, connectivity components, visualization tools and
management/deployment tools. CEP vendors like all infrastructure vendors live
and die by the features, performance and quality of their product. All too
often I've witnessed customers take a "not
invented here" attitude. They may survey the (infrastructure)
landscape and decide "we can do
better". For a business' IT group chartered with servicing the business to think they can implement infrastructure
themselves is a naïve viewpoint. Granted, on occasion requirements might be so
unique that the only choice is to start slinging C++ code, but weighing the
merits of commercial (and open source) infrastructure should not be overlooked.
My goal in this and past blogs is
to provide concrete use-cases and opinions on CEP drawn from my own experiences
with designing, building and deploying Apama CEP applications. In doing so I
was quite aware that I am drawing a big red bulls-eye on my back making me an
easy target for detractors to take aim. Surprisingly, I have received much more
positive commentary than I ever expected and fully professional criticisms. I
thank all that have taken the time to read my editorials, I am quite flattered.