Bending the Nail
Posted by Louis Lovas
In my recent blogs (When all you have is a hammer everything looks like a nail and Hitting the nail on the head) one could conclude that I've been overly inflammatory to SQL-based CEP products. I really have no intention to be seditious, just simply factual. I've been building and designing software far too long to have an emotional response to newly hyped products or platforms. Witnessing both my own handy work and many compelling technologies fade from glory all too soon has steeled me against any fervent reactions. I've always thought the software business is for the young. As with most professions one gets hardened by years of experience, some good, and some only painful lessons. Nonetheless, over time that skill, knowledge and experience condenses. The desire to share that (hard-won) wisdom is all too often futile. The incognizant young are too busy repeating the same mistakes to take notice. Funny thing is … wisdom is an affliction that inevitably strikes us all.
Well, enough of the philosophical meanderings just the facts please …
In a recent blog, I explored the need for CEP-based applications to manage state. As a representative example, I used the algo-trading example of managing Orders-in-Market. The need to process both incoming market data and take care of Orders placed is paramount to the goals of algorithmic trading. I'll delve a bit deeper into the state management requirement but this time focusing on the management of complex market data, the input if you will to the algorithmic engine. Aggregation of market data is a trend emerging across all asset classes in Capital Markets. Simply put, aggregation is the process of collecting and ordering quote data (bids & asks) from multiple sources of liquidity into a consolidated Order Book. In the end, this is a classic sort/merge problem. Incoming quotes are dissected and inserted into a cache organized by symbol, exchange and/or market maker and sorted Bid and Ask prices. Aggregation of market data is applicable to many asset classes (i.e. Equities, Foreign Exchange and Fixed Income). The providers of liquidity in any asset class share a number of common constructs but an equal number of unique oddities. For the aggregation engine, there are also common requirements (i.e. sorting/merging) and a few unique nuances. It's the role of the aggregation engine to understand each provider's subtleties and normalize them for the consuming audience. For example, different Equities Exchanges (or banks providing FX liquidity) can use slightly different symbol naming conventions. Likewise, transaction costs can (or should) have an influence on the quote prices. Many FX providers put a time-to-live (TTL) on their streaming quotes, which implies the aggregation engine has to handle price expirations (and subsequently eject them from its cache). In the event of a network (or other) disconnection, the cache must be cleansed of that provider's (now stale) prices. The aggregation engine must account for these (and a host of other needs) since its role is to provide a single unified view of an Order Book to trading applications. The trading applications can be on both sides of the equation. A typical Buy-side consumer is a Best Execution algo. Client orders or Prop desk orders are filled by sweeping the aggregate book from the top. For Market Makers, aggregation can be the basis for a Request For Quote (RFQ) system.
At first glance, one would expect that SQL-based CEP engines would be able to handle this use-case effectively. After all, sorting and merging (joining) is a common usage of SQL in the database world and streaming SQL does provide Join and Gather type operators. However, the complexities of an aggregation model quickly outstrip the use of SQL as an efficient means of implementation. The model requires managing/caching a complex multi-dimensional data structure. For each symbol, multiple arrays of a price structure are necessary, one for the bid side another for the ask side. Each element in the price structure would include total quantity available at this price and a list of providers. Each provider entry in turn, ends up being a complex structure in itself since it would include any symbol mapping, transaction costing, expiration and connectivity information. At the top level of the aggregate book would be a summation of the total volume available (per symbol of course). Algos more interested in complete order fulfillment (i.e. fill-or-kill) would want this summary view.
Using stream SQL to attempt to accomplish this would mean flattening this logical multi-dimension object into the row/column format of a SQL table. SQL tables can contain only scalar values; multidimensional-ness can only be achieved by employing multiple tables. I don't mean to imply this is undesirable or illogical. Initially it seems like a natural fit. However, an Aggregated Book is more than just it's structure, but as I mentioned above, a wealth of processing logic. In the end one would be bending the SQL language to perform unnatural acts in any attempt to implement this complex use-case.
To illustrate an unnatural act, here's a very simple streamSQL example. The purpose of this bit of code is to increment an integer counter, (TradeID = TradeID + 1) on receipt of every tick (TradesIn) event and produce a new output stream of ticks (Trades_with_ID) that now includes that integer counter - a trade identifier of sorts.
CREATE INPUT STREAM TradesIn (
CREATE MEMORY TABLE TradeIDTable (
PRIMARY KEY(RowPointer) USING btree
CREATE STREAM Trades_with_ID;
INSERT INTO TradeIDTable (RowPointer, TradeID)
SELECT 1 AS RowPointer, 0 AS TradeID
ON DUPLICATE KEY UPDATE
TradeID = TradeID+1
RETURNING TradesIn.Symbol AS TradesIn.Volume AS TradesIn.Price AS TradeIDTable.TradeID AS
The state to manage and the processing logic in this small stream SQL snippet is no more than incrementing an integer counter (i.e. i = i + 1). In order to accomplish this very simple task a memory table (TradeIDTable) is used to INSERT and then SELECT a single row (1 AS RowPointer) that contains that incrementing integer (ON DUPLICATE KEY UPDATE TradeID = TradeID + 1) when a new TradesIn event is received. In a way, a rather creative use of SQL don't you think? However, simply extrapolate the state requirements beyond TradeID int and the processing logic beyond TradeID = TradeID + 1 and you quickly realize you would be bending the language to the point of breaking.
In the commercial application world, relational databases are an entrenched and integral component. SQL is the language for applications to interact with those databases. As applications have grown in complexity, the data needs have also grown in complexity. One outgrowth of this is a new breed of application service known as Object-Relational (O/R) mapping. O/R mapping technologies have emerged to fill the impedance mismatch between an application's object view of data and SQL's flat two-dimensional view. A wealth of O/R products are available today so the need for such technologies clearly exists.
Why am I mentioning O/R technologies in a CEP blog? Simply to emphasize the point that the SQL language, as validated by the very existence of O/R technologies in the commercial space, is a poor choice for CEP applications. As I've mentioned in previous blogs, programming languages that provide the vernacular to express both complex structures (objects) and complex semantics (programming logic) are as necessary for aggregation as they are for Orders-in-Market or any CEP application.
So what sort of language is appropriate for CEP? Well, there is always the choice of Java or C++. Using traditional languages such as Java and C++ clearly provide this expressiveness and can be used to build applications in any domain. However, trailing along behind that expressiveness is also risk. Using these languages means you start an application's implementation at the bottom rung of the ladder. The risk associated with this is evident in many a failed project. A step up is domain-specific languages. For the domain of streaming data, Event Programming Languages (EPL's) are clear winners. Like C++ and Java they contain syntax for defining complex objects (like an Aggregated Order Book) and imperative execution but they also include a number of purposed declarative constructs specifically designed to process streaming data efficiently. Apama's MonitorScript is one such EPL.