Bending the Nail
Posted by Louis Lovas
In my recent blogs (When all
you have is a hammer everything looks like a nail and Hitting the
nail on the head) one could conclude that I've been overly inflammatory to
SQL-based CEP products. I really have no intention to be seditious, just simply
factual. I've been building and designing software far too long to have an
emotional response to newly hyped products or platforms. Witnessing both my own
handy work and many compelling technologies fade from glory all too soon has
steeled me against any fervent reactions. I've always thought the
software business is for the young. As with most professions one gets hardened
by years of experience, some good, and some only painful lessons. Nonetheless,
over time that skill, knowledge and experience condenses. The desire to share
that (hard-won) wisdom is all too often futile. The incognizant young are too
busy repeating the same mistakes to take notice. Funny thing is … wisdom is an
affliction that inevitably strikes us all.
Well,
enough of the philosophical meanderings just the facts please …
In a
recent blog, I explored the need for CEP-based applications to manage state. As a representative example, I
used the algo-trading example of managing Orders-in-Market.
The need to process both incoming market data and take care of Orders placed is paramount to the goals
of algorithmic trading. I'll delve a bit
deeper into the state management requirement but this time focusing on the
management of complex market data, the input
if you will to the algorithmic engine. Aggregation of market data is a trend
emerging across all asset classes in Capital Markets. Simply put, aggregation is the process of
collecting and ordering quote data (bids & asks) from multiple sources of
liquidity into a consolidated Order Book. In the end, this is a classic
sort/merge problem. Incoming quotes are dissected and inserted into a cache
organized by symbol, exchange and/or market maker and sorted Bid and Ask
prices. Aggregation of market data is
applicable to many asset classes (i.e. Equities, Foreign Exchange and Fixed
Income). The providers of liquidity in any asset class share a number of common
constructs but an equal number of unique oddities. For the aggregation engine,
there are also common requirements (i.e. sorting/merging) and a few unique
nuances. It's the role of the aggregation engine to understand each provider's
subtleties and normalize them for the consuming audience. For example,
different Equities Exchanges (or banks providing FX liquidity) can use slightly
different symbol naming conventions. Likewise, transaction costs can (or
should) have an influence on the quote prices. Many FX providers put a time-to-live (TTL) on their streaming
quotes, which implies the aggregation engine has to handle price expirations
(and subsequently eject them from its cache). In the event of a network (or
other) disconnection, the cache must be cleansed of that provider's (now stale)
prices. The aggregation engine must account for these (and a host of other
needs) since its role is to provide a single unified view of an Order Book to
trading applications. The trading
applications can be on both sides of the equation. A typical Buy-side consumer
is a Best Execution algo. Client orders or Prop desk orders are filled by
sweeping the aggregate book from the top. For Market Makers, aggregation can be
the basis for a Request For Quote
(RFQ) system.
At
first glance, one would expect that SQL-based CEP engines would be able to
handle this use-case effectively. After all, sorting and merging (joining) is a
common usage of SQL in the database world and streaming SQL does provide Join
and Gather type operators. However, the complexities of an aggregation model
quickly outstrip the use of SQL as an efficient means of implementation. The
model requires managing/caching a complex multi-dimensional data structure. For
each symbol, multiple arrays of a price structure are necessary, one for the
bid side another for the ask side. Each element in the price structure would
include total quantity available at this
price and a list of providers. Each provider entry in turn, ends up being a
complex structure in itself since it would include any symbol mapping,
transaction costing, expiration and connectivity information. At the top level
of the aggregate book would be a summation of the total volume available (per
symbol of course). Algos more interested in complete order fulfillment (i.e. fill-or-kill) would want this summary
view.
Using
stream SQL to attempt to accomplish this would mean flattening this logical
multi-dimension object into the
row/column format of a SQL table. SQL tables can contain only scalar values;
multidimensional-ness can only be
achieved by employing multiple tables. I don't mean to imply this is
undesirable or illogical. Initially it seems like a natural fit. However, an Aggregated Book is more than just
it's structure, but as I mentioned above, a wealth of processing logic. In the
end one would be bending the SQL language to perform unnatural acts in any
attempt to implement this complex use-case.
To
illustrate an unnatural act, here's a very simple streamSQL example. The
purpose of this bit of code is to increment an integer counter, (TradeID = TradeID + 1) on receipt of every tick (TradesIn) event and produce a new output stream of
ticks (Trades_with_ID) that now includes that
integer counter - a trade identifier of sorts.
CREATE INPUT STREAM TradesIn (
Symbol string(5),
Volume int,
Price double
);
CREATE MEMORY TABLE TradeIDTable (
TradeID int,
RowPointer int,
PRIMARY KEY(RowPointer)
USING btree
);
CREATE STREAM Trades_with_ID;
INSERT INTO TradeIDTable (RowPointer, TradeID)
SELECT 1 AS RowPointer, 0 AS TradeID
FROM TradesIn
ON DUPLICATE KEY UPDATE
TradeID = TradeID+1
RETURNING TradesIn.Symbol AS TradesIn.Volume AS TradesIn.Price AS TradeIDTable.TradeID AS
INTO Trades_with_ID;
The state to manage and the processing logic
in this small stream SQL snippet is no more than incrementing an integer
counter (i.e. i = i + 1). In order to
accomplish this very simple task a memory table (TradeIDTable)
is used to INSERT and then SELECT a single row (1 AS
RowPointer) that contains that incrementing integer (ON DUPLICATE KEY UPDATE TradeID = TradeID + 1) when
a new TradesIn event is received. In a way, a rather creative use of SQL don't
you think? However, simply extrapolate
the state requirements beyond TradeID int
and the processing logic beyond TradeID = TradeID + 1 and you quickly realize you
would be bending the language to the point of breaking.
In the
commercial application world, relational databases are an entrenched and
integral component. SQL is the language for applications to interact with those
databases. As applications have grown in
complexity, the data needs have also grown in complexity. One outgrowth of this
is a new breed of application service known as Object-Relational (O/R) mapping.
O/R mapping technologies have emerged to fill the impedance
mismatch between an application's object view of data and SQL's flat
two-dimensional view. A wealth of O/R
products are available today so the need for such technologies clearly
exists.
Why am
I mentioning O/R technologies in a CEP blog? Simply to emphasize the point that
the SQL language, as validated by the very existence of O/R technologies in the
commercial space, is a poor choice for CEP applications. As I've mentioned in
previous blogs, programming languages that provide the vernacular to express
both complex structures (objects) and complex semantics (programming logic) are
as necessary for aggregation as they are for Orders-in-Market or any CEP
application.
So what
sort of language is appropriate for CEP? Well, there is always the choice of
Java or C++. Using traditional languages such as Java and C++
clearly provide this expressiveness and can be used to build applications in
any domain. However, trailing along behind that expressiveness is also risk.
Using these languages means you start an application's implementation at the
bottom rung of the ladder. The risk associated with this is evident in many a
failed project. A step up is domain-specific
languages. For the domain of streaming data, Event Programming Languages
(EPL's) are clear winners. Like C++ and Java they contain syntax for defining
complex objects (like an Aggregated Order Book) and imperative execution but
they also include a number of purposed declarative constructs specifically
designed to process streaming data efficiently. Apama's MonitorScript
is one such EPL.
