“Too good to be true.” That’s what Tom Foremski initially wrote about Algebraix Data, a brand new startup in the lucrative database business. The company claims to use something called “Extended Set Algebra” to revolutionize the way database queries are performed. I was immediately skeptical, having never heard of this in all my years studying math. Sure enough, a Google search for “Extended Set Algebra” (with quotes) turned up only two results, both Algebraix. I wasn’t expecting a flood of results, but it’s pretty shocking to see that few! Fortunately, Tom pointed me toward a keyword with a lot more information: “Extended Set Theory”.

There’s a whole mini-community built around Extended Set Theory (XST). The concept was invented by David Childs a few decades ago. One blogger claims that Childs, using a personal laptop and XST, managed to outperform a highly optimized Oracle database with proprietary hardware. That’s a pretty bold claim.

For a while, my hopes were really high. There are a lot of people interested in set theory (I’m talking here about the abstract, rigorously axiomatized version), and there are a lot of people interested in databases, but not so many are interested in both, and I’m one of the few. My imagination ran away and I pictured a young new subfield of logic blossoming as Algebraix Data shines a spotlight on it, and here I am in a prime position to get in on the ground floor.

Unfortunately, the link between abstract set theory and Algebraix Data is not as strong as one is initially led to believe through reading Childs’s writings. Childs defines a new kind of set theory where set inclusion is a ternary relation instead of a binary relation, and I’ll talk more about the specific details below. But the rigorous upheaval of classic set theory is totally unnecessary. It’s a case of “killing a fly with a sledgehammer”. Indeed, from what little details AlgebraixData provides in its mathematics whitepaper(pdf), we can tell they use Childs’s basic idea, but they implement it with the traditional Zermelo-Fraenkel Set Theory which has supported mathematics for over a century.

The following line from that whitepaper is rather bizarre:

A couplet d.v is the set {{d},{d,v}}

This is the Kuratowski definition of an ordered pair. It’s usually only seen in very abstract contexts. In everyday use, instead of writing {{d},{d,v}}, we write (d,v). What the author has done is give the very formal definition of an ordered pair and assign it an unusual notation, d.v instead of (d,v). Is this company intentionally trying to obfuscate things?

Historically, this isn’t Kuratowski’s first surprise appearance. The writings of Mr. Childs are riddled with Kuratowski. Childs points out some well-known anomalies which arise if you try to use the non-ordered set inclusion relation on ordered tuples. For example, if I ask, “Is 1 an element of (1,2,3)?”, the answer is: “Probably not, but maybe. Depends on the background definitions. Not enough info!” This seems scandalous and Childs does his best to wring as much scandal from it as he can. But the truth is, formal set-theoretical inclusion isn’t relevant here. It’s like asking “Is 3 an element of the Pythagorean Theorem?” (again, the answer is “maybe”). If we want an intuitive relation which “acts like you’d think” on ordered tuples, we can define one: namely, say that x is order-contained in (x1,x2,…,xn) if x=x1 or x=x2 or … or x=xn. This is easily doable in classic set theory. The same goes for every other Kuratowski-related anomaly Childs exhibits.

Apparently Algebraix Data agrees, since they don’t actually use Extended Set Theory, just the basic idea behind it– which is a very neat and interesting idea indeed. It’s an idea which impresses me more as a computer scientist than as a mathematician (so much for an easy dissertation topic!)

It seems that the power of Mr. Childs’s idea isn’t so much about optimization as it is about organization. Optimization, however, is often a free bonus of organization– or rather, inefficiency is a consequence of disorganization. I don’t know whether Algebraix Data really is “the next big thing”. If they are, it’s not because XST is a magic wand of optimization– the only way Algebraix might be the next big disruptive innovator is if all the established players like Oracle are just really disorganized. But then, we are talking about mammoth corporations here, so that is a very good possibility ;)

In short, anything you can do with XST, you can do without it– but the question is will you, and how messy will it be?

The Idea Behind XSA/XST

The idea is simple. Classically a set is written like {x,y,z}. An extended set is written more like {xa,yb,zc}. Here x,y,z are the elements and a,b,c are their respective scopes (the mathematical logician will instantly be reminded of type theory). Nesting is allowed, so you can have things like {{ap,bq}x,{cr,ds}y}.

Currently, data is stored in three different ways. First there is raw data, blocks of 1′s and 0′s with no deeper structure (example: a bitmap). Next there is relational data, which is a fancy way of saying tables with rows and columns (example: an Excel spreadsheet). Then there is hierarchical data, where everything is tagged like HTML code, with the proviso that every tag must be closed, and tags must be closed in the opposite order they’re opened; in short, XML.

I recently did a project which involved all three data types: the Romaji Dictionary. The open-source dictionary files were published by Jim Breen in hierarchical (XML) form. In order to do any parsing on them, because the RAM in my computer is basically raw data, I had to convert from hierarchical to raw (I parsed the dictionary into RAM). Having the data in RAM format, I was able to efficiently do the manipulations I wanted. But to get it to its final destination– your computer– I had to load it into a database. That means it became relational data.

Extended sets provide a universal language which can store data in any of the three forms:

  • Raw data example:

    • The raw data “1101″ can be stored as {11,12,03,14}
  • Relational data example:
    • Imagine a spreadsheet with columns: “Name” and “Age”. Row 1 is “Sam, 26″. Row 2 is “Bob, 30″. This table stores as:
      {Sam(Name,1),26(Age,1),Bob(Name,2),30(Age,2)}.
  • Hierarchical data example:
    • The XHTML data “<html><head><title>Hi</title></head><body><p>Hi!</p></body></html>” stores as:
      {{{Hi(title,1)}(head,1), {Hi!(p,1)}(body,1)}(html,1)}

Hence the crucial point of it all: if you’re willing to manipulate extended sets, rather than requiring one specific form (raw, relational or hierarchical), then you never have to worry about converting data from one form to another. It’s a universal language of data. That said, I really don’t see how it can be used to “magically” speed up database queries– unless the big players are suffering from major disorganization. I lied about the three data types. Pry your computer open, you won’t find one single spreadsheet or HTML tag or set or extended set inside. It’s all raw!

FURTHER READING

Connections Between Different Branches of Math
Guessability
Collatz Recursion
“Basic Mathematics”, by Serge Lang