The other day I was attending a talk by UCSD Professor Charles Elkan about the Netflix competition, who was incidentally one of two external judges of it, which was really insightful and started getting me interested in Data Mining. So after thinking about storage requirements for my pet-‘Web 4.0 monolithic web-application’ I decided to dig deeper into DBMS technology vs. filesystems. I learned about OODBMS along the way, so we have to cover those briefly as well.
Let’s start with DBMS. They were created to handle large sets of data efficiently in the face of lots of concurrent reads and writes – read users – as well as allow swift querying of that data. They evolved from a mathematical sound theory – first order predicate logic – into what is known as relational algebra. To summarize, DBMS provide the following things:
- Correctness under concurrent access – known as the ACID principles,
- Complete indices for the data which together with
- a querying language based on a mathematically closed model allows for efficient searching aka. a powerful query execution engine.
They can do efficient searching, because DBMS have a very limited set of types and type information about each and every object by strictly validating all data against a schema. Furthermore most of the stored information is small – as in short strings or single numbers. With that information it is easy to build indices over the complete data set that are small enough to be kept in memory for fast access. Query execution engines in modern DBMS are powerful enough to allow querying millions of rows in a matter of seconds.
If you think about the data that needed that kind of storage and/or retrieval and would also gain the most from expensive computing back in those days (and had the money to pay for it) it was mostly banking and insurance companies. Most of the data for those domains is rather 2-dimensional: large tables with eg. customer or account info. Once you start trying to store stuff that is not 2-dimensional but n-dimensional instead, like Object instances in an OO language, you run into what it known as the object-relational impedance. This is basically speaking, the problem of mapping n-dimensional data into 2-dimensional tables. The object-relational-mapping (ORM) approach maps classes to tables and members to columns but has the disadvantage that it needs quite a number of tables which can be cumbersome and for each n x m relationship it needs a mapping table. There are some more issues related to object-relational impedance which I won’t name here.
OODBMS, which are quite new in the DBMS landscape, have recently seen a spike in interest because of this exact problem and the fact that OO languages are now the most widely used programming languages. OO languages as well are a fairly recent addition to the programming languages landscape, in the days when the relational model was invented it was mostly procedural languages like Fortran or COBOL. Some relational DBMS in those companies are still programmed in COBOL and measure their uptime in decades.
Instances in an OO language can be modeled as a directed Graph (DG) which even allow cycles (because of back-references) or trees without cycles (if there are no back-references). Trees and multi-dimensional arrays can be mapped to each other.
In short, OODBMS allow the direct mapping from an instance hierarchy (think tree) to a DBMS without the loss of expressiveness or the need to do complicated translation like ORM. So why doesn’t everybody just switch to OODBMS, if they are so much more superior? Well first, some applications do not really use multi-dimensional data, and others have (as already mentioned) been running well for decades – and no one should ever touch a running system; and secondly not everybody uses OO languages. A third reason, is that because traditional DBMS have been around for so long there are literally hundreds of tools available to deal with everything in standard relational format, like proofing, storing, or backup to name just a few, and almost every program that deals with data in some format can at least export its data into some relational format – csv for example.
So what about filesystems, why did I mention them in the subject? Good question.
What do FS’s do, what are they designed for, in comparison to DBMS’s? First of all, they follow the structure of a tree – see? multi-dimensional data – and with the help of links even directed (and cyclic) graphs, by allowing to store data with any n-dimensional prefix – the path. FS are meant to allow fast access to large, unstructured data without imposing any artificial schema on the data. Whereas DBMS were build to allow fast and concurrent access to small, structured data and efficient searching. For that they use transactions to make sure that every user only sees a complete picture of the data (not something intermediate), as well as indices for fast searching, hence the need for as much meta-data as they can get.
FS keep a limited amount of meta-data about their data, things like creation-date, owner and ACLs, whereas DBMS require a lot more meta-data to insure data integrity, concurrent access and more.
Thanks to SSD and lots of cache, FS’es have become faster and faster and can almost rival DBMS in random access time now, even though DBMS keep their indices in RAM for faster access.
So to compare and summarize, FS are only missing transactions, a better way to get complete meta-data info about their data – i.e. parsers for all file types, to be able to build complete indices to allow for fast searching by more than the traditional file attributes.
With transaction support now in Windows NTFS, check it out here, as well as the spotlight indexer now in OSX’s HFS+ filesystem which parses most file types and builds a complete index over all available meta-data (as well as most content), this difference is rapidly shrinking.
So what do you think, are filesystems the DBMS’s of the future?