Why ORM Divides Us

The old ORM chestnut is back and we're seeing the usual mixture of defenders and aggressors. But why is there such a divide?

The argument is fundamentally about choosing what's in charge of your system, the system being composed of your databases, your applications, and your supporting infrastructure (your scripts, your migrations, etc.) To relational database folk such as myself, the central authority is the database, and our principal interests are what ACID exists to provide: concurrent, isolated, atomic transactions that cannot be lost, on top of a well-defined schema with strong data validity guarantees. To us, what's most important is the data, so everything else must serve that end: the data must always be valid and meaningful and flexible to query.

The side that argues for ORM has chosen the application, the codebase, to be in charge. The central authority is the code because all the data must ultimately enter or exit through the code, and the code has more flexible abstractions and better reuse characteristics.

It comes down to disagreement about what a database is about. To the OO programmer, strong validation is part of the behavior of the objects in a system: the objects are data and behavior, so they should know what makes them valid. So the OO perspective is that the objects are reality and the database is just the persistence mechanism. It doesn't matter much to the programmer how the data is stored, it's that the data is stored, and it just happens that nowadays we use relational databases. This is the perspective that sees SQL is an annoying middle layer between the storage and the objects.

To the relational database person, the database is what is real, and the objects are mostly irrelevant. We want the database to enforce validity because there will always be tools outside the OO library that need to access the database and we don't want those tools to screw up the data. To us, screwing up the data is far worse than making development a little less convenient. We see SQL not as primarily a transport between the reality of the code and some kind of meaningless storage mechanism, but rather as a general purpose data restructuring tool. Most any page on most websites can be generated with just a small handful of queries if you know how to write them to properly filter, summarize and restructure the data. We see SQL as a tremendously powerful tool for everyday tasks—not as a burdensome way of inserting and retrieving records, and not as some kind of vehicle reserved for performance optimization.

At the end of the day, we need both perspectives. If the code is tedious and unpleasant to write, it won't be written correctly. The code must be written—the database absolutely should not be running a web server and servicing clients directly. OOP is still the dominant programming methodology, and for good reasons, but data encapsulation stands at odds with proper database design. But people who ignore data validity are eventually bitten by consistency problems. OODBs have failed to take off for a variety of reasons, but one that can't be easily discounted is that they are almost always tied to one or two languages, which makes it very hard to do the kind of scripting and reporting that invariably crops up with long-lived data. What starts out as application-specific data almost invariably becomes central to the organization with many clients written in many different languages and with many different frameworks.

ORM is destined to be hated, because the people who love databases don't love the practical albeit leaky abstraction offered by ORM, and people who hate databases will resent how much effort the better ones require to be used properly.

Some thought experiments

Consider this hypothetical scenario. You run Facebook, and you have all the software and all the data. A catastrophe occurs and you lose everything, and due to a mistake in the way backups were made, you can choose to restore the software or the data, but not both. (Somehow, restoring the software will destroy the data, and restoring the data will destroy the software). Which one do you choose?

Of course, this scenario is unlikely, but it should serve to demonstrate that to the rest of the organization (the non-programmers), the code is secondary to the data it manages. You can rewrite the code, but you can't always recreate the data. When you look at our database backup, migration and refactoring tools and compare them to what we have for source code management, it's clear that we spend more time worrying about the code than the data. That's not inappropriate (we work on code all day) but it can lead to a myopic view of the importance of the data.

Another thought experiment posed by jshen on HN points out that data validity is secondary to monetization, and that if the business finds a way to increase monetization while causing a low rate of data corruption, it may be worth sacrificing validity. This is a fair point, and I think this illustrates why NoSQL is a winning proposition for many companies. If scalability and speed are more valuable then they can represent a better choice—especially if the data is cheap or can be recreated without too much trouble (or on demand).

(This article was derived from this comment I left on HN).