Classifying Your Books

If you're like me, you're always in search of new ways to fetishize your books. I recently started keeping track of my books at all—a few too many of my expensive computer science tomes have walked away without my really knowing who has them. This has to stop, but solving this problem is boring and easy. You just go install Delicious Monster or BookPedia or something.

If we want to make the task harder, and thus, more fun, we have to go further than that. We have to assign books call numbers. This is the only way to convert your freeloading friends into legitimate library patrons.

What Librarians Do

What do librarians do? I never really asked myself this question. I assumed they had things to do other than “ssh!”ing people but I never really probed further than that. It turns out one of their major problems is classifying the books. Perhaps you thought, as I did, that every book has a magic code on the inside flap with where it should be placed! Well, many do, but many do not, and aside from that, this just gets you the class of the book, not whatever additional information you might want to incorporate into the call number.

You may be surprised to discover, as I was, that every library (potentially!) uses different call numbers to identify books. The call number usually incorporates some auxilliary information about the book, like the author, title, or year of publication. There are three purposes for the call number: to make it possible to uniquely identify a book, to give your patrons a way to look for a book, and to give you a unique relative placement for the book on the shelves. The latter thing there implies sortabililty.

Methods

There are four ways to classify books, and they each have strengths and weaknesses. The Library of Congress classification (LoC) has a pleasant hideous formality, but the tables are quite enormous and it has a tendency to scatter related things into different corners. Dewey (DDC) has that childish air to it, but can be delivered in a single abbreviated volume and produces short and tidy call numbers. Universal Decimal Classification (UDC) adds some syntatic horror and can be had in a real electronic format. It is, unfortunately, almost totally unheard of here in America. And then there is the Colon Classification, which deserves a special, lengthy introduction.

Colon Classification

The primary problem as perceived by LoC and DDC is: where does this book go on the shelves? And that question raises design concerns: one wants books to be together with other books on the same topic rather than sorted by color or author's favorite animal. So both spend most of their effort trying to figure out what the real subject of the book is, so they can produce a call number that puts things together helpfully.

This isn't the only perspective though. Colon instead seems to be saying: let's create a precise, formal statement about the subject of a book. That statement can then be encoded into a terse syntax, and we'll use that syntax as the class portion of the call number.

All of the work done in other systems to produce the total classification is just part of the “personality” facet of the book under CC. A further four facets can be used to narrow the book down by time, space, noun and verb (called “matter” and “energy” but I prefer my words). Then, each class or subclass can define its own facets. History, for instance, using additional time and space facets to discuss the subject of the book, freeing up the other two for the origin of the book itself. Literature makes the author a facet. And then there are a great number of additional “devices” that can be used to go further: books can be promoted to “classic” status and made into their own category; anteriorizing and posteriorizing facets to describe the structure of the book. The language and form, whether it is a commentary on some other book, etc. can all be encoded. Further, the “personality” can be combined with other personalities with special codes to indicate, for instance, the book contains both, or the book is in one subject biased for people who know another, or the book compares or discusses the similarities or differences between two subjects. Facets can be connected with each other. And this whole process can be nested in a recursive fashion, leading to terms like “second round energy.”

While shockingly powerful as a way of encoding a statement about a book, it suffers greatly from the huge complexity. The sorting rules are not at all intuitive: some sections are to be treated as integers, others as decimals. There is an “octavizing digit” that changes the sorting order, so you count 1, 2, 3, ... 8, 91, 92, 93, ... 991, 992, etc., but each class of character has one (so, z for lower-case alphabetic characters). Additionally, much complexity is spent trying to save a character here or there: instead of writing 2010, you would write N1, or N14 for 2014. It's not clear to me that the syntax would be all that parseable to a lay person after any amount of use.

APL

The whole thing reminds me a lot of APL. Reading the book A Programming Language is a fairly uncomfortable affair because of the strange order and missing information. One gets the sense that the language doesn't really make a distinction between syntax and semantics. It jumps around strangely from low-level to high-level concerns and back. It's kind of a poor spec.

Colon Classification, 6th Edition is a lot better, but still has that sort of feel to it. One of the early definitions specifies the character set for the call number as containing “some Greek letters.” Which ones? Read two more pages, and you'll see them listed in their sort order. Similarly, what appears in what order in the call number? It says the collection number can go “above” the class and the book number can go “below” the class when printed vertically (two different rules) but for the normal left-to-right direction nothing is said explicitly. I infer that it would go collection-class-book from the order it would appear vertically.

In other ways, it reminds me of SGML, prior to XML. The "chronological device" that saves you a whole digit or two, for instance, is very reminiscent of not needing closing tags. With a class number like L,45;421:6;253:f.44'N5, how much worse is the two-character longer L,45;421:6;253:f.44'1950, really?

Faceted Search

The benefit of CC is the facets. But on your bookshelf, you don't benefit much from the facets. Either your collection is so small, every category has one book, or your collection is so huge, only the first part matters anyway. The facets might help if you needed to do faceted search, but you're either looking for one of your own books, or you're doing an electronic search, where the facets don't need to be called out explicitly.

It would probably be beneficial for searching to have a comprehensive statement of the subject matter of a book. And in that case, having a complex of facets like CC provides would probably be helpful. But if you aren't using it as the call number, there's no need for the syntax to be so murdersome. Choosing a restricted vocabulary and a sensible metadata format would be enough; whatever you chose, if it has an electronic representation that can be parsed by a machine, you're done—there's no need to create horrors like the above. Even a basic full-text search on any of the words in the sentence "Research in the cure of the tuberculosis of lungs by x-ray conducted in India in 1950s" would get you meaningful results.

Planes

Ranganathan structured CC with a separate “language” plane for the human expression and a “number” plane for constructing the call number. I'm not sure his syntax is worth so much it would be worth saving (Medicine,Lungs;Tuberculosis:Treatment;X-ray:Research.India'1950) instead of something else, but it raises the question (for me) of whether it would be meaningful or helpful to use CC to develop statements about the subject of books_, even if one were using DDC to file them on the shelf.

Improvements?

The detour through CC was intellectually refreshing. The system is quite interesting, but ultimately, seems to be more like two half-solutions to two problems than a complete solution to either a metadata problem or a call number problem. I could see someone else applying the system—particularly if they have a large collection of Indian religious texts, as that's the only fully-elaborated example in the book. For someone like me, a computer science-y guy with a fondness for formality and a large collection of CS-related books, I'm not convinced you'd be able to implement the system faithfully anyway. I think you'd wind up having to create so much from scratch (and fail to properly apply so much of the many and complex rules) that you'd be no better off than trying to come up with something yourself out of whole cloth. For instance, these ideas occurred to me while reading Colon Classification, 6th Edition:

  • Maybe I should co-opt another Greek letter for the Main Class for computer science. Perhaps Λ?
  • I could devise a real collation around an actual character encoding (such as Unicode) if only I…
    • removed the Greek letters or sorted them to the end instead of “after the nearest similar Latin letter”
    • removed the “octave device”
  • I should get rid of the “chronological device,” because it doesn't save enough
  • I should write a formal grammar for the syntax of this thing

And so forth.

Conclusion

Ultimately, I have decided to go forward with Dewey Decimal Classification. The abbreviated guide for one edition out-of-date can be had for $15 on Amazon. For a small collection like mine, this will be totally sufficient to the task of putting my books in order. DDC is also widely-used in sub-collegiate libraries. My wife won't find it nearly as objectionable, and my son may benefit from being exposed to it. Also, you can find code like mine to query OCLC to look up DDC classes for books, and many books come with a DDC classification printed in them as part of the CIP data.

I would still like to think about the metadata problem. CC is very
interesting and powerful, and might provide some inspiration here, but I think ultimately other modern systems are probably just as powerful and without the drawbacks.

For a better-written take on the same problem, please see
this article by David Mundie.