The Conservation of Hairiness
Recently, or perhaps for as long as there have been technical decisions to be made, I've been hearing endless arguments about this versus that that lead to nowhere at all. Emacs vs Vi, PL X vs PL Y, NoSQL vs RDBMS, DjangoORM vs SQLAlchemy... etc. I hope I can shed light on what's fundenmentally going on here. I haven't blogged in a long time, this is the first post in which I hope will be a long line of semi-regular posts that document my thoughts, philosophy and learnings from programming, and anything barely related.
Back in the days when I was a grad student at Tufts University, I attended a guest lecture by then Project Darkstar tech lead Jim Waldo (now Harvard's CTO). At the end of the lecture, when someone asked him why he made the technical decisions in Jini, he introduced the term - The Conservation of Hairiness. Lightning bolts struck. He didn't spend too much time explaining what he meant, but here's how I came to understand this term later (adjusted by my own interpretation, so don't misattribute its entirety to Dr. Waldo please):
For any given problem, there's usually more than one subproblem
Suppose you are to design a templating library, there are a number of smaller problems you need to solve. You will need to be able to load the template files from somewhere, parse them, allow the users to do some manipulation to fill in the contents, and return the results in some representation. You can obviously break these subproblems into sub-subproblems. For example, you may want to be able to load the templetes from the file system, or memory. You get my point. I will postulate that for any given problem that takes some input and produces some output, there's at least one subproblem that is either the problem itself, or a set of smaller problems.
The solution of a problem is the sum of the solutions of its subproblems
Natually, if you manage to solve all the subproblems, you have a complete solution to the problem as a whole. I assume you have heard of the Single Responsibility Principle, Do one thing only and do it well, and the Rule of Composition, so I'm not going to dwell on this. What I'm going to tell you though, is that there are solutions out there that attempt to solve more than one problem at the same time, and there are solutions out there where the solutions for the subproblems are pretty much the same, but arranged in different orders.
If you have a different set of subproblems, you have a different problem, and requires a different solution
This is probably the most subtle point in the entire post so I'm going to explain a bit more. Let's say you are to build your startup company's website, you have limited time, money and you don't really know the constraints, you should probably go with the most popular technologies out there right? Wrong! The correct approach is always going to be to understand the problem first, even if you have to guesstimate on some number. How many users are going to be visiting your site? What's an acceptable response time? Do your data have schemas or are they free-form? Is it going to have lots of dynamic interactions or is it just mostly static with some dynamic content? These are all proper subproblems. Once you understand the constraints, then you can go out and choose a technology stack that enables that solution.
If your requirement is simply to be able to serve 1000 requests at a time, with less than 3 seconds response time, with a predefined data model and the pages are mostly static with some dynamic content, Spending days and weeks to evaluate Django vs Pyramid or Django ORM vs SQLAlchemy is probably going to make very little difference as they all solve the same problem, with slightly different arrangements of complexity spread out across basically the same layers.
However, if your data is free-form, using a schema-less datastore is probably a more attractive solution. Picking a framework that will enable you to use a different data access layer easily is probably a better idea. In this case, you may opt for Flask + MongoDB. By the same token, evaluating whether CherryPy + MongoDB or Bottle + Riak is better is going to make very little difference unless you have more nuanced requirements.
If you don't, or don't know, it's probably better to choose a stack that doesn't prevent you from solving other subproblems. Fortunately, most tools out there, especially those that adhere to the "Do one thing only and do it well" principle fit this description.
The complexity of any given problem is a constant
I would argue that how you choose your technology stack is largely dependent on how well you can understand the problem at hand. In reality, most of us most of the time simply don't know enough about the problem to be able to completely enumerate all the subproblems, so sitting here arguing about the same nuanced points days after days for two different tools that pretty much solve the same problems with different tradeoffs is going to make very little difference. At last, you still have to take your comfort zone into account. If you've found the perfect solution but requires you to learn the pieces for months, is it worth it?
This post has gone on long enough, I hope this post can give you some perspective in how to pick the right tool to do the right job. Now please feel free to comment!