Several people have asked online lately, “Can you really write production code in F# and mono?”, “How are you doing architecture now that you’ve switched completely to F#”, and “Aren’t you doing some kind of weird thing now that doesn’t work with interactive websites?”
I usually don’t do coding blogs, because, frankly, after a while it all gets pretty boring. Same old bytes, just new bells and whistles on top. But my approach to architecture has changed in a major way over the last five years. Might as well document it.
Description of Problem Domain. For our sample project, let’s take Newspaper23. It’s a production system with a small number of users that’s been running for a few years and has evolved through my change in philosophy.
Newspaper 23 exists for one reason: I have a very short attention span. Websites are always trying to get me to hang around, click on related stories, sign-up for things, and so on. I find this very distracting. It’s very easy for me to spend a lot of time online that’s unproductive. So a few years ago I had an idea: since the news I consume is all from a dozen or two sites, why don’t I just harvest the titles and links from those sites and put them in a big list? Then I’m choosing what content to consume based on the site brand and the title of the article, not social cues or any other nonsense.
So I got out my credit card, went over to Amazon AWS, and fired up the site.
The Old, Old Way of doing things. The first time through, I did this in a “classic” manner. I sketched out a domain model, I elaborated a bit on a database model, I built a database model in MySQL, and I created a new C# app in Visual Studio and wrote a program to do everything that was needed — get some links, save the data about them, create a page with the links, do some “other stuff”.
The problem was, I was never really clear what the program should do. Should I collect the votes the article had from reddit? Don’t know, so I’ll throw that in there and save it. Might need it later. Should I provide the ability to comment, or vote? Don’t know. I’ll add space — after all, space is cheap — and if I need it, it’s there. Should there be some kind of authentication? Beats me. (Decided to skip that one)
Don’t get me wrong; none of this was a heavyweight or Big Design Up Front process. I’m only talking about an hour or two of doodling around with things. And the beauty of modern tools is that it really doesn’t matter whether your table has 5 fields or 10. Unless you’re writing the code to do something with the data, it’s just “out there”.
It did make the “link sucking” process a little more complicated, because with different sites I had to get different pieces of data. Fair enough.
What happened. After a couple of weeks of off-and-on coding, I got the program up and running, along with a unique system for identifying data on a web page to harvest. (It was a recursive RegEx language. Please, no “but you can’t use Regex on html!” It looked good.) The code base consisted of a data tier, some controller logic on top, and it seemed to work fine.
For a while.
But over a period of just a few days or weeks, it seemed like there were always these weird edge cases, both in the sites, and more disturbingly, in the tool version, setup, and configuration. Site X would change their format, and I’d be in the code changing the way I received links. Yes, I had hard-coded in the parameters for each site, but how difficult was it to change some hard-coded values once or twice a year? Except the more source code there was, and the more I touched the code, the more errors I got. And it seemed like for one reason or the other with this way of programming there was a boatload of code to wade through and I was always horsing around with it. [insert discussion here about TDD]
By using a MVC paradigm, I had written one piece of software that I could easily lay a bunch of different controllers down on. In fact, as we all know, I could develop a “controller library” over time that would take the data and do all sorts of wondrous and amazing things. That’s the beauty of a good model — it facilitates many different solutions.
I still think this might be the way to go, assuming 1) you use good testing practices, and 2) you’re going to constantly be under the hood in the code for a long time. But it introduced all kinds of problems related to having everything pass through one executable and code base. Did I really want to horse around with the code that grabbed links, for instance, simply because I was fixing the code that wrote the web page? Why was I putting all of my eggs in one basket? OOP is a great way of doing things, but it introduces a freaking huge amount of hidden interdependencies. Some times I wonder if 90% of the problems development teams have with programming is based on the fact that in the developer’s mind there is separation of model and controller, but that in the wiring itself, things can be all hooked together in completely insane ways, usually for what seemed like good reasons at the time. The black box of OOP has a lot of very sharp edges.
The Old Way of doing things. The first thing I did was ditch C#. It’s a great language, a better Java than Java, but it was the wrong paradigm for me. The second thing I did was start to build the app from the ground-up. Programming functionally required me to write functions that did things. This sounds blazingly obvious, but it’s actually a much different way of looking at things from creating a model and adding controllers on top. I still used the MySQL datastore — had to put stuff somewhere — but my code was a LOT smaller.
It also needed much less tweaking. Remember: this is personal programming. Instead of sitting on top of a model that could do many things and having to choose what to implement, which might cover calls all over the place, my code did a few things, and there was a clear path to the data for the things it did.
What happened. There was still that problem with everything running through one spot. If I wanted to tweak anything, anywhere, I’m opening up the entire codebase to make a change.
Interestingly, since this was revision 2, the code ran for much longer without needing tweaking. This meant that when I did get around to tweaking it, I had forgotten the build setup! So I’d have a 2-line change that would require screwing around with re-remembering the build-deploy environment. That might take several hours. Plus the sites were still hard-coded. The output was hard-coded. As much as I loved data-driven programming, why was everything so coupled?
Along this time I built HN Books, a social book site for hackers. It was all static. I found I could do a lot with static web pages and JSON files. Much more than most web programmers would believe.
Hmm. Fuctional programming. Static pages. Hmmm.
The way I do things now.
So now I have a new philosophy:
- Third time through, I decided I write small programs. As small as I can. Instead of one big honking system, my system has four programs: GetLinks, UpdateArticleLibrary, CreateSelectedArticleList, and FormatSelectedArticleList. There’s a bonus utility, CheckXPath, which lets me check out XPath strings against websites to make sure they work (in case the site layout changes). I use as few external libraries, tools, or frameworks as possible. Over the years, I’ve found that every tool I picked up had 40 tons of features, of which I only needed 3. That meant that whenever I needed to do something simple that I had never done before, I had to wade through all sorts of forums, hearing about all sorts of arcane switches and other stuff. Screw that. I do not need to buy and operate a nuclear attack sub to go fishing from the pier. So — no Visual Studio, no MySQL, no complex dependencies at all. (Having said that, I actually dumped my custom link-sucking system and went with HtmlAgilityPack. I mean heck, once it’s installed, it’s just XPath. Time spent learning XPath is not as completely sunk as time spent learning WhizBang 7). Simplify.
- My files read and write data on the local storage. I’m not doing transactions, therefore I don’t need a transactional datastore. It’s functional coding. Things come in, things go out. The O/S already has a wonderful tool for storing things. It’s called a file system, and unless I’m going to be moving around tens of thousands of these things, it’ll work just fine, thank you. Simplify.
- I use the O/S to schedule when the programs run and to move things around. Instead of having one program that runs once a day, split the work up into a “pipeline” and have the O/S manage the pipeline. It’s already good at managing processes — it has all kinds of tools for it. Use them. Simplify.
- I don’t “interact”. I process things at certain times. In the old days, I’d have an instance of my program spun up in Apache, waiting around in fastCGI for a client to come and connect. Then the program would do all sorts of things depending on which client it was, what the request was, and so on.
After many, many years of coding like this, I had to ask myself: why? Why the hell am I doing it this way? 9 times out of ten I’m delivering the same content. 9 times out of ten the client is just reading stuff. 9 times out of 10 I’m creating database connections, cursors, and all sorts of other cruft — just to send the same stuff back down the wire as I sent 2 seconds ago. For the huge majority of the use cases I can imagine, even for interactive sites, there’s no difference at all to the user between a program that sits waiting for connections and a series of programs that updates data every few seconds. This is stupid. Don’t do this any more. Decouple. Simplify.
- I write simple queries to monitor how things are moving through the pipeline. Quite frankly, the system is running so well I don’t need to monitor it, but if I did, I’d simply monitor how things flow through the pipeline. I could even make a nice html page with graphs. Don’t need to, but it’s an easy option. In fact, I’d argue that the only things interesting to me as a owner/maintainer would be things visible at a system level, not a programming level. Huge win here: no programming skill required to look at CLR innards, just O/S skills. Simplify.
- Nothing exists as hard-coded data. It’s all either config files or command-line parameters. Right now as I bring up the page, I can see that Hacker News isn’t returning any data. If I wanted, I could probably figure out what the problem was and fix it (assuming it was fixable on my end) in about 10 minutes. No programming required. All I need is a shell. I AM re-coding the system, kinda. I decided that since I have 10-15 functions that are the same across each executable, it would make sense to create one Visual Studio solution and share a couple of library source files. Welcome back to the days of shared C header files! I’m also making the logging more robust. Right now I log everything. (I don’t read the logs, but they are there.) This is using up too much disc space. Every few months I have to clean it out. So a more fine-tuned logging system would be nice. Maybe. Maybe not. Simplify.
- TDD? TDD? We don’t need no stinking TDD. With purely functional programming, there are only 3 mistakes I can make: Failure to break down transforms into understandable atomic units, failure to describe the transform correctly, and failure to validate the data. If I do all three of those correctly? There’s nothing to test. It’s like trying to test an SQL select statement. The idea doesn’t make sense. Simplify.
It’s very nice. Here’s the main function for the first program in the chain, GetLinks, that gets links from sites (duh):
This is 50 lines of code. But it’s really only 4, lines 204-207. It gets the configs passed in. It creates a new container to hold the output data, it processes the link, and then it saves the data (which happens in the ripLink function. Should be down in this one. Ugh.) The rest of it is logging and help system stuff.
I could show you ripLinks, but it’s the same deal: 50 lines of code which are really about 10. It gets the links from the page using the “ripLinksOnAPage” function (clever naming, eh?), then it processes the links according to the config file.
Let’s look at how it breaks up.
As you can see, there are really only 5 functions, ripLinks, http, loadHtml, recBuildLinksUp, writeLinksOut, and ripLinksOnAPage. The rest is either library calls or a few small helper functions. Take out the logging and some of the other broilerplate, and there’s maybe 50 lines of “real code” here.
We’ve done something simple, easily describable, and concrete. We took stuff from the file system, read a webpage, wrote stuff to the file system. Logged as we went along. That’s it. No need to solve world hunger. It has value.
This code has been running for more than a year with no modifications. I expect it to continue running forever. I’m done. Isn’t that nice?
It’s re-inventing the wheel. In most cases, I don’t need a wheel. I need a lug nut. Yes, I know what the wheel looks like, but the cognitive load of buying a big stack of wheels and carrying them around when I just need a couple of lug nuts? Tell you what. If I start thinking about round things to fit on cars, I’ll get a wheel. Otherwise I’m fine.
It’ll never scale. The funny part about this objection is the opposite is true: the more complex your stack, the more difficult and nuanced it is to scale. I can take this app and scale it out as far as I like. Heck, copy it to a CDN. We’re done. And for those of you thinking interactivity, think long and hard whether you need immediate feedback for the user or just something that changes every few seconds or once a minute. You could be spending a huge number of processor cycles worrying about concurrent clients when you really don’t need it.
It’s poorly-thought-out. My datastore? Line-delimited text files. Code complexity? Nothing more than a few hundred lines of code. Is there anything it can’t do? Not really. But (and this would be the objection I would have made several years ago) what if you want it to do something that required the data to change? Wouldn’t you have to re-jigger every piece of code in the pipeline?
Not really. First, because I’m using name-value pairs, I can always add data and not use it. So really, what would happen if I wanted some cool new feature would be 1) I’d change the data representation where it was created, 2) I’d add the code to store it, and 3) I’d add the code to use it. If you’ve worked in a functional environment, this is the way you make changes anyway. Nothing new here.
It’s not as cool as X. Probably not.
The deployment process is brittle. Actually, although I’m not continuously deploying what I write — there’s no need to automate it when it’s just me — I’m integrating DevOps directly into the solution. There’s no one part of this solution that’s “programming” and another part that’s “deployment”. It’s all integrated together. Instead of good coupling and cohesion at the function level, I have good coupling and cohesion at both the function and the executable level. Very cool stuff.
Continuing to add features. The very next thing I’m going to do is add/adjust the logging. Having to clean out the logs once every 3 or 4 months is a chore. Next up in my quest to eliminate distractions, I’ll probably go to the target site and rip the plain text and store it here. I’ve thought about adding voting and commenting, but it’s a personal site.
None of this will require a major change or a re-think of how the architecture works. Mostly the system just works and I don’t have to mess with it. O/S updates handle updating security and scaling problems. I worked a bit to make the solution, and now the solution works for me. And isn’t that the entire idea?If you've read this far and you're interested in Agile, you should take my No-frills Agile Tune-up Email Course, and follow me on Twitter.