Thursday, November 15, 2007

Testable code: it's the structure, stupid!

The topic of writing testable code is a quagmire of opinions. See, for example, Learning From Sudoku Solvers, which gives the argument that Test-Driven Design is not always great... and it's a fantastic example.

I have found, however, that test-driven design helps to generate code that is not necessarily clearer, but better factored. I ran across a blog post from Phil Haack:

Writing Testable Code Is About Managing Complexity: "the real benefit of testable code is how it helps handle the software development’s biggest problem since time immemorial, managing complexity."

This post really cuts to the heart of why Test-Driven Design works, and why it can also utterly fail. Phil's main arguments are that generating tests gives you a canary in the coal mine for bugs, and that it indirectly helps manage complexity. The bug finding is absolutely true, that is the main attraction for most people.

But Phil doesn't quite get the second point right. Test-driven design absolutely helps to directly manage complexity. Why? The answer is factorization. Jeff uses the term "separating concerns" to talk about how writing tests helps to simplify code. That's a start, but not quite the whole story.

Unit tests work best on atoms of a program: a single algorithm that completes a specified task. Firefox is not an algorithm. The MySQL package is not an algorithm. GCC is not an algorithm. And none of them have a single, unified unit test. This is pretty logical, seeing that each program is built of parts.

But the extension here is that in order to write unit tests, you must understand what the parts of the program are. If a part of a program is very well unit-testable, then it is an atom, or a factor, or whatever you want to call it. It cannot be broken into any further multiple parts without destroying the semantics of what it does. This goes all the way back to undergraduate-level algorithms, using Hoare semantics: Precondition, Instructions, Postcondition. The test very simply has to set the precondition, call the function, and test the postcondition.

Let's say you are writing a program to parse a record, see if it is in an existing data set (say, a database), and then insert it or update it to the known data set. How do you write unit tests? To figure out the unit tests to write, you must first understand the factors.

I trivially see three factors: testing the existence in the data set, the insert operation, and the update operation. The "upsert" operation, that is, the decision whether to insert or update and then doing the appropriate action, is not an atom in this program. It should be covered in an integration test.

Let's think about that for a minute: why should upsert be an integration test? Consider a trivial set of two tests:

1. Upsert a record which does not exist (the record should be inserted)
2. Upsert a record which already exists in the set (the record should be updated)

These tests can be broken apart without changing the semantics of the upsert. We know that the upsert operation follows two code paths, and we can run these two tests against those code paths to ensure their respective operation.

The fact that the upsert is then covered as a small integration test signifies that it is also a (larger) factor (maybe a molecule instead of an atom?). The point is that the unit tests (and integration tests) fall clearly along borders within the design of the code. Functions are partitions of functionality of a program, and tests are written along the borders of those partitions.

The question about which comes first, however, still leaves a lot of room for argument. Should you write the tests first as a way of understanding the partitions in the code, or should you write the code first, making smart decisions about how to split functionality and write tests later?

I don't think there is an answer. Writing the tests before hand might give you some artificial lock-in to a design that is not great, but at the same time simply starting to code head first might prevent you from seeing what the overall design of the code should be, in terms of keeping logical functions separate.

Saturday, July 28, 2007

Wireless Shenanigans

I was in Borders about a week ago, and thought I would pick up the latest issue of 2600. I started reading 2600 regularly when I was a freshman in college, although I had read the occasional blurb on its website or in the newsgroup. The great thing about the print 2600 is that it's very well geared towards hands-on people. It's like Make Magazine, only older, and black hat.

Anyway, fast forward to this week. I'm in Michigan where two friends from college are getting married, and I'm staying in the same hotel as the reception. I plopped down my MacBook Pro yesterday to check my email, and I was greeted with the ubiquitous Terms of Service page. I was about to click through when a light popped on in my head.

I typed "" into the Firefox address bar, and, lo and behold, the hotel's wireless uses the same equipment that was given a step-by-step "how-to hack" in the latest 2600. Yikes.

IP address space abuse aside, I was not at all shocked at this discovery. Hotels don't subscribe to 2600, and from the hotel's perspective, I'm sure they are happy to pay someone else to take care of their wireless. Indeed, part of the TOS for wireless usage specified that a different company handled the wireless service, but hey, if they are in the business, then they should be aware of the (massive) security breach.

Long story short, I decided not to follow the instructions in the article, turned off Airport, and used the (probably) more secure wired connection. But I will definitely think twice before hooking up to another hotel wireless network.

Wednesday, July 4, 2007

Happy July 4th, please treat the flag correctly

I noticed an article in the Washington Post requiring US flags to be "born" in the USA. These are state-level laws, not a federal law, and as such are somewhat different in what each one requires. But the result is the same: people are astounded, the typical anti-global and anti-xenocentrism arguments come out, etc.

But wait, what about the Flag Code? That's right, US Code, Title 4, Chapter 1. This was the law for which Abbie Hoffman was arrested (and later acquitted) for wearing a shirt that looked like a US flag. The Flag Code does not stipulate anything specifically related to the country of origin of the flag, but Abbie's acquittal set a precedent which might apply to these states' laws. The state laws were deemed unconstitutional, since they dealt with the mutilation of the flag (mutilation is also covered by the Flag Code). The First Amendment protects political speech, and mutilation of the flag falls under that category.

The Flag Code is quite interesting in what it permits and forbids. After reviewing it, I realize that nearly every single instance I see of the flag would be illegal under the flag code. For example, the flag cannot be used as part of an advertisement. In fact, just about the only legal displays are flags on a pole or mounted to a building, retired at dusk or with an overnight light.

The protectionism that is thinly veiled in the "born in the USA" laws is rather detestable, but so is gross commercial misuse of the flag. Why are states so concerned with the country of manufacture, but not doing a single thing to enforce already-existing laws?

Saturday, May 5, 2007

What really drives development?

There is a lot of buzz in the software community about Test-Driven Development. I like to boil TDD down to this:
  1. Write tests first
  2. The tests are the de jure standard
  3. Correct software passes the tests
This is actually a pretty good engineering approach to software. It enforces a well-documented and agreed-upon standard, and provides clear, measurable standards for success. In fact, if the tests are written correctly, they will also verify the range of operation for varying input parameters.

I should expand a bit on number 2: the de jure standard, as opposed to de facto, is the agreed-upon, explicit standard. A de facto standard is one which may not be fully codified in writing, or one which is adopted as a matter of use and not planning. Stating that tests are the de jure standard does not imply that they supplant a written software spec; rather, they should simply mirror the software spec (or the spec should match the tests). In the end, every point in the spec should have a test written for it.

At any rate, this is a pretty good systems-level approach to software engineering. I was somewhat disappointed, then, to read a piece by Ravi Mohan called Learning From Sudoku Solvers. Ravi basically points to five failed attempts by Ron Jeffries to write a Sudoku solver using Agile (Test-Drive Development) practices, and one, far more successful, attempt by Peter Norvig. The tagline from reddit was Test Driven Development versus Thought Driven Development.

My disappointment is basically that nobody along the way identified the problem with using Test-Driven Development here: not Ron, not Ravi, nor the Reddit submitter.

The problem lies in shades of gray in software engineering: an algorithm is more or less atomic, systems are atoms, and tests operate on one atom at a time. The problem with the test-driven approach to solving the Sudoku puzzles is that it tries to break the algorithm apart, and in doing so, makes several incomplete, fragmented almost-atoms. The only tests that should be used with the Sudoku algorithm are to input an unsolved puzzle and see if the right thing comes out. (NB: the "right thing" may not be one solution; some puzzles have multiple solutions. Of course, this has to be considered in the test.)

If it is applied correctly, Test-Driven Development will guide a programmer to a well-factored solution. It it not, however, autopilot for good programming any more than Caps Lock is cruise control for awesome. A lot of care has to be taken to ensure that tests are applied at the correct level.

Sunday, April 22, 2007

The best news is like Ecto Cooler

(I originally published this on Tumblr. Enjoy!)

Jason Goldman:

"Eventually, I believe, everyone will be using the web as a medium of self-expression. Just as ~everyone has an email address, so too will ~everyone have a place on the web that they can point to as being theirs (even if it's not fully public or shared with everyone). ... But both from a philosophical and professional standpoint, I want to see as many as people as possible use the web to express themselves. Moreover, I want to build the tools that enable them to do so."

(on Goldtoe Lemon.nut)

Jason's post is a fantastic overview of the could-be about the blogosphere and web-in-general. He touched on one bit that I think is critical in evolving the blog landscape from "bloggers" to "~everyone:" the argument against the everyman publisher — namely, that not everyone thinks or realizes they have something interesting to say, and the similar-yet-different argument that "maybe not everyone should be blogging."

But there is another side to this story that I haven't heard or read exercised: how will people (and I mean a critical mass of people, across the tipping point) learn to digest this much information and interpret it for themselves?

I'll break down these two arguments as "production vs. consumption."

  • Production:
    • people are not interested in blogging
    • people are boring
    • people think they are boring
    • people think others are boring
    • people are bad writers/designers/people (see: MySpace, "people are boring" bullet point)
  • Consumption:
    • people don't know how to think for themselves
    • people really, really like Kool-Aid
    • people are easier trained to write than to think

So the production argument is pretty well understood. I agree that everyone does have something important/creative/interesting/inflammatory/marginally coherent to say, and I acknowledge that there are challenges ahead in having everyone publish.

But the consumption side — and this is the argument I haven't heard — what happens when every average Joe has access to (or an unstoppable deluge of) diverse, biased, subjective, disparate, and sometimes flat-out wrong opinions? What if everybody only read the Opinions section? What if everyone gets a bottomless cup of Kool-Aid?

Fanboys are a great example — look at digg and see how the "Apple Cult" articles are moderated versus the "Fuck-Microsoft Bandwagon" articles. Who needs "six words uttered by an innocent man" to convict him when you have six thousand or six million blogs that say he is guilty?

There is an insidious side to this, too: hard, technical barriers prevent people from writing. Nothing prevents people from nonthink or groupthink. Once ~everyone has a blog, and they will, no hard barrier will stop them from becoming an inexhaustible fount of stupid.

So change is coming. At least I hope so. But so is the need for people to become smarter and subjective. Anyone care to head this up?

Saturday, April 21, 2007

Dave Winer gets it.

Dave Winer, the self-proclaimed creator of RSS, podcasting, etc., has an interesting piece in his blog which is tangentially related to the ongoing struggle of mainstream media, newspapers, et. al. to compete with online media outlets. In Trouble at the Chronicle, Dave makes the case that the ivory tower of journalism is crumbling, and perhaps the best thing for journalism is to make everyone into a qualified journalist:
[R]eform journalism school. It's too late to be training new journalists in the classic mode. Instead, journalism should become a required course, one or two semesters for every graduate. Why? Because journalism like everything else that used to be centralized is in the process of being distributed. In the future, every educated person will be a journalist, as today we are all travel agents and stock brokers.
Yes. Fantastic idea, but we need a name for it. Hmm, people are composing ideas, and, of course, we want them to take this class as soon as possible to immediately contribute to society... freshman year should be right. I know! We could call it Freshman Composition!

Don't take my sarcasm as a criticism of Mr. Winer. He is absolutely correct. He has a fantastic vision for the populous press. Hell, he kind of invented it. If blogging, podcasting, The Writable Web, WikiEverything, and all of the other Web-2.0-ish things are going to succeed in enriching people's lives instead of reducing us to a crumbling mass of idiots, then we had better make sure that folks produce a good product.

The problem is that even college-level composition is not taken seriously. My degree is in Engineering. I, and many of my peers, complained about having to take Liberal Arts classes, also called the "General Education" requirement for our degree. This can be taken one of two ways: either we didn't think those subjects were important, or we didn't think the classes were worth the time. One of those points indicates a problem with the students, one indicates a problem with the school, and there are definitely people in both camps.

I am not a particularly gifted writer (you have probably figured that out if you made it this far), but I still did not feel like I enriched my writing ability in college. Granted, when teaching a large and diverse crowd, it is better to go slowly and make material accessible to most people than to go quickly and leave some students behind. But seriously, there were issues. If students complained, the teachers are too afraid of a bad teaching review to stand up to the students in a real way. Writing assignments were large and infrequent, meaning they were a temporary burden to be alleviated as quickly as possible. "Calorie-free" is the term I use to describe such things devoid of substance (see also: 300, xXx, and just about anything on Cable TV right now). It leaves a flavor behind, but doesn't ultimately affect you.

(I should note that the last paragraph is not entirely true. I took a senior-level Technical Writing composition class, and while we did have the infrequent, monolithic, completely-possible-to-do-the-night-before-so-why-even-try-type assignments, but we also had to compose one blog entry per class. Our teacher, Cat, was well-focused on making us into better writers and probably did so with a majority of the class. But even she was unable to overcome the aforementioned shortcomings which are very carefully tended to and grown by modern academia.)

At any rate, school, even college, has been relegated to a minor inconvenience on students. The same for high school. If we want to build a smart population of bloggers, of if we want people in general to contribute on the Web, we need to teach better writing and, more importantly, better critical thinking. I've blogged before about the need for critical thinking in reading blogs, but it is equally needed in writing.

Tuesday, March 20, 2007

The Semantic Web 0.01 alpha

There is an interesting video on Google Video by Ted Nelson. (Bio from iBiblio, Wikipedia) Ted invented Hypertext and is head of the Xanadu Project.

Reading about him is absolutely fascinating. His view of usability is something we should all strive for ("A user interface should be so simple that a beginner in an emergency can understand it within ten seconds," taken from Wikipedia). He also posits four maxims: "most people are fools, most authority is malignant, God does not exist, and everything is wrong." (Quoted on iBiblio from somewhere else)

Ted gave a talk at Google about the Xanadu project and his vision for the future of online publishing. Here is the talk abstract, blatantly ripped from the Google Video page:

Everyone wants to improve on Web structure, but few see how stuck it is-- the browser limits what can be seen, and the one-way embedded links limit connectivity. I still want to implement the original hypertext concept from the sixties and seventies. Politics and paradigms, not possibility, have held it back.

Transclusion-based hypertext has great promise, fulfilling (I believe) all the things people want that the Web cannot do.

But to build a clean system around transclusion, we do not embed, since that brings inappropriate markup and links to new contexts.

Most importantly, we must have editability with persistent addresses-- which means non-breaking stable addresses for every element. Each new version is distributed as pointers to stabilized content. We do our canonical editing and distribution via EDL (Edit Decision List, a Hollywood concept); thus content addresses never change, and links need not break.

This is highly general, not just for text. It directly gives us a universal format for all media and their combinations, including multitrack texts, movies and audio.

Naturally, Google can play a key part in all this. As transclusive formats start deploying (including browser-based transclusive formats), a Google listing of a document can point also to a document's content sources. (To say nothing of other possible roles for Google in transdelivery and brokering.)

People accuse me of wanting "perfection." No, I want the other 90% of hypertext that the Web in its present form cannot deliver.

I am showing prototypes of a client-based viewer and editor in 3D.
Link to Video

Ted brings up a few interesting points, which I don't really agree with. First of all, and this is the big one, is the concept of the EDL. Be forewarned... I am a bit hazy on the exact details of using EDLs, but I see some problems. The EDL model does not provide the means for publishing original content. Ted's big point was that most publishing consists of inspiration and quoting from old sources. But where does synthesis come in to play?

Another problem with the EDL is that it requires persistent references to information sources that do not change, and does not include a copy of quotes or parts that it references. This evolves into more problems: the strength of the Web is in its distributed nature. If one site goes down or runs out of money and shuts down, and they do, the rest of the Web doesn't completely break. Sure, you might not be able to follow a link from page A to the defunct page B, but this is not a failure of page A; you can still read page A and think about the bits it talks about without even finding out that page B is broken. But if every document depends on these persistent links, we expose many gigantic single points of failure. So here's my first conclusion: the Web's strength is in its lack of an EDL structure. (More on this later.)

As a side note... if this actually were to be implemented, there would probably be emergent "leaders" or "hubs" of information. That is, there would be a handful of publishers (say 1%) who, if they were to go down, would take nearly the entire remaining 99% with them. If any of the 1% sites crashed, they might take 0.01% with them. Or some such; basically, the idea is that most people are not cited. Albert-Laszlo Barabasi wrote about groups like this in Linked, where he looked at the structure of the entire Web: sure, pages might be an average of 32 clicks apart, but that's by virtue of the fact that most pages link to one or two "important" pages (hubs), which in turn are some finite number of hops apart. Or they link to some semi-important hub, which links to a more important hub, and upward, and back downward. But this is neither here nor there; if one of the big hubs in a Transclusive Web crashed, so would everything else.

Also, if each document is an EDL of other documents, what happens if there is a cycle? This shouldn't happen, since documents couldn't ostensibly reference anything but older documents, so a cycle could not be introduced. But what if somebody (gasp) changed their data? And what about infinite (or near-infinite, or just a really, really big amount of) recursion?

But even if sites don't go down, they still will change. The implication is that a contract exists for all published data with all references. Namely, the publisher promises never to change the information. The problem is that they most certainly do, and as long as we're talking about pushing around bits, there is no realistic way to implement strong enforcement of these contracts. If the server software even somehow prevents modification of content (which is actually a pretty ridiculous thought), what prevents someone from just logging into the machine as root? And if this content lockdown were tied to, say, a TPM module, why not just make a slightly different copy of the server and replace it? Or pull the plug altogether? You either successfully make the change, or break everything that references the data you want to change; you either lie or kill the people talking about you.

Here is another conclusion: duplicating content when quoting or citing it provides a way to keep the publisher honest. This is less important for civilized publishing communities that are built on keeping themselves honest, like many peer-reviewed journals. But what about a corporation that makes false claims about its books? Or information that is published by the government? The MLA and APA rules for citations require some sort of an "Accessed on" clause for websites; they understand that the web changes.

I also have a technical nit to pick, and that relates to the format of the EDL. Mr. Nelson focuses on using byte-offsets (character counts, really, in order to satisfy our Unicode-ian friends who don't use ASCII-printable characters) to reference source material. This is an arcane measure in any modern computing field. Ted criticized the idea of computers emulating "paper under glass," and in the next breath says we should reference content the same way that we have as long as we've used MLA, APA, or other citations. Sure, the character count does away with the notion of page number, paragraph number, line number, etc., which are all artifacts of printing words in a book. None of those measures say anything about the information. But neither does character count; the only reason we don't use character counts in an APA citation, for example, is that nobody can take the time to count 56,345 characters to find the start of a quotation.

I don't really have any new criticisms about Mr. Nelson's software demonstration; it was a neat concept and I think there is an absolute dearth of usable, creative, and unique ways to look at information and compose and synthesize it. But it's not the final answer.

So Ted's work is important. There are a lot of fantastic ideas, but ultimately it's a proof of concept. Herein lies another criticism: Ted is obviously disappointed that his ideas were molded into something else with the invention of the web. But just like the EDL would have to be more flexible to work, Ted has to be flexible to interpretations, syntheses, and other changes to his original ideas; otherwise they will never be corrected or perfected.

I promised "more later" about web EDLs. There is a lot of promise in the Semantic Web (a term which is a true bullshit magnet), Web 2.0/3.0 (ditto), and microformats. So if you can identify semantically which parts of a page are quotes or references to other works, you can bring in something like RDF to represent semantic meaning in the background; you actually create logical arcs between resources and say how they are related. It's a very slippery slope towards overkill; we don't need a four-line blog comment to have three dozen attributes and a megabyte of metadata associated with it, but a little overhead in the background is needed and probably pays for itself. This is the information age, for gosh sakes, and we're not exactly paying by the byte. (If you're a computer engineer or computer scientist, like me, you'll probably be able to come up with some magic number: anything between 1 and n bytes is the same, where n is 576, 1500, 2K, 4K, 16K, 80 columns, or just about anything else for various reasons.)

There are reasons (which I cannot hope to enumerate) why the Web has succeeded so far, despite its one-way links, embedded markup (oh, how you are required and yet so ugly!), and various other myriad shortcomings. Hell, look at gopher. It was faster AND semantic. Transclusion and Xanadu are great ideas, but they won't replace the Web as we know it. Hopefully we (as the general Web community) will grow to understand, use, and love some of the aspects provided by Mr. Nelson's work, but it's not something we can adopt today.