Tuesday, March 20, 2007

The Semantic Web 0.01 alpha

There is an interesting video on Google Video by Ted Nelson. (Bio from iBiblio, Wikipedia) Ted invented Hypertext and is head of the Xanadu Project.

Reading about him is absolutely fascinating. His view of usability is something we should all strive for ("A user interface should be so simple that a beginner in an emergency can understand it within ten seconds," taken from Wikipedia). He also posits four maxims: "most people are fools, most authority is malignant, God does not exist, and everything is wrong." (Quoted on iBiblio from somewhere else)

Ted gave a talk at Google about the Xanadu project and his vision for the future of online publishing. Here is the talk abstract, blatantly ripped from the Google Video page:
ABSTRACT

Everyone wants to improve on Web structure, but few see how stuck it is-- the browser limits what can be seen, and the one-way embedded links limit connectivity. I still want to implement the original hypertext concept from the sixties and seventies. Politics and paradigms, not possibility, have held it back.

Transclusion-based hypertext has great promise, fulfilling (I believe) all the things people want that the Web cannot do.

But to build a clean system around transclusion, we do not embed, since that brings inappropriate markup and links to new contexts.

Most importantly, we must have editability with persistent addresses-- which means non-breaking stable addresses for every element. Each new version is distributed as pointers to stabilized content. We do our canonical editing and distribution via EDL (Edit Decision List, a Hollywood concept); thus content addresses never change, and links need not break.

This is highly general, not just for text. It directly gives us a universal format for all media and their combinations, including multitrack texts, movies and audio.

Naturally, Google can play a key part in all this. As transclusive formats start deploying (including browser-based transclusive formats), a Google listing of a document can point also to a document's content sources. (To say nothing of other possible roles for Google in transdelivery and brokering.)

People accuse me of wanting "perfection." No, I want the other 90% of hypertext that the Web in its present form cannot deliver.

I am showing prototypes of a client-based viewer and editor in 3D.
Link to Video

Ted brings up a few interesting points, which I don't really agree with. First of all, and this is the big one, is the concept of the EDL. Be forewarned... I am a bit hazy on the exact details of using EDLs, but I see some problems. The EDL model does not provide the means for publishing original content. Ted's big point was that most publishing consists of inspiration and quoting from old sources. But where does synthesis come in to play?

Another problem with the EDL is that it requires persistent references to information sources that do not change, and does not include a copy of quotes or parts that it references. This evolves into more problems: the strength of the Web is in its distributed nature. If one site goes down or runs out of money and shuts down, and they do, the rest of the Web doesn't completely break. Sure, you might not be able to follow a link from page A to the defunct page B, but this is not a failure of page A; you can still read page A and think about the bits it talks about without even finding out that page B is broken. But if every document depends on these persistent links, we expose many gigantic single points of failure. So here's my first conclusion: the Web's strength is in its lack of an EDL structure. (More on this later.)

As a side note... if this actually were to be implemented, there would probably be emergent "leaders" or "hubs" of information. That is, there would be a handful of publishers (say 1%) who, if they were to go down, would take nearly the entire remaining 99% with them. If any of the 1% sites crashed, they might take 0.01% with them. Or some such; basically, the idea is that most people are not cited. Albert-Laszlo Barabasi wrote about groups like this in Linked, where he looked at the structure of the entire Web: sure, pages might be an average of 32 clicks apart, but that's by virtue of the fact that most pages link to one or two "important" pages (hubs), which in turn are some finite number of hops apart. Or they link to some semi-important hub, which links to a more important hub, and upward, and back downward. But this is neither here nor there; if one of the big hubs in a Transclusive Web crashed, so would everything else.

Also, if each document is an EDL of other documents, what happens if there is a cycle? This shouldn't happen, since documents couldn't ostensibly reference anything but older documents, so a cycle could not be introduced. But what if somebody (gasp) changed their data? And what about infinite (or near-infinite, or just a really, really big amount of) recursion?

But even if sites don't go down, they still will change. The implication is that a contract exists for all published data with all references. Namely, the publisher promises never to change the information. The problem is that they most certainly do, and as long as we're talking about pushing around bits, there is no realistic way to implement strong enforcement of these contracts. If the server software even somehow prevents modification of content (which is actually a pretty ridiculous thought), what prevents someone from just logging into the machine as root? And if this content lockdown were tied to, say, a TPM module, why not just make a slightly different copy of the server and replace it? Or pull the plug altogether? You either successfully make the change, or break everything that references the data you want to change; you either lie or kill the people talking about you.

Here is another conclusion: duplicating content when quoting or citing it provides a way to keep the publisher honest. This is less important for civilized publishing communities that are built on keeping themselves honest, like many peer-reviewed journals. But what about a corporation that makes false claims about its books? Or information that is published by the government? The MLA and APA rules for citations require some sort of an "Accessed on" clause for websites; they understand that the web changes.

I also have a technical nit to pick, and that relates to the format of the EDL. Mr. Nelson focuses on using byte-offsets (character counts, really, in order to satisfy our Unicode-ian friends who don't use ASCII-printable characters) to reference source material. This is an arcane measure in any modern computing field. Ted criticized the idea of computers emulating "paper under glass," and in the next breath says we should reference content the same way that we have as long as we've used MLA, APA, or other citations. Sure, the character count does away with the notion of page number, paragraph number, line number, etc., which are all artifacts of printing words in a book. None of those measures say anything about the information. But neither does character count; the only reason we don't use character counts in an APA citation, for example, is that nobody can take the time to count 56,345 characters to find the start of a quotation.

I don't really have any new criticisms about Mr. Nelson's software demonstration; it was a neat concept and I think there is an absolute dearth of usable, creative, and unique ways to look at information and compose and synthesize it. But it's not the final answer.

So Ted's work is important. There are a lot of fantastic ideas, but ultimately it's a proof of concept. Herein lies another criticism: Ted is obviously disappointed that his ideas were molded into something else with the invention of the web. But just like the EDL would have to be more flexible to work, Ted has to be flexible to interpretations, syntheses, and other changes to his original ideas; otherwise they will never be corrected or perfected.

I promised "more later" about web EDLs. There is a lot of promise in the Semantic Web (a term which is a true bullshit magnet), Web 2.0/3.0 (ditto), and microformats. So if you can identify semantically which parts of a page are quotes or references to other works, you can bring in something like RDF to represent semantic meaning in the background; you actually create logical arcs between resources and say how they are related. It's a very slippery slope towards overkill; we don't need a four-line blog comment to have three dozen attributes and a megabyte of metadata associated with it, but a little overhead in the background is needed and probably pays for itself. This is the information age, for gosh sakes, and we're not exactly paying by the byte. (If you're a computer engineer or computer scientist, like me, you'll probably be able to come up with some magic number: anything between 1 and n bytes is the same, where n is 576, 1500, 2K, 4K, 16K, 80 columns, or just about anything else for various reasons.)

There are reasons (which I cannot hope to enumerate) why the Web has succeeded so far, despite its one-way links, embedded markup (oh, how you are required and yet so ugly!), and various other myriad shortcomings. Hell, look at gopher. It was faster AND semantic. Transclusion and Xanadu are great ideas, but they won't replace the Web as we know it. Hopefully we (as the general Web community) will grow to understand, use, and love some of the aspects provided by Mr. Nelson's work, but it's not something we can adopt today.