Sound advice - blog

Tales from the homeworld

My current feeds

Sun, 2007-Jun-10

Lessons of the Web

Many people have tried to come up with a definitive list of lessons from the Web. In this article I present my own list, which is firmly slanted towards the role of the software architect in managing competing demands over a large architecture.

One of the problems software architects face is how to scale their architectures up. I don't mean scaling a server array to handle a large number of simultaneous users. I don't mean scaling a network up to handle terabytes of data in constant motion. I mean creating a network of communicating machines that serve the purposes of their users needs at a reasonable price. The World-Wide Web is easy to overlook when scouting around for examples of big architectures that are effective in this way. At first, it hardly seems like a distributed software architecture. It transports pages for human consumption, rather than being a serious machine communication system. However, it is the most successful distributed object system today. I believe it is useful to examine its success and the reasons for that success. Here are my lessons:

You can't upgrade the whole Web

When your architecture reaches a large scale, you will no longer be able to upgrade the whole architecture at once. The number of machines you can upgrade will be dwarfed by the overall population of the architecture. As an architect of a large system it is imperative you have the tools to deal with this problem. These tools are evident in the Web as separate lessons.

Protocols must evolve

The demands on a large architecture are constantly evolving. With that evolution comes a constant cycling of parts, but as we have already said: You can't upgrade the whole Web. New parts must work with old parts, and old parts must work with new. The old Object-oriented abstractions of dealing with protocol evolution don't stack up at this scale. It isn't sufficient to just keep adding new methods to your base-classes whenever you want to add an address line to your purchase order. A different approach to evolution is required.

Protocols must be decoupled to evolve

A key feature of the Web is that it decouples protocol into three separately-evolving facets. The first facet is identification through the Uniform Resource Identifier/Locator. The second facet is what we might traditionally view as protocol: HTTP. The definition of HTTP is focused on transfer of data from one place to another through standard interactions. The third facet is the actual data content that is transferred, such at HTML.

Decoupling these facets ensures that it is possible to add new kinds of interactions to the messaging system while leveraging existing identification and content types. Likewise, new content types can be deployed or content types be upgraded without compromising the integrity of software built to engage in existing HTTP interactions.

In a traditional Object-Oriented definition of the protocol these facets are not decoupled. This means that the base-class for the protocol has to keep expanding when new content types are added or entire new base-classes must be added. The configuration management of this kind of protocol as new components are added to the architecture over time is a potential nightmare. In contrast, the Web's approach would mean that the base-class that defines the protocol would include an "Any" slot for data. The actual set of data types can be defined separately.

Object identification must be free to evolve

Object identification evolves on the Web primarily through redirection, allowing services to restructure their object space as needed. It is an important principle that this be allowed to occur occasionally, though obviously it is best to keep it to a minimum.

New object interactions must be able to be added over time

The HTTP protocol allows for new methods to be added, as well as new headers to communicate specific interaction semantics. This can be used to add new ways to transfer data over time. For example, it allows for subscription mechanisms or other special kinds of interactions to be added.

New architecture components can't assume new interactions are supported by all components.

Prefer low-semantic-precision document types over newly-invented document types

I think this is one of the most interesting lessons of the Web. The reason for the success of the Web is that a host of applications can be added to the network and add value to the network using a single basic content type. HTML is used for every purpose under the sun. If each industry or service on the Web defined its own content types for communicating with its clients we would have a much more fragmented and less valuable World-Wide-Web.

Consider this: If you needed a separate browser application or special browser code to access your banking details and your shopping, or your movie tickets and your city's traffic reports... would you really install all of those applications? Would google really bother to index all of that content?

Contrary to perceived wisdom, the Web has thrived exactly because of its low semantic value and content. Adding special content types would actually work against its success. Would you rather define a machine-to-machine interface with special content types out to a supplier, or just hyperlink to their portal page? With a web browser in hand, a user can often integrate data much more effectively than you can behind the scenes with more structured documents.

On the other hand, machines are not as good as humans at interpreting the kinds of free-form data that appear on the Web. Where humans and machines share a common subset of information they need the answer appears to be in microformats: Use a low-semantic file format, but dress up the high-semantic-value parts so that machines can read it too. In pure machine-to-machine environments XML formats are the obvious way to go.

In either the microformat or XML approaches it is important to attack a very specific and well-understood problem in order to future-proof your special document type.

Ignore parts of content that are not understood

The must-ignore semantics of Web content types allows them to evolve. As new components include special or new information in their documents, old components must know to filter that information out. Likewise, new components must be clear that new information will not always be understood.

If it is essential that a particular piece of new information is included and understood in a particular document type, it is time to define a new document type that includes that information. If you find yourself inventing document type after document type to support the evolution of your data model, chances are you are not attacking the right problem in the right way.

Be cautious about the use of namespaces in documents

I take Mark Nottingham's observation about Microsoft, Mozilla, and HTML very seriously:

What I found interesting about HTML extensibility was that namespaces weren’t necessary; Netscape added blink, MSFT added marquee, and so forth.

I’d put forth that having namespaces in HTML from the start would have had the effect of legitimising and institutionalising the differences between different browsers, instead of (eventually) converging on the same solution, as we (mostly) see today, at least at the element/attribute level.

Be careful about how you use namespaces in documents. Consider only using them in the context of a true sub-document with a separately-controlled definition. For example, an atom document that includes some html content should identify the html as such. However, an extension to the atom document schema should not use a separate namespace. Even better: Make this sub-document a real external link and let the architecture's main evolution mechanisms work to keep things decoupled. Content-type definition is deeply community-driven. What we think of as an extension may one day be part of the main specification. Perhaps the worst thing we can do is to try and force in things that shouldn't be part of the main specification. Removing a feature is always hard.

New content types must be able to be added over time

HTTP includes the concept of an "Accept" header, that allows a client to indicate which kinds of document it supports. This is sometimes seen as a way to return different information to different kinds of clients, but should more correctly be seen as an evolution mechanism. It is a way of supporting clients that only understand a superseded document type and those that understand a current document type concurrently. This is an important feature of any architecture which still has an evolving content-type pool.

Keep It Simple

This is the common-sense end of my list. Keep it simple. What you are trying to do is produce the simplest evolving uniform messaging system you possibly can. Each architecture and sub-architecture can probably support half a dozen content types and fewer interactions through its information transport protocol. You aren't setting out to create thousands of classes interacting in crinkly, neat, orderly patterns. You are trying to keep the fundamental communication patterns in the architecture working.


The Web is already an always-on architecture. I suspect that always-on architectures will increasingly become the norm for architects out there. There will simply come a point where your system is connected to six or seven other systems out there that you have to keep working with. The architecture is no longer completely in your hands. It is the property of the departments of your organisation, partner organisations, and even competitors. You need to understand the role you play in this universal architecture.

The Web is already encroaching. Give it ten more years. "Distributed Software Architecture" and "Web Architecture" will soon be synonyms. Just keep your head through the changes and keep breathing. You'll get through. Just keep asking yourself: "What would HTML do?", "What would HTTP do?".