[Up: Distributed Systems]
[Previous: Distributed Systems] [Next: Low-Level Distribution]


The reasons for developing distributed systems are about as manifold as there are definitions for the term ``distributed system'' in the first place. The most basic one is that of a collaboration of individual components that communicate to achieve a common goal, which could cover everything from Unix shell scripts (where various programs communicate using pipes) to the loose collaborations on the Internet where hundreds or thousands of users contribute computing power to work on crypto challenges. Both extremes do not seem so far-fetched when looking at the most popular design strategies used in distributed systems:

The first item mentioned here was the historically first, too: when distributing a single task over a number of processors, parts of the overall task can be computed in parallel to speed up the total process. Fault tolerance is relevant in fail-safe systems (e.g. airplanes), and achieved using redundancy: the same computation is performed multiple times to detect and to safely recover from the failure of individual nodes. Delegation is the basis of Client/Server systems, where a server performs a particular service on behalf of the client, like a database retrieval. Resource sharing is a different kind of delegation, where systems offer their special information or hardware resources, for example a graphical X11 display, the operation of a robot or numerical computation capabilities.

Parallel processing, fault tolerance and resource sharing specialize the above definition of a ``distributed system'' in that they mandate a distribution over interconnected nodes in a network. This is obvious in the case of parallel processing, which only makes sense when distributing the job over a number of distinct processors; and redundancy also can only be achieved with physically separate systems, because the failure of any part of a node makes any of its results suspect. However, delegation is just as useful on a single computer, as in the Unix shell example.

One central keyword in all definitions of distributed systems is communication. Communication is necessary to distribute information among the components that make up a distributed system, and to collect the partial results of each component in order to compute a total answer to the common goal mentioned above.

While the term communication can be stretched to the extreme (global variables in a program can be said to be a means of communication just as well as common disk files or shared memory), the usual case of distribution over networked nodes requires some sort of network-aware transport.

The need of communication, potentially over a shared resource like a network, introduces a number of potential problems that a distributed system must be aware of:

Communication Overhead:
Transferring data is costly and takes time. Distribution does not make sense where the time for communication exceeds the amount spent performing computations. The bandwidth available for communication may be limited and not constant.
Points of failure:
Components can fail individually, or communication links may fail.
Communication over a public network could be intercepted, manipulated or interferred with by a third party.
It may be necessary for a distributed system to wait for a single slow component before computation can continue.
Just as not all of the advantages apply to a given distributed system, the impact of these potential problems also depends on the particular system which is to be implemented: while the interception of data by a hostile attacker might be ignored in a local network, it must be dealt with when private data needs to be transmitted over the Internet. Depending on the application, the manifold failure points in communication must be handled specifically. For non-critical systems, or for small systems in a local network where failure is rare, it may be acceptable for an application to simply crash in the case of a failure: it will then be the task of the administrator to discover and correct the cause of the failure, and to restart the application. In other cases, the software must be prepared to recover reliably from a partial system failure and perhaps use redundant components instead.

The term distribution platform shall be used to describe a generic means of distribution provided by an operating system, library or language. A distribution platform at least hast to provide primitives for communication and coordination for use by the programmer to build the distributed system on top of the platform. It should help a developer to utilize the advantages of a distributed system as easily as possible while minimizing or at least controlling the potential problems.

A distribution platform has to provide at least three primitives so that implementing a distributed system becomes feasible: addressing, synchronization and encoding.

In order for components of a distributed system to find each other, each must have an address. Servers must be able to publish their address, and clients must be able to acquire a server's address, either by administrative intervention, guessing, or looking it up in a naming service.
All participating parties must agree on a certain protocol when data may be sent or received. Common data transfer protocols are messages, request-response or data streams.
Defines the format used for transferring data. This could be uninterpreted chunks of octets or structured data.
A distribution platform can be expected to use abstractions of these primitives to ease development, for example by automated encoding of complex data. In a synchronized system, the developer is not involved in issues of timing, nor has to be concerned with the ordering of incoming data. Abstracting all three issues provides transparency, meaning that the developer does not need to be aware of how components are located and how data is communicated between them.

Apart from transparency, other key features of distribution platforms are:

Components can be fixed to one location, move in an inactive state, or even move while being active. Mobility can be coarse-grained (servers) or fine-grained (objects). Location selection could be client-controlled, server-controlled or automated, with an appropriate location selected by load-balancing, network usage or hardware requirements.
Quality of Service:
Addresses the non-functional aspects of distribution, e.g. a distributed system's control over different data transports (bandwidth limitations, encryption), timing (real-time requirements), or protocols to handle redundancy or error recovery (automated recovery vs. notification vs. abortion).
Closed systems can live without interoperability, but in a heterogeneous world, it is useful if the distribution platform is not limited to a particular hardware, operating system or programming language. It is unlikely that independent clients and servers can agree on identical hardware and software.
A number of distribution platforms have been developed, some with a specific use like parallel processing in mind, others with the intention of a generic platform for distributed applications. A couple of examples are explored in the following sections, grouped by the level of abstraction provided for the issues above: low, middle or high.

Figure 2.1: The OSI model applied to Distribution Platforms

This categorization can be done by applying the popular Open Systems Interconnection (OSI) Reference Model [70] to distribution platforms. The OSI model takes a layered approach to networking, each introducing a level of abstraction. The full model has seven layers, but the lower four, physical, data link, network and transport, deal with low-level issues such as failure detection and routing and are of less interest to the application programmer, who is content with secure communication on top of layer four.

A low-level distribution platform does not abstract any further and allows no more than data transport. Figure 2.1 shows the upper three layers of the OSI model on top of the transport layer, with the distributed system built on the top application layer.

Slightly deviating from OSI terminology, a high-level distribution platform can be said to provide mechanisms naturally built into the programming language, so that locality or remoteness is entirely transparent.

The two layers in between, again in line with OSI, provide session control, the handling of connection between distributed components, and the presentation layer encodes complex application-dependent data structures like method parameters into a common format, probably adding data compression or encryption - and this is just what can be expected from middleware.

[Previous: Distributed Systems] [Next: Low-Level Distribution]
[Up: Distributed Systems]

Frank Pilhofer