Memorandum No: 125
Date: November 27, 1974
Author: John F. Sowa
Subject: Analysis of Selected Aspects of GRAD
[GRAD was the code name for IBM's Future Systems project of the 1970s. The code names for its components were taken from various colleges and universities: RIPON was the hardware architecture, COLBY was the operating system, HOFSTRA/TULANE were the system programming language and library, and VANDERBILT was the largest of the three planned implementations. The smallest of the three, with considerable simplification, was eventually released as System/38, which evolved into the AS/400.This document is a scanned version of a memo that was written during the final days of the FS project. It was widely circulated throughout the IBM Corporation and had a strong impact on the developers' attitudes and perception of FS. This memo was not the primary reason for the termination of the FS project, but it promoted discussions that probably hastened the decision to redirect the development efforts. The major reason why FS was canceled was that the System/370 emulator on VANDERBILT ran many times faster than FS native mode.
In 1973, I had attended a presentation on the design of the VANDERBILT hardware. After showing a diagram of the CPU frame, which contained sixteen circuit boards, the presenter proudly said "Each of these boards contains as much circuitry as a System/370, Model 168." At that point, I asked the obvious question: "But the expected performance is only about three times faster than a Model 168. Why don't you just build sixteen 168s?" Instead of answering, he just glared at me. But when the FS project was canceled in 1975, IBM did just that. According to the IBM archives, "Whereas the Model 168 had required 40 months to evolve from development to initial shipment, the 3033 was shipped to its first customer after only 28 months in development." IBM achieved that feat by remapping the Model 168 design into the circuitry intended for VANDERBILT and making a number of tweaks in the design to improve performance. Of course, IBM could have delivered a machine with similar or better performance in 1975 instead of 1977, if they hadn't killed all the System/370 design projects to avoid competition with the FS fantasy.
The 370 emulator minus the FS microcode was eventually sold in 1980 as as the IBM 3081. The ratio of the amount of circuitry in the 3081 to its performance was significantly worse than other IBM systems of the time; its price/performance ratio wasn't quite so bad because IBM had to cut the price to be competitive. The major competition at the time was from Amdahl Systems — a company founded by Gene Amdahl, who left IBM shortly before the FS project began, when his plans for the Advanced Computer System (ACS) were killed. The Amdahl machine was indeed superior to the 3081 in price/performance and spectaculary superior in terms of performance compared to the amount of circuitry.]
As chartered, GRAD was commissioned to be technically ambitious. In most instances, GRAD has approached system problems with great monolithic advances that make a complete break with current systems. Yet monolithic systems can only succeed if the designer is omniscient. The most successful systems designed by humans are modular ones that can accommodate incremental advances within a highly general, but basically simple framework.
This memo analyzes some of the GRAD approaches and recommends alternatives that provide at least as much functional capability, but are simpler, cheaper, and easier for users and implementers to understand.
(signed)
John F. Sowa
Designing an operating system is fundamentally different from designing a payroll program. The external interface of an operating system is not something tangible like a pay stub, but rather a set of interfaces that form the innermost level of other programs. The topmost level of an operating system is a level that is visible to the system programmer, but not directly to the casual user.
Although an operating system should have no "externals," other programs provided by IBM, such as editors, compilers, and query facilities, must have external interfaces with good human factors. These programs, however, are less fundamental than the operating system. They run at the level of application programs and may be changed or replaced without disrupting the entire system. The design of these facilities may be called the secondary architecture to distinguish it from the heart of the operating system, which constitutes the primary architecture.
To be specific, the primary architecture of an operating system comprises the following seven areas:
By contrast, the secondary architecture is an open-ended list that evolves throughout the lifetime of the system. Whereas the primary architecture is one and cannot be subdivided, the secondary architecture can come in an endless variety of sizes and capabilities. Some examples of the secondary facilities include:
One example of primary architecture in terms of System/360 is the set of conventions for text cards produced by compilers for input to the loaders and linkage editors. These conventions were defined long before System/360 was ever announced and remain among the few conventions that are common to all operating systems that run on System/360 and System/370 machines. Because these conventions are so firm, compilers can be written in Hursley or San Jose and can produce text that is accepted by a linkage editor written in Poughkeepsie. A less felicitous example of primary architecture is the data control block: although the OS and DOS versions require similar information, the fields and formats are different, and programs cannot run on both systems without conversions that are often prohibitively expensive.
The primary architecture appears in the definition of control blocks, system macros, and library routines. Control blocks are not very sexy, and they're hard for a salesman to sell. But until the primary architecture has been mapped into a set of control blocks, there is no firm basis for all the secondary components. The principle virtue of control blocks is their inflexibility: a word like "transaction" can mean all things to all men; but until you define something like a transaction control block, you cannot begin to analyze what interactions a transaction may have with other aspects of the system.
COLBY, in its present state, has guidelines for the human factors of using displays, but it lacks a primary architecture that determines what kinds of operations are possible. As a result, the designers of the various components have made independent and usually inconsistent assumptions about the underlying system. The Design Review Board headed by Al Scherr is now faced with the problem of retrofitting a primary architecture onto a mass of secondary components. By themselves, most of these components are useful, desirable facilities, but a collection of components is no substitute for an architecture.
A slogan is a catchy phrase that can often be useful in motivating people. But it cannot be incorporated in a system design unless it is translated into a precisely defined principle. Some slogans can even be dangerous if they distract attention from essential questions. Two such slogans that have probably caused immeasurable harm to COLBY are the terms "transaction oriented" and "display oriented."
I don't mean to be so heretical as to imply that displays are not good — even the limited capabilities of the 3270 are adequate to make me forever dissatisfied with the 2741. However, the slogan "display-oriented" is dangerous because it focuses on an external device, whose support is part of the secondary architecture, and overlooks the question of what properties the primary architecture must have to support a truly interactive system. In terms of the primary architecture, an interactive system may be defined as one that has an on-line symbol table, dynamic resource allocation, and a modifiable run-time environment. Displays are windows into a system: if the system has appropriate characteristics, it can be turned into a display-oriented system merely by hooking on some displays and writing some programs; if it doesn't, it can never be more than a batch system with displays attached.
The slogan "transaction oriented" also points to secondary architecture instead of primary architecture. In its simplest terms, a transaction is nothing but a sequence of input, processing, output by some application program. In this sense, it is a meaningless term. Usually, however, it is intended to imply something about resource allocation, logging of inputs, automatic checkpoints, and recovery techniques. These latter questions are fundamental and unfortunately unsolvable in a completely general way. The basic implications for the primary architecture are for dynamic resource allocation with a suitable method of deadlock prevention and for a set of conventions for applications that would allow recovery under certain conditions. These are the fundamental issues that COLBY has not addressed.
Basic principles seldom come wrapped in catchy phrases, and any slogan that is too catchy is probably an oversimplification. The only way to discover the basic principles is through long hard years of studying the literature in computer science and in gaining as wide an experience as possible with competitive systems as well as the various IBM systems. Even then, no magic slogans will emerge that the salesman can tout to the customer; what will emerge is a system that is easy to use, easy to implement, and capable of supporting a wide range of function.
For optimum performance, implement a function in hardware; for optimum generality, implement it in software; and for optimum cost/performance, make judicious hardware/software trade-offs. The problem with the old FSM [FS Machine — an abandoned attempt to define a high-level instruction set] is that it didn't have well-defined goals indicating exactly which areas it had to optimize. As a result, it was too complex for efficient hardware implementation; it wasn't general enough to handle all the data types required for all high-level languages; and it prevented optimizing compilers from making their own trade-offs in generating code. Consequently, its performance didn't meet requirements, and the set of computational instructions was scrapped.
Although the old FSM has been replaced, its vaguely defined goals are largely responsible for COLBY's failure to develop a primary architecture. Since the FSM had been advertised as a high level system, it gave the appearance of being the primary architecture — all seven areas were partially covered by some feature of the FSM. Therefore, the COLBY designers concentrated their attention on other issues and assumed that something in the FSM would serve as the primary architecture. The overwhelming complexity of the FSM maintained the illusion because few people understood it well enough to know what was missing.
Yet the partial solutions provided by the FSM were useless. It had descriptors for certain types of data, but it did not have a fully general schema for all data in the system. APL, PL/I, and the DMC would still have to maintain their own data descriptors for all aspects not explicitly covered by the FSM. Without a set of conventions for data descriptors, every component was bound to pick a slightly different format.
The FSM solved the easy problems, but it did not provide a basis for a completely general solution for all aspects of the primary architecture. By giving the appearance of being a primary architecture, the FSM merely distracted other people from addressing the essential problems.
The set of system ops [operations] represents a substantial amount of programming effort. They form the nucleus of an operating system. Yet the current strategy is to implement them three times over for each of the three different machines.
The simpler approach is to recognize that only a fraction of the system ops will be executed frequently enough to have a measurable effect on system performance. The 80-20 rule is a rough guide: 20% of the ops will account for 80% of the execution time. Only the high frequency ops need be implemented in microcode to provide good performance.
The modular approach would be to design a common set of algorithms, interfaces, and control blocks for all three machines. Whenever possible, interfaces should be optimized for efficient microcoding — boundaries should be aligned, bits tested together should be in the proper positions within a byte, and the various routines should be independent enough that the decision to microcode could be made independently for each routine or group of routines.
Having designed such an interface, the implementers could code the entire set of system ops in the basic instruction set. Only after the system was debugged and running would microprogrammers begin to write the frequently used routines in microcode. This approach follows the fundamental rule for optimization: Make it run before you make it faster.
The modular design has the following advantages:
One aspect of the primary architecture that does not yet exist for COLBY is a consistent set of conventions for all data descriptors in the system. In the absence of such conventions, all of the secondary components are going to adopt their own.
To illustrate the importance of such a scheme of descriptors, consider what happens when one program calls another on current systems, and there happens to be a mismatch in the data types. For example, someone might define a factorial function in PL/I starting with the following two statements:
FACTORL: PROCEDURE(N) RETURNS (FIXED BINARY(31)); DECLARE N FIXED BINARY(31);An example of a common error is to call that function in an expression like FACTORL(5), under the impression that 5 will be passed to the function in the correct format. Unfortunately, the default format for integer constants is FIXED DECIMAL. Instead of returning the value 120, the function will loop until an overflow occurs. Other common bugs involve getting the operands in the wrong order, passing the wrong number of operands, or making incorrect assumptions about the order and types of variables in external control sections.
A system with run-time checking of data descriptors, such as APL, can detect many of these errors. Such checking, however, will generate an error message when incorrect data is used, but it will not, in general, explain that the error was caused by incorrect assumptions about the parameters of the subroutine being called. Run-time checking also incurs a great deal of overhead; whenever possible, the validity checks should be carried out by compilers and link editors.
For current systems, the loaders and linkage editors accept object decks generated by compilers, resolve references between separately compiled control sections, and relocate the address constants. In resolving external references, however, they never check to see whether the particular combination is valid. Within a given compilation, the compiler normally checks to see whether the use of a variable is consistent with its declaration; but between compilations, the linking programs never check whether the assumptions made by one program are the same as the declarations made by another program.
With suitable extensions to current compilers and linking programs, even System/370 could provide a level of validity checking that has not yet been defined for COLBY. All that need be done is for the language groups to adopt a common external form for data descriptors (each compiler might still use its own internal conventions, but it would have to define a mapping into the common external form). Then the linkage editors and loaders would accept two new types of cards: a Symbol Description Record (SDR) and a Symbol Assumption Record (SAR). For each control section, the compilers would generate an SDR for every entry point to indicate whether the entry represented data or program. If it was data, the SDR would describe the type; and if it was program, it would describe the expected parameters. Whenever a compiler generated a reference to an external symbol, it would also generate an SAR stating its assumptions. At link time or load time, the SAR's and SDR's would be compared. If assumptions and descriptions were consistent, the programs would be linked; otherwise, the linker would generate a diagnostic indicating which programs made incorrect assumptions.
Such a scheme of validity checking has the following advantages over a system of run-time checking:
MISMATCH OF ARGUMENTS AND PARAMETERS: XYZ CALLS FACTORL WITH ARGUMENT FIXED DECIMAL(5). FACTORL EXPECTS FIXED BINARY(31).This message happens to use the PL/I terminology, but it is probably readable to programmers who use the other languages. As a further extension, the SDR might name the programming language from which the control section was generated; then the linker could even phrase the messages in the same terminology as the source. A further extension would be for the linker to check whether the calling and called routines were written in the same language; if not, it could insert the appropriate interface routines for setting up the PL/I environment or the COBOL environment.
A good scheme of descriptors can enhance function by more than simple validity checking. For example, suppose one wished to link programs written in another programming language into an APL workspace. If those programs had SDR's and SAR's, it would be possible to have an APL loader that would load programs into an APL workspace and generate an interface function to communicate between APL and the other language. The interface functions would be locked functions with the same name as the program to be loaded, and their primary purpose would be to convert input arguments into the form described by the SDR and to convert results into the form expected by APL. Although there are other questions to be addressed such as PL/I storage management, FORTRAN COMMON, and I/O in general, the problem can be solved with certain restrictions; a good scheme of data descriptors would be the basis for combining the efficiencies of compiled subroutines with the highly interactive environment of APL.
The purpose of this discussion is to show the possibility of adding important function to a system by simple extensions. If the primary architecture is well designed, such benefits come easily — a good design will kill many birds with a single, elegant stone.
COLBY commands introduce an endless variety of different words and different syntax for functions that have an underlying commonality. The English language, for example, uses the simple word "send" for talking about sending messages, letters, packages, and people. Yet COLBY invented separate words for every type of missive. COLBY has a general ON-condition for detecting events, but it uses a special command for detecting incoming messages. Other computer systems reserve a file named MAILBOX for each user, and then use ordinary file handling commands for erasing, saving, or displaying messages; COLBY has separate commands with their own peculiarities.
The command system group is not responsible for all the terms contributed by other components, but they participated in making a simple task complicated. The subroutine call is the most general possible command interface; every high level language makes provision for subroutine calls; and some operating systems, such as CMS, use the call as the only command format. Yet the current command proposal combines commands with the HLL's in a way that forces people to learn new syntax, kludges up the HLL compilers, proposes a separate mechanism for creating commands, and freezes that syntax into HLL programs so that migration to other systems would become more difficult.
The study group recognizes that treating the command system as a library of routines to be called like any other routines would be the simplest, cheapest, most efficient, and least risky approach. However, they found the following reasons for rejecting it:
The question of accessing system variables involves a basic question of the run-time environment. PL/I external variables or named COMMON in FORTRAN would be adequate to access system variables if the primary architecture of COLBY had been properly defined. This is another example of how the lack of a primary architecture makes the secondary components more complex than they should be.
The question of a human factored interface is more complex. For commands with two or three parameters, a CALL statement with a positional notation is adequate. Commands with more parameters are usually disguised ways of defining control blocks. For such commands, it might be more appropriate to ask what is the most convenient way of defining control blocks in the various high level languages. Many general purpose solutions to this question are possible, including ways of using the editing facilities. Another special purpose approach should be avoided.
Although the study group used human factors as an argument in favor of the approach they were suggesting, a better case can be made against their approach:
// DISPLAY USAGE PROCESSOR(I) TO (CPU(I))::where the slashes and the double colons are the command delimiters. The syntax of that command is easy enough to read, but it would be a nuisance to have to memorize hundreds of such commands, each with its own conventions; i.e., should the preposition be "TO," "ON," "AT," or "INTO," and why should "CPU(I)" be enclosed in parentheses when "PROCESSOR(I)" is not? Even the keyword "DISPLAY" is misleading because that statement actually assigns a value to CPU(I) instead of printing something. If the command library contained a function name USAGE, that long, wordy command could be replaced by an ordinary assignment statement in PL/I:
CPU(I)=USAGE(I);This form is short, easy to remember, and consistent with PL/I syntax. I cannot imagine anyone who would prefer the special command form.
It may not be fair to beat such an innocent example to death, but if there are so many questionable human factors in this simple statement, there is no hope of designing an acceptable syntax for all possible commands. Furthermore, any syntax that is adopted may need to be revised from time to time. Yet if the syntax is buried in HLL programs, then it would be frozen forever because of the expense of rewriting programs to change the syntax. The use of subroutine calls would be far more flexible: if one of the commands were changed, a programmer could still run the old programs by writing an interface routine that would perform an equivalent function, but be called by the old name.
The widely held view that commands are somehow special is probably a result of certain peculiarities of OS/360, where job initiation was intimately linked to the JCL interpreter, which in turn was linked to an input file. These linkages are logically unnecessary and create obstacles to more flexible use of the system. Attempts to circumvent these obstacles have led to weird programming practices: since OS does not have dynamic resource allocation, a program that had to acquire a new set of files might lead to its own reincarnation by spooling out a copy of itself with different JCL cards and resubmitting itself through HASP.
OS system macros are a better example of what a command system should be than OS JCL. System macros are a specialized type of subroutine call, and they can call programs, link programs, or define data control blocks. The only thing they cannot do is acquire resources because of OS restrictions, which will, hopefully, be removed in GRAD. The system macros, of course, are not suitable for non-programmers (but neither is JCL). A good system should have a command library to be called from programs as well as interactive facilities that form an easy to use interface to both system commands and application programs.
Defining a new programming language takes more time than implementing a new compiler for an old programming language. When Conway was deciding to implement PL/C (a fast compiler for PL/I written at Cornell), he was strongly tempted to make changes or "improvements" in the language. He finally chose to implement the language as defined, because he realized that the implementers could argue about changes for years and never get down to implementing.
The strategy of designing a special systems language for implementing COLBY is inadvisable for several reasons:
I won't repeat Memo No. 121 here, but will summarize it in the following points:
The current practice of writing program products for System/370 in PL/S should be discontinued. Even if the HOFSTRA/TULANE language resembles PL/S, it cannot be compatible with it because of the different pointer philosophy of GRAD. PL/I, with its more restrictive rules for pointers, will run on both System/370 and GRAD. To simplify the migration to GRAD, future program products should be written in PL/I. If the current version of the PL/I optimizer is deemed inefficient, it should be enhanced.
The most revolutionary new feature of RIPON is the one-level store, which makes all data in the system accessible by a uniform addressing scheme. The storage hierarchy, which holds the data and moves it around, is logically independent of the addressing mechanism and should not be confused with it. With a totally different addressing scheme, System/370 could use the same storage devices, as in fact it already uses the 3850 Mass Storage System. The basic question that should be asked is whether COLBY is taking advantage of the one-level store or whether it is merely using the new storage devices, which might just as well be attached to System/370.
The answer to that question is that none of the components of the secondary architecture, such as compilers, editors, and so forth, have shown that they are planning to do anything that could not have been implemented on a System/370 in approximately the same way. The Data Management Component (DMC) is planning to use the various devices of the storage hierarchy and will probably use the automatic paging and locating mechanisms implicit in the use of the 16-byte pointers; but there is no function of the DMC that could not be implemented on a System/370 machine having the same devices. The DMC on System/370 would probably have to maintain tables of track locations that would handled automatically by GRAD, but it might actually access the data more efficiently than GRAD, especially when it has to perform sequential searches.
The fact that COLBY is not yet using the one-level store does not mean that the concept is not valuable. It is just another symptom of the lack of a primary architecture. Without such an architecture, the designers of the secondary facilities are not doing anything they would not have done on the old system.
In an ideal world, programs would rise above the mundane questions of efficient use of a particular hardware configuration. With infinite storage of infinite speed, no one would care where the data happened to be located or how the system happened to be moving it around. In the real world, the one-level store is a fiction that is almost true for many simple programs, but is hardly ever true for complex systems such as database programs, large sort routines, or airlines systems with thousands of terminals.
For complex systems, programmers must have some feedback that can guide their design choices: should they process a file sequentially or use a binary search; should they link a given set of programs into a single module or access them independently; should they process all the fields on one record before passing to the next record or run a given program on all records before passing to the next program? Such choices shouldn't have to be faced in an ideal world, but with the GRAD hardware, they can make a difference of an order of magnitude in system performance.
The casual user should never need to know what the system is doing in order to run his job. In order to write a logically correct program, the systems programmer should not need to know how the system is working, but he certainly must know what the hardware is doing to his job if he is writing a major application that takes up most of the resources of the system. If the system doesn't provide feedback, programmers will make guesses about what is happening, and almost invariably, their guesses will be wrong.
With the one-level store, we have swung from the old extreme where the casual user had to specify detailed volume and device information to the opposite extreme where the installation manager or the systems programmer has no control over what is going on inside the machine. For certain kinds of programs, the one-level store will perform beautifully. But for sequential access to a large file, it cannot compete with high density tapes. When current systems go down, the installation can move a disk pack or tape to another machine; but when GRAD goes down, everything is inaccessible. Of course, the GRAD requirements say that it won't go down, but the architecture has not been demonstrated to meet the requirements. The store is supposed to keep two copies of all data for backup, but both copies are kept within the same room. For better security, some installations will want to checkpoint the entire store at frequent intervals, yet the data rates for transferring all that data are not adequate. The gist of these arguments is that the full implications of the one-level store are not well understood.
Tight security prevents a potential thief from stealing a system design. But good communication is necessary to design a system that is worth stealing. The recent version of the RIPON architecture has been issued in 15 separate, registered confidential documents. A programmer who gets authorization to learn about the addressing structure has to demonstrate a separate need to know to learn the instruction set. The avowed aim of all this red tape is to prevent anyone from understanding the whole system; this goal has certainly been achieved.
There is more than one way to keep something from being stolen: you can lock it up so tight that no one can get to it, or you can make it so big that no one can carry it away. A bicycle left on the corner of Fifth Avenue and 34th Street in New York wouldn't remain there for 5 minutes, but the Empire State Building has been standing there for forty years. In terms of computer systems, there are certain things that can be stolen, such as the specifications for a disk drive or an electrical interface. A complete operating system, however, could not be stolen without far more detailed documentation than a programming manual. With something the size of COLBY, it is far from certain that IBM can implement the system even with full access to the documentation. For an outside group to copy it is humanly impossible.
Groups that have tried to copy a machine architecture from a programming manual have come to grief. Honeywell designed a 1401 emulator that did what the manual said it should, but they didn't copy the glitches that programmers knew it had. When IBM implemented a 1401 emulator on the Model 30, they copied the machine, not the manual, and it worked. RIPON is so much more complex than a 1401 and it is still so subject to unpredictable change that any competitor who tried to copy it would be committing suicide.
Designers of secondary components must have a total understanding of the primary architecture. And programmers who are implementing those components must have a basic five-foot shelf of manuals at their fingertips. Recent programming experience in IBM has been with machines and operating systems that have been publicly documented, and the security problems have been much less complex. If IBM seriously intends to implement COLBY, new security procedures must be adopted that will allow every programmer and designer to have his own copies of the manual for RIPON and the primary COLBY architecture.
Being omniscient, God wisely chose the evolutionary approach to systems design. Instead of constructing a full set of lungs, kidneys, liver, and brain and then trying to integrate them, He started with simple organisms and added one feature at a time to a working model. A worm, for example has muscles, a digestive system, a rudimentary circulatory system, primitive sense organs, and a fat ganglion for a brain. At no stage in the evolutionary process was there a period of "system integration" when two or more sophisticated components were brought together for the first time.
Advocates of structured programming are now preaching the lesson that God discovered billions of years ago: a traumatic phase of system integration should never occur. Instead, the system should start with a simple framework, and pieces should be added or modified one at a time. The initial system should contain slots for all of the primary facilities, even though some of the slots may only be filled with a rudimentary version (release 1.0 of the heart may be nothing more than a pulsating region in one of the arteries). Once the initial version is running, incremental improvements can be added to one component at a time, and each component can be tested within a realistic environment.
Recent slippage of the COLBY schedule has included estimates of the amount of time required for system integration based on experience with VS2. Yet a linear extrapolation of integration time vs. number of lines of code grossly underestimates the rate of increase in complexity — the number of possible interactions increases as the factorial of the number of components. It is humanly impossible to implement a system of the size of COLBY by putting together separately developed components. That does not mean that COLBY could never be built, but rather that it could never be built with the current organization and implementation plan.
Last Spring, a group from the Cambridge Scientific Center in cooperation with consultants from MIT reviewed the entire GRAD project. They recommended that COLBY be implemented by starting with a basic system and adding new features in an evolutionary process. That recommendation was rejected because the schedules and organization could not accommodate it. If so, then the current plan is unrealistic. I can predict with absolute confidence that COLBY will be implemented in an evolutionary process. The only question is whether the evolution will take place, as with OS/360, after the system has been inflicted upon the customers.
An organization geared up for an evolutionary growth must start out small and expand only after the primary architecture has been completed. Fred Brooks admitted that the most costly mistake he made was in allowing a 150-man group instead of a 12-man group design the architecture of OS/360. The arguments in favor of the large group were the familiar ones that the 150 people shouldn't be left with nothing to do and that the larger group could do a faster, more thorough job of addressing all the problems. In fact, the larger group took just as much time, the resulting design was incoherent, more costly to implement, and wasted more time in debugging than was required for the original design.
The number of people assigned to the GRAD project would have been appropriate to a system whose primary architecture had already been completed. Then the various groups at widely scattered locations could design the secondary components on a well defined base. But a project of this magnitude without a primary architecture is like a movie with a cast of thousands, but no script.
Further progress on COLBY requires a temporary reduction in manpower. All groups responsible for secondary components should be assigned to interim products, and the primary architecture should be done by a group of ten of the best system designers in the corporation all working in the same location. When the architecture is complete, each of the designers should become the chief programmer of a team to implement part of the basic system. By the beginning of 1977, the interim product should be nearing completion, and the groups can be transferred back to COLBY. By then the primary architecture should be done, and the secondary components can be implemented directly on COLBY itself without having to go through DUKE.
IBM's largest customers, who provide a major share of the revenue, have no room to grow. They already have Model 195's and duplexed 168's and have no prospect of getting a replacement until the 1980's. The machine they would get under the current GRAD strategy is VANDERBILT, which has no more CPU power than their current systems. It is unreasonable to expect them to wait that long to get a system that is no faster than what they now have, yet involves an enormous conversion expense to an incompatible operating system.
We must assume that DPD, which is acutely sensitive to customer complaints, will demand a new high-end system well before 1980. Since GRAD will not be available in time, that system will have to be a souped up model of System/370 (perhaps with minor extensions) that runs some version or extension of the current operating systems. Such a system could be built by taking a suitable design, like a 370/168, and implementing it in faster, cheaper technology. Given about 8 MIPS of processing power per CPU, large amounts of L3 storage, and a competitive price, such a system would be a real winner.
The first question that arises is how that system would affect GRAD: would any customer who bought an 8-MIP machine in 1977 migrate to an incompatible 4-MIP machine in 1980? The answer is that nobody would. IBM must, therefore, choose among the following alternatives:
The second alternative can also be ruled out. As currently conceived GRAD cannot be sold as a follow-on to a reasonably priced, 8-MIP System/ 370 machine. The SORC committee has concluded that there is nothing in feature group A or B of COLBY that would induce anyone to migrate to an incompatible system. The CPU power of the GRAD machines is too little, and feature group C is too late.
The third alternative is the only one left. The interim products and the GRAD system must be planned together as part of a coherent strategy. A smooth migration path must have top priority in the product plan: the interim products must be upward compatible with current hardware and operating systems, and the transition to GRAD must follow with as little disruption as possible.
Software enhancements for System/370 are also essential, particularly in the database area. IMS, for example, should be enhanced both for better performance and for improved function. On multiprocessing systems, IMS does not take advantage of multiple simultaneous processes and performs no better with two CPU's than with one; it should be rewritten to give better performance on a single CPU and to take advantage of multiple CPU's. The lack of a good database product is a serious exposure for IBM because that is an obvious gap that software houses are ready to fill. Although a non-IBM database system may temporarily increase the utilization of IBM hardware, non-IBM software will lock customers into a system that will be difficult if not impossible to convert to GRAD.
These considerations imply the need for some interim hardware and software products as well as basic reorganization of the entire GRAD project. If reasonably priced, VANDERBILT could have good cost/performance, but a project to build a faster CPU should be started. Without a primary architecture for COLBY, the groups that are working on secondary components have nothing to do. The work accomplished by the external groups so far represents a good statement of requirements, but it is independent of GRAD. Those groups should, therefore, be redirected towards implementing similar facilities on the interim hardware. Meanwhile, the primary architecture should be designed (and implemented) by a small group, or preferably by two small groups at opposite ends of the country working in a design competition.
Every time that IBM introduces a new system with better cost/ performance than the old systems, there will be some customers who trade down to save money. To offset a potential dip in revenue, the system should be accompanied by new software to encourage applications that will absorb the additional capability. But adding enhancements to System/370 means that IBM will have to add even more to GRAD when it finally comes.
At the current rate of technological innovation, there is no danger that anyone will run out of ideas for new enhancements. If we implement the best system we know how to build today, there will be plenty of new ideas five years from now for an even better system. And if IBM puts the best ideas into current systems, we will gain valuable experience that will show us how to make further improvements. IBM's greatest exposure lies in not enhancing System/370 software: competitive software companies will develop incompatible facilities that will be impossible to convert, and IBM programmers will not gain the experience of working with the latest technology.
The first enhancement must be some new hardware — two or three new CPU's that could be updated versions of existing CPU's together with large storage using the L2 and L3 technology. I will call the new line System/375 to indicate that it should contain half as many architectural changes over 370 as 370 has over 360.
Any architectural changes for System/375 should be ones that smooth the way towards GRAD. Putting in 16-byte pointers is too drastic a step and should not be contemplated. But introducing contexts in the sense of GRAD could be done by providing hardware support for virtual machines. Running OS/VS under VM/370 incurs enormous overhead because of the double level of address translation, channel translation, and interpretation of SVC's. But with certain architectural changes, it should be possible to run VS2 under VM/370 with no more overhead than running VS2 on a real machine. Additional support for rapid communication between contexts (virtual machines) should be added in a way that could be extended to communication between GRAD contexts.
Hardware support for virtual machines will simplify migration both for current systems and for GRAD. A DOS user who wishes to migrate to OS/VS could run his old system and his new system concurrently on the same machine. An airline could run PARS and VS on the same machine (with a virtual=real option for the PARS system). And IBM could develop new software products for System/375 that run under, say VS2, but let them communicate with a customer's old system. When GRAD finally comes, the entire structure of intercommunicating virtual machines on System/375 would be mapped into a system of intercommunicating GRAD contexts: a DOS program that communicates with IMS in another virtual machine could use identical protocols to communicate with the DMC in a GRAD context.
The mechanism for communicating between virtual machines or GRAD contexts should be designed to run as fast as a subroutine call. Then new software products that IBM supplies such as database facilities or application development facilities should be designed to run in a separate context (virtual machine) from the customer's programs. The intercontext communication mechanism would then provide a standard protocol that would not change with the advent of GRAD. When GRAD comes, the IBM software could be moved to a GRAD context without affecting any code written by the customer. This strategy would allow customers to begin taking advantage of GRAD without noticing any more disruption than normally occurs with the passage from one release to another of an operating system.
There are other software enhancements that should be provided. The HLL compilers for System/370 should be revised to produce re-entrant code. This revision is especially important for supporting interactive applications; many CICS users write in assembler instead of COBOL primarily because COBOL programs are not re-entrant. The following summary lists a few of the most important enhancements.
Last Modified: