Great Microprocessors of the Past and Present (V 11.2.0)

last update: September 1998
More CPU info (Including WWW sites of some companies mentioned here) can be found at:
The CPU Info Center

More detailed documentation of microprocessor instructions sets can be found at:
Microprocessor instruction set cards

A very detailed (and much more accurate) chronology of microcomputer history can be found at:
Chronology of Events in the History of Microcomputers

A list of architects, and some architecture descriptions which are more detailed (and probably accurate) than those found here is available at Mark Smotherman's list of:
Recent Computer Architects.

An online dictionary of computing terms you might find on this page can be found at:
the Free On-line Dictionary of Computing

Feel free to send me comments at:
bayko@cs.uregina.ca

Laugh at my own amateur attempt at designing a processor architecture at: http://www.cs.uregina.ca/~bayko/design/design.html


Introduction: What's a "Great CPU"?

This list is not intended to be an exhaustive compilation of microprocessors, but rather a description of designs that are either unique (such as the RCA 1802, Acorn ARM, or INMOS Transputer), or representative designs typical of the period (such as the 6502 or 8080, 68000, and R2000). Not necessarily the first of their kind, or the best.

A microprocessor generally means a CPU on a single silicon chip, but exceptions have been made (and are documented) when the CPU includes particularly interesting design ideas, and is generally the result of the microprocessor design philosophy. However, towards the more modern designs, design from other fields overlap, and this criterion becomes rather fuzzy. In addition, parts that used to be separate (FPU, MMU) are now usually considered part of the CPU design.

Another note on terminology - because of the muddling of the term "RISC" by marketroids, I've avoided using those terms here to refer to architectures. And anyway, there are in fact four architecture families, not two. So I use "memory-data" and "load-store" to refer to CISC and RISC architectures. This file is not intended as a reference work, though all attempts (well, many attempts) have been made to ensure its accuracy. It includes material from text books, magazine articles and papers, authoritative descriptions and half remembered folklore from obscure sources (and net.people who I'd like to thank for their many helpful comments). As such, it has no bibliography or list of references.

In other words, "For entertainment use only".

Enjoy, criticize, distribute and quote from this list freely.

By: John Bayko (Tau).
Internet: bayko@cs.uregina.ca

An explanation of the version numbers:

    ##.##.##
     |  |  |
     |  |  +-- small, usually 2 sentences or less.
     |  +--- changes a paragraph or more, or several descriptions
     +---- CPU added or deleted.

Quick Index (in no particular order):

Processors:

  • Intel 4004, 4040
  • Intel 8008, 8080, 8085
  • Intel 8048, 8051, 8052
  • Intel 80x86, Pentium, AMD K5/K6, Cyrix M1, Nx586, IA-64
  • Intel 80960
  • Intel 80860
  • Intel i432
  • Motorola MC14500B
  • Motorola 680x, 6809, Hitachi 6309
  • Motorola 680x0
  • Motorola 88000
  • Motorola DSP96002/DSP56000
  • Motorola MCore
  • AMD 2901, 2903 (and 2910)
  • AMD 9511 math processor
  • AMD 29000
  • Zilog Z-80, Z-280
  • Zilog Z-8000, Z80000
  • Fairchild F8
  • Fairchild 9440
  • Fairchild/Intergraph Clipper
  • National Semiconductor SC/MP (and COP)
  • National Semiconductor 320xx, Swordfish
  • TI TMS1000 4-bit
  • TI 9900 16-bit
  • TI TMS320Cx0 DSP
  • MIPS/SGI CPUs
  • MOS Technologies 650x, Western Design Center 65816
  • Microchip Technology PIC 16x
  • RCA 1802
  • Ferranti F100-L
  • Western Digital MCP-1600
  • Signetics 2650
  • Hitachi 6301
  • Signetics 8x300
  • Siemens 80C166
  • MISC M17
  • Rekursiv
  • AT&T CRISP/Hobbit
  • INMOS Transputer T-212, T-414, T-800, T-9000
  • Architectures:
  • PDP-8/Intersil 6100
  • PDP-11
  • Data General NOVA/MN601, Eclipse
  • MIL-STD-1750
  • IBM/Motorola POWER/PowerPC
  • IBM 801, ROMP
  • IBM System/360/370/390
  • TRON
  • Hitachi SuperH
  • SPARC
  • HP PA-RISC
  • ARM
  • Patriot Scientific ShBoom
  • DEC VAX
  • DEC Alpha
  • CDC 6600/7600
  • Berkeley RISC
  • Virtual Machines:
  • Forth
  • UCSD p-System Pascal
  • Java Virtual Machine
  • Definitions And Explanations


    Section One: Before the Great Dark Cloud.

    Part I: The Intel 4004, the first (Nov 1971) . .

    The first single chip CPU was the Intel 4004, a 4-bit processor meant for a calculator. It processed data in 4 bits, but its instructions were 8 bits long. Program and Data memory were separate, 1K data memory and a 12-bit PC for 4K program memory (in the form of a 4 level stack, used for CALL and RET instructions). There were also sixteen 4-bit (or eight 8-bit) general purpose registers.

    The 4004 had 46 instructions, using only 2,300 transistors in a 16-pin DIP. It ran at a clock rate of 740kHz (eight clock cycles per CPU cycle of 10.8 microseconds) - the original goal was 1MHz, to allow it to compute BCD arithmetic as fast (per digit) as a 1960's era IBM 1620.

    The 4040 (1972) was an enhanced version of the 4004, adding 14 instructions, larger (8 level) stack, 8K program space, and interrupt abilities (including shadows of the first 8 registers).

    [for additional information, see Appendix E]


    Intel Corporation:
    http://www.intel.com/
    Intel 25th Anniversary of the Microprocessor:
    http://www.intel.com/intel/museum/25anniv/index.htm


    Part II: TMS 1000, First microcontroller (1972) .

    Texax Instruments followed the Intel 4004/4040 closely with the 4-bit TMS 1000, which was the first microprocessor to include enough RAM, and space for a program ROM, to allow it to operate without multiple external support chips. It also featured an innovative feature to add custom instructions to the CPU.

    It included a 4-bit accumulator, 4-bit Y register and 2 or 3-bit X register, which combined to create a 6 or 7 bit index register for the 64 or 128 nybbles of on chip RAM. A 1-bit status register was used for various purposes in different contexts. The 6-bit PC combined with a 4 bit page register and an optional 1 bit bank ('chapter') register to produce 10 or 11 address bits to 1KB or 2KB of on-chip program ROM. There was also a 6-bit subroutine return register and 4-bit page buffer, used as the destination on a branch, or exchanged with the PC and page registers for a subroutine (amounting to a 1-element stack, branches could not be performed within a subroutine).

    An interesting feature of the PC is it was incremented using a feedback shift register, not a counter, so instructions were not consecutive in memory, but since all memory was internal, this was not a problem. Instructions were 8 bits with twelve hardwired, and with a 31X16 element PLA allowing 31 custom microprogrammed instructions. All hardwired instructions were single cycle, and no interrupts were allowed.


    Texas Instruments:
    http://www.ti.com/
    TMS 1000 One-Chip Microcomputers:
    http://www.ti.com/corp/docs/history/tms.htm

    Part III: The Intel 8080 (April 1974) . . .

    The 8080 was the successor to the 8008 (April 1972, intended as a terminal controller, and similar to the 4040). While the 8008 had 14 bit PC and addressing, the 8080 had a 16 bit address bus and an 8 bit data bus. Internally it had seven 8 bit registers (A-E, H, L - pairs BC, DE and HL could be combined as 16 bit registers), a 16 bit stack pointer to memory which replaced the 8 level internal stack of the 8008, and a 16 bit program counter. It also had several I/O ports - 256 of them, so I/O devices could be hooked up without taking away or interfering with the addressing space, and a signal pin that allowed the stack to occupy a separate bank of memory.

    The 8080 was used in the Altair 8800, the first widely-known personal computer (though the definition of 'first PC' is fuzzy. Some claim that the 12-bit LINC (Laboratory INstruments Computer) was the first 'personal computer'. Developed at MIT (Lincoln Labs) in 1963 using DEC components, it inspired DEC to design its own PDP-8 in 1965, also considered an early 'personal computer'). 'Home computer' would probably be a better term here, though).

    Intel updated the design with the 8085 (1976), which added two instructions to enable/disable three added interrupt pins (and the serial I/O pins), and simplified hardware by only using +5V power, and adding clock generator and bus controller circuits on-chip.


    Intel Corporation:
    http://www.intel.com/
    Intel 25th Anniversary of the Microprocessor:
    http://www.intel.com/intel/museum/25anniv/index.htm


    Part IV: The Zilog Z-80 - End of an 8-bit line (July 1976) . . . .

    The Z-80 was intended to be an improved 8080 (designed by ex-Intel engineers), and it was - vastly improved. It also used 8 bit data and 16 bit addressing, and could execute all of the 8080 (but not 8085) op codes, but included 80 more, instructions (1, 4, 8 and 16 bit operations and even block move and block I/O). The register set was doubled, with two banks of data registers (including A and F) that could be switched between. This allowed fast operating system or interrupt context switches. The Z-80 also added two index registers (IX and IY) and 2 types of relocatable vectored interrupts (direct or via the 8-bit I register).

    Clock speeds ranged from the original Z-80 2.5MHz to the Z80-H (later called Z80-C) at 8MHz, and later a CMOS version at 10MHz.

    Like many processors (including the 8085), the Z-80 featured many undocumented instructions. In some cases, they were a by-product of early designs (which did not trap invalid op codes, but tried to interpret them as best they could), and in other cases chip area near the edge was used for added instructions, but fabrication made the failure rate high. Instructions that often failed were just not documented, increasing chip yield. Later fabrication made these more reliable.

    But the thing that really made the Z-80 popular in designs was the memory interface - the CPU generated its own RAM refresh signals, which meant easier design and lower system cost, the deciding factor in its selection for the TRS-80 Model 1. That and its 8080 compatibility, and CP/M, the first standard microprocessor operating system, made it the first choice of many systems.

    Embedded varients of the Z-80 were also produced. Hitachi produced the 64180 (1984) with added components (two 16 bit timers, two DMA controllers, three serial ports, and a segmented MMU mapping a 20 bit (1M) address space to any three variable sized segments in the 16 bit (64K) Z-80 memory map), a design Zilog and Hitachi later refined to produce the Z-180 and HD64180Z (1987?) which were compatible with Z-80 peripheral chips, plus variants (Z-181, Z-182). The Z-280 was a 16 bit version introduced about July, 1987 (loosely based on the ill-fated Z-800), with a paged (like Z-180) 24 bit (16M) MMU (8 or 16 bit bus resizing), user/supervisor modes and features for multitasking, a 256 byte (4-way) cache, 4 channel DMA, and a huge number of new op codes tacked on (total of almost 3,500, including previously undocumented Z-80 instructions), though the size made some very slow. Internal clock could be run at twice the external clock (ex. 16MHz CPU with a 8MHz bus), and additional on-chip components were available. A 16/32 bit Z-380 version also exists (1994) with added 32-bit linear addressing mode (not Z-80 compatible).

    The Z-8 (1979) was an embedded processor with on-chip RAM (actually a set of 124 general and 20 special purpose registers) and ROM (often a BASIC interpreter), and is available in a variety of custom configurations up to 20MHz. Not actually related to the Z-80.


    Zilog Corporation:
    http://www.zilog.com/
    20th Anniversary of the TRS-80:
    http://www.radioshack.com/trs_80/

    Part V: The 650x, Another Direction (1975) . . .

    Shortly after Intel's 8080, Motorola introduced the 6800. Some of the designers left to start MOS Technologies (later bought by Commodore), which introduced the 650x series which included the 6501 (pin compatible with the 6800, taken off the market almost immediately for legal reasons) and the 6502 (used in early Commodores, Apples and Ataris). Like the 6800 series, varients were produced which added features like I/O ports (6510 in the Commodore 64) or reduced costs with smaller address buses (6507 13-bit 8K address bus in the Atari 2600). The 650x was little endian (lower address byte could be added to an index register while higher byte was fetched) and had a completely different instruction set from the big endian 6800. Apple designer Steve Wozniak described it as the first chip you could get for less than a hundred dollars (actually a quarter of the 6800 price) - it became the CPU of choice for many early home computers (8 bit Commodore and Atari products).

    Unlike the 8080 and its kind, the 6502 (and 6800 had very few registers. It was an 8 bit processor, with 16 bit address bus. Inside was one 8 bit data register, two 8 bit index registers, and an 8 bit stack pointer (stack was preset from address 256 ($100 hex) to 511 ($1FF)). It used these index and stack registers effectively, with more addressing modes, including a fast zero-page mode that accessed memory addresses from address 0 to 255 ($FF) with an 8-bit address that speeded operations (it didn't have to fetch a second byte for the address).

    Back when the 6502 was introduced, RAM was actually faster than microprocessors, so it made sense to optimize for RAM access rather than increase the number of registers on a chip. It also had a lower gate count (and cost) than its competitors.

    The 650x also had undocumented instructions.

    The CMOS 65C02/65C02S fixed some original 6502 design flaws, and the 65816 (officially W65C816S, both designed by Bill Mensch of Western Design Center Inc.) extended the 650x to 16 bits internally, including index and stack registers, with a 16-bit direct page register (similar to the 6809), and 24-bit address bus (16 bit registers plus 8 bit data/program bank registers). It included an 8-bit emulation mode. Microcontroller versions of both exist, and a 32-bit version (the 65832) is planned. Various licensed versions are supplied by GTE (16 bit G65SC802 (pin compatible with 6502), and G65SC816 (support for VM, I/D cache, and multiprocessing)) and Rockwell (R65C40), and Mitsubishi has a redesigned compatible version. The 6502 remains surprisingly popular largely because of the variety of sources and support for it.

    The 6502-based Apple II line (not backwards compatible with the Apple I) was among the first microcomputers introduced and became the longest running PC line, eventually including the 65816-based Apple IIgs The 6502 was also used in the Nintendo entertainment system (NES), and the 65816 is in the 16-bit successor, the Super NES.


    The Western Design Center, Inc.:
    http://www.wdesignc.com

    Part VI: The 6809, extending the 680x (1977) . . . . . . . .

    Like the 6502, the 6809 was based on the Motorola 6800 (August 1974), though the 6809 expanded the design significantly. The 6809 had two 8 bit accumulators (A & B) and could combine them into a single 16 bit register (D). It also featured two index registers (X & Y) and two stack pointers (S & U), which allowed for some very advanced addressing modes (The 6800 had A & B (and D) accumulators, one index register and one stack register). The 6809 was source compatible with the 6800, even though the 6800 had 78 instructions and the 6809 only had around 59. Some instructions were replaced by more general ones which the assembler would translate, and some were even replaced by addressing modes. While the 6800 and 6502 both had a fast 8 bit mode to address the first 256 bytes of RAM, the 6809 had an 8 bit Direct Page register to locate this fast address page anywhere in the 64K address space.

    Other features were one of the first multiplication instructions of the time, 16 bit arithmetic, and a special fast interrupt. But it was also highly optimized, gaining up to five times the speed of the 6800 series CPU. Like the 6800, it included the undocumented HCF (Halt Catch Fire) instruction to incrementally strobe the address lines for bus testing ("jump to accumulator (A or B)" in the 6800, implemented and documented as $00 in the 68HC11 which is described below).

    The 6800 and 6809, like the 6502 series, used a single clock cycle (the base cycle, plus a cycle rotated 90 degrees out of phase) to generate the timing for four internal execution stages, so that there were instructions which executed in one external 'cycle' (this is different from clock-doubling, which uses a phase-locked-loop to generate a faster internal clock which is synchronised with an external clock). Most CPUs, such as the 8080, used the external clock directly, so an equivalent instruction would take four cycles, meaning a 2MHz 6809 would be roughly equivalent to a 8MHz 8080. The 680x and 650x only accessed memory every other cycle, allowing a peripheral (such as video, or even a second cpu) to access the same memory without conflict. Motorola later produced CPUs in this line with a standard four-cycle clock.

    The 6800 lived on as well, becoming the 6801/3, which included ROM, some RAM, a serial I/O port, and other goodies on the chip (as an embedded controller, minimizing part counts - but expensive at 35,000 transistors. The 6805 was a cheaper 6801/3, dropping seldom used instructions and features). Later the 68HC11 version (two 8 bit/one 16 bit data register, two 16 bit index, and one 16 bit stack register, and an expanded instruction set with 16 bit multiply operations) was extended to 16 bits as the 68HC16 (and a lower cost 16 bit 68HC12 (May 1996)). It remains a popular embedded processor (with over 2 billion 6800 varients sold), and radiation hardened versions of the 68HC11 have been used in communications satellites. But the 6809 was a faster and more flexible chip, particularly with the addition of the OS-9 operating system.

    Of course, I'm a 6809 fan myself...


    As a note, Hitachi produced a version called the 6309. Compatible with the 6809, it added 2 new 8-bit registers (E and F) that could be combined to form a second 16 bit register (W), and all four 8-bit registers could form a 32 bit register (Q). It also featured hardware division, and some 32 bit arithmetic, a zero register (always 0 on read), block move, and was generally 30% faster in native mode. ALso, unlike the 6809, the 6309 could trap on an illegal instruction. These enhancements, surprisingly, never appeared in official Hitachi documentation.

    Motorola:
    http://www.mot.com/
    Motorola Microcontrollers:
    http://www.mcu.motsps.com/mc.html
    TRS-80 Color Computer Homepage (has 6809/6309 links):
    http://www.sfn.saskatoon.sk.ca/~ab594/coco.html

    Part VII: Advanced Micro Devices Am2901, a few bits at a time . .

    Bit slice processors were modular processors. Mostly, they consisted of an ALU of 1, 2, 4, or 8 bits, and control lines (including carry or overflow signals usually internal to the CPU). Two 4-bit ALUs could be arranged side by side, with control lines between them, to form an ALU of 8-bits, for example. A sequencer would execute a program to provide data and control signals.

    The Am2901, from Advanced Micro Devices, was a popular 4-bit-slice processor. It featured sixteen 4-bit registers and a 4-bit ALU, and operation signals to allow carry/borrow or shift operations and such to operate across any number of other 2901s. An address sequencer (such as the 2910) could provide control signals with the use of custom microcode in ROM.

    The Am2903 featured hardware multiply.

    Legend holds that some Soviet clones of the PDP-11 were assembled from Soviet clones of the Am2901.


    Since it doesn't fit anywhere else in this list, I'll mention it here...

    AMD also produced what is probably the first floating point "coprocessor" for microprocessors, the AMD 9511 "arithmetic circuit" (1979), which performed 32 bit (23 + 7 bit floating point) RPN-style operations (4 element stack) under CPU control - the 64-bit 9512 (1980) lacked the transcendental functions. It was based on a 16-bit ALU, performed add, subtract, multiply, and divide (plus sine and cosine), and while faster than software on microprocessors of the time (about 4X speedup over a 4MHz Z-80), it was much slower (at 200+ cycles for 32*32->32 bit multiply) than more modern math coprocessors are.

    It was used in some CP/M (Z-80) systems, and on a S-100 bus math card for NorthStar systems. Calculator circuits (such as the National Semiconductor MM57109 (1980), actually a 4-bit NS COP400 processor with floating point routines in ROM) were also sometimes used, with emulated keypresses sent to it and results read back, to simplify programming rather than for speed.


    Part VIII: Intel 8051, Descendant of the 8048. . . .

    Initially similar to the Fairchild F8, the Intel 8048 was also designed as a microcontroller rather than a microprocessor - low cost and small size was the main goal. For this reason, data is stored on-chip, while program code is external (a true Harvard architecture). The 8048 was eventually replaced by the very popular but bizarre 8051 and 8052.

    While the 8048 used 1-byte instructions, the 8051 has a more flexible 2-byte instruction set. It has eight 8-bit registers, plus an accumulator A. Data space is 128 bytes accessed directly or indirectly by a register, plus another 128 above that in the 8052 which can only be accessed indirectly (usually for a stack). External memory occupies the same address space, and can be accessed directly (in a 256 byte page via I/O ports) or through the 16 bit DPTR address register much like in the RCA 1802. Direct data above location 32 is bit-addressable. Although complicated, these memory models allow flexibility in embeded designs, making the 8051 very popular (over 1 billion sold since 1988).

    The Siemens 80C517 adds a math coprocessor to the CPU which provides 16 and 32 bit integer support plus basic floating point assistance (32 bit normalise and shift), reminiscent of the old AMD 9511. The Texas Instruments TMS370 is similar to the 8051, Adding a B accumulator and some 16 bit support.


    As a side note, the 4-bit Texas Instruments TMS1000 was the first CPU to integrate RAM (32 bytes), ROM (1K), a clock, and I/O support on a single chip, making it the first microcontroller.

    Intel Corporation:
    http://www.intel.com/
    Intel MCS (R) 51 Family:
    http://support.intel.com/oem_developer/mcs51.htm


    Part IX: Microchip Technology PIC 16x/17x, call it RISC (1975) . . .

    The roots of the PIC originated at Harvard university (see Harvard Architecture) for a Defense Department project, but was beaten by a simpler (and more reliable at the time) single memory design from Princeton. Harvard Architecture was first used in the Signetics 8x300, and was adapted by General Instruments for use as a peripheral interface controller (PIC) which was designed to compensate for poor I/O in its 16 bit CP1600 CPU. The microelectronics division was eventually spun off into Arizona Microchip Technology (around 1985), with the PIC as its main product.

    The PIC has a large register set (from 25 to 192 8-bit registers, compared to the Z-8's 144). There are up to 31 direct registers, plus an accumulator W, though R1 to R8 also have special functions - R2 is the PC (with implicit stack (2 to 16 level)), and R5 to R8 control I/O ports. R0 is mapped to the register R4 (FSR) points to (similar to the ISAR in the F8, it's the only way to access R32 or above).

    The 16x is very simple and RISC-like (but less so than the RCA 1802 or the more recent 8-bit Atmel AVR microcontroller which is a canonical simple load-store design - 16-bit instructions, 2-stage pipeline, thirty-two 8-bit data registers (six usable as three 16-bit X, Y, and Z address registers), load/store architecture (plus data/subroutine stack)). It has only 33 fixed length 12-bit instructions, including several with a skip-on-condition flag to skip the next instruction (for loops and conditional branches), producing tight code important in embedded applications. It's marginally pipelined (2 stages - fetch and execute) - combined with single cycle execution (except for branches - 2 cycles), performance is very good for its processor catagory.

    The 17x has more addressing modes (direct, indirect, and relative - indirect mode instructions take 2 execution cycles), more instructions (58 16-bit), more registers (232 to 454), plus up to 64K-word program space (2K to 8K on chip). The high end versions also have single cycle 8-bit unsigned multiply instructions.

    The PIC 16x is an interesting look at an 8 bit design made with slightly newer design techniques than other 8 bit CPUs in this list - around 1978 by General Instruments (the 1650, a successor to the more general 1600). It lost out to more popular CPUs and was later sold to Microchip Technology, which still sells it for small embedded applications. An example of this microprocessor is a small PC board called the BASIC Stamp, consisting of 2 ICs - an 18-pin PIC 16C56 CPU (with a BASIC interpreter in 512 word ROM (yes, 512)) and 8-pin 256 byte serial EEPROM (also made by Microchip) on an I/O port where user programs (about 80 tokenized lines of BASIC) are stored.


    Microchip Technology:
    http://www.microchip.com/

    Section Two: Forgotten/Innovative Designs before the Great Dark Cloud

    Part I: RCA 1802, weirdness at its best (1974) .

    The RCA 1802 was an odd beast, extremely simple and fabricated in CMOS, which allowed it to run at 6.4 MHz (at 10V, but very fast for 1974) or suspended with the clock stopped. It was a single chip version of the previous two-chip 1801, an 8 bit processor, with 16 bit addressing, but the major features were its extreme simplicity, and the flexibility of its large register set. Simplicity was the primary design goal, and in that sense it was one of the first "RISC" chips.

    It had sixteen 16-bit registers, which could be accessed as thirty-two 8 bit registers, and an accumulator D used for arithmetic and memory access - memory to D, then D to registers, and vice versa, using one 16-bit register as an address. This led to one person describing the 1802 as having 32 bytes of RAM and 65535 I/O ports. A 4-bit control register P selected any one general register as the program counter, while control registers X and N selected registers for I/O Index, and the operand for current instruction. All instructions were 8 bits - a 4-bit op code (total of 16 operations) and 4-bit operand register stored in N.

    There was no real conditional branching (there were conditional skips which could implement it, though), no subroutine support, and no actual stack, but clever use of the register set allowed these to be implemented - for example, changing P to another register allowed jump to a subroutine. Similarly, on an interrupt P and X were saved, then R1 and R2 were selected for P and X until an RTI restored them.

    A later version, the 1805, was enhanced, adding several Forth language primitives (Forth is commonly used in control applications).


    Apart from the COSMAC microcomputer kit, the 1802 saw action in some video games from RCA and Radio Shack, and the chip is the heart of the Voyager, Viking and Galileo (along with some AMD 2900 bit slice processors) probes. One reason for this is that a version of the 1802 used silicon on sapphire (SOS) technology, which leads to radiation and static resistance, ideal for space operation. It is still available from Harris Semiconductors.

    Harris Semiconductors:
    http://www.semi.harris.com/
    Microcontroller Primer FAQ:
    http://www.hitex.com/FAQ/primer/

    Part II: Fairchild F8, Register windows .

    The F8 was an 8 bit processor. The processor itself didn't have an address bus - program and data memory access were contained in separate units, which reduced the number of pins, and the associated cost. It also featured 64 registers, accessed by the ISAR register in cells (windows) of eight, which meant external RAM wasn't always needed for small applications. In addition, the 2-chip processor didn't need support chips, unlike others which needed seven or more. The F8 inspired other similar CPUs, such as the Intel 8048.

    The use of the ISAR register allowed a subroutine to be entered without saving a bunch of registers, speeding execution - the ISAR would just be changed. Special purpose registers were stored in the second cell (regs 9-15), and the first eight registers were accessed directly (globally).

    The windowing concept was useful, but only the register pointed to by the ISAR could be accessed - to access other registers the ISAR was incremented or decremented through the window.

    Fairchild ended up as part of National Semiconductor, before being spun off again in 1997.


    Fairchild Semiconductor:
    http://www.national.com/fairchild/

    Part III: SC/MP, early advanced multiprocessing (April 1976) . . . .

    The National Semiconductor SC/MP (Single Chip/Micro Processor, nicknamed "Scamp") was a typical 8 bit processor intended for control applications (a simple BASIC 2.5K ROM was added to one version). It featured 16 bit addressing, with 12 address lines and 4 lines borrowed from the data bus (it was common to borrow lines (sometimes all of them) from the data bus for addressing - however only the lower 12 index register/PC bits were incremented (4K pages), special instructions modified upper 4 bits). Internally, it included four index registers (P1 to P3, plus the PC/P0) and two 8 bit registers. It had no stack pointer or subroutine instructions (though they could be emulated with index registers). During interrupts, the PC and P3 were swapped. It was meant for embedded control, and many features were omitted for cost reasons. It was also bit serial internally to keep it cheap.

    The unique feature was the ability to completely share a system bus with other processors. Most processors of the time assumed they were the only ones accessing memory or I/O devices. Multiple SC/MPs (as well as other intelligent devices, such as DMA controllers) could be hooked up to the bus. A control line (ENOUT (Enable Out) to ENIN) could be chained along the processors to allow cooperative processing. This was very advanced for the time, compared to other CPUs, but the bit-serial CPU was slow (even simple instruction took 5-7 cycles, while memory access was 2 cycles, which allowed them to share a memory bus without saturating it, as opposed to a 6502 which could share memory with at most one other CPU, and only then because of the way the CPU clock was used). However this feature was almost never used for multiprocessing.

    In addition to I/O ports like the 8080, the SC/MP also had instructions and one pin for serial input and one for output.

    National Semiconductor eventually replaced the SCMP with the COP4 (4 bit) and COP8 (8 bit) embedded controllers, with only two index registers, but adding stack support.


    National Semiconductor:
    http://www.national.com/
    National Semiconductor Microcontroller Technology:
    http://www.national.com/appinfo/mcu/

    Part IV: F100-L, a self expanding design .

    The Ferranti F100-L was designed by a British company for the British Military. It was an 8 bit processor, with 16 bit addressing, but it could only access 32K of memory (1 bit for indirection).

    The unique feature of the F100-L was that it had a complete control bus available for a coprocessor that could be added on. Any instruction the F100-L couldn't decode was sent directly to the coprocessor for processing. Applications for coprocessors at the time were limited, but the design is still used in some modern processors, such as the National Semiconductor 320xx series (the predecessor of the Swordfish processor, described later), which included FPU, MMU, and other coprocessors that could just be added to the CPU's coprocessor bus in a chain. Other units not foreseen could be added later.

    Ferranti, which built the Ferranti Mark 1 (Britain's first commercial electronic computer), no longer makes microprocessors.


    Part V: The Western Digital 3-chip CPU (June 1976) .

    The Western Digital MCP-1600 was probably the most flexible processor available. It consisted of at least four separate chips, including the control circuitry unit, the ALU, two or four ROM chips with customisable microcode (like the old 4-bit Texas Instruments TMS 1000), and timing circuitry. It doesn't really count as a microprocessor, but neither do bit-slice processors (AMD 2901).

    The ALU chip contained twenty six 8 bit registers and an 8 bit ALU, while the control unit supervised the moving of data, memory access, and other control functions. The ROM allowed the chip to function as either an 8 bit chip or 16 bit, with clever use of the 8 bit ALU. Even more, microcode allowed the addition of floating point routines (40 + 8 bit format), simplifying programming (and possibly producing a Floating Point Coprocessor).

    Two standard microcode ROMS were available. This flexibility was one reason it was also used to implement the DEC LSI-11 processor as well as the WD Pascal Microengine.


    Part VI: Intersil 6100, old design in a new package . . .

    The IMS 6100 was a single chip design of the PDP-8 minicomputer (1965) from DEC (low cost successor to the PDP-5 (1963)). The old PDP-8 design was very strange, and if it hadn't been so popular, an awkward CPU like the 6100 would have never had a reason to exist.

    The 6100 was a 12 bit processor, which had exactly three registers - the PC, AC (an accumulator), and MQ. All 2 operand instructions read AC and MQ, and wrote back to AC. It had a 12 bit address bus, limiting RAM to only 4K. Memory references were 7 bit (128 word) offset either from address 0, or the PC.

    It had no stack. Subroutines stored the PC in the first word of the subroutine code itself, so recursion wasn't possible without fancy programming.

    4K RAM was pretty much hopeless for general purpose use. The 6102 support chip (included on chip in the 6120) added 3 address lines, expanding memory to 32K the same way that the PDP-8/E expanded the PDP-8. Two registers, IFR and DFR, held the page for instructions and data respectively (IFR always used until a data address was detected). At the top of the 4K page, the PC wrapped back to 0, so the last instruction on a page had to load a new value into the IFR if execution was to continue.

    The PDP-8 itself was succeeded by the PDP-11 (though a version called the PDP-12 was produced, it was part of the PDP-8 series, not a replacement). The IMS 6120 was used in the DECmate (1980), DEC's original competition for the IBM PC, but lacked the processor and RAM capacity (a Z-80 or 8086 card could be added (reducing the 6120 to an I/O coprocessor) but lacked IBM PC compatability). DEC also tried competing with the 8086 based Rainbow, and the PDP-11 based PRO-325 personal computers, but none caught on.

    Intersil was eventually bought by Harris Semiconductors, which produces versions of the 8088 and 8086, 1802, and 68HC05.


    PDP-8 Models and Options:
    http://www.cis.ohio-state.edu/hypertext/faq/usenet/dec-faq/pdp8-models/faq.html
    Harris Semiconductors:
    http://www.semi.harris.com/

    Part VII: NOVA, another popular adaptation . . . .

    Like the PDP-8, the Data General Nova was also copied, not just in one, but two implementations - the Data General MN601 (MicroNova), and Fairchild 9440. However, the NOVA (1969) was a more mature design (by PDP-8 designer Edson DeCastro, who came to Data General from DEC).

    The NOVA had four 16-bit accumulators, AC0 to AC3. There were also 15-bit system registers - Program Counter, Stack pointer, and Stack Frame pointer (the last two were only on the MicroNova and Nova 3, not the original Nova or Fairchild CPU). AC2 and AC3 could be used for indexed addresses. Apart from the small register set, the NOVA was an ordinary CPU design.

    Another CPU, the National Semiconductor PACE, was based on the NOVA design, but featured 16 bit addressing, more addressing modes, and a 10 level stack (like the 8008), but lacked hardware multiply and divide.

    The 32 bit ECLIPSE (pre 1983) was Data General's successor to the 16 bit Nova. Like the Nova, the ECLIPSE had four 32 bit integer accumulators, added four stack registers, and four 64 bit floating point registers (in the MV series). There are twelve special purpose registers. The ECLIPSE was eventually implemented in a microprocessor form as well.


    Data General later switched architectures and became an early supporter of the Motorola 88K series load-store microprocessor in the AViiON Unix based systems (designers originally wanted to call it the Nova II, but that idea was rejected, so instead they reversed the name and inserted the II in the middle, switching upper and lower case). Unfortunately, Motorola didn't keep up with competing CPUs (eventually switching its main support to the PowerPC), forcing Data General to invest heavily in multiprocessing to boost performance, until the company gave up on Motorola and switched to Intel Pentium CPUs (as Intergraph did).

    This has nothing to actually do with the Nova CPU, but is a little bit interesting anyway.


    Data General:
    http://www.dg.com/
    Data General Nova:
    http://www.dg.com/about/html/data_general_nova.html

    Part VIII: Signetics 2650, enhanced accumulator based (1978?) .

    Superficially similar to the PDP-8 (and IMS 6100), the Signetics 2650 was based around a set of 8 bit registers with R0 used as an accumulator, and six other registers arranged in two sets (R1A-R3A and R1B-R3B) - a status bit determined which register bank was active. The other registers were generally used for address calculations (ex. offsets) within the 15 bit address range. This kept the instruction set simple - all loads/stores to registers went through R0.

    It also had a subroutine stack of eight 15 bit elements, with no provision for spilling over into memory.

    Signetics was bought by Valvo, which was later bought by Phillips.


    Part IX: Signetics 8x300, Early cambrian DSP ancestor (1978) . .

    Originally developed by a company called SMS, the 8x300 was bought and became a product of Signetics. Presented as a microcontroller, it had many DSP-like features (plus a bipolar fabrication) that made if very fast at the time, for some applications at least, but lacked many standard features and was slighly out of step with some conventions of the time (for example, bits were numbered in reverse, bit 0 as MSB and 7 as LSB).

    The 8x300 could address sixteen registers, but some register addresses were used to specify non-register operations, so only eight 8-bit general purpose registers were available - R0 (AUX, the auxiliary register), R1 to R6, and R9. Register R8 was a single carry bit. In addition, an 8-bit I/O buffer register IVB was available, and is the only way to access data memory (similar to the D register in the RCA 1802) - all data was through 8-bit I/O ports, organised as two banks (left and right) of 256 each plus an address indicating which port of that bank I/O operations would use. The exact operation is specified by the source or destination register field - if it's not an actual register, then it signifies an operation. Ports could be attached directly to memory, or two ports could be used to generate an address, with another as a data buffer if more storage was needed.

    The CPU consisted of multiple units strung together in a pipeline (one instruction at a time, no overlapping stages as in modern CPUs). The first operand could be taken from the IVB (as an I/O operation from an I/O port in the left or right bank) or any of the general registers, the second operand (if any) came from the AUX register. The first operand would be processed through a rotate unit, then a mask unit, to the ALU which performed ony four operations - MOV, ADD, AND and XOR. The result would be returned either to a general register, or could would be processed through a shifter and merge unit to the IVB register (and output to the appropriate left or right I/O port) - this would allow a subfield of the IVB to be replaced by bits from the result, instead of the whole register.

    The design was also limited with no interrupt support, no stack or index registers (though the port addresses could function as such), and no subroutine support (the XEC instruction would execute an instruction without incremeting the PC, and could be used to implement subroutines). Data values couldn't be accessed from program memory.


    IDaSS: ASIC lay-out of a 'Peripheral Control Cell' processor core:
    http://www.eb.ele.tue.nl/proj/idass8x3.html

    Part X: Hitachi 6301 - Small and microcoded (1983) .

    The HD6301 was an 8-bit CPU designed using microcode to bring the simpler design techniques of 16 and 32 bit CPUs at the time down to 8-bit designs. Inspired by the Motorola 6800, the 6301 featured A and B accumulators, one stack and one index register. These, along with the PC, were mapped to a bank of sixteen 8-bit registers (R0L, R0H, R1L etc. up to R7H), which along with two data buffer registers (DBR and DBL), and memory address registers (MARL and MARH), were accessed by the microcode to execute the CPU instructions using one 8-bit ALU and a simpler 8-bit arithmatic unit. A simple 2-stage pipeline was used.


    Part XI: Motorola MC14500B ICU, one bit at a time .

    Probably the limit in small processors was the 1 bit 14500B from Motorola. It had a 4 bit instruction, and controlled a single data read/write line, used for application control. It had no address bus - that was an external unit that was added on. Another CPU could be used to feed control instructions to the 14500B in an application.

    It had only 16 pins, less than a typical RAM chip, and ran at 1 MHz.


    Section Three: The Great Dark Cloud Falls: IBM's Choice.

    Part I: DEC PDP-11, benchmark for the first 16/32 bit generation. (1970) . . . .

    The DEC PDP-11 was the most popular in the PDP (Programmed Data Processors) line of minicomputers, a successor to the previously popular PDP-8 (the PDP-10 (1967) was a higher capacity 36-bit mainframe-like version of the PDP-8, much adored and rumoured to have souls), and remained in production until the decision to discontinue the line as of September 30, 1997 (over 25 years - see note on the DEC Alpha intended lifetime).

    The PDP-11 had eight general purpose 16-bit registers (R0 to R7 - R6 was also the SP and R7 was the PC). It featured powerful register oriented (little-endian, byte addressable) addressing modes. Since the PC was treated as a general purpose register, constants were loaded using an indirect mode on R7 which had the effect of loading the 16 bit word following the current instruction, then incrementing the PC to the next instruction before fetching. The SP could be accessed the same way (and any register could be used for a user stack (useful for FORTH)). A CC (or PSW) register held results from every instruction that executed.

    Adjascent registers could be implicitly grouped into a 32 bit register for multiply and divide results (Multiply result stored in two registers if destination is an even register, not if it's odd. Divide source must be grouped - quotient is stored in high order (low number) register, remainder in low order).

    A floating point unit could be added which contains six 64 bit accumulators (AC0 to AC5, can also be used as six 32-bit registers - values can only be loaded or stored using the first four registers).

    PDP-11 addresses were 16 bits, limiting program space to 64K, though an MMU could be used to expand total address space (18-bits and 22-bits in different PDP-11 versions).

    The LSI-11 (1975-ish) was a popular microprocessor implementation of the PDP-11 using the Western Digital MCP1600 microprogrammable CPU, and the architecture influenced the Motorola 68000, NS 320xx, and Zilog Z-8000 microprocessors in particular. There was also a 32-bit PDP-11 plan as far back as its 1969 introduction. The PDP-11 was finally replaced by the VAX architecture, (early versions included a PDP-11 emulation mode, and were called VAX-11).


    PDP-11 FAQ:
    http://www.village.org/pdp11/faq.pages/faq.html

    Part II: TMS 9900, first of the 16 bits (June 1976) . .

    One of the first true 16 bit microprocessors was the TMS 9900, by Texas Instruments (the first are probably National Semiconductor IMP-16 or AMD 2901 bit slice processors in 16 bit configuration). It was designed as a single chip version of the TI 990 minicomputer series, much like the Intersil 6100 was a single chip PDP-8, and the Fairchild 9440 and Data General mN601 were both one chip versions of Data General's Nova. Unlike the IMS 6100, however, the TMS 9900 had a mature, well thought out design.

    It had a 15 bit address space and two internal 16 bit registers. One unique feature, though, was that all user registers were actually kept in memory - this included stack pointers and the program counter. A single workspace register pointed to the 16 register set in RAM, so when a subroutine was entered or an interrupt was processed, only the single workspace register had to be changed - unlike some CPUs which required a dozen or more register saves before acknowledging a context switch.

    This was feasible at the time because RAM was often faster than the CPUs. A few modern designs, such as the INMOS Transputers, use this same design using caches or rotating buffers, for the same reason of improved context switches. Other chips of the time, such as the 650x series had a similar philosophy, using index registers, but the TMS 9900 went the farthest in this direction. Later versions added a write-through register buffer/cache.

    That wasn't the only positive feature of the chip. It had good interrupt handling features and very good instruction set. Serial I/O was available through address lines. In typical comparisons with the Intel 8086, the TMS9900 had smaller and faster programs. The only disadvantage was the small address space and need for fast RAM.

    Despite the very poor support from Texas Instruments, the TMS 9900 had the potential at one point to surpass the 8086 in popularity. TI also produced an embedded version, the TMS 9940.


    TMS9900 Information Sheet:
    http://www.nashscene.com/~jmoses/contemplate/technico/9900_datasheet.html

    Part III: Zilog Z-8000, another direct competitor . . . .

    The Z-8000 was introduced not long after the 8086, but had superior features. It was basically a 16 bit processor, but could address up to 23 bits in some versions by using segment registers (to supply the upper 7 bits). There was also an unsegmented version, but both could be extended further with an additional MMU that used 64 segment registers. The Z-8070 was a memory mapped FPU.

    Internally, the Z-8000 had sixteen 16 bit registers, but register size and use were exceedingly flexible - the first eight Z-8000 registers could be used as sixteen 8 bit subregisters (identified RH0, RL0, RH1 ...), or all sixteen could be grouped into eight 32 bit registers (RR0, RR2, RR4 ...), or four 64 bit registers. They were all general purpose registers - the stack pointer was typically register 15, with register 14 holding the stack segment (both accessed as one 32 bit register (RR14) for painless address calculations). The instruction set included 32-bit multiply (into 64 bits) and divide.

    The Z-8000 was one of the first to feature two modes, one for the operating system and one for user programs. The user mode prevented the user from messing about with interrupt handling and other potentially dangerous stuff (each mode had its own stack register).

    Finally, like the Z-80, the Z-8000 featured automatic RAM refresh circuitry. Unfortunately the processor was somewhat slow, but the features generally made up for that.

    A later version, the Z-80000, was introduced about at the beginning of 1986, at about the same time as the 32 bit MC68020 and Intel 80386 CPUs, though the Z-80000 was quite a bit more advanced. It was fully expanded to 32 bits internally, giving it sixteen 32 bit physical registers (the 16 bit registers became subregisters), doubling the number of 32 bit and 64 bit registers (sixteen 8-bit and 16-bit subregisters, 32-bit physical registers, eight 64-bit double registers). The system stack remained in RR14.

    In addition to the addressing modes of the Z-8000, larger 24 bit (16Mb) segment addressing was added, as well as an integrated MMU (absent in the 68020 but added later in the 68030) which included an on chip 16 line 256-byte fully associated write-through cache (which could be set to cache only data, instructions, or both, and could also be frozen by software once 'primed' - also found on later versions of the AMD 29K). It also featured multiprocessor support by defining some memory pages to be exclusive and others to be shared (and non-cacheable), with separate memory signals for each (including GREQ (Global memory REQuest) and GACK lines). There was also support for coprocessors, which would monitor the data bus and identify instructions meant for them (the CPU had two coprocessor control lines (one in, one out), and would produce any needed bus transactions).

    Finally, the Z-80000 was fully pipelined (six stages), while the fully pipelined 80486 and 68040 weren't introduced until 1991.

    But despite being technically advanced, the Z-8000 and Z-80000 series never met mainstream acceptance, due to initial bugs in the Z-8000 (the complex design did not use microcode - it used only 17,500 transistors) and to delays in the Z-80000. There was a radiation resistant military version, and a CMOS version of the Z-80000 (the Z-320). Zilog eventually gave up and became a second source for the AT&T WE32000 32-bit (1986) CPU instead (a VAX-like microprocessor derived from the Bellmac 32A minicomputer, which also became obsolete).

    The Z-8001 was used for Commodore's CBM 900 prototype, but the Unix based machine was never released - instead, Commodore bought Amiga, and released the 68000 based machine it was designing. A few companies did produce Z-8000 based computers, with Olivetti being the most famous, and the Plexus P40 being the last - the 68000 quickly became the processor of choice.


    Part IV: Motorola 68000, a refined 16/32 bit CPU (September 1979) . . . . . . . . .

    The initial 8MHz 68000 was actually a 32 bit architecture internally, but had only a 16 bit data bus and 24 bit address bus to fit in a 64 pin package (address and data shared a bus in the 40 pin packages of the 8086 and Z-8000). Later the 68008 reduced the data bus to 8 bits and address to 20 bits, and the 68020 was fully 32 bit externally. Addresses were computed as 32 bits (without using segment registers) - unused upper bits in the 68000 or 68008 bits were ignored, but some programmers stored type tags in the upper 8 bits, causing compatibility problems with the 68020's 32 bit addresses. Lack of forced segments made programming the 68000 easier than some competing processors, without the 64K size limit on directly accessed arrays or data structures.

    Looking back it was a logical design decision, since most 8 bit processors featured direct 16 bit addressing without segments.

    The 68000 had sixteen 32-bit registers, split into eight data and address registers. One address register was reserved for the Stack Pointer. Data registers could be used for any operation, including offset from an address register, but not as the source of an address itself. Operations on address registers were limited to move, add/subtract, or load effective address.

    Like the Z-8000, the 68000 featured a supervisor and user mode (each with its own Stack Pointer). The Z-8000 and 68000 were similar in capabilities, but the 68000 was 32 bit units internally (16 bit ALUs, two in parallel for 32-bit data, one for addresses), making it faster and eliminating forced segments. It was designed for expansion, including specifications for floating point and string operations (floating point was added in the 68040 (1991), with eight 80 bit floating point registers compatible with the 68881/2 coprocessors). Like many other CPUs of the time, the 68000 could fetch the next instruction during execution (a 2 stage pipeline).

    The 68010 (1982) added virtual memory support (the 68000 couldn't restart interrupted instructions) and a special loop mode - small decrement-and-branch loops could be executed from the instruction fetch buffer. The 68020 (1984) expanded external data and address bus to 32 bits,simple 3-stage pipeline, and added a 256 byte cache, while the 68030 (1987) brought the MMU onto the chip (it supported two level pages (logical, physical) rather than the segment/page mapping of the Intel 80386 and IBM S/360 mainframe). The 68040 (January 1991) added fully cached Harvard busses (4K each for data and instructions), 6 stage pipeline, and on chip FPU.

    The 68060 (April 1994) expanded the design to a superscalar version, like the Intel Pentium and NS320xx (Swordfish) series before it. Like the National Semiconductor Swordfish, and later the Nx586, AMD K5, and Intel's "Pentium Pro", the the third stage of the 10-stage 68060 pipeline translates the 680x0 instructions to a decoded RISC-like form (stored in a 16 entry buffer in stage four), and uses resource renaming (with fourty rename registers) to reorder instructions. There is also a branch cache, and branches are folded into the decoded instruction stream like the AT&T Hobbit and other more recent processors, then dispatched to two pipelines (three stages: Decode, addr gen, operand fetch) and finally to two of three execution units - 2 integer, 1 floating point) before reaching two 'writeback' stages. Cache sizes are doubled over the 68040.

    The 68060 also also includes many innovative power-saving features (3.3V operation, execution unit pipelines could actually be shut down, reducing power consumption at the expense of slower execution, and the clock could be reduced to zero) so power use is lower than the 68040 (4-6 watts vs. 3.9-4.9). Another innovation is that simple register-register instructions which don't generate addresses may use the the address stage ALU to execute 2 cycles early.

    The embedded market became the main market for the 680x0 series after workstation venders (and the Apple Macintosh) turned to faster load-store processors, so a variety of embedded versions were introduced. Later, Motorola designed a successor called Coldfire (early 1995), in which complex instructions and addressing modes (added to the 68020) were removed and the instruction set was recoded, simplifying it at the expense of compatibility (source only, not binary) with the 680x0 line.

    The Coldfire 52xx architecture resmbles a stripped (single pipeline) 68060, The 5 stage pipeline is literally folded over itself - after two fetch stages and a 12-byte buffer, instructions pass through the decode and address generate stages, then loop back so the decode becomes the operand fetch stage, and the address generate becomes the execute stage (so only one ALU is required for address and execution calculations). Simple (non-memory) instructions don't need to loop back. There is no translator stage as in the 68060 because Coldfire instructions are already in RISC-like form. The earlier 51xx version has a straight 68040-based 6 stage pipeline and includes 680x0 instructions (in user mode).

    At a quarter the physical size and a fraction of the power consumption, Coldfire is about as fast as a 68040 at the same clock rate, but the smaller design allows a faster clock rate to be acheived.


    Few people wonder why Apple chose the Motorola 68000 for the Macintosh, while IBM's decision to use Intel's 8088 for the IBM PC has baffled many. It wasn't a straightforward decision though. The Apple Lisa was the predecessor to the Macintosh, and also used a 68000 (eventually - 8086 and slower bitslice CPUs (which Steve Wozniak thought were neat) were initially considered before the 68000 was available). It also included a fully multitasking, GUI based operating system, highly integrated software, high capacity (but incompatible) 'twiggy' 5 1/4" disk drives, and a large workstation-like monitor. It was better than the Macintosh in almost every way, but was correspondingly more expensive.

    The Macintosh was to include the best features of the Lisa, but at an affordable price - in fact the original Macintosh came with only 128K of RAM and no expansion slots. Cost was such a factor that the 8 bit Motorola 6809 was the original design choice, and some prototypes were built, but they quickly realised that it didn't have the power for a GUI based OS, and they used the Lisa's 68000, borrowing some of the Lisa low level functions (such as graphics toolkit routines) for the Macintosh.

    Competing personal computers such as the Amiga and Atari ST, and early workstations by Sun, Apollo, NeXT and most others also used 680x0 CPUs (including one of the earliest workstations, the Tandy TRS-80 Model 16, which used a 68000 CPU and Z-80 for I/O and VM support).


    Motorola:
    http://www.mot.com/
    Motorola Microprocessors:
    http://www.mot.com/SPS/MMTG/mp.html
    Absolute Mac US History:
    http://www.absolutemac.com/US/HISTOIRE/history.html
    Amiga International:
    http://www.amiga.de/
    Atari compatible Medusa computers:
    http://www.stud.ee.ethz.ch/~caschwan/medusa.html
    Atari compatible C-Lab computers:
    http://www.ataricentral.com/st/c-lab/mkx.html

    Part V: National Semiconductor 32032, similar but different . . . .

    Like the 68000, the 320xx family consisted of a CPU which was 32-bit internally, and either 32 or 16 (and later 8) bits externally, as indicated by the last two digits. It appeared a little later than the others here, and so was not really a choice for the IBM PC, but is still representative of the era.

    It was similar to the 68000 in basic features, such as byte addressing, 24-bit address bus in the first version, memory to memory instructions, and so on (The 320xx also includes a string and array instruction). Unlike the 68000, the 320xx had eight instead of sixteen 32-bit registers, and they were all general purpose, not split into data and address registers. There was also a useful scaled-index addressing mode, and unlike other CPUs of the time, only a few operations affected the condition codes (as in more modern CPUs).

    Also different, the PC and stack registers were separate from the general register set - they were special purpose registers, along with the interrupt stack, and several "base registers" to provide multitasking support - the base data register pointed to the working memory of the current module (or process), the interrupt base register pointed to a table of interrupt handling procedures anywhere in memory (rather than a fixed location), and the module register pointed to a table of active modules.

    The 320xx also had a coprocessor bus, similar to the 8-bit Ferranti F100-L CPU, and coprocessor instructions. Coprocessors included an MMU, and a Floating Point unit which included eight 32-bit registers, which could be used as four 64-bit registers.

    The series found use mainly in embedded applications, and was expanded to that end, with timers, graphics enhancements, and even a Digital Signal Processor unit in the Swordfish version (1991, also known as 32732 and 32764). THe Swordfish was among the first truely supserscalar microprocessors, with two 5-stage pipelines (integer A, and B, which consisted of an integer and floating point pipeline - an instruction dispatched to B would execute in the appropriate pipe, leaving the other with an empty slot. The integer pipe could cycle twice in the memory stage to synchronise with the result of the floating point pipe, to ensure in-order completion when floating point operations could trap. B could also execute branches). This strategy was invluenced by the Multiflow VLIW design. Instructions were always fetched two at a time from the instruction cache which partially decoded the instruction pairs and set a bit to indicate whether they were dependent or could be issued simultaneously (effectively generating two-word VLIWs in the cache from an external stream of instructions). The cache decoder also generated branch target addresses to reduce branch latency as in the AT&T CRISP/Hobbit CPU.

    The Swordfish implemented the NS32K instruction set using a reduced instruction core - NS32K instructions were translated by the cache decoder into either: one internal instruction, a pair of internal instructions in the cache, or a partially decoded NS32K instruction which would be fully decoded into internal instructions after being fetched by the CPU. The Swordfish also had dynamic bus resizing (8, 16, 32, or 64 bits, allowing 2 instructions to be fetched at once) and clock doubling, 2 DMA channels, and in circuit emulation (ICE) support for debugging.

    The Swordfish was later simplified into a load-store design and used to implement an instruction set called CompactRISC (also known as Pirhana, an implementation independent instruction set supporting designs from 8 to 64 bits).


    It seems interesting to note that in the case of the NS320xx and Z-80000, non mainstream processors gained many advanced design features well ahead of the more mainstream processors, which presumably had more development resources available. One possible reason for this is the greater importance of compatibility in processors used for computers and workstations, which limits the freedom of the designers. Or perhaps the non-mainstream processors were just more flexible designs to begin with. Or some might not have made it to the mainstream because the more ambitious designs resulted in more implementation bugs than competitors.

    National Semiconductor - CompactRISC:
    http://www.national.com/appinfo/compactrisc/

    Part VI: MIL-STD-1750 - Military artificial intelligence (February 1979) .

    The USAF created a draft standard for a 16-bit microprocessor meant to be used in all airborn computers and weapons systems, allowing software developed for one such system to be portable to other similar applications, similar to the intent behind the creation of Ada as the standard high level programming language for the U.S Department of Defense (MIL-STD-1815 accepted October 1979 - 1815 was the year Ada Augusta, Countess of Lovelace and the world's first programmer, was born).

    Like other 16 bit designs of the time, 1750 was inspired by the PDP-11, but differs significantly. Sixteen 16-bit registers were specified, and any adjascent pairs (such as R0+R1 or R1+R2) could be used as 32-bit registers (the Z-8000 and PDP-11 could only use even pairs, and the PDP-11 only for specific uses) for integer or floating point (FP) values (no separate FP registers), or triples for 48-bit extended precision FP (with the mantissa concatenated after the exponent - eg. 32-bit FP was [1s][23mantissa][8exp], 48-bit was [1s][23mantissa][8exp][16ext], meaning any 48-bit FP was also a valid 32-bit FP, only losing the extra precision). Also, only the upper four registers (R12 to R15) could be used as an address base (2 instruction bits instead of 4), and R0 can't be used as an index (using R0 implies no indexing, similar to the PowerPC. R15 is used as an implicit stack pointer, the program counter is not user accessible.

    Address space is 16 bit word addressed (not bytes), but the design allows for an MMU to extend this to 20 bits. In addition, program and data memory can be separated using the MMU. A 4-bit Address State field in the processor status word (PSW) selects one of sixteen page groups, each containing sixteen registers for data memory and another sixteen for program memory (16x16x2 = 512 total). The top 4 bits of an address selects a register from the current AS group, which provides the upper 8 bits of a 20 bit address.

    Each page register also has a 4-bit access key. While other CPUs at the time provided user and supervisor modes, the 1750 provided for sixteen modes, from supervisor (mode 0, could access all pages), fourteen user modes (1 to 14 can only access page with same key, or key 15), and an unprivledged mode (mode 15 can only access page with key 15). Program memory can occupy the same logical address space as data, but will select from the program page registers. Pages can also be write or execute protected.

    Several I/O instructions are also included, and are used to access processor state registers.

    The 1750 is a very practical 16 bit design, and is still being produced, mainly in expensive radiation resistant forms. It did not achieve widespread acceptance, likely because of the rapid advance of technology and the rise of the RISC paradigm.


    CPU Tech:
    http://www.cputech.com/
    CPU Tech MIL-STD-1750A Cores:
    http://www.cputech.com/milstd.htm
    Public Ada Library (PAL):
    http://wuarchive.wustl.edu/languages/ada/
    Origin of Ada:
    http://wuarchive.wustl.edu/languages/ada/ajpo/pol-hist/history/holwg-93/1.htm


    Part VII: Intel 8086, IBM's choice (1978)
    . . . . . . . . . . . . . . . . . . .

    The Intel 8086 was based on the design of the 8080/8085 (source compatible with the 8080) with a similar register set, but was expanded to 16 bits. The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining (8086 instructions varied from 1 to 4 bytes).

    It featured four 16 bit general registers, which could also be accessed as eight 8 bit registers, and four 16 bit index registers (including the stack pointer). The data registers were often used implicitly by instructions, complicating register allocation for temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment registers that could be set from index registers.

    The segment registers allowed the CPU to access 1 meg of memory through an odd process. Rather than just supplying missing bytes, as most segmented processors, the 8086 actually added the segment registers ( X 16, or shifted left 4 bits) to the address. As a strange result of this unsuccessful attempt at extending the address space without adding address bits, it was possible to have two pointers with the same value point to two different memory locations, or two pointers with different values pointing to the same location, and limited typical data structures to less than 64K. Most people consider this a brain damaged design (a better method might have been that developed for the MIL-STD-1750 MMU).

    Although this was largely acceptable for assembly language, where control of the segments was complete (it could even be useful then), in higher level languages it caused constant confusion (ex. near/far pointers). Even worse, this made expanding the address space to more than 1 meg difficult. The 80286 (1982?) expanded the design to 32 bits only by adding a new mode (switching from 'Real' to 'Protected' mode was supported, but switching back required using a bug in the original 80286, which then had to be preserved) which greatly increased the number of segments by using a 16 bit selector for a 'segment descriptor', which contained the location within a 24 bit address space, size (still less than 64K), and attributes (for Virtual Memory support) of a segment.

    But all memory access was still restricted to 64K segments until the 80386 (1985), which included much improved addressing: base reg + index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or 32 bit constant = 32 bit address) in the form of paged segments (using six 16-bit segment registers), like the IBM S/360 series, and unlike the Motorola 68030). It also had several processor modes (including separate paged and segmented modes) for compatibility with the previous awkward design. In fact, with the right assembler, code written for the 8008 can still be run on the most recent Pentium Pro. The 80386 also added an MMU, security modes (called "rings" of privledge - kernal, system services, application services, applications) and new op codes in a fashion similar to the Z-80 (and Z-280).

    The 80486 (1989) added full pipelines, single on chip 8K cache, integrated FPU (based on the eight element 80-bit stack-oriented FPU in the 80387 FPU), and clock doubling versions (like the Z-280). The Pentium (late 1993) was superscalar (up to two instructions at once in dual integer units and single FPU) with separate 8K I/D caches.

    The Pentium was the name Intel gave the 80586 version because it could not legally protect the name "586" to prevent other companies from using it - and in fact, the Pentium compatible CPU from NexGen is called the Nx586 (early 1995). Due to its popularity, the 80x86 line has been the most widely cloned processors, from the NEC V20/V30 (slightly faster clones of the 8088/8086 (could also run 8085 code)), AMD and Cyrix clones of the 80386 and 80486, to versions of the Pentium within less than two years of its introduction.

    MMX (initially reported as MultiMedia eXtension, but later said by Intel to mean Matrix Math eXtension) is very similar to the earlier SPARC VIS or HP-PA MAX, or later MIPS MDMX instructions - they perform integer operations on vectors of 8, 16, or 32 bit words, using the 80 bit FPU stack elements as eight 64 bit registers (switching between FPU and MMX modes as needed - it's very difficult to use them as a stack and as MMX registers at the same time). The P55C Pentium version (January 1997) is the first Intel CPU to include MMX instructions, followed by the AMD K6, and Pentium II. Cyrix also added these instructions in its M2 CPU (6x86MX, June 1997), as well as IDT with its C6.

    Interestingly, the old architecture is such a barrier to improvements that most of the Pentium compatible CPUs (NexGen Nx586/Nx686, AMD K5, IDT-C6), and even the "Pentium Pro" (Pentium's successor, late 1995) don't clone the Pentium, but emulate it with specialized hardware decoders like those introduced in the National Semiconductor Swordfish, which convert Pentium instructions to RISC-like instructions which are executed on specially designed superscalar RISC-style cores faster than the Pentium itself. Intel also used BiCMOS in the Pentium and Pentium Pro to achieve clock rates competitive with CMOS load-store processors (the Pentium P55C (early 1997) version is a pure CMOS design).

    Persistant rumours that IBM was developing hardware to translate Pentium instructions for the PowerPC in a similar manner as part of the PowerPC 615 CPU eventually died.

    The Cyrix 6x86 (early 1996), initially manufactured by IBM before Cyrix merged with National Semiconductor, still directly executes 80x86 instructions in two pipelines, but partly out of order, making it faster than a Pentium. Cyrix also sells an integrated version with graphics and audio on-chip called the MediaGX. MMX instructions were added to the 6x86MX, and 3D graphics instructions to the 6x86MXi.

    The Pentium Pro (code named "P6") is a 1 or 2-chip (CPU plus 256K or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage superpipelined processor. It uses extensive multiple branch prediction and speculative execution via register renaming. Three decoders (one for complex instructions, two for simpler ones (four or fewer micro-ops)) each decode one 80x86 instruction into micro-ops (one per simple decoder + up to four from the complex decoder = three to six per cycle). Up to five (usually three) micro-ops can be issued in parallel and out of order (six units - FPU, 2 integer, 2 address, 1 load/store), but are held and retired (results written to registers or memory) as a group to prevent an inconsistant state (equivalent to half an instruction being executed when an interrupt occurs, for example). 80x86 instructions may produce several micro-ops in CPUs like this (and the Nx586 and AMD K5), so the actual instruction rate is lower. In fact, due to problems handling instruction alignment in the Pentium Pro, emulated 16-bit instructions execute slower than on a Pentium. The Pentium II (April 1997) version of the Pentium Pro added MMX instructions, doubled cache to 32K, and was packaged in a processor card instead of an IC package.

    The AMD K5 translates 80x86 code to ROPs (RISC OPerations), which execute on a RISC-style core based on the superscalar AMD 29K. Up to four ROPs can be dispatched to six units (two integer, one FPU, two load/store, one branch unit), and five can be retired at a time. The complexity led to low clock speeds for the K5, prompting AMD to buy NexGen and integrate its designs for the next generation K6.

    The NexGen/AMD Nx586 (early 1995) is unique by being able to execute its micro-ops (called RISC86 code) directly, allowing optimised RISC86 programs to be written which are faster than an equivalent x86 program would be, but this feature is seldom used. It also features two 16K I/D L1 caches, a dedicated L2 cache bus (like that in the Pentium Pro 2-chip module) and an off-chip FPU (either separate chip, or later as in 2-chip module).

    The Nx586 sucessor, the K6 (April 1997) actually has three caches - 32K each for data and instructions, and a half-size 16K cache containing instruction decode information. It also brings the FPU on-chip and eliminates the dedicated cache bus of the Nx586, allowing it to be pin-compatible with the P54C model Pentium. Another decoder is added (two complex decoders, compared to the Pentium Pro's one complex and two simple decoders) producing up to four micro-ops and issuing up to six (to seven units - load, store, complex/simple integer, FPU, branch, multimedia) and retiring four per cycle. It includes MMX instructions, licensed from Intel, and AMD has designed and added 3DNow! graphics extensions without waiting for Intel's 3D MMX extensions.

    Centaur, a subsidiary of Integrated Device Technology, will produce the IDT-C6 (May 1997), which uses a much simpler (6-stage, 2 way integer/simple-FPU execution) desgn than Intel and AMD translation-based designs by using micro-ops more closely resembling 80x86 than RISC code, which allows for a higher clock rate and larger L1 (32K each I/D) and TLB caches in a lower cost, lower power consumption design. Simplifications include replacing branch prediction (less important with a short pipeline) with an eight entry call/return stack, depending more on caches. The FPU unit includes MMX support (the C6+ version added a second FPU/MMX unit and 3D graphics enhancements).

    Intel, with partner Hewlett-Packard, has begun development of a next generation 64-bit processor (code named Merced or IA-64). It is expected to be a variable length instruction group (VLIG or what HP/Intel call EPIC (Explicit Parallel Instruction Computing)) with instruction dependencies grouped from 1 to 9+, in 128 bit bundles of three instructions plus three "template bits" which indicate dependancies. Most instructions are predicated, a design very similar to the TI 320C6x DSP, but with 128 general 64-bit and 128 floating point registers, and 64 predicate bits (a type of condition code). To reduce page faults, speculative load instructions execute a load, but does not trap if there is an exception until a second instruction completes it - if the second instruction is predicated and never executes, then a page fault is avoided, and loads can be rescheduled more flexibly. It's expectged to be compatible in some way with both the PA-RISC and 80x86 - it will include 80x86 data and segment registers, with additional instructions to switch between instruction/register sets and transfer data between 80x86 and IA-64 registers. It is expected to translate 80x86 instructions into VLIW instructions (or directly to decoded instructions) the same way that Pentium Pro and AMD K5/K6 CPUs do, but with a larger number of instructions issued using the VLIW design, it should be faster. However, native IA-64 code should be even faster, and this may finally produce the incentive to let the 80x86 architecture finally fade away.


    So why did IBM chose the 8-bit 8088 (1979) version of the 8086 for the IBM 5150 PC (1981) when most of the alternatives were so much better? Apparently IBM's own engineers wanted to use the 68000, and it was used later in the forgotten IBM Instruments 9000 Laboratory Computer, but IBM already had rights to manufacture the 8086, in exchange for giving Intel the rights to its bubble memory designs. Apparently IBM was using 8086s in the IBM Displaywriter word processor.

    Other factors were the fact that the the 8-bit 8088 could use existing low cost 8085-type components, and allowed the computer to be based on a modified 8085 design. 68000 components were not widely available, though it could use 6800 components to an extent. After the failure and expense of the IBM 5100 (1974/5/6? - their first attempt at a peronal computer - discrete random logic CPU with no bus, built in BASIC and APL as the OS), cost was a large factor in the design of the PC.

    The availability of CP/M-86 is also likely a factor, since CP/M was the operating system standard for the computer industry at the time. However Digital Research founder Gary Kildall was unhappy with the legal demands of IBM, so Microsoft, a programming language company, was hired instead to provide the operating system (initially known at varying times as QDOS, SCP-DOS, and finally 86-DOS, it was purchased by Microsoft from Seattle Computer Products and renamed MS-DOS).

    Digital Research did eventually produce CP/M 68K for the 68000 series, making the operating system choice less relevant than other factors.

    Intel bubble memory was on the market for a while, but faded away as better and cheaper memory technologies arrived.


    Intel Corporation:
    http://www.intel.com/
    Intel Product Info:
    http://www.intel.com/intel/product/index.htm
    AMD, Inc.:
    http://www.amd.com/
    AMD PC Processors Plus:
    http://www.amd.com/products/cpg/cpg.html
    Cyrix:
    http://www.cyrix.com/
    Centaur:
    http://www.centtech.com/
    Hewlett-Packard:
    http://hpcc920.external.hp.com/
    HP Microprocessors:
    http://hpcc920.external.hp.com/computing/framed/technology/micropro/
    Interview with Bill Gates:
    http://innovate.si.edu/history/gates/gatestoc.htm

    Section Four: Unix and RISC, a New Hope

    Part I: TRON, between the ages (1987) . .

    TRON stands for The Real-time Operating Nucleus, and was a grand scheme devised by conceived by Prof. Takeshi Sakamura of the University of Tokyo around 1984 to design a unified architecture for computer systems from the CPU, to operating systems and languages, to large scale networks. The TRON CPU was designed just as load-store architectures were set to rise, but retained the memory-data design philosophies - it could be considered a last gasp, though that doesn't do justice to the intent behind the design and its part in the TRON architecture.

    The basic design is scalable, from 32 to 48 and 64 bit designs, with 16 general purpose registers. It is a memory-data instruction set, but an elegant one. One early design was the Mitsubishi M32 (mid 1987), which optimised the simple and often used TRON instructions, much like the 80486 and 68040 did. It featured a 5 stage pipeline, dynamic branch prediction with a target branch buffer similar to that in the AMD 29K. It also featured an instruction prefetch queue, but being a prototype, had no MMU support or FPU.

    Commercial versions such as the Gmicro/200 (1988) and other Gmicro/ from Fujitsu/Hitachi/Mitsubishi, and the Toshiba Tx1 were also introduced, and a 64 bit version (CHIP64) began development, but they didn't catch on in the non-Japanese market (definitive specifications or descriptions of the OS's actual operation were hard to come by, while research systems like Mach of BSD Unix were widely available for experimentation). In addition, newer techniques (such as load-store designs) overshadowed the TRON standard. Companies such as Hitachi switched to load-store designs, and many American companies (Sun, MIPS) licensed their (faster) designs openly to Japanese companies. TRON's promise of a unified architecture (when complete) was less important to companies than raw performance and immediate compatibility (Unix, MS-DOS/MS Windows, Macintosh), and has not become significant in the industry, though TRON operating system development continued as an embedded and distributed operating system (such as the Intelligent House project, or more recently the TiPO handheld digital assistant from Seiko (February 1997)) implemented on non-TRON CPUs.

    NEC produced a similar memory-data design around the same time, the V60/V70 series, using thirty two registers, a seven stage pipeline, and preprocessed branches. NEC later developed the 32-bit load-store V800 series, and became a source of 64-bit MIPS load-store processors.


    TRON Project Information
    http://tron.um.u-tokyo.ac.jp/TRON/
    Kahaner's report on TRON
    http://www.baker.com/grand-unification-theory/archive/199103/04.html

    Part II: SPARC, an extreme windowed RISC (1987) . .

    SPARC, or the Scalable (originally Sun) Processor ARChitecture was designed by Sun Microsystems for their own use. Sun was a maker of workstations, and used standard 68000-based CPUs and a standard operating system, Unix. Research versions of load-store processors had promised a major step forward in speed [See Appendix A], but existing manufacturers were slow to introduce a RISC processor, so Sun went ahead and developed its own (based on Berkeley's design). In keeping with their open philosophy, they licensed it to other companies, rather than manufacture it themselves.

    SPARC was not the first RISC processor. The AMD 29000 (see below) came before it, as did the MIPS R2000 (based on Stanford's experimental design) and Hewlett-Packard PA-RISC CPU, among others. The SPARC design was radical at the time, even omitting multiple cycle multiply and divide instructions (added in later versions), using single-cycle "step" instructions instead, while most RISC CPUs were more conventional.

    SPARC usually contains about 128 or 144 registers, (memory-data designs typically had 16 or less). At each time 32 registers are available - 8 are global, the rest are allocated in a 'window' from a stack of registers. The window is moved 16 registers down the stack during a function call, so that the upper and lower 8 registers are shared between functions, to pass and return values, and 8 are local. The window is moved up on return, so registers are loaded or saved only at the top or bottom of the register stack. This allows functions to be called in as little as 1 cycle. Like most RISC processors, global register zero is wired to zero to simplify instructions, and SPARC is pipelined for performance (a new instruction can start execution before a previous one has finished), but not as deeply as others - like the MIPS CPUs, it has branch delay slots. Also like previous processors, a dedicated CCR holds comparison results.

    SPARC is 'scalable' mainly because the register stack can be expanded (up to 512, or 32 windows), to reduce loads and saves between functions, or scaled down to reduce interrupt or context switch time, when the entire register set has to be saved. Function calls are usually much more frequent than interrupts, so the large register set is usually a plus, but compilers now can usually produce code which uses a fixed register set as efficiently as a windowed register set across function calls.

    SPARC is not a chip, but a specification, and so there are various designs of it. It has undergone revisions, and now has multiply and divide instructions. Original versions were 32 bits, but 64 bit and superscalar versions were designed and implemented (beginning with the Texas Instruments SuperSparc in late 1992), but performance lagged behind other load-store and even Intel 80x86 processors until the UltraSPARC (late 1995) from Texas Instruments and Sun, and superscalar HAL/Fuji SPARC64 multichip CPU.

    The UltraSPARC is a 64-bit superscalar processor series which can issue up to four instructions at once (but not out of order) to any of nine units: two integer units, two of the five floating point/graphics units, the branch and load/store unit. The UltraSparc also added a block move instruction which bypasses the caches (2-way 16K instr, 16K direct mapped data), to avoid disrupting it, and specialized pixel operations (VIS - the Visual Instruction Set) which can operate in parallel on 8, 16, or 32-bit integer values packed in a 64-bit floating point register (for example, four 8 X 16 -> 16 bit multiplications in a 64 bit word, a sort of simple SIMD/vector operation. More extensive than the Intel MMX instructions, or earlier HP PA-RISC MAX and Motorola 88110 graphics extensions, VIS also includes some 3D to 2D conversion, edge processing and pixes distance (for MPEG, pattern-matching support).

    The HAL/Fuji SPARC64 can issue up to four in order instructions simultaneously to four buffers, then to four integer, two floating point, two load/store, and the branch unit, and may complete out of order (an instruction completes when it finishes without error, is committed when all instructions ahead of it have completed, and is retired when its resources are freed - these are 'invisible' stages in the SPARC64 pipeline). A combination of register renaming, a branch history table, and processor state storage (like in the Motorola 88K) allow for speculative execution while maintaining precise exceptions/interrupts (renamed integer, floating, and CC registers - trap levels are also renamed and can be entered speculatively).


    Sun Microsystems:
    http://www.sun.com/
    Sun Microprocessor Products:
    http://www.sun.com/microelectronics/products/microproc.html
    HAL Computer Systems, Inc.:
    http://www.hal.com/

    Part III: AMD 29000, a flexible register set (1987) . .

    The AMD 29000 is another load-store CPU descended from the Berkeley RISC design (and the IBM 801 project), as a modern successor to the earlier 2900 bitslice series (beginning around 1981). Like the SPARC design that was introduced shortly later, the 29000 has a large set of registers split into local and global sets. But though it was introduced before the SPARC, it has a more elegant method of register management.

    The 29000 has 64 global registers, in comparison to the SPARC's eight. In addition, the 29000 allows variable sized windows allocated from the 128 register stack cache. The current window or stack frame is indicated by a stack pointer (a modern version of the ISAR register in the Fairchild F8 CPU), a pointer to the caller's frame is stored in the current frame, like in an ordinary stack (directly supporting stack languages like C, a CISC-like philosophy). Spills and fills occur only at the ends of the cache, and registers are saved/loaded from the memory stack. This allows variable window sizes, from 1 to 128 registers. This flexibility, plus the large set of global registers, makes register allocation easier than in SPARC (optimised stack operations also make it ideal for a stack-oriented interpreted languages such as PostScript, making it popular as a laser printer controller).

    There is no special condition code register - any general register is used instead, allowing several condition codes to be retained, though this sometimes makes code more complex. An instruction prefetch buffer (using burst mode) ensures a steady instruction stream. Branches to another stream can cause a delay, so the first four new instructions are cached - next time a cached branch (up to sixteen) is taken, the cache supplies instructions during the initial memory access delay.

    Registers aren't saved during interrupts, allowing the interrupt routine to determine whether the overhead is worthwhile. In addition, a form of register access control is provided. All registers can be protected, in blocks of 4, from access. These features make the 29000 useful for embedded applications, which is where most of these processors are used, allowing it at one point to claim the title of 'the most popular RISC processor'. The 29000 also includes an MMU and support for the 29027 FPU. The superscalar 29050 version in 1990 integrated a redesigned FPU (4 instructions could be dispatched to execute out of order and speculatively).

    In late 1995 Advanced Micro Devices dropped development of the 29K in favour of its more profitable clones of Intel 80x86 processors, although much of the development of the superscalar core for a new AMD 29000 (including FPU designs from the 29050) was shared with the 'K5' (1995) Pentium compatible processor (the 'K5' translates 80x86 instructions to RISC-like instructions, and dispatches up to five at once to two integer units, one FPU, a branch and a load/store unit).


    AMD, Inc.:
    http://www.amd.com/
    AMD 29K Family:
    http://www.amd.com/products/lpd/29k/29kindex.html

    Part IV:Siemens 80C166, Embedded load-store with register windows. . .

    The Siemens 80C166 was designed as a very low-cost embedded 8/16-bit load-store processor, with RAM (1 to 2K) kept on-chip for lower cost. This leads to some unusual versions of normal RISC features.

    The 80C166 has sixteen 16 bit registers, with the lower eight usable as sixteen 8 bit registers, which are stored in overlapping windows (like in the SPARC) in the on-chip RAM (or register bank), pointed to by the Context Pointer (CP) (similar to the SP in the AMD 29K). Unlike the SPARC, register windows can overlap by a variable amount (controlled by the CP), and the there are no spills or fills because the registers are considered part of the RAM address space (like in the TMS 9900), and could even extend to off chip RAM. This eliminates wasted registers of SPARC style windows.

    Address space (18 to 24 bits) is segmented (64K code segments with a separate code segment register, 16K data segments with upper two bits of 16 bit address selecting one of four data segment registers).

    The 80C166 has 32 bit instructions, while it's a 16 bit processor (compared to the Hitachi SH, which is a 32 bit CPU with 16 bit instructions). It uses a four stage pipeline, with a limited (one instruction) branch cache.


    Siemens Components Inc.:
    http://www.sci.siemens.com/
    Siements 8-bit and 16-bit Microcontrollers:
    http://www.sci.siemens.com/htdocs/catalog/ICs/02.html

    Part V: MIPS R2000, the other approach. (June 1986) . . . . . . . . .

    The R2000 design came from the Stanford MIPS project, which stood for Microprocessor without Interlocked Pipeline Stages [See Appendix A], and was arguably the first commercial RISC processor (other candidates are the ARM and IBM ROMP used in the IBM PC/RT workstation, which was designed around 1981 but delayed until 1986). It was intended to simplify processor design by eliminating hardware interlocks between the five pipeline stages. This means that only single execution cycle instructions can access the thirty two 32 bit general registers, so that the compiler can schedule them to avoid conflicts. This also means that LOAD/STORE and branch instructions have a 1 cycle delay to account for. However, because of the importance of multiply and divide instructions, a special HI/LO pair of multiply/divide registers exist which do have hardware interlocks, since these take several cycles to execute and produce scheduling difficulties.

    Like the AMD 29000 and DEC Alpha, the R2000 has no condition code register considering it a potential bottleneck. The PC is user readable. The CPU includes an MMU unit that can also control a cache, and the CPU was one of the first which could operate as a big or little endian processor. An FPU, the R2010, is also specified for the processor.

    Newer versions included the R3000 (1988), with improved cache control, and the R4000 (1991) (expanded to 64 bits and is superpipelined (twice as many pipeline stages do less work at each stage, allowing a higher clock rate and twice as many instructions in the pipeline at once, at the expense of increased latency when the pipeline can't be filled, such as during a branch, (and requiring interlocks added between stages for compatibility, making the original "I" in the "MIPS" acronym meaningless))). The R4400 and above integrated the FPU with on-chip caches. The R4600 and later versions abandoned superpipelines.

    The superscalar R8000 (1994) was optimised for floating point operation, issuing two integer or load/store operations (from four integer and two load/store units) and two floating point operations simultaneously (FP instructions sent to the independent R8010 floating point coprocessor (with its own set of thirty-two 64-bit registers and load/store queues)).

    The R10000 and R12000 versions (early 1996 and May 1997) added multiple FPU units, as well as almost every advanced modern CPU feature, including separate 2-way I/D caches (32K each) plus on-chip secondary controller (and high speed 8-way split transaction bus (up to 8 transactions can be issued before the first completes)), superscalar execution (load four, dispatch five instructions (may be out of order) to any of two integer, two floating point, and one load/store units), dynamic register renaming (integer and floating point rename registers (thirty two in the R10K, fourty eight in the R12K)), and an instruction cache where instructions are partially decoded when loaded into the cache, simplifying the processor decode (and register rename/issue) stage. This technique was first implemented in the AT&T CRISP/Hobbit CPU, described later. Branch prediction and target caches are also included.

    The 2-way (int/float) superscalar R5000 (January, 1996) was added to fill the gap between R4600 and R10000, without any fancy features (out of order or branch prediction buffers). For embedded applications, MIPS and LSI Logic added a compact 16 bit instruction set which can be mixed with the 32 bit set (same as the ARM Thumb 16 bit extension), implemented in a CPU called TinyRISC (October 1996), as well as MIPS V and MDMX (MIPS Digital Multimedia Extensions, announced October 1996)). MIPS V adds parallel floating point (two 32 bit fields in 64 bit registers) operations (compared to similar HP MAX integer or Sun VIS and Intel MMX floating point unit extensions), MDMX adds integer 8 or 16 bit subwords in 64 bit FPU registers and a 24 and 48 bit subwords in a 192 bit accumulator for multimedia instructions (a MAC instruction on an 8-bit value can produce a 24-bit result, hence the large accumulator). Vector-scalar operations (ex: multiply all subwords in a register by subword 3 from another register) are also supported. These extensive instructions are partly derived from Cray vector instructions (Cray is owned by SGI, the parent company of MIPS), and are much more extensive than the earlier multimedia extensions of other CPUs. Future versions are expected to add Java virtual machine support.


    Silicon Graphics MIPS Group:
    http://www.sgi.com/MIPS/

    Part VI: Hewlett-Packard PA-RISC, a conservative RISC (Oct 1986) . . . . . .

    A design typical of many load-store processors, the PA-RISC (Precision Architecture, originally code-named Spectrum) was designed to replace older processors in HP-3000 MPE minicomputers, and Motorola 680x0 processors in the HP-9000 HP/UX Unix minicomputers and workstations. It has an unusually large instruction set for a RISC processor (including a conditional (predicated) skip instruction similar to those in the ARM processor), partly because initial design took place before RISC philosophy was popular, and partly because careful analysis showed that performance benefited from the instructions chosen - in fact, version 1.1 added new multiple operation instructions combined from frequent instruction sequences, and HP was among the first to add multimedia instructions (the MAX-1 and MAX-2 instructions, similar to Sun VIS or Intel MMX). Despite this, it's a simple design - the entire original CPU had only 115,000 transistors, less than twice the much older 68000.

    Much of the RISC philosophy was independently invented at HP from lessons learned from FOCUS (pre 1984), HP's (and the world's) first fully 32 bit microprocessor. It was a huge (at the time) 450,000 transistor chip with a stack based instruction set, described as "essentially a gigantic microcode ROM with a simple 32 bit data path bolted to its side". Performance wasn't spectacular, but it was used in a pre-Unix workstation from HP.

    It's almost the cannonical load-store design, similar except in details to most other mainstream load-store processors like the Fairchild/Intergraph Clipper (1986), and the Motorola 88K in particular. It has a 5 stage pipeline, which (unlike early MIPS (R2000) processors) had hardware interlocks from the beginning for instructions which take more than one cycle, as well as result forwarding (a result can be used by a previous instruction without waiting for it to be stored in a register first).

    It is a load/story architecture, originally with a single instruction/data bus, later expanded to a Harvard architecture (separate instruction and data buses). It has thirty-two 32-bit integer registers (GR0 wired to constant 0, GR31 used as a link register for procedure calls), with seven 'shadow registers' which preserve the contents of a subset of the GR set during fast interrupts (also like ARM), and thirty-two 64-bit floating point registers (also as sixty-four 32-bit and sixteen 128-bit), in an FPU (which could execute a floating point instruction simultaneously, from the Apollo-designed Prism architecture (1988?) after Hewlett-Packard acquired the company). Later versions (the PA-RISC 7200 in 1994) added a second integer unit (still dispatching only two instructions at a time to any of the three units). Addressing originally was 48 bits, and expanded to 64 bits, using a segmented addressing scheme.

    The PA-RISC 7200 also included a tightly integrated cache and MMU, a high speed 64-bit 'Runway' bus, and a fast but complex fully associative 2KB on-chip assist cache, between the simpler direct-mapped data cache and main memory, which reduces thrashing (repeatedly loading the same cache line) when two memory addresses are aliased (mapped to the same cache line). Instructions are predecoded into a separate instruction cache (like the AT&T CRISP/Hobbit).

    The PA-RISC 8000 (April 1996), intended to compete with the R10000, UltraSparc, and others) expands the registers and architecture to 64 bits (eliminating the need for segments), and adds aggressive superscalar design - up to 5 instructions out of order, using fifty six rename registers, to ten units (five pairs of: ALU, shift/merge, FPU mult/add, divide/sqrt, load/store). The CPU is split in two, with load/store (high latency) instructions dispatched from a separate queue from operations (except for branch or read/modify/write instructions, which are copied to both queues). It also has a deep pipeline and speculative execution of branches (many of the same features as the R10000, in a very elegant implementation).

    The PA-RISC 8500 breaks with HP tradition (in a big way) and adds on-chip cache - 1.5Mb L1 cache.

    Although typically sporting fewer of the advanced (and promised) features of competing CPUs designs, a simple elegant design and effective instruction set has kept PA-RISC performance among the best in its class (of those actually available at the same time) since its introduction.

    HP pioneered the addition of multimedia instructions with the MAX-1 (Multimedia Acceleration eXtension) extensions in the PA-7100LC (pre-1994) and 64-bit (version 2.0) MAX-2 extensions in the PA-8000, which allowed vector operations on two or four 16-bit subwords in 32-bit or 64-bit integer registers (this only required circuitry to slice the integer ALU (similar to bit-slice processors, such as the AMD 2901), adding only 0.1 percent to the PA-8000 CPU area - using the FPU registers like Sun's VIS and Intels MMX do would have required duplicating ALU functions. 8 and 32-bit support, multiplication, and complex instructions were also left out in favour of powerful 'mix' and 'permute' packing/unpacking operations).

    In the future Hewlett-Packard plans to pursue a "post-VLIW" (Very Long Instruction Word) design in conjunction with Intel code named "Merced" or IA-64, possibly expanding on the idea of MAX or MMX operations. Some of the newer CPUs which execute Intel 80x86 instructions (The AMD 'K5' and NexGen Nx586, for example) treat 80x86 instructions as VLIW instructions, decoding them into RISC-like instructions and executing several concurrently. However, it would more likely follow the VLIW design of the TI 320C6x DSP.


    Hewlett-Packard:
    http://hpcc920.external.hp.com/
    HP Microprocessors:
    http://hpcc920.external.hp.com/computing/framed/technology/micropro/

    Part VII: Motorola 88000, Late but elegant (mid 1988) . . . .

    The Motorola 88000 (originally named the 78000) is a 32 bit processor, one of the first load-store CPUs based on a Harvard architecture (the same as the Fairchild/Intergraph Clipper C100 (1986) beat it by 2 years). Each bus has a separate cache, so simultaneous data and instruction access doesn't conflict. Except for this, it is similar to the Hewlett Packard Precision Architecture (HP/PA) in design (including many control/status registers only visible in supervisor mode), though the 88000 is more modular, has a small and elegant instruction set, no special status register (condition codes may be stored in any general register), and lacks segmented addressing (limiting addressing to 32 bits, vs. 64 bits). The 88200 MMU unit also provides dual caches (including multiprocessor support) and MMU functions for the 88100 CPU (like the Clipper). The 88110 includes caches and MMU on-chip.

    The 88000 has thirty-two 32 bit user registers, with up to 8 distinct internal function units - an ALU and a floating point unit (sharing the single register set) in the 88100 version, multiple ALU and FPU units (with thirty-two 80-bit FPU registers) and two-issue instuctions were added to the 88110 to produce one of the first superscalar designs (following the National Semiconductor Swordfish). Other units could be designed and added to produce custom designs for customers, and the 88110 added a graphics/bit unit which pack or unpack 4, 8 or 16-bit integers (pixels) within 32-bit words, and multiply packed bytes by an 8-bit value. But it was introduced late and never became as popular in major systems as the MIPS or HP processors. Development (and performance) has lagged as Motorola favoured the PowerPC CPU, coproduced with IBM.

    Like the most modern processors, the 88000 is pipelined (with interlocks), and has result forwarding (in the 88110 one ALU can feed a result directly into another for the next cycle). Loads and saves in the 88110 are buffered so the processor doesn't have to wait, except when loading from a memory location still waiting for a save to complete. The 88110 also has a history buffer for speculatively executing branches and to make interrupts 'precise' (they're imprecise in the 88100). The history buffer is used to 'undo' the results of speculative execution or to restore the processor to 'state' when the interrupt occurred - a 1 cycle penalty, as opposed to 'register renaming' which buffers results in another register and either discards or saves it as needed, without penalty.


    Part VIII: Fairchild/Intergraph Clipper, An also-ran (1986) . .

    The Clipper C100 was developed by Fairchild, later sold to workstation maker Intergraph, which took over chip development (produced the C300 in 1988) until it decided it couldn't compete in processor technology, and switched to Intel 80x86-based processors (Fairchild itself was bought by National Semiconductor).

    The C100 was a three-chip set like the Motorola 88000 (but predating it by two years), with a Harvard architecture CPU and separate MMU/cache chips for instruction and data. It differed from the 88K and HP PA-RISC in having sixteen 32-bit user registers and eight 64-bit FPU registers, rather than the more common thirty-two, and 16 and 32 bit instruction lengths.

    The only other distinguishing features of the Clipper are a bank of sixteen supervisor registers which completely replace the user registers, (the ARM replaces half the user registers on an FIRQ interrupt) and the addition of some microcode instructions like in the Intel i960.


    Part IX: Acorn ARM, RISC for the masses (1986) . . . .

    ARM (Advanced RISC Machine, originally Acorn RISC Machine) is often praised as one of the most elegant modern processors in existence. It was meant to be "MIPs for the masses", and designed as part of a family of chips (ARM - CPU, MEMC - MMU and DRAM/ROM controller, VIDC - video and DAC, IOC - I/O, timing, interrupts, etc), for the Archimedes home computer (multitasking OS, windows, etc). It's made by VLSI Technologies Inc, and based partly on the Berkeley experimental load-store design. It is simple, with a short 3-stage pipeline, and it can operate in big- or little-endian mode.

    The original ARM (ARM1, 2 and 3) was a 32 bit CPU, but used 26 bit addressing. The newer ARM6xx spec is completely 32 bits. It has user, supervisor, and various interrupt modes (including 26 bit modes for ARM2 compatibility). The ARM architecture has sixteen registers (including user visible PC as R15) with a multiple load/save instruction, though many registers are shadowed in interrupt modes (2 in supervisor and IRQ, 7 in FIRQ) so need not be saved, for fast response. The instruction set is reminiscent of the 6502, used in Acorns earlier computers.

    A feature introduced by the ARM is that every instruction is predicated, using a 4 bit condition code (including 'never execute', not officially recommended), an idea later used in some HP PA-RISC instructions and the TI 320C6x DSP. Another bit indicates whether the instruction should set condition codes, so intervening instructions don't change them. This easily eliminates many branches and can speed execution. Another unique and useful feature is a barrel shifter which operates on the second operand of most ALU operations, allowing shifts to be combined with most operations (and index registers for addressing), effectively combining two or more instructions into one (similar to the earlier design of the funky Signetics 8x300).

    These features make ARM code both dense (unlike most load-store processors) and efficient, despite the relatively low clock rate and short pipeline - it is roughly equivalent to a much more complex 80486 in speed. And like the Motorola Coldfire, ARM has developed a low cost 16-bit version called Thumb, which recodes a subset of ARM CPU instructions into 16 bits (decoded to native 32-bit ARM instructions without penalty - similar to the CISC decoders in the newest 80x86 compatible and 68060 processors, except they decode native instructions into a newer one, while Thumb does the reverse). Thumb programs can be 30-40% smaller than already dense ARM programs. Native ARM code can be mixed with Thumb code when the full instruction set is needed.

    The ARM series consists of the ARM6 CPU core (35,000 transistors, which can be used as the basis for a custom CPU) the ARM60 base CPU, and the ARM600 which also includes 4K 64-way set-associative cache, MMU, write buffer, and coprocessor interface (for FPU). A newer version, the ARM7 series (Dec 1994), increases performance by optimising the multiplier, and adding DSP-like extensions including 32 bit and 64 bit multiply and multiply/accumulate instructions (operand data paths lead from registers through the multiplier, then the shifter (one operand), and then to the integer ALU for up to three independent operations). It also doubles cache size to 8K, includes embedded In Circuit Emulator (ICE) support, and raises the clock rate significantly.

    A full DSP coprocessor (codenamed Piccolo, expected second half 1997) adds an independent set of sixteen 32-bit registers (also accessable as thirty two 16 bit registers), four which can be used as 48 bit registers, and a complete DSP instruction set (including four level zero-overhead loop operations), using a load-store model similar to the ARM itself. The coprocessor has its own program counter, interacting with the CPU which performs data load/store through input/output buffers connected to the coprocessor bus (similar but more intelligent than the address unit in a typical DSP (such as the Motorola 56K) supporting the data unit). The coprocessor shares the main ARM bus, but uses a separate instruction buffer to reduce conflict. Two 16 bit values packed in 32 bit registers can be computed in parallel, similar to the HP PA-RISC MAX-1 multimedia instructions.

    The ARM CPU was chosen for the Apple Newton handheld system because of its speed, combined with the low power consumption, low cost and customizable design (the ARM610 version used by Apple includes a custom MMU supporting object oriented protection and access to memory for the Newton's NewtOS). DEC has also licensed the architecture, and has developed the SA-110 (StrongARM) (February 1996), running a 5-stage pipeline at 100 to 233MHz (using only 1 watt of power), with 5-port register file, faster multiplier, single cycle shift-add, and Harvard architecture (16K each 32-way I/D caches). To fill the gap between ARM7 and DEC StrongARM, ARM also developed the ARM8/800 which includes many StrongARM features, and the ARM9 with Harvard busses, write buffers, and flexible memory protection mapping.

    An experimental asynchronous version of the ARM6 (operates without an external or internal clock signal) called AMULET has been produced by Steve Furber's research group at Manchester university. The first version (AMULET1, early 1993) is about 70% the speed of a 20MHz ARM6 on average (using the same fabrication process), but simple operations (multiplication is a big win at up to 3 times the speed) are faster (since they don't need to wait for a clock signal to complete). AMULET2e (October 1996, 93K transistor AMULET2 core plus four 1K fully associative cache blocks) is 30% faster (40 MIPS, 1/2 the performance of a 75MHz ARM810 using same fabrication), uses less power, and includes features such as branch prediction. AMULET3 is expected to be a commercial product in 1999.


    Advanced RISC Machines:
    http://www.arm.com/
    Digital Equipment Corporation:
    http://www.digital.com/
    Newton, Inc.:
    http://www.newton.apple.com/

    Part X: Hitachi SuperH series, Embedded, small, economical (1992) . . . . . .

    Although the TRON project produced processors competitive in performance (Fujitsu's(?) Gmicro/500 memory-data CPU (1993) was faster and used less power than a Pentium), the idea of a single standard processor never caught on, and newer concepts (such as RISC features) overtook the TRON design. Hitachi itself has supplied a wide variety of microprocessors, from Motorola and Zilog compatible designs to IBM System/360/370/390 compatible mainframes, but has also designed several of its own series of processors.

    The Hitachi SH series was meant to replace the 8-bit and 16-bit H8 microcontrollers, a series of PDP-11-like (or National Semiconductor 32032/32016-like) memory-data CPUs with sixteen 16-bit registers (eight in the H8/300), usable as sixteen 8-bit or combined as eight 32-bit registers (for addressing, except H8/300), with many memory-oriented addressing modes. The SH is also designed for the embedded marked, and is similar to the ARM architecture in many ways. It's a 32 bit processor, but with a 16 bit instruction format (different than Thumb, which is a 16 bit encoding of a subset of ARM 32 bit instructions, or the NEC V800 load-store series, which mixes 16 and 32 bit instruction formats), and has sixteen general purpose registers and a load/store architecture (again, like ARM). This results in a very high code density, similar to the 680x0 and 80x86 CPUs, and about half that of the PowerPC. Because of the small instruction size, there are no immediate load instruction, but a PC-relative addressing mode is supported to load 32 bit values (unlike ARM or PDP-11, the PC is not otherwise visible). The SH also has a Multiply ACcumulate (MAC) instruction, and MACH/L (high/low word) result registers - 42 bit results (32 low, 10 high) in the SH1, 64 bit results (both 32 bit) in the SH2 and later. The SH3 includes an MMU and 2K to 8K of unified cache.

    The SH4 (mid-1998) is a superscalar version with extensions for 3-D graphics support. It can issue two instructions at a time to any of four units: integer, floating point, load/store, branch (except for certain non-superscalar instructions, such as modifying control registers). Certain instructions, such as register-register move, can be executed by either the integer or load/store unit, two can be issued at the same time. Each unit has a separate pipeline, five stages for integer and load/store, five or six for floating point, and three for branch.

    Hitachi designers chose to add 3-D support to the SH4 instead of parallel integer subword operations like HP