2x2=4, mathematics

Emotional stories about first processors for computers: part 3 (Motorola 68k)

Motorola: from 68000 to 68040

Motorola was the only company that could successfully compete with Intel in the field of production of processors for personal computers for some time. In 1980, Motorola actually put Intel in a situation of crisis, from which it could only get out of by mobilizing all its forces and organizing the crush group for its competitors, whose actions somewhat violated the US antitrust laws.

The 68000 was released in 1979 and at the first impression looked much better than the 8086. It had 16 32-bit registers (more accurately, even 17), a separate command counter and a state register. It could address 16 MB of memory directly which did not create any restrictions for example for large arrays. However careful analysis of features of the 68000 shows that not everything was as good as it seemed. In those years to have a memory of more than 1 MB was an unattainable luxury even for medium-sized organizations. The 68000 code density was worse than for 8086, which means that 68000 code with the same functionality occupied more space. The latter is also due to the fact that any instruction for the 68k processors should be multiples of 2 bytes in length, and for the x86 of 1 byte. But the information about the code density is controversial as there is evidence showing that in some cases the 68000 could have the better code density. Out of 16 registers of the 68k there are 8 address registers, which in some respect are slightly more advanced analogues of the x86 segment registers. The ALU and data bus are 16-bit, so operations with 32-bit data are slower than someone could expect. Moreover, from the contrived Big Endian byte order, addition and subtraction operations with 32-bit numbers are performed with an additional slowdown. The execution time of register-register operations for the 68000 is 4 cycles, and for the 8086 it is only 2. The interrupt latency for the 68000 may reach 378 cycles, and this is quite a lot. Computers based on the 68000 until the mid-80's were much more expensive than those based on the Intel 8088, but the 68000 could not work with virtual memory and did not have hardware support for working with real numbers, which made it unsuitable for use in the most advanced systems. To support the use of virtual memory, another processor was required, usually another 68000 was used for this. In 1982, as Bill Joy noted, Sun began developing their processor because the 68000 did not meet customer needs for performance, and especially for working with real numbers. Interestingly, because of the large size of the 68000, Motorola people called it the "Texas cockroach".

However, the larger size allowed more functionality to be squeezed in. For example, the 68000 can handle up to 8 external interrupt sources without an external controller. The x86 architecture on the contrary, requires an external interrupt controller (the only very special exception is the 80186). The data and address buses of the 68000 are not multiplexed, unlike the 8086. Only since the 80286 Intel abandoned such multiplexing which is slowing computer systems. The 68000 has also privileged instructions, which are necessary for multitasking support. These instructions the x86 only got since the 80286 too. On the other hand, privileged commands alone are not enough to completely support multitasking, and therefore their presence in the 68000 in contrast to the 80286, looks somewhat far-fetched.

As always with products from Motorola the architecture of the 68000 shows some clumsiness and contrived oddities. For example there are two stacks (the Coldfire processor, which is the most popular 68k processor today, only has one stack) and two carry flags (one for condition checks and another for operations). Despite the presence of two carry flags, addition and subtraction with carry instructions support only two addressing modes, which makes them less convenient than such operations on the x86 – the 68000, thus retaining some of the clumsiness of such operations inherent in the IBM/370 and, in part the PDP-11. The oddities with the flags do not end with that. For some reason many instructions including even MOVE zero the carry and overflow flags. Another oddity is that the command to save the state of arithmetic flags which worked normally with the 68000, was made privileged in all processors starting with the 68010. This in particular made it impossible to use the same operation to save flags for the 68000 and the later 68k processors. Thus, you can't save flags as on the x86 (with the PUSHF or SAHF commands) on the 68k in the same way! Perhaps Motorola should not have made the MOVE from SR command privileged, but instead should have changed it so that it always returns fixed information about some system flags in user mode. Some operations irritate by their non-optimization, for example, the CLR instruction of writing zero to memory is slower than writing a constant 0 to memory with the MOVE instruction or shift to the left is slower than adding an operand to itself. Even the address registers while seemingly superior to the 8086 segment registers have a number of annoying disadvantages. For example they need to load as much as 4 bytes instead of 2 for the 8086 and of these four, one was extra. The 68000 command system reveals many similarities with the PDP-11 command system developed back in the 60's although some addressing methods and byte order are almost certainly taken from the IBM/370. Addressing via base and index registers is done on the 68000 at the level of the 8-bit 6800 or Z80, with a single-byte offset – this is somehow completely impractical. On the 8086 offsets are 16 bits, on the ARM or IBM/370 – 12. Even the 6502 can use a 16-bit offset along with indexing. The normal offset size was only supported since the 68020. Surprises with the 68k can occur with the unusual loop instruction for which we should pass not the number of repetitions, but the number one less. The 68k still lacks an instruction like TEST for the 8086. Despite the great capabilities of Motorola, the 68000 was originally made using a relatively old technology, which was inferior even to the one used in the production of the 6502.

One cannot help but be surprised that the 68000 instruction set has two different ways to call subroutines – this is a unique oddity of 68k architecture. The BSR.W addr instruction is absolutely identical in functionality, size and timing to the JSR addr(PC) instruction. Likewise, there are two ways in the command system for making an unconditional jump, which are also absolutely identical: BRA.W addr and JMP addr(PC). Some sense to the BSR and BRA instructions is given only by the presence of their short versions. However, BSR.S can be used relatively rarely, for example, for small recursive subroutines. And, in any case, why support completely useless long versions of these instructions?! There are other almost unnecessary commands, for example, there are both arithmetic and logical left shifts which actually do the same thing. By the way, shifts and rotations with memory can be used only with a single movement and only with 16-bit data – bytes and 32-bit values are not supported even on the 68020 and later processors!

The codes for the 68k typically look somewhat more cumbersome and clumsy compared to the x86 or ARM. This is largely due to the abundance of unique S, B, W, L suffixes in the 68k assembler instructions. For example, you can write such a strange and useless MOVE.L D0,(A0,D0.W) instruction, which means that you need to write 32-bits of data from register D0 to the address obtained by adding the contents of 32-bits of register A0 and 16-bits of register D0.

On the other hand the 68000 is generally faster than the 8086, according to my estimates by about 10%. And with the intensive use of 32-bit data or large arrays, the 68000 can even outperform the 8086 several times. The 680x0's code also has its inherent special beauty, elegance and less mechanicality than the x86's. Additionally as shown by eab.abime.net experts, the code density of the 68k is often better than that of the x86. The 68000, like the ARM or VAX can use PC as a base, which is very convenient. The x86 and even the IBM/370 can't do this – support for such addressing appeared only in the x86-64. Although it is worth noting that PC addressing on the 68k is available only for the source operand, it does not work for the destination operand or even one-operand instructions (like TST or NOT), which makes it noticeably less useful. Having more registers is also a significant advantage for the 68000 compared to the 8086, although this advantage is only shown when processing 16- and 32-bit data due to the inability of the 68000 to quickly use separate bytes of a 16-bit word. The increment and decrement operations are very good for the 68k, they allow you to use a step from 1 to 8 – the step is always 1 in the x86 and most other known architectures. The 68k has a very flexible and convenient MOVEM instruction that allows you to save or restore any set of registers – there is a similar instruction for the ARM, but on the x86 you have to use many instructions to save or restore individual registers for such operations. However MOVEM occupies 4 bytes, so when you need to save or restore no more than three registers, the x86 code will be more compact. In addition, the x86 (since the 80286) also has a command for saving and restoring all registers at once, so in the general case the 68k advantage due to the presence of MOVEM is not very significant. The almost complete orthogonality of the 68k's MOVE instruction is also a pleasant feature – data can be transferred from different memory locations without using registers. But this command is an exception, other commands, for instance CMP, are not orthogonal. Another attractive feature of the 68k architecture is addressing modes with auto-increment and -decrement, which are not available on the x86. The user stack's independence from interrupts allows data to be used above the top of the stack, which is unthinkable on the 8086. This very odd and not recommended way of working with the stack is also available on the x86 in multitasking environments, where each task and the system have their own stacks.

Overall the 68000 is a good processor with a large instruction set. It was originally planned for use in minicomputers, not personal computers. It is somewhat ironic, therefore, that the last mass application of this processor was found in the second half of the 90's in calculators and pocket computers. However it is for the 68000 that the development of workstations by Sun, Apollo, HP, Silicon Graphics and later NeXT began. Apple, which made the workstation-class Lisa computer, could also be added to this list. The 68000 was used in many of the now legendary personal computers: the first Apple Macintosh computers that were produced before the early 90's, the first Commodore Amiga multimedia computers, and in relatively inexpensive and high-quality Atari ST computers. The 68000 was also used in relatively inexpensive computers working with Unix variants, in particular in the rather popular Tandy 16B. It is also worth mentioning the fast and inexpensive Sage computers which for some time were the fastest personal computers in the world – their development was very dramatic. Interestingly IBM simultaneously led the development of the PC and the System 9000 computer based on the 68000, which was released less than a year after the PC.

The Apple Lisa – it's strange that the first 68000-based Apple computers (the Lisa and Macintosh) had black-and-white graphics, whereas the eight-bit Apple II computers had colors

This is a famous demo for the Amiga 1000, such graphics in 1985 seemed incredible fantasy. This is an image in GIF format, which allows you to show only 256 colors out of 4096 displayed by the real Amiga – other formats for full-color animated graphics have still not been well supported

The 68010 appeared clearly belatedly only in 1982 at the same time when Intel released the 80286, which put personal computers on the same level as mini-computers. The 68010 is pin-compatible with the 68000 but the system of its instructions is slightly different, so the replacement of the 68000 by 68010 has not become popular. This incompatibility was caused by a contrived reason to bring the 68000 into more correspondence with the ideal theory of virtualization. Another almost useless innovation was the ability to relocate the interrupt vector table.
The 68010 is only slightly no more than 10% faster than the 68000. In the 68010, a bug was finally fixed that prevented the use of virtual memory. Obviously the 68010 was badly losing to the 80286 and was even weaker than the 80186 that appeared in the same year. Like the 80186 the 68010 almost never found a use in personal computers.

The 68008 was also released in 1982 probably with a hope of repeating the success of the 8088. It's the 68000 but with an 8-bit data bus which allowed it to be used in cheaper systems. But the 68008 like the 68000 does not have an instruction queue which makes it about 50% slower than the 68000. Thus the 68008 may even be a little slower than the 8088, which is only about 20% slower than the 8086 due to the presence of the instruction queue. IBM offered to make the Motorola 68008 by 1980, but then were refused, although it would have cost, according to Motorola employees, the work of one employee for less than a month. If the refusal had not occurred, it was possible that IBM would have chosen the 68008 for the IBM PC.

Based on the 68008 Sir Clive Sinclair made the Spectrum QL, a very interesting computer that because of the lower price could compete with the Atari ST and similar computers. But Clive in parallel and clearly prematurely began to invest a lot in the development of electric vehicles leaving the QL (Quantum Leap) rather as a secondary task, that in the presence of some unsuccessful constructive decisions led the computer and the whole company to premature closure. The company became part of Amstrad, which refused to produce QL.

It would be interesting to calculate the bit index for the 68000, which seems to me clearly higher than 16 although maybe it is not higher than 24.

Appearing in 1984 the 68020 again returned Motorola to the first position. In this processor many very interesting and promising innovations were realized. The strongest effect is certainly the instruction pipeline, which sometimes allows you to execute up to three instructions at once! The 32-bit address bus looked a little premature in those years, and therefore a cheaper version of the processor (the 68020EC) with a 24-bit bus was available, but the 32-bit data bus looked quite appropriate and allowed to significantly speed up the processor. The built-in cache appeared to be an innovation even though it had a small 256 bytes of capacity, which allowed it sometimes to significantly improve the performance because the main dynamic memory could not keep up with the processor. Although in the general case, such a small cache only slightly affected the performance. Quick enough operations for division (64/32 = 32,32) and multiplication (32*32 = 64) for approximately 80 and up to 45 cycles respectively were added. The timings of the instructions were generally improved for example the division (32/16 = 16,16) began to be performed for approximately 45 cycles (more than 140 cycles in the 68000). Some instructions in the most favorable cases can be performed without occupying clocks at all! New address modes were added in particular with scaling, in the x86 this mode appeared only in the next year with the 80386. Other new address modes allow the use of double indirect addressing using several offsets, the PDP-11 has been remarkably outdone here.

Some new instructions for example bulky operations with bit fields or new operations with decimal numbers that have become little needed in the presence of rapid division and multiplication looked more like a fifth wheel of a bus than something essentially useful. Address modes with double indirect addressing theoretically look interesting but practically are needed quite rarely and are executed very slowly. The ability to use 32-bit offsets in addressing was rather a premature innovation, since such large offsets were almost never required for memory volumes on systems before the mid-90s. Here again as in the case of the 68000, Motorola asked users to pay for the ability to work with such large memory sizes that could not actually be provided with hardware yet. Unlike the 80286 the 68020 takes time to compute the address of the operand, the so-called effective address. The division at the 68020 is still almost twice as slow as the fantastic division of the 80286. Multiplication and some other operations are also slower. Overall, the 68020 is noticeably slower than the 80286 for byte operations. On operations with 16-bit data the 68020 is only slightly slower and only on operations with 32-bit data the 68020 is clearly superior to the 80286. The 68020 doesn't have a built-in memory management unit and the rather exotic ability to connect up to eight coprocessors couldn't fix this. The chief architect of the 68000 himself admitted that too many addressing modes were made in the 68020 and that the result was therefore some kind of monster. They focused on the VAX and the ease of assembly programming, but the future came with RISC, higher speeds and powerful compilers. In addition, here's another quote from Bill Joy: "It became clear that Motorola was doing with their microprocessor line roughly the same mistakes that DEC had done with their microprocessor line, in other words, 68010 68020 68040, were getting more and more complicated. And they were slipping and they weren't getting faster anywhere near the rate that the underlying transistors were getting faster". It is also worth adding that a third stack (specifically for interrupts) was added to the 68020!

It is not surprising therefore that in the modern development of the 68k architecture almost all new instructions of the 68020 have been abandoned. This applies in particular to the Coldfire and 68070 processors used in embedded systems.

The 68020 was widely used in mass computers the Apple Macintosh II, Macintosh LC and Commodore Amiga 1200, it was also used in several Unix systems.

The appearance of the 80386 with a built-in and very well-made MMU and 32-bit buses and registers again put Motorola in position number 2. The 68030 appearing in 1987 for the last time briefly returned the leadership to Motorola. The 68030 has a built-in memory management unit and a doubled cache, divided into a cache for instructions and data, it was a very prospective novelty. The MMU of the 68030 does not slow down, as it did with the external MMU of the 68020. In addition the 68030 could use a faster memory access interface which can speed up memory operations by almost a third. However, in general, working with memory remained slow – 4 clock cycles per access, i.e. the number of clock cycles remained the same as for the 68000. It was even joked about as "Motorola's standard memory cycle". For comparison, the 80286 took 2 clock cycles, while the ARM or 6502 took 1. To be fair, it should be added that officially the memory access period for the 68020 and 68030 takes 3 cycles, but in many instructions it actually turns out to be rather closer to 4. Despite all the innovations the 68030 turned out to be somewhat slower than the 80386 at the same frequency. However the 68030 was available at frequencies up to 50 MHz, and the 80386 only up to 40 MHz, which made top systems based on the 68030 slightly faster. It can be surprising that the 68030 does not support several instructions of the 68020, for example, CALLM and RTM! Shortcomings in the architecture of the 68k processors forced major manufacturers of computers based on these processors to look for a replacement. Sun started producing its own SPARC processors, Silicon Graphics switched to the MIPS processors, Apollo developed its own PRISM processor, HP started using its own PA-RISC processors, ATARI started working with custom RISC-chips, and Apple was coerced to switch to the PowerPC processors. Interestingly, Apple was going to switch to the SPARC in the second half of the 80's, but negotiations with Sun failed. One can only wonder how poorly the management of Motorola was working, as if they themselves did not believe in the future of their processors. Here we can also add that Motorola made a variant of the 68030 processor without an MMU! This option was used in the cheapest models of the Commodore Amiga 4000. Intel did not release such products, although the MMU was not needed for the then most popular DOS operating system.

The 68030 was used in computers of the Apple Macintosh II series, Commodore Amiga 3000, Atari TT, Atari Falcon and some others.

With the 68040 Motorola once again tried to outperform Intel, this processor appeared a year later after the 80486. However the 68040's set of useful qualities was never able to surpass the 80486's. In fact the Motorola 68k having a more overloaded system of instructions was not able to support it, and in a sense has disappeared from the race. In addition, Motorola also participated in the development of the PowerPC, which was planned to replace the 68k and this could not but affect the quality of the 68040 development. In the 68040 only a very truncated coprocessor could be placed to work with real numbers, and the chip itself was heated significantly more than the 80486. According to the results on lowendmac.com/benchmarks, the 68040 only about 2.1 times faster than the 68030 which means that the 68040 is slightly slower than the 80486 at the same frequency. The 68040 almost did not find applications in popular computers. Some noticeable use was found only by its cheaper version the 68LC040 which does not have a built-in coprocessor. However the first versions of this chip had a serious hardware defect which did not allow using even the software emulation of the coprocessor!

Motorola always had problems with mathematical coprocessors. As was mentioned above Motorola never released such a coprocessor for the 68000/68010, while Intel had released its very successful 8087 since 1980. For the 68020/68030, two coprocessors were produced at once, the 68881, and its improved pin-compatible version, the 68882.

It is appropriate to say that the Intel x86 still has problems with the mathematical coprocessor. The accuracy of calculations of some functions, for example the sine of some arguments, is very small, sometimes no more than 4 digits. Therefore modern compilers often calculate such functions without using the services of the coprocessor.

Surprisingly, Motorola was still able to release a 68k Pentium-class processor, the 68060 in 1994. This processor also had problems with floating point arithmetic. And most importantly not a single popular system remained except for the Commodore Amiga, where the 68060 could find application, but the Commodore company went bankrupt in the same 1994. According to some conspiracy theories, Commodore went bankrupt, in particular, due to the fact that the 68060 could have competed with the Power PC architecture that the Apple Macintosh computers began to use.

Motorola processors up to and including 1994 were generally quite comparable to the Intel x86 and in some important aspects they were always better. However, Intel unlike Motorola spent a lot of effort to retain its customers and attract new ones. Moreover in the fight against its main competitor, Intel sometimes acted rather not kindly. For instance, it's hard to believe that a big review article in Byte magazine from 9/1985, where is stated about the 68000 without proof that "compared to the 8086/8088, it required a massive software effort to get it to do anything", could appear outside the context of this struggle. On the other hand, Motorola did everything later and more expensive than Intel. In addition, Motorola processors clearly lacked originality, too much was copied from DEC and IBM technologies. Of course, the failure of the 68k was caused by complex reasons, combining both weak strategic marketing and some architectural shortcomings.

For some reason, the processors of the 68k architecture did not even try to clone in the USSR, although the Besta computer was developed on the basis of the 68020.

Edited by Richard BN

2x2=4, mathematics

Emotional stories about first processors for computers: part 2 (DEC PDP-11)

Processors of DEC PDP-11

Since the early 70's in the world began a 10-year era of domination of the company DEC. DEC computers were significantly cheaper than those produced by IBM and therefore attracted attention from small organizations for which IBM systems were unaffordable. With these computers also begins the era of mass professional programming. The PDP-11 computer series was very successful. Various PDP-11 models were produced from the early 70's to the early 90's. They were successfully cloned in the SU and became the first mass popular computer systems there. Some of the SU made PDP-11 compatible computers have several unique traits. For example, several models like DVK are rather personal computers than minicomputers and several models like UKNC and BK are pure personal computers. The mentioned BK became the first PC available for the SU ordinary people to buy since 1985. By the way, it is very likely that the number of the PDP-11 computers produced in the USSR was larger than the total number of computers of this architecture produced in the rest of the world!

DEC also promoted the more expensive and complex computers of the VAX-11 family, the situation around which was somewhat politicized. From the second half of the 70's, DEC practically stopped development in the PDP-11 series. In particular the support of hexadecimal numbers for the assembler was not introduced. Oracle DBMS was originally created for the PDP-11 in 1979, but the next version was not released for these computers in 1983 – they preferred MS-DOS systems. The performance of PDP-11 systems has also remained virtually unchanged since the mid-70's. All this is very surprising, since DEC achieved major success precisely with the relatively inexpensive PDP-11 computers, and the abandonment of their development in favor of the expensive, almost mainframe VAX turned out to be a voluntary rejection of further successes. Although most likely the refusal did not happen voluntarily, but under pressure from IBM, which in such an elegant way broke its dangerous competitor, forcing it to "compete" in the field of IBM.

This is the LCM's PDP-11/70 (Miss Piggy), it still works and is freely available through the network

The PDP-11 used various processors compatible with the main command system for example, the LSI-11, F-11, J-11. In the late 70's DEC made a cheap processor T-11 for microcomputers. However for unclear reasons despite the seemingly large and high-quality software that could eventually be transferred to the system using it, it was not noted by the manufacturers of any computer systems. The only exception was one model of the Atari gaming console. The T-11 found itself a mass application only in the world of embedded equipment, although in terms of capabilities it was slightly higher than the z80. The SU produced processors K1801VM1, K1801VM2, K1801VM3, etc. similar to DEC processors and also exact copies of some DEC processors. The latter began to be produced only by the beginning of the 90s.

The PDP-11 processor command system is almost completely orthogonal, a pleasant quality, but when it is brought to the extreme it can create ridiculous commands. The command system of the PDP-11 processors has had an impact on many architectures and in particular on the Motorola 68000.

The PDP-11 system of commands is strictly 16 bit. All 8 general purpose registers (and the program counter in this architecture is the usual R7 register) are 16 bit, the processor status word (it contains typical flags) is 16 bit too, the size of instructions is from 1 to 3 16-bit words. Any operand of an instruction can be (although there are exceptions, for example, the XOR instruction) any type – this is orthogonality. Among the types of operands are registers and memory locations. The SU's programmers in the 80s sometimes didn't understand why the Intel's x86 instruction system misses memory to memory types of instructions. This was the influence of the PDP-11 school, where you can easily write the full addresses of each operand. This indeed is slow and especially slow for systems with typical slow RAM which was used since the early 90's. It is possible to form a memory address using a register, a register with an offset, a register with autoincrement or autodecrement. The PDP-11 instruction system gives us a possibility to use double indirect access to memory through a register, for example, MOV @(R0)+,@-(R1) means the same as the operator **–r1 = **r1++; in the C/C++ programming languages, where r0 and r1 are declared as signed short **r0, **r1;.

Another example, the instruction MOVB @11(R2),@-20(R3) corresponds to **(r3-20) = **(r2+11);, where r2 and r3 are declared as char **r2, **r3;.

In the modern popular architectures, one instruction for such cases can be insufficient, it may require at least 10 instructions. It is also possible to get an address relative to the current value of the program counter. I will give another example with more simple addressing. The x86 instruction ADD [BX+11],16 corresponds to ADD #16,11(R4). In DEC assemblers it is common to write operands from left to right, unlike Intel where they write from the right-left. There is a reason to believe that the GNU assembler for the x86 was made under the influence of the PDP-11 assembler. Although the PDP-11 assembler has a strange exception for the CMP instruction where operands are placed as on the Intel x86.

Division and multiplication instructions are only signed and not available on all processors. The division instruction, as on the 68k, but unlike the x86, correctly sets the flags which allows you to work normally with overflow cases. However, not all PDP-11 models that support division work correctly with flags during such overflow. The arithmetic of the decimal numbers is optional too – it is a part of so-called commercial instruction set in DEC terminology. As an oddity of full orthogonality I will give an example of the command MOV #11,#22, which after execution turns into MOV #11,#11 – it is an example of using a direct constant as an operand. Another curious instruction is a unique instruction MARK which code needs to be placed on the stack and which may never be used explicitly. Calling subroutines in the architecture of the PDP-11 is also somewhat peculiar. The corresponding instruction first saves the allocated register (can be any) on the stack, then saves the program counter in this register and only then writes a new value to the program counter. The return from the subroutine instruction must do the reverse and know which register was used when calling the subroutine. As an example of an unusual command, you can also point to multiplication, in which depending on the number of the register used for the result, you can get either the full 32-bit product, or only its lower 16 bits. The presence in the command system of absolutely useless instructions CLV, SEV, CLZ, SEZ, CLN, SEN demonstrates some ill-conceivedness in the details of this system. Also, the work with the carry in the ADC and SBC commands is somewhat awkwardly implemented, for example, to make addition of two words with the carry on the PDP-11, two commands are needed, while on the x86 or 68k one is enough. Although on the IBM/370 such an operation requires even three commands. It may come as some surprise that some typical instructions are executed differently on different processors, for example, MOV R0,(R0)+ или MOV SP,-(SP) – such instructions are considered by the standard DEC assembler as erroneous. Strange effects can be sometimes obtained using the program counter as a normal register, although this may only apply to certain processor models.

It is interesting that among the programmers for the PDP-11 there is a culture of working directly with machine codes. Programmers could for example work without a disassembler when debugging. Or even write small programs directly into memory, without assembling!

It is also interesting that assembler mnemonics for the PDP-11 became the basis for assemblers of the popular processors 680x, 6502, 68k, ARM.

Indeed instruction timings were not too fast. It was surprising to find out that on a BK home computer the instruction to send from a register to a register takes as much as 12 clocks (10 clocks when using the code in ROM), and the instructions with two operands with double indirect addressing are executed for more than 100 clocks. The Z80 does 16-bit register transfer for 8 clocks. However the slowness of the BK is caused not so much by the processor, but by the poor quality of the SU made RAM, under the features of which the BK had to be adapted. If fast enough memory was used the BK would send 16 register bits for 8 clock cycles too. Once there was a lot of controversy, which is faster than the BK or Sinclair ZX Spectrum? I must say that the Spectrum is one of the fastest mass 8-bit personal computers when using the top 32 KB of memory. Therefore it is not surprising than the Spectrum is faster than the BK but not much. And if BK worked with fast enough memory it could even be a bit faster.

The code density is also rather a weak point in the PDP-11 architecture. Instruction codes must be multiples of the machine word length – 2 bytes, which is especially frustrating when working with byte arguments or simple commands like setting or resetting a flag. But when compared with other architectures, the PDP-11 sometimes shows even better code density in practice!

There were interesting attempts to make a personal computer on the basis of PDP-11 architecture. One of the first PC's in the world that appeared only a bit later that the Apple ][ and Commodore PET and rather a bit earlier than the Tandy TRS-80, was the Terak 8510/a, which had black and white graphics and an ability to load an incomplete variant of Unix. This computer was quite expensive and as far as I know was only used in the system of higher education in the USA. The Heathkit H11 was produced since 1978, it was a kit-format computer. DEC itself also tried to make its own PC, but very inconsistently. DEC for example produced PC's based on the z80 and 8088 explicitly playing against its own main developments. The PDP-11's architecture based PC's DEC PRO-325/350/380 have some rather contrived incompatibilities with the underlying architecture that impeded the use of some software. Best of all personalization of technologies of mini-computers turned out in the USSR, where produced the BK, DVK, UKNC, ... By the way the Electronica-85 was a quite accurate clone of the DEC PRO-350. In addition, the CP1600 processor, akin to the PDP-11 architecture, was used in the Intellivision game consoles which were popular in the early 80's.

Made in USSR 16-bit home computer (model of 1987) – it is almost PDP-11 compatible

The K1801VM2 processor which was used in the DVK should theoretically be about two times faster than the K1801VM1 but in practice it was only slightly faster. The K1801VM3, in turn, is somewhat faster than the K1801VM2, supports memory management and largely corresponds to the best PDP-11 processors, its performance is close to the Intel 8086.

Processors of the top PDP-11 computers can address up to 4 MB of memory, but for one program can usually get no more than 64 KB. However, using the special capabilities of hardware and system software, it was possible to create large codes. The best DEC processors, for example, the J-11 can use a separate address space for instructions and data, which doubles the size of the address space and does not slow down execution. The original processors which were made in the USSR could not do such separations. Also here it is worth noting that even the first x86 processors were able to do this directly "out of the box" and were even more advanced, allocating up to 256 KB with similar separation, and if you count the port space, then a total of 320 KB could be allocated. The size of executable code in the PDP-11 can also be increased through memory-based overlays. Working with the PDP-11 overlays requires very thorough support from the OS and compilers, and is significantly slower than far calls in the first x86. You can use virtual arrays to work with big data on the PDP-11, which again requires good support from the OS and compilers, and is also slower than working with large arrays in the first x86. In terms of performance, the best processors for the PDP-11 are close to the 8086 in terms of the number of operations per megahertz. However, due to the higher frequencies used, the best PDP-11 was faster than the first IBM PC, but the IBM PC AT with 8 MHz became almost equal to it.

Edited by Richard BN

2x2=4, mathematics

Emotional stories about first processors for computers: part 1 (Intel x86)

Intel: from 8086 to 80486

One of the best processors made in the 70's is definitely the 8086, and also the cheaper, almost analogue, 8088. Interestingly, the 8088 and the 8086 look identical on the outside, and their chips have the same number of pins and almost all of them have the same functionality. The architecture of these processors is pleasantly distinguished by the absence of any notable copy relating to other processors developed and in use at the time. It was also distinguished by adherence to abstract theories, the thoughtfulness and balance of architecture, steadiness and focus on further development. Of the drawbacks of the architecture of the x86, you can call it a bit cumbersome and prone to an extensive increase in the number of instructions.

One of the brilliant constructive solutions of the 8086 was the invention of segment registers. This, as it were, simultaneously achieved two goals – the "free" ability to relocate codes of programs, up to 64 KB in size (this was even a decent amount for computer memory for one program up to the mid-80's), and accessibility up to 1 MB of address space. You can also see that the 8086, like the 8080 or z80, also has a special address space for 64 KB I/O ports (this is 256 bytes for the 8080 and 8085). There are only four segment registers: one for code, one for stack, and two for data. Thus, 64*4 = 256 KB of memory is available for quick use and it was a lot even in the mid-80's. In fact, there is no problem with the size of code, since it is possible to use long subroutine calls while loading and storing a full address from two registers. There is only a limit of 64 KB for the size of one subroutine – this is enough even for many modern applications. Some problem is created only by the impossibility of fast addressing of data arrays larger than 64 KB – when using such arrays, it is necessary to load a segment register and an address itself on each access, which reduces the speed of work with such large arrays several times.

The segment registers are implemented in such a way that their presence is almost invisible in the machine code, so, when time had come, it was easy to abandon them.

Quite often, you can find criticism of memory segmentation, i.e. such an organization that in general you need to use two pointers to address a memory location. However, this is a strange criticism, rather contrived. Segmentation itself is a completely natural way to organize virtualization and memory protection. In fact, it was not the segmentation itself that was criticized, but only the maximum segment size of 64 KB. However, this limitation is a direct consequence of the desire to have large amounts of memory when using 16-bit registers. Therefore, all the criticism of segmentation is actually a disguised requirement to switch to a 32-bit architecture. The situation was complicated by the fact that segmentation in the first x86 only partially had the functionality of a normal memory management unit, in particular, usage of segment registers was available to application programs. The 80286 made complete segmentation support available, but this made previous applications for the 8086 incompatible with the mode when this full support was activated. Only with the introduction of the 80386 were all the problems resolved and the criticism stopped, although the 80386 still used segmentation!

It is surprising that for some reason it is almost impossible to find such criticism in relation to the popular PDP-11, where the restrictions on memory usage are much more stringent. The cheapest PDP-11s were significantly more expensive than the best personal computers, and the best PDP-11s until the mid-80s were faster than the best IBM PC compatible machines. The PDP-11s were higher-end computers before the advent of the 80486-based PC and used segmentation...

Using a single pointer to keep the complete address in memory was natural in the architecture of the IBM mainframes, the VAX, and the 68000 processor. It is easy to notice that this list does not include personal computers, since even the 68000 was originally developed for relatively expensive, non-personal systems. The 8086 processor retained much in common with the primitive 8080, which was used more as a controller. Therefore, it is quite strange to compare systems based on the 8088 with, for example, the VAX or even the Sun workstations – these are completely different classes of machines. But, perhaps, thanks to Bill Gates, the IBM PCs were initially compared with much more expensive systems. The first IBM PC had only 16 KB of memory, and 64 KB was more of a luxury for an individual customer in 1981. By the mid-80's, typical memory amounts for the IBM PC compatible systems reached 512 KB – segmentation with such memory amount could almost never create any difficulties. When the typical memory size for IBM PC compatible machines exceeded 512 KB, the 80386 appeared. It is worth recalling that even in 1985, most systems were 8-bit and to work with memory amounts of more than 64 KB, you had to use memory bank switching – this is one or even two order of magnitude more difficult and slower than using large arrays with the 8086. The first IBM PC were quite comparable with 8-bit systems, however, not with the VAX. By the way, an alternative design of the IBM PC used the Z80 processor. Therefore, we can only admire the Intel engineers who have been able to develop the x86 processors for more than 40 years so that they have been all the time relatively inexpensive, technically one of the best, and this while maintaining binary compatibility with all previous models, starting with the 8086! Although this is not a record, IBM has maintained compatibility with the System/360 architecture for almost 60 years.

As noted, the architecture of the 8086 retained its proximity to the architecture of the 8080, which allowed relatively small amount of effort to transfer programs from the 8080 (or even from the z80) to the 8086, and especially if the source code was available.

The 8086's instructions are not very fast, but they are comparable to competitors, for example, the Motorola's 68000, which appeared a year later. One of the innovations, some accelerating of the rather slow 8086, became an instructions queue.

The 8086 uses eight 16-bit general purpose registers, some of which can be used as two one-byte registers, and some as index registers. Thus, the 8086 registers characterize some heterogeneity, but it is well balanced and the registers are very convenient to use. This heterogeneity, by the way, allows having more dense codes. The 8086 uses the same flags as the 8080, plus a few new ones. For example, a flag appeared typical for the architecture of PDP-11 – step-by-step execution. Compared with the PDP-11, the logic for describing the work with flags when working with signed numbers has improved. Consider the table, which shows the correspondence between the values of the flags and the relationship between signed numbers.

So differently the same relationships described in different companies

From this table, it is probably natural to conclude that Intel's people understood logical operations, DEC's people understood them somewhat less, and Motorola's people could only write off DNF from Boolean algebra textbooks.

The 8086 allows you to use very interesting addressing modes, for example, the address can be made up of a sum of two registers and a constant 16-bit offset, on which the value of one of the segment registers is superimposed. From the amount that makes up the address, you can keep only two or even one out of three. This is not possible on a PDP-11 or 68k with a single command. Most commands in the 8086 do not allow both operands of memory type, one of the operands must be a register. This approach is completely analogous to what was used on the best then IBM/370 systems. Also the 8086 has string commands that just know how to work with two memory locations. The string commands allow you to do quick block copying (20 cycles per byte or word), search, fill, load and compare. In addition, string commands can be used when working with I/O ports. The idea of using the 8086 instruction prefixes is very interesting allowing it to use often very useful additional functionality without significantly complicating the encoding schemes of CPU instructions.

The 8086 has one of the best designs to work with the stack among all computer systems. Using only two registers (BP and SP), the 8086 allows the solving of all problems when organizing subroutine calls with parameters.

Among the commands there are signed and unsigned multiplication and division. There are even unique commands for decimal corrections for multiplication and division instructions. It's hard to say that in the 8086 command system, something is clearly missing. Quite the contrary. The division of a 32-bit dividend by a 16-bit divisor to obtain a 32-bit quotient and 16-bit remainder may require up to 300 clock cycles – not particularly fast, but several times faster than such a division on any 8-bit processors (except the 6309) and is comparable in speed with the 68000. The division in the x86 has one unexpected and rather unpleasant feature – it corrupts all arithmetic flags.

It's worth adding that in the x86 architecture, the XCHG command inherited from the 8080 has been improved. Interestingly, that the instruction XCHG AX,AX is used for the NOP command in the x86 architecture. Because of this, NOP turned out to be relatively slow, 3-clock, and this persisted until 80486! One may wonder why the 2-clock MOV AX,AX was not chosen for the NOP instead. The 8086 has such useless move operations in total 16 – this is more than the Z80 has. The count of useless instructions for XCHG is even larger, 71, because, for example, equivalent instructions XCHG BX,CX and XCHG CX,BX are encoded differently. XCHG is a rare case when AX is usually not encoded as a general purpose register: XCHG with AX is shorter by one byte and faster by one cycle than the general case, in addition, due to the command queue, XCHG with AX is usually faster than MOV. Nevertheless, the 7 longer and slower XCHG instructions, when AX is encoded as a GPR, are a particularly ugly part of the aforementioned useless instructions. The later processors began to use instructions XADD, CMPXCHG and CMPXCHG8B, which can also perform atomic exchange of arguments. Such instructions are one of the features of the x86, they are difficult to find on the processors of other architectures.

It can be summarized that the 8086 is a very good processor, which combines the ease of programming and attachment to the limitations on the amount of memory of that time. The 8086 was used comparatively rarely, giving way to the cheaper 8088 becoming the first processor for the mainstream personal computer architecture of the IBM PC compatible computers. The 8088 used an 8-bit data bus that did its performance somewhat slower, but allowed to build systems on its base more accessible to the customers.

The IBM 5150 or the first IBM PC

Interestingly, Intel fundamentally refused to make improvements to its processors, preferring instead to develop their next generations. One of Intel's largest second source, the Japanese corporation NEC, which was much larger than Intel in the early 80s, decided to upgrade the 8088 and 8086, launching the V20 and V30 processors which were pin-compatible with them and about 30% faster. NEC even offered Intel to become its second source! Intel instead launched a lawsuit against NEC, which however it could not win. For some reason this big clash between Intel and NEC is still completely ignored by Wikipedia.

The 80186 and 80286 appeared in 1982. Thus, Intel had two almost independent development teams. At the same time, the 80188 appeared, which differed from the 80186 only in a narrow data bus – Intel never forgot about inexpensive solutions for embedded systems. The 80186 was the 8086 improved by several commands and shortened timings plus several chips were integrated together into the chip typical of the x86 architecture: a clock generator, timers, DMA, interrupt controller, delay generator, etc. Such a processor, it would seem, could greatly simplify the production of computers based on it, but due to the fact that the embedded interrupt controller was for some reason not compatible with the IBM PC, it was almost never used on any PC. The author knows only the BBC Master 512 based on the BBC Micro computer, which did not use built-in circuits or even a timer, but there were several other systems using the 80186. Addressed memory with the 80186 remained as with the 8086 sizes at 1 МБ. The Japanese corporation NEC produced analogues of the 80186 which were compatible with the IBM PC.

Consider new instructions for the 80186:

  • single-byte instructions PUSHA and POPA, allowing to save or restore all 8 registers at once;
  • three-operand signed multiplication, unique in the x86 architecture, it is more like an instruction for the ARM;
  • bit shifts and rotations, with the argument number – in the 8086, only the number 1 or the CL register can be used. For argument 1, you can use two types of instructions: fast and short, inherited from the 8086, or generalized longer and slower for any numeric arguments – which is rather useless;
  • string commands for working with i/o ports, they are somewhat more powerful than similar ones available on the Z80;
  • the ENTER and LEAVE instructions – support for working with subroutines in high-level languages. They know how to work with syntactic nesting of subroutines up to 32 levels – the use of this type of nesting is typical for Pascal language. However, for Pascal, you probably cannot find a single program where the nesting would be more than 3. And Pascal itself has been used less and less since then. Here you can see that Motorola also added Pascal support to the 68020, which was later regretted;
  • the BOUND command to check whether the array index is valid.

The 80286 had even better timings than the 80186, among which stands out just a fantastic division (32/16=16,16) for 22 clock cycles – since then they have not learned how to do the division any faster! The 80286 supports working with all new instructions of the 80186 plus many instructions for working in a new, protected mode. The 80286 became the first processor with built-in support for protected mode, which allowed it to organize memory protection, proper use of privileged instructions and access to virtual memory. Although the new mode was relatively rarely used, it was a big breakthrough. In this new mode, segment registers have acquired a new quality, allowing up to 16 MB of addressable memory and up to 1 GB of virtual memory per task. The main problem with the 80286 was the inability to switch from protected mode to real mode, in which most programs worked. Using the "secret" undocumented instruction LOADALL, it was possible to use 16 MB of memory being in the real mode.

In the 80286, the calculation of an address in an instruction operand became a separate scheme and stopped slowing down the execution of commands. This added interesting features, for example, with a command LEA AX,[BX+SI+4000] in just 3 cycles it became possible to perform two additions and transfer the result to the AX register!

The segment registers in protected mode became part of a complete memory management unit. As it was already mentioned, in real mode these registers only partially provided the functionality of the MMU.

The number of manufacturers and specific systems using the 80286 is huge, but indeed the first computers were IBM PC AT's with almost fantastic personal computer performance indicators for speed. With these computers memory began to lag behind the speed of the processor, wait states appeared, but it still seemed something temporary.

In the early versions of the 80286 as in the 8086/8088 using interrupts was not implemented 100% correctly, that in very rare cases could lead to very unpleasant consequences. For example the POPF command in the 80286 always allowed interrupts during its execution, and when executing a command with two prefixes (as an example; you can take REP ES:MOVSB) on the 8086/8088 after the interrupt call, one of the prefixes was lost. The POPF error was only present in early releases of the 80286.

Protected mode of the 80286 (segmented) was rather inconvenient, it divided all memory into segments of no more than 64 KB and required complicated software support for working with virtual memory. The segmented method of working with memory was clearly inferior to the paged method in almost all its characteristics.

The 80386 which appeared in 1985, made the work in protected mode quite comfortable, it allowed the use of up to 4 GB of addressable memory and easy switching between modes. In addition to support multitasking for programs for the 8086, the virtual 8086 mode was made. To manage memory, it became possible to use both large segments up to 4 GB in size and the convenient paged mode. The 80386 for all its innovations has remained fully compatible with programs written for the 80286. Among the innovations of the 80386, you can also find the extension of registers to 32-bits and the addition of two new segment registers. In addition, when calculating a memory address, all registers became equal and it became possible to use scaling. However, this register equality added a lot of useless ugly duplicate instructions. The timings have changed, but ambiguously. A barrel shifter was added, which allowed multiple shifts with timings equal to one shift. However, this innovation for some reason considerably slowed down the execution of commands of the cyclic rotates. The multiplication became slightly slower than that of the 80286. Working with memory became on the contrary, a little faster, but this does not apply to string commands that stayed faster for the 80286. The author of this material has often had to come across the view that in the real mode with 16-bit code the 80286 in the end is still a little bit faster than the 80386 at the same frequency.

Several new instructions were added to the 80386, most of which just gave new ways for work with data, actually duplicating with optimization some already present instructions. For example, the following commands were added:

  • to check, set and reset a bit by number, similar to those that were made for the z80;
  • bit-scan BSF and BSR;
  • copy a value with a signed or zero bit extension, MOVSX and MOVZX;
  • setting a value depending on the values of operation flags by SETxx;
  • shifts of double values by SHLD, SHRD – similar commands are available on the IBM mainframes.

The x86 processors before the appearance of the 80386 could use only short, with an offset of one-byte, conditional jumps – this was often not enough. With the 80386 it became possible to use offset of two or four bytes, and despite the fact that the code of new jumps became two or three times longer, the time of its execution remained the same as in previous, short jumps. However, not everything was done perfectly: perhaps for protected mode it was worth using 16-bit offsets instead of the almost useless 8-bit ones.

The support for debugging was radically improved by the introduction of 4 hardware breakpoints, using them it became possible to stop programs even on memory addresses that may not be changed.

Due to the fact that the main protected mode became much easier to manage than in the 80286, a number of inherited instructions became unnecessary elements. In the main protected so-called flat-mode, segments up to 4 GB in size are used, which turns all segment registers into an unobtrusive formality. The semi-documented unreal mode even allowed the use of all the memory as in flat-mode, using real mode which is easy to setup and control.

Since the 80386, Intel has refused to share its technology, becoming in fact the monopoly processor manufacturer for the IBM PC architecture, and with the weakening of Motorola's position, and for other personal computer architectures. Systems based on the 80386 were very expensive until the early 90's, when they became finally available to mass consumers at frequencies from 25 to 40 MHz. Since the 80386 IBM began to lose its position as a leading manufacturer of IBM PC compatible computers. This was manifested, in particular, in that the first PC based on the 80386 was a computer made by Compaq in 1986.

It's hard not to hold back admiration for the volume of work that was done by the creators of the 80386 and its results. I dare even suggest that the 80386 contains more achievements than all the technological achievements of mankind before 1970, and maybe even until 1980. Interestingly, the 80386 development team was distinguished by a peculiar and overt religiosity.

Quite interesting is the topic of errors in the 80386. I will write about two. The first chips had some instructions which then disappeared from the manuals for this processor and stopped executing on later chips. It's about the instructions of IBTS and XBTS. All 80386DX/SX's produced by both AMD and Intel (which reveals their curious internal identity) have a very strange and unpleasant bug that manifested itself in destroying the value of the EAX register after writing to the stack or unloading from there all registers with POPAD or PUSHAD after which a command that used an address with the BX register was used. In some situations the processor could even hang. Just a nightmare bug and very massive, but in Wikipedia, there is still not even a mention of it. There were other bugs, indeed.

The emergence of the ARM changed the situation in the world of computer technology. Despite the problems the ARM processors continued their development. The answer from Intel was the 80486. In the struggle for speed and for the first place in the world of advanced technologies Intel even took a decision to use a cooling fan that spoiled the look of the PC till present time.

In the 80486 timings for most instructions were improved and some of them began to be executed as on the ARM processors during one clock cycle. Although the multiplication and division for some reason became slightly slower. What was specially strange was that a single binary shift or rotation of a register began to run even slower than with the 8088! There was quite a big built-in cache memory for those years, with the size of 8 KB. There were also new instructions, for example CMPXCHG – it took the place of the imperceptibly missing instructions of IBTS and XBTS (interestingly as a secret this instruction was available already at the late 80386). There were very few new instructions – only six, of which one is worth mentioning a very useful instruction for changing the order of bytes in the 32-bit word BSWAP. A big useful innovation was the presence of a built-in arithmetic coprocessor chip – no other producer had made anything similar.

The first systems based on the 80486 were incredibly expensive. Quite unusual is that the first computers based on the 80486 the VX FT model, were made by the English firm Apricot – their price in 1989 was from 18 to 40 thousand dollars, and the weight of the system unit is over 60 kg! Although this earliest appearance of computer systems based on the newest Intel processor in the UK could have been caused by the competition with Acorn and the ARM. IBM released the first computer based on the 80486 in 1990, it was a PS/2 model 90 with a cost of $17,000.

It's hard to imagine the Intel processors without secret officially undocumented features. Some of these features have been hidden from users since the very first 8086. For example, such an albeit useless fact that the second byte in the instructions of the decimal correction (AAD and AAM) matters and can be various, generally not equal to 10, was documented only for the Pentium processor after 15 years! It is more unpleasant to silence the shortened AND/OR/XOR instructions with an operand byte constant for example, AND BX,7 with an opcode of three bytes length (83 E3 07). These commands making the code more compact which was especially important with the first PC's, were quietly inserted into the documentation only for the 80386. It is interesting that Intel's manuals for the 8086 or 80286 have a hint about these commands, but there are no specific opcodes for them. Unlike similar instructions ADD/ADC/SBB/SUB, for which the full information was provided. This in particular, led to the fact that many assemblers (all?) could not produce shorter codes. Another group of secrets may be called some strange thing because a number of instructions have two codes of operations. For example it is the instructions SAL and SHL (opcodes D0 E0, D0 F0 or D1 E0, D1 F0). Usually and maybe always only the first operation code is used. The second opcode which is secret is used almost never. One can only wonder why Intel so carefully preserves these superfluous cluttering space of opcodes instructions, being unofficial and duplicating? The SALC instruction waited for its official documentation until 1995 almost 20 years! Instruction for debugging ICEBP was officially non-existent for 10 years from 1985 to 1995. It was written most about the secret instructions LOADALL and LOADALLD although they will remain forever secret, as they could be used for easy access to large memory sizes only on the 80286 and 80386 respectively. Until recently, there was intrigue around the UD1 (0F B9) instruction, which was unofficially an example of an incorrect opcode. The unofficial has recently become official.

In the USSR the production of clones of the processors 8088 and 8086 was mastered, but they were unable to fully reproduce the 80286. Only the extended 80186 instruction system and a separate memory management chip were implemented, which should have allowed running of programs for the 80286.

Edited by Jim Tickner, BigEd and Richard BN.