Proposal for a 64-bit NOSC Forth CPU Andy Valencia vandys@vsta.org 0. Introduction Charles Moore and others have designed and even constructed a range of CPUs based on a Forth-ish machine architecture characterised by no operands and a minimal instruction set. As a part of the author's experimental software work, a body of code has been created which is designed to capitalize on the unique capabilities of both a 64-bit address space and 64-bit data types, while still being coded in the traditional Forth style. This document describes a mapping of the NOSC machine architecture to 64 bits, adapting it where appropriate to the unique properties of a 64-bit address space, but preserving a basic instruction set compatible with Forth. 1. Basic machine instruction format 0 1 2 3 4 5 6 0123456789012345678901234567890123456789012345678901234567890123 /---\/---\/---\/---\/---\/---\/\/------------------------------\ op1 op2 op3 op4 op5 op6 o Literal p 7 op1 through op7 are each conceptually 5-bit opcodes; op7 is padded with 3 trailing 0's. As in previous NOSC architectures, opcodes are grouped in numeric ranges, with the most common opcodes being at the beginning of the range so that the bits of op7 can encode them. 2. Basic machine model of execution Memory is addressed to the byte granularity, and c@/c! are included in this architecture due to the extensive use of character processing in most information servers. However, any data unit size must be addressed on its natural, aligned boundary, so that the address for a @ must be an aligned 64-bit value. The exception to this uniform 64-bit addressing is execution addresses as encoded in instructions; branches and calls using the "Literal" field left shift this 32-bit field's value by 3 bits, permitting an addressible code space of a full 2^32 64-bit instructions (34,359,738,368 bytes). At any given time, the CPU is executing the current 64 bits of instruction, and also records the address of the next instruction to execute. When a break in the memory access pipeline is detected, the CPU issues a fetch for the contents of this address, recording this alongside the address itself. As a degenerate case, if the current instruction completely consumes the memory access pipeline until completion of the instruction, the CPU will stall until the fetch of the next instruction word is completed. There are several attributes of this characteristic of CPU behavior. On initial execution of the current instruction at PC, the "next address" register is PC+1, and as opportunity occurs, this word will be fetched. As the instruction word is executed, a branch, for instance, may occur, overwriting the "next address" register and discarding any fetch related to its previous value (and initiating the CPU's wait for a memory cycle during which to fetch the contents of this new value). Thus, control flow words only take effect at the end of the instruction; after a branch or call is chosen the CPU may very well have useful work which can be achieved in the subsequent slots of the current instruction word. 3. Instruction groups 3.1. branch and execution branch Set "next instruction" address from Literal execute Push "next instruction" address to return stack Set "next instruction" address from Literal ?branch Set "next instruction" address from Literal if top of operand stack is 0 Pop operand stack 0r r> dup 2dup swap rot drop -rot literal Push Literal to operand stack, with top bit sign extended to top 32 bits literalHi Replace top 32 bits of top of operand stack with Literal 3.3 Memory Access @ @+ Fetch from address, but leave address adjusted by cell+ under fetched data ! !+ Store to address, popping value but leaving address adjusted by cell+ c@ c@+ c! c!+ 3.4 Arithmetic Operations + not 1+ TBD: since this can be done with literal 1, +, is it justified? It costs us the use of the "Literal" slot to get that 1.... 2/ 2* +* or and 4. Memory reference architecture 4.1 Background In a 32-bit architecture, system designers still have some hope of having a cache hit rate in the upper 90th percentile, due to the ratio of cache RAM to system RAM afforded by the state of the art. This ratio is almost impossible to achieve when the size of a data set requires 64 bits of addressing. In the degenerate case, every reference into the 64-bit data set is a cache miss, and thus a memory hierarchy and cache burst fills can actually do more harm than good. On the other hand, most 64-bit applications have a partitioning of memory references which maps well to a NUMA-like organization of memory. Code space, for instance, will generally be sized along the lines of a 32-bit application, as would the system-level data structures such as user areas, code dictionaries, and the user heap. This suggests (although this suggestion should only be the basis for verification through experimentation) that the physical memory space should be partitioned, with, say, the low 4 gigabytes be inherently cacheable, and all upper memory non-cacheable. Or perhaps even that any memory in the low 4 gigabytes simply be high speed SRAM. Thus, a system might have 4 megabytes of SRAM based at 0, and then further gigabytes of DRAM located above 4 gigabytes. In a 64-bit physical memory addressing space, the state of the art suggests that memory configuration of some significant fraction of a terabyte (if your memory is not sized around here, why are you bothering with 64 bits in the first place?) would be built up from parts sized somewhere between 1 and 4 gigabytes. The fan-out from CPU to this many parts suggests a multi-level hierarchy of decoding and bus transcievers, which further implies latency. Despite this latency, the actual parts are capable of significant bandwidth, and the combination of latency and bandwidth strongly suggests that the memory reference architecture support pipelining. 4.2 A Proposed Memory Reference Architecture One approach would be to have explicit instructions for interacting with the large, high latency portion of the memory space. Such lack of orthogonality has costs which suggest that the behavior of an orthogonal memory reference architecture be explored first. In this proposed architecture, Charles Moore's A (and, in some designs, B) memory access register has been omitted. While this approach was a nice fit for the P21, once latency climbs at the same time that significant memory bandwidth ustilisation is expected, the single memory register becomes an insurmountable bottleneck. Thus, the return to operand stack-based memory accesses is proposed not simply for Forth orthodoxy, but because it permits memory access transactions to be tied to distinct operand stack cells. Conceptually, a memory operation can be turned into a split transaction to the memory controller, with a transaction ID being associated with that operand stack cell. This ID, being tied to the cell, permits "swap" to move the pending memory access "out of the way" so that other calculations can be undertaken in the window of this latency. By permitting multiple outstanding memory transactions, a pipeline of memory references can be built with the CPU staying busy rather than stalling on each reference. The CPU stalls only when a calculation reference is made to an operand stack cell which has not yet received its value. With this style of access, it's even possible to make speculative memory references, "drop"'ing the result of the speculative reference if calculations so indicate. For instance, in a binary tree the tree walker could reference the "value", and then both the "left" and "right" pointers, after which it could compare the "value" with its search key, and then drop whichever of "left" or "right" was not needed after all--but benefiting from the concurrency of the memory fetches and the key/value comparison. To capitalize on this type of memory reference architecture, it is necessary that stack operations do not block on the memory reference, but rather move the pending state from one stack position to another. This would apply to not only "dup" and "rot", but even ">r" and "r>". This probably implies that all the stack cells are connected to a small memory transaction ID bus, and that all cells compare their pending memory transaction ID (if any) to the bus content each time a memory completion is strobed onto the bus from the memory controller. While this will block the CPU from executing an instruction during this interval, the improved efficiency of memory accesses will more than repay the cost of this interference. 5. Interlock Architecture In this architecture, it is proposed that all dependencies (ripple carry, memory references, etc.) are enforced with interlocks if needed in order to preserve correct operation. Many CPU designs have been proposed over the years predicated upon software being crafted for the unique CPU, and thus replacing CPU enforcement with software techniques which avoid the windows where CPU results are not yet available. In the author's experience, such designs are barely feasible for the first instance of such designs. Software, not having access to the full micro- state of the CPU, inevitably has to make pessimistic assumptions, discarding much of the performance actually available. By the second revision of the chip, pessimism intersecting with the moving target of the (sometimes poorly understood) CPU requirements makes this approach almost unlivable. 6. Exceptional execution flow This section discusses mechanisms in traditional CPUs where flow of execution is changed due to factors other than explicit instruction control. 6.1 Interrupts No allowance is made in this design for interrupts. The expectation is that I/O controllers will buffer or DMA data, and the CPU will poll status registers at a suitable rate. If a particular external interface is too demanding for this approach, a front end processor CPU can presumably be thrown at the problem (the F21 immediately comes to mind!). 6.2 Exceptions If traps are needed for errors such as divide by 0, the division implementation will provide this explicitly in software. 6.3 Page faults This 64 bit design does not have any memory translation, thus all memory addresses are legal to generate. While memory range enforcement and traps are useful in some modes of software debugging, their presence has such an overarching effect that they are not included in this architecture. Some aspects of this debugging support can still be achieved by adding additional logic to the system board.