Proposal for a 64-bit NOSC Forth CPU
Andy Valencia   vandys@vsta.org

0. Introduction

Charles Moore and others have designed and even constructed a range of CPUs
based on a Forth-ish machine architecture characterised by no operands and a
minimal instruction set.  As a part of the author's experimental software
work, a body of code has been created which is designed to capitalize on the
unique capabilities of both a 64-bit address space and 64-bit data types,
while still being coded in the traditional Forth style.  This document
describes a mapping of the NOSC machine architecture to 64 bits, adapting it
where appropriate to the unique properties of a 64-bit address space, but
preserving a basic instruction set compatible with Forth.

1. Basic machine instruction format

0         1         2         3         4         5         6
0123456789012345678901234567890123456789012345678901234567890123
/---\/---\/---\/---\/---\/---\/\/------------------------------\
 op1  op2  op3  op4  op5  op6 o              Literal
                              p
                              7

op1 through op7 are each conceptually 5-bit opcodes; op7 is padded with 3
trailing 0's.  As in previous NOSC architectures, opcodes are grouped
in numeric ranges, with the most common opcodes being at the beginning of
the range so that the bits of op7 can encode them.

2. Basic machine model of execution

Memory is addressed to the byte granularity, and c@/c! are included in this
architecture due to the extensive use of character processing in most
information servers.  However, any data unit size must be addressed on its
natural, aligned boundary, so that the address for a @ must be an aligned
64-bit value.  The exception to this uniform 64-bit addressing is execution
addresses as encoded in instructions; branches and calls using the "Literal"
field left shift this 32-bit field's value by 3 bits, permitting an
addressible code space of a full 2^32 64-bit instructions (34,359,738,368
bytes).

At any given time, the CPU is executing the current 64 bits of instruction,
and also records the address of the next instruction to execute.  When a
break in the memory access pipeline is detected, the CPU issues a fetch for
the contents of this address, recording this alongside the address itself.
As a degenerate case, if the current instruction completely consumes the
memory access pipeline until completion of the instruction, the CPU will
stall until the fetch of the next instruction word is completed.

There are several attributes of this characteristic of CPU behavior.  On
initial execution of the current instruction at PC, the "next address"
register is PC+1, and as opportunity occurs, this word will be fetched.  As
the instruction word is executed, a branch, for instance, may occur,
overwriting the "next address" register and discarding any fetch related to
its previous value (and initiating the CPU's wait for a memory cycle during
which to fetch the contents of this new value).  Thus, control flow words
only take effect at the end of the instruction; after a branch or call is
chosen the CPU may very well have useful work which can be achieved in the
subsequent slots of the current instruction word.

3. Instruction groups

3.1. branch and execution

branch
 Set "next instruction" address from Literal
execute
 Push "next instruction" address to return stack
 Set "next instruction" address from Literal
?branch
 Set "next instruction" address from Literal if top of operand stack is 0
 Pop operand stack
0<?branch
 Set "next instruction" address from Literal if top of operand stack is < 0
 Pop operand stack
;
 Pop top of return stack to "next instruction" address
skip
 Continue execution at "next instruction"
 Note: this must be one of the instructions encodable by Op7

3.2 stack manipulation

>r
r>
dup
2dup
swap
rot
drop
-rot
literal
 Push Literal to operand stack, with top bit sign extended to top 32 bits
literalHi
 Replace top 32 bits of top of operand stack with Literal

3.3 Memory Access

@
@+
 Fetch from address, but leave address adjusted by cell+ under fetched data
!
!+
 Store to address, popping value but leaving address adjusted by cell+
c@
c@+
c!
c!+

3.4 Arithmetic Operations

+
not
1+
 TBD: since this can be done with literal 1, +, is it justified?
 It costs us the use of the "Literal" slot to get that 1....
2/
2*
+*
or
and

4. Memory reference architecture

4.1 Background

In a 32-bit architecture, system designers still have some hope of having
a cache hit rate in the upper 90th percentile, due to the ratio of cache
RAM to system RAM afforded by the state of the art.  This ratio is almost
impossible to achieve when the size of a data set requires 64 bits of
addressing.  In the degenerate case, every reference into the 64-bit
data set is a cache miss, and thus a memory hierarchy and cache burst
fills can actually do more harm than good.

On the other hand, most 64-bit applications have a partitioning of
memory references which maps well to a NUMA-like organization of memory.
Code space, for instance, will generally be sized along the lines of a
32-bit application, as would the system-level data structures such as
user areas, code dictionaries, and the user heap.

This suggests (although this suggestion should only be the basis for
verification through experimentation) that the physical memory space
should be partitioned, with, say, the low 4 gigabytes be inherently
cacheable, and all upper memory non-cacheable.  Or perhaps even that
any memory in the low 4 gigabytes simply be high speed SRAM.  Thus,
a system might have 4 megabytes of SRAM based at 0, and then further
gigabytes of DRAM located above 4 gigabytes.

In a 64-bit physical memory addressing space, the state of the art suggests
that memory configuration of some significant fraction of a terabyte (if
your memory is not sized around here, why are you bothering with 64 bits in
the first place?) would be built up from parts sized somewhere between 1 and
4 gigabytes.  The fan-out from CPU to this many parts suggests a multi-level
hierarchy of decoding and bus transcievers, which further implies latency.
Despite this latency, the actual parts are capable of significant bandwidth,
and the combination of latency and bandwidth strongly suggests that the
memory reference architecture support pipelining.

4.2 A Proposed Memory Reference Architecture

One approach would be to have explicit instructions for interacting with
the large, high latency portion of the memory space.  Such lack of
orthogonality has costs which suggest that the behavior of an orthogonal
memory reference architecture be explored first.

In this proposed architecture, Charles Moore's A (and, in some designs, B)
memory access register has been omitted.  While this approach was a nice fit
for the P21, once latency climbs at the same time that significant memory
bandwidth ustilisation is expected, the single memory register becomes an
insurmountable bottleneck.

Thus, the return to operand stack-based memory accesses is proposed not
simply for Forth orthodoxy, but because it permits memory access
transactions to be tied to distinct operand stack cells.  Conceptually, a
memory operation can be turned into a split transaction to the memory
controller, with a transaction ID being associated with that operand stack
cell.  This ID, being tied to the cell, permits "swap" to move the pending
memory access "out of the way" so that other calculations can be undertaken
in the window of this latency.  By permitting multiple outstanding memory
transactions, a pipeline of memory references can be built with the CPU
staying busy rather than stalling on each reference.  The CPU stalls only
when a calculation reference is made to an operand stack cell which has not
yet received its value.  With this style of access, it's even possible to
make speculative memory references, "drop"'ing the result of the speculative
reference if calculations so indicate.  For instance, in a binary tree the
tree walker could reference the "value", and then both the "left" and
"right" pointers, after which it could compare the "value" with its search
key, and then drop whichever of "left" or "right" was not needed after
all--but benefiting from the concurrency of the memory fetches and the
key/value comparison.

To capitalize on this type of memory reference architecture, it is necessary
that stack operations do not block on the memory reference, but rather move
the pending state from one stack position to another.  This would apply to
not only "dup" and "rot", but even ">r" and "r>".  This probably implies
that all the stack cells are connected to a small memory transaction ID bus,
and that all cells compare their pending memory transaction ID (if any) to
the bus content each time a memory completion is strobed onto the bus from
the memory controller.  While this will block the CPU from executing an
instruction during this interval, the improved efficiency of memory accesses
will more than repay the cost of this interference.

5. Interlock Architecture

In this architecture, it is proposed that all dependencies (ripple carry,
memory references, etc.) are enforced with interlocks if needed in order
to preserve correct operation.  Many CPU designs have been proposed over
the years predicated upon software being crafted for the unique CPU, and
thus replacing CPU enforcement with software techniques which avoid the
windows where CPU results are not yet available.

In the author's experience, such designs are barely feasible for the first
instance of such designs.  Software, not having access to the full micro-
state of the CPU, inevitably has to make pessimistic assumptions, discarding
much of the performance actually available.  By the second revision of the
chip, pessimism intersecting with the moving target of the (sometimes poorly
understood) CPU requirements makes this approach almost unlivable.

6. Exceptional execution flow

This section discusses mechanisms in traditional CPUs where flow of execution
is changed due to factors other than explicit instruction control.

6.1 Interrupts

No allowance is made in this design for interrupts.  The expectation is
that I/O controllers will buffer or DMA data, and the CPU will poll status
registers at a suitable rate.  If a particular external interface is too
demanding for this approach, a front end processor CPU can presumably be
thrown at the problem (the F21 immediately comes to mind!).

6.2 Exceptions

If traps are needed for errors such as divide by 0, the division
implementation will provide this explicitly in software.

6.3 Page faults

This 64 bit design does not have any memory translation, thus all
memory addresses are legal to generate.  While memory range enforcement
and traps are useful in some modes of software debugging, their
presence has such an overarching effect that they are not included in
this architecture.  Some aspects of this debugging support can still
be achieved by adding additional logic to the system board.