Erik Eidt

Assembly Language, Machine Code, Processor Architecture

Logic Block Diagrams

Published Jun 22, 2019Last updated Jul 26, 2019

Introduction

In computer architecture, we draw diagrams of boxes and lines to illustrate the internal operations of a CPU.

A logic block is diagrammed as a box. A logic block has inputs & outputs, which are done with wires. The inputs & outputs are data, often each input or output consists of a group of data bits, such as:

1 - for a control signal, or,
5 – for a register name, or,
16 – for an immediate value, or,
64 – for integer and pointer data

Wires for integers and pointers will bundles of 64 for 64-bit architectures, and of 32 for 32-bit architectures. For simplicity, this article will focus on RV64, a 64-bit RISC V achitecture.

Wires are diagrammed as lines connecting the boxes. Logic blocks are interconnected by wires. The output of one logic block connects to the input of another logic block by a line (which can represent a single wire or a bundle of more than one wire though drawn only as a singe line). Wires are made of highly conductive metal. The speed of wires is practically instant compared to logic blocks. (While hardware engineers are concerned with the signal levels and speed of these wires on long runs, in reading logic block diagrams, we should consider the wires as instant.)

Logic blocks consist of any number of gates (some operating in parallel and some in series); in a logic block diagram, the actual gates inside a logic block ar abstracted (not shown). The flow of electrons through silicon gates if far slower than the speed of light in a vacuum. So, when stable inputs are presented to a logic block, it takes some time before valid and stable outputs appear. The speed of a logic block is relative to the number of gates operating in series (and the speed of a single gate, of course). A logic block may be associated with a duration time, e.g. in picoseconds. A given such time is how long it takes for all the outputs to be valid and stable for use as inputs to the next logic block.

Some outputs may be ready sooner, but these are typically not broken out in reporting — to do so, instead of reporting timings for each output, we would decompose the logic block into smaller blocks and still report one time for each logic block.

Wires interconnect logic blocks, flowing from an output of one to an input of another. Some of the wires represent datapaths. Other wires represent control signals. Control signals are mostly 1 bit of data, a single boolean. These typically enable or disable some output, telling a Mux to select between one input or another using that boolean control signal (a multiplexer chooses between two values, given a third value — a control signal).

All the hardware is always present. Unless advanced power saving techniques are applied, the logic blocks are all operating all the time. Some logic blocks are not applicable on some cycles, and so, control signals choose what outputs of what logic blocks to listen to, and which to ignore — rather than telling these momentarily unused blocks to do nothing.

Logic block diagrams generally flow left to right. Occasionally, there will be reverse flow, right to left. For example, MIPS and RISC V diagrams typically show two reverse flows: one for updating the PC and one for updating one of the regular registers. These reverse flows go into registers, and their timing happens both at the end of one cycle and also the beginning of the next cycle, since these two events are simultaneous and synonymous.

A cycle lasts from the rising edge of one clock signal to the rising edge of the next clock. When the rising edge happens, register inputs are captured and stored. For register capture to work, register inputs have to be valid and stable for a short duration before the rising clock edge: this is called register-write setup time.

This refers to clocked logic. There is also an unclocked logic, called asynchronous logic, though this is more complicated to work with and not a subject of this post.

Computing Timing of a Logic Block Diagram

To compute the cycle timing, we need to know the (start-time and) end-time for each block. To do that, first we identify the starting blocks and give them a start-time of zero. To identify the starting blocks, we locate any blocks that have no current-cycle inputs.

NB: the starting block will typically be a register, like the PC. The PC's input is the previous cycle's pc-next computation, so it has no current-cycle inputs. The reverse flow (right to left) is a big clue that the input comes from the previous cycle; the other big clue is that the PC is a register.

Next, we compute the end-time for the starting blocks via the simple formula endTime = startTime + blockTime. For the purposes of exercises of timing computations using these diagrams, block times will be given, most likely hypothetically.

The only way to truely compute a logic block's time is to know the composition of the block in terms of individual gates (and to know the timing of a single gate); however, that information is abstracted out of these diagrams, so needs to be given. The timing of a single gate goes to the particular process technology (nm sizes and other factors of construction) being used to make the chip.

Once we have done this with the starting block(s), the remaining blocks will all have current-cycle inputs from a preceeding logic block.

So, next we iterate as follows: We transfer the end-time of any logic blocks we know of, to the start-time of an input to another logic block (taking wire time as zero). Once all the start times of all the inputs for some logic block are known, then we can compute the start time of the logic block itself, as the latest (max) of the start-times of its inputs.

The same formula applies to compute the end time for the block (add the given logic block's duration to the block's start time).

And repeat untill all logic blocks and individual gates have (a start and) an end time.

To compute the timing of the CPU, we take the largest end-time.

At the end of the last cycle, which is also the beginning of the current (aka next) cycle, registers capture updated state that represents execution results for the last program instruction, and that updated state is used as the new statet at the start of the current program instruction cycle. Here setup time is added to the largest end-time on the reverse flow (right to left) path back to the register, as it represents the register capture time cost that happens during the virtually simulatneous ending of one cycle and beginning of the next.

Timing of specific Instructions

To compute the timing of an individual instruction though the CPU, we need to identify which logic blocks in particular, are used by the given instruction.

The following discusses the logic blocks for a relatively uncomplicated RISC V CPU, and whether they are applicable to various instructions (if at all).

All instruction use the PC (program counter) to identify the memory location to read (fetch) an instruction using the Instruction Memory. All instructions use the Control block to send control signals around the chip.

One quality that diff;erentiates instructions is whether a given instruction is capable of non-sequential flow or not. For instructions only capable of sequential flow, the PC + 4 Adder datapath is applicable &#8212 this is the case forall computational instructions such as add, addi, as well as loads and stores.

For instructions only capable of non-sequential flow, the alternate PC-next datapaths are applicable — for RISC V, the instructions with this quality are jal and jalr, the unconditional branch instructions.

For instructions capable of dynamically choosing between sequential flow and alternate flow — called conditional branches — both both datapaths (the one for sequential flow and the one for non-sequential flow) are applicable, and one or the other is dynamically chosen based on a comparison done within the main ALU.

Loads and stores use the Data memory unit — on MIPS and RISC V no other instructions use the Data Memory unit, and this is typical of RISC processors (but not necessarily of other kinds of processors). (In terms of instruction set architecture, this is called a load/store architecture.)

Immediate instructions use the sign extension facility (I-Types).

Most instructions read at least one register; some read two registers; some instructions write a register (though all instructions update the PC).

Let's imagine that we're given logic block duration times as follows, in picoseconds:

Logic Operation	Time in Picoseconds
read the PC	25
Adder	100
Instrution Memory (fetch)	180
Control block (lookup)	50
Register file read registers	140
Extract and sign extend immediate	60
Mux	15
ALU (max)	200
1 logic gate	5
Data Memory (read or write)	210
Register-write setup	20

We'll use the following diagram (see footnote at end for attribution of the unannotated original version of this datapath diagram):

Logic Block Diagrams Figure 4.17.jpg

#1: This wire holds pc-next, the value that feeds the PC register. Note that this input to this register uses reverse flow, coming from the right of the diagram, and going to its left. This input is computed by the previous cycle, and captured in the PC at the start of the current cycle. The cycle time for the whole processor can be no faster than: the time for to compute the input plus setup time. This wire is 64-bits wide (for RV64).

#2: This logic block is the PC, or program Counter. It is a register, and on RISC V it is 64-bits wide (for RV64). The PC has no current-cycle inputs because #1 is computed in the previous cycle. Therefore, having no inputs, it is the starting logic block. It's start time is therefore 0, and end-time is 0+25 (where the 25 is given in the above table).

#3: This constant, 4, also has no inputs, so its start time is zero. It is hardwired so its duration time is also zero, and hence end time is zero.

#4: This is the main adder that computes sequential program control flow. It is used by all instructions that do not branch, plus conditional branches, which may or may not branch (as if they don't branch, then they use sequential flow). Its start time is the maximium time of its two inputs, so, max(0,25), i.e., 25.

#5: PC value wire, as input to Adder that computes pc-relative offset as branch target for unconditional and conditional branchs.

#6. Instruction Memory logic block, has as input memory address, provides as output the instruction, which 32-bits (for both RV64 and RV32). Starts at 25, ends at 25+180.

#7. Control logic block, has as input the major opcode of a RISC V instruction, which is located in the least significant 7 bits of the instruction word. Outputs a number of 1 bit control signals, and a bundle of signals called ALUOp, which tell the ALU control to use a paricular ALU op (mostly for I-Type instructions) -or- whether to source the ALU op from another part of the instruction (like funct3) (for R-Type instructions).

#8. Registers (aka Register File) holds 31 64-bit registers. Let's note that since all of the inputs to this block come from the same source (the Instruction Memory #6), the start time for this is 205.

#9. Immediate Generation, which asembles a 12- to 20-bit immediate from various locations in the instruction, and sign extends to 64 bits immediate. Used by I-, S-, SB-, U-, and UJ-Type, and instruction, such as "addi" and "beq".&nbs; (This unit takes all 32-bits of the instruction: it needs the major opocde from bit 6-0 in order to determine what kind of immediate to make, and the other bits hold the actual immediate data, which will need to be assembled.) Its start time is also 205.

#10. Shifter for branch target immediate computation. RISC V allows variable length instructions: the base instruction set is 32-bits per instruction; if the Compact Extension is supported, it allows 16-bit instructions. Further extensibility allows 48- and 64-bit instructions as well. Given this, the PC always stays on at least a 16-bit boundary. As a result, the low bit of the PC and immediate are always zero. Since that is the case, the low always-zero bit is not stored in the immediate for branch instructions, and, this is used to gain one extra bit of range in the values of branch immediates. That then requires a shift by 1 here; however, this is a very simple shift operation, as it is a constant shift amount. I believe that in hardware this shift operation is done simply via wiring rather than using a shifter at all. Wire for bit 0 of the immediate is wired to bit 1 of the add input, (bit 1 to bit 2, etc...) while bit 0 of the add input is hardwired to zero. So, the shift is accomplished by wiring rather than by a logic block. As a result, no timing is generally given for this logic block and we take it to be zero.

#11. PC-relative immediate adder, computes the address of taken branch target. Take two inputs, one is ready at time 25, and the other at time 265, therefore, start time is 265.

#12. Mux in front of ALU. This Mux is used to switch operand 2 going into the ALU from taking a register data value — as necessary for R-Type instructions and conditional branches, vs. taking an immediate value — as necessary for immediate instructions, like "addi".

#13. Auxillary ALU control. This determines whether to pass the value of the ALUOp received from control through unchanged, or compute a new ALUOp based on the subopcode, in particular, for R-Type instructions, since the major opcode (6-0) for all R-Type instructions is the same value (two bits saying 32-bit instruction and the rest are zero) simply indicating R-Type instruction. The real ALUOp for R-Type is generated from funct3 and other bits in the R-Type instruction. I never see this unit's time being reported, so assume its time is taken into account by the other Control block #7. If it were broken out, then that could be accounted for here, but even modest time here would not increase the minimum cycle time: the time for ALU start is dominated by the Registers (#8) and pre ALU Mux (#12).

#14. ALU, computes an output from two sources and a control signal that indicates what kind of ALU operation to perform on the sources.

#15. This wire is the value of pc-next for taken branches, which alter program control flow (from the default of sequential program control flow). For taken branches, the next Mux #16 will select this value.

#16. This is an AND gate. It helps compute whether the Mux (#17) should take a branch (pc+immediate) or use sequential program flow control (pc+4). We compute the time for the AND gate from the timing of the latest arriving input, which is 560, the "Zero" control signal from the ALU for conditional branches. So we give this gate a start time of 560, and end time of 565 (5 is the given time for a single gate).

#17: This Mux takes the top input (labeled logic 0) to engage sequential program control flow, and the bottom input (labeled logic 1) to engage the alternate flow control path (i.e. the taken branch next-pc value from #15). It takes three inputs: the top input (pc-next = pc+4) is ready at 125ps, whereas the bottom input is ready at 365ps. However, the control input is ready at 565, so that is the start time for this block.

#18. Data Memory, given an address and length can read a value from memory, or given an address, length, and value, can write a value into memory. In the case of write, the control signals are available at time 255, the Address at time 560, and the Write Data at time 345. Therefore its start time is 560. For reads, it provides an output, a value read from memory, and this output is ready at start + 210.

#19. MemToReg Mux, chooses whether the ALU value or the Read Data value is placed on its output that feeds back into Write Data at the Registers (#8). For load instructions, this selects the Read Data, and for other instructions like add or addi, selects the ALU result that bypasses the Data Memory. For some instructions (those that don't target a register to write, i.e. RegWrite is 0), the control signal value MemtoReg doesn't matter.

So, the datapath back to the PC is 580ps, while the datapath back to the Registers for register update is 785ps. Add register-write setup time of 20 and we get 805 as the picoseconds for the minimum time for one cycle, and 1 cycle / 805 ps = 1 cycle / (805*10^-12) = 1 cycle / (8.05*10^-10) = 1.24*10^9, or this processor with the above timings can run at up to a maximum of 1.24 gigahertz.

If we like we can expand the timing table a bit, going from kinds of logic blocks to specific logic block in the diagram; here that means replicating a Mux entry for each of the three Mux's on the datapath diagram. Same for Adders as there are two of them. This expanded table would be then easy to mark an "x" for each logic block that is used by an instruction.

To compute the timing of a particular instruction inpendent of the rest, ignore the logic blocks that the instruction doesn't use, e.g. for "add", an R-Type instruction, compute the above as if Data Memory didn't exist.

Note: Thank you for "The single cycle datapath diagram" to the authors and publisher of:

Computer Organization and Design RISC-V Edition:
The Hardware Software Interface
(The Morgan Kaufmann Series in Computer Architecture and Design)
Figure 4.17 The simple datapath with control unit. p 508
ISBN-13: 978-0128122754
See the book at Elsevier

Computer architecture Risc V Logic block diagrams

Report

Enjoy this post? Give Erik Eidt a like if it's helpful.

Erik Eidt

Assembly Language, Machine Code, Processor Architecture

I am no longer taking new students for mentoring sessions. However, you are welcome to ask me questions of any depth in chat and I'll answer as I can. I am an assembly language expert, having mentored recently in MIPS, LC-3&4, M...

Discover and read more posts from Erik Eidt

get started