Erik Eidt

Assembly Language, Machine Code, Processor Architecture

Pipelining and Hazards

Published Jul 04, 2019Last updated Jul 05, 2019

Pipelining is an approach that allows faster cycle time. We'll examine a classic 5 stage pipeline in the style of MIPS and RISC V.

Pipelining works by splitting up the work for a single instruction into multiple smaller pieces, each of which can run quicker due to there being less work to do, compared to doing all the work for a single instruction in a single cycle. So, that means we can run the clock faster, at the cost of having an instruction take multiple cycles.

The pieces of an instruction are broken into stages that operate sequentially for a single instruction, so that a 5 deep pipeline will take 5 cycles to complete an instruction. However, execution of one instruction — and the next, and the next — overlap, so that (ideally) one instruction enters the pipeline every cycle, and one exits as well.

Due to this overlap, one instruction completes (ideally) every cycle, and so, the processor's cycles per instruction, CPI is closer to 1 than the 5 we might expect. Execution of a single instruction is spread over 5 cycles, but execution of 5 or so instructions at a time are overlapping.

However, because the execution of a single instruction is stretched out over time, this introduces a concept of hazard, which is where the end reult value of a preceding instruction isn't yet ready for the beginning of a succeeding instruction to use. As all the instructions are stretched out in time, their overlap can cause issues when there are dependencies.

Before we get into a specific pipeline diagram example, let's discuss different approaches to diagramming or illustrating pipelines. We have 3 fundamental components we're trying to show:

time — which increases,
instructions — which also increase, and,
pipeline stages — which are fixed.

One popular approach to pipeline diagramming showing the relationship between these three concepts is the following:

Pipeline 2.jpg

The primary strength of this approach is that it follows a familiar horizontal timeline approach, where time increases as we go left to right.

However, this approach has the significant drawback that both the x-axis and the y-axis expand as indicated by the arrows on both axes. If we take this out to even 50 instructions, the diagram is out past column 50 and row 50!! This approach is not scalable to hundreds (let alone thousands) of instructions.

In doing a research project, we might capture traces of thousands to millions of instructions, and, we would want to employ best practices in data modeling, e.g using a database. In an approach using a database, we would gravitate towards fixed columns and allowing the rows to expand to capture the variable information (length of time and/or instructions).

For these reasons, I use the following approach in these diagrams:

Pipeline 3.jpg

The rows expand as needed to account for time, whereas the columns are fixed to represent the fixed hardware pipeline stages.

(I have chosen this as the best format; however, there are still yet other approaches, of course. For example, we can use fixed columns for the pipeline stages, and have instructions for the row dimension. The cells then hold time information. All the diagrams I've discussed are equivalent from the point of view of the information that is conveyed.)

Let's get into a specific example. A traditional MIPS-style 5 stage pipeline has the following stages:

Short Name	Expansion	Inputs	Blocks	Outputs
IF	Instruction Fetch	PC	PC, Instruction Memory	Instruction word
ID	Instruction Decode	Instruction word	Control, Registers	Control Signals, Register values
EX	Execution	Register Values	ALU	ALU result
MEM	Memory	Memory Address	Data Memory	Memory Value
WB	Write Back	Value to Write	Registers	n/a

IF — fetches an instruction from Instruction Memory using the PC for the address of the instruction word to read
ID — decodes the instruction word into control signals, and uses register names in the instruction word to look up register values
EX — performs some arithmetic operation, like ADD or AND
MEM — accesses data memory for LOAD or STORE instructions
WB — writes output values into registers

The WB stage completes the execution of an instruction. There is no other output, since there is no next stage. Successive program instructions communicate with each other via the (values in) registers and memory. The WB stage is usually fairly simple in terms of its internal logic.

You will have to study the specific architecture of the processor used in your course to get correct answers. For example, in the below I have assumed that the register unit is internally capable of its own forwarding, so that it forwards any write back value to its outputs in the same cycle. If this is not the case, then an additional hazard exists. See the bubble diagram for more on this.

Here's an example:
Pipeline 1.jpg

**1 At the end of this cycle (that starts at t+3, where instr 2: is in EX) the value of a2+64 comes out of the ALU, but it isn't in register a0 yet! Also, at the end of this same cycle, instr 3: in ID has read stale/bad values for a0.

**2 Let's observe that the register write (write back) of a0 for instr 3: doesn't happen until the WB stage at t+5. Thus, the register is finally up-to-date at start of t+5 (with internal register forwarding, or t+6 without), 2 cycles too late for instr 3:'s ID stage that starts at t+3, where it reads register a0.

Without a bypass the processor would have to stall to avoid a hazard — this stall is also called a bubble in the pipeline. The exact nature of the bubble depends on how the processor detects the hazard.

(Some processors don't detect them at all, requiring assembly language programmers and compilers to insert NOP instructions inbetween instructions as needed to avoid hazards. This means that such machine code is highly dependent upon the internal architecture of the processor, and will need adjustments to run on processors with different pipeline architectures.)

Here we see that the normal register read for instr 3: would happen at in ID at t+3. To avoid the hazard (without some form of bypass), the instruction would repeat the ID stage until the correct a0 value is finally ready, which would be at t+5. At t+5 instr 2: finally writes its value back to the register file, so by the start of t+5, a register read of a0 will get the up-to-date value (again presuming internal register file forwarding). This kind of hazard avoidance would cost 2 cycles to solve this problem. (See picture of bubbles & notes below.)

A bypass is a complex circuit whose job ultimately is to choose — under the exact right conditions — whether to either use the register-read values from the ID stage (only if they are not stale), or override with another value from somewhere else (when a register is out of date).

Fundamentaly, the bypass has to compare register names used in adjacent instructions (compare the target register of former instruction against both (if present) sources of the latter instruction). If the register names match, then a bypass is indicated because the ID stage of the latter instruction has fetched a stale register value.

An ALU bypass feeds the output of the ALU back into the input of the ALU so that the ALU operation just starting with the new cycle can use the value it just computed in the cycle just ending without waiting for the value to go through two more pipeline stages to get back into its proper register.

On the diagram we indicate with a red arrow where an ALU bypass operation should occur here.

Let's note that in this example, the actual ALU value is ready after EX of instr 2: in t+3 — the value of a2+64 is first seen in the processor by time t+4, and, that the ALU would naturally want to compute the add (for the addressing mode of the load instruction, here a0+24), starting at time t+4.

If we have an ALU bypass (red arrow), we can feed the ALU output from t+3's ALU operation into ALU input at t+4. This mechansim will avoid this hazard without costing any stalls or bubbles!

Let's look at the next hazard: the LOAD hazard also writes the register value at the end of the WB stage, so also has a 3 cycle latency for the immediately succeeding instruction.

A LOAD bypass will also be helpful for performance. The LOAD bypass will feed into the ALU input for the next cycle, and the value comes for this bypass from the MEM stage (whereas the ALU bypass value comes from the EX stage).

Since the LOAD operation's result is not available until the end of the MEM stage, a bypass here will only be able to eliminate 2 of the 3 bubble cycles. In addition to the load bypass (purple arrow), the processor will have to insert one stall cycle, shown as <bubble>.

NB: These are probably the two most important bypasses for illustrative purposes, but by no means the only ones.

Branches also cause issues for pipelines. Whenever the processor starts an instruction sequence anew, it takes (in our 5 stage pipeline example) up to 5 cycles before we see the results of the first instruction. After 5 cycles, the pipe is now full, and the processor is efficiently overlapping execution of the multiple instructions in the pipe. So, this is to say that there is a startup overhead for switching instruction sequences.

Such startup overhead is incurred when the program changes the flow of control from what the processor was assuming. Programs change the flow of control with loops and if-then-else statements, using branch instructions, and in particular, conditional branch instructions are difficult to predict whether they will or won't branch, as the condition they use is dynamic.

For example, when the processor assumes a conditional branch will be not taken, and it turns out it is taken, by the time it knows this it will already have started working on several other instructions that are no longer relevant (they are on the wrong control flow of the program) and have to be discarded. This is due to the fact that a pipelined processor is always trying to keep the pipe full rather than doing nothing while waiting for a branch to resolve as taken or not taken. The resolution for conditional branches typically occurs in the EX stage, as a compare operation uses the ALU's subtraction circuitry. This means that when a conditional branch turns out to be taken where the processor assumed it wouldn't be, there are instructions in earlier stages, IF, and ID, that have to be discarded. This situation will cost a bubble of 2 cycles, as the processor restarts with a new instruction sequence.

The following diagram illustrates NOPs being inserted to avoid the ALU hazard (only, we're not avoiding the LOAD hazard). If done this way, this will cost 2 cycles..

Pipeline 2.jpg

Again we are assuming that the register unit has internal forwarding so that it can output the write back value in the same cycle (here t+5) when used by the next instruction. This is plausible because unlike all the other stages, the WB stage does not have to compute any values — it only has to store/copy an existing one, and those operations are much faster than arithmetic or memory accesses.

If it were not the case that the register unit has internal forwarding, then third NOP would be necessary to fully avoid the ALU hazard shown here.

Pipeline Assembly Processor design Mips Risc V

Report

Enjoy this post? Give Erik Eidt a like if it's helpful.

Erik Eidt

Assembly Language, Machine Code, Processor Architecture

I am no longer taking new students for mentoring sessions. However, you are welcome to ask me questions of any depth in chat and I'll answer as I can. I am an assembly language expert, having mentored recently in MIPS, LC-3&4, M...

Discover and read more posts from Erik Eidt

get started