Replay system

A replay system is a method for allowing pipelined processors to make use of a bypass network even when the latency of an instruction is unknown but predictable. This method reduces latency when the prediction is correct but uses more execution resources if the prediction is incorrect. Load instructions are a common target of replay since they generally have a set of known performance levels depending on what level of cache (if any) the requested data resides in.

The first documented instance of a replay system was in Intel's Pentium 4 processor^[1]^: 1 and has continued to be a feature of their subsequent processors^[2]. Some variation of this system is may exist in other superscalar processors but because it is an implementation detail that only impacts performance, it is rarely documented.

Overview

A traditional pipelined CPU has a register fetch stage near the beginning of the pipeline which reads the register operands of the instruction from the physical register file and a write-back stage near the end where the register outputs are written back out to the physical register file. Since there are multiple clock cycles between those stages, an instruction can't be in the pipeline directly after another instruction that produces a value for a register required by the first. Processors that take pipeline hazards into account can automatically insert bubbles in the pipeline to keep an instruction at the register fetch stage until all of its inputs have been written back.

In a pipeline design with one execute stage, a bypass bus can be added to allow data produced by the execute stage in one cycle to be directly used as an input to the execute stage on the next cycle thus eliminating the latency penalty for back-to-back dependent instructions. More complex bypass buses can be designed for more complex pipelines and superscalar processors can make use of forwarding networks to allow data to be routed between execution units^[3].

Not all instructions have latencies that are known at the time that instructions are scheduled^[1]^: 2. This includes some mathematical functions such as division and square root operations which depend on their operands, and instructions that depend on state external to the execution units such as memory^[1]^: 1 and IO access and random number generation. For these instructions, the CPU would not be able to make use of the bypass bus since it wouldn't know on which cycle the data would be ready^[1]^: 2.

In the case of memory reads, any hits to the L1 cache (that also hit the TLB) have a known latency. Hits to the L2 cache may also have a known latency but generally hits to the L3 cache and full misses can't be easily predicted. Data reads are often on the critical path of execution so reducing their latency can have a large impact on the execution time of a program. Since an effective caching system will have most memory reads hit the L1 cache, the processor can schedule instructions that depend on a data read with the assumption that the read will hit the L1 cache^[4]. If the prediction is incorrect and the read is a miss, the results of the dependent instruction will be discarded and the instruction will be rescheduled after the read is complete. If the L2 cache latency is known, the instruction could be rescheduled to attempt to use the bypass bus again at the L2 latency^[5].

Processors that decode instructions into multiple micro-operations and schedule them separately can replay only the μops that are dependent on the mispredicted instruction.

History

The first Intel processor to make use of a replay system was the Pentium 4^[1]^: 1 which is a superscalar processor with an unusually long pipeline. Having such a long pipeline means that the latency impact of not using bypass is much greater than in processors from shorter pipelines but the wasted cycles from an inaccurate prediction is also large. Despite having shorter pipelines, future Intel processors have also included replay systems since memory latency has become a critical factor in application performance^[6]. The cost of a misprediction has also decreased since there are far more execution units than the two of the Pentium 4.

The replay system implemented by the Pentium 4 was simplistic. Instructions went through both the regular pipeline as well as a queue with the same number of stages as the pipeline such that if a memory read failed, a signal could be sent to the scheduler to prevent reading instructions from further up the pipeline and instead read from the replay queue. Rather than waiting for the instruction that initially caused replay to complete, the replayed instructions cycle around the execution pipeline until the memory access is successful.^[1]^: 3

Later processors keep track of μops that haven't executed in a reorder buffer and use a microcode scoreboard to track dependencies between them^[7]. Using these tracking structures, μops can avoid being rescheduled too early.

Tradeoffs

When instructions must be replayed, they have to execute again which takes power and generates heat. In the case of the Pentium 4, replayed instructions could take twice as many execution slots as other instructions. Replaying an instruction also takes more processor resources in general which reduces how many other instructions can be speculatively executed. In processors with simultaneous multithreading (such as Intel's hyper-threading), the resources taken by replayed instruction can't be used by the sibling threads that share the physical core either.

References

^ ^a ^b ^c ^d ^e ^f Kartunov, Victor; Yury, Malich; Keruchenko, Jan; Levchenko, Vadim (2005-06-06). "Replay: Unknown Features of the NetBurst Core". X-bit labs. Archived from the original on 2014-04-08. Retrieved 2014-04-07.
^ Peter Cordes (May 17, 2019). "About the RIDL vulnerabilities and the "replaying" of loads". Stack Overflow. Retrieved February 14, 2026.
^ Peter Cordes (August 24, 2017). "Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths". Stack Overflow. Retrieved February 14, 2026. Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.
^ Peter Cordes (August 23, 2018). "Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths". Stack Overflow. Retrieved February 14, 2026. The RS can replay uops in a few cases, e.g. [...] if it was dispatched in anticipation of load data arriving, but in fact it didn't.
^ Peter Cordes (2021-06-03). "Comment on "Are load ops deallocated from the RS when they dispatch, complete or some other time?"". Stack Overflow. Retrieved 2026-02-14. The RS dispatches in the cycle when the load result will be on the bypass-forwarding bus, if the load hit in L2 cache (after it already missed in L1d). If the data doesn't arrive then, that uop will have to get replayed again later, when the load eventually does complete.
^ Drepper, Ulrich (November 21, 2007). What Every Programmer Should Know About Memory (PDF) (Technical report). Red Hat, Inc. Retrieved February 14, 2026.
^ US 11829767, Tran, Thang Minh, "Register scoreboard for a microprocessor with a time counter for statically dispatching instructions", published 2023-11-28, issued 2023-11-28, assigned to Simplex Micro Inc

[xbitlabs2-1] ^ ^a ^b ^c ^d ^e ^f Kartunov, Victor; Yury, Malich; Keruchenko, Jan; Levchenko, Vadim (2005-06-06). "Replay: Unknown Features of the NetBurst Core". X-bit labs. Archived from the original on 2014-04-08. Retrieved 2014-04-07.

[2] Peter Cordes (May 17, 2019). "About the RIDL vulnerabilities and the "replaying" of loads". Stack Overflow. Retrieved February 14, 2026.

[3] Peter Cordes (August 24, 2017). "Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths". Stack Overflow. Retrieved February 14, 2026. Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.

[4] Peter Cordes (August 23, 2018). "Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths". Stack Overflow. Retrieved February 14, 2026. The RS can replay uops in a few cases, e.g. [...] if it was dispatched in anticipation of load data arriving, but in fact it didn't.

[5] Peter Cordes (2021-06-03). "Comment on "Are load ops deallocated from the RS when they dispatch, complete or some other time?"". Stack Overflow. Retrieved 2026-02-14. The RS dispatches in the cycle when the load result will be on the bypass-forwarding bus, if the load hit in L2 cache (after it already missed in L1d). If the data doesn't arrive then, that uop will have to get replayed again later, when the load eventually does complete.

[6] Drepper, Ulrich (November 21, 2007). What Every Programmer Should Know About Memory (PDF) (Technical report). Red Hat, Inc. Retrieved February 14, 2026.

[7] US 11829767, Tran, Thang Minh, "Register scoreboard for a microprocessor with a time counter for statically dispatching instructions", published 2023-11-28, issued 2023-11-28, assigned to Simplex Micro Inc

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Overview

History

Tradeoffs

See also

References