aqua-ified thoughts

never developed nor fully solidified

All about FIFOs

Wednesday, October 22, 2025, 12:30 AM

Traditional RTL and Bluespec perspectives on skid buffers and friends

I’ll suppose you’re already familiar with the basic pipeline with stalling. Say, you have a complicated function f you want to compute in hardware. You can break it up into k stages with a register between each pair of consecutive stages.

The upstream producer might not have any data to feed into the pipeline, so the pipeline takes an input valid bit to indicate whether the data is available. The downstream consumer might not be ready to take the result from the pipeline, so the pipeline also requires an input ready bit to indicate whether it is being stalled.

The pipeline has an output valid bit as part of its result. It also generates an output ready bit which is sent to the upstream producer.

For any given stage, we can also think of its predecessor and successor as upstream producer and downstream producer, respectively. If the stage has valid data but the input ready bit is low, then it needs to keep the data for another cycle. This means it must disable its register write and output a low ready bit to stall the upstream. Put another way, the stage only outputs ready when the downstream is ready or there is no valid data, when there is no reason to stall.

As a result, whenever the ultimate consumer stalls, only the last few stages end up being stalled.

The forward data path is nicely broken up by registers, keeping the clock frequency high. However, the backward control path forms a long combinational chain.

This motivates treating the register between two stages, together with its control logic, as a module in its own right, so we can design it better and more easily.

Specifically, we are interested in building a data storage module, usually called a buffer. Using Bluespec terminology, we say we are building a FIFO (of size 1!) supporting two methods:

  • enq (enqueue): takes a piece of data with a valid bit.
  • deq (dequeue): outputs a piece of data with a valid bit. The success of enq is determined by the input valid bit and the output ready bit: when they are both high, enq is considered “executed.” Similarly, deq is executed when the input ready bit and the output valid bit are both high.

Pipeline buffer

We have already seen an example of such a module in the earlier diagram. This is called a pipeline buffer or pipeline FIFO:

In the diagram, the registers are called data and valid. The inputs are named data_in, valid_in, ready_in. The outputs are data_out, valid_out, ready_out. Note that ready_in refers to an input bit coming from the output side! It’s confusing. I know. Whine about it or choose a different naming convention.

Pipeline buffer as a Bluespec module with guarded interfaces

Here’s one way to understand the circuit using Bluespec concepts.

Think of the producer and consumer as individual rules that may or may not execute the methods enq and deq, respectively.

When the buffer is empty, only enq can execute. When the buffer is full, deq can take data from the register, and enq can replace it with new data in the same cycle.

Whether deq can execute depends on whether enq has executed. deq must occur first to free up space, allowing enq to proceed. Even though they execute in the same cycle, it’s as if deq happens a tiny bit before enq.

This gives us a causal relationship where deq precedes enq. In Bluespec terms, we say the method deq is “scheduled before” enq, written deq < enq.

Some designs also include a separate first method for accessing the value. The only sensible schedule is first < deq < enq.

Of course, these methods and rules are just abstractions. In the circuit itself, there’s no notion of “rules”—only a combinational path from the consumer side back to the producer side.

Pipeline buffer as a finite state machine

We can also take a more traditional view. Think of this as a state machine with two states:

  • Empty (!valid)
  • Full (valid)

The inputs valid_in and ready_in correspond to whether the producer has data and whether the consumer is ready.

Here is the state transition diagram (outputs omitted):

I’ve labeled key transitions fill, flow, and flush, which should all be self-explanatory.

The outputs can be written explicitly (read from the earlier circuit diagram):

  • valid_out = valid
  • ready_out = !valid || ready_in

Because ready_out depends not only on the state valid but also on the input ready_in, this is a Mealy machine. Again, we note that ready_out is a function of ready_in, consistent with the Bluespec schedule deq < enq, since ready_out ultimately depends on the downstream’s readiness.

Whew, I said the same things a hundred different ways, but I really want to emphasize their connections!

A pipeline buffer is a natural choice for most pipelines, but some systems may require a different behavior, so now we’re going to look at…

Bypass buffer

Suppose we have two modules where the first module’s output is the second module’s input. In the ideal case, we want to do all computations in one cycle. However, the second module may occasionally stall, so we need to store the data temporarily. When the buffer holds data, we must stall the first module.

Can we design a circuit for this? Pause and try for yourself if you’d like.

Finite state machine and Bluespec

We can again think of this as a finite-state machine, but using some Bluespec intuition helps.

We still have a buffer (with a valid bit) to store data when needed. The buffer can be either empty or full.

When the buffer is empty:

  • If the producer has data, that’s enq, and enq should succeed.
  • We attempt to pass the data to the consumer via first in the same clock cycle.
  • If the consumer accepts it (deq), the data isn’t stored. If not, the data is stored in the buffer.

The causality chain gives us the schedule enq < first < deq. That makes sense: we want data to flow directly from upstream to downstream in the same cycle.

When the buffer is full, enq can’t execute. We must first consume the stored data (first and deq).

The resulting state transition diagram looks like this:

Compare it with the pipeline buffer FSM—you’ll see a nice symmetry.

Circuit diagram for a bypass buffer

Let’s reason out what the circuit must look like.

  • There must be a combinational path from input to output, since data can flow directly.
  • The outputs must pass through MUXes, as they can come either from the input or from the buffer.
  • Finally, depending on whether the consumer is ready (ready_in), we update the buffer accordingly.

Whenever ready_in is true, we clear the buffer—so ready_in effectively acts as a local reset. Once the buffer takes a value, it holds it until reset.

This leads to the following circuit diagram, where data_in/valid_in and the internal data/valid registers are shown together:

Just to be clear, these diagrams aren’t the only possible implementations. I’m relying on enable/reset ports for simplicity; synthesis tools may produce different circuits.

Observe that this is the opposite of the pipeline buffer: here, there’s a combinational path from upstream to downstream, but the ready signal from downstream is registered before reaching upstream.

Magic buffer

In a pipeline buffer, enq and deq can execute concurrently only when the buffer is non-empty.
In a bypass buffer, they can execute concurrently only when the buffer isn’t full.

Can we have the best of both worlds: maximum concurrency?

Here’s an idea:

  • When the buffer is empty, act like a bypass buffer.
  • When full, act like a pipeline buffer.

This way, enq and deq could always execute concurrently!

But there are two major problems.

First, you can’t do this in Bluespec using guarded interfaces. Scheduling must be static, and this behavior has an obvious scheduling contradiction. You could work around it using low-level interfaces like CommitIfc, but you’d have to ensure that producer and consumer rules decide independently, risking combinational cycles and losing the benefits of guarded interfaces.

Second, if there’s a combinational path in both directions, what purpose does the buffer serve? I don’t know—there might be a legitimate use case for this; let me know if you have any ideas.

The term “magic buffer” is my own. It is magical—maximum concurrency with one slot—but it’s ultimately an illusion. There is no free lunch.

Half buffer

Here’s another way to interpret “the best of both worlds”: no combinational paths in either direction.

The idea is to have ready_out and valid_out depend solely on internal state:

  • ready_out = !valid
  • valid_out = valid

Here’s the state machine:

The circuit (not shown) can be described as:

  • data <= valid ? data : data_in
  • valid <= valid ? !ready_in : valid_in

We call this a half buffer because it achieves only half the ideal throughput. Even if the producer always has valid data, it can only pass data every two cycles. In contrast, pipeline and bypass buffers have a throughput of 1. There is no free lunch.

Skid buffer

For the third time, can we have the best of both worlds? That is, we want maximum concurrency and no combinational paths in either direction.

The answer is yes. And there is no free lunch—we pay with extra storage.

Skidding to a stop

Recall the pipeline buffer. Its forward data path already uses registers, so that’s fine. It’s the backward control path that forms a long combinational chain.

What if, instead of setting ready_out = !valid || ready_in combinationally, we register it: ready <= !valid || ready_in; ready_out = ready.

This removes the combinational path backward but introduces a problem.

If data is flowing continuously and downstream suddenly stalls (ready_in becomes false), the producer still sees ready_out = true for one extra cycle. The buffer will accept new data it can’t store, losing it, just like me accepting more work when my plate is full.

To fix this, we add an extra data slot. When the main slot is full and we’re about to stall, the new data goes into the extra slot instead. Until that slot is drained, we deassert ready_out. The data effectively skids to a stop—hence the name skid buffer.

Here’s the state machine:

Outputs can be defined simply as:

  • ready_out = (state != FULL)
  • valid_out = (state != EMPTY)

This gives us a Moore machine—no dependence on valid_in or ready_in!

Equivalence to a bypass buffer and a pipeline buffer

The hard part is, of course, updating the registers. You can try to do it yourself. Alternatively, see that the behavior is equivalent to a bypass buffer followed by a pipeline buffer.

Think about it:

  • When both slots are empty, new data passes through the bypass buffer and fills the pipeline buffer.
  • When only the main slot is full, the behavior depends on whether downstream consumes or stalls.
  • When both are full, the downstream consumes first, and data shifts from the bypass buffer into the pipeline buffer.

Thus, the behavior matches exactly what we want.

Here’s the combined diagram:

This combination gives us the best of both worlds:

  • The pipeline buffer breaks the forward combinational path.
  • The bypass buffer breaks the backward path by providing slack when stalls arrive late.

We could also combine them in the reverse order (pipeline then bypass); it’s less intuitive but equivalent.

Of course, the tradeoff is twice the storage elements.

Conflict-free FIFO

In Bluespec, this structure is called a conflict-free FIFO, typically of size 2 (since size 1 makes no sense). It’s “conflict-free” because enq and deq are independent. The schedule is first < {enq, deq} and enq CF deq (CF stands for conflict-free).


Resources

References I used while writing this:

Special thanks to Thomas Bourgeat, who taught 6.1920 (6.175) Constructive Computer Architecture in Spring 2023, when I could fortunately take it, and to the late Arvind, whose work underpins much of Bluespec.


Appendix

Rants and notes that didn’t fit in the main post.

“Skid” buffer

Sometimes, people refer to our bypass buffer as a skid buffer. There’s no standard name in this messy field.

Larger FIFOs

In Bluespec world, for larger FIFOs, we expect enq and deq to execute concurrently whenever the FIFO isn’t empty or full.

  • A pipeline FIFO also allows concurrent operations when full (first < deq < enq).
  • A bypass FIFO allows concurrency when empty (enq < first < deq).
  • A conflict-free FIFO satisfies first < {enq, deq} and enq CF deq, but doesn’t require concurrency in the full/empty cases.

The implication is:

  • Pipeline FIFO -> long backward combinational path.
  • Bypass FIFO -> long forward combinational path.
  • Conflict-free FIFO -> fully isolated, but no concurrency in full/empty cases.

Larger but Still Small FIFOs

You can’t simply chain n bypass buffers to make a size-n FIFO. When a stall occurs, data ripples backward through the chain, breaking concurrency guarantees. Similarly, chaining pipeline buffers creates a delay line, not a flexible FIFO.

Skid buffers chained together are even more wasteful.

To maintain concurrency, you need a proper parameterized FIFO with a shift register and plenty of extra logic—or you can just leverage magic buffers from earlier.

This shows why building large FIFOs purely from registers quickly becomes impractical. There is no free lunch.

Actually Large FIFOs

Realistic designs resemble software circular buffers with read and write pointers. Achieving pipeline, bypass, or conflict-free semantics requires clever implementation details (see references for examples).

Ephemeral History Registers (EHRs)

With those primitive FIFOs, there are few legitimate reasons to use EHRs. They’re difficult to get right, and most designs can be expressed more cleanly using rules and FIFOs.