Synchronous FIFO

I do a lot of logic design for Xilinx FPGAs, and I really like the tools that they use. Ironically, I hate their GUIs, but since I always run their tools from the command line using GNU Make, that’s not a big drawback– except when it comes to their CoreGen tool. I really hate that tool. It;s virtually impossible to run it from a command line, because it reads all its input from an XML file. The only way to make the XML file is from their GUI. You can modify the XML and run the tool again, but it won’t document the XML format, and the format can change from release to release. Another problem with CoreGen is that the modules it generates are not parameterized. Say you make a FIFO which is 31 bits wide. If a parameter changes, you need to fire up a GUI and point your mouse and click to build another FIFO. I’m sorry, but this is just wrong.

Now I suppose this just can’t be helped in some cases (like an ethernet MAC), but for many of the modules that CoreGen builds, you’re better off writing the module from scratch. You get a clean and portable design, and you don’t need to mess around with a separate simulation model. In general, I avoid CoreGen like the plague.

Declaration


`timescale 1ns/1ns
module sync_fifo
  #(
    parameter depth = 32,
    parameter width = 32,
    parameter log2_depth = log2(depth),
    parameter log2_depthp1 = log2(depth+1)
    )
  (
   input clk,
   input reset,
   input wr_enable,
   input rd_enable,
   output reg empty,
   output reg full,
   output [width-1:0] rd_data,
   input [width-1:0] wr_data,
   output reg [log2_depthp1-1:0] count
   );

The module essentially takes two parameters the depth and the width of the FIFO. The other two parameters are caused to two different Xilinx XST bugs. The first has to do with constant functions. XST doesn’t treat expressions involving parameters and constant functions as constant expressions unless the expression is in the right-hand side of a parameter declaration. The second bug is that XST does not support local parameters, so they can’t be declared with localparam. The ports contain all the usual suspects, and also include a count of the number of entries in the FIFO. Note that the width of the count may be larger than the width of the memory address. For example, in the default case of 32 bits, the memory address is 5 bits wide because it must span the range 0..31, but the count will be 6 bits wide because it needs to span the range 0..32 since there can be 32 entries in the FIFO.

Local functions

These two functions are used in the design, so I list them here. The first one returns the ceiling of the log base two of its input.


  function integer log2;
    input [31:0] value;
    begin
      value = value-1;
      for (log2=0; value>0; log2=log2+1)
	value = value>>1;
    end
  endfunction

This next one is used to increment the read and write pointers. Note that we can’t just rely on the pointer wrapping, because we don’t assume the FIFO is a power of two deep. If the FIFO is a power of two in depth, then the synthesis tool will optimize out the additional logic. Therefore, with this design you get the best of both worlds.

The read and write pointers

Next, we’ll update the read and write pointers. First I declare some signals which indicate whether the FIFO is reading or writing. It’s not just a matter of looking at the enable!


  wire writing = wr_enable && (rd_enable || !full);
  wire reading = rd_enable && !empty;

Here is the code for the read pointer. I don’t usually use a combinational next signal, but in this case, we’re going to need to, since then, we can have the synthesis tool infer a read-first style RAM. If you want to use a Xilinx block RAM, this is a necessity.


  reg [log2_depth-1:0] next_rd_ptr;
  always @(*)
    if (reset)
      next_rd_ptr = 0;
    else if (reading)
      next_rd_ptr = increment(rd_ptr);
    else
      next_rd_ptr = rd_ptr;

The write pointer doesn’t need the next variable, but I use the same style for symmetry.


  reg [log2_depth-1:0] wr_ptr;
  reg [log2_depth-1:0] next_wr_ptr;
  always @(*)
    if (reset)
      next_wr_ptr = 0;
    else if (writing)
      next_wr_ptr = increment(wr_ptr);
    else
      next_wr_ptr = wr_ptr;

  always @(posedge clk)
    wr_ptr <= next_wr_ptr;

The status and count outputs

I just use a counter to keep track of the reads and writes.


  always @(posedge clk)
    if (reset)
      count <= 0;
    else if (writing && !reading)
      count <= count+1;
    else if (reading && !writing)
      count <= count-1;

Empty and full are pretty simple. We need to know a couple things, though. First, they don’t depend on count. In many applications, you don’t need to know how many entries there are in the FIFO. In these cases, you simply leave the count output disconnected and the synthesis tool will remove the counter. If full or empty used the counter, then the logic cannot be removed. Also, I experimented with using unregistered values and the design was larger.


  always @(posedge clk)
    if (reset)
      empty <= 1;
    else if (reading && next_wr_ptr == next_rd_ptr && !full)
      empty <= 1;
    else
      if (writing && !reading)
	empty <= 0;
  
  always @(posedge clk)
    if (reset)
      full <= 0;
    else if (writing && next_wr_ptr == next_rd_ptr)
      full <= 1;
    else if (reading && !writing)
      full <= 0;

The memory

Finally, let’s get to the memory. Again, if you use CoreGen for your memories, you would need to generate another module for this and instantiate it here. But why would you? It takes less code to write the memory than to instantiate what CoreGen would have produced. Note that we need to infer a read-first memory. This is where the next_rd_ptr signal is needed.


  reg [width-1:0] mem [depth-1:0];
  always @(posedge clk)
    begin
      if (writing)
	mem[wr_ptr] <= wr_data;
      rd_ptr <= next_rd_ptr;
    end

  assign rd_data = mem[rd_ptr];

Wrap-up

So that’s about it. We now have simple RTL synchronous FIFO that synthesizes to a nice compact implementation, is portable across vendors, and simulates nicely. And best of all? No GUI!

I’ve attached the whole design here in case you want to use it. Just keep the copyright header intact, and of course, there is no warranty.

Timescale

I don’t understand how so many people have so many weird ideas about timescale. I think it stems from some simulators not allowing modules to have a timescale declaration if it already parsed modules without one. So, for example, if you have module1 without a timescale and module2 with a timescale, and you then compile module1 followed by module2, you get an error. If you compile module2 followed by module1, then all is well, and module1 just uses the timescale from module2. Now, I agree that this is weird. There should just be a default timescale that applies at the beginning of the simulation. Let’s look at some examples that I have seen that attempt to overcome this problem.

The Special-Timescale-File-Goes-First Method

In this solution, everyone agrees that the very first file the compiler ever sees is a special timescale file with only a default timescale directive. I suppose that this solution is an attempt to enforce the way all the tools should probably work. The problem with this approach is that it relies on everyone to use the same simulation tool flow. In some environments this can be possible, but I’ve seen this approach backfire when bringing up a new tool or flow, or even when just changing the rules in a makefile.

The Include-a-Special-File Method

In this scheme, everyone agrees to include a special timescale file with a default timescale– unless they need their own custom timescale declaration. This approach removes the requirement that any particular file needs to go first. However, it now builds a special requirement of your flow into the RTL. This leads to portability issues if you move the RTL into another flow.

The Correct Method– always specify a timescale

I think that the correct approach is to specify a timescale before the first module in every source file which contains a module. Use this method even (and I want to make this absolutely clear) if you don’t make any reference to time in the module. Using this technique, your file will always be correct. If it is, then first, it will declare some timescale for everyone else. If not, then no harm done. And if some further module specifies time without specifying a timescale, then there’s some fault with that module.

Of course, it’s possible to use this method in combination with the first if you have some code that can’t be changed, and thus can’t have a timescale added. The include file approach is just wrong-headed. I don’t know why it would ever be a good idea– especially if your module actually cares about what the timescale is.

Recursive Modules

Many designers think that recursive techniques cannot be applied to hardware design. I’m not really sure where this misconception comes from. While it is true that in Verilog ’95 you could not use recursion because the language did not support it, with Verilog 2001, recursion is possible. Recursion is really useful for describing hardware that has a tree structure. Let’s use a priority encoder as an example.

A priority encoder takes as input a bit vector and produces the bit number of the highest (or lowest) bit set and also an indication that some input bit was set. For example with an 8-bit encoder and an input of ‘b00101110, the output would be 5 because bit number 5 is the highest bit set.

The base case

For a 2-bit input, a priority encoder is pretty simple. The valid output is just the OR of the two input bits. The bit number output is just the higher input bit. In schematics it looks something like this.

Base implementation of a 2-bit priority encoder

A 2-bit priority encoder

The recursive case

For an encoder which is 2n-bits wide, we could join two 2n-1 bit wide priority encoders together like this.

Recursive implementation of a priority encoder

Recursive implementation of a priority encoder

Verilog code

So what does the code for this look like? It’s pretty simple. If the input vector is two bits wide, then we just implement the base case. Otherwise, if the width is not a power of two, then we pad out the input with zero bits to the next power of two width and then instantiate a power of two width encoder. If the width is a power of two, then we build the encoder out of two encoders half the size.


module priencr #(parameter width = 64)
  (
   input [width-1:0] decode,
   output [log2(width)-1:0] encode,
   output valid
   );

  function [31:0] log2;
    input reg [31:0] value;
    begin
      value = value-1;
      for (log2=0; value>0; log2=log2+1)
	value = value>>1;
    end
  endfunction

  generate
    if (width == 2)
      begin
	assign valid = |decode;
	assign encode = decode[1];
      end
    else if (width & (width-1))
      priencr #(1<<log2(width)) priencr ({1<<log2(width) {1'b0}} | decode,
                                         encode,valid);
    else
      begin
	wire [log2(width)-2:0] encode_low;
	wire [log2(width)-2:0] encode_high;
	wire valid_low, valid_high;
	priencr #(width>>1) low(decode[(width>>1)-1:0],encode_low,valid_low);
	priencr #(width>>1) high(decode[width-1:width>>1],encode_high,valid_high);
	assign valid = valid_low | valid_high;
	assign encode = valid_high ? {1'b1,encode_high} : {1'b0,encode_low};
      end
  endgenerate
endmodule

Wrap up

The only gotchas with recursion are vendor support. I have come across vendors who don’t think recursion is possible in Verilog 2001 because the standard doesn’t say it is specifically supported. The Verilog 2005 standard is very clear that recursion is allowed to help with this problem.

Recursion can be a very powerful way to express tree-like structures in hardware.

Constant Functions in Verilog 2001

Constant functions are a great feature introduced in Verilog 2001. In a nutshell, constant functions allow you to write functions that are used at elaboration time. That is when parameters are evaluated and generate statements are expanded. One great example of a constant function is the log2 function.

function integer log2;
  input integer value;
  begin
    value = value-1;
    for (log2=0; value>0; log2=log2+1)
      value = value>>1;
  end
endfunction

The rules for constant functions are pretty simple. They can’t have any side effects. That is they can’t modify any global variables, and they can only call constant functions. Also, they can only have constants or parameters as inputs.

So what are they good for? Consider a priority encoder. You pass in an n-bit input vector and produce a log2 n encoded result. Without constant functions, you need to specify both the input and output widths when instantiating the module. Now with constant functions, you just need to specify the input width. Something like this…

module prioenc
  #(parameter width = 16) (
     input [width-1] d,
     output [log2(width)-1:0] enc
);

See my post on using recursion to implement a priority encoder to see the whole body of the module. A couple of things to note however, are that the function can be called before it is declared. Also, the function does not need to behave the RTL rules for synthesizable code. Many older tools can have problems with constant functions, although most recent releases have fixed this.

One notable problem is with Xilinx XST (as of 10.1 SP3) which has a very strange bug with constant functions. The code above will not synthesize with XST. However if we change it to this, then everything works as expected.

module prioenc
  #(parameter width = 16,
     parameter log_width = log2(width)) (
  input [width-1] d,
  output [log_width-1:0] enc);

XST errors if you use a constant function in the port declaration but is fine if you use it in the right hand side of the parameter declaration.

At any rate, constant functions are a huge and underutilized feature in Verilog 2001.