Verilog functions in Xilinx XST

A nice idea

The XST documentation says that Verilog functions are fully supported. I was frustrated today to discover that this is not the case. I was trying to make a library for handling fixed point arithmetic. Module instantiation overhead in Verilog is quite high in terms of lines of code and excess verbiage. All the code I need to use is combinational, so the natural thing to do is have one module with all my functions and parameters which control the representation of the fixed point numbers. After this, I can make instances of the library module with the various fixed point representations I need and use the functions.

So I whip out the following:


module SignedInteger
    parameter out_int = 1;
    parameter out_frac = 1;
    parameter in_a_int = out_int;
    parameter in_a_frac = out_frac;
    parameter in_b_int = in_a_int;
    parameter in_b_frac = in_a_frac;
    parameter out_width = out_int+out_frac;
    parameter in_a_width = in_a_int+in_a_frac;
    parameter in_b_width = in_b_int+in_b_frac;

    function signed [out_width-1:0] sum;
    input signed [in_a_width-1:0] a;
    input signed [in_b_width-1:0] b;
    begin
      if (in_a_frac > in_b_frac)
	sum = a + (b<<(in_a_frac-in_b_frac));
      else
	sum = (a<<(in_b_frac-in_a_frac)) + b;
    end
  endfunction

endmodule

Now I can instantiate SignedInteger in another module and call the sum function. I need the function in another module because I may have multiple instances of SignedInteger in the calling module with different parameter values– sort of a poor man’s class mechanism. Everything simulates just peachy. Now, I synthesize a test case in XST.

Prepare to crash and burn

I’ll spare you the details, but XST has a number of issues with doing things like this, though the first few are not insurmountable. XST will first error out because I used concatenation instead of shift operators in the sum function. XST didn’t like the fact that one of the concatenation multipliers was negative. I got boxed into a corner here by Verilog too, because you can’t use a generate inside a function, and if I use a generate outside the function, it makes it really hard to call the function from outside the module. So, we’ll go back to the drawing board and use the shift operator. Next, XST believes that modules need at least one port. My guess is that the people who wrote XST never thought about just using a module as a container for its tasks and functions. Hmm… it’s not great, but I can add a port that I won’t use.

At this point, I’m faced with an interesting error message:


Analyzing top module .
ERROR:Xst:917 - Undeclared signal .
ERROR:Xst:2083 - "test.v" line 29: Unsupported range for function.

What do you mean, undeclared signal out_width? Firstly, it’s not a signal and secondly, it is so declared. The second message gives me pause. Crap– it really thinks that out_width is a signal. Why can’t it see it?

An experiment

Time for a test case. I try to synthesize this module, and guess what? It compiles without errors.


module test2
  (
   input signed [15:0] a,
   input signed [15:0] b,
   output signed [19:0] sum
   );

  localparam out_width = 20;
  localparam in_a_width = 16;
  localparam in_a_frac = 0;
  localparam in_b_width = 16;
  localparam in_b_frac = 4;

  SignedInteger #(16,4,16,0,12,4) si_16_4_16_0_12_4(.value());

  assign sum = si_16_4_16_0_12_4.sum(a,b);
endmodule

If I give it the parameters it is looking for, the errors go away. I do get some strange warnings, however…


WARNING:Xst:616 - Invalid property "out_width 00000014": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_frac 00000000": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_int 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_width 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_frac 00000004": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_int 0000000C": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_width 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "out_frac 00000004": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "out_int 00000010": Did not attach to si_16_4_16_0_12_4.

Not sure what that means exactly, but it sounds like something is rotten with the parameters.

Incorrect logic

Now this could cause incorrect logic to be synthesized. So I tried this test case.


module sub(empty);
  output empty;
  parameter paramvalue = 1;

  assign empty = 0;
  function [3:0] myparam_plus;
    input [3:0] in;
    begin
      myparam_plus = paramvalue + in;
    end
  endfunction
endmodule // sub

module test3(value,empty);
  output [3:0] value;
  output empty;
  parameter paramvalue = 10;
  sub #(2) mysub(.empty(empty));
  assign value = mysub.myparam_plus(3);
endmodule

The value output is a constant 5. That is, the sub module has a paramvalue of 2 and we add 3 to it. XST synthesizes this with no errors, warnings, or infos. The module is totally clean. Even so, the netlist it produces gives value a constant output of 13. Let’s see Xilinx support try to squirm their way out of this one. This seems like a bug to me.

At any rate, I have to give up on a pure Verilog solution to a fixed point library. Time to use Perl.

LIFO

Recently, I had an application that needed a LIFO rather than a FIFO. Where FIFO stands for “first in first out”, LIFO stands for “”last in first out”. LIFOs are not as common as FIFOs but, they are used whenever you need to remember something you are currently working on and start on something new. When the new thing is done, you can go back to what it was you were doing before you were interrupted. Due to the sequential nature of software, LIFOs are used a lot in processors as parameter or return “stacks”, and the write operation is called a “push” and the read operation is a “pop”.

A simple implementation

At first blush, implementing a LIFO should be pretty much the same as implementing a FIFO. You have a read pointer and a write pointer and a dual port RAM. Here is a LIFO figure with two entries:

two_entries1

A LIFO with two entries.

You can see that we will read from Location 2 if necessary, and we will write to Location 3. However, unlike a FIFO, a pop operation will affect the write pointer, and worse yet, it will affect the write pointer before the write takes place. In this example, if we were to perform a pop and push operation simultaneously, then we would need to write to Location 2 after doing the read. Afterwards, the read and write pointers would have the same values as before.

Digging deeper

One other thing to think about is that it seems wasteful to use a dual port RAM here. We could use a single port RAM if it weren’t for the pesky case of simultaneous pushes and pops. However, when we do a simultaneous push and pop, the pointers don’t change. We only read and write the location pointed to by the read pointer, taking care to do the read operation before the write operation. Also, if you think about it, we don’t even need two pointers. The pointers always need to move in lock step with each other. The write pointer is always the read pointer if you are pushing and popping, and the read pointer plus one if you are just pushing.

A better approach

The “aha!” moment comes when you realize that the critical memory location is the top of the stack. We need to read and write to this location if we are pushing and popping at the same time. Otherwise, we’re only going to need to read or write the RAM, and never both. If we keep the top of the stack in a register instead of the RAM, then we can do a simultaneous push and pop without using the memory at all. If we do just a push, then we need to write the top of the stack register to the RAM and store the pushed value in the top of the stack register. For a pop without a push, we just need to read the RAM and store the result in the top of the stack register, while simultaneously providing the current top-of-stack register value as the pop result. The figure below shows a stack with two entries using a top of stack register:

A stack with two entries using a top of stack register

A stack with two entries using a top of stack register

The code

Ports for the LIFO are the usual clock and reset. We have status outputs empty and full, as well as a count of the number of data items in the LIFO. In addition, we have a signal indicating a push with its associated data, and another indicating a pop. A top-of-stack value rounds out the ports. Just a note, you can use the top-of-stack value all you want without popping the stack.


`timescale 1ns/1ns
module lifo
  #(
    parameter depth = 32,
    parameter width = 32,
    parameter log2_depth = log2(depth),
    parameter log2_depthp1 = log2(depth+1)
    )
  (
   input clk,
   input reset,
   output reg empty,
   output reg full,
   output reg [log2_depthp1-1:0] count,
   input push,
   input [width-1:0] push_data,
   input pop,
   output [width-1:0] tos
   );

You can read my other posts about the synchronous FIFO or constant functions to find out more about the log2 parameters. Here’s the log2 function for completeness, though:


   function integer log2;
      input [31:0] value;
      begin
	 value = value-1;
	 for (log2=0; value>0; log2=log2+1)
	   value = value>>1;
      end
   endfunction

Next, we’re going to use a flag to tell us when we’re writing. That is, when we are asked to push and we have room for the data. This means that either we are not full, or if we are full, we are also popping. Reading is similar, but a little simpler. We read if we’re asked to pop and there is data available.


  wire writing = push && (count < depth || pop);
  wire reading = pop && count > 0;

Now we count the data. We compute a combinational next_count value because that allows us to register the empty and full outputs. The logic is really quite simple. The count goes up when we write and don’t read, and it goes down if we read and don’t write. Otherwise, it just stays the same.


  reg [log2_depthp1-1:0] next_count;
  always @(*)
    if (reset)
      next_count = 0;
    else if (writing && !reading)
      next_count = count+1;
    else if (reading && !writing)
      next_count = count-1;
    else
      next_count = count;

  always @(posedge clk)
    count <= next_count;

Full and empty are pretty self explanatory:


  always @(posedge clk)
    full <= next_count == depth;

  always @(posedge clk)
    empty <= next_count == 0;

For the memory pointer, if we are writing, then we use the location pointed to by the count, and if we are reading, we use the location one before.


  wire [log2_depth-1:0] ptr = writing ? count [log2_depth-1:0] : (count [log2_depth-1:0])-1;

Writing to the memory is simple enough. If we are writing and not reading, we write the current top-of-stack into the RAM. The top-of-stack will be replaced with what we are pushing onto the stack. If we are writing and reading, we don’t want to write anything. The top of stack register will do all the work in this case.


  reg [width-1:0] mem [depth-1:0];

  always @(posedge clk)
    if (writing && !reading)
      mem[ptr] <= tos;

Reading is a little more tricky. It would be great if we could just say


 always @(posedge clk)
   if (reading && !writing)
     tos <= mem[ptr];
   else if (writing)
     tos <= push_data;

However, this won’t work. You can’t map that into the read output register of a BRAM. Synthesis tools aren’t smart enough to map part of this onto the RAM and the rest onto outside logic. We’re going to have to give it a little hint. First we can just use a simple register that can map onto the read output register:


  reg [width-1:0] mem_rd;
  always @(posedge clk)
    if (reading)
      mem_rd <= mem[ptr];

Now, we’ll create a top-of-stack “shadow” register. This will hold the value being written to the top of the stack, since we can’t store it in the RAM output register. Only the RAM can do that.


  reg [width-1:0] tos_shadow;
  always @(posedge clk)
    if (writing)
      tos_shadow <= push_data;

Finally, we need a MUX to select who really holds the top-of-stack value.


  reg use_mem_rd;
  always @(posedge clk)
    if (reset)
      use_mem_rd <= 0;
    else if (writing)
      use_mem_rd <= 0;
    else if (reading)
      use_mem_rd <= 1;

  assign tos = use_mem_rd ? mem_rd : tos_shadow;

Wrapup

With this design, you’re going to get an efficient LIFO using a single port RAM. Why is it important to do this in a single port? I’ll answer that question in a future post. For now, you can download the entire LIFO module here.

Synchronous FIFO

I do a lot of logic design for Xilinx FPGAs, and I really like the tools that they use. Ironically, I hate their GUIs, but since I always run their tools from the command line using GNU Make, that’s not a big drawback– except when it comes to their CoreGen tool. I really hate that tool. It;s virtually impossible to run it from a command line, because it reads all its input from an XML file. The only way to make the XML file is from their GUI. You can modify the XML and run the tool again, but it won’t document the XML format, and the format can change from release to release. Another problem with CoreGen is that the modules it generates are not parameterized. Say you make a FIFO which is 31 bits wide. If a parameter changes, you need to fire up a GUI and point your mouse and click to build another FIFO. I’m sorry, but this is just wrong.

Now I suppose this just can’t be helped in some cases (like an ethernet MAC), but for many of the modules that CoreGen builds, you’re better off writing the module from scratch. You get a clean and portable design, and you don’t need to mess around with a separate simulation model. In general, I avoid CoreGen like the plague.

Declaration


`timescale 1ns/1ns
module sync_fifo
  #(
    parameter depth = 32,
    parameter width = 32,
    parameter log2_depth = log2(depth),
    parameter log2_depthp1 = log2(depth+1)
    )
  (
   input clk,
   input reset,
   input wr_enable,
   input rd_enable,
   output reg empty,
   output reg full,
   output [width-1:0] rd_data,
   input [width-1:0] wr_data,
   output reg [log2_depthp1-1:0] count
   );

The module essentially takes two parameters the depth and the width of the FIFO. The other two parameters are caused to two different Xilinx XST bugs. The first has to do with constant functions. XST doesn’t treat expressions involving parameters and constant functions as constant expressions unless the expression is in the right-hand side of a parameter declaration. The second bug is that XST does not support local parameters, so they can’t be declared with localparam. The ports contain all the usual suspects, and also include a count of the number of entries in the FIFO. Note that the width of the count may be larger than the width of the memory address. For example, in the default case of 32 bits, the memory address is 5 bits wide because it must span the range 0..31, but the count will be 6 bits wide because it needs to span the range 0..32 since there can be 32 entries in the FIFO.

Local functions

These two functions are used in the design, so I list them here. The first one returns the ceiling of the log base two of its input.


  function integer log2;
    input [31:0] value;
    begin
      value = value-1;
      for (log2=0; value>0; log2=log2+1)
	value = value>>1;
    end
  endfunction

This next one is used to increment the read and write pointers. Note that we can’t just rely on the pointer wrapping, because we don’t assume the FIFO is a power of two deep. If the FIFO is a power of two in depth, then the synthesis tool will optimize out the additional logic. Therefore, with this design you get the best of both worlds.

The read and write pointers

Next, we’ll update the read and write pointers. First I declare some signals which indicate whether the FIFO is reading or writing. It’s not just a matter of looking at the enable!


  wire writing = wr_enable && (rd_enable || !full);
  wire reading = rd_enable && !empty;

Here is the code for the read pointer. I don’t usually use a combinational next signal, but in this case, we’re going to need to, since then, we can have the synthesis tool infer a read-first style RAM. If you want to use a Xilinx block RAM, this is a necessity.


  reg [log2_depth-1:0] next_rd_ptr;
  always @(*)
    if (reset)
      next_rd_ptr = 0;
    else if (reading)
      next_rd_ptr = increment(rd_ptr);
    else
      next_rd_ptr = rd_ptr;

The write pointer doesn’t need the next variable, but I use the same style for symmetry.


  reg [log2_depth-1:0] wr_ptr;
  reg [log2_depth-1:0] next_wr_ptr;
  always @(*)
    if (reset)
      next_wr_ptr = 0;
    else if (writing)
      next_wr_ptr = increment(wr_ptr);
    else
      next_wr_ptr = wr_ptr;

  always @(posedge clk)
    wr_ptr <= next_wr_ptr;

The status and count outputs

I just use a counter to keep track of the reads and writes.


  always @(posedge clk)
    if (reset)
      count <= 0;
    else if (writing && !reading)
      count <= count+1;
    else if (reading && !writing)
      count <= count-1;

Empty and full are pretty simple. We need to know a couple things, though. First, they don’t depend on count. In many applications, you don’t need to know how many entries there are in the FIFO. In these cases, you simply leave the count output disconnected and the synthesis tool will remove the counter. If full or empty used the counter, then the logic cannot be removed. Also, I experimented with using unregistered values and the design was larger.


  always @(posedge clk)
    if (reset)
      empty <= 1;
    else if (reading && next_wr_ptr == next_rd_ptr && !full)
      empty <= 1;
    else
      if (writing && !reading)
	empty <= 0;
  
  always @(posedge clk)
    if (reset)
      full <= 0;
    else if (writing && next_wr_ptr == next_rd_ptr)
      full <= 1;
    else if (reading && !writing)
      full <= 0;

The memory

Finally, let’s get to the memory. Again, if you use CoreGen for your memories, you would need to generate another module for this and instantiate it here. But why would you? It takes less code to write the memory than to instantiate what CoreGen would have produced. Note that we need to infer a read-first memory. This is where the next_rd_ptr signal is needed.


  reg [width-1:0] mem [depth-1:0];
  always @(posedge clk)
    begin
      if (writing)
	mem[wr_ptr] <= wr_data;
      rd_ptr <= next_rd_ptr;
    end

  assign rd_data = mem[rd_ptr];

Wrap-up

So that’s about it. We now have simple RTL synchronous FIFO that synthesizes to a nice compact implementation, is portable across vendors, and simulates nicely. And best of all? No GUI!

I’ve attached the whole design here in case you want to use it. Just keep the copyright header intact, and of course, there is no warranty.

Timescale

I don’t understand how so many people have so many weird ideas about timescale. I think it stems from some simulators not allowing modules to have a timescale declaration if it already parsed modules without one. So, for example, if you have module1 without a timescale and module2 with a timescale, and you then compile module1 followed by module2, you get an error. If you compile module2 followed by module1, then all is well, and module1 just uses the timescale from module2. Now, I agree that this is weird. There should just be a default timescale that applies at the beginning of the simulation. Let’s look at some examples that I have seen that attempt to overcome this problem.

The Special-Timescale-File-Goes-First Method

In this solution, everyone agrees that the very first file the compiler ever sees is a special timescale file with only a default timescale directive. I suppose that this solution is an attempt to enforce the way all the tools should probably work. The problem with this approach is that it relies on everyone to use the same simulation tool flow. In some environments this can be possible, but I’ve seen this approach backfire when bringing up a new tool or flow, or even when just changing the rules in a makefile.

The Include-a-Special-File Method

In this scheme, everyone agrees to include a special timescale file with a default timescale– unless they need their own custom timescale declaration. This approach removes the requirement that any particular file needs to go first. However, it now builds a special requirement of your flow into the RTL. This leads to portability issues if you move the RTL into another flow.

The Correct Method– always specify a timescale

I think that the correct approach is to specify a timescale before the first module in every source file which contains a module. Use this method even (and I want to make this absolutely clear) if you don’t make any reference to time in the module. Using this technique, your file will always be correct. If it is, then first, it will declare some timescale for everyone else. If not, then no harm done. And if some further module specifies time without specifying a timescale, then there’s some fault with that module.

Of course, it’s possible to use this method in combination with the first if you have some code that can’t be changed, and thus can’t have a timescale added. The include file approach is just wrong-headed. I don’t know why it would ever be a good idea– especially if your module actually cares about what the timescale is.