Remote debugging with the Arrow SoCkit board


My FPGA builds are done on a server in the datacenter, which is helpful for a number of reasons, not the least of which is that it takes a lot of CPU, and the server has plenty of CPU power. However, it’s a pain to transfer the design files over to a local machine for programming and debugging. Here are instructions that you can use to run just a JTAG server on the lab machine, and connect to from the server machine.

Lab Machine


You should install the Quartus II tools on the lab machine. There’s got to be a download for only the programming tools, but I haven’t found it yet. Download and install only the minimal amount. The machine needs to run the port mapper and xinetd. You may need to install those with yum or whatever other package manager your distribution requires.

Starting the Server

In the following, I assume that the environment variable $qII points to your Quartus II install directory.

You need to make sure that a file /etc/jtagd/jtagd.conf exists and has the following line in it:

Password = “mysecret”

Without that, the jtag daemon will not listen for remote machines.

First check that the JTAG software can see the board. Run the following command and verify that you get the correct output…

[pete@localhost ~]$ $qII/quartus/bin/jtagconfig
 1) CV SoCKit [1-1]
 02D020DD 5CSEBA6(.|ES)/5CSEMA6/..

Then, run the following to start the server. You need to do this as root:

[pete@localhost ~]$ sudo $qII/quartus/bin/jtagd
[pete@localhost ~]$ sudo $qII/quartus/bin/jtagconfig --enableremote mysecret

Remote Machine

On your remote machine you should start up Quartus II, and:

Select Tools->Programmer. This will start the programmer window.
You should then click the Hardware Setup… button.
Click on the JTAG Settings tab.
Click the Add Server… button.
Enter the IP address of your server. Enter the password, greywolf.
The Hardware Setup window should indicate the server IP address and the connection status should be OK.
Click the Close button.
You should then see the IP address of the lab machine to the right of the Hardware Setup… button.

You should now be able to use the JTAG device from the remote machine.

Pro Tips

If your lab machine doesn’t have a public IP address– for instance, because it is a VM using NAT running on a host with a VPN connection to your server– you will need to set up an SSH tunnel to the lab machine.

The JTAG server runs on port 1309. You can use ssh to tunnel the connection.

From your VPN machine ssh to the remote machine with the following command:

[pete@localhost ~]$ ssh -L 1309:

In this example, your guest machine has an IP address of, and the remote server is

Then, you would use your VPN machine IP address as the address of the lab machine when connecting to the JTAG server.


Timing Diagrams with WordPress

I saw a post yesterday on Hackaday about using a program called Wavedrom to edit timing diagrams on web pages. It looks like a great tool, but I have a couple of issues with it.

First, I’m confused about the command line version of the tool. It exists, but there seems to be some issue with using it on Mac OS. I haven’t tried it on Linux yet, but most of my work is done on both, especially documentation, so I need to get that figured out.

The second really big gotcha is that it doesn’t work with Internet Explorer 9 and earlier. That is a big one. Even though I personally never use IE, a lot of my readers do.

Anyway, with those caveats in mind, I did some work to try to integrate Wavedrom with WordPress. Here is an example waveform:

Apologies if you use IE since you can’t see the waveform. You can check it out in Firefox or Chrome, though.

You can see the full tutorial for WaveDrom here for more details. But for the example above, the code embedded in this page looks like this:

<script type="WaveDrom">
{ signal : [
{ name: "clk", wave: "p......" },
{ name: "bus", wave: "x.34.5x", data: "head body tail" },
{ name: "wire", wave: "0.1..0." },

I created a very simple plugin for WordPress which adds the right stuff to the HTML header for these pages. I will probably work on it a bit further if I can figure out how to support IE. If I can’t do that, maybe I can get the command line version working, and use it for PNG or SVG images at least.

Putting New Files in the Right Place: The Vivado Edition

Vivado has the ability to create and manage your own IP, which is a good thing. But there are some real gotchas when doing something like this, the first being that it doesn’t seem to place HDL files in the correct locations.

I have been running some FPGA workshops using the IP Packager in Vivado and having been having a lot of problems when creating new files in the IP component. Here’s an example:

From a Vivado project select Tools→Create and Package IP… Click Next, select Create a new AXI4 peripheral, click Next, just use the defaults for the next page, click Next again, use the defaults for the AXI information and click Next again, select Edit IP, then click Finish.

This will put in an IP Packager project. Edit the top level file myip_v1_0. I’m using Verilog but it works the same for VHDL. Instantiate a new dummy module at the bottom of the file:

// Add user logic here
dummy dummy (.clk(s00_axi_aclk));
// User logic ends

files_fig1Save the file and look at the design sources. You will see that there is now a new undefined module in the design called dummy. It should look like the figure to the right:

Now, we would like to create the new module. So we go to File→Add Sources… and create the new source file. Select Add or create design sources and click Next. Click Create File …, type the file name dummy and click OK. Click Finish. When the Define Module dialog pops up, just click OK. Click Yes to the dialog saying you haven’t changed anything. You now have a file defined for your dummy module.

files_fig2Ready to package that IP? Go to the Package IP tab in the Project Manager window. Then, select File Groups in the Packaging Steps window. You should see a window like the figure on the right.

Now click on the Merge changes from File Groups Wizard link. You will then get a link saying 2 warnings. Click on that, and you will see two warnings about your dummy.v file being on a path which is not relative to your IP root directory. What has happened is that Vivado placed the new file outside the IP component. If you go ahead and package your IP, you cannot use the packaged IP because it will be missing this file. Seems like a Vivado (2014.4) bug to me.

So how should we fix this? I don’t know how to move files in Vivado, but we can create the file in a different location when we make the new file. To do that, first delete the dummy file and remove it from disk. Just select the file, then right-click and select Remove File from Project. Select the checkbox to also delete it from disk and click OK.

files_fig3Now go to Add Sources again, and after clicking Create File…, specify the file name dummy.v, then select a new file location. files_fig4Navigate to your ip_repo area and into myip_1_0. Select the hdl directory, and click Select. Then click OK and then Finish to create the file. This will place the file in the proper location in the IP repository.

When you merge your changes in the File Groups step, you should not get any warnings. You can then package your IP and use it in your designs.

Hopefully this will be fixed in the next release of Vivado.

Getting Xilinx Document Navigator working with CentOS 6.5

There are a number of packages you need to install beyond the default to get Document Navigator working on CentOS 6.5. What’s especially strange is that they are 32-bit libraries, which are required even on 64-bit machines. Here is the yum command I used to get everything to work. We’ll have a couple of frustrations, because the latest CentOS 6 repositories are for CentOS 6.6, which is not compatible with the latest Vivado release. So you’ll also need to update a few things on the x86_64 side before doing the install for the i686 libraries:

% sudo yum update libXrender glib2
% sudo yum install fontconfig.i686 libXext.i686 libXrender.i686 glib2.i686 libpng.i686 libSM.i686

Why is Xilinx 2014.4 SDK so buggy on Linux?

I am running CentOS 6.6 with Vivado and SDK, but SDK has been crashing like crazy on me. It dies with the following message.

java: cairo-misc.c:380: _cairo_operator_bounded_by_source: Assertion `NOT_REACHED' failed.

Turns out it is a problem with versioning between gtk2 and cairo. Here is a link with the details.

There are some posts on the Xilinx forum about installing updated RPMs to fix this, but it is not a good idea to install RPMs outside of the YUM framework. Another alternative is to disable cairo in eclipse. So edit your eclipse configuration file in install_dir/SDK/2014.4/eclipse/lnx64.o/configuration/config.ini (and the 32-bit version if that is appropriate) to include the following line.


This seems to have fixed things at least for now.

Remote JTAG with Vivado

My FPGA development computer is a server in a datacenter. It stores lots of memory and CPU, but it’s impossible to connect it directly to my development boards. ISE has always supported remote JTAG servers, but the implementation has always been spotty. In some releases it has difficulty with some versions of Linux. With Vivado, things are much improved.

If your lab machine is on the same network as your development machine, simply start the hw_server program in the Vivado bin directory and open a target from the development machine Vivado window. Just specify a remote server host name and use the default port 3121. It couldn’t be simpler.

If your lab machine is behind a firewall, you may need to use SSH to tunnel the traffic through. First, make sure the hardware manager is not running on your development machine. From the lab machine, ssh into the development machine with the following command:

ssh -R 3121:localhost:3121

This will create an SSH tunnel from your lab machine to your development machine. If you get a warning saying that SSH can’t bind to the required port it means you are still running a local JTAG server on your development machine. Assuming all went well you should be logged in to your development machine from the lab machine. You can then open the target in the hardware manager in Vivado but tell it to connect to a local machine. Vivido will see a server already running on port 3121 (through your SSH tunnel), so it won’t try to start a new one.

If your lab machine is running Windows then you can use the open source PuTTY program to make the tunnel. If your development machine is running Windows, then you need to install an SSH server on the machine. That’s beyond the scope of this post.


The lifo2 module implements a dual Last In First Out stack data structure. The two stack structures share a single block RAM, or may optionally be constructed from shift registers. This module forms the heart of a stack based micro-controller. One of the stacks is used as a data stack while the other is used as a subroutine call stack. By implementing these in a single BRAM we can save a lot of space.

1. Interface

The module interface consists of a clock and reset signal, and a control interface for each stack.

1.1. Module declaration

§1.1.1: §7.1

module lifo2
  §1.3.1. Parameters
   input clk,
   input reset,
   § Stack port A,
   § Stack port B

1.2. Stack ports

The two ports use the suffix a and b to distinguish interface signals. A push input indicates that data from the push_data input should be pushed onto the stack. A pop input indicates that data should be popped from the stack. Popped data should be read from the tos output during the same cycle that pop is valid. It is completely normal to push and pop data on the same clock cycle.

Status signals empty and full indicate if the stack is empty or full. The count output provides the number of valid entries currently on the stack.

1.2.1. Stack port A

§ §1.1.1

   input push_a,
   input pop_a,
   output reg empty_a,
   output reg full_a,
   output reg `CNT_T count_a,
   output `DATA_T tos_a,
   input `DATA_T push_data_a

1.2.2. Stack port B

§ §1.1.1

   input push_b,
   input pop_b,
   output reg empty_b,
   output reg full_b,
   output reg `CNT_T count_b,
   output `DATA_T tos_b,
   input `DATA_T push_data_b

1.3. Parameters

A depth parameter specifies the depth for each of the stacks. For the BRAM implementation, the size of the RAM is just the sum of the two depth parameters. For the shift register implementation each stack uses its depth parameter independently. The width parameter specifies the data width. This is the same for both stacks.

The implementation parameter may be set to "BRAM" or "SRL" to select between BRAM and shift register implementations. Note that the quotes are required.

The full_checking specifies whether the logic that generates the full outputs should be generated. In the case of the shift register implementation the concept of full is easy to implement and sensible. With this implementation, there really are two independent stacks. However, with the BRAM implementation the concept of full is a little harder to pin down. Is the stack full when the number of elements is equal to its depth parameter? This seems overly restrictive since there may still be plenty of room for it to grow. But if it does grow then it does so at the expense of the other stack. There is a further difficulty if both stacks are almost full. If you push to both stacks then you will overflow one of the stacks. Which one? The one that is already over its posted limit? The logic to implement this seems needlessly complex. So the compromise here is to implement the strict full checking for the BRAM implementation when the full_checking parameter is true and to not generate any full checking logic at all if it is false.

§1.3.1: §1.1.1

    parameter depth_a = 512,
    parameter depth_b = 512,
    parameter width = 32,
    parameter implementation = "SRL",
    parameter full_checking = 0,
    parameter depth = depth_a+depth_b

1.4. Definitions

Here are some definitions that we can make use of later. DATA_T is used for data values. PTR_T is used for pointers like the top of stack pointer. The CNT_T definition is suitable for holding a count. Note that this value needs to hold a value of 0 as well as a value of depth so it may be larger than the PTR_T value.

§1.4.1: §7.1

`define DATA_T [width-1:0]
`define PTR_T [log2(depth)-1:0]
`define CNT_T [log2(depth+1)-1:0]

2. Functions

The log2 function computes the log base 2 of its input value. This function is used to computer widths of pointer and count values.

§2.1: §7.1

   function integer log2;
      input [31:0] value;
	 value = value-1;
	 for (log2=0; value>0; log2=log2+1)
	   value = value>>1;

3. Logic common to all implementations

3.1. Determine read and write signals

The writing signals indicate that we are writing to the corresponding stack. This occurs when the push signal is asserted and there is space on the stack or if the push signal is asserted and a pop is asserted as well. In this case we can perform the operation even though the stack is full.

The reading signal indicates that we are reading from the corresponding stack. It is just the pop signal qualified with not empty.

§3.1.1: §7.1

  wire writing_a = push_a && (!full_a || pop_a);
  wire writing_b = push_b && (!full_b || pop_b);
  wire reading_a = pop_a && !empty_a;
  wire reading_b = pop_b && ! empty_b;

3.2. Determine count value for stack A

We first generate the next_count_a value. Incrementing if we are writing but not reading, decrementing if we are reading and not writing and leaving the count the same if we are both reading and writing or neither reading or writing. We use a separate combinational always block to generate a next value signal next_count_a. This signal is used by later code to determine empty and full conditions.

§3.2.1: §7.1

  reg `CNT_T next_count_a;
  always @(*)
    if (writing_a && !reading_a)
      next_count_a = count_a + 1;
    else if (reading_a && !writing_a)
      next_count_a = count_a - 1;
      next_count_a = count_a;

  always @(posedge clk)
    if (reset)
      count_a <= 0;
      count_a <= next_count_a;

3.3. Determine count value for stack B

Here we generate the count values for stack B.

§3.3.1: §7.1

  reg `CNT_T next_count_b;
  always @(*)
    if (writing_b && !reading_b)
      next_count_b = count_b + 1;
    else if (reading_b && !writing_b)
      next_count_b = count_b - 1;
      next_count_b = count_b;
  always @(posedge clk)
    if (reset)
      count_a <= 0;
      count_b <= next_count_b;

3.4. Generate empty

Here we generate the logic for determining if either stack is empty. The stack is empty if on the next clock its count will be zero.

§3.4.1: §7.1

  always @(posedge clk)
    if (reset)
      empty_a <= 1;
      empty_a <= next_count_a == 0;

  always @(posedge clk)
    if (reset)
      empty_b <= 1;
      empty_b <= next_count_b == 0;

3.5. Generate full

Full is pretty simple as well. We go full when the next count value will be equal to the depth parameter of the stack. Writes will not occur when the stack is full so there can never be a case when the next count value is greater than the depth.

§3.5.1: §3.6.1

  always @(posedge clk)
    if (reset)
      full_a <= 0;
      full_a <= next_count_a == depth_a;

  always @(posedge clk)
    if (reset)
      full_b <= 0;
      full_b <= next_count_b == depth_b;

3.6. Conditionally generate the full logic

As discussed in Section 1.3, “Parameters”, based on the value of the full_checking parameter we either create the checking logic or just stub the full signals out to always false.

§3.6.1: §7.1

    if (full_checking)
        §3.5.1. Generate full
	initial full_a = 0;
	initial full_b = 0;

4. BRAM implementation

This section describes the BRAM implementation of the dual LIFO.

4.1. The data memory

§4.1.1: §4.6.1

  reg `DATA_T mem [depth-1:0];

4.2. Maintain address pointers

Figure 1. The pointer to the top of the A stack grows down while the pointer to the top of the B stack grows up

Here we maintain the address pointers into the RAM. The pointer for stack A starts at address zero and works up, while the pointer for stack B starts at the last address and works down.

§4.2.1: §4.6.1

  wire `PTR_T ptr_a = writing_a ? count_a `PTR_T : (count_a `PTR_T)-1;
  wire `PTR_T ptr_b = (depth-1)-(writing_b ? count_b `PTR_T
                                           : (count_b `PTR_T)-1);

4.3. Top-of-stack

One limitation of the BRAM implementation is that we only have two ports. One port must be used for each stack since both stacks can operate simultaneously. The solution to this problem is to recognize that if we are both pushing and popping the stack then we are really only operating on the top element. This element can be stored in a register and no actual memory operation needs to occur.

4.3.1. Top-of-stack shadow registers

The tos_shadow registers hold the latest value pushed onto the stack.

§ §4.6.1

  reg `DATA_T tos_shadow_a;
  always @(posedge clk)
    if (reset)
      tos_shadow_a <= 0;
    else if (writing_a)
      tos_shadow_a <= push_data_a;

  reg `DATA_T tos_shadow_b;
  always @(posedge clk)
    if (reset)
      tos_shadow_b <= 0;
    else if (writing_b)
      tos_shadow_b <= push_data_b;

4.3.2. Select between top-of-stack shadow registers and memory read values

Here we determine if we should use the memory read value or the top-of-stack shadow register to provide the top-of-stack value. If we are popping the stack then the new top-of-stack value should come from the memory read. Otherwise we should just use the value in the top-of-stack shadow register.

§ §4.6.1

  reg use_mem_rd_a;
  always @(posedge clk)
    if (reset)
      use_mem_rd_a <= 0;
    else if (writing_a)
      use_mem_rd_a <= 0;
    else if (reading_a)
      use_mem_rd_a <= 1;

  reg use_mem_rd_b;
  always @(posedge clk)
    if (reset)
      use_mem_rd_b <= 0;
    else if (writing_b)
      use_mem_rd_b <= 0;
    else if (reading_b)
      use_mem_rd_b <= 1;

  assign tos_a = use_mem_rd_a ? mem_rd_a : tos_shadow_a;

  assign tos_b = use_mem_rd_b ? mem_rd_b : tos_shadow_b;

4.4. Perform memory reads

Here we read from the stack. The values read will be the top-of-stack outputs if we are reading from the stack but not writing to it.

§4.4.1: §4.6.1

  reg `DATA_T mem_rd_a;
  always @(posedge clk)
    if (reading_a)
      mem_rd_a <= mem[ptr_a];

  reg `DATA_T mem_rd_b;
  always @(posedge clk)
    if (reading_b)
      mem_rd_b <= mem[ptr_b];

4.5. Perform memory writes

Conversely if we are writing to the stack but not reading then we perform a memory write.

§4.5.1: §4.6.1

  always @(posedge clk)
    if (writing_a && !reading_a)
      mem[ptr_a] <= tos_a;

  always @(posedge clk)
    if (writing_b && !reading_b)
      mem[ptr_b] <= tos_b;

5. Shift register implementation

5.1. Shift register implementation of stack A

There may be situations where two stacks are needed but we don't want to use a block RAM, or if both stacks don't need to be very large and we want to save the overhead of the RAM implementation. This alternate shift register implementation may be used instead.

The basic idea of the shift register implementation is to use a shift register for each bit of the data word. We then concatenate the head of the shift registers together to form the top of the stack.

5.1.1. Pop data off the shift register

Here we shift the register left, shifting in a zero value.

§ §5.1.1

srl <= {srl[depth-2:0],1'b0}

5.1.2. Push data onto the shift register

Here we shift the push data onto the shift register. This will only happen when the shift register is not full.

§ §5.1.1

srl <= {push_data_a[i],srl[depth_a-1:1]}

5.1.3. Swap data at the top of the shift register

Here we just replace the leftmost value in the shift register with the new pushed value.

§ §5.1.1

srl <= {push_data_a[i],srl[depth_a-2:0]}

We use a generate for loop to implement each shift register and a case statement to update it based on the push and pop conditions.

§5.1.1: §5.1

for (i=0; i<width; i=i+1)
  begin : srl_a
    reg [depth_a-1:0] srl;
    always @(posedge clk)
      case ({writing_a,reading_a})
	2'b01: § Pop data off the shift register;
	2'b10: § Push data onto the shift register;
	2'b11: § Swap data at the top of the shift register;

      assign tos_a[i] = srl[depth_a-1];

5.2. Shift register implementation of stack B

The implementation for stack B is the same as for stack A but with the appropriate name changes.

§5.2.1: §5.1

for (i=0; i<width; i=i+1)
  begin : srl_b
    reg [depth_b-1:0] srl;
    always @(posedge clk)
      case ({writing_b,reading_b})
	2'b01: srl <= {srl[depth_b-2:0],1'b0};
	2'b10: srl <= {push_data_b[i],srl[depth_b-1:1]};
	2'b11: srl <= {push_data_b[i],srl[depth_b-2:0]};
      assign tos_b[i] = srl[depth_b-1];

For the shift register implementation we just declare the generate loop variable and then instantiate the two stacks.

6. Undefined implementation

§6.1: §7.1

	    $display("***ERROR: %m: Undefined implementation \"%0s\".",implementation);

7. The main module

This is where we tie everything together.


§1.1. Header
§2.1. Copyright

`timescale 1ns/1ns
§1.4.1. Definitions

§1.1.1. Module declaration

§2.1. Functions

§3.1.1. Determine read and write signals

§3.2.1. Determine count value for stack A
§3.3.1. Determine count value for stack B

§3.4.1. Generate empty
§3.6.1. Conditionally generate the full logic

     $display("***INFO: %m: lifo2.v implementation=\"%0s\".",implementation);
    if (implementation == "BRAM")
      begin : bram_implementation
        §4.6.1. BRAM implementation body
    else if (implementation == "SRL")
      begin : srl_implementation
        §5.1. Shift register implementation
        §6.1. Undefined implementation

A. Front matter

1. Header

§1.1: §7.1

// Title         : Dual LIFO stack
// Project       : Common
// File          : lifo2.v
// Description :
// Two Synchronous LIFO stacks using a single BRAM.
// The stacks start at each end and grow to the middle.

2. Copyright

§2.1: §7.1

// Copyright 2009 Beyond Circuits. All rights reserved.
// Redistribution and use in source and binary forms, with or without modification,
// are permitted provided that the following conditions are met:
//   1. Redistributions of source code must retain the above copyright notice, 
//      this list of conditions and the following disclaimer.
//   2. Redistributions in binary form must reproduce the above copyright notice, 
//      this list of conditions and the following disclaimer in the documentation 
//      and/or other materials provided with the distribution.

B. Downloads

The Verilog code for lifo2.v can be downloaded here. Since this post is so long I have also included a link the article in PDF form here.

A Verilog Preprocessor

A long time ago…

When I used to work at Silicon Engineering in the nineties, I wrote a Verilog preprocessor using Perl. Verilog ’95 is pretty limited in what it can do, so we used this preprocessor on our projects to enhance our Verilog code. Back in the Verilog ’95 days, we used to write every RTL file as a preprocessor file and used GNU Make to convert it into pure Verilog. What with Verilog 2005 and System Verilog, these days I only use this preprocessor when necessary. But that’s still pretty often.

The trick

The cool thing about this preprocessor is that it does very little of the work on its own. It can be used in place of the standard Verilog preprocessor– which is well and good if you need that type of thing, but it’s still pretty limited. After all, you already have something that does that. This preprocessor’s neat trick is that you can embed Perl code inside your Verilog code. What the preprocessor does, essentially, is it converts the input code into a perl program which, when run, prints the input code. Now, by itself that is kind of boring. However the preprocessor allows you to use special comments to drop out of this mode and embed Perl code.

A simple example

One simple example of what you can do is to build a table to use in your Verilog module. In my CORDIC module, I needed a table of arctan values. This table is static and doesn’t ever need to change. All I needed was one entry in the table for each pipeline stage in the CORDIC. The table takes the pipeline stage number as input and returns a 64-bit value as the result. I implemented this as a function which is basically just a big case statement. You can see the special comment is wrapped in a /*@ and @*/ pair. I chose these because they are real Verilog comments so they won’t mess up editors that know how to edit Verilog (like Emacs). Once you drop into Perl mode, anything that your Perl code writes to standard output will appear in the preprocessor output. Here is the input to the preprocessor:

   function [63:0] k_table;
      input [5:0] index;
  $k = 1;
  for my $i (0..63)
      $k *= cos(atan2(1,2**$i));
      printf "%8d: k_table = 64'd%s; // %.20e\n",$i,int($k*2**64+.5),$k;
	   default: k_table = 0;


Code enclosed within the /*@ and @*/ pairs, or prefixed with the //@ comment, is just copied into the output Perl program. Other lines are just printed by the Perl program, which means you can use Perl string interpolation on those lines. Here’s a rather contrived example:

//@ my $reset_style = " or negedge clk";
always @(posedge clk $reset_style)
  q <= d;

In this example, the Perl reset_style variable is used in the verilog code without needing to drop into the more cumbersome comment syntax. This can help make your code a lot more readable. I have a library that I commonly use which defines a bunch of variables. These variables match the Verilog system tasks and functions that begin with a dollar sign ($). This way, you can say $display and the preprocessor will expand the Perl variable display (which, by the way, conveniently has the value $display.)

Other Possibilities

I also write a lot of structured documentation for my designs using XML. By using the XML::LibXML library I can read the XML documentation from my Verilog code and parse that for information that drives the RTL code generation. So I can document registers, instruction sets, interrupts, or other types of information, and then that do generate my verily code.

Useful Links

Look for more posts using this technique soon.

Verilog functions in Xilinx XST

A nice idea

The XST documentation says that Verilog functions are fully supported. I was frustrated today to discover that this is not the case. I was trying to make a library for handling fixed point arithmetic. Module instantiation overhead in Verilog is quite high in terms of lines of code and excess verbiage. All the code I need to use is combinational, so the natural thing to do is have one module with all my functions and parameters which control the representation of the fixed point numbers. After this, I can make instances of the library module with the various fixed point representations I need and use the functions.

So I whip out the following:

module SignedInteger
    parameter out_int = 1;
    parameter out_frac = 1;
    parameter in_a_int = out_int;
    parameter in_a_frac = out_frac;
    parameter in_b_int = in_a_int;
    parameter in_b_frac = in_a_frac;
    parameter out_width = out_int+out_frac;
    parameter in_a_width = in_a_int+in_a_frac;
    parameter in_b_width = in_b_int+in_b_frac;

    function signed [out_width-1:0] sum;
    input signed [in_a_width-1:0] a;
    input signed [in_b_width-1:0] b;
      if (in_a_frac > in_b_frac)
	sum = a + (b<<(in_a_frac-in_b_frac));
	sum = (a<<(in_b_frac-in_a_frac)) + b;


Now I can instantiate SignedInteger in another module and call the sum function. I need the function in another module because I may have multiple instances of SignedInteger in the calling module with different parameter values– sort of a poor man’s class mechanism. Everything simulates just peachy. Now, I synthesize a test case in XST.

Prepare to crash and burn

I’ll spare you the details, but XST has a number of issues with doing things like this, though the first few are not insurmountable. XST will first error out because I used concatenation instead of shift operators in the sum function. XST didn’t like the fact that one of the concatenation multipliers was negative. I got boxed into a corner here by Verilog too, because you can’t use a generate inside a function, and if I use a generate outside the function, it makes it really hard to call the function from outside the module. So, we’ll go back to the drawing board and use the shift operator. Next, XST believes that modules need at least one port. My guess is that the people who wrote XST never thought about just using a module as a container for its tasks and functions. Hmm… it’s not great, but I can add a port that I won’t use.

At this point, I’m faced with an interesting error message:

Analyzing top module .
ERROR:Xst:917 - Undeclared signal .
ERROR:Xst:2083 - "test.v" line 29: Unsupported range for function.

What do you mean, undeclared signal out_width? Firstly, it’s not a signal and secondly, it is so declared. The second message gives me pause. Crap– it really thinks that out_width is a signal. Why can’t it see it?

An experiment

Time for a test case. I try to synthesize this module, and guess what? It compiles without errors.

module test2
   input signed [15:0] a,
   input signed [15:0] b,
   output signed [19:0] sum

  localparam out_width = 20;
  localparam in_a_width = 16;
  localparam in_a_frac = 0;
  localparam in_b_width = 16;
  localparam in_b_frac = 4;

  SignedInteger #(16,4,16,0,12,4) si_16_4_16_0_12_4(.value());

  assign sum = si_16_4_16_0_12_4.sum(a,b);

If I give it the parameters it is looking for, the errors go away. I do get some strange warnings, however…

WARNING:Xst:616 - Invalid property "out_width 00000014": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_frac 00000000": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_int 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_a_width 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_frac 00000004": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_int 0000000C": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "in_b_width 00000010": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "out_frac 00000004": Did not attach to si_16_4_16_0_12_4.
WARNING:Xst:616 - Invalid property "out_int 00000010": Did not attach to si_16_4_16_0_12_4.

Not sure what that means exactly, but it sounds like something is rotten with the parameters.

Incorrect logic

Now this could cause incorrect logic to be synthesized. So I tried this test case.

module sub(empty);
  output empty;
  parameter paramvalue = 1;

  assign empty = 0;
  function [3:0] myparam_plus;
    input [3:0] in;
      myparam_plus = paramvalue + in;
endmodule // sub

module test3(value,empty);
  output [3:0] value;
  output empty;
  parameter paramvalue = 10;
  sub #(2) mysub(.empty(empty));
  assign value = mysub.myparam_plus(3);

The value output is a constant 5. That is, the sub module has a paramvalue of 2 and we add 3 to it. XST synthesizes this with no errors, warnings, or infos. The module is totally clean. Even so, the netlist it produces gives value a constant output of 13. Let’s see Xilinx support try to squirm their way out of this one. This seems like a bug to me.

At any rate, I have to give up on a pure Verilog solution to a fixed point library. Time to use Perl.


Recently, I had an application that needed a LIFO rather than a FIFO. Where FIFO stands for “first in first out”, LIFO stands for “”last in first out”. LIFOs are not as common as FIFOs but, they are used whenever you need to remember something you are currently working on and start on something new. When the new thing is done, you can go back to what it was you were doing before you were interrupted. Due to the sequential nature of software, LIFOs are used a lot in processors as parameter or return “stacks”, and the write operation is called a “push” and the read operation is a “pop”.

A simple implementation

At first blush, implementing a LIFO should be pretty much the same as implementing a FIFO. You have a read pointer and a write pointer and a dual port RAM. Here is a LIFO figure with two entries:


A LIFO with two entries.

You can see that we will read from Location 2 if necessary, and we will write to Location 3. However, unlike a FIFO, a pop operation will affect the write pointer, and worse yet, it will affect the write pointer before the write takes place. In this example, if we were to perform a pop and push operation simultaneously, then we would need to write to Location 2 after doing the read. Afterwards, the read and write pointers would have the same values as before.

Digging deeper

One other thing to think about is that it seems wasteful to use a dual port RAM here. We could use a single port RAM if it weren’t for the pesky case of simultaneous pushes and pops. However, when we do a simultaneous push and pop, the pointers don’t change. We only read and write the location pointed to by the read pointer, taking care to do the read operation before the write operation. Also, if you think about it, we don’t even need two pointers. The pointers always need to move in lock step with each other. The write pointer is always the read pointer if you are pushing and popping, and the read pointer plus one if you are just pushing.

A better approach

The “aha!” moment comes when you realize that the critical memory location is the top of the stack. We need to read and write to this location if we are pushing and popping at the same time. Otherwise, we’re only going to need to read or write the RAM, and never both. If we keep the top of the stack in a register instead of the RAM, then we can do a simultaneous push and pop without using the memory at all. If we do just a push, then we need to write the top of the stack register to the RAM and store the pushed value in the top of the stack register. For a pop without a push, we just need to read the RAM and store the result in the top of the stack register, while simultaneously providing the current top-of-stack register value as the pop result. The figure below shows a stack with two entries using a top of stack register:

A stack with two entries using a top of stack register

A stack with two entries using a top of stack register

The code

Ports for the LIFO are the usual clock and reset. We have status outputs empty and full, as well as a count of the number of data items in the LIFO. In addition, we have a signal indicating a push with its associated data, and another indicating a pop. A top-of-stack value rounds out the ports. Just a note, you can use the top-of-stack value all you want without popping the stack.

`timescale 1ns/1ns
module lifo
    parameter depth = 32,
    parameter width = 32,
    parameter log2_depth = log2(depth),
    parameter log2_depthp1 = log2(depth+1)
   input clk,
   input reset,
   output reg empty,
   output reg full,
   output reg [log2_depthp1-1:0] count,
   input push,
   input [width-1:0] push_data,
   input pop,
   output [width-1:0] tos

You can read my other posts about the synchronous FIFO or constant functions to find out more about the log2 parameters. Here’s the log2 function for completeness, though:

   function integer log2;
      input [31:0] value;
	 value = value-1;
	 for (log2=0; value>0; log2=log2+1)
	   value = value>>1;

Next, we’re going to use a flag to tell us when we’re writing. That is, when we are asked to push and we have room for the data. This means that either we are not full, or if we are full, we are also popping. Reading is similar, but a little simpler. We read if we’re asked to pop and there is data available.

  wire writing = push && (count < depth || pop);
  wire reading = pop && count > 0;

Now we count the data. We compute a combinational next_count value because that allows us to register the empty and full outputs. The logic is really quite simple. The count goes up when we write and don’t read, and it goes down if we read and don’t write. Otherwise, it just stays the same.

  reg [log2_depthp1-1:0] next_count;
  always @(*)
    if (reset)
      next_count = 0;
    else if (writing && !reading)
      next_count = count+1;
    else if (reading && !writing)
      next_count = count-1;
      next_count = count;

  always @(posedge clk)
    count <= next_count;

Full and empty are pretty self explanatory:

  always @(posedge clk)
    full <= next_count == depth;

  always @(posedge clk)
    empty <= next_count == 0;

For the memory pointer, if we are writing, then we use the location pointed to by the count, and if we are reading, we use the location one before.

  wire [log2_depth-1:0] ptr = writing ? count [log2_depth-1:0] : (count [log2_depth-1:0])-1;

Writing to the memory is simple enough. If we are writing and not reading, we write the current top-of-stack into the RAM. The top-of-stack will be replaced with what we are pushing onto the stack. If we are writing and reading, we don’t want to write anything. The top of stack register will do all the work in this case.

  reg [width-1:0] mem [depth-1:0];

  always @(posedge clk)
    if (writing && !reading)
      mem[ptr] <= tos;

Reading is a little more tricky. It would be great if we could just say

 always @(posedge clk)
   if (reading && !writing)
     tos <= mem[ptr];
   else if (writing)
     tos <= push_data;

However, this won’t work. You can’t map that into the read output register of a BRAM. Synthesis tools aren’t smart enough to map part of this onto the RAM and the rest onto outside logic. We’re going to have to give it a little hint. First we can just use a simple register that can map onto the read output register:

  reg [width-1:0] mem_rd;
  always @(posedge clk)
    if (reading)
      mem_rd <= mem[ptr];

Now, we’ll create a top-of-stack “shadow” register. This will hold the value being written to the top of the stack, since we can’t store it in the RAM output register. Only the RAM can do that.

  reg [width-1:0] tos_shadow;
  always @(posedge clk)
    if (writing)
      tos_shadow <= push_data;

Finally, we need a MUX to select who really holds the top-of-stack value.

  reg use_mem_rd;
  always @(posedge clk)
    if (reset)
      use_mem_rd <= 0;
    else if (writing)
      use_mem_rd <= 0;
    else if (reading)
      use_mem_rd <= 1;

  assign tos = use_mem_rd ? mem_rd : tos_shadow;


With this design, you’re going to get an efficient LIFO using a single port RAM. Why is it important to do this in a single port? I’ll answer that question in a future post. For now, you can download the entire LIFO module here.