# Introduction

My post yesterday got me digging into some issues regarding Vivado and BRAM usage. Vivado has a synthesis bug where it generates incorrect logic when you use the so-called “asynchronous” style of reading from a block RAM. You can read my previous post about placing bypass logic to emulate the asynchronous read style in a block RAM.

A quick note though. Xilinx calls this style of RAM read asynchronous. But it is only asynchronous if the read address is combinational. If you register the read address then the access is really synchronous.

Also, note that Xilinx will surely say that this is not a bug. The Vivado synthesis user’s guide says that in this case simulation won’t match synthesized results. I would argue that that is precisely the definition of a synthesis bug. Vivado should raise an error or at least a warning in this case. In fact, if you try to do a truly asynchronous read on a RAM, it will indicate that it cannot use a block RAM and it will instead use a distributed RAM.

# An example

Here is a rather contrived example of a module with two RAMs. One using the Xilinx “synchronous” read and one using the “asynchronous” read.

## Example Module

timescale 1ns/1ns
(
input clk,
input reset,
input [31:0] sync_wr_data,
input [31:0] async_wr_data,
input [31:0] sync_compare_value,
input [31:0] async_compare_value,
output sync_compare,
output async_compare
);

reg [31:0] ram_sync_out;

always @(posedge clk)

always @(posedge clk)

(* ram_style = "block" *) reg [31:0] ram_sync [511:0];
always @(posedge clk)

always @(posedge clk)

assign sync_compare = ram_sync_out == sync_compare_value;

(* ram_style = "block" *) reg [31:0] ram_async [511:0];
always @(posedge clk)

wire [31:0] ram_async_out;
assign ram_async_out = ram_async[async_rd_addr_q];

assign async_compare = ram_async_out == async_compare_value;

endmodule


## Example Test Driver

And here is some test code that uses the module.

timescale 1ns/1ns

reg clk = 1;
always #5 clk = ~clk;

reg reset = 1;
initial begin repeat(10) @(posedge clk); reset <= 0; end

wire sync_compare, async_compare;

reg [31:0] counter, counter1, counter2;

always @(posedge clk)
begin
counter1 <= counter;
counter2 <= counter1;
if (reset)
counter <= 0;
else
counter <= counter+1;
end

(
.clk(clk),
.reset(reset),
.sync_wr_data(counter),
.async_wr_data(counter),
.sync_compare_value(counter2),
.async_compare_value(counter1),
.sync_compare(sync_compare),
.async_compare(async_compare)
);

initial
begin
wait(!reset);
repeat (1000) @(posedge clk);
$finish; end endmodule  ## Operation I use a counter here to generate the RAM write and read addresses, and I also use it for RAM write data and output comparisons. You can see that both RAMs write and read from the same address at the same time. You can also see that the compare value for the synchronous read is one cycle delayed from the read value from the asynchronous read. # Discussion If you simulate the design you will see that the two comparison values are true throughout the simulation. However, if you perform a gate-level simulation you will discover that the asynchronous compare output will be false. Looking at the synthesis log file you will discover that there is no error or warning. You do however get a message INFO: [Synth 8-6430] The Block RAM "read_first_example/ram_async_reg" may get memory collision error if read and write address collide. Use attribute (* rw_addr_collision= "yes" *) to avoid collision. But to my mind, we don't have that situation because the read address is the write address delayed by one clock cycle. Now, here's the strange part. You can add the rw_addr_collision attribute to the ram register declaration like this.  (* rw_addr_collision = "yes" *) (* ram_style = "block" *) reg [31:0] ram_async [511:0];  Synthesize again and the design will work. Vivado will now add bypass logic to the design much like in the synchronous FIFO design in my previous post. Even more curious is that you can add the attribute to the synchronous RAM declaration and it will still behave correctly. So it appears that Vivado is capable of generating correct logic, you just need to use the magic rw_addr_collision attribute. # Conclusion I certainly think this is a Vivado synthesis bug. It should never generate logic that does not match the RTL without some kind of complaint. And further, it should generate bypass logic on the asynchronous style of RAM read unless you tell it not to. # Synchronous FIFO Redux So, almost ten years ago to the day, I posted an article on implementing a synchronous FIFO. Well, take the read portion of that FIFO implementation with a grain of salt. # Asynchronous Read To summarize, here is the portion of the FIFO which implements the memory.  reg [width-1:0] mem [depth-1:0]; always @(posedge clk) begin if (writing) mem[wr_ptr] <= wr_data; rd_ptr <= next_rd_ptr; end assign rd_data = mem[rd_ptr]; Here is a timing diagram of writing one piece of data to the FIFO and then immediately reading it out. Looks great, doesn’t it? # The Problem This code, however, does not work on Xilinx FPGAs from 6 series and above. Both XST and Vivado will happily implement the equivalent of this for the read logic. And by happily, I mean without error or warning. always @(posedge clk) rd_data <= mem[rd_ptr];  As you and I can tell, that is not the same thing. And it causes the design in my old post to behave in subtly incorrect ways. You can see this when the FIFO is leaving the empty state. # The Solution To solve this problem we need to implement a bypass register to hold the write data along with an output mux to select the bypass register when we are reading immediately following a write. Here is the code which does the synchronous read from the RAM. reg [width-1:0] rd_mem; always @(posedge clk) rd_mem <= mem[next_rd_ptr];  Here is the bypass register, and the code to determine when the bypass register should be used instead of the RAM output. reg [width-1:0] bypass_reg; always @(posedge clk) bypass_reg <= wr_data; reg use_bypass; always @(posedge clk) use_bypass <= writing && wr_ptr == next_rd_ptr;  And here is the output MUX. assign rd_data = use_bypass ? bypass_reg : rd_mem;  # An Alternate Approach In looking at the Vivado Synthesis Guide (ug901), you can see in the section called RW_ADDR_COLLISION on page 63, a description of an attribute which allows the write data to take priority over the read data. If you synthesize the sync_fifo design with the synchronous_read parameter set to zero or one you will see that the same muxing logic is created either way. With synchronous_read set to one, the muxing is explicit. With it set to zero, then the mux logic is implicit and Vivado will add it for you. It looks like it is safe to use either way. Unfortunately, the attribute is not supported in XST, so there you will need to use the explicit bypass logic with the synchronous read style. Here is an example of how to set the attribute when you declare the RAM in Verilog. (* rw_addr_collision = "yes" *) reg [width-1:0] mem [depth-1:0];  # Conclusion I actually find this pretty shocking that both XST and Vivado generate incorrect code without any error or warning. Clearly, this is a bug and makes me wonder what other constructs it is implementing incorrectly. You can download the complete synchronous FIFO design here. # ZYNQ FSBL – The Saga Continues # Building an FSBL for the ZC706 using Petalinux Well, another blog post on how to build a modified FSBL for ZYNQ. Using the patch which I demonstrated how to make in the previous post and a modified version of the fsbl_%.bbappend file which I received from the Xilinx Forum post regarding this I was able to make a working FSBL with my patch. The modified fsbl_%.bbappend file is shown below. # Force to use embeddedsw repository EXTERNALXSCTsrc="" EXTERNALXSCTSRC_BUILD = "" #Enable FSBL debug flags YAML_COMPILER_FLAGS_append = " -DFSBL_DEBUG" # Patch FSBL SRC_URI_append += "file://0001-fsbl.patch" FILESEXTRAPATHS_prepend := "${THISDIR}/files:"


Here are the steps used to create the FSBL

First run petalinux-create to build a petalinux project. Point it at the appropriate BSP.

$petalinux-create --type project --source /tools/xilinx/bsp/xilinx-zc706-v2017.4-final.bsp Then copy the fsbl_%.bbappend file into the project. $ cp fsbl_%.bbappend xilinx-zc706-2017.4/project-spec/meta-user/recipes-bsp/fsbl
$cp 0001-fsbl.patch xilinx-zc706-2017.4/project-spec/meta-user/recipes-bsp/fsbl/files Next run a petalinux-build to make the bootloader $ petalinux-build --project xilinx-zc706-2017.4 -c bootloader

When all is said and done you will get a zynq_fsbl file in xilinx-zc706-2017.4/images/linux/zynq_fsbl.elf.

This procedure works for 2017.4 and 2016.4.

# Building from the git checkout

Alternatively you can build from the git checkout where you made the patch. This seems much simpler but it doesn’t work for me with petalinux 2016.4.

From the embeddesw checkout directory cd to lib/sw_apps/zynq_fsbl_src and run the following command.

$make BOARD=zc706 This should build the file fsbl.elf in the src directory. When I try to build this for 2016.4 I get a bunch of errors. They differ depending on how I set the CC variable on the make command line. If I don’t set it I get errors about arm-xilinx-eabi-gcc: Command not found. If I set CC=arm-none-eabi-gcc then I get this arm-none-eabi-gcc -c pcap.c -o pcap.o -I../misc//ps7_cortexa9_0/include -I. In file included from pcap.c:96:0: pcap.h:65:21: fatal error: xdevcfg.h: No such file or directory compilation terminated. Makefile:97: recipe for target 'pcap.o' failed  If anyone knows what to do about this I’m all ears. # Getting the Log Base 2 Algorithm to Synthesize My last post introduced an algorithm for finding the log base 2 of a fixed point number. However, it had a gotcha. It had to use some floating point functions to initialize a table, and even though it is not synthesizing floating point, ISE, Vivado, and Quartus II all refuse to synthesize the design. What should we do? # Perl Preprocessor to the Rescue In an older blog post, I discuss a Verilog Preprocessor that I wrote years ago. In the old days of Verilog ’95, preprocessors like this were practically required. The language was missing lots of features that made external solutions necessary, but got largely fixed with Verilog 2001 and Verilog 2005. Now with Systemverilog, things are even better. However, sometimes tools don’t support all of the language features. In the last blog post, we discovered that the FPGA synthesis tools can’t handle the floating point functions used in the ROM initialization code. # Computing the lookup table in Perl What we’ll do is write the ROM initialization code in Perl and use the preprocessor to generate the table before the synthesis tool sees it. The code which gives the synthesis tools fits is this:  function real rlog2; input real x; begin rlog2 =$ln(x)/$ln(2); end endfunction reg [out_frac-1:0] lut[(1<<lut_precision)-1:0]; integer i; initial begin : init_lut lut[0] = 0; for (i=1; i<1<<lut_precision; i=i+1) lut[i] =$rtoi(rlog2(1.0+$itor(i)/$itor(1<<lut_precision))*$itor(1<<out_frac)+0.5); end  We can essentially turn this code into Perl and embed it, so that the preprocessor will be able to generate it for us. Remember, the Perl code goes inside special Verilog comments. First, we’re going to need to define some parameter values in Perl and their counterparts in Verilog. These define the maximum allowable depth and width of the lookup table.  //@ my$max_lut_precision = 12;
localparam max_lut_precision = $max_lut_precision; //@ my$max_lut_frac = 27;
localparam max_lut_frac = $max_lut_frac;  Here is the Perl code to compute the lookup table values:  /*@ sub compute_lut_value { my$i = shift;
return log(1.0+$i/(1<<$lut_precision))*(1<<$lut_frac)/log(2.0); } @*/  Then, we’ll embed the lookup table inside a Verilog function.  function [max_lut_frac-1:0] full_lut_table; input integer i; begin case (i) //@ for my$i (0..(1<<$max_lut_precision)-1) { //@ my$h = sprintf("${max_lut_frac}'h%x",int(compute_lut_value($i)+0.5));
$i: full_lut_table =$h;
//@ }
endcase
end
endfunction


We’re also going to parallel the Perl code with a pure Verilog function, just like our local parameters.

  function [out_frac-1:0] compute_lut_value;
input integer i;
begin
compute_lut_value = $rtoi(rlog2(1.0+$itor(i)/$itor(1<<lut_precision))*$itor(1<<out_frac)+0.5);
end
endfunction


# But how do you parameterize it?

There happens to be one gotcha when working with preprocessors: you really don’t want to use them. Back in the Verilog ’95 days, my colleagues and I just used the preprocessor for every Verilog file. All of our parameterization was done there. We used a GNU Make flow, and it built all the Verilog files for us automatically. But with the advent of Verilog 2001 and SystemVerilog, a lot of things that we used the preprocessor for could be done within the language– and much better, too. One of the crufty things about using the preprocessor was that you needed to embed the parameter values in the module names. Otherwise, you would have module name conflicts for different modules with different parameter values. In this case, we actually want the preprocessor to generate a parameterized Verilog module. We still want to use the normal Verilog parameter mechanism to control the width and depth of the lookup table.

To do this, we must generate a maximal table in the preprocessor, and then cut from that table, using Verilog, a subtable that has the desired width and depth based on our Verilog parameter values.

Here is the code to do that. If the values of the depth and width (precision and fractional width) exceed the maximal Perl values, then we just use the pure Verilog implementation and the code will not be synthesizable, but it will still work in simulation, at least. If the parameter values are “in bounds” for the preprocessor-computed lookup table, then we’re going to go ahead and cut our actual table from the Perl generated lookup table, and the design will be synthesizable.

  reg [out_frac-1:0] lut[(1<<lut_precision)-1:0];
integer i;
generate
// If the parameters are outside the bounds of the static lookup table then
// compute the lookup table dynamically. This will not be synthesizable however
// by most tools.
if (lut_precision > max_lut_precision || out_frac > max_lut_frac)
initial
begin : init_lut_non_synth
lut[0] = 0;
for (i=1; i<1<<lut_precision; i=i+1)
begin
lut[i] = compute_lut_value(i);
end
end
else
// The parameters are within bounds so we can use the precomputed table
// and synthesize the design.
initial
begin : init_lut_synth
for (i=0; i<1<<lut_precision; i=i+1)
begin : loop
reg [max_lut_frac-1:0] table_value;
table_value = full_lut_table(i<<(max_lut_precision-lut_precision));
lut[i] = table_value[max_lut_frac-1:max_lut_frac-out_frac];
end
end
endgenerate


We can then finish off the design, just like in Computing the Logarithm Base 2

# Conclusion

This may be a lot to take in. But the gist is that you build a table using the Perl preprocessor, which is large enough to use for all parameter values. Then in Verilog, you use the actual parameter values to cut out the portion of the pre-computed table that you need. This cutting out can be done during the elaboration or initialization stages of the synthesis. Of course, our job would be much easier if the synthesis tool developers got it into their heads that using floating point and math during elaboration or initialization does not necessitate synthesizing floating point logic.

If anyone knows of a software floating point library written purely in Verilog, please let me know. We could then use that to trick the synthesis tools into doing what we want.

Oh, and here’s a link to the complete file.

# Detecting the rising edge of a short pulse

A reader is going through my ZedBoard tutorial and had some questions about detecting the rising edge of a pulse. The tutorial in question is using a ZedBoard to make a stopwatch. Kind of overkill in terms of hardware, but you have to start somewhere when you’re learning to code.

TRAN MINHHAI’s question asked: what do you do when the rising edge might just be a pulse, and the pulse might last less than a single clock cycle? The answer is to use a flip-flop with the input signal going to an asynchronous set input. The data input is just zero, and the clock signal is the one we are synchronizing to.

Here is the code:


timescale 1ns/1ns
module edge_detect
(
input clk,
input btnl,
output btnl_rise
);

reg btnl_held = 0;
always @(posedge clk or posedge btnl)
if (btnl)
btnl_held <= 1;
else
btnl_held <= 0;

reg [1:0] btnl_shift = 0;
always @(posedge clk)
btnl_shift <= {btnl_shift,btnl_held};

assign btnl_rise = btnl_shift == 2'b01;
endmodule


I also wrote a little test to go with it. Notice how short little pulses can come anywhere with respect to the clock edge:


timescale 1ns/1ns
module edge_detect_test;

reg clk = 0;
always #10 clk = ~clk;

reg btnl = 0;

wire btnl_rise;

edge_detect edge_detect(clk,btnl,btnl_rise);

initial
begin
$dumpvars(0); @(posedge clk) btnl <= 1; repeat (10) @(posedge clk); btnl <= 0; repeat (4) @(posedge clk); #10 btnl <= 1; #1 btnl <= 0; repeat (4) @(posedge clk); #19 btnl <= 1; #1 btnl <= 0; repeat (4) @(posedge clk); #20 btnl <= 1; #1 btnl <= 0; repeat (4) @(posedge clk);$finish;
end
endmodule


## The long pulse case

In the case of a long pulse, the design works just like it would without the asynchronous set flip-flop. Here’s a timing diagram:

## The short pulse case

But if there is a short pulse, the asynchronous set flip-flop holds the input value until there is a clock edge.

If the pulse appears in the middle of the clock cycle, then the timing diagram looks like this:

If the pulse appears right before the clock edge, then the timing diagram looks like this:

Now you know how to synchronize the rising edge of a pulse even if you have a slow clock.

# Algorithm

Computing the log base 2 of a whole number is easy: just use a priority encoder. But what about a fixed point number? That sounds a lot harder. But, like many things, there’s a trick to make things easier.

Consider the following equation:

$\log_2{x} = \log_2 (2^n \cdot 2^{-n} \cdot x) = n + \log_2(2^{-n} \cdot x)$

If we choose $n$ such that $2^{-n}\cdot x\in [1,2)$ , then $n$ is just the whole number portion of the log. and we can compute that with a priority encoder. Furthermore, we can compute the $2^{-n}\cdot x$ portion by barrel-shifting the original number. Finally, we can use a lookup table to compute the lower bits. The nice thing about this is that the lookup table can be much smaller, since we only need to store the values between 1 and 2.

Another way to think about this is that the logarithm curve from [2,4) is the same as the one from [1,2), just shifted up by 2 and scaled horizontally by 2. Likewise the curve from [4,8) is scaled the same way. The algorithm is just taking advantage of this symmetry.

Oh– additionally, this algorithm only works on input values greater or equal to one, so the input and output values are always positive.

# Implementation

Now, let’s code up our Verilog module. We’ll start with the module declaration:

module log2
#(
parameter in_int = 16,
parameter in_frac = 8,
parameter out_int = $clog2(in_int), parameter out_frac = in_int+in_frac-out_int, parameter lut_precision = 6, parameter register_output_stage = 0 )  Since the module is dealing with a fixed-point input and output value, we need to specify how many integer and fractional bits there are in the input and output values. The lut_precision parameter specifies the log base 2 of the number of entries in our lookup table, and the default setting will be 64 entries in the table. There is also a parameter which allows for an optional final output register stage. Next come the module ports:  ( input clk, input [in_int+in_frac-1:0] din, input din_valid, // data in is valid output din_ready, // ready to receive data output reg [out_int+out_frac-1:0] dout, output reg dout_valid, // data out is ready (valid) output reg dout_error // data out is incorrect - input data was less than 1.0 );  We have the clock clk and the input value din. A handshake input indicating that the din value is valid. There is an output signal din_ready indicating that the module is ready to accept an input value. There is, of course, the output value dout, and a valid signal dout_valid. Since it is possible to provide an illegal input between 0 and 1, there is also a dout_error signal indicating the output value is not valid due to an inadmissible input value. Now, let’s look at the body of the module:  assign din_ready = 1;  Ready? Please. We’re always ready. ## Pipeline Stage 1 Next, we instantiate the recursive priority encoder from my previous blog post. This takes the integer portion of the input value and produces the integer portion of the output value. There is also a prienc_error output which indicates that none of the input bits were set. This means we were given an input value that was less than one.  wire [$clog2(in_int)-1:0] prienc_out;
wire prienc_error;

priencr #(in_int)
priencr
(
.decode(din[in_frac+:in_int]),
.encode(prienc_out),
.valid(prienc_error)
);


Next, we have the stage one pipeline logic:

  reg [$clog2(in_int)-1:0] stage1_prienc_out; reg stage1_error; reg stage1_valid; reg [in_int+in_frac-1:0] stage1_din; always @(posedge clk) begin stage1_din <= din; stage1_prienc_out <= prienc_out; stage1_error <= prienc_error; stage1_valid <= din_valid; end  ## Pipeline Stage 2 Stage two of the pipeline is the barrel shift logic. This shifts the input value to the left, based on the priority encoder output. Things are flipped, though, so lower priority encoder output values cause a larger shift. We’ll also pipeline the other signals:  reg [in_int+in_frac-1:0] stage2_barrel_out; reg [out_int-1:0] stage2_barrel_out_int; reg stage2_error; reg stage2_valid; always @(posedge clk) begin stage2_barrel_out <= stage1_din << (in_int-stage1_prienc_out-1); stage2_barrel_out_int <= stage1_prienc_out; stage2_error <= stage1_error; stage2_valid <= stage1_valid; end  ## Pipeline Stage 3 The third pipeline stage is the lookup table, which computes the fractional part of the output. We will use an initial block to fill the table. First, we define a function to take the floating point log base-2 of an input value, since Verilog does not have one built in. Remember this rule from high school algebra?  function real rlog2; input real x; begin rlog2 =$ln(x)/$ln(2); end endfunction  Next we declare the table and fill it in an initial block:  reg [out_frac-1:0] lut[(1<<lut_precision)-1:0]; integer i; initial begin : init_lut lut[0] = 0; for (i=1; i<1<<lut_precision; i=i+1) lut[i] =$rtoi(rlog2(1.0+$itor(i)/$itor(1<<lut_precision))*\$itor(1<<out_frac)+0.5);
end


Now, we need to use the barrel shift output as an address to the lookup table. We also carry along some of the stage 2 results:

  reg [out_frac-1:0] stage3_lut_out;
reg [out_int-1:0] stage3_out_int;
reg stage3_error;
reg stage3_valid;
always @(posedge clk)
begin
stage3_out_int <= stage2_barrel_out_int;
stage3_lut_out <= lut[stage2_barrel_out[in_int+in_frac-2-:lut_precision]];
stage3_error <= stage2_error;
stage3_valid <= stage2_valid;
end


Finally we have the code for the optional stage 4 pipeline. This is controlled by the register_output_stage parameter:

  generate
if (register_output_stage)
begin
always @(posedge clk) dout = {stage3_out_int,stage3_lut_out};
always @(posedge clk) dout_error = stage3_error;
always @(posedge clk) dout_valid = stage3_valid;
end
else
begin
always @(*) dout = {stage3_out_int,stage3_lut_out};
always @(*) dout_error = stage3_error;
always @(*) dout_valid = stage3_valid;
end
endgenerate


# Caveats

So there you have it: a module that computes log base 2 for fixed point inputs. However, there ‘s still one tiny problem: the design is not synthesizable by XST or Vivado. I haven’t tried other tools, but they may have issues with it as well.

The issue is that the tools don’t seem to implement the built-in Verilog floating point functions during elaboration. Essentially, the synthesis tool needs to run the initial block in order to know how to populate the lookup table. In general, the tools can do this. However, if they catch wind of a float value, they tuck their tail between their legs and run.

My workaround for this is to use the Perl preprocessor I describe in a previous blog post. But that’s a topic for another time.

# Getting Xilinx Document Navigator working with CentOS 6.5

There are a number of packages you need to install beyond the default to get Document Navigator working on CentOS 6.5. What’s especially strange is that they are 32-bit libraries, which are required even on 64-bit machines. Here is the yum command I used to get everything to work. We’ll have a couple of frustrations, because the latest CentOS 6 repositories are for CentOS 6.6, which is not compatible with the latest Vivado release. So you’ll also need to update a few things on the x86_64 side before doing the install for the i686 libraries:


% sudo yum update libXrender glib2
% sudo yum install fontconfig.i686 libXext.i686 libXrender.i686 glib2.i686 libpng.i686 libSM.i686


# Why is Xilinx 2014.4 SDK so buggy on Linux?

I am running CentOS 6.6 with Vivado and SDK, but SDK has been crashing like crazy on me. It dies with the following message.

java: cairo-misc.c:380: _cairo_operator_bounded_by_source: Assertion NOT_REACHED' failed.

Turns out it is a problem with versioning between gtk2 and cairo. Here is a link with the details.

There are some posts on the Xilinx forum about installing updated RPMs to fix this, but it is not a good idea to install RPMs outside of the YUM framework. Another alternative is to disable cairo in eclipse. So edit your eclipse configuration file in install_dir/SDK/2014.4/eclipse/lnx64.o/configuration/config.ini (and the 32-bit version if that is appropriate) to include the following line.

org.eclipse.swt.internal.gtk.cairoGraphics=false`

This seems to have fixed things at least for now.