From: "Kevin Neilson" <kevin_neilson@yahoo.com>
Newsgroups: comp.arch.fpga
Subject: BlockRAM outputs and the Placer
Lines: 24
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Message-ID: <ferG6.496$dH5.56041@newsread2.prod.itd.earthlink.net>
Date: Sat, 28 Apr 2001 03:59:07 GMT
NNTP-Posting-Host: 209.245.11.3
X-Complaints-To: abuse@earthlink.net
X-Trace: newsread2.prod.itd.earthlink.net 988430347 209.245.11.3 (Fri, 27 Apr 2001 20:59:07 PDT)
NNTP-Posting-Date: Fri, 27 Apr 2001 20:59:07 PDT
Organization: EarthLink Inc. -- http://www.EarthLink.net
X-Received-Date: Fri, 27 Apr 2001 20:57:40 PDT (newsmaster1.prod.itd.earthlink.net)
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.stealth.net!63.208.208.143.MISMATCH!feed2.onemain.com!feed1.onemain.com!newsfeed2.earthlink.net!newsfeed.earthlink.net!newsmaster1.prod.itd.earthlink.net!newsread2.prod.itd.earthlink.net.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:5977

I had a problem with the BRAM on my last project.  The Xilinx BRAM
clock->out time is about 3ns.  This is fine.  The problem I had was that the
nets connecting the BRAM data outputs to other logic seemed to have
excessively long delays, many over 3ns.  Even when I handplaced flops right
next to the BRAMs the net delays still seemed excessive.  This was in a part
that was only 30% utilized, so there were no real routing issues.

Have others had this problem?  I registered the BRAM outputs, but in many
cases it was still over 7ns to get the data out, across the net, and meet
the flop setup time.

This raises an additional question about the placer.  The placer seems to
make some really poor decisions.  All of my critical nets had huge delays
because parts that should have been placed together were inexplicably placed
on opposite sides of the part.  Place and route seem to be completely
independent processes.  Isn't there a way to place, route, and then place
again based on the route information, and then route again?  This seems like
a good idea, because on the second placement round, the placer could look at
the critical nets from the previous route and say, "Whoa, it take 4ns to get
from flop A to flop B; maybe I should place those closer together and try
routing again."  But from what I can tell, placement is done first, and then
locked down, with no regard to the post-route net delays.

######

Message-ID: <3AEA63A5.1A4F1C8@nowhere.com>
From: Anonymous Idiot <anonymous_idiot@nowhere.com>
X-Mailer: Mozilla 4.76 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
Newsgroups: comp.arch.fpga
Subject: Re: BlockRAM outputs and the Placer
References: <ferG6.496$dH5.56041@newsread2.prod.itd.earthlink.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 21
Date: Fri, 27 Apr 2001 23:31:01 -0700
NNTP-Posting-Host: 207.214.177.182
X-Complaints-To: abuse@pacbell.net
X-Trace: nnrp5-w.sbc.net 988439012 207.214.177.182 (Fri, 27 Apr 2001 23:23:32 PDT)
NNTP-Posting-Date: Fri, 27 Apr 2001 23:23:32 PDT
Organization: SBC Internet Services
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!newsfeed-zh.ip-plus.net!news.ip-plus.net!news.tesion.net!news.belwue.de!news-stu1.dfn.de!news-mue1.dfn.de!news-was.dfn.de!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!feed2.onemain.com!feed1.onemain.com!cyclone-sf.pbi.net!206.13.28.183!nnrp5-w.sbc.net.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:5967

> This raises an additional question about the placer.  The placer seems to
> make some really poor decisions.  All of my critical nets had huge delays
> because parts that should have been placed together were inexplicably placed
> on opposite sides of the part.  Place and route seem to be completely
> independent processes.  Isn't there a way to place, route, and then place
> again based on the route information, and then route again?  This seems like
> a good idea, because on the second placement round, the placer could look at
> the critical nets from the previous route and say, "Whoa, it take 4ns to get
> from flop A to flop B; maybe I should place those closer together and try
> routing again."  But from what I can tell, placement is done first, and then
> locked down, with no regard to the post-route net delays.

You've pretty much described "physically knowledgable synthesis"...a
concept
Synopsys, Cadence, and everyone else are pushing toward.

Those software tools don't come cheap...

It's just a matter of time before FPGA vendors incorporate the same
ideas
into their development tool suite.

######

From: Allan Herriman <allan_herriman.hates.spam@agilent.com>
Newsgroups: comp.arch.fpga
Subject: Re: BlockRAM outputs and the Placer
Date: Sat, 28 Apr 2001 23:34:30 +1000
Organization: Agilent Technologies
Lines: 92
Message-ID: <3AEAC6E5.3924F651@agilent.com>
References: <ferG6.496$dH5.56041@newsread2.prod.itd.earthlink.net>
NNTP-Posting-Host: cswreg.cos.agilent.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: cswtrans.cos.agilent.com 988455868 26374 130.29.154.45 (28 Apr 2001 11:04:28 GMT)
X-Complaints-To: usenet@cswtrans.cos.agilent.com
NNTP-Posting-Date: Sat, 28 Apr 2001 11:04:28 +0000 (UTC)
X-Mailer: Mozilla 4.7 [en] (WinNT; I)
X-Accept-Language: en
Cache-Post-Path: cswreg.cos.agilent.com!unknown@146.223.162.153
X-Cache: nntpcache 2.3.3 (see http://www.nntpcache.org/)
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.dplanet.ch!news-ge.switch.ch!news.bme.hu!news.tele.dk!204.94.211.44!enews.sgi.com!sdd.hp.com!agilent.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:5969

Kevin Neilson wrote:
> 
> I had a problem with the BRAM on my last project.  The Xilinx BRAM
> clock->out time is about 3ns.  This is fine.  The problem I had was that the
> nets connecting the BRAM data outputs to other logic seemed to have
> excessively long delays, many over 3ns.  Even when I handplaced flops right
> next to the BRAMs the net delays still seemed excessive.  This was in a part
> that was only 30% utilized, so there were no real routing issues.
> 
> Have others had this problem?  I registered the BRAM outputs, but in many
> cases it was still over 7ns to get the data out, across the net, and meet
> the flop setup time.
> 
> This raises an additional question about the placer.  The placer seems to
> make some really poor decisions.  All of my critical nets had huge delays
> because parts that should have been placed together were inexplicably placed
> on opposite sides of the part.  Place and route seem to be completely
> independent processes.  Isn't there a way to place, route, and then place
> again based on the route information, and then route again?  This seems like
> a good idea, because on the second placement round, the placer could look at
> the critical nets from the previous route and say, "Whoa, it take 4ns to get
> from flop A to flop B; maybe I should place those closer together and try
> routing again."  But from what I can tell, placement is done first, and then
> locked down, with no regard to the post-route net delays.

Hi Kevin,

I've had the same problem.  Even with a mostly empty chip, it's still
possible to have route congestion around the block rams.

My fix was to hand place the flip flops close to the block rams. 
(Actually, I wrote a Perl script to generate the UCF.)
This helped a lot, but I was still getting congestion.  The block ram
signals were so congested, PAR was using route-throughs.  (A
route-through is when PAR passes the signal through a CLB to get around
the lack of routing resources.)
You can check your timing report (or FPGA editor) to find
route-throughs.
Unfortunately, there is no way to tell PAR to not use route-throughs on
some nets.
Xilinx Support suggested setting the timespec priority, but that seems
to only be used to resolve which timespec gets used when there are
multiple timespecs covering the same path.
It's also possible to set the priority in the PCF, but that made the
tools crash, so I wasn't able to find out whether it would have helped.

In the end, I made it work by

1.  Using a timespec that was faster than it needed to be on the paths
that were giving problems.  Try a few different values.

2.  Hand placing everything.  (That Perl script again...)
Automatic blockram placement is poor.  Don't rely on it.  You may be
able to get away with automatic placement (perhaps aided by area
constraints) for the flip flops.

3.  MPPR.  Occasionally, I would get a cost table that caused the tools
to actually give acceptable results.  The best cost table would change
every time the design changed.

4.  Making the other parts of the design easier to route, mostly through
source code changes (pipelining, etc.).  It seems that PAR only expends
so much effort before it gives up, so don't waste any of that effort on
something that isn't so important.
Similarly, don't overconstrain parts of the design that don't need it.

5.  (The most powerful technique of them all :) Make small changes to
unrelated parts of the design, then run it through the tools again. 
This may perturb the placement enough to make a big difference to the
critical path.

6.  Move other, unrelated parts of the design further away, so that they
don't interfere with the critical stuff.
Similarly, move the related parts of the design closer, so they don't
"pull" on your critical paths as much.

7.  Don't trust Map.  It will sometimes put resources that really should
be on opposite sides of the chip into the same CLB.  There is nothing
PAR can do about this, and there is no feedback from placement to Map,
so Map doesn't know it's not doing the right thing.  Work around this by
using placement constraints on one or more of the problem resources, so
that Map won't group them.
This effect is really easy to see in the floorplanner.

While I'm here...
Did anyone else notice how the block ram timing tBCKO *more than
doubled* with the 1.56 speed files (Virtex-E -8 speed grade)?  That was
after the parts had been available for almost a year.  Was this change
yield related?

Good luck,
Allan.

######

Message-ID: <3AEAD134.A7C36184@home.com>
From: Phil Hays <spampostmaster@home.com>
Organization: "Real email is phil-hays aATt homeDOTcom"
X-Mailer: Mozilla 4.7 [en] (WinNT; U)
X-Accept-Language: en
MIME-Version: 1.0
Newsgroups: comp.arch.fpga
Subject: Re: BlockRAM outputs and the Placer
References: <ferG6.496$dH5.56041@newsread2.prod.itd.earthlink.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 30
Date: Sat, 28 Apr 2001 14:20:53 GMT
NNTP-Posting-Host: 65.4.188.174
X-Complaints-To: abuse@home.net
X-Trace: news1.sttls1.wa.home.com 988467653 65.4.188.174 (Sat, 28 Apr 2001 07:20:53 PDT)
NNTP-Posting-Date: Sat, 28 Apr 2001 07:20:53 PDT
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!wn1feed!worldnet.att.net!24.0.0.38!newshub2.rdc1.sfba.home.com!news.home.com!news1.sttls1.wa.home.com.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:6007

Kevin Neilson wrote:

> This raises an additional question about the placer.  The placer seems to
> make some really poor decisions.  All of my critical nets had huge delays
> because parts that should have been placed together were inexplicably placed
> on opposite sides of the part.  Place and route seem to be completely
> independent processes.  Isn't there a way to place, route, and then place
> again based on the route information, and then route again?  This seems like
> a good idea, because on the second placement round, the placer could look at
> the critical nets from the previous route and say, "Whoa, it take 4ns to get
> from flop A to flop B; maybe I should place those closer together and try
> routing again."  But from what I can tell, placement is done first, and then
> locked down, with no regard to the post-route net delays.

Welcome to FPGA design.  The only comments you missed was that synthesis and
mapping are also an independant processes.  Sooner or later you will get two
flops that need to be in opposite corners mapped into the same CLB.  Place and
route then have no chance at all of making timing.  And synthesis will build
logic that just can't work, as the really long, slow net ends up going through
the most LUTs.

As another person replied, "physically knowledgable synthesis" is one solution
that does work.  Pricy, yes.  Has some important limitations, yes.  I've used
Amplify by Synplicity, and it's a good tool.  Another solution is putting
placement information on key elements of the design using VHDL attributes or the
.ucf file.


-- 
Phil Hays