Path: chonsp.franklin.ch!not-for-mail From: Neil Franklin Newsgroups: comp.arch.fpga Subject: TBUFs in Virtex and later chips, going out of fashion, what instead Date: 19 Mar 2001 00:02:36 +0100 Organization: My own Private Self Lines: 35 Message-ID: <6ulmq2y92r.fsf@chonsp.franklin.ch> NNTP-Posting-Host: chonsp.franklin.ch X-Trace: chonsp.franklin.ch 984956556 2155 10.0.3.2 (18 Mar 2001 23:02:36 GMT) X-Complaints-To: news@chonsp.franklin.ch NNTP-Posting-Date: 18 Mar 2001 23:02:36 GMT X-Newsreader: Gnus v5.7/Emacs 20.4 Xref: chonsp.franklin.ch comp.arch.fpga:5210 In the earlier Xilinx chips (3000, 4000, 5200) there is always 2 TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read. In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an CLB, so only one slice can be routed to them, and even only 1 line for reading back from TBUF lines. In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which makes even connecting the output of the 4-wide 2 slices of an single carry chain to an bus impossible. The data sheet does not give the amount of readbacks. From this I get the impression, that Xilinx regards TBUF buses as going out of fashion. After all, the TBUFs cost in chip space is next to nothing relative to them many PIPs (about 900 per CLB in Virtex). In the Jan Gray RISC processors TBUFs are used to implement processor internal data buses in no space. I have the same type of situation, with many data producing elements to be selected from. TBUFs seem to be ideal _horizontal_ wide AND-ORs (vertical is being used for the bits, because of the carry chains). So I have a question: What is the Xilinx-suggested replacement for TBUFs? Is one supposed to use MUXes implemented in the CLBs? Is there an other trick I have not yet stumbled over? Note that I need to use Spartan-II, as Spartan is too small and JBits only runs on Virtex and Spartan-II anyway. -- Neil Franklin, neil@franklin.ch.remove http://neil.franklin.ch/ Hacker, Unix Guru, El Eng HTL/FH/BSc, Sysadmin, Roleplayer, LARPer ###### From: "Jan Gray" Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Mon, 19 Mar 2001 06:28:08 -0800 Organization: Gray Research LLC Lines: 72 Message-ID: <995523$paj$1@slb6.atl.mindspring.net> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> NNTP-Posting-Host: 04.21.a0.5e X-Server-Date: 19 Mar 2001 14:28:19 GMT X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news.mindspring.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5225 Indeed, the halving of TBUFs/LUT in Virtex, and again in V-II, make my datapaths larger/less functional per LUT, compared with XC4000. (Consider what happens to the result mux in the xr16 CPU datapath schematic S3 on p. 9 of www.fpgacpu.org/papers/xsoc-series-drafts.pdf, for example. Not to mention the "zero cost" <<2, <<4, <<8, >>2, >>4, >>8 shifters and bus byte/word/longword resizers you can build with spare columns of TBUFs.) Reoptimizing for Virtex has been a chore. (Another setback in Virtex vs. 4000 was the loss of independent clock inversion on LUT RAM WCLKs and LUT FF CLKs, but that's another story.) But for V-II there seems to be no practical alternative but to a) use (waste) LUTs and their interconnect to build these horizontal muxes, and/or b) recode your design to help your technology mappers merge some of the muxes into other logic. Regarding (b), using the Virtex-style carry logic (including MULT_AND), it seems possible to build these "free mux" structures: 1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i]) 2) o[i] = add ? (a[i] + b[i]) : c[i] 3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i]) 4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i] : a[i]^b[i]) See http://www.fpgacpu.org/log/nov00.html#001112 for details. Synthesis tools get (1) (usually) but (as far as I know) miss the others. Consider case (2). An add followed by a mux would seem to be a pretty common circuit structure, and therefore important to optimize. Surely using the single-LUT-per-bit construction is a no-brainer, right? Not so fast! There are some tools issues. If you inefficiently implement this as two LUTs: t = a[i] + b[i]; o[i] = add ? t : c[i] then trce will "see" that the latency from c[i] to o[i] is Tilo. Good. But if you implement it in one LUT as o[i] = add ? (a[i] + b[i]) : c[i] e.g. o[i] = add&(a[i]^b[i]) + ~add&c[i] along with the appropriate configuration of MULT_AND, MUXCY, and XORCY, then (if I recall correctly) trce will also find false ripple-carry paths from c[i], e.g. from c[0] to o[31], which would therefore interfere with correct static timing analysis and with timing driven placement and routing. Oops! Therefore, I would like to see two tools enhancements to enable correct inference of add/mux in one LUT per bit: a) Xilinx should enhance trce to do a more precise analysis around ripple-carry structures, e.g. to rule out the false path from c[i] through the carry chain to o[i+1]...o[n]. Here with 'add' feeding MULT_AND, there is no carry-out if 'add' is false, and also, c[i] does not influence the carry-out when 'add' is true, and thus the MUXCY carry-out does not depend upon c[i]. b) Xilinx should lobby its synthesis partners to infer add/mux structures like (2)-(4) when possible. Or encourage a user-directive to force it. If both (a) and (b) were done, then Xilinx customers (synthesis users and RPM builders alike) would probably enjoy somewhat smaller and faster results in the devices they're already using. This add/mux inference digression aside, abundant TBUFs were useful and will be missed. But I suppose that any FPGA feature that HDL synthesis users and tools do not take good advantage of, is not long for this world. Jan Gray, Gray Research LLC ###### Message-ID: <3AB62247.372419BF@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.73 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <995523$paj$1@slb6.atl.mindspring.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 103 Date: Mon, 19 Mar 2001 15:08:37 GMT NNTP-Posting-Host: 24.13.238.93 X-Complaints-To: abuse@home.net X-Trace: news1.wwck1.ri.home.com 985014517 24.13.238.93 (Mon, 19 Mar 2001 07:08:37 PST) NNTP-Posting-Date: Mon, 19 Mar 2001 07:08:37 PST Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newshub2.home.com!news.home.com!news1.wwck1.ri.home.com.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5215 Jan Gray wrote: > > Indeed, the halving of TBUFs/LUT in Virtex, and again in V-II, make my > datapaths larger/less functional per LUT, compared with XC4000. (Consider > what happens to the result mux in the xr16 CPU datapath schematic S3 on p. 9 > of www.fpgacpu.org/papers/xsoc-series-drafts.pdf, for example. Not to > mention the "zero cost" <<2, <<4, <<8, >>2, >>4, >>8 shifters and bus > byte/word/longword resizers you can build with spare columns of TBUFs.) > > Reoptimizing for Virtex has been a chore. (Another setback in Virtex vs. > 4000 was the loss of independent clock inversion on LUT RAM WCLKs and LUT FF > CLKs, but that's another story.) Another drawback to the VIrtex is that you no longer get the carry chain for free for functions where you are only interested in the carry out. As a result, something like a saturating limiter that was able to be implemented in one column of CLBs in 4K, now takes two columns of slices with the LUTs in the first used as pass-throughs to the carry chain :-( > > But for V-II there seems to be no practical alternative but to a) use > (waste) LUTs and their interconnect to build these horizontal muxes, and/or > b) recode your design to help your technology mappers merge some of the > muxes into other logic. I haven't looked at it closely, but it seems to me that you might be able to use the horizontal OR chains for this. Have you investigated it? > > Regarding (b), using the Virtex-style carry logic (including MULT_AND), it > seems possible to build these "free mux" structures: > 1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i]) > 2) o[i] = add ? (a[i] + b[i]) : c[i] > 3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i]) > 4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i] > : a[i]^b[i]) > > See http://www.fpgacpu.org/log/nov00.html#001112 for details. > > Synthesis tools get (1) (usually) but (as far as I know) miss the others. > > Consider case (2). An add followed by a mux would seem to be a pretty common > circuit structure, and therefore important to optimize. Surely using the > single-LUT-per-bit construction is a no-brainer, right? Not so fast! There > are some tools issues. Jan, you are correct. The tools do not properly infer this (as well as certain adds/counters with resets if they are anything but a dirt simple adder/increment). This, and ability to direct placement are some reasons I often use instantiated circuits within a generate instead of the more readable inferred logic. > > If you inefficiently implement this as two LUTs: > t = a[i] + b[i]; > o[i] = add ? t : c[i] > then trce will "see" that the latency from c[i] to o[i] is Tilo. Good. > > But if you implement it in one LUT as > o[i] = add ? (a[i] + b[i]) : c[i] > e.g. > o[i] = add&(a[i]^b[i]) + ~add&c[i] > along with the appropriate configuration of MULT_AND, MUXCY, and XORCY, then > (if I recall correctly) trce will also find false ripple-carry paths from > c[i], e.g. from c[0] to o[31], which would therefore interfere with correct > static timing analysis and with timing driven placement and routing. Oops! Yep. TRCE doesn't do anything in the way of analyzing the logic in the circuit. It just adds delays between FFs. If you are careful with the constraints, you can block the false path, but it usually doesn't warrant the effort or the added potential for accidently ignoring a valid path. > > Therefore, I would like to see two tools enhancements to enable correct > inference of add/mux in one LUT per bit: > > a) Xilinx should enhance trce to do a more precise analysis around > ripple-carry structures, e.g. to rule out the false path from c[i] through > the carry chain to o[i+1]...o[n]. Here with 'add' feeding MULT_AND, there > is no carry-out if 'add' is false, and also, c[i] does not influence the > carry-out when 'add' is true, and thus the MUXCY carry-out does not depend > upon c[i]. > > b) Xilinx should lobby its synthesis partners to infer add/mux structures > like (2)-(4) when possible. Or encourage a user-directive to force it. > > If both (a) and (b) were done, then Xilinx customers (synthesis users and > RPM builders alike) would probably enjoy somewhat smaller and faster results > in the devices they're already using. > > This add/mux inference digression aside, abundant TBUFs were useful and will > be missed. But I suppose that any FPGA feature that HDL synthesis users and > tools do not take good advantage of, is not long for this world. > > Jan Gray, Gray Research LLC -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com ###### From: "Austin Franklin" Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Mon, 19 Mar 2001 10:37:37 -0500 Organization: MindSpring Enterprises Lines: 23 Message-ID: <995905$661$1@slb3.atl.mindspring.net> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> NNTP-Posting-Host: a5.f7.04.83 X-Server-Date: 19 Mar 2001 15:35:33 GMT X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!feed2.onemain.com!feed1.onemain.com!newsfeed2.earthlink.net!newsfeed.earthlink.net!news.mindspring.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5223 > In the earlier Xilinx chips (3000, 4000, 5200) there is always 2 > TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read. > > In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an > CLB, so only one slice can be routed to them, and even only 1 line for > reading back from TBUF lines. > > In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which > makes even connecting the output of the 4-wide 2 slices of an single > carry chain to an bus impossible. The data sheet does not give the > amount of readbacks. > > From this I get the impression, that Xilinx regards TBUF buses as going > out of fashion. After all, the TBUFs cost in chip space is next to nothing > relative to them many PIPs (about 900 per CLB in Virtex). I believe a lot of this has to do with HDLs. I know that most all the people I know using HDLs for Xilinx design don't even know what a TBUF is, or even how they would use it. I also think the tools, tutorials, classes etc. poorly support using them. ###### Path: chonsp.franklin.ch!not-for-mail From: Neil Franklin Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: 19 Mar 2001 22:19:01 +0100 Organization: My own Private Self Lines: 83 Message-ID: <6u4rwpfoe2.fsf@chonsp.franklin.ch> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <995523$paj$1@slb6.atl.mindspring.net> <3AB62247.372419BF@andraka.com> NNTP-Posting-Host: chonsp.franklin.ch X-Trace: chonsp.franklin.ch 985036743 493 10.0.3.2 (19 Mar 2001 21:19:03 GMT) X-Complaints-To: news@chonsp.franklin.ch NNTP-Posting-Date: 19 Mar 2001 21:19:03 GMT X-Newsreader: Gnus v5.7/Emacs 20.4 Xref: chonsp.franklin.ch comp.arch.fpga:5235 Ray Andraka writes: > Jan Gray wrote: > > > > But for V-II there seems to be no practical alternative but to a) use > > (waste) LUTs and their interconnect to build these horizontal muxes, and/or > > b) recode your design to help your technology mappers merge some of the > > muxes into other logic. > > I haven't looked at it closely, but it seems to me that you might be able to use > the horizontal OR chains for this. Have you investigated it? I went quickly to the data sheet. If I get it correctly, there is only one ORCY per slice (after the G MUXCY). So if you are using 4bits per CLB row (given by carry logic) and want one horizontal mux per bit, you have only 1/2 the needed ORCYs. Dammit. Using LUTs as (F1-in&F2-enable)|(F3-in&F4-enable) and then ORCYing them would have solved the problem. And LUTs with 3 inputs for logic and one enable and ORCY would be ideal. Xilinx: more ORCYs in Virtex-III, please. > > Regarding (b), using the Virtex-style carry logic (including MULT_AND), it > > seems possible to build these "free mux" structures: > > 1) o[i] = addsub ? (a[i] + b[i]) : (a[i] - b[i]) > > 2) o[i] = add ? (a[i] + b[i]) : c[i] > > 3) o[i] = addb ? (a[i] + b[i]) : (a[i] + c[i]) > > 4) o[i] = addsub ? (addand ? a[i]+b[i] : a[i]-b[i]) : (addand ? a[i]&b[i] > > : a[i]^b[i]) Sorry, I do not speak Verilog. Here is my MUXCY based layout: 2input 4input 8input 7 67 4567 each digit is one LUT (F1&F2)|(F3&F4) + MUXCY OR 6 67 4567 digits number data path bit that is processed 5 45 4567 2x2 digits is a Virtex CLB 4 45 4567 3 23 0123 2 23 0123 1 01 0123 0 01 0123 Java/JBits code 4-enables type 4:1 Mux: for (int Col = MuxCol; Col < MuxCol+2; Col++) { for (int Row = MuxRow; Row < MuxRow+MuxBits/2; Row++) { /* LUT AND OR AND, 1&2|3&4, 8888|F000 = F888 */ int Mux2i[] = Util.IntToIntArray(0xF888, 16); Fpga.set(Row, Col, LUT.SLICE0_F, Util.InvertIntArray(Mux2i)); Fpga.set(Row, Col, LUT.SLICE0_G, Util.InvertIntArray(Mux2i)); /* wide OR with LUT=0 -> !BX (=0) and LUT=1 -> 1 */ Fpga.set(Row, Col, S0Control.XCarrySelect.XCarrySelect, S0Control.XCarrySelect.LUT_CONTROL); Fpga.set(Row, Col, S0Control.YCarrySelect.YCarrySelect, S0Control.YCarrySelect.LUT_CONTROL); Fpga.set(Row, Col, S0Control.AndMux.AndMux, S0Control.AndMux.ONE); Fpga.set(Row, Col, S0Control.Cin.Cin, S0Control.Cin.BX); Fpga.set(Row, Col, S0Control.BxInvert, S0Control.ON); } } > > See http://www.fpgacpu.org/log/nov00.html#001112 for details. I have already employed the jump/increment program counter trick with the carry logic increment controlled by MUL-AND. Actually read about it in November, forgot it, reinvented it, and now re-recognized it. > > This add/mux inference digression aside, abundant TBUFs were useful and will > > be missed. But I suppose that any FPGA feature that HDL synthesis users and > > tools do not take good advantage of, is not long for this world. So I suppose the line "Consequently, the Virtex routing archi-tecture and its place-and-route software were defined in a single optimization process" can be translated as: we the chip designers do not support Assembler^WJBits fossils^Wprogrammers. :-( -- Neil Franklin, neil@franklin.ch.remove http://neil.franklin.ch/ Hacker, Unix Guru, El Eng HTL/FH/BSc, Sysadmin, Roleplayer, LARPer ###### From: Juri Kanevski Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Tue, 20 Mar 2001 12:37:22 +0200 Organization: NTUU "KPI" Lines: 39 Message-ID: <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> NNTP-Posting-Host: pm000-2.comsys.ntu-kpi.kiev.ua Mime-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Transfer-Encoding: 7bit X-Trace: igloo.uran.net.ua 985084662 67570 10.18.48.13 (20 Mar 2001 10:37:42 GMT) X-Complaints-To: newsmaster@news.ntu-kpi.kiev.ua NNTP-Posting-Date: 20 Mar 2001 10:37:42 GMT X-Mailer: Mozilla 4.7 [en] (Win98; I) X-Accept-Language: en Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!grolier!isdnet!news.internetia.pl!news.nask.pl!news.uar.net!carrier.kiev.ua!news.uran.net.ua!news.ntu-kpi.kiev.ua!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5277 Neil Franklin wrote: > > In the earlier Xilinx chips (3000, 4000, 5200) there is always 2 > TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read. > > In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an > CLB, so only one slice can be routed to them, and even only 1 line for > reading back from TBUF lines. > > In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which > makes even connecting the output of the 4-wide 2 slices of an single > carry chain to an bus impossible. The data sheet does not give the > amount of readbacks. > > From this I get the impression, that Xilinx regards TBUF buses as going > out of fashion. After all, the TBUFs cost in chip space is next to nothing > relative to them many PIPs (about 900 per CLB in Virtex). > > > So I have a question: What is the Xilinx-suggested replacement for TBUFs? > Is one supposed to use MUXes implemented in the CLBs? Is there an other > trick I have not yet stumbled over? I think that Xilinx found a compromise in the space functionality-cost. For these TBUFs and tristate wires Xilinx's devices were 2 fold more costly than Altera's ones, I suppose, because of their heavy technology. And therefore the only solution is to try to do without large multiplexors and shared busses. The way to do without large shared busses in the chip is the stable tendency now and in the future because they do not support the clock frequency increase, and afford the high switshing energy . A.Ser. ###### From: Austin Lesea Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Tue, 20 Mar 2001 08:10:28 -0800 Organization: Xilinx Lines: 60 Message-ID: <3AB780F4.2DDEE49D@xilinx.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> NNTP-Posting-Host: 149.199.7.181 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.76 [en] (WinNT; U) X-Accept-Language: en,pdf Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.mathworks.com!cyclone.swbell.net!bos-service1.ext.raytheon.com!attla1!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5252 All, As the chips scale, driving a 1" long wire became slower and slower (relative to the whole picture). Recently we removed TBUF's and replaced them with mux's in a few thousands of designs, and without replacing and rerouting, the designs were all faster. They also took up more area (the mux's). By removing TBUF's we gain that area back (TBUF's have to be huge to drive the long wires), so that now are still more area efficient, and higher speed than before. With PAR, the designs are all "better" than before. As always, removing something is always a tough decision, and we made sure we had a way of succeeding before we asked everyone to give them up forever (which may happen in the future). They were great while they lasted, Austin TBUF: RIP 200? Juri Kanevski wrote: > Neil Franklin wrote: > > > > In the earlier Xilinx chips (3000, 4000, 5200) there is always 2 > > TBUFs per CLB of 2 LUT+FF. And both TBUF lines can be read. > > > > In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an > > CLB, so only one slice can be routed to them, and even only 1 line for > > reading back from TBUF lines. > > > > In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which > > makes even connecting the output of the 4-wide 2 slices of an single > > carry chain to an bus impossible. The data sheet does not give the > > amount of readbacks. > > > > From this I get the impression, that Xilinx regards TBUF buses as going > > out of fashion. After all, the TBUFs cost in chip space is next to nothing > > relative to them many PIPs (about 900 per CLB in Virtex). > > > > > > So I have a question: What is the Xilinx-suggested replacement for TBUFs? > > Is one supposed to use MUXes implemented in the CLBs? Is there an other > > trick I have not yet stumbled over? > > I think that Xilinx > found a compromise in the space functionality-cost. > For these TBUFs and tristate wires > Xilinx's devices were 2 fold more costly than Altera's ones, I suppose, > because of their heavy technology. > > And therefore the only solution is > to try to do without large multiplexors and shared busses. > The way to do without large shared busses in the chip > is the stable tendency now and in the future > because they do not support the clock frequency increase, > and afford the high switshing energy . > > A.Ser. ###### From: Rick Collins Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Tue, 20 Mar 2001 11:46:25 -0500 Organization: Arius, Inc Lines: 41 Message-ID: <3AB78961.BA26E8B5@yahoo.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: UmFuZG9tSVYxTTQOkXED6cyMb8SglUSDq7ae5Nao66eexIM4ozlEM0ziyDZeGjpC X-Complaints-To: abuse@rcn.com NNTP-Posting-Date: 20 Mar 2001 16:47:00 GMT X-Mailer: Mozilla 4.73 [en] (Win98; U) X-Accept-Language: en Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!feed2.news.rcn.net!feed1.news.rcn.net!rcn!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5237 Austin Lesea wrote: > > All, > > As the chips scale, driving a 1" long wire became slower and slower (relative to > the whole picture). Recently we removed TBUF's and replaced them with mux's in > a few thousands of designs, and without replacing and rerouting, the designs > were all faster. They also took up more area (the mux's). By removing TBUF's > we gain that area back (TBUF's have to be huge to drive the long wires), so that > now are still more area efficient, and higher speed than before. With PAR, the > designs are all "better" than before. > > As always, removing something is always a tough decision, and we made sure we > had a way of succeeding before we asked everyone to give them up forever (which > may happen in the future). > > They were great while they lasted, > > Austin > > TBUF: RIP 200? When you say that the muxes were added, are you saying that there are "special" muxes available? Or are you referring to the LUTs in the CLBs? I guess I need to read the VII data sheet in detail and learn the architecture. It seems to be the chip to beat these days. -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX ###### Message-ID: <3AB7995F.11EFA624@bnl.gov> Date: Tue, 20 Mar 2001 18:54:39 +0100 From: Kolja Sulimma Reply-To: kolja@bnl.gov X-Mailer: Mozilla 4.7 [en] (Win98; I) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB78961.BA26E8B5@yahoo.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit NNTP-Posting-Host: 213.23.52.104 X-Trace: 20 Mar 2001 18:54:38 +0100, 213.23.52.104 Lines: 72 X-Complaints-To: abuse@arcor-ip.de Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!uni-erlangen.de!newsfeed.germany.net!newsfeed01.sul.t-online.de!t-online.de!newsfeed.arcor-ip.de!news.arcor-ip.de!213.23.52.104 Xref: chonsp.franklin.ch comp.arch.fpga:5294 As I understand it, Austin is stating that the area that was saved by removing the TBUFs is now used for extra CLBs (or reducing the ASIC cost per CLB) With 90% of the FPGA area beeing in the interconnect (Andre DeHons estimates) I doubt that you gain much of a CLB (including its interconnect) from removing a TBUF. Someone stated in this group that the TBUFs are not really TBUFs anymore internally but are realized in some other patented manner. Maybe thats the reason. (The TBUFs worst case timing got MUCH faster with virtex when compared to the carry chain, for example) But of course, the longlines get faster when less TBUFs are connected to them But remember: Marketing argument are often more important that technical aspects. Because of the huge amount of area dedicated to routing in an FPGA it is better to waste CLBs than to waste routing and therefore poorly routable FPGAs are much more area efficient.then highly routable designs. But they got bad press for bad routability, and therefore greatly improve routability. The new trade-off may be more expensive to manufacture, but its easier to use and less people complain. But wait: Everybody feels happier if he has a reason to complain so they discounted the equivalent gate count metric, and now we have a reason to complain again. CU, Kolja Rick Collins wrote: > Austin Lesea wrote: > > > > All, > > > > As the chips scale, driving a 1" long wire became slower and slower (relative to > > the whole picture). Recently we removed TBUF's and replaced them with mux's in > > a few thousands of designs, and without replacing and rerouting, the designs > > were all faster. They also took up more area (the mux's). By removing TBUF's > > we gain that area back (TBUF's have to be huge to drive the long wires), so that > > now are still more area efficient, and higher speed than before. With PAR, the > > designs are all "better" than before. > > > > As always, removing something is always a tough decision, and we made sure we > > had a way of succeeding before we asked everyone to give them up forever (which > > may happen in the future). > > > > They were great while they lasted, > > > > Austin > > > > TBUF: RIP 200? > > When you say that the muxes were added, are you saying that there are > "special" muxes available? Or are you referring to the LUTs in the CLBs? > > I guess I need to read the VII data sheet in detail and learn the > architecture. It seems to be the chip to beat these days. > > -- > > Rick "rickman" Collins > > rick.collins@XYarius.com > Ignore the reply address. To email me use the above address with the XY > removed. > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design URL http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX ###### From: Austin Lesea Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Tue, 20 Mar 2001 15:32:03 -0800 Organization: Xilinx Lines: 58 Message-ID: <3AB7E873.31C30F7E@xilinx.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB78961.BA26E8B5@yahoo.com> NNTP-Posting-Host: 149.199.7.181 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.76 [en] (WinNT; U) X-Accept-Language: en,pdf Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newsfeed.mathworks.com!cyclone.swbell.net!bos-service1.ext.raytheon.com!attla1!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5244 All Mux's are special :) The replacement was with CLB's (LUT's and special clb logic), which may implement the F5, F6, F7 or even F8 in Virtex II, so they can be very efficient. The fast that you could retarget the synthesis, and use even more powerful structures (e.g. horizontal cascade carry) makes things even faster with less area. The TBUF's occupy a large area in interconnect (metal wires) and buffers, and mux trees to get in and out. There are these strange warts that hang off the CLB and run really slowly now that everything has scaled so well into the ultra deep sub-micron world. Austin Rick Collins wrote: > Austin Lesea wrote: > > > > All, > > > > As the chips scale, driving a 1" long wire became slower and slower (relative to > > the whole picture). Recently we removed TBUF's and replaced them with mux's in > > a few thousands of designs, and without replacing and rerouting, the designs > > were all faster. They also took up more area (the mux's). By removing TBUF's > > we gain that area back (TBUF's have to be huge to drive the long wires), so that > > now are still more area efficient, and higher speed than before. With PAR, the > > designs are all "better" than before. > > > > As always, removing something is always a tough decision, and we made sure we > > had a way of succeeding before we asked everyone to give them up forever (which > > may happen in the future). > > > > They were great while they lasted, > > > > Austin > > > > TBUF: RIP 200? > > When you say that the muxes were added, are you saying that there are > "special" muxes available? Or are you referring to the LUTs in the CLBs? > > I guess I need to read the VII data sheet in detail and learn the > architecture. It seems to be the chip to beat these days. > > -- > > Rick "rickman" Collins > > rick.collins@XYarius.com > Ignore the reply address. To email me use the above address with the XY > removed. > > Arius - A Signal Processing Solutions Company > Specializing in DSP and FPGA design URL http://www.arius.com > 4 King Ave 301-682-7772 Voice > Frederick, MD 21701-3110 301-682-7666 FAX ###### Message-ID: <3AB811ED.3883949B@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.73 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB78961.BA26E8B5@yahoo.com> <3AB7E873.31C30F7E@xilinx.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 77 Date: Wed, 21 Mar 2001 02:23:19 GMT NNTP-Posting-Host: 24.13.238.93 X-Complaints-To: abuse@home.net X-Trace: news1.wwck1.ri.home.com 985141399 24.13.238.93 (Tue, 20 Mar 2001 18:23:19 PST) NNTP-Posting-Date: Tue, 20 Mar 2001 18:23:19 PST Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newshub2.home.com!news.home.com!news1.wwck1.ri.home.com.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5253 The problem with the CLB muxes is that they don't match the pitch of the carry chain, so If I have a set of counters for instance that I want to address to read onto a common bus, it is a royal pain in the patoot without the Tbufs. It's a good thing there is more routing in these chips! For example, in Virtex, if I want to use the F6 mux, which would be the one I need to mux between two slices, I need to use the F5 muxes to feed the F6. Since the F5 selects between the F and G LUTs, this structure is broken for connecting to carry chain logic. Austin Lesea wrote: > > All Mux's are special :) > > The replacement was with CLB's (LUT's and special clb logic), which may implement the > F5, F6, F7 or even F8 in Virtex II, so they can be very efficient. > > The fast that you could retarget the synthesis, and use even more powerful structures > (e.g. horizontal cascade carry) makes things even faster with less area. > > The TBUF's occupy a large area in interconnect (metal wires) and buffers, and mux > trees to get in and out. There are these strange warts that hang off the CLB and run > really slowly now that everything has scaled so well into the ultra deep sub-micron > world. > > Austin > > Rick Collins wrote: > > > Austin Lesea wrote: > > > > > > All, > > > > > > As the chips scale, driving a 1" long wire became slower and slower (relative to > > > the whole picture). Recently we removed TBUF's and replaced them with mux's in > > > a few thousands of designs, and without replacing and rerouting, the designs > > > were all faster. They also took up more area (the mux's). By removing TBUF's > > > we gain that area back (TBUF's have to be huge to drive the long wires), so that > > > now are still more area efficient, and higher speed than before. With PAR, the > > > designs are all "better" than before. > > > > > > As always, removing something is always a tough decision, and we made sure we > > > had a way of succeeding before we asked everyone to give them up forever (which > > > may happen in the future). > > > > > > They were great while they lasted, > > > > > > Austin > > > > > > TBUF: RIP 200? > > > > When you say that the muxes were added, are you saying that there are > > "special" muxes available? Or are you referring to the LUTs in the CLBs? > > > > I guess I need to read the VII data sheet in detail and learn the > > architecture. It seems to be the chip to beat these days. > > > > -- > > > > Rick "rickman" Collins > > > > rick.collins@XYarius.com > > Ignore the reply address. To email me use the above address with the XY > > removed. > > > > Arius - A Signal Processing Solutions Company > > Specializing in DSP and FPGA design URL http://www.arius.com > > 4 King Ave 301-682-7772 Voice > > Frederick, MD 21701-3110 301-682-7666 FAX -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com ###### From: "Jan Gray" Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Tue, 20 Mar 2001 22:40:20 -0800 Organization: Gray Research LLC Lines: 47 Message-ID: <999iij$qf7$1@slb7.atl.mindspring.net> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB78961.BA26E8B5@yahoo.com> <3AB7E873.31C30F7E@xilinx.com> <3AB811ED.3883949B@andraka.com> NNTP-Posting-Host: 04.21.a0.5e X-Server-Date: 21 Mar 2001 06:43:31 GMT X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news.mindspring.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5278 "Ray Andraka" wrote in message news:3AB811ED.3883949B@andraka.com... > The problem with the CLB muxes is that they don't match the pitch of the carry > chain, so If I have a set of counters for instance that I want to address to > read onto a common bus, it is a royal pain in the patoot without the Tbufs. > It's a good thing there is more routing in these chips! Right. The various MUXFx's are of little or no use in my adder-heavy datapaths (e.g. a <96-slice, >80 TBUF, 16-bit RISC CPU for Virtex). For such designs, which extensively use TBUFs for wide/bus-oriented muxes, it is hard for me to believe that the area, delay, and power of a *dedicated* *full-custom* TBUF are inferior to the area, delay, and power of the same topology using inter-CLB interconnect, plus CLB-local interconnect, plus 4-LUTs. Instead I like to believe that the *weighted proportion* of CLBs that use TBUFs in the mix of test designs (that I assume Xilinx uses to validate and optimize its next generations of tools and silicon) is low, so that across the spectrum of designs, TBUF resources are almost always wasted silicon. This is especially plausible if most such test designs use HDL synthesis, which do not infer many TBUFs. If so, then a good architecture design system might indeed determine that the optimal number of TBUFs per CLB is ... epsilon (arbitrarily close to zero). Anyway, V-II is here, so how shall we make the most of it? As I wrote last time, one important family of circuits (particularly add followed by mux) can be done properly in just one LUT per bit (in Virtex and Virtex-II), if Xilinx enhances TRCE (and timing-driven PAR) to address the stated false path situation. THIS IS IMPORTANT, because if TRCE is NOT so enhanced, this important circuit trick is almost unusable, because TRCE (and presumably timing-driven PAR) think each such mux delay is a whole adder delay. Austin or Peter, what do you think? [By the way, thanks: it is great to hear the Xilinx perspective. Imagine the deafening silence otherwise. See also the cluetrain manifesto (cluetrain.com) thesis #84 :-)] Jan Gray, Gray Research LLC ###### From: Rick Collins Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Wed, 21 Mar 2001 03:10:29 -0500 Organization: Arius, Inc Lines: 21 Message-ID: <3AB861F5.EF0B59E8@yahoo.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB78961.BA26E8B5@yahoo.com> <3AB7995F.11EFA624@bnl.gov> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: UmFuZG9tSVZBhfIhDhlgekgbTbz7TP3y4JNLV6ZQUujB53xrBUHWYSUnj70BcEd/ X-Complaints-To: abuse@rcn.com NNTP-Posting-Date: 21 Mar 2001 08:11:01 GMT X-Mailer: Mozilla 4.73 [en] (Win98; U) X-Accept-Language: en Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!feed2.news.rcn.net!feed1.news.rcn.net!rcn!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5238 Kolja Sulimma wrote: > But wait: Everybody feels happier if he has a reason to complain so they discounted > the equivalent gate count metric, and now we have a reason to complain again. You may be on to something! I know that I hate having to back calculate the gate counts to figure out how to compare today's chips to yesterday's chip (that I am designing in tomorrow!). -- Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX ###### Path: chonsp.franklin.ch!not-for-mail From: Neil Franklin Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: 21 Mar 2001 19:43:36 +0100 Organization: My own Private Self Lines: 61 Message-ID: <6uhf0nklnr.fsf@chonsp.franklin.ch> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> NNTP-Posting-Host: chonsp.franklin.ch X-Trace: chonsp.franklin.ch 985200218 700 10.0.3.2 (21 Mar 2001 18:43:38 GMT) X-Complaints-To: news@chonsp.franklin.ch NNTP-Posting-Date: 21 Mar 2001 18:43:38 GMT X-Newsreader: Gnus v5.7/Emacs 20.4 Xref: chonsp.franklin.ch comp.arch.fpga:5302 Austin Lesea writes: > Juri Kanevski wrote: > > > Neil Franklin wrote: > > > > > > In Virtex and Spartan-II it is down to 2 TBUFS for the 4 LUT+FF of an > > > CLB, so only one slice can be routed to them, and even only 1 line for > > > reading back from TBUF lines. > > > > > > In Virtex-II there is even only 2 TBUFs for 8 LUT+FF per CLB. Which > > > > > > So I have a question: What is the Xilinx-suggested replacement for TBUFs? > > > Is one supposed to use MUXes implemented in the CLBs? Is there an other > > > trick I have not yet stumbled over? > > > > And therefore the only solution is > > to try to do without large multiplexors and shared busses. Unless I have got many outputs (different logic units) that need to be selected to go into one input (register file). Then I need an Mux. Can't avoid it. > > The way to do without large shared busses in the chip > > is the stable tendency now and in the future That is sensible anyway, simply from contention/parallelism viewpoint. But that can be had by partitioning the TBUF lines (on Virtex and later chips, not on 3000/4000/5200). [rearanged to here] > As the chips scale, driving a 1" long wire became slower and slower (relative to > the whole picture). Assuming an chip-wide bus. With an bus just 5 to 10 CLBs wide (in XC2S200) they would be quite a bit less than 1" (1/8 to 1/4"). > Recently we removed TBUF's and replaced them with mux's in > a few thousands of designs, and without replacing and rerouting, the designs > were all faster. This surprises me, given more "wire" (to Mux and away). OTOH if the Mux is inbetween the target and its nearest source then length may not be much more. And TBUF section PIPs may be not much faster then GRM PIPs. Were these comparisons on Virtex-II or also on plain Virtex and Spartan-II (which is what I am using)? > TBUF: RIP 200? That was what I was suspecting. Thanks to get it from the horses mouth. -- Neil Franklin, neil@franklin.ch.remove http://neil.franklin.ch/ Hacker, Unix Guru, El Eng HTL/FH/BSc, Sysadmin, Roleplayer, LARPer ###### From: hmurray-nospam@megapathdsl.net (Hal Murray) Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Thu, 22 Mar 2001 07:33:00 -0000 Message-ID: X-Newsreader: xrn 9.02 Sender: murray@hgm (Hal Murray) References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> X-Complaints-To: newsabuse@supernews.com Lines: 39 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!freenix!sn-xit-01!sn-post-01!supernews.com!corp.supernews.com!hgm!hmurray-nospam Xref: chonsp.franklin.ch comp.arch.fpga:5337 >As the chips scale, driving a 1" long wire became slower and slower (relative to >the whole picture). Recently we removed TBUF's and replaced them with mux's in >a few thousands of designs, and without replacing and rerouting, the designs >were all faster. They also took up more area (the mux's). By removing TBUF's >we gain that area back (TBUF's have to be huge to drive the long wires), so that >now are still more area efficient, and higher speed than before. With PAR, the >designs are all "better" than before. I'm missing something. Why are muxes better/faster? I see why driving a 1" long wire is tough, but I don't see why the driver after a mux is different from a TBUF. In one case you have to turn the driver on. In the other case you have to get half way across the chip, then through a mux. I'm assuming that getting the go signal to the tbuf is about as hard as getting the select signals to a mux. ---- In either case, a critical parameter is how well the TBUF/mux density lines up with a counter. Assume I have a counter that uses the "obvious" fast carry logic. I need to get that on the bus and to load other counters and registers from that bus. How many bits per CLB does that good counter use? If I'm using TBUFs, I need that many TBUFs per CLB. If I'm using muxes, I need enough routing to get all the registers that drive the bus into the mux. I haven't tried to build anything like that. It might be simpler if the mux is distributed or some trick like that. -- These are my opinions, not necessarily my employeers. I hate spam. ###### Message-ID: <3AB9FD8D.5E7F0E4C@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.73 [en] (WinNT; I) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 49 Date: Thu, 22 Mar 2001 13:20:51 GMT NNTP-Posting-Host: 24.13.238.93 X-Complaints-To: abuse@home.net X-Trace: news1.wwck1.ri.home.com 985267251 24.13.238.93 (Thu, 22 Mar 2001 05:20:51 PST) NNTP-Posting-Date: Thu, 22 Mar 2001 05:20:51 PST Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!newshub2.home.com!news.home.com!news1.wwck1.ri.home.com.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5326 If xilinx extended the horizontal or chains in V2 so that there was at least 4 per CLB, we'd be all set. Hal Murray wrote: > > >As the chips scale, driving a 1" long wire became slower and slower (relative to > >the whole picture). Recently we removed TBUF's and replaced them with mux's in > >a few thousands of designs, and without replacing and rerouting, the designs > >were all faster. They also took up more area (the mux's). By removing TBUF's > >we gain that area back (TBUF's have to be huge to drive the long wires), so that > >now are still more area efficient, and higher speed than before. With PAR, the > >designs are all "better" than before. > > I'm missing something. Why are muxes better/faster? > > I see why driving a 1" long wire is tough, but I don't see why > the driver after a mux is different from a TBUF. > > In one case you have to turn the driver on. In the other case > you have to get half way across the chip, then through a mux. > I'm assuming that getting the go signal to the tbuf is about > as hard as getting the select signals to a mux. > > ---- > > In either case, a critical parameter is how well the TBUF/mux > density lines up with a counter. Assume I have a counter > that uses the "obvious" fast carry logic. I need to get > that on the bus and to load other counters and registers > from that bus. > > How many bits per CLB does that good counter use? > > If I'm using TBUFs, I need that many TBUFs per CLB. > > If I'm using muxes, I need enough routing to get all the > registers that drive the bus into the mux. I haven't tried > to build anything like that. It might be simpler if the mux > is distributed or some trick like that. > > -- > These are my opinions, not necessarily my employeers. I hate spam. -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com ###### From: Austin Lesea Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Thu, 22 Mar 2001 08:22:14 -0800 Organization: Xilinx Lines: 62 Message-ID: <3ABA26B6.520A1AC7@xilinx.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> NNTP-Posting-Host: 149.199.7.181 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.76 [en] (WinNT; U) X-Accept-Language: en,pdf Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.stealth.net!24.30.200.2.MISMATCH!news-east.rr.com!news.rr.com!wn2feed!worldnet.att.net!4.24.21.153!chcgil2-snh1.gtei.net!cambridge1-snf1.gtei.net!news.gtei.net!bos-service1.ext.raytheon.com!attla1!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5321 Hal, The mux's don't have to drive across the whole chip. Maybe they need to drive to one of the CLB's that may be reached by the directs, doubles (or hexes), making the amount of logic that can be reached quickly, large. This discussion reminds me of the days when people still argued that asssembly language programming was "better" than any C compiler every could be. Just like what happened to compiled languages, we are already there with compiled HDL's: if it isn't fast enough after the best the tools can do, you go in and "code" by hand the pieces that were not implemented with the best possible efficiency (if there are any). Or, you realize that you didn't take advantage of the parallelism, or some feature of the part, and you recode your HDL. With > 95% of all FPGA designs being done in HDL's, we think we are spending time and silicon on what makes that flow the fastest and best possible. Austin Hal Murray wrote: > >As the chips scale, driving a 1" long wire became slower and slower (relative to > >the whole picture). Recently we removed TBUF's and replaced them with mux's in > >a few thousands of designs, and without replacing and rerouting, the designs > >were all faster. They also took up more area (the mux's). By removing TBUF's > >we gain that area back (TBUF's have to be huge to drive the long wires), so that > >now are still more area efficient, and higher speed than before. With PAR, the > >designs are all "better" than before. > > I'm missing something. Why are muxes better/faster? > > I see why driving a 1" long wire is tough, but I don't see why > the driver after a mux is different from a TBUF. > > In one case you have to turn the driver on. In the other case > you have to get half way across the chip, then through a mux. > I'm assuming that getting the go signal to the tbuf is about > as hard as getting the select signals to a mux. > > ---- > > In either case, a critical parameter is how well the TBUF/mux > density lines up with a counter. Assume I have a counter > that uses the "obvious" fast carry logic. I need to get > that on the bus and to load other counters and registers > from that bus. > > How many bits per CLB does that good counter use? > > If I'm using TBUFs, I need that many TBUFs per CLB. > > If I'm using muxes, I need enough routing to get all the > registers that drive the bus into the mux. I haven't tried > to build anything like that. It might be simpler if the mux > is distributed or some trick like that. > > -- > These are my opinions, not necessarily my employeers. I hate spam. ###### From: Austin Lesea Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Thu, 22 Mar 2001 08:24:41 -0800 Organization: Xilinx Lines: 60 Message-ID: <3ABA2749.E35F51C1@xilinx.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3AB9FD8D.5E7F0E4C@andraka.com> NNTP-Posting-Host: 149.199.7.181 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 4.76 [en] (WinNT; U) X-Accept-Language: en,pdf Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!news.tele.dk!144.212.100.101!newsfeed.mathworks.com!cyclone.swbell.net!bos-service1.ext.raytheon.com!attla1!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5320 Ray, Horizontal Cascade Carry. Virtex II has so many new features, some tend to get lost. Austin Austin Ray Andraka wrote: > If xilinx extended the horizontal or chains in V2 so that there was at least 4 > per CLB, we'd be all set. > > Hal Murray wrote: > > > > >As the chips scale, driving a 1" long wire became slower and slower (relative to > > >the whole picture). Recently we removed TBUF's and replaced them with mux's in > > >a few thousands of designs, and without replacing and rerouting, the designs > > >were all faster. They also took up more area (the mux's). By removing TBUF's > > >we gain that area back (TBUF's have to be huge to drive the long wires), so that > > >now are still more area efficient, and higher speed than before. With PAR, the > > >designs are all "better" than before. > > > > I'm missing something. Why are muxes better/faster? > > > > I see why driving a 1" long wire is tough, but I don't see why > > the driver after a mux is different from a TBUF. > > > > In one case you have to turn the driver on. In the other case > > you have to get half way across the chip, then through a mux. > > I'm assuming that getting the go signal to the tbuf is about > > as hard as getting the select signals to a mux. > > > > ---- > > > > In either case, a critical parameter is how well the TBUF/mux > > density lines up with a counter. Assume I have a counter > > that uses the "obvious" fast carry logic. I need to get > > that on the bus and to load other counters and registers > > from that bus. > > > > How many bits per CLB does that good counter use? > > > > If I'm using TBUFs, I need that many TBUFs per CLB. > > > > If I'm using muxes, I need enough routing to get all the > > registers that drive the bus into the mux. I haven't tried > > to build anything like that. It might be simpler if the mux > > is distributed or some trick like that. > > > > -- > > These are my opinions, not necessarily my employeers. I hate spam. > > -- > -Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com ###### From: "Simon Bacon" Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Thu, 22 Mar 2001 22:19:05 -0000 Message-ID: <985308839.15549.0.nnrp-09.9e9832fa@news.demon.co.uk> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3ABA26B6.520A1AC7@xilinx.com> NNTP-Posting-Host: tile.demon.co.uk X-NNTP-Posting-Host: tile.demon.co.uk:158.152.50.250 X-Trace: news.demon.co.uk 985308839 nnrp-09:15549 NO-IDENT tile.demon.co.uk:158.152.50.250 X-Complaints-To: abuse@demon.net X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.00.2919.6600 x-mimeole: Produced By Microsoft MimeOLE V5.00.2919.6600 Lines: 24 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!news.maxwell.syr.edu!dispose.news.demon.net!news.demon.co.uk!demon!tile.demon.co.uk!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:5380 "Austin Lesea" wrote in message news:3ABA26B6.520A1AC7@xilinx.com... > > With > 95% of all FPGA designs being done in HDL's, we think we are spending time and > silicon on what makes that flow the fastest and best possible. Well, picking an HDL: DataOut <= Reg0Val when Addr=REG0 else (others=>'z'); DataOut <= Reg1Val when Addr=REG1 else (others=>'z'); DataOut <= Reg2Val when Addr=REG2 else (others=>'z'); and the Verilog metaphor is similar. This is HDL, easy to understand, and a good mapping to TBUFs. Of course, ASICs do not usually support T/S lines. Even the Xilinx implementation is an emulation in Virtex, though the 3K and 4K had the real thing. ###### From: muzaffer@dspia.com Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: 23 Mar 2001 01:13:19 GMT Organization: Concentric Internet Services Lines: 39 Message-ID: <93ckbts3sb10kffa472geio8uq5igt9vnc@4ax.com> References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3ABA26B6.520A1AC7@xilinx.com> <985308839.15549.0.nnrp-09.9e9832fa@news.demon.co.uk> NNTP-Posting-Host: 209.31.228.232 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Newsreader: Forte Agent 1.8/32.548 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!dc1.nntp.concentric.net!sjc1.nntp.concentric.net!newsfeed.concentric.net!global-news-master Xref: chonsp.franklin.ch comp.arch.fpga:5379 "Simon Bacon" wrote: > >"Austin Lesea" wrote in message >news:3ABA26B6.520A1AC7@xilinx.com... >> >> With > 95% of all FPGA designs being done in HDL's, we think we are spending >time and >> silicon on what makes that flow the fastest and best possible. > >Well, picking an HDL: > > DataOut <= Reg0Val when Addr=REG0 else (others=>'z'); > DataOut <= Reg1Val when Addr=REG1 else (others=>'z'); > DataOut <= Reg2Val when Addr=REG2 else (others=>'z'); > >and the Verilog metaphor is similar. This is HDL, easy to >understand, and a good mapping to TBUFs. > >Of course, ASICs do not usually support T/S lines. By ASIC I take it you mean Standard cell based ASIC; if that's the case, this has not been my experience. Almost all SC libraries have T/S buffers and inverters. Some even have T/S DFFs and T/S bus holders. Of course with a full custom ASIC, there is no limit on what you can do. > Even the >Xilinx implementation is an emulation in Virtex, though the >3K and 4K had the real thing. I think the problem with T/S busses are well known. Difficult to time, potential for electrical conflicts, potential for very high power consumption etc. Being more of an analog feature, one has to spend extra effort with RTL verification. You might even have to spice some paths god forbid ;-) Muzaffer ###### From: hmurray-nospam@megapathdsl.net (Hal Murray) Newsgroups: comp.arch.fpga Subject: Re: TBUFs in Virtex and later chips, going out of fashion, what instead Date: Fri, 23 Mar 2001 05:39:11 -0000 Message-ID: X-Newsreader: xrn 9.02 Sender: murray@hgm (Hal Murray) References: <6ulmq2y92r.fsf@chonsp.franklin.ch> <3AB732E2.13AE7950@comsys.ntu-kpi.kiev.ua> <3AB780F4.2DDEE49D@xilinx.com> <3ABA26B6.520A1AC7@xilinx.com> X-Complaints-To: newsabuse@supernews.com Lines: 66 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!grolier!sn-xit-01!sn-post-01!supernews.com!corp.supernews.com!hgm!hmurray-nospam Xref: chonsp.franklin.ch comp.arch.fpga:5373 >The mux's don't have to drive across the whole chip. Maybe they need to drive > to one of the CLB's that may be reached by the directs, doubles (or hexes), > making the amount of logic that can be reached quickly, large. I'm still not getting it. My picture is that I have a register on the left side of the chip and another register on the right side. Sprinkled in the middle are various registers and counters. I want to be able to load the register on the left from any of the others, including the one on the right. Same in the other direction. This is easy to understand with TBUFs. I might not like the answers, but it's easy to floor plan things so all the routing is as good as it will get. The pattern is clock, enable, tbuf drives longline, setup, clock. This fits well with a style that thinks of the problem as a specialized CPU like gizmo that gets driven my a microcoded engine. Most of the changes go into the microcode. The instruction format includes a field to specify which register drives the bus and another field for which register gets loaded. (Occasional hacks are handy to load several registers on the same cycle.) Are you using a mux in front of each register you want to load, or a shared mux feeding all the registers? If I assume a mux in the middle, then the nasty case is clock, routing-to-mux, delay through mux, delay through long-line like routing to get all over the chip, setup, clock. If I assume a mux in front of each register, the pattern is the same but some things have been pushed around. I'd guess the timing will be similar. Perhaps your test cases go faster because you didn't put registers that far apart so you don't have to route across the whole chip. If so, then the problem isn't TBUFs, it's the long line across the whole chip that they are driving. I could easily see that FPGA are now big enough that having a bus span the whole chip isn't useful very often. Did you try your test cases on a smaller chip? Perhaps TBUFs feeding not-so-long lines that only go part way across the chip would be interesting (aka faster). I'm assuming there would be a pair of good TBUFs at the segment junction that I could use to cennect them and/or put a register there to pipeline things. My straw man (not much thought) is that the right length would be to match the register transfer timing to what it takes to get through a (pipelined) RAM or multiply stage. Or maybe short segments collapses to something mux like if you have lots of local/medium routing resources. -- These are my opinions, not necessarily my employeers. I hate spam.