Message-ID: <3DB46645.63543E05@mail.com> From: John_H X-Mailer: Mozilla 4.75 [en]C-CCK-MCD (Win95; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 15 Date: Mon, 21 Oct 2002 20:40:36 GMT NNTP-Posting-Host: 192.65.17.17 X-Complaints-To: postmaster@opbu.xerox.com X-Trace: news-west.eli.net 1035232836 192.65.17.17 (Mon, 21 Oct 2002 14:40:36 MDT) NNTP-Posting-Date: Mon, 21 Oct 2002 14:40:36 MDT Organization: Xerox Officeprinting NewsReader Service Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!logbridge.uoregon.edu!news-west.eli.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22291 Addressing item 1 only, sythesizers will better understand how to efficiently implement a single arithmetic chain rather than working backwars from a mux of two arithmetic values. Try assign {carry_out,sum,x} = {a,carry_in} + ( fn ? -{ b,1'b1 }: { b,1'b1 }); though the synth might still be confused by the 2's complement. To go "whole hog" to the minimum solution, you may have to explicitly perform the negative: assign {carry_out,sum,x} = {a,carry_in} + ( fn ? {~b,1'b1 }: { b,1'b1 }); ( -n == ~n + 1 . . . -{n,1} == ~{n,1} + 1 == {~n,0} + 1 == {~n,1} ) It's all much smaller now! ###### From: "Jan Gray" Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC Date: Mon, 21 Oct 2002 16:10:59 -0700 Organization: Gray Research LLC Lines: 67 Message-ID: References: NNTP-Posting-Host: 04.2e.94.06 X-Server-Date: 21 Oct 2002 23:16:35 GMT X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!newsfeed.stueberl.de!cox.net!newsfeed1.earthlink.net!newsfeed.earthlink.net!stamper.news.pas.earthlink.net!stamper.news.atl.earthlink.net!harp.news.atl.earthlink.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22312 Ralph Mason wrote > 1. As a learning process I have copped his adder, he says that it turns > into 17 LUT's although mine ends up at 52 - Is there some optimizations I am > missing here? are there any pragmas you can use in these situations to say > what you want? Can you get right down to the actual LUT level and connect > them up yourself? Using the free Xilinx tools can you actually see the way > it has connected the actual LUTs? First, congratulations on going to the trouble of actually looking under the hood and *inspecting* what came out of your tools. That is a very good practice. Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You know exactly what you want -- a particular optimal, hand-mapped, hand-placed layout for your datapath -- but the tools get in the way, and you spend hours trying to discover an incantation that persuades the tools to emit the desired result.") Here the goal was to generate a+b or a-b with cin and cout at a cost of one LUT per bit. The hardware is willing, but you will need to perform a bit of experimentation to discover an expression that the synthesis tool will repeatably synthesize into the desired hardware idiom. The construction you 'copped' did work for one version of some synthesis tool once. Of course that does not mean it works reliably across all tools and all versions of all tools! I like John_H's suggestion -- it may succeed more often than the mux(+,-) that I used. BTW IIRC in this specific instance it was also necessary to move the single assignment into its own module -- the synthesis tool created the inefficient mux(+, -) form if the assignment was inline in the parent module. > 2. Floorplaning - Is there any info on this, it seems that you don't need > to do it because the tools do it automatically, I am guessing sometimes you > can do it better yourself to give a faster device? It looks like a > laborious process - is it? It need not be laborious, but you have to learn the ropes. It think we old timers are fortunate because we grew up doing hierarchical schematics, and so writing the same hierarchical structural HDL is familiar and natural. It is long past time I investigated how to do this with XST Verilog and wrote it up for my site... > 3. General - The CPU I want to design is a RISC / CISC design where a pre > processor reads CISC instructions from ram and places RISC instructions (or > jumps to RISC subs) to do the same them in an internal pipeline for a RISC > internal cpu ( or perhaps many instances of the same RISC cpu and a single > pre processor to provide a kind of hardware multithreading) - Is there any > notes on existing designs like this that one could read. Interesting. If the "preprocessor" is hardware, this was done in e.g. Intel P6, in the AMD K6, in some VAX implementations, and so forth. If the "preprocessor" is software, this was done in e.g. SoftWindows, Digital's FX!32 and Transmeta. There's lots in the library (see e.g. IEEE Micro) and in Google! (Also time to update my resources/book section of my site.) Jan Gray, Gray Research LLC ###### Message-ID: <3DB49703.D82087E7@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.77 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 99 Date: Tue, 22 Oct 2002 00:09:03 GMT NNTP-Posting-Host: 68.15.41.165 X-Complaints-To: abuse@cox.net X-Trace: news1.east.cox.net 1035245343 68.15.41.165 (Mon, 21 Oct 2002 20:09:03 EDT) NNTP-Posting-Date: Mon, 21 Oct 2002 20:09:03 EDT Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!newsfeed.stueberl.de!cox.net!p01!news1.east.cox.net.POSTED!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22297 Jan, Ralph, Long time ago we got tired of pushing on the rope for every design we did, so we went ahead and created a library of parameterized structurally instantiated things like d registers, various flavors of adders/subtractors, delay queues, etc. Easy to do in VHDL with the generate statement, a real PITA in verilog without the v2000 extensions. Because of some of the things like the generate, I have become something of a VHDL bigot so you'll have to excuse that. Usually I could get the tools to behave with inferred stuff if I structured the inference to look like the hardware, something like this: neg_b<= b when sub='1' else not b; cin<= 1 when sub='1' else 0; sum_d<= a + neg_b+cin; The results vary by synth and version, but this structure seems to get there most of the time. Jan Gray wrote: > Ralph Mason wrote > > > 1. As a learning process I have copped his adder, he says that it turns > > into 17 LUT's although mine ends up at 52 - Is there some optimizations I > am > > missing here? are there any pragmas you can use in these situations to say > > what you want? Can you get right down to the actual LUT level and connect > > them up yourself? Using the free Xilinx tools can you actually see the way > > it has connected the actual LUTs? > > First, congratulations on going to the trouble of actually looking under the > hood and *inspecting* what came out of your tools. That is a very good > practice. > > Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You > know exactly what you want -- a particular optimal, hand-mapped, hand-placed > layout for your datapath -- but the tools get in the way, and you spend > hours trying to discover an incantation that persuades the tools to emit the > desired result.") > > Here the goal was to generate a+b or a-b with cin and cout at a cost of one > LUT per bit. The hardware is willing, but you will need to perform a bit of > experimentation to discover an expression that the synthesis tool will > repeatably synthesize into the desired hardware idiom. > > The construction you 'copped' did work for one version of some synthesis > tool once. Of course that does not mean it works reliably across all tools > and all versions of all tools! > > I like John_H's suggestion -- it may succeed more often than the mux(+,-) > that I used. > > BTW IIRC in this specific instance it was also necessary to move the single > assignment into its own module -- the synthesis tool created the inefficient > mux(+, -) form if the assignment was inline in the parent module. > > > 2. Floorplaning - Is there any info on this, it seems that you don't need > > to do it because the tools do it automatically, I am guessing sometimes > you > > can do it better yourself to give a faster device? It looks like a > > laborious process - is it? > > It need not be laborious, but you have to learn the ropes. It think we old > timers are fortunate because we grew up doing hierarchical schematics, and > so writing the same hierarchical structural HDL is familiar and natural. > > It is long past time I investigated how to do this with XST Verilog and > wrote it up for my site... > > > 3. General - The CPU I want to design is a RISC / CISC design where a pre > > processor reads CISC instructions from ram and places RISC instructions > (or > > jumps to RISC subs) to do the same them in an internal pipeline for a > RISC > > internal cpu ( or perhaps many instances of the same RISC cpu and a single > > pre processor to provide a kind of hardware multithreading) - Is there any > > notes on existing designs like this that one could read. > > Interesting. If the "preprocessor" is hardware, this was done in e.g. Intel > P6, in the AMD K6, in some VAX implementations, and so forth. If the > "preprocessor" is software, this was done in e.g. SoftWindows, Digital's > FX!32 and Transmeta. There's lots in the library (see e.g. IEEE Micro) and > in Google! (Also time to update my resources/book section of my site.) > > Jan Gray, Gray Research LLC -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759 ###### From: "Ralph Mason" Newsgroups: comp.arch.fpga References: Subject: Re: Newbie Questions - Jan Gray XSOC Lines: 50 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Message-ID: <_R0t9.9727$Os6.1362098@news.xtra.co.nz> Date: Tue, 22 Oct 2002 13:21:19 +1300 NNTP-Posting-Host: 210.86.61.88 X-Complaints-To: newsadmin@xtra.co.nz X-Trace: news.xtra.co.nz 1035246074 210.86.61.88 (Tue, 22 Oct 2002 13:21:14 NZDT) NNTP-Posting-Date: Tue, 22 Oct 2002 13:21:14 NZDT Organization: Xtra Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.imp.ch!news.imp.ch!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!news-hog.berkeley.edu!ucberkeley!enews.sgi.com!news.xtra.co.nz!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22305 "Jan Gray" wrote in message news:ap21sj$qs3$1@slb5.atl.mindspring.net... > Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You > know exactly what you want -- a particular optimal, hand-mapped, hand-placed > layout for your datapath -- but the tools get in the way, and you spend > hours trying to discover an incantation that persuades the tools to emit the > desired result.") Comming from a mostly SW background it sounds like the same as trying to get the compiler to produce just that perfect assembly for you - about 50% of the time you just give up and write it your self > > 3. General - The CPU I want to design is a RISC / CISC design where a pre > > processor reads CISC instructions from ram and places RISC instructions > (or > > jumps to RISC subs) to do the same them in an internal pipeline for a > RISC > > internal cpu ( or perhaps many instances of the same RISC cpu and a single > > pre processor to provide a kind of hardware multithreading) - Is there any > > notes on existing designs like this that one could read. > > Interesting. If the "preprocessor" is hardware, this was done in e.g. Intel > P6, in the AMD K6, in some VAX implementations, and so forth. If the > "preprocessor" is software, this was done in e.g. SoftWindows, Digital's > FX!32 and Transmeta. There's lots in the library (see e.g. IEEE Micro) and > in Google! (Also time to update my resources/book section of my site.) I haven't decided if the pre processor will be SW loaded microcode (prefered) or hard coded yet or not, I just want to minimise external bus activity, and also be able to fed lots of internal processors ( kinda a cross between the Transmeta, with the and these multip threading pentiums), the multiple internal processors could also share some of the bigger units (say a multiplier) I am going to play with a software emulation first to see how it turns out, what boosts performace the most, what works and what doesn't Thanks Ralph ###### Message-ID: <3DB518F4.93760BF0@algor.co.uk> From: Rick Filipkiewicz X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Organization: MIPS Technologies (UK) Ltd Cache-Post-Path: mudchute.algor.co.uk!unknown@poplar.algor.co.uk X-Cache: nntpcache 2.4.0b2 (see http://www.nntpcache.org/) Lines: 57 Date: Tue, 22 Oct 2002 10:23:00 +0100 NNTP-Posting-Host: 62.254.210.129 X-Complaints-To: abuse@ntlworld.com X-Trace: newsfep2-win.server.ntli.net 1035278585 62.254.210.129 (Tue, 22 Oct 2002 10:23:05 BST) NNTP-Posting-Date: Tue, 22 Oct 2002 10:23:05 BST Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!newsfeed.stueberl.de!newspeer1-gui.server.ntli.net!ntli.net!newsfep2-win.server.ntli.net.POSTED!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22295 Jan Gray wrote: > Ralph Mason wrote > > > 1. As a learning process I have copped his adder, he says that it turns > > into 17 LUT's although mine ends up at 52 - Is there some optimizations I > am > > missing here? are there any pragmas you can use in these situations to say > > what you want? Can you get right down to the actual LUT level and connect > > them up yourself? Using the free Xilinx tools can you actually see the way > > it has connected the actual LUTs? > > First, congratulations on going to the trouble of actually looking under the > hood and *inspecting* what came out of your tools. That is a very good > practice. > > Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You > know exactly what you want -- a particular optimal, hand-mapped, hand-placed > layout for your datapath -- but the tools get in the way, and you spend > hours trying to discover an incantation that persuades the tools to emit the > desired result.") > Another example of `rope pushing', possibly simpler. I have this abstract structure for an LUT RAM based FIFO read address. ra = the registered address incremented on every read: fifo_ra = (fifo_rd_a | fifo_rd_b) ? ra + 1 : ra; but what I wanted to do was absorb the `or' into the LSB LUT of the adder. Clearly I could have taken Ray's approach of instantiating the structure I wanted but, lacking a generate capability (& my inbuilt resistance to VHDL), it would be difficult to parametrise [in fact Synplify now handles Verilog's "arrays of instances" step towards a full generate which would do the trick]. First problem: Synplify doesn't infer a carry chain adder for widths < 6 (for width = 4 I can understand this and I think somewhere deep in the inference engine there's a dumb coding error where xxx < 5 became xxx <= 5). Fix = do a module with a min width of 6 *and* put a syn_hier = hard on it. 2nd problem: Getting the absorption to work. Following a suggestion of, IIRC, John_H, recoding to the equivalent, but less readable, rope_push works [for Syn v7.0.2 at least] : fifo_ra[n:0] = {ra[N:1], (fifo_rd_a | fifo_rd_b)} + ra[0]; The moral being that it takes a fair amount of effort to both go fast and keep the code "pure RTL". Just a thought: This, and the OP, makes a potentially interesting test case for the XST optimiser/mapper. Who knows, it might handle the original selector based RTL better than Syn ? ###### Message-ID: <3DB545CD.B5450E3C@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.77 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: <3DB518F4.93760BF0@algor.co.uk> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 27 Date: Tue, 22 Oct 2002 12:34:47 GMT NNTP-Posting-Host: 68.15.41.165 X-Complaints-To: abuse@cox.net X-Trace: news1.east.cox.net 1035290087 68.15.41.165 (Tue, 22 Oct 2002 08:34:47 EDT) NNTP-Posting-Date: Tue, 22 Oct 2002 08:34:47 EDT Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.imp.ch!news.imp.ch!newsfeed.stueberl.de!cox.net!p01!news1.east.cox.net.POSTED!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22335 I'm not adverse to the extra work to get it to do the right thing with RTL. Been burned several times though by new versions doing something different, so the extra work turns into a continuing maintenance program as well. XST does seem to do pretty well on optimizing carry chain structures. Rick Filipkiewicz wrote: > Jan Gray wrote: > The moral being that it takes a fair amount of effort to both go fast and keep > the code "pure RTL". > > Just a thought: This, and the OP, makes a potentially interesting test case for > the XST optimiser/mapper. Who knows, it might handle the original selector based > RTL better than Syn ? -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759 ###### Message-ID: <3DB57678.9C326F8F@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.77 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: <3DB49703.D82087E7@andraka.com> <3DB56A72.3619EA3B@Xilinx.com> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Lines: 148 Date: Tue, 22 Oct 2002 16:02:28 GMT NNTP-Posting-Host: 68.15.41.165 X-Complaints-To: abuse@cox.net X-Trace: news1.east.cox.net 1035302548 68.15.41.165 (Tue, 22 Oct 2002 12:02:28 EDT) NNTP-Posting-Date: Tue, 22 Oct 2002 12:02:28 EDT Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!cyclone.bc.net!news-hog.berkeley.edu!ucberkeley!nntp-relay.ihug.net!ihug.co.nz!west.cox.net!cox.net!p01!news1.east.cox.net.POSTED!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22334 Ours has a high degree of instantiation in stuff that gets used alot...adders, d registers, delay queues etc are all 100%. For less often used, I'll often use the D register then use RTL to describe the LUTs in front of it and let the mapper worry about placing them, and for some of the control logic I just go to an all RTL taking care to keep the inputs to the logic in front each register to 4 or 5. The motivation of course is design turn time, which includes making the design hit the timing goals. Once you have a library of some of the basic stuff, putting together pretty complicated data paths goes quite quickly, and in the end you have a placed component that you know has a good shot of making timing and should be left alone by the tools. Goran Bilski wrote: > Hi, > > I can only agree with Ray. > My coding is mostly instanciation of Xilinx primitive surrounded with generate > statement and generics. > > I have also found out that it takes me roughly 5-10 minutes to write the above > combined adder/mux using LUT,MUXCY,XORCY and generate statements. > It can take me almost a day trying to foul one synthesis tool to do the same and I > might have to redo that day when I change synthesis tool or get a new version. > > I just want most result of the effort I put in my work and that's why roughly 90% of > MicroBlaze is directly instanciated Xilinx primitives. > What I also get by instanciating the primitives is a possibility of controlling the > placement, which is much tougher if you stick to RTL. > It's extremely easy to floorplan the above module and getting maximum performance. > > Why only 90% not 100% in MicroBlaze? > Sometimes I know that even the stupidest synthesis tool can't go wrong and it's not > in a critical section. > Doing large boolean expression is faster in RTL and most synthesis tool can handle > that. > > Göran > > Ray Andraka wrote: > > > Jan, Ralph, > > > > Long time ago we got tired of pushing on the rope for every design we did, so we > > went ahead and created a library of parameterized structurally instantiated > > things like d registers, various flavors of adders/subtractors, delay queues, > > etc. Easy to do in VHDL with the generate statement, a real PITA in verilog > > without the v2000 extensions. Because of some of the things like the generate, > > I have become something of a VHDL bigot so you'll have to excuse that. Usually > > I could get the tools to behave with inferred stuff if I structured the > > inference to look like the hardware, something like this: > > > > neg_b<= b when sub='1' else not b; > > cin<= 1 when sub='1' else 0; > > sum_d<= a + neg_b+cin; > > > > The results vary by synth and version, but this structure seems to get there > > most of the time. > > > > Jan Gray wrote: > > > > > Ralph Mason wrote > > > > > > > 1. As a learning process I have copped his adder, he says that it turns > > > > into 17 LUT's although mine ends up at 52 - Is there some optimizations I > > > am > > > > missing here? are there any pragmas you can use in these situations to say > > > > what you want? Can you get right down to the actual LUT level and connect > > > > them up yourself? Using the free Xilinx tools can you actually see the way > > > > it has connected the actual LUTs? > > > > > > First, congratulations on going to the trouble of actually looking under the > > > hood and *inspecting* what came out of your tools. That is a very good > > > practice. > > > > > > Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You > > > know exactly what you want -- a particular optimal, hand-mapped, hand-placed > > > layout for your datapath -- but the tools get in the way, and you spend > > > hours trying to discover an incantation that persuades the tools to emit the > > > desired result.") > > > > > > Here the goal was to generate a+b or a-b with cin and cout at a cost of one > > > LUT per bit. The hardware is willing, but you will need to perform a bit of > > > experimentation to discover an expression that the synthesis tool will > > > repeatably synthesize into the desired hardware idiom. > > > > > > The construction you 'copped' did work for one version of some synthesis > > > tool once. Of course that does not mean it works reliably across all tools > > > and all versions of all tools! > > > > > > I like John_H's suggestion -- it may succeed more often than the mux(+,-) > > > that I used. > > > > > > BTW IIRC in this specific instance it was also necessary to move the single > > > assignment into its own module -- the synthesis tool created the inefficient > > > mux(+, -) form if the assignment was inline in the parent module. > > > > > > > 2. Floorplaning - Is there any info on this, it seems that you don't need > > > > to do it because the tools do it automatically, I am guessing sometimes > > > you > > > > can do it better yourself to give a faster device? It looks like a > > > > laborious process - is it? > > > > > > It need not be laborious, but you have to learn the ropes. It think we old > > > timers are fortunate because we grew up doing hierarchical schematics, and > > > so writing the same hierarchical structural HDL is familiar and natural. > > > > > > It is long past time I investigated how to do this with XST Verilog and > > > wrote it up for my site... > > > > > > > 3. General - The CPU I want to design is a RISC / CISC design where a pre > > > > processor reads CISC instructions from ram and places RISC instructions > > > (or > > > > jumps to RISC subs) to do the same them in an internal pipeline for a > > > RISC > > > > internal cpu ( or perhaps many instances of the same RISC cpu and a single > > > > pre processor to provide a kind of hardware multithreading) - Is there any > > > > notes on existing designs like this that one could read. > > > > > > Interesting. If the "preprocessor" is hardware, this was done in e.g. Intel > > > P6, in the AMD K6, in some VAX implementations, and so forth. If the > > > "preprocessor" is software, this was done in e.g. SoftWindows, Digital's > > > FX!32 and Transmeta. There's lots in the library (see e.g. IEEE Micro) and > > > in Google! (Also time to update my resources/book section of my site.) > > > > > > Jan Gray, Gray Research LLC > > > > -- > > --Ray Andraka, P.E. > > President, the Andraka Consulting Group, Inc. > > 401/884-7930 Fax 401/884-7950 > > email ray@andraka.com > > http://www.andraka.com > > > > "They that give up essential liberty to obtain a little > > temporary safety deserve neither liberty nor safety." > > -Benjamin Franklin, 1759 -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759 ###### From: "Ralph Mason" Newsgroups: comp.arch.fpga References: <3DB49703.D82087E7@andraka.com> <3DB56A72.3619EA3B@Xilinx.com> <3DB57678.9C326F8F@andraka.com> Subject: Re: Newbie Questions - Jan Gray XSOC Lines: 236 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Message-ID: Date: Wed, 23 Oct 2002 09:15:21 +1300 NNTP-Posting-Host: 210.86.61.88 X-Complaints-To: newsadmin@xtra.co.nz X-Trace: news.xtra.co.nz 1035317721 210.86.61.88 (Wed, 23 Oct 2002 09:15:21 NZDT) NNTP-Posting-Date: Wed, 23 Oct 2002 09:15:21 NZDT Organization: Xtra Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!cyclone2.usenetserver.com!news.webusenet.com!news02.tsnz.net!newsfeed01.tsnz.net!news.xtra.co.nz!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22361 It seems that the optimisers are a few year behind software ones (or perhaps far more complex beasts) Given that, shouldn't the vendors provide libraries of pro build items that are parameterised (down to the target part) in implemented in the most efficient way? This way upgrades / part changes shouldn't be a problem. Simple things like adders muxes ripple counters / with compare etc etc Or am I misunderstanding / over simplifying the problem. Ralph "Ray Andraka" wrote in message news:3DB57678.9C326F8F@andraka.com... > Ours has a high degree of instantiation in stuff that gets used alot...adders, d > registers, delay queues etc are all 100%. For less often used, I'll often use the D > register then use RTL to describe the LUTs in front of it and let the mapper worry about > placing them, and for some of the control logic I just go to an all RTL taking care to > keep the inputs to the logic in front each register to 4 or 5. The motivation of course > is design turn time, which includes making the design hit the timing goals. Once you > have a library of some of the basic stuff, putting together pretty complicated data > paths goes quite quickly, and in the end you have a placed component that you know has a > good shot of making timing and should be left alone by the tools. > > Goran Bilski wrote: > > > Hi, > > > > I can only agree with Ray. > > My coding is mostly instanciation of Xilinx primitive surrounded with generate > > statement and generics. > > > > I have also found out that it takes me roughly 5-10 minutes to write the above > > combined adder/mux using LUT,MUXCY,XORCY and generate statements. > > It can take me almost a day trying to foul one synthesis tool to do the same and I > > might have to redo that day when I change synthesis tool or get a new version. > > > > I just want most result of the effort I put in my work and that's why roughly 90% of > > MicroBlaze is directly instanciated Xilinx primitives. > > What I also get by instanciating the primitives is a possibility of controlling the > > placement, which is much tougher if you stick to RTL. > > It's extremely easy to floorplan the above module and getting maximum performance. > > > > Why only 90% not 100% in MicroBlaze? > > Sometimes I know that even the stupidest synthesis tool can't go wrong and it's not > > in a critical section. > > Doing large boolean expression is faster in RTL and most synthesis tool can handle > > that. > > > > Göran > > > > Ray Andraka wrote: > > > > > Jan, Ralph, > > > > > > Long time ago we got tired of pushing on the rope for every design we did, so we > > > went ahead and created a library of parameterized structurally instantiated > > > things like d registers, various flavors of adders/subtractors, delay queues, > > > etc. Easy to do in VHDL with the generate statement, a real PITA in verilog > > > without the v2000 extensions. Because of some of the things like the generate, > > > I have become something of a VHDL bigot so you'll have to excuse that. Usually > > > I could get the tools to behave with inferred stuff if I structured the > > > inference to look like the hardware, something like this: > > > > > > neg_b<= b when sub='1' else not b; > > > cin<= 1 when sub='1' else 0; > > > sum_d<= a + neg_b+cin; > > > > > > The results vary by synth and version, but this structure seems to get there > > > most of the time. > > > > > > Jan Gray wrote: > > > > > > > Ralph Mason wrote > > > > > > > > > 1. As a learning process I have copped his adder, he says that it turns > > > > > into 17 LUT's although mine ends up at 52 - Is there some optimizations I > > > > am > > > > > missing here? are there any pragmas you can use in these situations to say > > > > > what you want? Can you get right down to the actual LUT level and connect > > > > > them up yourself? Using the free Xilinx tools can you actually see the way > > > > > it has connected the actual LUTs? > > > > > > > > First, congratulations on going to the trouble of actually looking under the > > > > hood and *inspecting* what came out of your tools. That is a very good > > > > practice. > > > > > > > > Welcome to "pushing on a rope". (fpgacpu.org/usenet/rope_pushing.html: "You > > > > know exactly what you want -- a particular optimal, hand-mapped, hand-placed > > > > layout for your datapath -- but the tools get in the way, and you spend > > > > hours trying to discover an incantation that persuades the tools to emit the > > > > desired result.") > > > > > > > > Here the goal was to generate a+b or a-b with cin and cout at a cost of one > > > > LUT per bit. The hardware is willing, but you will need to perform a bit of > > > > experimentation to discover an expression that the synthesis tool will > > > > repeatably synthesize into the desired hardware idiom. > > > > > > > > The construction you 'copped' did work for one version of some synthesis > > > > tool once. Of course that does not mean it works reliably across all tools > > > > and all versions of all tools! > > > > > > > > I like John_H's suggestion -- it may succeed more often than the mux(+,-) > > > > that I used. > > > > > > > > BTW IIRC in this specific instance it was also necessary to move the single > > > > assignment into its own module -- the synthesis tool created the inefficient > > > > mux(+, -) form if the assignment was inline in the parent module. > > > > > > > > > 2. Floorplaning - Is there any info on this, it seems that you don't need > > > > > to do it because the tools do it automatically, I am guessing sometimes > > > > you > > > > > can do it better yourself to give a faster device? It looks like a > > > > > laborious process - is it? > > > > > > > > It need not be laborious, but you have to learn the ropes. It think we old > > > > timers are fortunate because we grew up doing hierarchical schematics, and > > > > so writing the same hierarchical structural HDL is familiar and natural. > > > > > > > > It is long past time I investigated how to do this with XST Verilog and > > > > wrote it up for my site... > > > > > > > > > 3. General - The CPU I want to design is a RISC / CISC design where a pre > > > > > processor reads CISC instructions from ram and places RISC instructions > > > > (or > > > > > jumps to RISC subs) to do the same them in an internal pipeline for a > > > > RISC > > > > > internal cpu ( or perhaps many instances of the same RISC cpu and a single > > > > > pre processor to provide a kind of hardware multithreading) - Is there any > > > > > notes on existing designs like this that one could read. > > > > > > > > Interesting. If the "preprocessor" is hardware, this was done in e.g. Intel > > > > P6, in the AMD K6, in some VAX implementations, and so forth. If the > > > > "preprocessor" is software, this was done in e.g. SoftWindows, Digital's > > > > FX!32 and Transmeta. There's lots in the library (see e.g. IEEE Micro) and > > > > in Google! (Also time to update my resources/book section of my site.) > > > > > > > > Jan Gray, Gray Research LLC > > > > > > -- > > > --Ray Andraka, P.E. > > > President, the Andraka Consulting Group, Inc. > > > 401/884-7930 Fax 401/884-7950 > > > email ray@andraka.com > > > http://www.andraka.com > > > > > > "They that give up essential liberty to obtain a little > > > temporary safety deserve neither liberty nor safety." > > > -Benjamin Franklin, 1759 > > -- > --Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com > > "They that give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." > -Benjamin Franklin, 1759 > > ###### From: "Ralph Mason" Newsgroups: comp.arch.fpga References: <3DB46645.63543E05@mail.com> Subject: Re: Newbie Questions - Jan Gray XSOC Lines: 51 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1106 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Message-ID: Date: Wed, 23 Oct 2002 10:16:56 +1300 NNTP-Posting-Host: 210.86.61.88 X-Complaints-To: newsadmin@xtra.co.nz X-Trace: news.xtra.co.nz 1035321416 210.86.61.88 (Wed, 23 Oct 2002 10:16:56 NZDT) NNTP-Posting-Date: Wed, 23 Oct 2002 10:16:56 NZDT Organization: Xtra Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!newsfeed.icl.net!newsfeed.fjserv.net!newsfeed1.cidera.com!Cidera!telocity-west!TELOCITY!news-out.spamkiller.net!propagator2-maxim!news-in.spamkiller.net!news02.tsnz.net!newsfeed01.tsnz.net!news.xtra.co.nz!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22358 knowing about 10-E6% of what I need to know about this stuff, there is probably something stupid I am doing here. But It seem like these assign {carry_out,sum,x} = fn ? {a,carry_in} - { b,1'b1 } :{a,carry_in} + { b,1'b1 } ; assign {carry_out,sum,x} = {a,carry_in} + ( fn ? { ~b,1'b1 }: { b,1'b1 }); Generate a different carry output on subtraction. If I input carry_in = 1 a=32 b=32 fn=1 On the first I get a carry_out of 0 (expected) and on the second I get a carry_out of 1 (unexpected) Seems like writing the same thing a few ways can help find bugs. Ralph "John_H" wrote in message news:3DB46645.63543E05@mail.com... > Addressing item 1 only, sythesizers will better understand how to efficiently > implement a single arithmetic chain rather than working backwars from a mux of > two arithmetic values. Try > > assign {carry_out,sum,x} = {a,carry_in} + ( fn ? -{ b,1'b1 }: { b,1'b1 }); > > though the synth might still be confused by the 2's complement. To go "whole > hog" to the minimum solution, you may have to explicitly perform the negative: > > assign {carry_out,sum,x} = {a,carry_in} + ( fn ? {~b,1'b1 }: { b,1'b1 }); > > ( -n == ~n + 1 . . . -{n,1} == ~{n,1} + 1 == {~n,0} + 1 == {~n,1} ) > > It's all much smaller now! > ###### Message-ID: <3DB5CEA1.17ADE82@mail.com> From: John_H X-Mailer: Mozilla 4.75 [en]C-CCK-MCD (Win95; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: <3DB46645.63543E05@mail.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 32 Date: Tue, 22 Oct 2002 22:18:08 GMT NNTP-Posting-Host: 192.65.17.17 X-Complaints-To: postmaster@opbu.xerox.com X-Trace: news-west.eli.net 1035325088 192.65.17.17 (Tue, 22 Oct 2002 16:18:08 MDT) NNTP-Posting-Date: Tue, 22 Oct 2002 16:18:08 MDT Organization: Xerox Officeprinting NewsReader Service Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!newscore.univie.ac.at!uninett.no!uio.no!logbridge.uoregon.edu!news-west.eli.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22409 Indeed... I was thinking of unsigned arithmatic where the result vector is the same size as the input vector. To get the carry out, you extend this one bit. To get the full implementation, extend the input side another bit on top. After all, 6-6 is the same as 6+10 when you only look at the LSbits. To get a carry out, you need 6-6 equal to 6+26 where the extra 16 is due to the sign. Look only at the 5 bits and that carry out is zero as expected. So... Original: assign {carry_out,sum,x} = fn ? {a,carry_in} - { b,1'b1 } :{a,carry_in} + { b,1'b1 } ; which, if extended so the right side vector is the same size as the left side vector, is assign {carry_out,sum,x} = fn ? {1'b0,a,carry_in} - {1'b0, b,1'b1 } :{1'b0,a,carry_in} + {1'b0, b,1'b1 } ; What I showed you: assign {carry_out,sum,x} = {a,carry_in} + ( fn ? {~b,1'b1 }: { b,1'b1 }); What I should've shown you: assign {carry_out,sum,x} = {1'b0,a,carry_in} + ( fn ? {1'b1,~b,1'b1 }: { 1'b0,b,1'b1 }); ( -n == ~n + 1 . . . -{0,n,1} == ~{0,n,1} + 1 == {1,~n,0} + 1 == {1,~n,1} ) This extends the sign out the extra bit needed for a proper carry out. ###### Message-ID: <3DB5F0E7.39B8C5A7@andraka.com> From: Ray Andraka Organization: Andraka Consulting Group, Inc X-Mailer: Mozilla 4.77 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Re: Newbie Questions - Jan Gray XSOC References: <3DB49703.D82087E7@andraka.com> <3DB56A72.3619EA3B@Xilinx.com> <3DB57678.9C326F8F@andraka.com> Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Lines: 265 Date: Wed, 23 Oct 2002 00:44:50 GMT NNTP-Posting-Host: 68.15.41.165 X-Complaints-To: abuse@cox.net X-Trace: news1.east.cox.net 1035333890 68.15.41.165 (Tue, 22 Oct 2002 20:44:50 EDT) NNTP-Posting-Date: Tue, 22 Oct 2002 20:44:50 EDT Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!newsfeed.stueberl.de!cox.net!p01!news1.east.cox.net.POSTED!53ab2750!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:22336 They probably should, but on the other hand if they do then those need to be open source to convince the masses to use them. These do exist in closed source form with the core generator (xilinx) and LPMs (altera), but there is no access to the source. After disecting several of them, I came away with the impression that at least some of these cores are not very carefully optimized, and many times don't do exactly what you want. In the end, it was easier for use to write our own library for these type of functions so that we had control over it. Several years back, Xilinx had a library of macros in schematic form. Those were easier to get inside and modify if needed. It turned out that many of those macros could be improved significantly with a little more attention to the architecture of the FPGA. Nowadays, most folks are not using these simple macros because it is easier (and more portable) to just leave it to the synthesis tools even if the results are not optimal. Ralph Mason wrote: > It seems that the optimisers are a few year behind software ones (or perhaps > far more complex beasts) > > Given that, shouldn't the vendors provide libraries of pro build items that > are parameterised (down to the target part) in implemented in the most > efficient way? This way upgrades / part changes shouldn't be a problem. > > Simple things like > > adders > > muxes > > ripple counters / with compare > > etc > > etc > > Or am I misunderstanding / over simplifying the problem. > > Ralph > > "Ray Andraka" wrote in message > news:3DB57678.9C326F8F@andraka.com... > > Ours has a high degree of instantiation in stuff that gets used > alot...adders, d > > registers, delay queues etc are all 100%. For less often used, I'll > often use the D > > register then use RTL to describe the LUTs in front of it and let the > mapper worry about > > placing them, and for some of the control logic I just go to an all RTL > taking care to > > keep the inputs to the logic in front each register to 4 or 5. The > motivation of course > > is design turn time, which includes making the design hit the timing > goals. Once you > > have a library of some of the basic stuff, putting together pretty > complicated data > > paths goes quite quickly, and in the end you have a placed component that > you know has a > > good shot of making timing and should be left alone by the tools. > > > > Goran Bilski wrote: > > > > > Hi, > > > > > > I can only agree with Ray. > > > My coding is mostly instanciation of Xilinx primitive surrounded with > generate > > > statement and generics. > > > > > > I have also found out that it takes me roughly 5-10 minutes to write the > above > > > combined adder/mux using LUT,MUXCY,XORCY and generate statements. > > > It can take me almost a day trying to foul one synthesis tool to do the > same and I > > > might have to redo that day when I change synthesis tool or get a new > version. > > > > > > I just want most result of the effort I put in my work and that's why > roughly 90% of > > > MicroBlaze is directly instanciated Xilinx primitives. > > > What I also get by instanciating the primitives is a possibility of > controlling the > > > placement, which is much tougher if you stick to RTL. > > > It's extremely easy to floorplan the above module and getting maximum > performance. > > > > > > Why only 90% not 100% in MicroBlaze? > > > Sometimes I know that even the stupidest synthesis tool can't go wrong > and it's not > > > in a critical section. > > > Doing large boolean expression is faster in RTL and most synthesis tool > can handle > > > that. > > > > > > Göran > > > > > > Ray Andraka wrote: > > > > > > > Jan, Ralph, > > > > > > > > Long time ago we got tired of pushing on the rope for every design we > did, so we > > > > went ahead and created a library of parameterized structurally > instantiated > > > > things like d registers, various flavors of adders/subtractors, delay > queues, > > > > etc. Easy to do in VHDL with the generate statement, a real PITA in > verilog > > > > without the v2000 extensions. Because of some of the things like the > generate, > > > > I have become something of a VHDL bigot so you'll have to excuse that. > Usually > > > > I could get the tools to behave with inferred stuff if I structured > the > > > > inference to look like the hardware, something like this: > > > > > > > > neg_b<= b when sub='1' else not b; > > > > cin<= 1 when sub='1' else 0; > > > > sum_d<= a + neg_b+cin; > > > > > > > > The results vary by synth and version, but this structure seems to get > there > > > > most of the time. > > > > > > > > Jan Gray wrote: > > > > > > > > > Ralph Mason wrote > > > > > > > > > > > 1. As a learning process I have copped his adder, he says that it > turns > > > > > > into 17 LUT's although mine ends up at 52 - Is there some > optimizations I > > > > > am > > > > > > missing here? are there any pragmas you can use in these > situations to say > > > > > > what you want? Can you get right down to the actual LUT level and > connect > > > > > > them up yourself? Using the free Xilinx tools can you actually see > the way > > > > > > it has connected the actual LUTs? > > > > > > > > > > First, congratulations on going to the trouble of actually looking > under the > > > > > hood and *inspecting* what came out of your tools. That is a very > good > > > > > practice. > > > > > > > > > > Welcome to "pushing on a rope". > (fpgacpu.org/usenet/rope_pushing.html: "You > > > > > know exactly what you want -- a particular optimal, hand-mapped, > hand-placed > > > > > layout for your datapath -- but the tools get in the way, and you > spend > > > > > hours trying to discover an incantation that persuades the tools to > emit the > > > > > desired result.") > > > > > > > > > > Here the goal was to generate a+b or a-b with cin and cout at a cost > of one > > > > > LUT per bit. The hardware is willing, but you will need to perform > a bit of > > > > > experimentation to discover an expression that the synthesis tool > will > > > > > repeatably synthesize into the desired hardware idiom. > > > > > > > > > > The construction you 'copped' did work for one version of some > synthesis > > > > > tool once. Of course that does not mean it works reliably across > all tools > > > > > and all versions of all tools! > > > > > > > > > > I like John_H's suggestion -- it may succeed more often than the > mux(+,-) > > > > > that I used. > > > > > > > > > > BTW IIRC in this specific instance it was also necessary to move the > single > > > > > assignment into its own module -- the synthesis tool created the > inefficient > > > > > mux(+, -) form if the assignment was inline in the parent module. > > > > > > > > > > > 2. Floorplaning - Is there any info on this, it seems that you > don't need > > > > > > to do it because the tools do it automatically, I am guessing > sometimes > > > > > you > > > > > > can do it better yourself to give a faster device? It looks like > a > > > > > > laborious process - is it? > > > > > > > > > > It need not be laborious, but you have to learn the ropes. It think > we old > > > > > timers are fortunate because we grew up doing hierarchical > schematics, and > > > > > so writing the same hierarchical structural HDL is familiar and > natural. > > > > > > > > > > It is long past time I investigated how to do this with XST Verilog > and > > > > > wrote it up for my site... > > > > > > > > > > > 3. General - The CPU I want to design is a RISC / CISC design > where a pre > > > > > > processor reads CISC instructions from ram and places RISC > instructions > > > > > (or > > > > > > jumps to RISC subs) to do the same them in an internal pipeline > for a > > > > > RISC > > > > > > internal cpu ( or perhaps many instances of the same RISC cpu and > a single > > > > > > pre processor to provide a kind of hardware multithreading) - Is > there any > > > > > > notes on existing designs like this that one could read. > > > > > > > > > > Interesting. If the "preprocessor" is hardware, this was done in > e.g. Intel > > > > > P6, in the AMD K6, in some VAX implementations, and so forth. If > the > > > > > "preprocessor" is software, this was done in e.g. SoftWindows, > Digital's > > > > > FX!32 and Transmeta. There's lots in the library (see e.g. IEEE > Micro) and > > > > > in Google! (Also time to update my resources/book section of my > site.) > > > > > > > > > > Jan Gray, Gray Research LLC > > > > > > > > -- > > > > --Ray Andraka, P.E. > > > > President, the Andraka Consulting Group, Inc. > > > > 401/884-7930 Fax 401/884-7950 > > > > email ray@andraka.com > > > > http://www.andraka.com > > > > > > > > "They that give up essential liberty to obtain a little > > > > temporary safety deserve neither liberty nor safety." > > > > -Benjamin Franklin, 1759 > > > > -- > > --Ray Andraka, P.E. > > President, the Andraka Consulting Group, Inc. > > 401/884-7930 Fax 401/884-7950 > > email ray@andraka.com > > http://www.andraka.com > > > > "They that give up essential liberty to obtain a little > > temporary safety deserve neither liberty nor safety." > > -Benjamin Franklin, 1759 > > > > -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759