From: "Carl Brannen" Newsgroups: comp.arch.fpga Subject: 16x5 multiplier uses new multiply algorithm Date: Fri, 21 Dec 2001 09:18:17 +0000 (UTC) Organization: Mailgate.ORG Server - http://www.Mailgate.ORG Lines: 279 Message-ID: <5b02deb9bd2704d125bcb07ae453d40d.51709@mygate.mailgate.org> References: <4172938cdfe1a23d2d483ee9c9d0460e.51709@mygate.mailgate.org> NNTP-Posting-Host: firewall.terabeam.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: news.mailgate.org 1008914312 18038 216.137.15.2 (Fri Dec 21 10:18:16 2001) X-Complaints-To: abuse@mailgate.org NNTP-Posting-Date: Fri, 21 Dec 2001 09:18:17 +0000 (UTC) Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008914312 User-Agent: Mailgate Web Server X-URL: http://www.Mailgate.ORG Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:12613 This 16x5 unsigned multiplier uses the algorithm listed above. It has a single register stage, at the end. I'm fairly sure it works, having simulated a lot of numbers through it. If it were to be pipelined, it would be most efficient to put the first register stage at the M input, but at the "MODE" stage in the X input. It uses 33 slices, with a total of 65 FGs used. Two of the FGs are programmed to be zero so that two carry-outs can be made visible. I don't immediately see how to avoid this. Despite being a fall through, with no internal registers, (though note that the final register is necessary in order to get M x zero = zero), and without being floorplanned, (I haven't even gone back through the code to see if I can improve it), it still gets 131MHz in the xcv50e -8. Were it to be pipelined, it would be natural to bring the "X" input into the logic a clock early. This would allow the mode inputs for the partial products to be registered. This could be done without unbalancing the multiplier by registering the "M" inputs on the inputs, but registering the "X" inputs only after a clock. I haven't taken a look at how to minimize the logic in this case... I typically over-comment my VHDL. I've removed the comments here to save bandwidth on the internet. If anyone is interested, I can add them back in. Thanks to Frédéric Rivoallon at Xilinx for e-mailing me a link to instructions on how to instantiate LUT4s inside generate statements without a lot of grief. For those interested, the link is here: http://tech-www.informatik.uni-hamburg.de/vhdl/doc/faq/FAQ1.html#attributes This is the first in a series of multipliers. The next, a 16x8, will use 3 partial products, which is not a particularly natural number for this algorithm. But the one after that, 16x11, will use 4 and will be quite sweet. Total LUT usage with this algorithm will increase by 3 per additional bit beyond 16. That is, the number of LUTs for a Nx5 multiplier will be about LUT( multiplier Nx5) = 17 + 3N. The 3 adders are hooked up as follows: LUT#1 \ -- PP0V creates { 1M, 2M, 3M, 4M} + LUT#3 -- PS0V creates final result between 0M and 31M LUT#2 / -- PP3V creates { 8M,16M,24M,32M} The usual algorithm for multiplying by 5 bits on a Virtex will require 4 LUTs per bit. The adder tree will look like this (maybe a slightly different topology will be better): LUT#1 \ -- creates { 0M, 1M, 2M, 3M} + LUT#3 \ -- creates { 0M ... 15M} LUT#2 / \ -- creates { 0M, 4M, 8M,12M} + LUT#4 -- creates final result between 0M and 31M (M)-------------/ -- creates { 0M,16M} (AND gate absorbed into LUT#4) For extremely wide multiplies, the savings of the new algorithm approach 25% over the old technique. -- Multiplier code, 16x5 multiplier -- Design by Carl Brannen. -- Uses 3 + 2 bit coding. -- Multiplier code, 16x5 multiplier library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_unsigned.all; entity MUL16x5S is port ( CLK: in STD_LOGIC; M: in STD_LOGIC_VECTOR(15 downto 0); X: in STD_LOGIC_VECTOR( 4 downto 0); Y: out STD_LOGIC_VECTOR(20 downto 0); TEST: in STD_LOGIC ); end MUL16x5S; architecture MUL16x5S_arch of MUL16x5S is component LUT4 port ( I0: in STD_LOGIC; I1: in STD_LOGIC; I2: in STD_LOGIC; I3: in STD_LOGIC; O: out STD_LOGIC); end component; attribute INIT: string; component XORCY port ( CI: in STD_LOGIC; LI: in STD_LOGIC; O: out STD_LOGIC); end component; component MUXCY port ( DI: in STD_LOGIC; CI: in STD_LOGIC; S: in STD_LOGIC; O: out STD_LOGIC); end component; component MULT_AND port ( I0: in STD_LOGIC; I1: in STD_LOGIC; LO: out STD_LOGIC); end component; component FDR port ( D: in STD_LOGIC; C: in STD_LOGIC; R: in STD_LOGIC; Q: out STD_LOGIC); end component; signal EM0V: STD_LOGIC_VECTOR(18 downto 0); -- Extended M[] signal PP0PN: STD_LOGIC_VECTOR( 1 downto 0); -- PP0.P & PP0.N signal PP0V: STD_LOGIC_VECTOR(20 downto 0); -- PP0.V signal PP0M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PP0_LUT: STD_LOGIC_VECTOR(17 downto 1); -- LUT signal PP0_MA: STD_LOGIC_VECTOR(17 downto 1); -- MULT_AND signal PP0_XC: STD_LOGIC_VECTOR(17 downto 1); -- XORCY signal PP0_CRY: STD_LOGIC_VECTOR(18 downto 1); -- Carry signal PP0_SUM: STD_LOGIC_VECTOR(17 downto 1); -- Sum output signal EM3V: STD_LOGIC_VECTOR(20 downto 0); -- Extended M[] signal PP3PN: STD_LOGIC_VECTOR( 1 downto 0); -- PP3.P & PP3.N signal PP3V: STD_LOGIC_VECTOR(20 downto 2); -- PP3.V signal PP3M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PP3_LUT: STD_LOGIC_VECTOR(20 downto 3); -- LUT signal PP3_MA: STD_LOGIC_VECTOR(20 downto 3); -- MULT_AND signal PP3_XC: STD_LOGIC_VECTOR(20 downto 3); -- XORCY signal PP3_CRY: STD_LOGIC_VECTOR(21 downto 3); -- Carry signal PP3_SUM: STD_LOGIC_VECTOR(20 downto 3); -- Sum output signal PS2SEL: STD_LOGIC_VECTOR( 3 downto 0); -- Select bit signal PS2PN: STD_LOGIC_VECTOR( 1 downto 0); -- PS2.P & PS2.N signal PS2V: STD_LOGIC_VECTOR(20 downto 0); -- PS2.V signal PS2M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PS2_LUT: STD_LOGIC_VECTOR(20 downto 2); -- LUT signal PS2_MA: STD_LOGIC_VECTOR(20 downto 2); -- MULT_AND signal PS2_XC: STD_LOGIC_VECTOR(20 downto 2); -- XORCY signal PS2_CRY: STD_LOGIC_VECTOR(21 downto 2); -- Carry signal PS2_SUM: STD_LOGIC_VECTOR(20 downto 2); -- Sum output signal YRES: STD_LOGIC; -- Reset final FF signal YQ: STD_LOGIC_VECTOR(20 downto 0); -- Final FF begin EM0V(18 downto 0) <= "000" & M(15 downto 0); PP0V(0) <= (EM0V(0) and X(0)); PP0V(1) <= (EM0V(0) and X(1)) xor (EM0V(1) and X(0)); PP0V(17 downto 2) <= PP0_SUM(17 downto 2); PP0V(20 downto 18) <= "000"; PP0PN(0) <= X(2); -- Negative bit with X(2 downto 0) select -- Positive bit PP0PN(1) <= '1' when "001" | "010" | "011", '0' when others; with X(2 downto 0) select PP0M(1 downto 0) <= "01" when "111" | "001", -- PP0V <= 1M "00" when "110" | "010", -- PP0V <= 2M "11" when "101" | "011", -- PP0V <= 3M "10" when others; -- PP0V <= 4M PP0_CRY(1) <= '0'; A0: for I in 1 to 17 generate B: block attribute INIT of L0: label is "7484"; begin L0: LUT4 port map( I0 => PP0M(1), I1 => EM0V(I-1), I2 => PP0M(0), I3 => EM0V(I), O => PP0_LUT(I)); MA: MULT_AND port map ( I0 => PP0M(1), I1 => EM0V(I-1), LO => PP0_MA(I)); MC: MUXCY port map ( DI => PP0_MA(I), CI => PP0_CRY(I), S => PP0_LUT(I), O => PP0_CRY(I+1)); XC: XORCY port map ( CI => PP0_CRY(I), LI => PP0_LUT(I), O => PP0_SUM(I)); end block b; end generate; EM3V(20 downto 0) <= "00" & M(15 downto 0) & TEST & TEST & TEST; PP3V(2) <= '0'; PP3V(20 downto 3) <= PP3_SUM(20 downto 3); PP3PN(0) <= '0'; -- Negative bit (never negative) with X(4 downto 2) select -- Positive bit PP3PN(1) <= '0' when "000", '1' when others; -- Usually positive with X(4 downto 2) select PP3M(1 downto 0) <= "01" when "001" | "010", -- PP3V <= 1M "00" when "011" | "100", -- PP3V <= 2M "11" when "101" | "110", -- PP3V <= 3M "10" when others; -- PP3V <= 4M PP3_CRY(3) <= '0'; A3: for I in 3 to 20 generate B: block attribute INIT of L3: label is "7484"; -- See PP0V begin L3: LUT4 port map( I0 => PP3M(1), I1 => EM3V(I-1), I2 => PP3M(0), I3 => EM3V(I), O => PP3_LUT(I)); MA: MULT_AND port map ( I0 => PP3M(1), I1 => EM3V(I-1), LO => PP3_MA(I)); MC: MUXCY port map ( DI => PP3_MA(I), CI => PP3_CRY(I), S => PP3_LUT(I), O => PP3_CRY(I+1)); XC: XORCY port map ( CI => PP3_CRY(I), LI => PP3_LUT(I), O => PP3_SUM(I)); end block b; end generate; PS2V(1 downto 0) <= PP0V(1 downto 0); PS2V(20 downto 2) <= PS2_SUM(20 downto 2); PS2SEL <= PP3PN(1 downto 0) & PP0PN(1 downto 0); with PS2SEL select PS2PN(1 downto 0) <= -- Result of sum: "01" when "0101" | "0100" | "0110" | "0001", -- Negative "10" when "1001" | "1000" | "1010" | "0010", -- Positive "00" when others; with PS2SEL select PS2M(1 downto 0) <= -- Mode: "00" when "0100" | "1000", -- A "01" when "0001" | "0010", -- B "10" when "0110" | "1001", -- A-B "11" when others; -- A+B with PS2M(1 downto 0) select PS2_CRY(2) <= ( '0' ) when "00", -- CIN = 0 ( '0' ) when "01", -- CIN = 0 (PP0V(1) nor PP0V(0)) when "10", -- CIN = 1 ( '0' ) when others; -- CIN = 0 S2: for I in 2 to 20 generate B: block attribute INIT of L2: label is "7C86"; begin L2: LUT4 port map( I0 => PS2M(1), I1 => PP3V(I), I2 => PS2M(0), I3 => PP0V(I), O => PS2_LUT(I)); MA: MULT_AND port map ( I0 => PS2M(1), I1 => PP3V(I), LO => PS2_MA(I)); MC: MUXCY port map ( DI => PS2_MA(I), CI => PS2_CRY(I), S => PS2_LUT(I), O => PS2_CRY(I+1)); XC: XORCY port map ( CI => PS2_CRY(I), LI => PS2_LUT(I), O => PS2_SUM(I)); end block b; end generate; YRES <= not PS2PN(1); F0: for I in 0 to 20 generate FR: FDR port map ( D => PS2V(I), C => CLK, R => YRES, Q => YQ(I)); end generate; Y <= YQ(20 downto 0); end MUL16x5S_arch; Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORG