From: "Carl Brannen" <carl.brannen@terabeam.com>
Newsgroups: comp.arch.fpga
Subject: 16x5 multiplier uses new multiply algorithm
Date: Fri, 21 Dec 2001 09:18:17 +0000 (UTC)
Organization: Mailgate.ORG Server - http://www.Mailgate.ORG
Lines: 279
Message-ID: <5b02deb9bd2704d125bcb07ae453d40d.51709@mygate.mailgate.org>
References: <4172938cdfe1a23d2d483ee9c9d0460e.51709@mygate.mailgate.org>
NNTP-Posting-Host: firewall.terabeam.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.mailgate.org 1008914312 18038 216.137.15.2 (Fri Dec 21 10:18:16 2001)
X-Complaints-To: abuse@mailgate.org
NNTP-Posting-Date: Fri, 21 Dec 2001 09:18:17 +0000 (UTC)
Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008914312
User-Agent: Mailgate Web Server
X-URL: http://www.Mailgate.ORG
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12613

This 16x5 unsigned multiplier uses the algorithm listed above.  It has a
single register stage, at the end.  I'm fairly sure it works, having simulated
a lot of numbers through it.

If it were to be pipelined, it would be most efficient to put the first
register stage at the M input, but at the "MODE" stage in the X input.

It uses 33 slices, with a total of 65 FGs used.  Two of the FGs are programmed
to be zero so that two carry-outs can be made visible.  I don't immediately
see how to avoid this.

Despite being a fall through, with no internal registers, (though note that
the final register is necessary in order to get M x zero = zero), and without
being floorplanned, (I haven't even gone back through the code to see if I
can improve it), it still gets 131MHz in the xcv50e -8.

Were it to be pipelined, it would be natural to bring the "X" input into the
logic a clock early.  This would allow the mode inputs for the partial
products to be registered.  This could be done without unbalancing the
multiplier by registering the "M" inputs on the inputs, but registering the
"X" inputs only after a clock.  I haven't taken a look at how to minimize
the logic in this case...

I typically over-comment my VHDL. I've removed the comments here to save
bandwidth on the internet.  If anyone is interested, I can add them back in.

Thanks to Frédéric Rivoallon at Xilinx for e-mailing me a link to instructions
on how to instantiate LUT4s inside generate statements without a lot of grief.
For those interested, the link is here:
http://tech-www.informatik.uni-hamburg.de/vhdl/doc/faq/FAQ1.html#attributes

This is the first in a series of multipliers.  The next, a 16x8, will use
3 partial products, which is not a particularly natural number for this
algorithm.  But the one after that, 16x11, will use 4 and will be quite
sweet.

Total LUT usage with this algorithm will increase by 3 per additional bit
beyond 16.  That is, the number of LUTs for a Nx5 multiplier will be about
LUT( multiplier Nx5) = 17 + 3N.

The 3 adders are hooked up as follows:

LUT#1 \           -- PP0V creates { 1M, 2M, 3M, 4M}
      + LUT#3     -- PS0V creates final result between 0M and 31M
LUT#2 /           -- PP3V creates { 8M,16M,24M,32M}

The usual algorithm for multiplying by 5 bits on a Virtex will require 4 LUTs
per bit.  The adder tree will look like this (maybe a slightly different
topology will be better):

LUT#1 \                  -- creates { 0M, 1M, 2M, 3M}
       + LUT#3 \         -- creates { 0M ... 15M}
LUT#2 /         \        -- creates { 0M, 4M, 8M,12M}
                 + LUT#4 -- creates final result between 0M and 31M
(M)-------------/        -- creates { 0M,16M} (AND gate absorbed into LUT#4)

For extremely wide multiplies, the savings of the new algorithm approach
25% over the old technique.


-- Multiplier code, 16x5 multiplier
-- Design by Carl Brannen.
-- Uses 3 + 2 bit coding.
-- Multiplier code, 16x5 multiplier

library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;

entity MUL16x5S is
    port (
        CLK:  in  STD_LOGIC;
        M:    in  STD_LOGIC_VECTOR(15 downto 0);
        X:    in  STD_LOGIC_VECTOR( 4 downto 0);
        Y:    out STD_LOGIC_VECTOR(20 downto 0);
        TEST: in  STD_LOGIC
    );
end MUL16x5S;

architecture MUL16x5S_arch of MUL16x5S is
component LUT4 port (
  I0: in  STD_LOGIC;
  I1: in  STD_LOGIC;
  I2: in  STD_LOGIC;
  I3: in  STD_LOGIC;
  O:  out STD_LOGIC);
end component;
attribute INIT: string;
component XORCY port (
	CI:	in  STD_LOGIC;
	LI:	in  STD_LOGIC;
	O:	out STD_LOGIC);
end component;
component MUXCY port (
	DI:	in  STD_LOGIC;
	CI:	in  STD_LOGIC;
	S:	in  STD_LOGIC;
	O:	out STD_LOGIC);
end component;
component MULT_AND port (
	I0:	in  STD_LOGIC;
	I1:	in  STD_LOGIC;
	LO:	out STD_LOGIC);
end component;
component FDR port (
	D:	in  STD_LOGIC;
	C:	in  STD_LOGIC;
	R:	in  STD_LOGIC;
	Q:	out STD_LOGIC);
end component;
signal EM0V:      STD_LOGIC_VECTOR(18 downto 0);   -- Extended M[]
signal PP0PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PP0.P & PP0.N
signal PP0V:      STD_LOGIC_VECTOR(20 downto 0);   -- PP0.V
signal PP0M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PP0_LUT:   STD_LOGIC_VECTOR(17 downto 1);   -- LUT      
signal PP0_MA:    STD_LOGIC_VECTOR(17 downto 1);   -- MULT_AND
signal PP0_XC:    STD_LOGIC_VECTOR(17 downto 1);   -- XORCY
signal PP0_CRY:   STD_LOGIC_VECTOR(18 downto 1);   -- Carry
signal PP0_SUM:   STD_LOGIC_VECTOR(17 downto 1);   -- Sum output
signal EM3V:      STD_LOGIC_VECTOR(20 downto 0);   -- Extended M[]
signal PP3PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PP3.P & PP3.N
signal PP3V:      STD_LOGIC_VECTOR(20 downto 2);   -- PP3.V
signal PP3M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PP3_LUT:   STD_LOGIC_VECTOR(20 downto 3);   -- LUT      
signal PP3_MA:    STD_LOGIC_VECTOR(20 downto 3);   -- MULT_AND
signal PP3_XC:    STD_LOGIC_VECTOR(20 downto 3);   -- XORCY
signal PP3_CRY:   STD_LOGIC_VECTOR(21 downto 3);   -- Carry
signal PP3_SUM:   STD_LOGIC_VECTOR(20 downto 3);   -- Sum output
signal PS2SEL:    STD_LOGIC_VECTOR( 3 downto 0);   -- Select bit
signal PS2PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PS2.P & PS2.N
signal PS2V:      STD_LOGIC_VECTOR(20 downto 0);   -- PS2.V
signal PS2M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PS2_LUT:   STD_LOGIC_VECTOR(20 downto 2);   -- LUT      
signal PS2_MA:    STD_LOGIC_VECTOR(20 downto 2);   -- MULT_AND
signal PS2_XC:    STD_LOGIC_VECTOR(20 downto 2);   -- XORCY
signal PS2_CRY:   STD_LOGIC_VECTOR(21 downto 2);   -- Carry
signal PS2_SUM:   STD_LOGIC_VECTOR(20 downto 2);   -- Sum output
signal YRES:      STD_LOGIC;                       -- Reset final FF
signal YQ:        STD_LOGIC_VECTOR(20 downto 0);   -- Final FF

begin
EM0V(18 downto 0) <= "000" & M(15 downto 0);
PP0V(0) <= (EM0V(0) and X(0));
PP0V(1) <= (EM0V(0) and X(1)) xor (EM0V(1) and X(0));
PP0V(17 downto 2) <= PP0_SUM(17 downto 2);
PP0V(20 downto 18) <= "000";
PP0PN(0) <= X(2);      -- Negative bit
with X(2 downto 0) select  -- Positive bit
  PP0PN(1) <=
    '1' when "001" | "010" | "011",
    '0' when others;
with X(2 downto 0) select
  PP0M(1 downto 0) <=
    "01" when "111" | "001",  -- PP0V <= 1M
    "00" when "110" | "010",  -- PP0V <= 2M
    "11" when "101" | "011",  -- PP0V <= 3M
    "10" when others;         -- PP0V <= 4M
PP0_CRY(1) <= '0';
A0: for I in 1 to 17 generate B: block
  attribute INIT of L0: label is "7484";
begin
L0: LUT4 port map(
  I0 => PP0M(1),
  I1 => EM0V(I-1),
  I2 => PP0M(0),
  I3 => EM0V(I),
  O  => PP0_LUT(I));
  MA: MULT_AND port map (
    I0  => PP0M(1),
    I1  => EM0V(I-1),
    LO  => PP0_MA(I));
  MC: MUXCY port map (
    DI  => PP0_MA(I),
    CI  => PP0_CRY(I),
    S   => PP0_LUT(I),
    O   => PP0_CRY(I+1));
  XC: XORCY port map (
    CI  => PP0_CRY(I),
    LI  => PP0_LUT(I),
    O   => PP0_SUM(I));
end block b; end generate;
EM3V(20 downto 0) <= "00" & M(15 downto 0) & TEST & TEST & TEST;
PP3V(2) <= '0';
PP3V(20 downto 3) <= PP3_SUM(20 downto 3);
PP3PN(0) <= '0';           -- Negative bit (never negative)
with X(4 downto 2) select  -- Positive bit
  PP3PN(1) <=
    '0' when "000",
    '1' when others;       -- Usually positive
with X(4 downto 2) select
  PP3M(1 downto 0) <=
    "01" when "001" | "010",  -- PP3V <= 1M
    "00" when "011" | "100",  -- PP3V <= 2M
    "11" when "101" | "110",  -- PP3V <= 3M
    "10" when others;         -- PP3V <= 4M
PP3_CRY(3) <= '0';
A3: for I in 3 to 20 generate B: block
  attribute INIT of L3: label is "7484"; -- See PP0V
begin
L3: LUT4 port map(
    I0 => PP3M(1),
    I1 => EM3V(I-1),
    I2 => PP3M(0),
    I3 => EM3V(I),
    O  => PP3_LUT(I));
  MA: MULT_AND port map (
    I0  => PP3M(1),
    I1  => EM3V(I-1),
    LO  => PP3_MA(I));
  MC: MUXCY port map (
    DI  => PP3_MA(I),
    CI  => PP3_CRY(I),
    S   => PP3_LUT(I),
    O   => PP3_CRY(I+1));
  XC: XORCY port map (
    CI  => PP3_CRY(I),
    LI  => PP3_LUT(I),
    O   => PP3_SUM(I));
end block b; end generate;
PS2V(1 downto 0) <= PP0V(1 downto 0);
PS2V(20 downto 2) <= PS2_SUM(20 downto 2);
PS2SEL <= PP3PN(1 downto 0) & PP0PN(1 downto 0);
with PS2SEL select
  PS2PN(1 downto 0) <=                            -- Result of sum:
    "01" when "0101" | "0100" | "0110" | "0001",  --  Negative
    "10" when "1001" | "1000" | "1010" | "0010",  --  Positive
    "00" when others;
with PS2SEL select
  PS2M(1 downto 0) <=           -- Mode:
    "00" when "0100" | "1000",  --  A
    "01" when "0001" | "0010",  --  B
    "10" when "0110" | "1001",  -- A-B
    "11" when others;           -- A+B
with PS2M(1 downto 0) select
    PS2_CRY(2) <=
    (        '0'        ) when "00",      -- CIN = 0
    (        '0'        ) when "01",      -- CIN = 0
    (PP0V(1) nor PP0V(0)) when "10",      -- CIN = 1
    (        '0'        ) when others;    -- CIN = 0
S2: for I in 2 to 20 generate B: block
  attribute INIT of L2: label is "7C86";
begin
L2: LUT4 port map(
    I0 => PS2M(1),
    I1 => PP3V(I),
    I2 => PS2M(0),
    I3 => PP0V(I),
    O  => PS2_LUT(I));
  MA: MULT_AND port map (
    I0  => PS2M(1),
    I1  => PP3V(I),
    LO  => PS2_MA(I));
  MC: MUXCY port map (
    DI  => PS2_MA(I),
    CI  => PS2_CRY(I),
    S   => PS2_LUT(I),
    O   => PS2_CRY(I+1));
  XC: XORCY port map (
    CI  => PS2_CRY(I),
    LI  => PS2_LUT(I),
    O   => PS2_SUM(I));
end block b; end generate;
YRES <= not PS2PN(1);
F0: for I in 0 to 20 generate
FR: FDR port map (
  D => PS2V(I),
  C => CLK,
  R => YRES,
  Q => YQ(I));
end generate;
Y <= YQ(20 downto 0);
end MUL16x5S_arch;

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG