From: "Carl Brannen" <carl.brannen@terabeam.com>
Newsgroups: comp.arch.fpga
Subject: Efficient new multiplier for Spartan2, Virtex &c.
Date: Wed, 19 Dec 2001 18:20:06 +0000 (UTC)
Organization: Mailgate.ORG Server - http://www.Mailgate.ORG
Lines: 174
Message-ID: <4172938cdfe1a23d2d483ee9c9d0460e.51709@mygate.mailgate.org>
NNTP-Posting-Host: firewall.terabeam.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.mailgate.org 1008777918 24653 216.137.15.2 (Wed Dec 19 19:20:06 2001)
X-Complaints-To: abuse@mailgate.org
NNTP-Posting-Date: Wed, 19 Dec 2001 18:20:06 +0000 (UTC)
Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008777918
User-Agent: Mailgate Web Server
X-URL: http://www.Mailgate.ORG
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12510

I haven't seen this algorithm published.  If it is original, I'm
throwing it into the public domain. 

-- Efficient fall through multiplier in Xilinx Spartan2, Virtex,
-- VirtexE.  (Will work in Virtex2, but they have dedicated
-- multipliers so they probably don't need this.)
--
-- This fall through multiplier gets 3 bits per initial adder
-- rather than the usual 2 bits.  This is accomplished by taking
-- advantage of under utilized adders in a standard multiplier.
--
-- This is an efficient multiplier when the multiplier has
-- 3n - 1 bits, with n>1.  Where it really rocks is when the
-- the multiplier has (3n-1)*2^m bits.
--
-- This is in contrast to the usual Xilinx Virtex multiplier
-- which is relatively efficient when the multiplier has
-- 2*2^m bits.
--
-- For large enough multiplies, this algorithm gets more and
-- more efficient, compared to the usual Xilinx multiplier.
--
-- While a standard multiplier uses the MULT_AND logic well, the
-- stages that add up the partial product results are simple adders
-- built from LUT2s.  Any time you see a big array of LUTs used with
-- less than all four inputs needed you have to wonder if there's
-- a more efficient way of packing the logic in.
--
-- A note on the usual Xilinx multiplier.  (Those skilled in the art
-- should skip this section.)
--
-- The usual multiplier uses two bits per partial product.  The least
-- significant partial product produces one of {0M, 1M, 2M, 3M}, while the
-- next least significant produces one of {0M, 4M, 8M, 12M}.  Adding
-- together these two results produces any multiple of M from 0M to 15M:
--
--  0M + 0M =  0M
--  0M + 1M =  1M
--  0M + 2M =  2M
--  0M + 3M =  3M
--  4M + 0M =  4M
--  4M + 1M =  5M
--  4M + 2M =  6M
--  4M + 3M =  7M
-- ...
-- 12M + 2M = 14M
-- 12M + 3M = 15M
--
--
-- Instead of keeping all my numbers as positive (or 2's complement
-- negative) numbers, I save bits if I allow negative numbers and
-- keep a single bit that indicates that the result is to be interpreted
-- as a negative number.  This is how Seymour Cray did his arithmetic,
-- but I don't know if he used this internally to his multipliers.
-- But if I assume that my single column of slices is only going to
-- give me one of four possible multiples of M, I have a problem.  If
-- I choose the four values to be {0M, 1M, 2M, 3M}, then I only get
-- seven values when I negate them as -0M = 0M.  But for a 3-bit
-- coded multiplier I'm going to need 2^3 = 8 values from each slice.
-- If I use {1M, 2M, 3M, 4M}, then I miss zero.
--
-- The solution is to use two sign bits, one for positive numbers, the
-- other for negative.  Let M = 5, so the nine possible values are as
-- follows:  (Note that I only need 8 of these nine values.)
--
--  P  N  Mult | Value
--  -  -  ---- + -----
--  0  1   4   | -20   (i.e. -4*M)
--  0  1   3   | -15
--  0  1   2   | -10
--  0  1   1   | - 5
--  0  0   X   |   0
--  1  0   1   |   5
--  1  0   2   |  10
--  1  0   3   |  15
--  1  0   4   |  20   (i.e.  4*M)
--
-- This would be exactly what the doctor ordered if the multiplier
-- were in a base slightly different from the octal.  With octal, the
-- eight values that a digit can take are the familiar {0,1, ... 7}.
-- With this unusual base, the 8 values that a digit can take are
-- instead {-4,-3,-2,-1,0,1,2,3}.  (I choose to keep these rather than
-- -3 through 4 because it saves a LUT somewhere.)
--
-- So I need a base conversion between base 8 and base, well, I'll
-- call it base 8-4.  Here's how numbers are interpreted in the
-- two bases:
--
-- Base 8:    A = (A0  ) + (A1  )*8 + (A2  )*8^2 + ...
-- Base 8-4:  B = (B0-4) + (B1-4)*8 + (B2-4)*8^2 + ...
--
-- From this it is obvious that to convert the number "A" in base 8
-- to base 8-4, I need merely add the octal constant o444444... to "A".
-- This perfectly converts it to the corresponding (i.e. carrying the
-- same numerical value) number in base 8-4.
--
-- This conversion is very convenient when the multiplier has a lot of
-- bits, but it isn't needed for relatively short multipliers.  In
-- particular, a multiplier of n x 5 would do well to avoid performing
-- the base conversion explicitly.
--
--
-- After performing the base conversion, I take each digit from B
-- (where B = A + o44...44. = A + "100100100...100100") and use
-- it to create a single partial product.  For an n x (3m-1) multiply,
-- I'll end up with m partial products.
--
-- I can't just add up the partial products because they're not in
-- the usual format for 2's complement arithmetic.  I'll have to add
-- extra logic to the adder stages in order to handle the signed
-- values being added.  It turns out that there is exactly enough
-- freedom in a Xilinx slice in order to handle these explicitly
-- signed numbers.
--
-- With two mode bits, I can choose 4 functions in a Xilinx arithmetic
-- slice.  The table to add two signed numbers (each with "N" and "P"
-- bits, and each with a partial sum in the set 1M to 4M) is fairly
-- complicated.  Let "S" be the higher precision partial product, and
-- "T" be the lower precision number.
--
-- Rather than be confusing, I'll denote the "N" bit of "S" by "S.N",
-- same with the "P" bit, and I'll denote the unsigned vector part
-- of "S" by "S.V".  Same with "T".  Then the function to add "S" and
-- "T" is as follows:  (Note that since "S" is higher in signficance
-- than "T", it follows that "S.V" is larger, as an unsigned number,
-- than "T.V", except if "S" is zero, in which case "S.V" is a don't
-- care.  For this reason, all the subtractions in the following
-- table result in unsigned integers.)
--
--     "S"         "T"            "S+T"
--  ---------   ---------   ------------------
--   S  P N V    T  P N V    S+T      P N   V
--  --  - - -   --  - - -   ------    - -  ---
--  -A  0 1 A   -B  0 1 B   -(A+B)    0 1  A+B
--  -A  0 1 A    0  0 0 X   -(A  )    0 1   A
--  -A  0 1 A    B  1 0 B   -(A-B)    0 1  A-B
--   0  0 0 A   -B  0 1 B   -(  B)    0 1   B
--   0  0 0 A    0  0 0 X      0      0 0   X
--   0  0 0 A    B  1 0 B    (  B)    1 0   B
--   A  1 0 A   -B  0 1 B    (A-B)    1 0  A-B
--   A  1 0 A    0  0 0 X    (A  )    1 0   A
--   A  1 0 A    B  1 0 B    (A_B)    1 0  A+B
--
-- From this, it is clear that "(S+T).N" and "(S+T).P" are
-- simple (4-LUT) functions of "S.N", "S.P", "T.N" and "T.P".
-- It's also clear that "(S+T).V" is computed by one of the
-- four functions {A+B, A-B, A, B}.  It turns out that these
-- four arithmetic functions exactly fit in a single slice.
--
-- It's all very well that I keep the internal partial sums
-- in this odd notation, but it won't do to deliver that to
-- the customer.  So how do I compute the final result?
--
-- First of all, the final result can't have the "N" bit set,
-- because this is an unsigned multiply.  If it has the "P" bit
-- set, then the correct product shows up on "(S+T).V" and I'm
-- done.  The only hairy case is when the "P" bit is low.  In
-- that case, the final result will be supposed to be zero, but
-- "(S+T).V" will be an "X".  For this reason, I have to connect
-- up the final "P" result to the synchronous reset pin of the
-- final register in such a way that when "P" is zero, the
-- final result register is held to zero.
--
-- This comment has gone on long enough that I'm including it
-- as a separate note rather than with the VHDL.  I'll add VHDL
-- code for a sample multipliers as replies to this note, but
-- these aren't easy to build, so give me some time.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG