From: "Carl Brannen" <carl.brannen@terabeam.com>
Newsgroups: comp.arch.fpga
Subject: Multiplying by squaring using Block RAM.
Date: Sun, 16 Dec 2001 13:29:17 +0000 (UTC)
Organization: Mailgate.ORG Server - http://www.Mailgate.ORG
Lines: 54
Message-ID: <8483fe9606f88fc24f7444a2f6b29094.51709@mygate.mailgate.org>
NNTP-Posting-Host: firewall.terabeam.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.mailgate.org 1008488909 25897 216.137.15.2 (Sun Dec 16 14:29:17 2001)
X-Complaints-To: abuse@mailgate.org
NNTP-Posting-Date: Sun, 16 Dec 2001 13:29:17 +0000 (UTC)
Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008488909
User-Agent: Mailgate Web Server
X-URL: http://www.Mailgate.ORG
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12427

I'm not sure if people are doing this already, but I couldn't find a reference
on the Xilinx web site.

Block RAMs make more efficient squaring circuits than they do multipliers.  And
you can get multipliers out of squarers.

An explanation for the arithmetic.  Let A and B be the numbers to be
multiplied.

Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2
Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2
Then C - D = 4AB.

This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2
architectures because the block RAM is dual port.  That means you can use one
side for the (A-B)**2 calculation and the other side for the (A+B)**2
calculation.

With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16.  It has 8
inputs and 16 outputs in two sections.  Each section can conveniently compute
the square of an 8-bit number.  Note that the lowest two bits of the two
squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match
two bits), so you don't have to subtract bits 1 and 0 of the two squares.

If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you
can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B",
and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one
block RAM.  Maybe there's a way to get the bit back, and let A and B be 8-bit
numbers; I haven't looked at it long enough to conclude there isn't.

The circuit uses about half the LUTs required by the standard algorithm, at an
expense of one block RAM.

To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39
slices:
http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf

Using RAMB4s alone to implement even a 7x7 multiply would require a huge number
of them, as multiplies require twice as many address inputs as squares.

You can iterate on the calculation of the square.  That is, if A is too big to
square in a single operation, then break A into two parts.  With A broken into
two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and
compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2.

Breaking A and B into more than 3 parts may be worth exploring, for certain bit
sizes.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

######

From: Christopher.Saunter@durham.ac.uk (Christopher Saunter)
Newsgroups: comp.arch.fpga
Subject: Re: Multiplying by squaring using Block RAM.
Date: Mon, 17 Dec 2001 14:22:43 +0000 (UTC)
Organization: University of Durham, Durham, UK
Lines: 74
Message-ID: <9vkv3j$i2e$1@sirius.dur.ac.uk>
References: <8483fe9606f88fc24f7444a2f6b29094.51709@mygate.mailgate.org>
NNTP-Posting-Host: deneb.dur.ac.uk
X-Trace: sirius.dur.ac.uk 1008598963 18510 129.234.4.80 (17 Dec 2001 14:22:43 GMT)
X-Complaints-To: usenet@durham.ac.uk
NNTP-Posting-Date: Mon, 17 Dec 2001 14:22:43 +0000 (UTC)
X-Newsreader: TIN [version 1.2 PL2]
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de!t-online.de!diablo.theplanet.net!easynet-monga!easynet.net!server5.netnews.ja.net!nntphost.dur.ac.uk!deneb.dur.ac.uk!dph1cds
Xref: chonsp.franklin.ch comp.arch.fpga:12444

Hi Carl,
	You da man!

I was thinkng about using a BlockRAM to do squaring in a project of mine,
more for the latency issues than device usage in my particular case, and
was left with one 'regular' multiply.  I like your sugestion.  I'll try it
out in the new year and get back to the group with any issues it raises,
or sucesses.

It seems so simple...  I guess I don't need to feel so bad about only
having Virtex hardware to play with, and no hardware multipliers ;-)

As an aside, thanks to the posters who answered my previous article about
modifying blockram contents in a bitstream, I haven't got round to trying
anything yet, but I will do...

Regards,
	Chris Saunter

 Carl Brannen (carl.brannen@terabeam.com) wrote:
: I'm not sure if people are doing this already, but I couldn't find a reference
: on the Xilinx web site.

: Block RAMs make more efficient squaring circuits than they do multipliers.  And
: you can get multipliers out of squarers.

: An explanation for the arithmetic.  Let A and B be the numbers to be
: multiplied.

: Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2
: Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2
: Then C - D = 4AB.

: This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2
: architectures because the block RAM is dual port.  That means you can use one
: side for the (A-B)**2 calculation and the other side for the (A+B)**2
: calculation.

: With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16.  It has 8
: inputs and 16 outputs in two sections.  Each section can conveniently compute
: the square of an 8-bit number.  Note that the lowest two bits of the two
: squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match
: two bits), so you don't have to subtract bits 1 and 0 of the two squares.

: If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you
: can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B",
: and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one
: block RAM.  Maybe there's a way to get the bit back, and let A and B be 8-bit
: numbers; I haven't looked at it long enough to conclude there isn't.

: The circuit uses about half the LUTs required by the standard algorithm, at an
: expense of one block RAM.

: To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39
: slices:
: http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf

: Using RAMB4s alone to implement even a 7x7 multiply would require a huge number
: of them, as multiplies require twice as many address inputs as squares.

: You can iterate on the calculation of the square.  That is, if A is too big to
: square in a single operation, then break A into two parts.  With A broken into
: two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and
: compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2.

: Breaking A and B into more than 3 parts may be worth exploring, for certain bit
: sizes.

: Carl


: -- 
: Posted from firewall.terabeam.com [216.137.15.2] 
: via Mailgate.ORG Server - http://www.Mailgate.ORG