From: "Carl Brannen" Newsgroups: comp.arch.fpga Subject: Multiplying by squaring using Block RAM. Date: Sun, 16 Dec 2001 13:29:17 +0000 (UTC) Organization: Mailgate.ORG Server - http://www.Mailgate.ORG Lines: 54 Message-ID: <8483fe9606f88fc24f7444a2f6b29094.51709@mygate.mailgate.org> NNTP-Posting-Host: firewall.terabeam.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: news.mailgate.org 1008488909 25897 216.137.15.2 (Sun Dec 16 14:29:17 2001) X-Complaints-To: abuse@mailgate.org NNTP-Posting-Date: Sun, 16 Dec 2001 13:29:17 +0000 (UTC) Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008488909 User-Agent: Mailgate Web Server X-URL: http://www.Mailgate.ORG Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:12427 I'm not sure if people are doing this already, but I couldn't find a reference on the Xilinx web site. Block RAMs make more efficient squaring circuits than they do multipliers. And you can get multipliers out of squarers. An explanation for the arithmetic. Let A and B be the numbers to be multiplied. Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2 Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2 Then C - D = 4AB. This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2 architectures because the block RAM is dual port. That means you can use one side for the (A-B)**2 calculation and the other side for the (A+B)**2 calculation. With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16. It has 8 inputs and 16 outputs in two sections. Each section can conveniently compute the square of an 8-bit number. Note that the lowest two bits of the two squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match two bits), so you don't have to subtract bits 1 and 0 of the two squares. If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B", and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one block RAM. Maybe there's a way to get the bit back, and let A and B be 8-bit numbers; I haven't looked at it long enough to conclude there isn't. The circuit uses about half the LUTs required by the standard algorithm, at an expense of one block RAM. To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39 slices: http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf Using RAMB4s alone to implement even a 7x7 multiply would require a huge number of them, as multiplies require twice as many address inputs as squares. You can iterate on the calculation of the square. That is, if A is too big to square in a single operation, then break A into two parts. With A broken into two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2. Breaking A and B into more than 3 parts may be worth exploring, for certain bit sizes. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORG ###### From: Christopher.Saunter@durham.ac.uk (Christopher Saunter) Newsgroups: comp.arch.fpga Subject: Re: Multiplying by squaring using Block RAM. Date: Mon, 17 Dec 2001 14:22:43 +0000 (UTC) Organization: University of Durham, Durham, UK Lines: 74 Message-ID: <9vkv3j$i2e$1@sirius.dur.ac.uk> References: <8483fe9606f88fc24f7444a2f6b29094.51709@mygate.mailgate.org> NNTP-Posting-Host: deneb.dur.ac.uk X-Trace: sirius.dur.ac.uk 1008598963 18510 129.234.4.80 (17 Dec 2001 14:22:43 GMT) X-Complaints-To: usenet@durham.ac.uk NNTP-Posting-Date: Mon, 17 Dec 2001 14:22:43 +0000 (UTC) X-Newsreader: TIN [version 1.2 PL2] Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!newsfeed01.sul.t-online.de!newsfeed00.sul.t-online.de!t-online.de!diablo.theplanet.net!easynet-monga!easynet.net!server5.netnews.ja.net!nntphost.dur.ac.uk!deneb.dur.ac.uk!dph1cds Xref: chonsp.franklin.ch comp.arch.fpga:12444 Hi Carl, You da man! I was thinkng about using a BlockRAM to do squaring in a project of mine, more for the latency issues than device usage in my particular case, and was left with one 'regular' multiply. I like your sugestion. I'll try it out in the new year and get back to the group with any issues it raises, or sucesses. It seems so simple... I guess I don't need to feel so bad about only having Virtex hardware to play with, and no hardware multipliers ;-) As an aside, thanks to the posters who answered my previous article about modifying blockram contents in a bitstream, I haven't got round to trying anything yet, but I will do... Regards, Chris Saunter Carl Brannen (carl.brannen@terabeam.com) wrote: : I'm not sure if people are doing this already, but I couldn't find a reference : on the Xilinx web site. : Block RAMs make more efficient squaring circuits than they do multipliers. And : you can get multipliers out of squarers. : An explanation for the arithmetic. Let A and B be the numbers to be : multiplied. : Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2 : Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2 : Then C - D = 4AB. : This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2 : architectures because the block RAM is dual port. That means you can use one : side for the (A-B)**2 calculation and the other side for the (A+B)**2 : calculation. : With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16. It has 8 : inputs and 16 outputs in two sections. Each section can conveniently compute : the square of an 8-bit number. Note that the lowest two bits of the two : squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match : two bits), so you don't have to subtract bits 1 and 0 of the two squares. : If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you : can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B", : and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one : block RAM. Maybe there's a way to get the bit back, and let A and B be 8-bit : numbers; I haven't looked at it long enough to conclude there isn't. : The circuit uses about half the LUTs required by the standard algorithm, at an : expense of one block RAM. : To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39 : slices: : http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf : Using RAMB4s alone to implement even a 7x7 multiply would require a huge number : of them, as multiplies require twice as many address inputs as squares. : You can iterate on the calculation of the square. That is, if A is too big to : square in a single operation, then break A into two parts. With A broken into : two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and : compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2. : Breaking A and B into more than 3 parts may be worth exploring, for certain bit : sizes. : Carl : -- : Posted from firewall.terabeam.com [216.137.15.2] : via Mailgate.ORG Server - http://www.Mailgate.ORG