From: "Carl Brannen" <carl.brannen@terabeam.com>
Newsgroups: comp.arch.fpga
Subject: Low area barrel shift puts 3 to 1 mux in a Xilinx LUT:
Date: Wed, 19 Dec 2001 08:42:45 +0000 (UTC)
Organization: Mailgate.ORG Server - http://www.Mailgate.ORG
Lines: 768
Message-ID: <d59d30d03eadb65299bb04ac2507f24d.51709@mygate.mailgate.org>
NNTP-Posting-Host: firewall.terabeam.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.mailgate.org 1008744250 13480 216.137.15.2 (Wed Dec 19 09:42:45 2001)
X-Complaints-To: abuse@mailgate.org
NNTP-Posting-Date: Wed, 19 Dec 2001 08:42:45 +0000 (UTC)
Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008744250
User-Agent: Mailgate Web Server
X-URL: http://www.Mailgate.ORG
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12512

This 16x9 barrel shifter uses only 38 LUTs.  There are
two stages of 3 to 1 muxes, giving a total mux of 9 to 1.

The 3 to 1 muxes are implemented using a "trick" that
involves the use of the carry logic.

This dimension barrel shifter is rather inconvenient for
conventional barrel shifters, which will require 4 stages
of logic.  This one does it in only two stages, but I have
to add 2 LUTs for the two mode pins required by each stage,
and 1 LUT for the carry input to each stage.  Thus the
total LUT count is 2 * (16 + 3) = 38.

This is considerably below the conventional barrel shifter
requirement for this size barrel shifter of 4 * 16 = 64.

I'm putting this up for people to admire and critique.
If you're going to use it, I'd note that getting stuff like
this to synthesize correctly is not particularly easy.

I've tried to include design notes to illustrate the pitfalls
that arise when the design is modified.

Also, I'd note that it's been a few days since I simulated
this logic, and it's possible that in the process of forcing
it into its designed number of LUTs I messed something up.
This was written for fun, I am not using it in any work.  So
beware.

library IEEE;
use IEEE.std_logic_1164.all;

-- Efficient Barrel Shifter, fully pipelined.
--
-- 16-input x 9 barrel shifter.  Uses two stages of logic.
-- Requires only 39FGs, 30CYs, and 34DFFs.
--
-- Designer:  Carl Brannen
--
-- Feel free to modify this circuit and use it in
-- your own designs. I am aware of no patents that
-- it infringes on, but you will have to make your
-- own determination of this.  My only request is
-- that you leave a comment to the effect that your
-- knowledge of the algorithm is through me.  Of course
-- this is freeware and I make no guarantees that it
-- will be at all helpful to you.
--
-- Synthesize with optimize set for "low", and 
-- "area".  This circuit is already optimized,
-- the computer will only be waste its time (and likely
-- increase the size and delay of the result) if it
-- tries to optimize further.

entity BRL16_8 is
   port (
      CLK:    in  STD_LOGIC;
      DIN:    in  STD_LOGIC_VECTOR(15 downto 0);
      SHIFT:  in  STD_LOGIC_VECTOR( 3 downto 0);
      Y:      out STD_LOGIC_VECTOR(15 downto 0)
   );
end BRL16_8;

architecture BRL16_8_arch of BRL16_8 is

--
-- The standard version of a 16x8 barrel shifter uses
-- 48FGs and 51DFFs, and a 16x9 barrel shifter would use
-- 64FGs.
--
-- The usual barrel shifter uses stages that shift by either 0
-- or 2^n bits.  This barrel shifter uses stages that shift
-- by three different amounts instead of two.
--
-- The usual barrel shifter stage consists of a vector of 2 to 1
-- muxes.  A stage in this barrel shifter is instead, effectively,
-- a 3 to 1 mux.  It nevertheless takes up about the same space that
-- the 2 to 1 stage in a regular barrel shifter takes.
--
-- The 3 to 1 muxes can be used in pairs to give a 9 to 1 mux,
-- which is equivalent to three stages of the usual barrel shifter.
--
-- The area reduction is therefore up to 33% of the area used by a
-- regular 8^n size barrel shifter, but it can be more or less for
-- other sizes.
--
-- This barrel shifter type is particularly efficient when the
-- number of bits to be shifted includes a power of 3.  For instance,
-- a 9-bit barrel shifter could be accomplished in only 2 stages,
-- while a regular barrel shifter of that length would require
-- a full 4 stages.  The example here, however, only does shifts
-- by 8, not by 9.  The reason for so restricting it is to get
-- the control circuitry simpler.
--
-- The usual barrel shifter consists of 2 to 1 muxes.  These are
-- implemented in LUT3 (i.e. two data inputs and one select input).
-- Since there is a free control pin, it's clear that more functionality
-- can be packed into the LUT.  In addition, none of the arithmetic
-- functionality of the slice is being used.
--
-- A 4 to 1 mux would allow 2 "bits" of barrel shifting to happen
-- in a single stage, but it's pretty obvious that that goal is
-- beyond us.  There would be 6 inputs required for such a mux, but
-- there are only 4 LUT inputs, and one CIN input.
--
-- But 4 LUT and 1 CIN does give 5 inputs per bit, and this is
-- enough to implement a 3 to 1 mux.  So we need to look at designs
-- that use 3 to 1 muxes.
--
-- I'll combine consecutive 3 to 1 mux stages so that together they
-- implement three bits of a standard barrel shifter.  The overall
-- functionality will be as follows:
--
-- SHFT[2..0]  0 1 2 3 4 5 6 7 8
--             - - - - - - - - -
-- Shift Amt:  0 1 2 3 4 5 6 7 8
--
-- This functionality can be implemented in two stages using
-- 3 to 1 muxes as follows:
--
-- SHFT[2:0]   0 1 2 3 4 5 6 7 8
--             - - - - - - - - -
-- SHFA[1:0]   0 1 2 0 1 2 0 1 2 (Shift by 0, 1, or 2)
-- SHFB[2:1]   0 0 0 3 3 3 6 6 6 (Shift by 0, 3, or 6)
--             - - - - - - - - -
-- Shift Amt:  0 1 2 3 4 5 6 7 8
--
-- Into each LUT I'll bring 2 mode control bits and two data
-- bits.  Two of the shifts will correspond to selecting the
-- two data bits.  I'll call these the "mux" shifts.  The 3rd
-- shift is passed through the CIN / COUT lines, so it's called
-- the "arithmetic" shift.
--
-- Before continuing, I need to comment on some mathematics.
-- Barrel shifting, as implemented by stages of shifts, for
-- a barrel shifter of width "n", is equivalent to Z_n.  That
-- is, it is equivalent to addition in the integers modulo n.
-- Another way of putting this is to say that shifting is additive.
-- If you shift by "S" and then shift by "T", the result is
-- equivalent to a shift by "S+T".  The usual barrel shifter
-- uses the binary representation of the shift amount, with
-- stage "m" shifting by either 0 or 2^m.  But the same shift
-- result could be accomplished by shifting by other amounts.
-- The thing to remember is that it is additive.
--
-- Because of this, it's useful to know a little bit about Z_n.
-- I'm going to have to use the CIN to bring in one of the
-- shift quantities, so these CIN to COUT chains are going to
-- depend on the structure of the "orbits" of the shift amounts.
-- "Orbit" has a specific meaning to the pure mathematicians,
-- in this case I'm going to redefine it to mean something
-- similar.  Don't show this to any mathematicians, they'll
-- undoubtedly be disgusted with my abuse of the language.
--
-- A given shift amount creates an orbit out of a start bit
-- by the sequence of places it shifts that bit to as the shift
-- is repeated.  For example, with a barrel shifter width of 8,
-- if the shift amount is 3 bits, the "0" bit is taken through
-- the following sequence: (0, 3, 6, 1, 4, 7, 2, 5), and then
-- back to "0".  This "orbit" has 8 elements.  The orbit of the
-- other bits, under a shift of 3 bits, is going to look very
-- much the same (i.e. be "isomorphic".)  Because of this
-- similarity, I'm only going to look at the orbits of "0".
--
-- The orbits due to other shift amounts may or may not be
-- isomorphic.  For example, the orbit of "0" with a shift
-- amount of 1 is (0, 1, 2, 3, 4, 5, 6, 7), which is indeed
-- isomorphic to the orbit of "3".  But the orbit of "2" is
-- shorter: (0, 2, 4, 6).  Mathematically, the orbits of a
-- shift by n bits in a barrel shift of width w is of maximum
-- length of n and w have no common divisors.
--
-- The isomorphisms of the orbits is not of significance for
-- the shifts that are performed with the usual multiplexers,
-- but are very important for shifts performed using the
-- arithmetic CIN / COUT wires.  The carry logic for that
-- shift forms a chain.  A shift of 1 bit always has the same
-- orbit, the one with maximum length, no matter what the
-- size of the barrel shifter: (0, 1, 2, ... WIDTH-1), where
-- WIDTH is the width of the barrel shifter.  This orbit is
-- preferred to other possible orbits because I have to make
-- sure that CIN = 0 in order to avoid an unwanted arithmetic
-- operation (i.e. an increment) when performing the regular
-- mux type shifts.  This means that I have to control the CIN
-- by forcing it to zero for the two mux shifts, and connecting
-- it to COUT for the one arithmetic shift.  If I select an
-- orbit of less than the full Barrel Shift width for the
-- arithmetic shift, I'll have to build more than one copy of
-- the CIN control logic.
--
-- There's another reason for analyzing the orbit structure of
-- barrel shifters.  The two mode bits routed to each LUT have
-- some freedom.  It's possible that I might be able to arrange
-- for those mode bits to be one of the 3 select inputs, thereby
-- reducing the number of LUTs needed to compute the control
-- lines.  I suppose there is also a chance that I might be
-- able to share a control line between the SHFA and SHFB shifts,
-- though it surely seems unlikely.
--
-- The arithmetic shift is accomplished through the CIN input.
-- This means I'll have to use the XORCY output, and therefore
-- I will have to make sure that COUT = 0 (and therefore CIN = 0
-- for the next bit in the orbit) during the mux shift modes.
-- Now one of the data inputs will have to connect to I0 in
-- order for it to be sent out the COUT.  This will be the
-- arithmetic shift.  When that same data input is to be used
-- as a mux shift (to the current bit), the COUT will automatically
-- be cleared by the arithmetic logic.  (If "I0" is the mux /
-- arithmetic data input, the LUT4 will be programmed to output
-- I0 when selecting I0 as the shift amount.  The MUXCY, which
-- has the LUT4 output as its selector, will then select I0
-- when the LUT4 output is 0, and CIN when the LUT4 output is
-- 1.  Since the LUT4 output in this case is I0, the I0 input
-- will only be selected when it is zero.)  But I have to make
-- sure that I0 input to the MUXCY is zero when selecting the
-- other mux shift. The only way to do this for free is to use
-- the MULT_AND.
-- 
-- This gives the structure of one stage of a general purpose
-- 3-way shifter as the following.  XA and XB are shifts that
-- use the multiplexers, and C is the arithmetic shift.  "M1"
-- and "M0" are the two mode inputs to the LUT, while "XA" and
-- "XB" are the two data inputs.  The mode values "s" need to
-- be selected later, hopefully in a manner that reduces the
-- need for decoding logic.
--
-- SHFA  M1 M0 LUT4  CIN COUT   Result
-- ----  -- -- ----  --- ----   ------
--  XA    s  s  XA     0   0      XA
--  XB    s  0  XB     0   0      XB
--   C    s  1   0     C  XA       C
--
-- There's one remaining bit of mathematical gibberish.
-- In addition to performing isomorphisms between shifts,
-- I can also add a constant shift to a stage for free.
-- That is, I can renumber the outputs of a stage by shifting
-- them around.  This is just a wire change, and it means
-- that given a stage that, for instance, shifts by 2,3, or
-- 4 bits, I can rearrange the output bits on the same stage
-- and make it shift by 4,5, or 6 bits.  (Or 6,7, or 0 bits.)
-- This gives me some freedom in how I assign mode pins,
-- and freedom is very important for minimizing logic functions.
--
-- The effect of the above mathematical note is simple.  I
-- can add an arbitrary shift amount to each stage, in terms
-- of what shift is associated with a particular pattern on
-- the SHFT inputs.  Then I can remove that shift by doing
-- a "wire" barrel shift for free.
--
-- The XB shift is completely arbitrary.  By that I mean
-- it could connect up any way without respect to how the
-- XA and C shifts are done.  But the XA and C shifts are
-- related.  In order for the carry structure to have only
-- one overall CIN, I need to have that XA and C shifts
-- differ (in terms of how many bits they shift) by an amount
-- that corresponds to a full length orbit.
--
-- For the example given, with a barrel shifter width of 8,
-- this means that I have to have the XA and C shifters
-- differ by {1,3,5, or 7} bits.  (Note all arithmetic is to
-- be done modulo the width of the barrel shifter.)  This
-- is rather liberal, as it means that I need only ensure
-- that not all the shift amounts be even or odd for either
-- SHFA or SHFB.
--
-- The other thing to notice is that I only need 3 different
-- shifts but with two mode pins I'm going to have 4 codes.
-- This means that I can map two codes to the same mode.  This
-- may help reduce the amount of control logic. But I'll use
-- the extra degree of freedom to define two states for SHFB == 3.
-- This way, if SHIFT[3] is tied low (and the design reduced
-- from a shift by "0 to 8" to a shift by "0 to 7") an FG will
-- be saved.  The resulting truth table is:
--
--
--                 A0_MODE  A1_MODE
-- SHIFT SHFA SHFB   1  0     1  0
-- ----- ---- ----   -  -     -  -
-- 0000    0    0    0  0     0  0
-- 0001    1    0    1  0     0  0
-- 0010    2    0    0  1     0  0
-- 0011    0    3    0  0     1  0
-- 0100    1    3    1  0     1  1
-- 0101    2    3    0  1     1  1
-- 0110    0    6    0  0     0  1
-- 0111    1    6    1  0     0  1
-- 1000    2    6    0  1     0  1


-- Also note that the original version of this code got the arithmetic
-- mux by connecting the Carry-out back around to the Carry-input.  This
-- generates an apparent "cycle", and causes a "post layout timing report"
-- warning something like the following:
--
--  ----------------------------------------------------------------------
-- ! Warning: The following connections close cycles, and some paths      !
-- !          through these connections may not be analyzed.              !
-- !                                                                      !
-- ! Signal                            Driver            Load             !
-- ! --------------------------------  ----------------  ---------------- !
-- ! U3/A0_CRY2                        LB_R13C8.S1.COUT  CLB_R12C8.S1.CIN !
-- ! U3/A1_CRY8                        CLB_R9C4.S0.COUT  CLB_R8C4.S0.CIN  !
--  ---------------------------------------------------------------------- 
--
-- First of all, this warning may be ignored for this particular circuit.
-- The reason is that the circuit is operated in two modes.  In the non
-- arithmetic shifts, the "0th" carry is forced to be zero by the CIN selector.
-- This propagates through the rest of the circuit, so there is no cycle.
-- In the arithmetic shift mode the carry is always equal to the applied A[]
-- input, and no carries propagate through the circuit at all.
--
-- But it's best to avoid warnings, so the circuitry shown here simply ignores
-- the final carry-out and consequently is manifestly free of cycles.
--
-- There are some useful arithmetic circuits where the COUT has to be connected
-- back to the CIN input but this is not one of them.  A great example of a
-- an arithmetic circuit where cycles have to be dealt with is in the addition
-- circuitry of an ALU designed to add floating point numbers in CRAY notation.
-- Ah, for the days when I was a CPU designer!

component XORCY port (
      CI:     in  STD_LOGIC;
      LI:     in  STD_LOGIC;
      O:      out STD_LOGIC);
end component;

component MUXCY port (
      DI:     in  STD_LOGIC;
      CI:     in  STD_LOGIC;
      S:      in  STD_LOGIC;
      O:      out STD_LOGIC);
end component;

component MULT_AND port (
      I0:      in  STD_LOGIC;
      I1:      in  STD_LOGIC;
      LO:      out STD_LOGIC);
end component;

component LUT4 port (
      I0:      in  STD_LOGIC;
      I1:      in  STD_LOGIC;
      I2:      in  STD_LOGIC;
      I3:      in  STD_LOGIC;
      O:       out STD_LOGIC);
end component;

-- Define LUT4s to the correct function.  For some reason the synthesizer
-- couldn't figure out that this was what I wanted.
attribute INIT: string;
attribute INIT of L00: label is "0E04";
attribute INIT of L01: label is "0E04";
attribute INIT of L02: label is "0E04";
attribute INIT of L03: label is "0E04";
attribute INIT of L04: label is "0E04";
attribute INIT of L05: label is "0E04";
attribute INIT of L06: label is "0E04";
attribute INIT of L07: label is "0E04";
attribute INIT of L08: label is "0E04";
attribute INIT of L09: label is "0E04";
attribute INIT of L10: label is "0E04";
attribute INIT of L11: label is "0E04";
attribute INIT of L12: label is "0E04";
attribute INIT of L13: label is "0E04";
attribute INIT of L14: label is "0E04";
attribute INIT of L15: label is "0E04";

-- Stage 0 declarations
signal A0_MODE:  STD_LOGIC_VECTOR( 1 downto 0);     -- SHFA
-- Arithmetic inputs
signal A0_A:     STD_LOGIC_VECTOR(15 downto 0);     -- A[] signal input
signal A0_B:     STD_LOGIC_VECTOR(15 downto 0);     -- B[] signal input
-- Arithmetic internal signals
signal A0_LUT:   STD_LOGIC_VECTOR(15 downto 0);     -- LUT      (internal use)
signal A0_MA:    STD_LOGIC_VECTOR(15 downto 0);     -- MULT_AND (internal use)
signal A0_XC:    STD_LOGIC_VECTOR(15 downto 0);     -- XORCY    (internal use)
signal A0_CRY:   STD_LOGIC_VECTOR(16 downto 0);     -- Carry    (internal use)
-- Arithmetic outputs
signal A0_COUT:  STD_LOGIC;                         -- Arithmetic Carry output
signal A0_SUMQ:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output
signal A0_SUMD:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output

-- Stage 1 declarations
signal A1_MODEQ: STD_LOGIC_VECTOR( 1 downto 0);     -- SHFB
signal A1_MODED: STD_LOGIC_VECTOR( 1 downto 0);     -- SHFB
-- Arithmetic inputs
signal A1_A:     STD_LOGIC_VECTOR(15 downto 0);     -- A[] signal input
signal A1_B:     STD_LOGIC_VECTOR(15 downto 0);     -- B[] signal input
-- Arithmetic internal signals
signal A1_LUT:   STD_LOGIC_VECTOR(15 downto 0);     -- LUT      (internal use)
signal A1_MA:    STD_LOGIC_VECTOR(15 downto 0);     -- MULT_AND (internal use)
signal A1_XC:    STD_LOGIC_VECTOR(15 downto 0);     -- XORCY    (internal use)
signal A1_CRY:   STD_LOGIC_VECTOR(16 downto 0);     -- Carry    (internal use)
-- Arithmetic outputs
signal A1_COUT:  STD_LOGIC;                         -- Arithmetic Carry output
signal A1_SUMQ:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output
signal A1_SUMD:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output


begin


-- The selector needed is a type of arithmetic circuit with 
-- two mode pins.  In addition, I have to be able to force the "A"
-- input to be ignored completely for at least one mode.  That means
-- that the mode will require a MULT_AND, so the template required is
-- the MODE3-0 MULT_AND template (which see).
--
-- The three operations required will be A[], B[], and A[]+A[]+COUT.
-- B[] will have to correspond to AR_MODE(1) "low", and A[]+A[] will
-- need to have AR_MODE(1) "high".  Also, I want a shift by "0" to
-- correspond to the A[] version, while a shift by "1" or "3" to correspond
-- to the arithmetic shift (i.e. A[]+A[]+COUT).  In addition, it's
-- possible to save a LUT by fiddling with N0 so that two different codes
-- correspond to the arithmetic shift:
--
--                 A0_MODE  A1_MODE
-- SHIFT SHFA SHFB   1  0     1  0
-- ----- ---- ----   -  -     -  -
-- 0000    0    0    0  0     0  0
-- 0001    1    0    1  0     0  0
-- 0010    2    0    0  1     0  0
-- 0011    0    3    0  0     1  0
-- 0100    1    3    1  0     1  1
-- 0101    2    3    0  1     1  1
-- 0110    0    6    0  0     0  1
-- 0111    1    6    1  1     0  1
-- 1000    2    6    0  1     0  1

-- The bizarre modes for "10" and "11" is to possibly save the A1_MODE(0) LUT.

-- SHFA modes
with SHIFT(3 downto 0) select
    A0_MODE(1 downto 0) <=
      "00" when "0000" | "0011" | "0110",      -- Shift by 0   A {0,3,6}
      "10" when "0001"                  ,      -- Shift by 1   0 {1,
      "11" when          "0100" | "0111",      -- Shift by 1        4,7}
      "01" when others;                        -- Shift by 2   B {2,5,8}

-- SHFB modes
with SHIFT(3 downto 0) select
    A1_MODED(1 downto 0) <=
      "00" when "0000" | "0001" | "0010",      -- Shift by 0
      "10" when "0011"                  ,      -- Shift by 3
      "11" when          "0100" | "0101",      -- Shift by 3
      "01" when others;                        -- Shift by 6


------------------------------------------------------
-- SHFA   Shifts by 0,1, or 2 bits
------------------------------------------------------

-- Arithmetic functions required:
--
-- A0_MODE  Arithmetic
--   1 0     Function      Shift
--   - -    ------------   --------
--   0 0    A[]              0
--   0 1    B[]              2
--   1 0    A[]+A[]+COUT     1
--   1 1    A[]+A[]+COUT     1
--
-- This is all I need to know to define the carry logic. Because
-- this is sort of a complicated substitution, I'll do the bit substitution
-- explicitly, using assignments, rather than try to substitute the text
-- in the arithmetic.  The other substitutions are as follows:
--
-- Substitutions:
-- AR_MAXBIT <= 15
-- AR_MODE   <= A0_MODE(1 downto 0)
-- AR_xxx    <= A0_xxx

-- Bit assignments, stage 0
--
-- Note that the bit assignments for stage 0 are trivial.  This is because
-- the orbit of a bit in this length barrel shifter, when shifted by 1
-- is the sequence (0,1,2,3,4 ... 15), and this is a trivial sequence.
-- the bit assignment for stage 1 is more complicated.

A0_A(15 downto 0) <= DIN(15 downto 0);      -- Shift by 0

-- A0_B is the same as A0_A, but shifted by two places:
A0_B(15 downto 0) <= A0_A(13 downto 0) & A0_A(15 downto 14);      -- Shift by 2

-- Carry-in control
-- Note that when the circuit is in A[]+A[]+COUT mode, the COUT will
-- be precisely equal to A0_A(15), so I choose that as the CIN instead
-- of the carry-out A0_CRY(16).  This uses no more gates and is kinder
-- and gentler to the tools.
with A0_MODE(1 downto 0) select
    A0_CRY(0) <=
    (  '0'   ) when "00",     -- CIN = 0      A[]
    (  '0'   ) when "01",     -- CIN = 0      B[]
    (A0_A(15)) when "10",     -- CIN = COUT   A[]+A[]+COUT
    (A0_A(15)) when others;   -- CIN = COUT   A[]+A[]+COUT

-- Unfortunately, I couldn't get the synthesizer to clue in that
-- A0_LUT would fit into a LUT. I hate to instantiate these things,
-- but here goes:
--
-- with A0_MODE(1 downto 0) select
--   A0_LUT(I) <=
--     (             A0_A(I)) when "00",      --   A[]        A[]+1
A[]+CIN
--     (             A0_B(I)) when "01",      --   B[]        B[]+1
B[]+CIN       (Note 1.)
--     (A0_A(I)  xor A0_A(I)) when "10",      --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
--     (A0_A(I)  xor A0_A(I)) when others;    --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
--
-- INIT calculation:
--
-- 3 1111111100000000  B
-- 2 1111000011110000  MODE(0)
-- 1 1100110011001100  A
-- 0 1010101010101010  MODE(1)
-- - ----------------
--        1 0     1 0  A0_A when "00",
--       1 1     0 0   A0_B when "01",
--   0000    0000       '0' when others (note A xor A == '0')
--   ----------------
--   0000111000000100 = 0x0E04  (assigned as an attribute near signals)
--
-- LUT4 ugliness...
L00: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 0),
    I2  => A0_MODE(0),
    I3  => A0_B( 0),
    O   => A0_LUT( 0));
L01: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 1),
    I2  => A0_MODE(0),
    I3  => A0_B( 1),
    O   => A0_LUT( 1));
L02: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 2),
    I2  => A0_MODE(0),
    I3  => A0_B( 2),
    O   => A0_LUT( 2));
L03: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 3),
    I2  => A0_MODE(0),
    I3  => A0_B( 3),
    O   => A0_LUT( 3));
L04: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 4),
    I2  => A0_MODE(0),
    I3  => A0_B( 4),
    O   => A0_LUT( 4));
L05: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 5),
    I2  => A0_MODE(0),
    I3  => A0_B( 5),
    O   => A0_LUT( 5));
L06: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 6),
    I2  => A0_MODE(0),
    I3  => A0_B( 6),
    O   => A0_LUT( 6));
L07: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 7),
    I2  => A0_MODE(0),
    I3  => A0_B( 7),
    O   => A0_LUT( 7));
L08: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 8),
    I2  => A0_MODE(0),
    I3  => A0_B( 8),
    O   => A0_LUT( 8));
L09: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 9),
    I2  => A0_MODE(0),
    I3  => A0_B( 9),
    O   => A0_LUT( 9));
L10: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(10),
    I2  => A0_MODE(0),
    I3  => A0_B(10),
    O   => A0_LUT(10));
L11: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(11),
    I2  => A0_MODE(0),
    I3  => A0_B(11),
    O   => A0_LUT(11));
L12: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(12),
    I2  => A0_MODE(0),
    I3  => A0_B(12),
    O   => A0_LUT(12));
L13: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(13),
    I2  => A0_MODE(0),
    I3  => A0_B(13),
    O   => A0_LUT(13));
L14: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(14),
    I2  => A0_MODE(0),
    I3  => A0_B(14),
    O   => A0_LUT(14));
L15: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(15),
    I2  => A0_MODE(0),
    I3  => A0_B(15),
    O   => A0_LUT(15));

-- Generate command
A0: for I in 0 to 15 generate
-- Carry chain instantiation
  MA: MULT_AND port map (
    I0  => A0_A(I),
    I1  => A0_MODE(1),
    LO  => A0_MA(I));
  MC: MUXCY port map (
    DI  => A0_MA(I),
    CI  => A0_CRY(I),
    S   => A0_LUT(I),
    O   => A0_CRY(I+1));
  XC: XORCY port map (
    CI  => A0_CRY(I),
    LI  => A0_LUT(I),
    O   => A0_SUMD(I));
end generate;


------------------------------------------------------
-- SHFB   Shifts by 0,3, or 6 bits
------------------------------------------------------
--
--
-- The logic for the second level shifts is similar to the logic for
-- the first level, but the shifts are by amounts 3x as much.  This
-- means that I have to scramble bits to get the right results.
--
-- A0_MODE  Arithmetic
--   1 0     Function      Shift
--   - -    ------------   --------
--   0 0    A[]              0
--   0 1    B[]              6
--   1 0    A[]+A[]+COUT     3
--   1 1    A[]+A[]+COUT     3
--
--
-- Substitutions:
-- AR_MAXBIT <= 15
-- AR_MODE   <= A1_MODE(1 downto 0)
-- AR_xxx    <= A1_xxx

-- Bit assignments, stage 1
--
-- The results from the previous stage is A0_SUM(15 downto 0), and
-- the bits are positioned just as they appear.  But I'm going to have
-- to juggle the ordering for this stage.
--
-- I'm going to choose to start the selector with the selected
-- bit 0.  That is, A1_SUM(0) will have a position of "0" in the
-- (15 downto 0) set of bits.
--
-- As soon as I make that choice, I know that A1_A(0) will be given
-- A0_SUM(0) in order to make the B[] choice correspond to a shift by
-- 0.  Then A1_B(0) will have to be A0_SUM(6) to get the shift by 6.
--
-- The carry-out of from this stage (during the arithmetic mux) will have
-- a value of A0_SUM(0), and since the arithmetic shift is supposed to
-- have a shift amount of "3", it must be that the next bit higher
-- than bit "0" must be bit "3".  This rule continues through all
-- 16 bits of input bits, and determines the bits for A1_A:

A1_A(15 downto 0)
      <= A0_SUMQ(13) & A0_SUMQ(10) & A0_SUMQ( 7) & A0_SUMQ( 4)
      &  A0_SUMQ( 1) & A0_SUMQ(14) & A0_SUMQ(11) & A0_SUMQ( 8)
      &  A0_SUMQ( 5) & A0_SUMQ( 2) & A0_SUMQ(15) & A0_SUMQ(12)
      &  A0_SUMQ( 9) & A0_SUMQ( 6) & A0_SUMQ( 3) & A0_SUMQ( 0);

-- A1_B is the same as A1_A, but shifted by two places:
A1_B(15 downto 0) <= A1_A(13 downto 0) & A1_A(15 downto 14);      -- Shift by 6

-- Note: The following carry chain logic may seem mysterious, but
-- with the use of a template it's easy to implement.  It's my
-- intention to publish the templates I use as freeware, but I haven't
-- got good enough comments written into them.  Half the problem is
-- that after you've used the templates a few times you just remember
-- how to get the arithmetic functions you want, so I don't bother
-- with them much.
--
-- The template that covers this case allows for a very large number
-- of 2-mode bit functions of 2 arithmetic vectors (and constants and
-- stuff) to be implemented in a single slice.

-- Carry-in control
with A1_MODEQ(1 downto 0) select
    A1_CRY(0) <=
    (  '0'   ) when "00",     -- CIN = 0      A[]
    (  '0'   ) when "01",     -- CIN = 0      B[]
    (A1_A(15)) when "10",     -- CIN = COUT   A[]+A[]+COUT
    (A1_A(15)) when others;   -- CIN = COUT   A[]+A[]+COUT

-- Generate command
A1: for I in 0 to 15 generate
with A1_MODEQ(1 downto 0) select
    A1_LUT(I) <=
    (             A1_A(I)) when "00",      --   A[]        A[]+1      A[]+CIN
    (             A1_B(I)) when "01",      --   B[]        B[]+1      B[]+CIN
(Note 1.)
    (A1_A(I)  xor A1_A(I)) when "10",      --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
    (A1_A(I)  xor A1_A(I)) when others;    --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN

-- Carry chain instantiation
  MA: MULT_AND port map (
    I0  => A1_A(I),
    I1  => A1_MODEQ(1),
    LO  => A1_MA(I));
  MC: MUXCY port map (
    DI  => A1_MA(I),
    CI  => A1_CRY(I),
    S   => A1_LUT(I),
    O   => A1_CRY(I+1));
  XC: XORCY port map (
    CI  => A1_CRY(I),
    LI  => A1_LUT(I),
    O   => A1_SUMD(I));
end generate;


process (CLK)
begin
    if CLK'event and CLK='1' then  --CLK rising edge
      A0_SUMQ  <= A0_SUMD(15 downto 0);
      A1_MODEQ <= A1_MODED(1 downto 0);
      A1_SUMQ  <= A1_SUMD(15 downto 0);
    end if;
end process;


-- The result is A1_SUM, but the bits are not in the correct order.
-- To get the bits in correct order, I simply reverse the operation
-- that gave A1_A:

Y(15 downto 0)
      <= A1_SUMQ( 5) & A1_SUMQ(10) & A1_SUMQ(15) & A1_SUMQ( 4)
      &  A1_SUMQ( 9) & A1_SUMQ(14) & A1_SUMQ( 3) & A1_SUMQ( 8)
      &  A1_SUMQ(13) & A1_SUMQ( 2) & A1_SUMQ( 7) & A1_SUMQ(12)
      &  A1_SUMQ( 1) & A1_SUMQ( 6) & A1_SUMQ(11) & A1_SUMQ( 0);

end BRL16_8_arch;


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

######

From: "Carl Brannen" <carl.brannen@terabeam.com>
Newsgroups: comp.arch.fpga
Subject: Re: Low area barrel shift puts 3 to 1 mux in a Xilinx LUT:
Date: Wed, 19 Dec 2001 08:57:42 +0000 (UTC)
Organization: Mailgate.ORG Server - http://www.Mailgate.ORG
Lines: 137
Message-ID: <e25c1aa95551aa23abaa3d771f051885.51709@mygate.mailgate.org>
References: <d59d30d03eadb65299bb04ac2507f24d.51709@mygate.mailgate.org>
NNTP-Posting-Host: firewall.terabeam.com
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: news.mailgate.org 1008744250 13480 216.137.15.2 (Wed Dec 19 09:57:42 2001)
X-Complaints-To: abuse@mailgate.org
NNTP-Posting-Date: Wed, 19 Dec 2001 08:57:42 +0000 (UTC)
Injector-Info: news.mailgate.org; posting-host=firewall.terabeam.com; posting-account=51709; posting-date=1008744250
User-Agent: Mailgate Web Server
X-URL: http://www.Mailgate.ORG
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!web2news!firewall.terabeam.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12508

Design reports:

<<<
Design Summary
--------------
   Number of errors:      0
   Number of warnings:    1
   Number of Slices:                 38 out of    768    4%
   Number of Slices containing
      unrelated logic:                0 out of     38    0%
   Number of Slice Flip Flops:       70 out of  1,536    4%
   Number of 4 input LUTs:           38 out of  1,536    2%
   Number of bonded IOBs:            36 out of     94   38%
   Number of GCLKs:                   1 out of      4   25%
   Number of GCLKIOBs:                1 out of      4   25%
Total equivalent gate count for design:  1,034
Additional JTAG gate count for IOBs:  1,776
>>>

Note that the 38 slices mentioned above include slices that contain
only DFFs. The number of slices that contain FGs or carry logic is
20.  There are extra DFFs included on the inputs and outputs in order
to "free" the design from the IO region.

<<<

   The Average Connection Delay for this design is:        0.923 ns
   The Maximum Pin Delay is:                               1.982 ns
   The Average Connection Delay on the 10 Worst Nets is:   1.430 ns
...

--------------------------------------------------------------------------------
  Constraint                                | Requested  | Actual     | Logic 
                                            |            |            | Levels

--------------------------------------------------------------------------------
* NET "CLK" PERIOD =  5 nS   LOW 50.000 %   | 5.000ns    | 5.272ns    | 11   

--------------------------------------------------------------------------------

<<<
Constraints cover 2038 paths, 0 nets, and 234 connections (93.6% coverage)

Design statistics:
   Minimum period:   5.272ns (Maximum frequency: 189.681MHz)
>>>

I should here mention that the worst case timing does
involve the long carry chain, but the carry logic itself
only increases by .097ns per stage (i.e. half that per bit).

Consequently, other than the additional routing congestion,
the carry chain, per se, is not really much of a limit to it:

<<<

================================================================================
Timing constraint: NET "CLK" PERIOD =  5 nS   LOW 50.000 % ;
 2038 items analyzed, 6 timing errors detected.
 Minimum period is   5.272ns.

--------------------------------------------------------------------------------
Slack:    -0.272ns path B<3> to U14/A1_A<10> relative to
           5.000ns delay constraint

Path B<3> to U14/A1_A<10> contains 11 levels of logic:
Path starting from Comp: CLB_R14C5.S1.CLK (from CLK)
To                   Delay type         Delay(ns)  Physical Resource
                                                   Logical Resource(s)
-------------------------------------------------  --------
CLB_R14C5.S1.XQ      Tcko                  0.772R  B<3>
                                                   U2
CLB_R15C6.S1.G2      net (fanout=4)        0.613R  B<3>
CLB_R15C6.S1.Y       Tilo                  0.398R  U14/C1/N5
                                                   U14/C586
CLB_R15C6.S1.F3      net (fanout=17)       0.476R  U14/C17/N16
CLB_R15C6.S1.X       Tilo                  0.398R  U14/C1/N5
                                                   U14/C582
CLB_R16C6.S0.BX      net (fanout=1)        0.680R  U14/C1/N5
CLB_R16C6.S0.COUT    Tbxcy                 0.357R  U14/A1_A<0>
                                                   U14/MC_0
                                                   U14/MC_1
CLB_R15C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<2>
CLB_R15C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<6>
                                                   U14/MC_2
                                                   U14/MC_3
CLB_R14C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<4>
CLB_R14C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<12>
                                                   U14/MC_4
                                                   U14/MC_5
CLB_R13C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<6>
CLB_R13C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<2>
                                                   U14/MC_6
                                                   U14/MC_7
CLB_R12C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<8>
CLB_R12C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<8>
                                                   U14/MC_8
                                                   U14/MC_9
CLB_R11C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<10>
CLB_R11C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<14>
                                                   U14/MC_10
                                                   U14/MC_11
CLB_R10C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<12>
CLB_R10C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<4>
                                                   U14/MC_12
                                                   U14/MC_13
CLB_R9C6.S0.CIN      net (fanout=1)        0.000R  U14/A0_CRY<14>
CLB_R9C6.S0.CLK      Tcckx                 0.996R  U14/A1_A<10>
                                                   U14/XC_14
                                                   U14/A0_SUMQ_reg<14>
-------------------------------------------------
Total (3.503ns logic, 1.769ns route)       5.272ns (to CLK)
      (66.4% logic, 33.6% route)
>>>

Even though the carry is only used in two modes, one where it is
zero throughout the carry chain, and the other where each carry
bit propagates only to the next stage, it is nevertheless the case
that the worst case speed of the circuit includes the full timing
chain.

It's not easy to get a delay as long as the whole carry chain.
To do it, you'll have to apply '1's to the DIN[] input, and
tool around with the SHIFT[] input.  When the carry chain is
in the "pass data" mode, it will all be '1's.  When the carry
chain is switched to the all '0' mode, the '0' will be supplied
at the least significant bit, and will propagate through the
chain.  (Note that the chain does not propagate evenly from least
significant to most significant because of the remapping that
makes the barrel shifter work.)

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

######

From: Mike Treseler <mike.treseler@flukenetworks.com>
Newsgroups: comp.arch.fpga
Subject: Re: Low area barrel shift puts 3 to 1 mux in a Xilinx LUT:
Date: Wed, 19 Dec 2001 10:21:19 -0800
Organization: Fluke Networks
Lines: 7
Message-ID: <3C20DA9F.E5159376@flukenetworks.com>
References: <d59d30d03eadb65299bb04ac2507f24d.51709@mygate.mailgate.org> <e25c1aa95551aa23abaa3d771f051885.51709@mygate.mailgate.org>
NNTP-Posting-Host: slick.tc.fluke.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 4.77 [en] (X11; U; Linux 2.4.7-4GB i686)
X-Accept-Language: en
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!cpk-news-hub1.bbnplanet.com!evrtwa1-snf1.gtei.net!news.gtei.net!fluke!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:12520

Carl Brannen wrote:
> 
> Design reports:

Leo gives 55 LCs on altera 20k -- 193 MHz !!

 -- Mike Treseler