LUT+routing based Open FPGA Specification
author Neil Franklin, last modification 2004.05.23


Background:

Want to be able to design an open PC architecture, not dependant on the lousy
"industry standard" PC components. And do so using open source software tools.
This has created an need to make own chips.

ASICs are limited by large NRE, large "compile" time (bad for the iterative
design methods prefered by hackers), inflexible (single function), minimal
batch sizes (from wafer sizes and mask costs). They also demand HDL (ugly) or
schematic (unportable non-ASCII) programming to generate EDIF, not allowing
74((L)S)(X)xx style "select and wire blocks" design.

FPGAs solve all these problems, but their vendors do not publish the bitstream
formats. This makes writing open source tools an large job (reverse engineering
bitstream format) or even impossible (needed data only available for one chip
family, Xilinx Virtex), with limited duration value (single vendor/source chips
go out of production, making tools useless, see XC6200 desaster, and no future
documented family to look for). So I am investigating alternatives to using
vendors FPGAs.


Logfile:

2003.09.22 Mon
Discussion on c.a.f about making own open FPGA so open bitstream
  links to an few FPGA-in-FPGA (slow) and a few SoG FPGA-in-ASIC projects

2003.09.26 Fri
Discussion with attila about FPGA-in-ASIC as student project
  he thinks this is more DA project, but pointer to Kaeslin VLSI 1 script

2003.09.29 Mon
multiple hours musing on details of soec for an FPGA-in-ASIC
  started this file, plot down the whole stuff

2003.10.01 Wed
added this Timeline section, tidied up, added a few details
expanded Xilinx chip feature evolution and patent situation

2003.10.15 Wed
estimates about minimal chip oder and how to distribute costs
renamed this section Timeline->Logfile, added Todo section
added Altera non-Xilinx chip features and patent situation

2003.10.26 Sun
Added a few more details, "mux switch in data path" routing analog to mux
based control signal propagation, more Xilinx details, Atmel in patents

2003.11.04 Tue
Added remarks about FF-RAMs addresses from LSB side neighbor mux

2003.11.14 Fri
block of 4x4, staggered routing muxes, both dimensions wide logic for Muxes
RAMs with 4-reg 3-LUT DP mode, name "E" clock network, fixed data bus pins
config loading modes, slave 1/8bit, master from 8bit, addr gen and conf select
reloadable at LB level, no entire width, self reloadable, voltages started
SoG variant with 4-LUT and wide output muxes w 2 or 3 cell/bit and 2 cell/IOB

2003.11.17 Mon
definitely 4x4 block for routing not 4x1, l<->r 1/4:3/4 routing
wide-OR also in data bus direction not just data path width direction
reset FF and non reset SRAM in LC, SRAM addr always from control can >4bits
SRAM direct I0 input, leaving 3-LUT for logic (such as muxing SRAM outputs)

2003.11.19 Wed
studying 74(LS)(x)xx TTL databook, this evening 7483 and first look ahead carry
  carry = 0 if all [a+b]<3 until [a+b]=0 or Cin = 0 found

2003.11.30 Sun
massive text tidy up, better formulations, wide-OR->OR-chaining, I0..3 -> A..D
studying 74(LS)(x)xx TTL databook, further with 74181 2*16 function ALU
  carry chain Cn->C(n+4) is same as in 7483, add external 74182 lookahead

2004.05.23 Sun
added "wire-OR" style stuff in interconnect

Todo:
ASCII art of layout (attila request), features in it


Aim:

Design an open source FPGA, implemented in an ASIC. This enables using
multi-vendor (will not disappear) ASIC technology (or perhaps even one day full
custom design), while at same time still giving FPGA flexibility for iterative
design methods.

Both open HDL source for producing the FPGA chips (ensuring their continues
existance), and open bitstream format (to enable open source compilers for
them).

Disadvantage is work of designing/manufacturing chips and most likely less
speed. Advantage is no dependency on FPGA vendors who have an monopoly on any
bitstream format and tools to generate it.

Thise conditions make this chip attractive to hackers and students newly
entering FPGA usage, and non-interesting for existing FPGA users who do not
mind the tool situation, and. So design and document preferably for newbie
friendlyness, not industry compatibility.

Design for typical hobby applications, SoftCPUs, PC IO devices,
mp3/video decode/encode, 3D video accel.


Target and Priorities:

Be usable for FPGA PC, consisting of soft CPU, IO devices and chipset/glue:
- Crucial is enabling wide and fast data paths, in particular fast carry for
  ALUs and incrementers and fast control signal propagartion to all bits in one
  data path stage.
- Second priority is fast instruction decoder, which uses many ANDs and then
  few wide ORs just next to the data path elements.
- Third priority is space efficient support for many-register sets.
- Fourth priority is fast data signal propagation, once it can be processed
  efficiently. And also enough data transfer volume for multiple parallel data
  buses. This can be restriced to inside each bits data path.


Basic design:

For data path both fast carry an fast control signal propagation require
minimal amounts of cells to traverse. In addition fast carry requires dedicated
carry logic. Worst case acceptable is 1 stage per cell/bit (like Atmel AT4000),
but better group multiple bits of one element in one cell (like Xilinx
XC[2345]xxx/Virtex or Altera or Lucent/Lattice devices). I prefer at present 4
bits per cell (like 7483 and 74181+74182 do).

In addition fast data signal propagation requires also minimising cells to
cross and multiple buses require multiple signals crossing one cell, so this
excludes SoG style designs.

Also having large register sets requires RAMs in LCs, also requiring the 1
cell/bit formula of 4LUT based FPGAs.

This makes SoG (sea of gates, also known as fine grain) style FPGAs (such as
Algotronix CAL1032, Xilinx XC62xx or Atmel AT6000), which require 2..4 cells
per bit, useless for this target, because they are 4*(2..4)=4..16 times slower.
So no use going to the existing open source projects, of which all I know are
SoG based (all aimed at / optimised for random logic?).

SoG would be just about useable if with hardwired 2 or 3 gate/bit formula with
carry and RAMs provided every 2 or 3 cells. In this case use an 4-LUT with NEWS
as direkt inputs and in neighbor cells 4 wide (4bit 16:1 or even larger) output
muxes, with 4 dirs, LUT, FF and 10 or more far inputs.


Logic Cell (LC):

Should be able to emulate as many 74((L)S)(X)xx series parts as possible. So
that existing designs and thinking port as easy as possible.

Standard 4LUT + D-FF. Other additional features only if they do not add
significant costs (= cell % larger than the % cells they save). Even then when
in doubt leave away (less work, less documentation, simpler tools).

LUT inputs numbered A..D (like in 74181 A..B), not I1..I4 as traditional FPGAs
do. LUT outputs F (function, as in 74181) and FF outputs Q (as in 7474).

FF inputs just data from LUT and separate clock+reset.

Separate F and Q outputs only if no significant cost increase in routing, else
use an output O with Mux F/Q like in PALs or CPLDs. Together with no separate
carry outputs only 1/3 of outputs of Virtex devices.


Wide Logic

No F5/F6 Muxes (require 5th Bx input Mux per LUT, +20% inputs). Use simpler
OR-chaining of LUTs, previous LUT AND controlbit into OR of this LUTs 16:1 Mux.

Due to only one bit and simple AND and extended OR have OR-chaining in both
data path width (for control logic) and data bus length (for large processing
elements) directions.

This can do:

- n*4OR directly, usually done in data path width direction, but can be (mxn)
  for less chaining links delay

- n*4NAND (with inverting inputs), and so wide-AND if "recievers" invert also

- (m*4)x(n*4) OR-of-ANDs by doing NAND-of-NANDs, this can be done with AND and
  OR in 2 directions, reducing chaining links

- n*(2OR-of-2-ANDs, A&B|C&D), "select line Muxes", replacing internal tristate
  logic, this is best done using data path direction OR-chaining

- 2*(2:1-Muxes) with AB data, C Sel0, D Sel1|/Sel1. First level 2:1 in LUTs
  (ABC) and second level 2:1 ANDs in LUTs (first level + D) and OR from 1 stage
  OR-chaining. Above 4:1 need to use "select line Muxes" or cascade 4:1 Muxes.

No merging of 2 or 4 4LUTs to make 5LUTs or 6LUTs (is seldom used anyway).


Logic Block (LB):

4x1 LCs, 4 in data path width direction and 1 in data bus direction, same
arrangement as the XC5200 used. This 4x1 is "half way" between Virtex 2x1 slice
and Altera 8x1 LAB, and near to Virtex-II pair of 2x1 slices.

Common FF (Clk and Reset) inputs for all 4 FFs in LC. If taken from routing
this gives 4*4+1+1=18 inputs, 3/4 of the 24 inputs of equivalent 2 Virtex
slices. Possibly no separate reset input, as only needed in FSM FFs and PC
registers, so input Mux is wasted everywhere else.

Alternative take Clk and Reset from outputs of data path width direction LSB
side LCs, requires no extra input Muxes for Clock or FF. And is fitting for
resetting registers, and small waste for FSM Bits.


Making register sets:

LUT-RAMs are surely patent encumbered, so avoid them.

Instead replace the 4 FFs with an RAM of 4*16FFs (like 74(x)89), call these
FF-RAMs. In "normal" operation Address=0 reduces them to single FFs.

For RAM addressing !=0:

- SP-RAM all 4 Dn(n=0..3) to address, leaving 3LUTs for data path.

- DP-RAM use all 4 Cn(n=0..3) as write address, leaving 2LUTs. This even allows
  DP-RAM without using 2 4LUTs per bit.

- Possibly also an DP-RAM mode with 3-LUTs and 2 Ds for each address, but only
  4 addresses/registers.

So read addr can be 0/D, write addr 0/D/C/D(3-n), with 1 adress bits config
bits per LC.

Alternative take address(es) from data path width direction LSB side LCs. At
each LB mux amplifying like for control lines routing. Feed in from control
section address generator logic, before LSB. This also allows more than 4
address lines, and so 4*32 or 4*64 bit RAMs. Needs 8 to 12 "driver" LUTs, but
register sets have small driving logic, no FSMs. little decode.

Clock as usual. Reset allowed to act on only 1 of 16 bits. On row/column of 4
or all 16 bits only if no extra cost. Alternative single resetable FF and non
resettable RAM in each LC.

This also gives an clean split of config bits (including LUT patterns) and
debug data bits access. Allows debug reads without stopping chip for
"reconfig". Even debug writes if proper semaphores are used.

No Support for 4*16 -> 2*32 or even 1*64. Better use LUTs and OR-chaining.
With 4*32 or 4*64 no need for this anyway.


BRAMs:

Just one/two large SRAM(s) at end(s) of data path is inferior, when having
multiple independant logic sections (CPU, peripherals) on the chip. Multiple
small size BRAM stripes is better. Ideal RAM at LC granularity, so that it is
available "at any time".

BRAMs are most likely patent encumbered, having appeared in the middle of the
XvsA patent war. So better many large FF-RAM tiles. This needs 4*32bit (for
Virtex-E density) or 4*64bit (for Virtex-EM or Virtex-II RAM density).

Use the LCs LUTs and data direction OR-chaining to select from multiple FF-RAM
outputs, to get the sizes for caches or video memories. For this LUT usage
FF-RAM input can not be via LUT, so FF-RAM input with Mux for A or LUTout.


Arithmetic and Carry:

F(A,B,Carryin) = A XOR B XOR D fully computed by the LUTs, not just 2nd XOR
like in XC4000 or first XOR like in Virtex.

Carryout(An,Bn,Carryin:n=0..3) provide for each LB an 4bit lookahead carry,
like used in 7483/74181+74182, but these then serial like in all FPGAs, no
74182 equivalent. This is main reason for the 4x1 LB format, the FF(-RAM)
Clock/Reset(/Addresses) being only secondary benefactor.

For this wire:

- CarryAn and CarryBn hardwired from An and Bn of this LB. Increment mode with
  only one An (or Bn?) used for data, leave other input free for logic (such as
  Mux after increment for PC registers)

- Carryin hardwired from Carryout of previous LB. Set first Carryin by an LUT
  with A and B = 0 or 1 or from carry flag

- Carryout to least-used input D(n+1) Mux (so no slowing non-arithmetic use
  with extra Mux in D line(s) as XC4000 has, no slowing with LUT/Fx/XOR Mux as 
  Virtex has). This also allows conditional adding, using D only in add case
  (also for Mux case of PC registers)

- For subtraction invert CarryB using C (add/sub select). So in lookahead input
  from B an XOR driven by C if enable config bit (1 for all 4 bits) is set.
  Alternative have an dedicated add/sub selector, also coming from control
  section LUTs.

For speed control signal and carry ripple should run in the same direction.
So control logic at LSB side of data path width. Carry at MSB hits side of
chip. As FFs are direct after logic, routing carry back to control is in
critical time path, so put an additional FF for Cout in the left IOBs.

For patent avoidance carry logic possibly not in LCs, but separate carry
stripes (like BRAMs), only connected to data bus direction routing. One carry
chain per 2 or 4 LCs.


General Routing:

Possibly Data path width horizontal, MSB left, LSB right, as most hackers are
used to this arrangement from register diagrams in CPU data sheets. Data path
depth growing top->bottom same as text, code, memory diagrams etc do. Also in
vv-like tool this gives tags at left (like comments). And allocation (and vd
dump) order is same as vv column before row order. This familiarity is newbie
friendlier than the FPGA traditional MSB top and LSB bottom.

If not like above, then traditional MSB at top arrangement.

Width (control) and depth (data) are nearly separate and have different uses
and requirements. So consider them as 2 separate topics.

Data comes preferred to A|B seldom C, never D. Control to D|C, seldom B, never
from I0. Use this to reduce A and D input Mux size. Because of this Carryin is
in D Mux.

Routing 4x4 logic groups (LG), to get same 4 times faster propagation in data
bus depth direction routing as, in data path width. Also symmetrical for
routing software. Gives LG of 4 LBs of each 4 LCs. Also avoids possible patent
problem with differentiated length routing. If that no problem, possibly also
16x16 "supergroup" fast transit routes.

Routing Mux based, as this avoids chip damage with bugs in routing software.
Traditional Mux has less config bits/selection, but more complex logic to
decode the selection. So use OR of multiple (signal AND confbit). Is safe like
Muxes, and allows multi-ANDs being auto-ORed, like open collector wire-OR. PIP
signals through transistor gate, vs Mux through 2 transistor stages, which is
slower. But Mux amplifies and solves fanout problems and no T/C low pass filter
effect. And transistors faster then interconnect delay.

Inputs to LUTs also with such wire-ORs, also short-safe, and can save final ORs
in control signal part, use less space. Suggests also an FPGA with only NOT and
AND instead of LUTs, no LUT patent collisions..

Scatter repeater Muxes by breaking 1/4 of lines in each LCs vincinity, so all
broken once inside each LG. Have outgoing drivable from incoming of 3
directions, and by that LC. Have each LC input come from any line originating
in or crossing through that LC.

Presently 8 or 12 or 16 wires per LC (2 or 3 or 4 per swich) routing density.


Width/Control Routing:

Takes mainly control signals to data path LUTs and clock/reset to data path
FFs (unless dedicated for this). Every LUTs 3 control inputs and the common
clock/resets should be drivable from these. These are all long distance, delay
critical and many loads / high fanout. So run these in LB-wide (4 LC wide)
sections, so delay is hidden behind identical carry delay. These should be
drivable at each LB stage by corresponding LCs outputs or the previous
corresponding segment. Mux for this is also amplifier, buffering the many
loads. All run right->left, no left->right. This is nearer to XC[234]000 or
even more XC6200 routing than to XC5200|Virtex|later.

But also wiring between LCs in control section, short distance, both
directions, so only neighbor LUT and single segments. But this can also use
long distance stuff, without extensions segments, as that is not in use here.

Also Cout/MSB to control or Cin or LSB input, or input data from left IOBs. May
want speed, only one load, so single global width line, no amplifiers needed.
Or simply smaller set of segmented routing.

Also data exchange for shift registers or bit swaps, may want speed, and are
multiple, variable distance, usually one load. In both right->left and
left->right directions. But seldom and can so be scattered over multiple
depth stages. May be most difficult to define. Use segmented also for this
direction, no global. So have 1/4 routing MSB->LSB, 3/4 routing LSB->MSB.


Depth/Data Routing:

Drives mainly data and instructions between function units, registers and
memory. Medium distance, registers to multiple function units have multiple
loads, so may require amplified multi-LB segments top->bottom. Back up usually
single target medium distance with large Mux somewhere, no amplification.

No need for jumping bits, appart from swap operations. No 2-LUT+F5 functions
with 2 bits in same bit of 2 different rows, so no need for 45 degree line to
output Mux, to un-zigzag bits so that data appears as all from one row. This
also avoids zigzaging bits and needing un-zigzag routing.

Both directons symmetrical, 1/2 and 1/2 routing for up and down.

No TBUFs as real ones are slow and need global lines, and simulated TBUFs with
muxes is patented. Instead use explicit "select muxes" from LCs, 2 data lines
and 2 enables per LUT. Or for this in data bus direction use Mux "wire-OR" for
chaining outputs. 2 wire-ORs in both directions and ORs in LUT inputs alow
mulating the "simulted TBUF" in software, without hardwired features. Patent
trouble only for featurre compiler maker, not chip maker.


Clock distribution network:

Modern dedicated clock network is surely patented (unless not patentable
because PALs/CPLDs already did it, which is unlikely), so avoid it. Can use
long lines or segmentend routing, as XC[23]000 already had that. Using normal
routing for clock increses skew, but this is not so critical so long it is done
predictably.

The LSB is computed faster anyway (carry delay, and control signal propagation)
so have clock following carry at same speed, hides delay like for control
signals. So use control LSB->MSB signals for clock. No need for clock (or
reset) from data bus depth lines. Actually better, as one can clock faster, as
no need to wait for the MSB to compute. Need to look out with MSB->LSB running
signals, such as carry->LSB, as 2 directions propagation time added to logic,
but usually not much logic in using them, not critical.

Drive state machine FFs in control section from source even further "LSB", or
possibly drive from between data path and control, delaying state machine. So
vertical clock in LUT column next to data path, or at most LSB (possibly even
in the right IOBs, using "from LSB segment" input to routing). Gives "E" clock
network, unlike the normal FPGA "H" network.

Data bus direction takes time from registers to function units, in both
directions, so no clock from top. Drive from middle of LSB side, to half skew.
Or even drive clock with zigzag lines/segments, to parallel data propagation
time.

This gets rid of need for separate clock and CE inputs, in case we can not have
one. Just gate clock with an AND up in control section, when distributing from
vertical to horizontal. Note that this requires clock enable signals to be
active (1) before clock hits its second (1) phase, assuming FFs that act on
falling edge. Or use some latch in the LUT to keep clock up when enable goes
down.


IOBs:

1 pin per LC format.

Data bus end sides ideal for connecting memory data and wide bus peripherals.
So put in wide tri-states, control fast like data path controls and FF clocks,
can for this accept only 1 tri-state control for group of 4, like only 1 FF
clock. No FF in tristate needed, but possibly nice for pipelining.

At least top data bus end (at least from MSB side first n*32/36) pinout all of
them, with "dropped" IOBs all outside this area. Drive data bus routing
directly into IOB, losing need for IOB rong routing (if patent problematic).

LSB side is nearest to control, so good for clock in. So possibly/likely
additional clock distribution lines in them. Also most likely control signal
outputs, as near to state machine and decoders.

And MSB side is nearest data path, so good for thin peripherals, such as RS232
or ethernet shift registers. So these may need single bit tri-states, but don't
need speedy direction switching, control by normal logic.

MSB has the additional Carryout FF+Mux in it, only for feedback to logic. LSB
has Carryin source in it, only from logic.

SoGs have one pad/cell, our cells are 3x3 SoG cells, so perhaps double or even
3 times as many pads als cells (but is this usefull?). Possibly needed if
multiple small chips added up to an larger cell space.

If making an SoG, use one IOB per 2 cells, 1 cell out data, 1 cell out
tristate, in data goes to both cells inputs. Or one IOB per cell, and trisate
from neighbor IOB, with pass throuch mechanism.

Voltages for IO it is sufficient to support TTL or LVTTL as most IOs will
require bus buffers or analog stuff anyway. LVTTL without buffers is more
important than exotic voltages. As many chip producers no need for many IO
standards, so no problem with 5V because of that. If LVTTL not achievable and
buffers are needed, then just core voltage.


Programming:

Bitstream format open documented, is central feature. Is not even an chip
feature, but just an policy/attitude feature of the developers. Damn vendors.

SRAM based config, as this does not require special PROM or EPROM or Flash
processes to make the chip, so cheaper. If many small chips then possibly go
for Flash or for OTP.

SRAM booting is annoying and config PROMs/Flashes/CF add cost and boot delay,
so if possible make SRAM "constant memory" like XC[23]000 had (but with
separate VCCCONF) and lithium cell to keep configuration alive (like for
Virtex-II 3DES key).

Loading config either from devel PC or "pc independance" module with FLash or
bat-SRAM and uC oder CPLD. So standard is slave mode, either pin-saving serial,
or fast parport/uC friendly 8bit. Facultative an master mode, serial from
standard SEEPROM/ser-Flash parts or using MSB data bus pins from standard 8bit
SRAM or 29F Flash chip (addr from MSB side, config select from IOBs with
boot=1, reboot=x).

Config frames run horizontally, so one can replace sections of logic without
touching others. Between 64 and 256 config bits per LC, so 256 to 1024 per LG.

If routing in IOB ring, then separate bits for LBs and IOBs, so IO stays under
control when reconfiguring logic. Possibly even offer to just replace an part
of an row, best even individual LGs. This requires an LGs to be 2^n bits wide,
ideal n>=5.

Strict separation of config bits (incl LUT paterns) and register access for
debug. No FF, FF-RAM or IO-FF bits in config bit space. Only set these per
reset logic or data inputs or debug circuit.

Debug output should not have to stop operation. Even reconfiguring should not
stop, so long the "missing" part is not in actual use (simply gives random
data). This suggests reloadable at LB level, not just entire width.

Possibly keep an subset of the FPGA running while reconfig of rest. For
"joining" of constant and variable parts use some standard of using routing. Or
possible some "interconnect points" (in XC4000 and Virtex often TBUFs were
used).

Possibly access to debug or config circuits from running FPGAs logic (by unused
IOBs driving config circuit, or better external config pins also having IOB).


Sizes:

Square array of LUTs. Logarithmic (2^n)x(2^n) sizes. Possibly plus intermediate
(1.5*2^n)x(1.5*2^n) sizes.

Minimal usefull is 4x4=16 LCs (about 1 22V10 PAL worth) has problem of also
4*4=16 IOBs, so 50% overhead (assiming IOBs are LC size), so better use an PAL
(smaller ones are documented). Even an PAL-like "LUT-PAL" FPGA with just 2
columns of 8 LCs and everyone-to-everyone non segmented routing would be better
than the full 4x4 thing. Or alternatively an open SoG FPGA for this. A
"LUT-PAL" may be cheaper up to 48 FFs and usefull up to 96 FFs, with then
"LUT-CPLD" or SoG up to 256..384 FFs.

8x8=64 with 4*8=32 IOBs is 30% overhead and still in CPLD/SoG range. 16x16=256
with 4*16=64 IOBs is still 20% overhead. 32x32=1024 with 4*32=128 is only 11%.
64x64=4096 with 4*64=256 is down to 6%. So start with smallest sensible 16x16
(which will just fit simple 8bit or 12bit designs. But also look at ASIC
vendors size/cost curve and mask cost vs chipcount per minimal wafer count (so
that not mask too dominant, and not too many small chips).

Packages can be what type the production organiser wants, so far the ASIC
vendor chosen offers selection. This can include 2.54mm (100mil) friendly
packages. 8x8 LCs (4*8=32 IOs) can fit PLCC44 or even DIL40. 16x16 (4*16=64)
will fit PLCC84 or PGA84. 32x32 (4*32=128) and 48x48 (4*48=192) will stand
either TQFP or PGA (or BGA if someone wants it). 64x64 (4*64=256) is PGA and
BGA only, but still smaller than the PIII/Celeron PGA370, even 96x96 (4*86=384)
fits nearly into PGA370. 128x128 (4*128=512) and above will require an large
case, most likely only sensible in BGA, but that is going near top range of
Xilinx stuff (312x208 LCs (4*312=1248 IOs) for XCV3200E), so most likely not
possibe for FPGA in ASIC anyway.

Same also process/voltage/IOvoltage can be traded off by users, so no need for
most likely patented mutliple IO standards.

In the beginnig most likely just single "by request" sizes, someone makes a
batch and offers not self used ones for sale. As users ordering an batch of
chips have a wide size/package/voltage selection.

This suggests an website for offering such surplus to people who do not want
to arrange an own production run. Also may offer bids so producers know what
sort of surplus will sell well. Also trade experiences/featuresets knowledge
on ASIC vendors versions.

For first run single size. Because min chip count order, possibly smallish size
with an 2x2 or 4x4 array of chips for large circuit. Larger cost per board but
cheaper chip order, and less boards to spread it over. And sellable in finer
granularity, to low-cost users. But this requires large amount of IO, else chip
borders are large problem.


Patents:

XC2000 sets up basic FPGA technology, grid/array of LUTs for logic (replacing
PAL/CPLD product terms), FFs and Muxes, routing (direct, same size segments
switched, long lines, IO direct, IO clocks), PIPs, IOB input FF+Mux and output
tristate, clock from routing segmented/vert-long (replacing PAL/CPLD dedicated
clock networks).

Important patents, according to c.a.f, are the 2 "Freeman" patents. Altera did
not risk segmented routing until they had an agreement, so that must be one of
them. But were sued even without that, over some other patent (LUTs?). Both of
these are said to run out in 2006, so in 3 years. This project will sure need
that long to get up to comercial sales. Anything before that (no manufacturing
or sales) is outside of the scope of patent law. So we can ignore these.

Also the 2006 date means registered in 1986, 1 year after XC2000 went on sales
(and so was documented in its data sheet). This suggests no european patent
possible, as this would need patent registration/issue before publication (see
how RSA lost theirs). So either rejected or not even attempted is to be
expected. How much of the later stuff has also european patents?

XC3000 seems to be similar to XC2000, with added double size LUTs (16+16
instead of 8+8). Direct D input to FFs (not via LUT), Clock Enable (Mux with Q
feedback to D), separate reset input from routing, TBUFs internal tri-state on
longlines for buses with, IOB output FF and slew rate and passive keepers.

None of these are essential for us. Separate reset is nice, we only need reset
in FSM state FFs, so can sacrifice an LUT input. CE is nice, but is not crucial
either.

XC3100 just adds size and speed, as far as I can see.

XC4000 in about 1992 adds lots of stuff, 2*4 independant input Muxes, H 3LUT,
independant X|XQ|Y|YQ outputs, LUT-RAMs, carry-before-LUTs stuff, "interleaved"
dual length routing, wide edge decoders, Clock Enable in IOBs, routing ring in
IOBs, dedicated clock networks.

Patents on it will only run to about 2012, so annother 9 years. So this one is
the hard object to avoid collision with. Independant inputs are most likely
not patented. We do common output, no H, RAMs different, carry different (I do
not even comprehend part of Xilinxes complicated circuit!), no tri-state, wide
stuff in logic OR-chaining, no clock networks. So only IOBs or even more the
IOB routing may be an problem.

XC4000X adds FF latch mode, more routing, quad length "interleaved" routing,
octal length "interleaved" IOB routing ring, more clocks, more size and speed.

XC5200 is a bit later, ca 1995, adding 3 years to get 12. But added only 4x1
format, carry-after-LUTs stuff (using 2nd XOR in seprate 2nd CLB),
carry-wide-AND|OR, and GSM style routing (but with "ring road" routing),
dropping LUT-RAMs and IO FFs.

Of these none should be a problem, 4x1 is too generic, we have carry before
LUTs. OR-chaining instead of F5/F6, no "ring road" style GSM.

XC6200 was an totally different design, SoG instead of LUT based, erived from
Algotronix CAL1024. Main addition is multi-level local+fast routing, with
routing switches.

Virtex is even a bit more later, ca 1997, adding 2 years to get 14. Added
slices, F6, SRL16, BRAMs (which were already used by Altera), DLLs, out
tri-state FF, IO standards, direct GSM routing, pseudo-tri-state, readded
LUT-RAMs.

We have no slices, no F6, no SRL16, no BRAMs, no DLLs, only TLL or LVTTL, no
tri-state, so only GSM or out tri-stare FFs may be relevant. This can be
avoided with XC[234]000 or even 6200 style routing if neccessary, and non-FF
out tri-state. So no large problem.

Virtex-II is newest, after 2000, so giving near 20. But added only (2*2)x2,
separate write address, MuxCY, multipliers, DCMs, DDR IO, DCI, active
interconnect.

We have none of these, apart from routing muxes being active, but they are far
more than just amplifiers, so irrelevant.

Virtex-IIpro adds PPC hard cores and RocketIO. None of interest here.

Outside of Xilinx (earliest and most dominant, and DSP/datapath oriented) there
are also Altera (main early vendor) and possibley others (who?).

Altera (at least earlier Flex* chips) uses different carry logic than Xilinx,
before LUTs like XC4000, no complex circuit, so quite similar to intended here.
But no subtraction inverter and no look ahead (Xilinx patent too broad?).

Altera wide logic is AND based and after LUT, not OR merged into LUT 16-OR. So
different design.

Alter clock network from beginning, coming from CPLDs. But PALs also have this
so possible not patentable.

Altera BRAMs were before Xilinx. But no interst in BRAMs here for beginning.

Atmel small "BRAMs" separate from LUTs. But also separate from FFs, so
different from what intended here.

Atmel routing with switches between blocks of LCs. May collide here, unless
XC6200 was earlier.