LUT+routing based Open FPGA Specification author Neil Franklin, last modification 2004.05.23 Background: Want to be able to design an open PC architecture, not dependant on the lousy "industry standard" PC components. And do so using open source software tools. This has created an need to make own chips. ASICs are limited by large NRE, large "compile" time (bad for the iterative design methods prefered by hackers), inflexible (single function), minimal batch sizes (from wafer sizes and mask costs). They also demand HDL (ugly) or schematic (unportable non-ASCII) programming to generate EDIF, not allowing 74((L)S)(X)xx style "select and wire blocks" design. FPGAs solve all these problems, but their vendors do not publish the bitstream formats. This makes writing open source tools an large job (reverse engineering bitstream format) or even impossible (needed data only available for one chip family, Xilinx Virtex), with limited duration value (single vendor/source chips go out of production, making tools useless, see XC6200 desaster, and no future documented family to look for). So I am investigating alternatives to using vendors FPGAs. Logfile: 2003.09.22 Mon Discussion on c.a.f about making own open FPGA so open bitstream links to an few FPGA-in-FPGA (slow) and a few SoG FPGA-in-ASIC projects 2003.09.26 Fri Discussion with attila about FPGA-in-ASIC as student project he thinks this is more DA project, but pointer to Kaeslin VLSI 1 script 2003.09.29 Mon multiple hours musing on details of soec for an FPGA-in-ASIC started this file, plot down the whole stuff 2003.10.01 Wed added this Timeline section, tidied up, added a few details expanded Xilinx chip feature evolution and patent situation 2003.10.15 Wed estimates about minimal chip oder and how to distribute costs renamed this section Timeline->Logfile, added Todo section added Altera non-Xilinx chip features and patent situation 2003.10.26 Sun Added a few more details, "mux switch in data path" routing analog to mux based control signal propagation, more Xilinx details, Atmel in patents 2003.11.04 Tue Added remarks about FF-RAMs addresses from LSB side neighbor mux 2003.11.14 Fri block of 4x4, staggered routing muxes, both dimensions wide logic for Muxes RAMs with 4-reg 3-LUT DP mode, name "E" clock network, fixed data bus pins config loading modes, slave 1/8bit, master from 8bit, addr gen and conf select reloadable at LB level, no entire width, self reloadable, voltages started SoG variant with 4-LUT and wide output muxes w 2 or 3 cell/bit and 2 cell/IOB 2003.11.17 Mon definitely 4x4 block for routing not 4x1, l<->r 1/4:3/4 routing wide-OR also in data bus direction not just data path width direction reset FF and non reset SRAM in LC, SRAM addr always from control can >4bits SRAM direct I0 input, leaving 3-LUT for logic (such as muxing SRAM outputs) 2003.11.19 Wed studying 74(LS)(x)xx TTL databook, this evening 7483 and first look ahead carry carry = 0 if all [a+b]<3 until [a+b]=0 or Cin = 0 found 2003.11.30 Sun massive text tidy up, better formulations, wide-OR->OR-chaining, I0..3 -> A..D studying 74(LS)(x)xx TTL databook, further with 74181 2*16 function ALU carry chain Cn->C(n+4) is same as in 7483, add external 74182 lookahead 2004.05.23 Sun added "wire-OR" style stuff in interconnect Todo: ASCII art of layout (attila request), features in it Aim: Design an open source FPGA, implemented in an ASIC. This enables using multi-vendor (will not disappear) ASIC technology (or perhaps even one day full custom design), while at same time still giving FPGA flexibility for iterative design methods. Both open HDL source for producing the FPGA chips (ensuring their continues existance), and open bitstream format (to enable open source compilers for them). Disadvantage is work of designing/manufacturing chips and most likely less speed. Advantage is no dependency on FPGA vendors who have an monopoly on any bitstream format and tools to generate it. Thise conditions make this chip attractive to hackers and students newly entering FPGA usage, and non-interesting for existing FPGA users who do not mind the tool situation, and. So design and document preferably for newbie friendlyness, not industry compatibility. Design for typical hobby applications, SoftCPUs, PC IO devices, mp3/video decode/encode, 3D video accel. Target and Priorities: Be usable for FPGA PC, consisting of soft CPU, IO devices and chipset/glue: - Crucial is enabling wide and fast data paths, in particular fast carry for ALUs and incrementers and fast control signal propagartion to all bits in one data path stage. - Second priority is fast instruction decoder, which uses many ANDs and then few wide ORs just next to the data path elements. - Third priority is space efficient support for many-register sets. - Fourth priority is fast data signal propagation, once it can be processed efficiently. And also enough data transfer volume for multiple parallel data buses. This can be restriced to inside each bits data path. Basic design: For data path both fast carry an fast control signal propagation require minimal amounts of cells to traverse. In addition fast carry requires dedicated carry logic. Worst case acceptable is 1 stage per cell/bit (like Atmel AT4000), but better group multiple bits of one element in one cell (like Xilinx XC[2345]xxx/Virtex or Altera or Lucent/Lattice devices). I prefer at present 4 bits per cell (like 7483 and 74181+74182 do). In addition fast data signal propagation requires also minimising cells to cross and multiple buses require multiple signals crossing one cell, so this excludes SoG style designs. Also having large register sets requires RAMs in LCs, also requiring the 1 cell/bit formula of 4LUT based FPGAs. This makes SoG (sea of gates, also known as fine grain) style FPGAs (such as Algotronix CAL1032, Xilinx XC62xx or Atmel AT6000), which require 2..4 cells per bit, useless for this target, because they are 4*(2..4)=4..16 times slower. So no use going to the existing open source projects, of which all I know are SoG based (all aimed at / optimised for random logic?). SoG would be just about useable if with hardwired 2 or 3 gate/bit formula with carry and RAMs provided every 2 or 3 cells. In this case use an 4-LUT with NEWS as direkt inputs and in neighbor cells 4 wide (4bit 16:1 or even larger) output muxes, with 4 dirs, LUT, FF and 10 or more far inputs. Logic Cell (LC): Should be able to emulate as many 74((L)S)(X)xx series parts as possible. So that existing designs and thinking port as easy as possible. Standard 4LUT + D-FF. Other additional features only if they do not add significant costs (= cell % larger than the % cells they save). Even then when in doubt leave away (less work, less documentation, simpler tools). LUT inputs numbered A..D (like in 74181 A..B), not I1..I4 as traditional FPGAs do. LUT outputs F (function, as in 74181) and FF outputs Q (as in 7474). FF inputs just data from LUT and separate clock+reset. Separate F and Q outputs only if no significant cost increase in routing, else use an output O with Mux F/Q like in PALs or CPLDs. Together with no separate carry outputs only 1/3 of outputs of Virtex devices. Wide Logic No F5/F6 Muxes (require 5th Bx input Mux per LUT, +20% inputs). Use simpler OR-chaining of LUTs, previous LUT AND controlbit into OR of this LUTs 16:1 Mux. Due to only one bit and simple AND and extended OR have OR-chaining in both data path width (for control logic) and data bus length (for large processing elements) directions. This can do: - n*4OR directly, usually done in data path width direction, but can be (mxn) for less chaining links delay - n*4NAND (with inverting inputs), and so wide-AND if "recievers" invert also - (m*4)x(n*4) OR-of-ANDs by doing NAND-of-NANDs, this can be done with AND and OR in 2 directions, reducing chaining links - n*(2OR-of-2-ANDs, A&B|C&D), "select line Muxes", replacing internal tristate logic, this is best done using data path direction OR-chaining - 2*(2:1-Muxes) with AB data, C Sel0, D Sel1|/Sel1. First level 2:1 in LUTs (ABC) and second level 2:1 ANDs in LUTs (first level + D) and OR from 1 stage OR-chaining. Above 4:1 need to use "select line Muxes" or cascade 4:1 Muxes. No merging of 2 or 4 4LUTs to make 5LUTs or 6LUTs (is seldom used anyway). Logic Block (LB): 4x1 LCs, 4 in data path width direction and 1 in data bus direction, same arrangement as the XC5200 used. This 4x1 is "half way" between Virtex 2x1 slice and Altera 8x1 LAB, and near to Virtex-II pair of 2x1 slices. Common FF (Clk and Reset) inputs for all 4 FFs in LC. If taken from routing this gives 4*4+1+1=18 inputs, 3/4 of the 24 inputs of equivalent 2 Virtex slices. Possibly no separate reset input, as only needed in FSM FFs and PC registers, so input Mux is wasted everywhere else. Alternative take Clk and Reset from outputs of data path width direction LSB side LCs, requires no extra input Muxes for Clock or FF. And is fitting for resetting registers, and small waste for FSM Bits. Making register sets: LUT-RAMs are surely patent encumbered, so avoid them. Instead replace the 4 FFs with an RAM of 4*16FFs (like 74(x)89), call these FF-RAMs. In "normal" operation Address=0 reduces them to single FFs. For RAM addressing !=0: - SP-RAM all 4 Dn(n=0..3) to address, leaving 3LUTs for data path. - DP-RAM use all 4 Cn(n=0..3) as write address, leaving 2LUTs. This even allows DP-RAM without using 2 4LUTs per bit. - Possibly also an DP-RAM mode with 3-LUTs and 2 Ds for each address, but only 4 addresses/registers. So read addr can be 0/D, write addr 0/D/C/D(3-n), with 1 adress bits config bits per LC. Alternative take address(es) from data path width direction LSB side LCs. At each LB mux amplifying like for control lines routing. Feed in from control section address generator logic, before LSB. This also allows more than 4 address lines, and so 4*32 or 4*64 bit RAMs. Needs 8 to 12 "driver" LUTs, but register sets have small driving logic, no FSMs. little decode. Clock as usual. Reset allowed to act on only 1 of 16 bits. On row/column of 4 or all 16 bits only if no extra cost. Alternative single resetable FF and non resettable RAM in each LC. This also gives an clean split of config bits (including LUT patterns) and debug data bits access. Allows debug reads without stopping chip for "reconfig". Even debug writes if proper semaphores are used. No Support for 4*16 -> 2*32 or even 1*64. Better use LUTs and OR-chaining. With 4*32 or 4*64 no need for this anyway. BRAMs: Just one/two large SRAM(s) at end(s) of data path is inferior, when having multiple independant logic sections (CPU, peripherals) on the chip. Multiple small size BRAM stripes is better. Ideal RAM at LC granularity, so that it is available "at any time". BRAMs are most likely patent encumbered, having appeared in the middle of the XvsA patent war. So better many large FF-RAM tiles. This needs 4*32bit (for Virtex-E density) or 4*64bit (for Virtex-EM or Virtex-II RAM density). Use the LCs LUTs and data direction OR-chaining to select from multiple FF-RAM outputs, to get the sizes for caches or video memories. For this LUT usage FF-RAM input can not be via LUT, so FF-RAM input with Mux for A or LUTout. Arithmetic and Carry: F(A,B,Carryin) = A XOR B XOR D fully computed by the LUTs, not just 2nd XOR like in XC4000 or first XOR like in Virtex. Carryout(An,Bn,Carryin:n=0..3) provide for each LB an 4bit lookahead carry, like used in 7483/74181+74182, but these then serial like in all FPGAs, no 74182 equivalent. This is main reason for the 4x1 LB format, the FF(-RAM) Clock/Reset(/Addresses) being only secondary benefactor. For this wire: - CarryAn and CarryBn hardwired from An and Bn of this LB. Increment mode with only one An (or Bn?) used for data, leave other input free for logic (such as Mux after increment for PC registers) - Carryin hardwired from Carryout of previous LB. Set first Carryin by an LUT with A and B = 0 or 1 or from carry flag - Carryout to least-used input D(n+1) Mux (so no slowing non-arithmetic use with extra Mux in D line(s) as XC4000 has, no slowing with LUT/Fx/XOR Mux as Virtex has). This also allows conditional adding, using D only in add case (also for Mux case of PC registers) - For subtraction invert CarryB using C (add/sub select). So in lookahead input from B an XOR driven by C if enable config bit (1 for all 4 bits) is set. Alternative have an dedicated add/sub selector, also coming from control section LUTs. For speed control signal and carry ripple should run in the same direction. So control logic at LSB side of data path width. Carry at MSB hits side of chip. As FFs are direct after logic, routing carry back to control is in critical time path, so put an additional FF for Cout in the left IOBs. For patent avoidance carry logic possibly not in LCs, but separate carry stripes (like BRAMs), only connected to data bus direction routing. One carry chain per 2 or 4 LCs. General Routing: Possibly Data path width horizontal, MSB left, LSB right, as most hackers are used to this arrangement from register diagrams in CPU data sheets. Data path depth growing top->bottom same as text, code, memory diagrams etc do. Also in vv-like tool this gives tags at left (like comments). And allocation (and vd dump) order is same as vv column before row order. This familiarity is newbie friendlier than the FPGA traditional MSB top and LSB bottom. If not like above, then traditional MSB at top arrangement. Width (control) and depth (data) are nearly separate and have different uses and requirements. So consider them as 2 separate topics. Data comes preferred to A|B seldom C, never D. Control to D|C, seldom B, never from I0. Use this to reduce A and D input Mux size. Because of this Carryin is in D Mux. Routing 4x4 logic groups (LG), to get same 4 times faster propagation in data bus depth direction routing as, in data path width. Also symmetrical for routing software. Gives LG of 4 LBs of each 4 LCs. Also avoids possible patent problem with differentiated length routing. If that no problem, possibly also 16x16 "supergroup" fast transit routes. Routing Mux based, as this avoids chip damage with bugs in routing software. Traditional Mux has less config bits/selection, but more complex logic to decode the selection. So use OR of multiple (signal AND confbit). Is safe like Muxes, and allows multi-ANDs being auto-ORed, like open collector wire-OR. PIP signals through transistor gate, vs Mux through 2 transistor stages, which is slower. But Mux amplifies and solves fanout problems and no T/C low pass filter effect. And transistors faster then interconnect delay. Inputs to LUTs also with such wire-ORs, also short-safe, and can save final ORs in control signal part, use less space. Suggests also an FPGA with only NOT and AND instead of LUTs, no LUT patent collisions.. Scatter repeater Muxes by breaking 1/4 of lines in each LCs vincinity, so all broken once inside each LG. Have outgoing drivable from incoming of 3 directions, and by that LC. Have each LC input come from any line originating in or crossing through that LC. Presently 8 or 12 or 16 wires per LC (2 or 3 or 4 per swich) routing density. Width/Control Routing: Takes mainly control signals to data path LUTs and clock/reset to data path FFs (unless dedicated for this). Every LUTs 3 control inputs and the common clock/resets should be drivable from these. These are all long distance, delay critical and many loads / high fanout. So run these in LB-wide (4 LC wide) sections, so delay is hidden behind identical carry delay. These should be drivable at each LB stage by corresponding LCs outputs or the previous corresponding segment. Mux for this is also amplifier, buffering the many loads. All run right->left, no left->right. This is nearer to XC[234]000 or even more XC6200 routing than to XC5200|Virtex|later. But also wiring between LCs in control section, short distance, both directions, so only neighbor LUT and single segments. But this can also use long distance stuff, without extensions segments, as that is not in use here. Also Cout/MSB to control or Cin or LSB input, or input data from left IOBs. May want speed, only one load, so single global width line, no amplifiers needed. Or simply smaller set of segmented routing. Also data exchange for shift registers or bit swaps, may want speed, and are multiple, variable distance, usually one load. In both right->left and left->right directions. But seldom and can so be scattered over multiple depth stages. May be most difficult to define. Use segmented also for this direction, no global. So have 1/4 routing MSB->LSB, 3/4 routing LSB->MSB. Depth/Data Routing: Drives mainly data and instructions between function units, registers and memory. Medium distance, registers to multiple function units have multiple loads, so may require amplified multi-LB segments top->bottom. Back up usually single target medium distance with large Mux somewhere, no amplification. No need for jumping bits, appart from swap operations. No 2-LUT+F5 functions with 2 bits in same bit of 2 different rows, so no need for 45 degree line to output Mux, to un-zigzag bits so that data appears as all from one row. This also avoids zigzaging bits and needing un-zigzag routing. Both directons symmetrical, 1/2 and 1/2 routing for up and down. No TBUFs as real ones are slow and need global lines, and simulated TBUFs with muxes is patented. Instead use explicit "select muxes" from LCs, 2 data lines and 2 enables per LUT. Or for this in data bus direction use Mux "wire-OR" for chaining outputs. 2 wire-ORs in both directions and ORs in LUT inputs alow mulating the "simulted TBUF" in software, without hardwired features. Patent trouble only for featurre compiler maker, not chip maker. Clock distribution network: Modern dedicated clock network is surely patented (unless not patentable because PALs/CPLDs already did it, which is unlikely), so avoid it. Can use long lines or segmentend routing, as XC[23]000 already had that. Using normal routing for clock increses skew, but this is not so critical so long it is done predictably. The LSB is computed faster anyway (carry delay, and control signal propagation) so have clock following carry at same speed, hides delay like for control signals. So use control LSB->MSB signals for clock. No need for clock (or reset) from data bus depth lines. Actually better, as one can clock faster, as no need to wait for the MSB to compute. Need to look out with MSB->LSB running signals, such as carry->LSB, as 2 directions propagation time added to logic, but usually not much logic in using them, not critical. Drive state machine FFs in control section from source even further "LSB", or possibly drive from between data path and control, delaying state machine. So vertical clock in LUT column next to data path, or at most LSB (possibly even in the right IOBs, using "from LSB segment" input to routing). Gives "E" clock network, unlike the normal FPGA "H" network. Data bus direction takes time from registers to function units, in both directions, so no clock from top. Drive from middle of LSB side, to half skew. Or even drive clock with zigzag lines/segments, to parallel data propagation time. This gets rid of need for separate clock and CE inputs, in case we can not have one. Just gate clock with an AND up in control section, when distributing from vertical to horizontal. Note that this requires clock enable signals to be active (1) before clock hits its second (1) phase, assuming FFs that act on falling edge. Or use some latch in the LUT to keep clock up when enable goes down. IOBs: 1 pin per LC format. Data bus end sides ideal for connecting memory data and wide bus peripherals. So put in wide tri-states, control fast like data path controls and FF clocks, can for this accept only 1 tri-state control for group of 4, like only 1 FF clock. No FF in tristate needed, but possibly nice for pipelining. At least top data bus end (at least from MSB side first n*32/36) pinout all of them, with "dropped" IOBs all outside this area. Drive data bus routing directly into IOB, losing need for IOB rong routing (if patent problematic). LSB side is nearest to control, so good for clock in. So possibly/likely additional clock distribution lines in them. Also most likely control signal outputs, as near to state machine and decoders. And MSB side is nearest data path, so good for thin peripherals, such as RS232 or ethernet shift registers. So these may need single bit tri-states, but don't need speedy direction switching, control by normal logic. MSB has the additional Carryout FF+Mux in it, only for feedback to logic. LSB has Carryin source in it, only from logic. SoGs have one pad/cell, our cells are 3x3 SoG cells, so perhaps double or even 3 times as many pads als cells (but is this usefull?). Possibly needed if multiple small chips added up to an larger cell space. If making an SoG, use one IOB per 2 cells, 1 cell out data, 1 cell out tristate, in data goes to both cells inputs. Or one IOB per cell, and trisate from neighbor IOB, with pass throuch mechanism. Voltages for IO it is sufficient to support TTL or LVTTL as most IOs will require bus buffers or analog stuff anyway. LVTTL without buffers is more important than exotic voltages. As many chip producers no need for many IO standards, so no problem with 5V because of that. If LVTTL not achievable and buffers are needed, then just core voltage. Programming: Bitstream format open documented, is central feature. Is not even an chip feature, but just an policy/attitude feature of the developers. Damn vendors. SRAM based config, as this does not require special PROM or EPROM or Flash processes to make the chip, so cheaper. If many small chips then possibly go for Flash or for OTP. SRAM booting is annoying and config PROMs/Flashes/CF add cost and boot delay, so if possible make SRAM "constant memory" like XC[23]000 had (but with separate VCCCONF) and lithium cell to keep configuration alive (like for Virtex-II 3DES key). Loading config either from devel PC or "pc independance" module with FLash or bat-SRAM and uC oder CPLD. So standard is slave mode, either pin-saving serial, or fast parport/uC friendly 8bit. Facultative an master mode, serial from standard SEEPROM/ser-Flash parts or using MSB data bus pins from standard 8bit SRAM or 29F Flash chip (addr from MSB side, config select from IOBs with boot=1, reboot=x). Config frames run horizontally, so one can replace sections of logic without touching others. Between 64 and 256 config bits per LC, so 256 to 1024 per LG. If routing in IOB ring, then separate bits for LBs and IOBs, so IO stays under control when reconfiguring logic. Possibly even offer to just replace an part of an row, best even individual LGs. This requires an LGs to be 2^n bits wide, ideal n>=5. Strict separation of config bits (incl LUT paterns) and register access for debug. No FF, FF-RAM or IO-FF bits in config bit space. Only set these per reset logic or data inputs or debug circuit. Debug output should not have to stop operation. Even reconfiguring should not stop, so long the "missing" part is not in actual use (simply gives random data). This suggests reloadable at LB level, not just entire width. Possibly keep an subset of the FPGA running while reconfig of rest. For "joining" of constant and variable parts use some standard of using routing. Or possible some "interconnect points" (in XC4000 and Virtex often TBUFs were used). Possibly access to debug or config circuits from running FPGAs logic (by unused IOBs driving config circuit, or better external config pins also having IOB). Sizes: Square array of LUTs. Logarithmic (2^n)x(2^n) sizes. Possibly plus intermediate (1.5*2^n)x(1.5*2^n) sizes. Minimal usefull is 4x4=16 LCs (about 1 22V10 PAL worth) has problem of also 4*4=16 IOBs, so 50% overhead (assiming IOBs are LC size), so better use an PAL (smaller ones are documented). Even an PAL-like "LUT-PAL" FPGA with just 2 columns of 8 LCs and everyone-to-everyone non segmented routing would be better than the full 4x4 thing. Or alternatively an open SoG FPGA for this. A "LUT-PAL" may be cheaper up to 48 FFs and usefull up to 96 FFs, with then "LUT-CPLD" or SoG up to 256..384 FFs. 8x8=64 with 4*8=32 IOBs is 30% overhead and still in CPLD/SoG range. 16x16=256 with 4*16=64 IOBs is still 20% overhead. 32x32=1024 with 4*32=128 is only 11%. 64x64=4096 with 4*64=256 is down to 6%. So start with smallest sensible 16x16 (which will just fit simple 8bit or 12bit designs. But also look at ASIC vendors size/cost curve and mask cost vs chipcount per minimal wafer count (so that not mask too dominant, and not too many small chips). Packages can be what type the production organiser wants, so far the ASIC vendor chosen offers selection. This can include 2.54mm (100mil) friendly packages. 8x8 LCs (4*8=32 IOs) can fit PLCC44 or even DIL40. 16x16 (4*16=64) will fit PLCC84 or PGA84. 32x32 (4*32=128) and 48x48 (4*48=192) will stand either TQFP or PGA (or BGA if someone wants it). 64x64 (4*64=256) is PGA and BGA only, but still smaller than the PIII/Celeron PGA370, even 96x96 (4*86=384) fits nearly into PGA370. 128x128 (4*128=512) and above will require an large case, most likely only sensible in BGA, but that is going near top range of Xilinx stuff (312x208 LCs (4*312=1248 IOs) for XCV3200E), so most likely not possibe for FPGA in ASIC anyway. Same also process/voltage/IOvoltage can be traded off by users, so no need for most likely patented mutliple IO standards. In the beginnig most likely just single "by request" sizes, someone makes a batch and offers not self used ones for sale. As users ordering an batch of chips have a wide size/package/voltage selection. This suggests an website for offering such surplus to people who do not want to arrange an own production run. Also may offer bids so producers know what sort of surplus will sell well. Also trade experiences/featuresets knowledge on ASIC vendors versions. For first run single size. Because min chip count order, possibly smallish size with an 2x2 or 4x4 array of chips for large circuit. Larger cost per board but cheaper chip order, and less boards to spread it over. And sellable in finer granularity, to low-cost users. But this requires large amount of IO, else chip borders are large problem. Patents: XC2000 sets up basic FPGA technology, grid/array of LUTs for logic (replacing PAL/CPLD product terms), FFs and Muxes, routing (direct, same size segments switched, long lines, IO direct, IO clocks), PIPs, IOB input FF+Mux and output tristate, clock from routing segmented/vert-long (replacing PAL/CPLD dedicated clock networks). Important patents, according to c.a.f, are the 2 "Freeman" patents. Altera did not risk segmented routing until they had an agreement, so that must be one of them. But were sued even without that, over some other patent (LUTs?). Both of these are said to run out in 2006, so in 3 years. This project will sure need that long to get up to comercial sales. Anything before that (no manufacturing or sales) is outside of the scope of patent law. So we can ignore these. Also the 2006 date means registered in 1986, 1 year after XC2000 went on sales (and so was documented in its data sheet). This suggests no european patent possible, as this would need patent registration/issue before publication (see how RSA lost theirs). So either rejected or not even attempted is to be expected. How much of the later stuff has also european patents? XC3000 seems to be similar to XC2000, with added double size LUTs (16+16 instead of 8+8). Direct D input to FFs (not via LUT), Clock Enable (Mux with Q feedback to D), separate reset input from routing, TBUFs internal tri-state on longlines for buses with, IOB output FF and slew rate and passive keepers. None of these are essential for us. Separate reset is nice, we only need reset in FSM state FFs, so can sacrifice an LUT input. CE is nice, but is not crucial either. XC3100 just adds size and speed, as far as I can see. XC4000 in about 1992 adds lots of stuff, 2*4 independant input Muxes, H 3LUT, independant X|XQ|Y|YQ outputs, LUT-RAMs, carry-before-LUTs stuff, "interleaved" dual length routing, wide edge decoders, Clock Enable in IOBs, routing ring in IOBs, dedicated clock networks. Patents on it will only run to about 2012, so annother 9 years. So this one is the hard object to avoid collision with. Independant inputs are most likely not patented. We do common output, no H, RAMs different, carry different (I do not even comprehend part of Xilinxes complicated circuit!), no tri-state, wide stuff in logic OR-chaining, no clock networks. So only IOBs or even more the IOB routing may be an problem. XC4000X adds FF latch mode, more routing, quad length "interleaved" routing, octal length "interleaved" IOB routing ring, more clocks, more size and speed. XC5200 is a bit later, ca 1995, adding 3 years to get 12. But added only 4x1 format, carry-after-LUTs stuff (using 2nd XOR in seprate 2nd CLB), carry-wide-AND|OR, and GSM style routing (but with "ring road" routing), dropping LUT-RAMs and IO FFs. Of these none should be a problem, 4x1 is too generic, we have carry before LUTs. OR-chaining instead of F5/F6, no "ring road" style GSM. XC6200 was an totally different design, SoG instead of LUT based, erived from Algotronix CAL1024. Main addition is multi-level local+fast routing, with routing switches. Virtex is even a bit more later, ca 1997, adding 2 years to get 14. Added slices, F6, SRL16, BRAMs (which were already used by Altera), DLLs, out tri-state FF, IO standards, direct GSM routing, pseudo-tri-state, readded LUT-RAMs. We have no slices, no F6, no SRL16, no BRAMs, no DLLs, only TLL or LVTTL, no tri-state, so only GSM or out tri-stare FFs may be relevant. This can be avoided with XC[234]000 or even 6200 style routing if neccessary, and non-FF out tri-state. So no large problem. Virtex-II is newest, after 2000, so giving near 20. But added only (2*2)x2, separate write address, MuxCY, multipliers, DCMs, DDR IO, DCI, active interconnect. We have none of these, apart from routing muxes being active, but they are far more than just amplifiers, so irrelevant. Virtex-IIpro adds PPC hard cores and RocketIO. None of interest here. Outside of Xilinx (earliest and most dominant, and DSP/datapath oriented) there are also Altera (main early vendor) and possibley others (who?). Altera (at least earlier Flex* chips) uses different carry logic than Xilinx, before LUTs like XC4000, no complex circuit, so quite similar to intended here. But no subtraction inverter and no look ahead (Xilinx patent too broad?). Altera wide logic is AND based and after LUT, not OR merged into LUT 16-OR. So different design. Alter clock network from beginning, coming from CPLDs. But PALs also have this so possible not patentable. Altera BRAMs were before Xilinx. But no interst in BRAMs here for beginning. Atmel small "BRAMs" separate from LUTs. But also separate from FFs, so different from what intended here. Atmel routing with switches between blocks of LCs. May collide here, unless XC6200 was earlier.