Message-ID: <3C5605B1.1523@designtools.co.nz> From: Jim Granville Reply-To: jim.granville@designtools.co.nz Organization: Mandeno Granville elect X-Mailer: Mozilla 3.0C-XTRA (Win95; I) MIME-Version: 1.0 Newsgroups: comp.arch.fpga Subject: Soft errors climb in 0,13u SRAM Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Lines: 23 Date: Tue, 29 Jan 2002 15:15:13 +1300 NNTP-Posting-Host: 203.79.98.26 X-Complaints-To: abuse@tsnz.net X-Trace: news02.tsnz.net 1012273560 203.79.98.26 (Tue, 29 Jan 2002 16:06:00 NZDT) NNTP-Posting-Date: Tue, 29 Jan 2002 16:06:00 NZDT Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!out.nntp.be!propagator-SanJose!in.nntp.be!newsfeed01.tsnz.net!news02.tsnz.net!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:13709 This looks interesting : http://www.ebnonline.com/story/OEG20020128S0079 It has many spins (as expected :) on the problem, but it seems very relevant to fast-shrink-path SRAM FPGAs. Is there any information on the 'disturbance energy' for the various SRAM components of FPGAs :- - The Config Cells ( Slower/larger, but not zero FIT ? ) - The Fast SRAM Blocks - The sea of registers - The LUT array store The SRAM notes do not mention register errors, but since a SRAM cell is effectively a strobed latch, is it possible to disturb the BIT state in a latch, just as in a SRAM cell ? Error recovery in data blocks is possible with correction, but how does a system detect config or latch errors ? -jg ###### From: Michael Boehnel Newsgroups: comp.arch.fpga Subject: Re: Soft errors climb in 0,13u SRAM Date: Tue, 29 Jan 2002 21:58:59 +0100 Organization: ITI Lines: 16 Message-ID: <3C570D13.C4C5471@iti.tu-graz.ac.at> References: <3C5605B1.1523@designtools.co.nz> NNTP-Posting-Host: 129.27.146.74 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: fstgss02.tu-graz.ac.at 1012337861 11825 129.27.146.74 (29 Jan 2002 20:57:41 GMT) X-Complaints-To: news@zid.tu-graz.ac.at NNTP-Posting-Date: 29 Jan 2002 20:57:41 GMT X-Mailer: Mozilla 4.71 [en] (WinNT; I) X-Accept-Language: en Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.stueberl.de!newsfeed.online.be!newscore.univie.ac.at!aconews-feed.univie.ac.at!news.tu-graz.ac.at!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:13719 Jim Granville wrote: > Error recovery in data blocks is possible with correction, but > how does a system detect config or latch errors ? I can imagine at least 2 possibilities: i) By continously reading the configuration and comparing it with a reference bitstream ii) Built in self-tests and many other fault-tolerant system design concepts (TMR, ..). Michael ###### From: Austin Lesea Newsgroups: comp.arch.fpga Subject: Re: Soft errors climb in 0,13u SRAM Date: Tue, 29 Jan 2002 17:07:58 -0800 Organization: Xilinx Lines: 146 Message-ID: <3C57476E.323A4B64@xilinx.com> References: <3C5605B1.1523@designtools.co.nz> NNTP-Posting-Host: 149.199.9.10 Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="------------E55D00EB7E1675EE47A9A030" X-Mailer: Mozilla 4.77 [en]C-CCK-MCD (WinNT; U) X-Accept-Language: en,pdf Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!newsfeeds.belnet.be!news.belnet.be!newsfeed.stanford.edu!canoe.uoregon.edu!arclight.uoregon.edu!wn4feed!wn3feed!worldnet.att.net!206.191.82.230!prairie.attcanada.net!newsfeed.attcanada.net!12.127.17.144!attbt1!attbt2!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:13587 --------------E55D00EB7E1675EE47A9A030 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Jim, Detection of errors in the configuration is by readback, and checking checksums. Many hi-rel (military) users have found that continually reprogramming is easier and accomplishes the same thing (as in high altitude or space applications, SEUs are 10,000X to 100,000X more frequent, and often occur daily). Detection of transient SEU errors to the logic requires the logic to have its own checks. Triple Multiple Redundancy (TMR) with redundant voting is used in critical cases. Block RAMs are used with ECC logic, where upon readback errors are corrected and written back in on the other port corrected. To take this to another domain, when I designed systems for the telcos, I had to anticipate failures, both soft and hard, and provide recovery for both (error free for both....). At the system level, I had redundant elements, and check and correct circuits. Now that the systems are on a chip, it is a natural progression to use the same techniques inside the FPGA that used to be used with collections of FPGAs. .22->.18->.15>->.13->.10->.07 is Moore's 'Law', and the effects of SEUs must be taken into account by the IC designers. Designs can be completely hardened at the expense of area, but how much larger area, and hence larger cost, will a customer want to pay for a "feature" that may not apply to their market? As it is all programmable logic, and if SEUs can be obviated by programming more gates, then for those that require that level of availability, they can 'pay' for it by their usage of their gates. For those with non critical applications that don't care, they can benefit from the increased density and reduced costs. Many DSP processes (e.g. voice phone calls) are able to tolerate a huge level of bad bits, as human speech is incredibly robust. Banking and aerodynamic controls are applications where TMR is the only solution, even if SEU's were not an issue! Telecom switching is in-between: if the level of errors is below the noise floor of all of the interconnected elements, then no one can tell, or even know what is going on (all service objectives and tariffs are met). We by no means have all of the answers, and are out there looking at all options, and providing TMR, ECC, and other programmable cores today to our customers for their critical applications. Austin Jim Granville wrote: > This looks interesting : > > http://www.ebnonline.com/story/OEG20020128S0079 > > It has many spins (as expected :) on the problem, but it > seems very relevant to fast-shrink-path SRAM FPGAs. > > Is there any information on the 'disturbance energy' for the > various SRAM components of FPGAs :- > > - The Config Cells ( Slower/larger, but not zero FIT ? ) > - The Fast SRAM Blocks > - The sea of registers > - The LUT array store > > The SRAM notes do not mention register errors, but since a SRAM > cell is effectively a strobed latch, is it possible to disturb > the BIT state in a latch, just as in a SRAM cell ? > > Error recovery in data blocks is possible with correction, but > how does a system detect config or latch errors ? > > -jg --------------E55D00EB7E1675EE47A9A030 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit Jim,

Detection of errors in the configuration is by readback, and checking checksums.  Many hi-rel (military) users have found that continually reprogramming is easier and accomplishes the same thing (as in high altitude or space applications, SEUs are 10,000X to 100,000X more frequent, and often occur daily).

Detection of transient SEU errors to the logic requires the logic to have its own checks.  Triple Multiple Redundancy (TMR) with redundant voting is used in critical cases.  Block RAMs are used with ECC logic, where upon readback errors are corrected and written back in on the other port corrected.

To take this to another domain, when I designed systems for the telcos, I had to anticipate failures, both soft and hard, and provide recovery for both (error free for both....).  At the system level, I had redundant elements, and check and correct circuits.

Now that the systems are on a chip, it is a natural progression to use the same techniques inside the FPGA that used to be used with collections of FPGAs.

.22->.18->.15>->.13->.10->.07 is Moore's 'Law', and the effects of SEUs must be taken into account by the IC designers.  Designs can be completely hardened at the expense of area, but how much larger area, and hence larger cost, will a customer want to pay for a "feature" that may not apply to their market?

As it is all programmable logic, and if SEUs can be obviated by programming more gates, then for those that require that level of availability, they can 'pay' for it by their usage of their gates.  For those with non critical applications that don't care, they can benefit from the increased density and reduced costs.

Many DSP processes (e.g. voice phone calls) are able to tolerate a huge level of bad bits, as human speech is incredibly robust.  Banking and aerodynamic controls are applications where TMR is the only solution, even if SEU's were not an issue!  Telecom switching is in-between:  if the level of errors is below the noise floor of all of the interconnected elements, then no one can tell, or even know what is going on (all service objectives and tariffs are met).

We by no means have all of the answers, and are out there looking at all options, and providing TMR, ECC, and other programmable cores today to our customers for their critical applications.

Austin

Jim Granville wrote:

This looks interesting :

http://www.ebnonline.com/story/OEG20020128S0079

 It has many spins (as expected :) on the problem, but it
seems very  relevant to fast-shrink-path SRAM FPGAs.

 Is there any information on the 'disturbance energy' for the
various SRAM components of FPGAs :-

 - The Config Cells ( Slower/larger, but not zero FIT ? )
 - The Fast SRAM Blocks
 - The sea of registers
 - The LUT array store

 The SRAM notes do not mention register errors, but since a SRAM
cell is effectively a strobed latch, is it possible to disturb
the BIT state in a latch, just as in a SRAM cell ?

 Error recovery in data blocks is possible with correction, but
how does a system detect config or latch errors ?

-jg

--------------E55D00EB7E1675EE47A9A030--