Message-ID: <3C5605B1.1523@designtools.co.nz>
From: Jim Granville <jim.granville@designtools.co.nz>
Reply-To: jim.granville@designtools.co.nz
Organization: Mandeno Granville elect
X-Mailer: Mozilla 3.0C-XTRA  (Win95; I)
MIME-Version: 1.0
Newsgroups: comp.arch.fpga
Subject: Soft errors climb in 0,13u SRAM
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 23
Date: Tue, 29 Jan 2002 15:15:13 +1300
NNTP-Posting-Host: 203.79.98.26
X-Complaints-To: abuse@tsnz.net
X-Trace: news02.tsnz.net 1012273560 203.79.98.26 (Tue, 29 Jan 2002 16:06:00 NZDT)
NNTP-Posting-Date: Tue, 29 Jan 2002 16:06:00 NZDT
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.mailgate.org!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!out.nntp.be!propagator-SanJose!in.nntp.be!newsfeed01.tsnz.net!news02.tsnz.net!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:13709

This looks interesting :

http://www.ebnonline.com/story/OEG20020128S0079

 It has many spins (as expected :) on the problem, but it 
seems very  relevant to fast-shrink-path SRAM FPGAs.

 Is there any information on the 'disturbance energy' for the
various SRAM components of FPGAs :-

 - The Config Cells ( Slower/larger, but not zero FIT ? ) 
 - The Fast SRAM Blocks
 - The sea of registers
 - The LUT array store

 The SRAM notes do not mention register errors, but since a SRAM
cell is effectively a strobed latch, is it possible to disturb 
the BIT state in a latch, just as in a SRAM cell ?

 Error recovery in data blocks is possible with correction, but
how does a system detect config or latch errors ?

-jg

######

From: Michael Boehnel <boehnel@iti.tu-graz.ac.at>
Newsgroups: comp.arch.fpga
Subject: Re: Soft errors climb in 0,13u SRAM
Date: Tue, 29 Jan 2002 21:58:59 +0100
Organization: ITI
Lines: 16
Message-ID: <3C570D13.C4C5471@iti.tu-graz.ac.at>
References: <3C5605B1.1523@designtools.co.nz>
NNTP-Posting-Host: 129.27.146.74
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: fstgss02.tu-graz.ac.at 1012337861 11825 129.27.146.74 (29 Jan 2002 20:57:41 GMT)
X-Complaints-To: news@zid.tu-graz.ac.at
NNTP-Posting-Date: 29 Jan 2002 20:57:41 GMT
X-Mailer: Mozilla 4.71 [en] (WinNT; I)
X-Accept-Language: en
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!newsfeed00.sul.t-online.de!newsfeed01.sul.t-online.de!t-online.de!newsfeed.stueberl.de!newsfeed.online.be!newscore.univie.ac.at!aconews-feed.univie.ac.at!news.tu-graz.ac.at!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:13719

Jim Granville wrote:

>  Error recovery in data blocks is possible with correction, but
> how does a system detect config or latch errors ?

I can imagine at least 2 possibilities:

i) By continously reading the configuration and comparing it with a
reference bitstream
ii) Built in self-tests

and many other fault-tolerant system design concepts (TMR, ..).

Michael

######

From: Austin Lesea <austin.lesea@xilinx.com>
Newsgroups: comp.arch.fpga
Subject: Re: Soft errors climb in 0,13u SRAM
Date: Tue, 29 Jan 2002 17:07:58 -0800
Organization: Xilinx
Lines: 146
Message-ID: <3C57476E.323A4B64@xilinx.com>
References: <3C5605B1.1523@designtools.co.nz>
NNTP-Posting-Host: 149.199.9.10
Mime-Version: 1.0
Content-Type: multipart/alternative;
 boundary="------------E55D00EB7E1675EE47A9A030"
X-Mailer: Mozilla 4.77 [en]C-CCK-MCD   (WinNT; U)
X-Accept-Language: en,pdf
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!newsfeeds.belnet.be!news.belnet.be!newsfeed.stanford.edu!canoe.uoregon.edu!arclight.uoregon.edu!wn4feed!wn3feed!worldnet.att.net!206.191.82.230!prairie.attcanada.net!newsfeed.attcanada.net!12.127.17.144!attbt1!attbt2!ip.att.net!newsgate.xilinx.com!cliff.xsj.xilinx.com!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:13587


--------------E55D00EB7E1675EE47A9A030
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Jim,

Detection of errors in the configuration is by readback, and checking
checksums.  Many hi-rel (military) users have found that continually
reprogramming is easier and accomplishes the same thing (as in high
altitude or space applications, SEUs are 10,000X to 100,000X more
frequent, and often occur daily).

Detection of transient SEU errors to the logic requires the logic to
have its own checks.  Triple Multiple Redundancy (TMR) with redundant
voting is used in critical cases.  Block RAMs are used with ECC logic,
where upon readback errors are corrected and written back in on the
other port corrected.

To take this to another domain, when I designed systems for the telcos,
I had to anticipate failures, both soft and hard, and provide recovery
for both (error free for both....).  At the system level, I had
redundant elements, and check and correct circuits.

Now that the systems are on a chip, it is a natural progression to use
the same techniques inside the FPGA that used to be used with
collections of FPGAs.

.22->.18->.15>->.13->.10->.07 is Moore's 'Law', and the effects of SEUs
must be taken into account by the IC designers.  Designs can be
completely hardened at the expense of area, but how much larger area,
and hence larger cost, will a customer want to pay for a "feature" that
may not apply to their market?

As it is all programmable logic, and if SEUs can be obviated by
programming more gates, then for those that require that level of
availability, they can 'pay' for it by their usage of their gates.  For
those with non critical applications that don't care, they can benefit
from the increased density and reduced costs.

Many DSP processes (e.g. voice phone calls) are able to tolerate a huge
level of bad bits, as human speech is incredibly robust.  Banking and
aerodynamic controls are applications where TMR is the only solution,
even if SEU's were not an issue!  Telecom switching is in-between:  if
the level of errors is below the noise floor of all of the
interconnected elements, then no one can tell, or even know what is
going on (all service objectives and tariffs are met).

We by no means have all of the answers, and are out there looking at all
options, and providing TMR, ECC, and other programmable cores today to
our customers for their critical applications.

Austin

Jim Granville wrote:

> This looks interesting :
>
> http://www.ebnonline.com/story/OEG20020128S0079
>
>  It has many spins (as expected :) on the problem, but it
> seems very  relevant to fast-shrink-path SRAM FPGAs.
>
>  Is there any information on the 'disturbance energy' for the
> various SRAM components of FPGAs :-
>
>  - The Config Cells ( Slower/larger, but not zero FIT ? )
>  - The Fast SRAM Blocks
>  - The sea of registers
>  - The LUT array store
>
>  The SRAM notes do not mention register errors, but since a SRAM
> cell is effectively a strobed latch, is it possible to disturb
> the BIT state in a latch, just as in a SRAM cell ?
>
>  Error recovery in data blocks is possible with correction, but
> how does a system detect config or latch errors ?
>
> -jg

--------------E55D00EB7E1675EE47A9A030
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
Jim,
<p>Detection of errors in the configuration is by readback, and checking
checksums.&nbsp; Many hi-rel (military) users have found that continually
reprogramming is easier and accomplishes the same thing (as in high altitude
or space applications, SEUs are 10,000X to 100,000X more frequent, and
often occur daily).
<p>Detection of transient SEU errors to the logic requires the logic to
have its own checks.&nbsp; Triple Multiple Redundancy (TMR) with redundant
voting is used in critical cases.&nbsp; Block RAMs are used with ECC logic,
where upon readback errors are corrected and written back in on the other
port corrected.
<p>To take this to another domain, when I designed systems for the telcos,
I had to anticipate failures, both soft and hard, and provide recovery
for both (error free for both....).&nbsp; At the system level, I had redundant
elements, and check and correct circuits.
<p>Now that the system<b>s</b> are on a chip, it is a natural progression
to use the same techniques inside the FPGA that used to be used with collections
of FPGAs.
<p>.22->.18->.15>->.13->.10->.07 is Moore's 'Law', and the effects of SEUs
must be taken into account by the IC designers.&nbsp; Designs can be completely
hardened at the expense of area, but how much larger area, and hence larger
cost, will a customer want to pay for a "feature" that may not apply to
their market?
<p>As it is all programmable logic, and if SEUs can be obviated by programming
more gates, then for those that require that level of availability, they
can 'pay' for it by their usage of their gates.&nbsp; For those with non
critical applications that don't care, they can benefit from the increased
density and reduced costs.
<p>Many DSP processes (e.g. voice phone calls) are able to tolerate a huge
level of bad bits, as human speech is incredibly robust.&nbsp; Banking
and aerodynamic controls are applications where TMR is the only solution,
even if SEU's were not an issue!&nbsp; Telecom switching is in-between:&nbsp;
if the level of errors is below the noise floor of all of the interconnected
elements, then no one can tell, or even know what is going on (all service
objectives and tariffs are met).
<p>We by no means have all of the answers, and are out there looking at
all options, and providing TMR, ECC, and other programmable cores today
to our customers for their critical applications.
<p>Austin
<p>Jim Granville wrote:
<blockquote TYPE=CITE>This looks interesting :
<p><a href="http://www.ebnonline.com/story/OEG20020128S0079">http://www.ebnonline.com/story/OEG20020128S0079</a>
<p>&nbsp;It has many spins (as expected :) on the problem, but it
<br>seems very&nbsp; relevant to fast-shrink-path SRAM FPGAs.
<p>&nbsp;Is there any information on the 'disturbance energy' for the
<br>various SRAM components of FPGAs :-
<p>&nbsp;- The Config Cells ( Slower/larger, but not zero FIT ? )
<br>&nbsp;- The Fast SRAM Blocks
<br>&nbsp;- The sea of registers
<br>&nbsp;- The LUT array store
<p>&nbsp;The SRAM notes do not mention register errors, but since a SRAM
<br>cell is effectively a strobed latch, is it possible to disturb
<br>the BIT state in a latch, just as in a SRAM cell ?
<p>&nbsp;Error recovery in data blocks is possible with correction, but
<br>how does a system detect config or latch errors ?
<p>-jg</blockquote>
</html>

--------------E55D00EB7E1675EE47A9A030--