From: Steven Derrien <sderrien@irisa.fr>
Newsgroups: comp.arch.fpga
Subject: Data cache for fpga-cpu using Xilinx BlockRam
Date: Mon, 10 Sep 2001 19:29:02 +0200
Organization: INRIA  - RENNES
Lines: 10
Message-ID: <3B9CF85E.1BE69072@irisa.fr>
NNTP-Posting-Host: spyder.irisa.fr
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Trace: news.irisa.fr 1000142941 16875 131.254.51.10 (10 Sep 2001 17:29:01 GMT)
X-Complaints-To: usenet@irisa.fr
NNTP-Posting-Date: 10 Sep 2001 17:29:01 GMT
X-Mailer: Mozilla 4.75 [en] (WinNT; U)
X-Accept-Language: en, fr
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!newsfeed00.sul.t-online.de!t-online.de!grolier!fr.usenet-edu.net!usenet-edu.net!ciril.fr!loria.fr!news-rocq!irisa.fr!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:9855

Hello,

I was ondering if someone had ever tried to implement a data cache using 
blockrams on a Virtex/SpartanII like FPGAs. (for a soft-core cpu) ?

Any hints/HDL code/advices ?

Thanks,

Steven

######

Reply-To: "Rob Finch" <robfinch@sympatico.ca>
From: "Rob Finch" <robfinch@sympatico.ca>
Newsgroups: comp.arch.fpga
References: <3B9CF85E.1BE69072@irisa.fr>
Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam
Lines: 50
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Message-ID: <nsfn7.12796$2r.1150073@news20.bellglobal.com>
Date: Mon, 10 Sep 2001 23:26:47 -0400
NNTP-Posting-Host: 64.229.12.64
X-Complaints-To: abuse@sympatico.ca
X-Trace: news20.bellglobal.com 1000178579 64.229.12.64 (Mon, 10 Sep 2001 23:22:59 EDT)
NNTP-Posting-Date: Mon, 10 Sep 2001 23:22:59 EDT
Organization: Bell Sympatico
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!newsfeed.rt.ru!news.rosnet.ru!newsfeed.sovam.com!nf1.bellglobal.com!nf2.bellglobal.com!news20.bellglobal.com.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:9849

Yes, I'm attempting this. Actually, I'm implementing a unified instruction /
data cache using the dual-port feature of the block rams. One port for data
and one for code. Simultaneously accessed ports allows for a true Harvard
architecture for the cpu. A unified cache will give you a slightly lower
miss rate.
One of the tricky bits is reads are registered on the clock edge so the read
address has to be available before the clock edge, assuming you're going to
clock read data into the soft core on the next clock edge. On a read or
write miss you have to keep track of the address that was used so the cache
can be updated.
One problem with a cache is invalidating it, which is required on a process
switch or reset. There is no dedicated hardware to perform a block reset of
the valid bit, which means you must supply logic to do this (a counter +
mux). It would be nice if one of the block rams had a reset signal that
could set all the bits to zero.
It's tricky to get the cache to work fast and pipelining is tricky as well.
Consider that a ready signal to the cpu can't be valid until the result of
the tag match is ready, which is after the read is registered (clocked).
Also the cpu ready signal has to drive a lot of logic (high fanout). I have
about 6 or 7 logic levels between starting a cache access until data gets to
the cpu. There is also a lot of routing (routing is about 80% of time used).
Right now I'm stuck at about 40MHz through the cache, while the soft cpu
itself can run at 55MHz+. It might be possible to improve this with
additional pipelineing but it gets to be complex.
If you're using fast sram for memory, it might not be worth implementing a
cache as the sram will probably run upwards of 40MHz. (Unless you want to
implement the cache as an academic exercise).
A faster alternative to a cache would be to use the block ram as high speed
memory. Stuff all the data and code that you want available at high speed in
this memory. For a simple system you often know which routines and data are
time critical.
I'm developing using a SpartanII -5 and WebPack software.

Rob
http://www.birdcomputer.ca

"Steven Derrien" <sderrien@irisa.fr> wrote in message
news:3B9CF85E.1BE69072@irisa.fr...
> Hello,
>
> I was ondering if someone had ever tried to implement a data cache using
> blockrams on a Virtex/SpartanII like FPGAs. (for a soft-core cpu) ?
>
> Any hints/HDL code/advices ?
>
> Thanks,
>
> Steven

######

From: "Erik Widding" <widding@birger.com>
Newsgroups: comp.arch.fpga
References: <3B9CF85E.1BE69072@irisa.fr> <nsfn7.12796$2r.1150073@news20.bellglobal.com>
Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam
Lines: 32
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.00.2919.6600
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6600
Message-ID: <ihnn7.15659$Uf1.1369777@bgtnsc06-news.ops.worldnet.att.net>
Date: Tue, 11 Sep 2001 12:17:18 GMT
NNTP-Posting-Host: 12.91.10.171
X-Complaints-To: abuse@worldnet.att.net
X-Trace: bgtnsc06-news.ops.worldnet.att.net 1000210638 12.91.10.171 (Tue, 11 Sep 2001 12:17:18 GMT)
NNTP-Posting-Date: Tue, 11 Sep 2001 12:17:18 GMT
Organization: AT&T Worldnet
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!news.maxwell.syr.edu!howland.erols.net!news-out.worldnet.att.net.MISMATCH!wn3feed!worldnet.att.net!135.173.83.71!wnfilter1!worldnet-localpost!bgtnsc06-news.ops.worldnet.att.net.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:9845

"Rob Finch" <robfinch@sympatico.ca> wrote in message
news:nsfn7.12796$2r.1150073@news20.bellglobal.com...
> Yes, I'm attempting this. Actually, I'm implementing a unified instruction
/
> data cache using the dual-port feature of the block rams. One port for
data
> and one for code. Simultaneously accessed ports allows for a true Harvard
> architecture for the cpu.
> [...]
> Right now I'm stuck at about 40MHz through the cache, while the soft cpu
> itself can run at 55MHz+. It might be possible to improve this with
> additional pipelineing but it gets to be complex.

If you are having a speed problem, you should probably be using two pual
port rams, one for instruction and one for data.  Then use the second port
on each of the memories as the interface to your external memory.  If you
have a cache miss, you will have to insert one wait state, so that you can
read the memory location that was just fetched.  Further, if every time you
have a miss, you fetch four or
eight data/instruction words, then it is likely that you will suffer fewer
misses.


Regards,
Erik Widding.

--
Birger Engineering, Inc.  --------------------------------  781.481.9233
38 Montvale Ave #260; Stoneham, MA 02180  -------  http://www.birger.com

######

Reply-To: "Rob Finch" <robfinch@sympatico.ca>
From: "Rob Finch" <robfinch@sympatico.ca>
Newsgroups: comp.arch.fpga
References: <3B9CF85E.1BE69072@irisa.fr> <nsfn7.12796$2r.1150073@news20.bellglobal.com> <ihnn7.15659$Uf1.1369777@bgtnsc06-news.ops.worldnet.att.net>
Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam
Lines: 70
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 5.50.4133.2400
X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
Message-ID: <NdCn7.29168$%N2.1361848@news20.bellglobal.com>
Date: Wed, 12 Sep 2001 00:21:18 -0400
NNTP-Posting-Host: 64.229.12.64
X-Complaints-To: abuse@sympatico.ca
X-Trace: news20.bellglobal.com 1000271853 64.229.12.64 (Wed, 12 Sep 2001 01:17:33 EDT)
NNTP-Posting-Date: Wed, 12 Sep 2001 01:17:33 EDT
Organization: Bell Sympatico
Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!news.tele.dk!small.news.tele.dk!195.54.122.107!newsfeed1.bredband.com!bredband!newsfeed.sovam.com!nf1.bellglobal.com!nf2.bellglobal.com!news20.bellglobal.com.POSTED!not-for-mail
Xref: chonsp.franklin.ch comp.arch.fpga:9866


"Erik Widding" <widding@birger.com> wrote in message
news:ihnn7.15659$Uf1.1369777@bgtnsc06-news.ops.worldnet.att.net...
>
> If you are having a speed problem, you should probably be using two pual
> port rams, one for instruction and one for data.  Then use the second port
> on each of the memories as the interface to your external memory.

Thanks for the suggestion, but duh, I can't see how using the second port to
interface to external memory would improve performance (talking about
reducing clock cycle time). What does it give me? I know it's difficult to
illuminate over the net, but could you give some examples ? Code ?

As I see it there is only din,dout,addr,and wr to worry about with the
cache. din is always connected to the databus no mux required here, dout is
connected back to the data bus with an unavoidable mux (the cpu might be
trying to read i/o not memory). I can't avoid a mux on the addr inputs
because of the way block ram works. Because the memory transactions are
pipelined, and the block ram's register the read address on the clock edge,
the address for the next read needs to be present before the clock edge.
Once the clock edge occurs (provided pipe is enabled) the read address at
the input will change to the next read address. Since we can't tell whether
there was a cache miss or hit until after the clock, and the read address
has already advanced to the next address on the clock edge, the address has
to be muxed back to the previous address in order to re-read on a cache
miss. I sure would like to avoid that mux on the address.. Controlling wr
isn't that hard, it's activated when the memory system is ready and is a
registered output on a clock edge.

When I started I had only an instruction cache using one port to interface
to memory, and the second port for cpu cache access. However, after looking
at the code and some timing experimentation I found using the second port
for memory access didn't provide any benefit; but I could be missing
something.
I then decided to add a data cache so I had two independent caches one for
instructions and one for data. However when I looked at them I realized I
was only using one port on each and hence could combine the two. By
combining the two I made a larger cache and was able to use some block rams
for cache data that were previously used as tag rams; thus making the cache
even larger. So I can have either a 4kb unified cache or two 1kb independent
cache.

>  If you
> have a cache miss, you will have to insert one wait state, so that you can
> read the memory location that was just fetched.

On a miss I have to insert at least two wait states because I have one or
more for memory arbitration as well as an additional cycle for external ram
read. :(

> Further, if every time you
> have a miss, you fetch four or
> eight data/instruction words, then it is likely that you will suffer fewer
> misses.
>

Right now I have a fairly simple system. Without adding fifos to various i/o
devices I can't really read more than one word at a time. For instance the
vga controller could require about one out of every three or four memory
cycles.


Thanks
Rob
http://www.birdcomputer.ca