From: Steven Derrien Newsgroups: comp.arch.fpga Subject: Data cache for fpga-cpu using Xilinx BlockRam Date: Mon, 10 Sep 2001 19:29:02 +0200 Organization: INRIA - RENNES Lines: 10 Message-ID: <3B9CF85E.1BE69072@irisa.fr> NNTP-Posting-Host: spyder.irisa.fr Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: news.irisa.fr 1000142941 16875 131.254.51.10 (10 Sep 2001 17:29:01 GMT) X-Complaints-To: usenet@irisa.fr NNTP-Posting-Date: 10 Sep 2001 17:29:01 GMT X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en, fr Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!newsfeed00.sul.t-online.de!t-online.de!grolier!fr.usenet-edu.net!usenet-edu.net!ciril.fr!loria.fr!news-rocq!irisa.fr!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:9855 Hello, I was ondering if someone had ever tried to implement a data cache using blockrams on a Virtex/SpartanII like FPGAs. (for a soft-core cpu) ? Any hints/HDL code/advices ? Thanks, Steven ###### Reply-To: "Rob Finch" From: "Rob Finch" Newsgroups: comp.arch.fpga References: <3B9CF85E.1BE69072@irisa.fr> Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam Lines: 50 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Message-ID: Date: Mon, 10 Sep 2001 23:26:47 -0400 NNTP-Posting-Host: 64.229.12.64 X-Complaints-To: abuse@sympatico.ca X-Trace: news20.bellglobal.com 1000178579 64.229.12.64 (Mon, 10 Sep 2001 23:22:59 EDT) NNTP-Posting-Date: Mon, 10 Sep 2001 23:22:59 EDT Organization: Bell Sympatico Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news.ifi.unizh.ch!news.imp.ch!news.imp.ch!fr.clara.net!heighliner.fr.clara.net!newsfeed.rt.ru!news.rosnet.ru!newsfeed.sovam.com!nf1.bellglobal.com!nf2.bellglobal.com!news20.bellglobal.com.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:9849 Yes, I'm attempting this. Actually, I'm implementing a unified instruction / data cache using the dual-port feature of the block rams. One port for data and one for code. Simultaneously accessed ports allows for a true Harvard architecture for the cpu. A unified cache will give you a slightly lower miss rate. One of the tricky bits is reads are registered on the clock edge so the read address has to be available before the clock edge, assuming you're going to clock read data into the soft core on the next clock edge. On a read or write miss you have to keep track of the address that was used so the cache can be updated. One problem with a cache is invalidating it, which is required on a process switch or reset. There is no dedicated hardware to perform a block reset of the valid bit, which means you must supply logic to do this (a counter + mux). It would be nice if one of the block rams had a reset signal that could set all the bits to zero. It's tricky to get the cache to work fast and pipelining is tricky as well. Consider that a ready signal to the cpu can't be valid until the result of the tag match is ready, which is after the read is registered (clocked). Also the cpu ready signal has to drive a lot of logic (high fanout). I have about 6 or 7 logic levels between starting a cache access until data gets to the cpu. There is also a lot of routing (routing is about 80% of time used). Right now I'm stuck at about 40MHz through the cache, while the soft cpu itself can run at 55MHz+. It might be possible to improve this with additional pipelineing but it gets to be complex. If you're using fast sram for memory, it might not be worth implementing a cache as the sram will probably run upwards of 40MHz. (Unless you want to implement the cache as an academic exercise). A faster alternative to a cache would be to use the block ram as high speed memory. Stuff all the data and code that you want available at high speed in this memory. For a simple system you often know which routines and data are time critical. I'm developing using a SpartanII -5 and WebPack software. Rob http://www.birdcomputer.ca "Steven Derrien" wrote in message news:3B9CF85E.1BE69072@irisa.fr... > Hello, > > I was ondering if someone had ever tried to implement a data cache using > blockrams on a Virtex/SpartanII like FPGAs. (for a soft-core cpu) ? > > Any hints/HDL code/advices ? > > Thanks, > > Steven ###### From: "Erik Widding" Newsgroups: comp.arch.fpga References: <3B9CF85E.1BE69072@irisa.fr> Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam Lines: 32 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.00.2919.6600 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6600 Message-ID: Date: Tue, 11 Sep 2001 12:17:18 GMT NNTP-Posting-Host: 12.91.10.171 X-Complaints-To: abuse@worldnet.att.net X-Trace: bgtnsc06-news.ops.worldnet.att.net 1000210638 12.91.10.171 (Tue, 11 Sep 2001 12:17:18 GMT) NNTP-Posting-Date: Tue, 11 Sep 2001 12:17:18 GMT Organization: AT&T Worldnet Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!news.maxwell.syr.edu!howland.erols.net!news-out.worldnet.att.net.MISMATCH!wn3feed!worldnet.att.net!135.173.83.71!wnfilter1!worldnet-localpost!bgtnsc06-news.ops.worldnet.att.net.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:9845 "Rob Finch" wrote in message news:nsfn7.12796$2r.1150073@news20.bellglobal.com... > Yes, I'm attempting this. Actually, I'm implementing a unified instruction / > data cache using the dual-port feature of the block rams. One port for data > and one for code. Simultaneously accessed ports allows for a true Harvard > architecture for the cpu. > [...] > Right now I'm stuck at about 40MHz through the cache, while the soft cpu > itself can run at 55MHz+. It might be possible to improve this with > additional pipelineing but it gets to be complex. If you are having a speed problem, you should probably be using two pual port rams, one for instruction and one for data. Then use the second port on each of the memories as the interface to your external memory. If you have a cache miss, you will have to insert one wait state, so that you can read the memory location that was just fetched. Further, if every time you have a miss, you fetch four or eight data/instruction words, then it is likely that you will suffer fewer misses. Regards, Erik Widding. -- Birger Engineering, Inc. -------------------------------- 781.481.9233 38 Montvale Ave #260; Stoneham, MA 02180 ------- http://www.birger.com ###### Reply-To: "Rob Finch" From: "Rob Finch" Newsgroups: comp.arch.fpga References: <3B9CF85E.1BE69072@irisa.fr> Subject: Re: Data cache for fpga-cpu using Xilinx BlockRam Lines: 70 X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 Message-ID: Date: Wed, 12 Sep 2001 00:21:18 -0400 NNTP-Posting-Host: 64.229.12.64 X-Complaints-To: abuse@sympatico.ca X-Trace: news20.bellglobal.com 1000271853 64.229.12.64 (Wed, 12 Sep 2001 01:17:33 EDT) NNTP-Posting-Date: Wed, 12 Sep 2001 01:17:33 EDT Organization: Bell Sympatico Path: chonsp.franklin.ch!pfaff.ethz.ch!news-zh.switch.ch!news-ge.switch.ch!news.tele.dk!small.news.tele.dk!195.54.122.107!newsfeed1.bredband.com!bredband!newsfeed.sovam.com!nf1.bellglobal.com!nf2.bellglobal.com!news20.bellglobal.com.POSTED!not-for-mail Xref: chonsp.franklin.ch comp.arch.fpga:9866 "Erik Widding" wrote in message news:ihnn7.15659$Uf1.1369777@bgtnsc06-news.ops.worldnet.att.net... > > If you are having a speed problem, you should probably be using two pual > port rams, one for instruction and one for data. Then use the second port > on each of the memories as the interface to your external memory. Thanks for the suggestion, but duh, I can't see how using the second port to interface to external memory would improve performance (talking about reducing clock cycle time). What does it give me? I know it's difficult to illuminate over the net, but could you give some examples ? Code ? As I see it there is only din,dout,addr,and wr to worry about with the cache. din is always connected to the databus no mux required here, dout is connected back to the data bus with an unavoidable mux (the cpu might be trying to read i/o not memory). I can't avoid a mux on the addr inputs because of the way block ram works. Because the memory transactions are pipelined, and the block ram's register the read address on the clock edge, the address for the next read needs to be present before the clock edge. Once the clock edge occurs (provided pipe is enabled) the read address at the input will change to the next read address. Since we can't tell whether there was a cache miss or hit until after the clock, and the read address has already advanced to the next address on the clock edge, the address has to be muxed back to the previous address in order to re-read on a cache miss. I sure would like to avoid that mux on the address.. Controlling wr isn't that hard, it's activated when the memory system is ready and is a registered output on a clock edge. When I started I had only an instruction cache using one port to interface to memory, and the second port for cpu cache access. However, after looking at the code and some timing experimentation I found using the second port for memory access didn't provide any benefit; but I could be missing something. I then decided to add a data cache so I had two independent caches one for instructions and one for data. However when I looked at them I realized I was only using one port on each and hence could combine the two. By combining the two I made a larger cache and was able to use some block rams for cache data that were previously used as tag rams; thus making the cache even larger. So I can have either a 4kb unified cache or two 1kb independent cache. > If you > have a cache miss, you will have to insert one wait state, so that you can > read the memory location that was just fetched. On a miss I have to insert at least two wait states because I have one or more for memory arbitration as well as an additional cycle for external ram read. :( > Further, if every time you > have a miss, you fetch four or > eight data/instruction words, then it is likely that you will suffer fewer > misses. > Right now I have a fairly simple system. Without adding fifos to various i/o devices I can't really read more than one word at a time. For instance the vga controller could require about one out of every three or four memory cycles. Thanks Rob http://www.birdcomputer.ca