AVR Microcoded CPUs author Neil Franklin, last modification 2012.05.09 Processor microcontroller for implementing random own CPUs in microcode further developement of Microcontroller_CPU_Emulator files ideas enables random 1-chip processor, just add memory and IO completely universal, memory only SRAM for "RAM" and "ROM", ROM "booted" addressable ROM at bottom or top, and variable amount of RAM/ROM split already published idea, as part of SoftHardware SoftVGA lecture because of 8bit controllers most likely 8/16bit CPUs, unlikely 32bit round up other "odd" widths to next n*8bit, 12bit as 4+8bit 18bit as 2+2*8bit, unlikely 36bit, but all of these unlikely because of 8bit controllers usually 16bit or 24bit addresses produce them as direct 2*8bit or 3*8bit addresses large simply CPU with 24bit/3byte addresses, analog to eZ80 or 65816 or for 20bit or 24bit addresses with external MMU more likely only to 20bit because of table SRAM size/width or for 20bit or 24bit addresses with internal segmenting more likely 24bit because faster to compute and output as FAR CALL, und damit OS+DLL+Prog viel groesser als 64k haben or only "NEAR" CALL with A4..19 as Address, full 20bit addresse gen these variants require less external work, no SRAM and loading it fit far better to user/microcode defined system potential CPUs to implement existing micros (not needed, but completeness): 8080, Z80, 6800, 6809, 6502 evel larger stuff if slower ist acceptable, 8086/8088, 68000, Z8000, 32032 not any more existing micros: 9900, 1802, 2650 historic minis and mainframes: PDP-1, PDP-8, PDP-11, Alto, 360/370 possibly only user level subset of their instruction sets own designs: DKs MyCPU, my own CPU designs langauge specific CPUs: Basic, Forth, Lisp/Logo virtual maschines: perl, python, JavaVM Microcontroller Type Choice for having all target CPU registers and status in controller registers it requires a lot of them to be fast, so use AVR because of its 32 registers possbily less important registers will have to spill into internal SRAM Resource Usage microcontroller ROM for processor microcode regs for microcode state and often used processor registers RAM only for processor state, seldom used processor registers (such as FP) and also for internal stacks or buffers or caches or tags for user code+data only controller-external memory, normal ROMs and RAMs wired to controllers PIO pins, like to normal CPU Data/Addr/Ctrl pins address decoding therefore also external, an on normal CPUs possibly "service" mode, data in RAM outside but code from controllers ROM this allows monitor or bootloader in target CPU code without external ROM alternatively these two could also be done directly in AVR code or better ROM-less with code copied into external RAM while startup also external user IO and decoding, and user interrupts by ext INT pins the EEPROM, timers and other IOs of the microcontroller are not used or only as far as they can map to the users instruction set and/or IO arch External Memory Pins Towards External Code/Data Memory needs D0..7, A0..15, R/W and mem-/CS (or /RD and /WR), IO-/CS, INT, poss WAIT poss A16..19|23 for 20|24bit adresses, only on larger AVR than DIP40 case /Reset is direct AVR /RESET, same Clk direct AVR quarz standard memory today for 64k retro CPUs is half-half split 32k+32k ROM+RAM so need to drive 2 256kBit/32kByte memory chips, A0..14, decode A15 simplest 1-chip decoder is 138 or half 139, allows 8 or 4 memory chips this suggests decoding A15..17 or A15..16, need partial 3rd byte of address alternative only 2 memory chips if it can be done with an 4*2NAND or 4*2NOR for driving use AVR external Data Memory mode vs generic PIO mode Atmega8185|162 with external D Mem mode, comfortable /ALE and /CS and R/W all-in-one, no coding, signals fast, and with latch saving PIO pins pins given by AVR, PA (AD0..7) + PC (A8..15) + PD6+7 (WR+RD) + PE1 (ALE) but requires external auxillary 74AHC373er for A/D demultiplexing and only available in 2 small-ROM/1k-RAM types, and only with 16MHz because AVR regs+IO in SRAM address-space, only use 8bit ext addr xx00..xxFF top 8bit or 12bit or 16bit ((0)0)00xx..((F)F)FFxx musst be separate, by PIO possibly ext addr mode 9|10|11bit, bits for Code/Data or C/D/Stack/Extra/IO and/or bit for Sys/User mode, setable by different XH|YH|ZH constants ATmega32|64|324|644 with large ROMs, faster with 16|20MHz, no 373er needed and Reg=Reg+1 can be done between old-reg out and data read, use instead wait but only PIOs, no auto-signals, read is fast with permanent +R/W and /CS particular code read, poss with only PIO low-addr with new PC=PC+1 write write with +/CS R/W /CS +/CS +R/W /CS is quite a bit slower, but seldom because no 373er 16+8+n PIOs used port usage PA (A0..7) + PC (A8..15), PB (D0..7), PD2..n..7 (control) leave PD0..1..n free for USART + handshake, communication with IO devices can still use PB* for SPI (SEEPROM), just thats /SS must be from an PD* possibly not enough pins in 32, ev multiplex because of that for pins 273/374/373/374er, but software /ALE slow for A0..7 prefer A8..15 before A0..7, and D0..7 on separate pins or A8..15 over D0..7 pins, if A0..7 increment gives carry better extension for slow auxillary stuff, such as A16..19 or PID or MMU or for recognizing interrupts (only wide OR gate and 373?) or 164er SI-PO shiftregs, only 14pin, only 1pin SI to all and each 1pin CLK or separate Atmega8 for PIO extension, connected by SPI or UART ATmega64|645|1281 with external D-Mem mode and large ROMs, but TQFP64 0.8mm also enough PIOs, for full 20|24 addr, 8 data, control, and direct interrupts even if without 373 accesses by software, and separate A0..7 and D0..7 External MMU driven by top part of adress bus, and more AVR pins poss PID0..3|7, poss Code/Data (SID), poss System/User Mode these would then have to be processed by an external SRAM MMU table connected to it as address lines, data out then new address lines needs also pins for writing into table, an poss tristate AVR D -> MMU D/Aout ROM/RAM Address Layout must be variable depending on the chosen CPU types requirements some require ROM bottom + RAM top (8080/Z80/8051-like) others require ROM top + RAM bottom (6800/6809/6502-like and 8086-like) and mem-map IO deducable from either, more likely ROM, so both must be possib best is automatic switching wenn new/other microcode gets loaded allows switching different CPU type microcodes without error source but for small system size poss only jumper, else ev too much stuff better no ROM at all, only all-RAM, less sockets and chips, no switching CPU specific ROM section of RAM loaded from controller with ROM image at boot allows flexible split or RAM and ROM, not limited to fixed 32k+32k possible extension to 4 or 8 32k with 139 or 138 as decoder, if addr pins need pins for 2*8+n addr, 8 data, n control, fit all in 4*80=32 PIOs no need for an EPROM/EEPROM/Flash burner or EPROM eraser also changes automatically when new microcode loaded, is desirable also possible fill "memoryless" sections w "empty bus" 0x00 or 0xFF or NOP read operations run at full speed, always reading from RAM this includes all the many code/instruction read operations write operations with read-only test to simulate ROM/empty no-change behavior this allows code fetch and read at full speed, only slows writes and these are usually end of instruction, after to exec next instr easiest variant simplt top 8bit Address index table of writeprotect bits allows resolution of 256 Bytes, far more than ever needed or jump with address high AND 0xF0|0xF8 index into 16|8instr routines each of these writes or not, RAM vs ROM/empty resolution of 4k|2k addr possibly extend also to reads with such routines although this would be more often a call instead of jump but this is only needed for an CPU with memory-mapped IO reads code fetch never from IO device, fetch from "empty" RAM, filled with NOPs possible to simulate ROM modules by controller detecting an SPI S-Flash and then loading an image from that instead of the internal one or the image from that plus part or all of the standard internal image this also allows microcode driven switchable ROMs IO accesses to switch registers result in ROM images being copied the code for this bypasses normal instruction write tests allows first ROM image being menu, then multiple actual images to use possible to simulate video RAM by trapping on writes, and sending to display still write to RAM, so that read can be fast without trap alternative data table on 0xFF boundries, separ code data-read and data-write 3*256Byte=0.75kByte tables, even possibly 2Byte entries this would implement an simple internal MMU, to >16bit addresses also allows switching memory, full 64k RAM and superimposed ROMs ROM writes here "prevented" by writing to an page not readable anywhere but does not give any form of trapping for mem-map IO or video RAM simulat it no good for C64 like system, but OK for CP/M or Apple 1 like system this variant has less work and offers flexible RAM/ROM splitting and modules fits a lot better to idea of user/microcode defined system or alternatively only RAM outside and ROM directly read from AVR Flash but Flash vs ExtRAM test slows all memory accesses, including code fetch so better full external RAM, may be needed anyway for flat 64k RAM design or unlikely extreme case even RAM only from internal SRAM ATmega1284P has nearly 16k free, more than Apple 1 or VIC-20 with 8k exp either use only SRAM as memory, loaded from Flash as above and has also enough Flash for 128k/8k=16 switchable half-RAM size ROMs or here more likely with direct read from Flash, to conserve space and still has 128k/32k=4 switchable half-64k size ROMs enough for an minimal in-controller-only system, with all pins only for IO can even run on an normal prototyping board, as nothing external required enough for conceptual tests, CPU development, AVR IO usage development even 28pin ATmega168 would give an small 1k/2=0.5k RAM, ev plus 16k/2=8k ROM this is still about the size of an typical BasicStamp-like system and 0.5k is enough for small machinecode hex training system and 8k is comfortable for line assembler training system or even for Forth or minimal Basic or minimal Logo appart from that, this is not desirable system, low on RAM, slow with ROM Software Instruction Decoder convert fetched instruction codes to jump addresses, token-threaded either jump tables, or computed jumps (faster but costs more space) OUT abus-low; ADIW; IN dbus; MOV; ANDI F0; ANDI 0F; ADD begin-high; IJMP table buildup 8bit instructions are straigtforward table indexes, into 256 n-instr blocks 16bit instructions first 8bit table, or possible first 5..10bit and rest coded out, or small aux tables, or anyway only parameters 1+5+5+5bit combined instructions with multilevel tables 8bit/4*3bit/5bit Memory Accesses normal case always reading programm code, I=PC++ for data read or even write switch over IO, do it, restore IO state to reduce amount of switches multiple operations at once, blockwise blocks/strings copying in 2^n packets all-in/all-out from/to SRAM unrolled, (de)aktivate /CS and switch date bus only each 16bytes OUT abus-low; IN dbus and OUT abus-low: OUT dbus; CBI r/w; SBI r/w allows with non-wait SRAM over 35ns to do 3MByte/s copying because address overlap 2 2^n sided buffers needed stack TOS in internal SRAM, top 2^n elements, ideal instruction offset size here also 2 2^n buffers, so random section over both buffers is possible forget normal random RAM accesses by cache, administrative time is to large Speed for each memory fetch and decode or fetch and process multiple clocks assuming 10-20 for this, so with 20MHz about 1-2 simulated MHz both fetch 3-10 and decode or process 3-10 choice of either variable clocks per instruction memory cycle as fast as possible, each instruction minimal cycles vs constant clocks per instruction memory cycle predictable timing, for each instruction n cycle blocks on m clocks the later also for clock compatible emulation of historic CPUs same issues resources vs predictable as with CBR vs VBR music compression