Technical background

  Usual digital Audio/Video storage and playback is based on high ratio data compression. With diverse algorythms usual package ratios are in range 6 (Mjpeg) to couple hunderts - H264 for instance. So is possible to pack some 2 hours movie in about 1GB file, with res. 1280x580 or like. But such sophisticated packing methods requre lot of CPU power by playback. 2GHz Pentium4 range power for instance.
  Mpeg1, first really good commercial packing, used by VCD required something in range P1 at 100MHz. Before it we could see diverse less efficient and less CPU hungry codecs (codec is actually packing/depacking system), good for 30MHz and similar fast CPUs.
  But we have only 8MHz, 16bit 68000 in STE. If want 320x200 px playback, with at least 12.5 fps, we can immediately forget any depacking during playback. Even with some very primitive algorythm, 320x160px, at 12.5 fps needs over 500KB/sec depacking rate. It is just not possible. And if there is hi-color show, we have even less CPU time, so completely impossible.
  Only thing, what we can do is: using simple loading of depacked video and audio data, straight to end destinations - avoiding any intermediate data copy, processing. And even it self is almost too demanding for old Atari STE. Usual disk transfer rates, with modern drives, flash cards are around 1MB/sec. Someone could say: it is much more than 500KB/sec, so let use my UltraSatan, which can 1170KB/sec . It is possible to use almost whole mentioned datarate, but not with hi-color playback, instead with 16 color playback, what is already solved - with nice 25 fps and 320x200 px. Hi-color showing on Atari STE occupies CPU pretty much during active scanlines - and it is some 65% of time by res. of 320x200. OK, then let's use DMA for loading data - will someone say. Unfortunately, this is what works not, and it is main reason why UltraSatan is not really good for this: DMA, when is active stops CPU for some cycles - and because whole hi-color displaying is based on accurate CPU timing, it will be destroyed, and you will see garbage instead nice video. Only way with UltraSatan is to load during border periods - top and bottom border, what means only some 33% of time by normal res. This is why I reduced vertical resolution to 158px - then need to load less data, while there is more time available - about 48% . Problem solved - actually, only one. But there is many thing yet to care, and we need some unusual solutions.

   If you try usual data loading with GEMDOS Trap #1 calls, it will fail miserably, even with fastest driver SW. The reason is that we need short data loads in little available blank time periods, what is about 9.5 mS. Then need to load 21 sectors from drive/Flash card. And with speed above 1100KB/sec, it is possible to load 21 sectors in 9.5mS. But not via GEMDOS. Because data on AHDI partitions is organised different. We have so called logical sectors, which are multiple sizes of normal, 512 byte long hard disk sectors. Atari made big sectors for so called BGM (Big GEM) partititons - with sizes over 32MB, up to 512MB. By 512MB partition, one logical sector is 8KB, or 16 normal sectors long. And it means that it is minimal size with what hard disk driver SW operates. Now, what happens when we give command to GEMDOS that load 21x512=10752 bytes ? It will calculate how many logical sectors it takes, and will load so much that complete requested data loads. In case of 512MB partititon it is 2 logical sectors, or 16384 bytes. Hey ! - it can overwrite some user data, because we asked less. Right - TOS programmers knew about it, and therefore we have disk buffers. So, in reality, data goes first in buffer (from disk), and then will copy only proper number of bytes to end destination, to avoid damage of user data. And it means of course slower load. With 16 color playback, I overrided problem by loading always 16KB long blocks - then TOS is smart enough to load straight to dest.

   Here, we can not go on 16KB blocks, because are limited with time for loading. So, only way is to use direct disk access, bypassing GEMDOS calls. With experience on writing hard disk drivers, was not problem for me. Then can load always exactly so much sectors as much we need, straight to dest. However, this makes new problem for us - how to locate position of large, AV file on drive ? Then, fragmentation. Fragmentation is not an option here - it must be avoided, otherwise palyback will be bad, because can not achieve required loading speed. So, defragmenting is a must - what is good overall too, for work with computer. After some thinking, and initial overcomplicated and slow ideas, I solved finding file location on drive in pretty fast and relative simple way - code is short in any case. Who is interested, may look about in source of player SW. So much about hi-color playback with UltraSatan. I don't expect any improvements here - more fps and/or resolution is just not possible. Actually, for many people even this will not work well.


  Hi-color playback using cartridge port IDE adapter CATA :

  ST(E) cartridge port is 16-bit, like IDE (ATA) hard disks, or Compact Flash cards. So, idea of using it for IDE hard disk adapters is normal, and there was already manufactured one: Paskud. I made something faster, using very special way of writing to disk - what is always solved tricky with cart. port, because it is read-only design. Without going in too deep details, I'll focus only on things related to speed. If we want really good quality AV playback, need pretty high loading rates. 320x200px, with 80 colors/line and 30K colors needs 2.5MB/sec loading rate. Or Overscan 416x228px with 48 colors/line - 2.4MB/sec. Such speeds are higher than any existing hard disk adapter can on STE. Max is about 1800KB/sec with ICD Link2. My special ACSI-CF adapter can some 1900KB/sec - and it is really top with ACSI port. Then how to load 2.5 MB/sec ? To undertanding solution, we need first to know little about Atari ST(E) RAM, bus speeds. RAM in ST(E) is 250nS cycled and 16 bit wide. It means that max data rate is 8 MB/sec. Looks promising .. But, RAM holds video data too, what is constantly readen by displaying screen. It uses exactly half of RAM bandwith - 4MB/sec (need less, but logic is made so, that in blanks you still can not use whole RAM bandwith). Anyway, still 4MB/sec. So, why then only less than 2MB/sec ? CPU has constant access to RAM. It's RAM access cycle is 500nS, so CPU can load RAM with max 4MB/sec (in peak). DMA chip is designed for max 2MB/sec speed - then slowdowns CPU about 50%. If DMA would go on 4MB/sec, CPU should be stopped completely. Anyway - as we told earlier, DMA is not good for hi-color palyback. And even with 4MB/sec, we could not go much over 1MB/sec with some bigger res. CPU can 4MB/sec, as is told. Yeah, but usual way of data transfer is: first load data from source (adress) into CPU, and then write to dest address. It means that max transfer rate is actually only 2MB/sec, in peak. It would be good if CPU had instruction to write from some external port to some RAM address, directly. I call it semi-DMA. But no such by 68000. And likely no by other CPUs. Still, we can achieve it, with little hack in machine, proper logic in cart. port adapter and special SW. Need to set machine in special state: all interrupts disabled, no DMA activity (but audio DMA can go on, luckily) . Then logic of cart. port adapter will invert R/W line from CPU to MMU/Shifter at any reading from RAM, or shifter - here meaning such command given to CPU. Because of inverting, data will be not readen, but written there - what is what we want when reading from hard disk. And from where data will be readen ? From IDE port, which will be activated parallel with inverting R/W line. Any of such cycle will advance IDE internal counter by 2 bytes, so will load data sequencially. This is the essence. And we can achieve peak speed of some 3.4MB/sec. Not exactly 4MB, and one of the reasons is CPU bug with movem.

  Following is for people knowing little about 68000 coding: For reversed R/W loading from disk, we may use movem as fastest way. Then something like : movem.w (a6),d0-d7 will load 18 bytes into RAM at address in a6 . Pardon ? 8x2 is 16, not 18, says someone . Right - but there is bug in CPU, and it always performs one cycle more than needed. Normally, when it is read and not write those 2 bytes are just lost, but in reversed mode it will be written into RAM, and luckily to correct place. And bug is the reason why we can not use movem.l ... to transfer multiple of 4 bytes - it will be always 2 more bytes than command says. SW for transfer must be executed from cart. port ROM - then logic can simply determine is it code fetch, or RAM access . Parameters should never be readen from RAM - it triggers IDE port. And interesting is that something like dbf (dbra) causes bad data transfer. I think that the reason is that dbf (and bra instructions) have cycle counts not divideable with 4, what confuses MMU logic. So, branches only with jmp - but no real need for them - we need only few rutines to be placed in cart. ROM.

  Still not enough fast ! - over 3MB/sec, and not enough ? Yes - calculate little: for 320x200px and 25fps we need to load 32KB bitmap + about 20KB color data + sound data in 1/25 sec. In 33% of time. 50x25=1250 KB/sec x 3 is more than our tricky cart. port reading can. The solution is in reading color data not into RAM, but straight into shifter - the whole concept is based on straight loading to end destination. And it is possible with carefully written code and newer, fast compact Flash cards - to have syncro load right from IDE port. And not just color load of usual PCS format - 48colors/line. but more: even 80 colors/line. Looks better. Furthermore, because we show same bitmap twice - 25 fps and 50 vertical scans, we can use 2 slightly different color data to achieve more perceptual color nuances - about 30000. It needs 2 different color datas, and it makes 2.5MB/sec rate. By Overscan, we must load even part of bitmap data interleaved with color data, because there is even less time in border periods. Then can not have more than 48 colors/line, but it may look still good.

Dec. 13 2012. P. Putnik