the speedup is ~150 with 8 SPUs compared with running the code in ppu only