BBC BASIC for Windows: OPCODE timings

BBC BASIC for Windows
Programming >> Assembly Language Programming >> OPCODE timings
http://bb4w.conforums.com/index.cgi?board=assembler&action=display&num=1225719987

OPCODE timings
Post by Michael Hutton on Nov 3^rd, 2008, 12:46pm

(here are some results of the opcode timing..)

I have used cpuid and rdstc to time some OPCODE timing as discussed in the GFXLIB thread.

The original question was looking at the number of cycles taken to execute

Code:

fld dword [mem]

The instruction itself, apparently, takes 1 cycle but the memory overhead must increase this.

Using the code below;

Code:

      REM Using the Time Stamp Instruction
      
      DIM code 2200
      FOR pass=0 TO 2 STEP 2
        P%=code
        [
        OPT pass
        
        
        .subtime
        dd 0
        
        .time
        dd 0
        
        .var
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        finit
        ;now test the instruction
        cpuid
        rdtsc
        mov [time],eax
        
        ;REM the intruction being tested
        fld dword [time]
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT
      END

and running it gives me a 8 cycles on my AMD Athlon x2 laptop.

However, I noticed a few anomalies if I change a few things (do them individually..)

1. Delete

Code:


.var
dd 0

and then run it. I found that the times may vary, especially the first one. Is this because the code is just on the 2048 byte boundry?

2. Change

Code:

        fld dword [time]

to

Code:

        fld dword [var]

gives (me) 20 cycles to complete the instruction. I presume this is the same reason as in 1

3. Delete

Code:


finit

I think this is fairly straight forward and the FPU needs to be initialised before use.

4. Instead of

Code:

        ;now test the instruction

try

Code:

        ;REM now test the instruction

bizarre, but I got varying results, but after running it a few/ten times I got the 'correct' results. I presume this is due to the use of the L1 cache and the code is made resident there the more times it is used. However, I find it very hard to repilicate the results I was getting! They all seem to be OK now even when I restarted windows and ran it again... I am very positive the only thing I changed was the REM line and for a while it was predictable that adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!

You may find you don't get the same variations I did, and I hope they are reproducible so that you all don't think I'm nuts.

I will be doing more investigating....

Michael

Re: OPCODE timings
Post by admin on Nov 3^rd, 2008, 1:19pm

Quote:

adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!

It's likely to be an alignment issue. You've made no attempt to align the dds, as you need to for best performance, so adding 'REM ' to your code could easily change their addresses from being a multiple-of-four (fast) to not a multiple-of-four (slow), or vice versa.

To get (hopefully) more consistent results change your code as follows:

Code:

      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4

Incidentally you might want to get into the habit of setting L% and using FOR pass=8 TO 10 STEP 2 to avoid overstepping the bounds of your code area.

Richard.

Re: OPCODE timings
Post by admin on Nov 3^rd, 2008, 3:53pm

Quote:

I think this is fairly straight forward and the FPU needs to be initialised before use.

There may be more to it than that. You don't attempt to balance the FPU stack, so without the finit you'll get a stack overflow (with consequent exception) every so often, no doubt adding to the inconsistency.

You might be better off timing a pair of instructions (such as fld and fstp) so that the stack remains balanced. You may then find the finit isn't essential.

Richard.

Re: OPCODE timings
Post by mohsen on Nov 4^th, 2008, 09:11am

Code:

      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4

These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.

I suggest that this be added to the Manual and/or the Wiki.

Thanks Richard ;)

Re: OPCODE timings
Post by admin on Nov 4^th, 2008, 12:42pm

Quote:

These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.

There's no way that changing the alignment should affect whether code works or crashes. All x86 (IA-32) processors will happily access instructions or data at any address, whether aligned or not; only speed is affected.

If you are finding that alignment has a more drastic effect, then it is a side-effect of a bug elsewhere in your program. You need to find the bug rather than fiddle with the alignment!

Of course alignment can matter in calls to the Windows API, but that's a rather different issue.

Quote:

I suggest that this be added to the Manual and/or the Wiki.

I'm not convinced it is appropriate for the manual; 'tuning' assembler code is a rather specialised subject, of which alignment is only a small part.

As far as the Wiki is concerned, it could form part of an article on how to get the best speed out of assembler code. Perhaps David Williams might consider writing one.

Richard.

Re: OPCODE timings
Post by mohsen on Nov 4^th, 2008, 5:39pm

I agree Richard.
It was a silly bug

Re: OPCODE timings
Post by Michael Hutton on Nov 24^th, 2008, 01:13am

Hmm. I have fiddled a bit more with trying to get consistent results for the opcode timings but still getting some pretty wild results.

Code:

      REM Using the Time Stamp Instruction for opcode timing
      
      DIM code 4000, L%-1
      
      FOR pass=8 TO 10 STEP 2
        P%=(code+3) AND -4
        S%=P%
        [
        OPT pass
        
        .subtime
        dd 0
        dd 0
        
        .time
        dd 0
        dd 0
        
        .var
        dd 0
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        .beforerem
        ;REM kljashdfaldskf
        .afterrem
        
        cpuid
        rdtsc
        mov [time],eax
        
        ;fld dword [time]
        ;fstp dword [var]
        fld1
        fstp st0
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      PRINT"Aligning to dword boundry adds: ";S%-code;" bytes."
      PRINT"REM statement adds:"afterrem-beforerem
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT

1. I have aligned the code to a dword boundry.

2. You mentioned "adding 'REM ' to your code could easily change their addresses " but I can't see that happening where I've added a REM statement. Surely the assemble ignores anything after a semi-colon anyway?

3. I've balanced the FPU stack using a pair of instructions rather than just one. It still seems slightly more consistent with a FINIT instruction but still varies somewhat.

4. I was reading that it is probably better to use "QueryPerformanceCounter' on multicore systems to time code etc because apparently a thread cannot be guaranteed to execute on one core exclusively. I presume this could be what is happening. I noticed that I got some negative results sometimes which could explain that. I will experiment some more..

Michael
Michael

Re: OPCODE timings
Post by Michael Hutton on Nov 24^th, 2008, 01:40am

Here's the link to the rdtsc and multicore systems.

http://msdn.microsoft.com/en-us/library/bb173458.aspx

Michael