BBC BASIC for Windows
« OPCODE timings »

Welcome Guest. Please Login or Register.
Apr 5th, 2018, 10:29pm



ATTENTION MEMBERS: Conforums will be closing it doors and discontinuing its service on April 15, 2018.
Ad-Free has been deactivated. Outstanding Ad-Free credits will be reimbursed to respective payment methods.

If you require a dump of the post on your message board, please come to the support board and request it.


Thank you Conforums members.

BBC BASIC for Windows Resources
Online BBC BASIC for Windows documentation
BBC BASIC for Windows Beginners' Tutorial
BBC BASIC Home Page
BBC BASIC on Rosetta Code
BBC BASIC discussion group
BBC BASIC for Windows Programmers' Reference

« Previous Topic | Next Topic »
Pages: 1  Notify Send Topic Print
 thread  Author  Topic: OPCODE timings  (Read 797 times)
Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx OPCODE timings
« Thread started on: Nov 3rd, 2008, 12:46pm »

(here are some results of the opcode timing..)

I have used cpuid and rdstc to time some OPCODE timing as discussed in the GFXLIB thread.

The original question was looking at the number of cycles taken to execute

Code:
fld dword [mem]
 


The instruction itself, apparently, takes 1 cycle but the memory overhead must increase this.

Using the code below;

Code:
      REM Using the Time Stamp Instruction
      
      DIM code 2200
      FOR pass=0 TO 2 STEP 2
        P%=code
        [
        OPT pass
        
        
        .subtime
        dd 0
        
        .time
        dd 0
        
        .var
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        finit
        ;now test the instruction
        cpuid
        rdtsc
        mov [time],eax
        
        ;REM the intruction being tested
        fld dword [time]
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT
      END

 


and running it gives me a 8 cycles on my AMD Athlon x2 laptop.

However, I noticed a few anomalies if I change a few things (do them individually..)

1. Delete

Code:

.var
dd 0 

 


and then run it. I found that the times may vary, especially the first one. Is this because the code is just on the 2048 byte boundry?

2. Change

Code:
        fld dword [time]
 


to

Code:
        fld dword [var]
 


gives (me) 20 cycles to complete the instruction. I presume this is the same reason as in 1

3. Delete

Code:

finit

 


I think this is fairly straight forward and the FPU needs to be initialised before use.

4. Instead of

Code:
        ;now test the instruction
 


try

Code:
        ;REM now test the instruction
 


bizarre, but I got varying results, but after running it a few/ten times I got the 'correct' results. I presume this is due to the use of the L1 cache and the code is made resident there the more times it is used. However, I find it very hard to repilicate the results I was getting! They all seem to be OK now even when I restarted windows and ran it again... I am very positive the only thing I changed was the REM line and for a while it was predictable that adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!

You may find you don't get the same variations I did, and I hope they are reproducible so that you all don't think I'm nuts.

I will be doing more investigating....

Michael


User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: OPCODE timings
« Reply #1 on: Nov 3rd, 2008, 1:19pm »

Quote:
adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!

It's likely to be an alignment issue. You've made no attempt to align the dds, as you need to for best performance, so adding 'REM ' to your code could easily change their addresses from being a multiple-of-four (fast) to not a multiple-of-four (slow), or vice versa.

To get (hopefully) more consistent results change your code as follows:

Code:
      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4 

Incidentally you might want to get into the habit of setting L% and using FOR pass=8 TO 10 STEP 2 to avoid overstepping the bounds of your code area.

Richard.
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: OPCODE timings
« Reply #2 on: Nov 3rd, 2008, 3:53pm »

Quote:
I think this is fairly straight forward and the FPU needs to be initialised before use.

There may be more to it than that. You don't attempt to balance the FPU stack, so without the finit you'll get a stack overflow (with consequent exception) every so often, no doubt adding to the inconsistency.

You might be better off timing a pair of instructions (such as fld and fstp) so that the stack remains balanced. You may then find the finit isn't essential.

Richard.
User IP Logged

mohsen
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 39
xx Re: OPCODE timings
« Reply #3 on: Nov 4th, 2008, 09:11am »

Code:
      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4 


These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.

I suggest that this be added to the Manual and/or the Wiki.

Thanks Richard ;)
User IP Logged

admin
Administrator
ImageImageImageImageImage


member is offline

Avatar




PM


Posts: 1145
xx Re: OPCODE timings
« Reply #4 on: Nov 4th, 2008, 12:42pm »

Quote:
These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.

There's no way that changing the alignment should affect whether code works or crashes. All x86 (IA-32) processors will happily access instructions or data at any address, whether aligned or not; only speed is affected.

If you are finding that alignment has a more drastic effect, then it is a side-effect of a bug elsewhere in your program. You need to find the bug rather than fiddle with the alignment!

Of course alignment can matter in calls to the Windows API, but that's a rather different issue.

Quote:
I suggest that this be added to the Manual and/or the Wiki.

I'm not convinced it is appropriate for the manual; 'tuning' assembler code is a rather specialised subject, of which alignment is only a small part.

As far as the Wiki is concerned, it could form part of an article on how to get the best speed out of assembler code. Perhaps David Williams might consider writing one.

Richard.
User IP Logged

mohsen
New Member
Image


member is offline

Avatar




PM

Gender: Male
Posts: 39
xx Re: OPCODE timings
« Reply #5 on: Nov 4th, 2008, 5:39pm »

I agree Richard.
It was a silly bug smiley
User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: OPCODE timings
« Reply #6 on: Nov 24th, 2008, 01:13am »

Hmm. I have fiddled a bit more with trying to get consistent results for the opcode timings but still getting some pretty wild results.

Code:
      REM Using the Time Stamp Instruction for opcode timing
      
      DIM code 4000, L%-1
      
      FOR pass=8 TO 10 STEP 2
        P%=(code+3) AND -4
        S%=P%
        [
        OPT pass
        
        .subtime
        dd 0
        dd 0
        
        .time
        dd 0
        dd 0
        
        .var
        dd 0
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        .beforerem
        ;REM kljashdfaldskf
        .afterrem
        
        cpuid
        rdtsc
        mov [time],eax
        
        ;fld dword [time]
        ;fstp dword [var]
        fld1
        fstp st0
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      PRINT"Aligning to dword boundry adds: ";S%-code;" bytes."
      PRINT"REM statement adds:"afterrem-beforerem
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT
 


1. I have aligned the code to a dword boundry.

2. You mentioned "adding 'REM ' to your code could easily change their addresses " but I can't see that happening where I've added a REM statement. Surely the assemble ignores anything after a semi-colon anyway?

3. I've balanced the FPU stack using a pair of instructions rather than just one. It still seems slightly more consistent with a FINIT instruction but still varies somewhat.

4. I was reading that it is probably better to use "QueryPerformanceCounter' on multicore systems to time code etc because apparently a thread cannot be guaranteed to execute on one core exclusively. I presume this could be what is happening. I noticed that I got some negative results sometimes which could explain that. I will experiment some more..

Michael
Michael
User IP Logged

Michael Hutton
Developer

member is offline

Avatar




PM

Gender: Male
Posts: 248
xx Re: OPCODE timings
« Reply #7 on: Nov 24th, 2008, 01:40am »

Here's the link to the rdtsc and multicore systems.

http://msdn.microsoft.com/en-us/library/bb173458.aspx

Michael
User IP Logged

Pages: 1  Notify Send Topic Print
« Previous Topic | Next Topic »

| |

This forum powered for FREE by Conforums ©
Terms of Service | Privacy Policy | Conforums Support | Parental Controls