Author | 
 Topic: OPCODE timings  (Read 797 times) | 
 
 
 | 
 
 
Michael Hutton
 Developer
  member is offline
  
 
  
    
    
 Gender:   
Posts: 248 
 | 
  | 
OPCODE timings 
« Thread started on: Nov 3rd, 2008, 12:46pm » | 
 | 
 
 
 
(here are some results of the opcode timing..)
  I have used cpuid and rdstc to time some OPCODE timing as discussed in the GFXLIB thread.
  The original question was looking at the number of cycles taken to execute
   Code:
  The instruction itself, apparently, takes 1 cycle but the memory overhead must increase this.
  Using the code below;
   Code:
      REM Using the Time Stamp Instruction
      
      DIM code 2200
      FOR pass=0 TO 2 STEP 2
        P%=code
        [
        OPT pass
        
        
        .subtime
        dd 0
        
        .time
        dd 0
        
        .var
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        finit
        ;now test the instruction
        cpuid
        rdtsc
        mov [time],eax
        
        ;REM the intruction being tested
        fld dword [time]
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT
      END
  
  and running it gives me a 8 cycles on my AMD Athlon x2 laptop.
  However, I noticed a few anomalies if I change a few things (do them individually..)
  1. Delete
   Code:
  and then run it. I found that the times may vary, especially the first one. Is this because the code is just on the 2048 byte boundry? 
  2. Change 
   Code:
  to
   Code:
  gives (me) 20 cycles to complete the instruction. I presume this is the same reason as in 1
  3. Delete
   Code:
  I think this is fairly straight forward and the FPU needs to be initialised before use.
  4. Instead of
   Code:
        ;now test the instruction
  
  try
   Code:
        ;REM now test the instruction
  
  bizarre, but I got varying results, but after running it a few/ten times I got the 'correct' results. I presume this is due to the use of the L1 cache and the code is made resident there the more times it is used. However, I find it very hard to repilicate the results I was getting! They all seem to be OK now even when I restarted windows and ran it again... I am very positive the only thing I changed was the REM line and for a while it was predictable that adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!
  You may find you don't get the same variations I did, and I hope they are reproducible so that you all don't think I'm nuts.
  I will be doing more investigating....
  Michael
 
  
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
admin
 Administrator
 
  member is offline
  
 
  
    
    
  
Posts: 1145 
 | 
  | 
Re: OPCODE timings 
« Reply #1 on: Nov 3rd, 2008, 1:19pm » | 
 | 
 
 
 
 Quote:| adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!  |  
  |  
  It's likely to be an alignment issue.  You've made no attempt to align the dds, as you need to for best performance, so adding 'REM ' to your code could easily change their addresses from being a multiple-of-four (fast) to not a multiple-of-four (slow), or vice versa.
  To get (hopefully) more consistent results change your code as follows:
   Code:
      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4   Incidentally you might want to get into the habit of setting L% and using FOR pass=8 TO 10 STEP 2 to avoid overstepping the bounds of your code area.
  Richard.
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
admin
 Administrator
 
  member is offline
  
 
  
    
    
  
Posts: 1145 
 | 
  | 
Re: OPCODE timings 
« Reply #2 on: Nov 3rd, 2008, 3:53pm » | 
 | 
 
 
 
 Quote:| I think this is fairly straight forward and the FPU needs to be initialised before use.  |  
  |  
  There may be more to it than that.  You don't attempt to balance the FPU stack, so without the finit you'll get a stack overflow (with consequent exception) every so often, no doubt adding to the inconsistency.
  You might be better off timing a pair of instructions (such as fld and fstp) so that the stack remains balanced.  You may then find the finit isn't essential.
  Richard. 
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
mohsen
 New Member
 
  member is offline
  
 
  
    
    
 Gender:   
Posts: 39 
 | 
  | 
Re: OPCODE timings 
« Reply #3 on: Nov 4th, 2008, 09:11am » | 
 | 
 
 
 
 Code:
      FOR pass=0 TO 2 STEP 2
        P% = (code+3) AND -4  
  These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.
  I suggest that this be added to the Manual and/or the Wiki.
  Thanks Richard    ;) 
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
admin
 Administrator
 
  member is offline
  
 
  
    
    
  
Posts: 1145 
 | 
  | 
Re: OPCODE timings 
« Reply #4 on: Nov 4th, 2008, 12:42pm » | 
 | 
 
 
 
 Quote:| These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.  |  
  |  
  There's no way that changing the alignment should affect whether code works or crashes.  All x86 (IA-32) processors will happily access instructions or data at any address, whether aligned or not; only speed is affected.
  If you are finding that alignment has a more drastic effect, then it is a side-effect of a bug elsewhere in your program.  You need to find the bug rather than fiddle with the alignment!
  Of course alignment can matter in calls to the Windows API, but that's a rather different issue.
   Quote:| I suggest that this be added to the Manual and/or the Wiki.  |  
  |  
  I'm not convinced it is appropriate for the manual; 'tuning' assembler code is a rather specialised subject, of which alignment is only a small part.
  As far as the Wiki is concerned, it could form part of an article on how to get the best speed out of assembler code.  Perhaps David Williams might consider writing one.
  Richard. 
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
mohsen
 New Member
 
  member is offline
  
 
  
    
    
 Gender:   
Posts: 39 
 | 
  | 
Re: OPCODE timings 
« Reply #5 on: Nov 4th, 2008, 5:39pm » | 
 | 
 
 
 
I agree Richard. It was a silly bug   
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
Michael Hutton
 Developer
  member is offline
  
 
  
    
    
 Gender:   
Posts: 248 
 | 
  | 
Re: OPCODE timings 
« Reply #6 on: Nov 24th, 2008, 01:13am » | 
 | 
 
 
 
Hmm. I have fiddled a bit more with trying to get consistent results for the opcode timings but still getting some pretty wild results.
   Code:
      REM Using the Time Stamp Instruction for opcode timing
      
      DIM code 4000, L%-1
      
      FOR pass=8 TO 10 STEP 2
        P%=(code+3) AND -4
        S%=P%
        [
        OPT pass
        
        .subtime
        dd 0
        dd 0
        
        .time
        dd 0
        dd 0
        
        .var
        dd 0
        dd 0
        
        ]
        P%+=2048
        [
        OPT pass
        
        
        .start
        
        ;L1 cache warming
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        cpuid
        rdtsc
        mov [subtime],eax
        cpuid
        rdtsc
        sub eax,[subtime]
        mov [subtime],eax
        
        .beforerem
        ;REM kljashdfaldskf
        .afterrem
        
        cpuid
        rdtsc
        mov [time],eax
        
        ;fld dword [time]
        ;fstp dword [var]
        fld1
        fstp st0
        
        cpuid
        rdtsc
        sub eax,[time]
        mov [time],eax
        sub eax,[subtime]
        
        ret
        ]
      NEXT
      PRINT"Code Length = ";P%-code
      PRINT"Aligning to dword boundry adds: ";S%-code;" bytes."
      PRINT"REM statement adds:"afterrem-beforerem
      FOR I%=1 TO 20
        T%=USR(start)
        PRINT T%
      NEXT
  
  1. I have aligned the code to a dword boundry.
  2. You mentioned "adding 'REM ' to your code could easily change their addresses " but I can't see that happening where I've added a REM statement. Surely the assemble ignores anything after a semi-colon anyway?
  3. I've balanced the FPU stack using a pair of instructions rather than just one. It still seems slightly more consistent with a FINIT instruction but still varies somewhat.
  4. I was reading that it is probably better to use "QueryPerformanceCounter' on multicore systems to time code etc because apparently a thread cannot be guaranteed to execute on one core exclusively. I presume this could be what is happening. I noticed that I got some negative results sometimes which could explain that. I will experiment some more.. 
  Michael Michael
 | 
 
 | 
  Logged
 | 
               
             
            
             
             | 
           
         
         | 
       
     
     | 
   
 
 
 |