Author |
Topic: OPCODE timings (Read 797 times) |
|
Michael Hutton
Developer
member is offline


Gender: 
Posts: 248
|
 |
OPCODE timings
« Thread started on: Nov 3rd, 2008, 12:46pm » |
|
(here are some results of the opcode timing..)
I have used cpuid and rdstc to time some OPCODE timing as discussed in the GFXLIB thread.
The original question was looking at the number of cycles taken to execute
Code:
The instruction itself, apparently, takes 1 cycle but the memory overhead must increase this.
Using the code below;
Code:
REM Using the Time Stamp Instruction
DIM code 2200
FOR pass=0 TO 2 STEP 2
P%=code
[
OPT pass
.subtime
dd 0
.time
dd 0
.var
dd 0
]
P%+=2048
[
OPT pass
.start
;L1 cache warming
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
finit
;now test the instruction
cpuid
rdtsc
mov [time],eax
;REM the intruction being tested
fld dword [time]
cpuid
rdtsc
sub eax,[time]
mov [time],eax
sub eax,[subtime]
ret
]
NEXT
PRINT"Code Length = ";P%-code
FOR I%=1 TO 20
T%=USR(start)
PRINT T%
NEXT
END
and running it gives me a 8 cycles on my AMD Athlon x2 laptop.
However, I noticed a few anomalies if I change a few things (do them individually..)
1. Delete
Code:
and then run it. I found that the times may vary, especially the first one. Is this because the code is just on the 2048 byte boundry?
2. Change
Code:
to
Code:
gives (me) 20 cycles to complete the instruction. I presume this is the same reason as in 1
3. Delete
Code:
I think this is fairly straight forward and the FPU needs to be initialised before use.
4. Instead of
Code:
;now test the instruction
try
Code:
;REM now test the instruction
bizarre, but I got varying results, but after running it a few/ten times I got the 'correct' results. I presume this is due to the use of the L1 cache and the code is made resident there the more times it is used. However, I find it very hard to repilicate the results I was getting! They all seem to be OK now even when I restarted windows and ran it again... I am very positive the only thing I changed was the REM line and for a while it was predictable that adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one!
You may find you don't get the same variations I did, and I hope they are reproducible so that you all don't think I'm nuts.
I will be doing more investigating....
Michael
|
|
Logged
|
|
|
|
admin
Administrator
member is offline


Posts: 1145
|
 |
Re: OPCODE timings
« Reply #1 on: Nov 3rd, 2008, 1:19pm » |
|
Quote:adding the REM caused the times to either vary wildly or be about 20 cycles (20 cycles because the variables are too close to the code). I really can't explain that one! |
|
It's likely to be an alignment issue. You've made no attempt to align the dds, as you need to for best performance, so adding 'REM ' to your code could easily change their addresses from being a multiple-of-four (fast) to not a multiple-of-four (slow), or vice versa.
To get (hopefully) more consistent results change your code as follows:
Code:
FOR pass=0 TO 2 STEP 2
P% = (code+3) AND -4 Incidentally you might want to get into the habit of setting L% and using FOR pass=8 TO 10 STEP 2 to avoid overstepping the bounds of your code area.
Richard.
|
|
Logged
|
|
|
|
admin
Administrator
member is offline


Posts: 1145
|
 |
Re: OPCODE timings
« Reply #2 on: Nov 3rd, 2008, 3:53pm » |
|
Quote:I think this is fairly straight forward and the FPU needs to be initialised before use. |
|
There may be more to it than that. You don't attempt to balance the FPU stack, so without the finit you'll get a stack overflow (with consequent exception) every so often, no doubt adding to the inconsistency.
You might be better off timing a pair of instructions (such as fld and fstp) so that the stack remains balanced. You may then find the finit isn't essential.
Richard.
|
|
Logged
|
|
|
|
mohsen
New Member
member is offline


Gender: 
Posts: 39
|
 |
Re: OPCODE timings
« Reply #3 on: Nov 4th, 2008, 09:11am » |
|
Code:
FOR pass=0 TO 2 STEP 2
P% = (code+3) AND -4
These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing.
I suggest that this be added to the Manual and/or the Wiki.
Thanks Richard ;)
|
|
Logged
|
|
|
|
admin
Administrator
member is offline


Posts: 1145
|
 |
Re: OPCODE timings
« Reply #4 on: Nov 4th, 2008, 12:42pm » |
|
Quote:These two (2) lines solve all of my problems with getting assembler code to work and stopping BB4W crashing. |
|
There's no way that changing the alignment should affect whether code works or crashes. All x86 (IA-32) processors will happily access instructions or data at any address, whether aligned or not; only speed is affected.
If you are finding that alignment has a more drastic effect, then it is a side-effect of a bug elsewhere in your program. You need to find the bug rather than fiddle with the alignment!
Of course alignment can matter in calls to the Windows API, but that's a rather different issue.
Quote:I suggest that this be added to the Manual and/or the Wiki. |
|
I'm not convinced it is appropriate for the manual; 'tuning' assembler code is a rather specialised subject, of which alignment is only a small part.
As far as the Wiki is concerned, it could form part of an article on how to get the best speed out of assembler code. Perhaps David Williams might consider writing one.
Richard.
|
|
Logged
|
|
|
|
mohsen
New Member
member is offline


Gender: 
Posts: 39
|
 |
Re: OPCODE timings
« Reply #5 on: Nov 4th, 2008, 5:39pm » |
|
I agree Richard. It was a silly bug
|
|
Logged
|
|
|
|
Michael Hutton
Developer
member is offline


Gender: 
Posts: 248
|
 |
Re: OPCODE timings
« Reply #6 on: Nov 24th, 2008, 01:13am » |
|
Hmm. I have fiddled a bit more with trying to get consistent results for the opcode timings but still getting some pretty wild results.
Code:
REM Using the Time Stamp Instruction for opcode timing
DIM code 4000, L%-1
FOR pass=8 TO 10 STEP 2
P%=(code+3) AND -4
S%=P%
[
OPT pass
.subtime
dd 0
dd 0
.time
dd 0
dd 0
.var
dd 0
dd 0
]
P%+=2048
[
OPT pass
.start
;L1 cache warming
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
cpuid
rdtsc
mov [subtime],eax
cpuid
rdtsc
sub eax,[subtime]
mov [subtime],eax
.beforerem
;REM kljashdfaldskf
.afterrem
cpuid
rdtsc
mov [time],eax
;fld dword [time]
;fstp dword [var]
fld1
fstp st0
cpuid
rdtsc
sub eax,[time]
mov [time],eax
sub eax,[subtime]
ret
]
NEXT
PRINT"Code Length = ";P%-code
PRINT"Aligning to dword boundry adds: ";S%-code;" bytes."
PRINT"REM statement adds:"afterrem-beforerem
FOR I%=1 TO 20
T%=USR(start)
PRINT T%
NEXT
1. I have aligned the code to a dword boundry.
2. You mentioned "adding 'REM ' to your code could easily change their addresses " but I can't see that happening where I've added a REM statement. Surely the assemble ignores anything after a semi-colon anyway?
3. I've balanced the FPU stack using a pair of instructions rather than just one. It still seems slightly more consistent with a FINIT instruction but still varies somewhat.
4. I was reading that it is probably better to use "QueryPerformanceCounter' on multicore systems to time code etc because apparently a thread cannot be guaranteed to execute on one core exclusively. I presume this could be what is happening. I noticed that I got some negative results sometimes which could explain that. I will experiment some more..
Michael Michael
|
|
Logged
|
|
|
|
|