Re: x86 osl/interlck.h performance

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
Hi,

between SRC680 m164 and SRC680 m170 some important performance
improvements have been integrated, most notably is the "empty" string no
longer reference counted. This has significantly reduced the number of
reference counter calls. I redid the measurement to see if there is
still a significant impact of the "lock" prefix on the overall performance.


$ time ./soffice numbers_large.ods

With "lock", w/o lock,  w/o lock but with check for SMP
      31.566      31.142      30.762
      32.515      30.909      30.807
      32.247      30.515      31.413
      31.695      30.594      30.812
      32.008      30.449      30.924
      ------      ------      ------
Mean 32.006      30.722      30.944
Std   0.349       0.263       0.241

The gain for old machines is now some 3.3% (column 1 and 3), the penalty
for new machines because of the additional check (column 2 and 3) can be
estimated to be somewhere around 0.7%. I no longer think that the gain
on older machines warrants the penalty on modern systems.

BTW column 1 and 2 are directly comparable to the columns below, a 23%
improvement from m164 to m170, wow!

On another note: Inlining on Solaris Sparc machines saves only about 10%
per call to the reference counter. The overall influence of inlining on
the performance is thus probably not measurable on this platform.

Heiner

Jens-Heiner Rechtien wrote:

> Hi,
>
> I did some measurements with a copy of SRC680 m164 and one of the more
> pathological calc documents, and found that the "lock" prefix indeed
> imposes a significant overhead of about 8% on a non HT 1.8 GHz Pentium IV.
>
> (The tests included starting StarOffice, loading the document and
> closing the application as soon as the document is loaded).
>
> $ time ./soffice numbers_large.ods
> With "lock":          w/o "lock"
> user time: 41.474s    38.379s
> user time: 41.611s    38.676s
> user time: 41.796s    38.397s
> user time: 41.623s    38.412s
> user time: 41.696s    38.742s
>
> mean:      41.64s     38.52s
>
> Comparing the wall clock times showed essentially the same value of 8%
> overhead for the "lock" case.
>
> Heiner
>
>
> Stephan Bergmann wrote:
>> Hi all,
>>
>> Someone recently mentioned that
>> osl_increment/decrementInterlockedCount would show up as top scorers
>> with certain profiling tools (vtune?). That got me thinking.  On both
>> Linux x86 and Windows x86, those functions are implemented in
>> assembler, effectively consisting of a LOCK-prefixed XADD.  Now, I
>> thought that, at least on a uniprocessor machine, the LOCK would
>> probably not be that expensive, but that the profiling tool in
>> question might be confused by it and present bogus results.
>>
>> However, the following little program on Linux x86 (where incLocked is
>> a copy of osl_incrementInterlockedCount, and incUnlocked is the same,
>> without the LOCK prefix) told a different story:
>>
>>   // lock.c
>>   #include <stdio.h>
>>   int incLocked(int * p) {
>>     int n;
>>     __asm__ __volatile__ (
>>       "movl $1, %0\n\t"
>>       "lock\n\t"
>>       "xaddl %0, %2\n\t"
>>       "incl %0" :
>>       "=&r" (n), "=m" (*p) :
>>       "m" (*p) :
>>       "memory");
>>     return n;
>>   }
>>   int incUnlocked(int * p) {
>>     int n;
>>     __asm__ __volatile__ (
>>       "movl $1, %0\n\t"
>>       "xaddl %0, %2\n\t"
>>       "incl %0" :
>>       "=&r" (n), "=m" (*p) :
>>       "m" (*p) :
>>       "memory");
>>     return n;
>>   }
>>   int main(int argc, char ** argv) {
>>     int i;
>>     int n = 0;
>>     if (argv[1][0] == 'l') {
>>       puts("locked version");
>>       for (i = 0; i < 100000000; ++i) {
>>         incLocked(&n);
>>       }
>>     } else {
>>       puts("unlocked version");
>>       for (i = 0; i < 100000000; ++i) {
>>         incUnlocked(&n);
>>       }
>>     }
>>     return 0;
>>   }
>>
>> m1> cat /proc/cpuinfo
>>   processor : 0
>>   model name: Intel(R) Pentium(R) 4 CPU 1.80GHz
>>   ...
>> m1> time ./lock l
>>   locked version
>>   11.868u 0.000s 0:12.19 97.2%  0+0k 0+0io 0pf+0w
>> m1> time ./lock u
>>   unlocked version
>>   1.516u 0.000s 0:01.57 96.1%  0+0k 0+0io 0pf+0w
>>
>> m2> cat /proc/cpuinfo
>>   processor : 0
>>   model name: AMD Opteron(tm) Processor 242
>>   processor : 1
>>   model name: AMD Opteron(tm) Processor 242
>>   ...
>> m2> time ./lock l
>>   locked version
>>   1.863u 0.000s 0:01.86 100.0%  0+0k 0+0io 0pf+0w
>> m2> time ./lock u
>>   unlocked version
>>   0.886u 0.000s 0:00.89 98.8%  0+0k 0+0io 0pf+0w
>>
>> So, depending on CPU type, the version with LOCK is 2--8 times slower
>> than the version without LOCK.  Would be interesting to see whether
>> this has any actual impact on overall OOo performance.  (But first,
>> I'm off on vacation...)
>>
>> -Stephan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>


--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Niklas Nebel
Jens-Heiner Rechtien wrote:
> BTW column 1 and 2 are directly comparable to the columns below, a 23%
> improvement from m164 to m170, wow!

A large part of that might be due to issue 64109, which was introduced
in m162 and fixed in m167.

Niklas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]