x86 osl/interlck.h performance

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

x86 osl/interlck.h performance

stephan.bergmann
Hi all,

Someone recently mentioned that osl_increment/decrementInterlockedCount
would show up as top scorers with certain profiling tools (vtune?).
That got me thinking.  On both Linux x86 and Windows x86, those
functions are implemented in assembler, effectively consisting of a
LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor
machine, the LOCK would probably not be that expensive, but that the
profiling tool in question might be confused by it and present bogus
results.

However, the following little program on Linux x86 (where incLocked is a
copy of osl_incrementInterlockedCount, and incUnlocked is the same,
without the LOCK prefix) told a different story:

   // lock.c
   #include <stdio.h>
   int incLocked(int * p) {
     int n;
     __asm__ __volatile__ (
       "movl $1, %0\n\t"
       "lock\n\t"
       "xaddl %0, %2\n\t"
       "incl %0" :
       "=&r" (n), "=m" (*p) :
       "m" (*p) :
       "memory");
     return n;
   }
   int incUnlocked(int * p) {
     int n;
     __asm__ __volatile__ (
       "movl $1, %0\n\t"
       "xaddl %0, %2\n\t"
       "incl %0" :
       "=&r" (n), "=m" (*p) :
       "m" (*p) :
       "memory");
     return n;
   }
   int main(int argc, char ** argv) {
     int i;
     int n = 0;
     if (argv[1][0] == 'l') {
       puts("locked version");
       for (i = 0; i < 100000000; ++i) {
         incLocked(&n);
       }
     } else {
       puts("unlocked version");
       for (i = 0; i < 100000000; ++i) {
         incUnlocked(&n);
       }
     }
     return 0;
   }

m1> cat /proc/cpuinfo
   processor : 0
   model name: Intel(R) Pentium(R) 4 CPU 1.80GHz
   ...
m1> time ./lock l
   locked version
   11.868u 0.000s 0:12.19 97.2%  0+0k 0+0io 0pf+0w
m1> time ./lock u
   unlocked version
   1.516u 0.000s 0:01.57 96.1%  0+0k 0+0io 0pf+0w

m2> cat /proc/cpuinfo
   processor : 0
   model name: AMD Opteron(tm) Processor 242
   processor : 1
   model name: AMD Opteron(tm) Processor 242
   ...
m2> time ./lock l
   locked version
   1.863u 0.000s 0:01.86 100.0%  0+0k 0+0io 0pf+0w
m2> time ./lock u
   unlocked version
   0.886u 0.000s 0:00.89 98.8%  0+0k 0+0io 0pf+0w

So, depending on CPU type, the version with LOCK is 2--8 times slower
than the version without LOCK.  Would be interesting to see whether this
has any actual impact on overall OOo performance.  (But first, I'm off
on vacation...)

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Thorsten Behrens
Stephan Bergmann <[hidden email]> writes:

> So, depending on CPU type, the version with LOCK is 2--8 times slower
> than the version without LOCK.  Would be interesting to see whether
> this has any actual impact on overall OOo performance.
>
Hm. First off, I'd try to inline those asm snippets (except for the
sparc ones, which use a function ptr to dynamically distinguish
between v8 and v9 instruction sets). Function call overhead for a
simple increment appears somewhat disproportionate to me...

Apart from that, with the integration of the UNO threading framework,
there's opportunity to bin thread-safe implementations at tons of
places, isn't there?

Cheers,

--

Thorsten

If you're not failing some of the time, you're not trying hard enough.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Ross Johnson-2
In reply to this post by stephan.bergmann
On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:

> Hi all,
>
> Someone recently mentioned that osl_increment/decrementInterlockedCount
> would show up as top scorers with certain profiling tools (vtune?).
> That got me thinking.  On both Linux x86 and Windows x86, those
> functions are implemented in assembler, effectively consisting of a
> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor
> machine, the LOCK would probably not be that expensive, but that the
> profiling tool in question might be confused by it and present bogus
> results.
>
> However, the following little program on Linux x86 (where incLocked is a
> copy of osl_incrementInterlockedCount, and incUnlocked is the same,
> without the LOCK prefix) told a different story:

>From a completely different project (pthreads-win32) I have seen the
same thing and was surprised.

I had read that the LOCK prefix has no effect for uni-processors.
However, on a single CPU system, the LOCK prefix slowed the interlocked
instructions down considerably. In this case, it was the xchg and
cmpxchg instructions - the same ones that are at the centre of the Win32
API InterlockedExchange routines.

I also found from timing tests using hand-optimised assembler that calls
to the Win32 API Interlocked routines appeared to be optimised when the
code is compiled by MSVC, but not GCC (say). It was as though MSVC was
emitting optimised assembler on the fly instead of calling the routines
in Kernel32.dll. My timings showed that the standard Interlocked routine
calls compiled with MSVC were as fast or faster than my inlined
assembler without the LOCK prefix. The interlocked routines are used as
the basis for the mutex operations in pthreads-win32, and using the
assembler versions, I was able to cut the time for some of the pthreads-
win32 test applications involving saturated POSIX reader-writer lock
calls to nearly 1/3 for the gcc compiled versions, and match the times
produced by the MSVC compiled code.

And I agree with the figure mentioned below, that the LOCK prefix slows
the x* instructions down by up to 8 times, or maybe even a bit more.

Ross Johnson

>    // lock.c
>    #include <stdio.h>
>    int incLocked(int * p) {
>      int n;
>      __asm__ __volatile__ (
>        "movl $1, %0\n\t"
>        "lock\n\t"
>        "xaddl %0, %2\n\t"
>        "incl %0" :
>        "=&r" (n), "=m" (*p) :
>        "m" (*p) :
>        "memory");
>      return n;
>    }
>    int incUnlocked(int * p) {
>      int n;
>      __asm__ __volatile__ (
>        "movl $1, %0\n\t"
>        "xaddl %0, %2\n\t"
>        "incl %0" :
>        "=&r" (n), "=m" (*p) :
>        "m" (*p) :
>        "memory");
>      return n;
>    }
>    int main(int argc, char ** argv) {
>      int i;
>      int n = 0;
>      if (argv[1][0] == 'l') {
>        puts("locked version");
>        for (i = 0; i < 100000000; ++i) {
>          incLocked(&n);
>        }
>      } else {
>        puts("unlocked version");
>        for (i = 0; i < 100000000; ++i) {
>          incUnlocked(&n);
>        }
>      }
>      return 0;
>    }
>
> m1> cat /proc/cpuinfo
>    processor : 0
>    model name: Intel(R) Pentium(R) 4 CPU 1.80GHz
>    ...
> m1> time ./lock l
>    locked version
>    11.868u 0.000s 0:12.19 97.2%  0+0k 0+0io 0pf+0w
> m1> time ./lock u
>    unlocked version
>    1.516u 0.000s 0:01.57 96.1%  0+0k 0+0io 0pf+0w
>
> m2> cat /proc/cpuinfo
>    processor : 0
>    model name: AMD Opteron(tm) Processor 242
>    processor : 1
>    model name: AMD Opteron(tm) Processor 242
>    ...
> m2> time ./lock l
>    locked version
>    1.863u 0.000s 0:01.86 100.0%  0+0k 0+0io 0pf+0w
> m2> time ./lock u
>    unlocked version
>    0.886u 0.000s 0:00.89 98.8%  0+0k 0+0io 0pf+0w
>
> So, depending on CPU type, the version with LOCK is 2--8 times slower
> than the version without LOCK.  Would be interesting to see whether this
> has any actual impact on overall OOo performance.  (But first, I'm off
> on vacation...)
>
> -Stephan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Thorsten Behrens wrote:
> Stephan Bergmann <[hidden email]> writes:
>
>> So, depending on CPU type, the version with LOCK is 2--8 times slower
>> than the version without LOCK.  Would be interesting to see whether
>> this has any actual impact on overall OOo performance.
>>
> Hm. First off, I'd try to inline those asm snippets (except for the
> sparc ones, which use a function ptr to dynamically distinguish
> between v8 and v9 instruction sets).

We can inline the sparc ones, too. Since the last compiler change we
support only v8plus (V9 architecture, but in 32 mode). The distinction
between the spin lock implementation and the (faster) "cas" based
implementation is no longer necessary, maybe with the exception of
NetBSD/Sparc (have to ask sparcmoz if he still supports v7/v8 architectures)

If there is a consent that inlining osl_[in|de]crementInterLockedCount()
  is worthwhile, I'm volunteering to do the change.

I can't see what we could do about the costs of the "lock" instruction
on x86. I mean, if we need an atomic increment/decrement for our
reference counter we can't work with non-atomic instructions here,
especially now when multi-processor/multi-core PCs are entering the
mainstream.

Heiner


Function call overhead for a
> simple increment appears somewhat disproportionate to me...
>
> Apart from that, with the integration of the UNO threading framework,
> there's opportunity to bin thread-safe implementations at tons of
> places, isn't there?
>
> Cheers,
>


--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Ross Johnson wrote:

> On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
>> Hi all,
>>
>> Someone recently mentioned that osl_increment/decrementInterlockedCount
>> would show up as top scorers with certain profiling tools (vtune?).
>> That got me thinking.  On both Linux x86 and Windows x86, those
>> functions are implemented in assembler, effectively consisting of a
>> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor
>> machine, the LOCK would probably not be that expensive, but that the
>> profiling tool in question might be confused by it and present bogus
>> results.
>>
>> However, the following little program on Linux x86 (where incLocked is a
>> copy of osl_incrementInterlockedCount, and incUnlocked is the same,
>> without the LOCK prefix) told a different story:
>
>>From a completely different project (pthreads-win32) I have seen the
> same thing and was surprised.
>
> I had read that the LOCK prefix has no effect for uni-processors.
> However, on a single CPU system, the LOCK prefix slowed the interlocked
> instructions down considerably. In this case, it was the xchg and
> cmpxchg instructions - the same ones that are at the centre of the Win32
> API InterlockedExchange routines.
>
> I also found from timing tests using hand-optimised assembler that calls
> to the Win32 API Interlocked routines appeared to be optimised when the
> code is compiled by MSVC, but not GCC (say). It was as though MSVC was
> emitting optimised assembler on the fly instead of calling the routines
> in Kernel32.dll. My timings showed that the standard Interlocked routine
> calls compiled with MSVC were as fast or faster than my inlined
> assembler without the LOCK prefix. The interlocked routines are used as
> the basis for the mutex operations in pthreads-win32, and using the
> assembler versions, I was able to cut the time for some of the pthreads-
> win32 test applications involving saturated POSIX reader-writer lock
> calls to nearly 1/3 for the gcc compiled versions, and match the times
> produced by the MSVC compiled code.

Now that's interesting! Did you disassemble what MSVC emits instead of
calling the interlocked routines. How do they achieve atomic operations
without the lock prefix to xadd, xchg or cmpxchg instructions?

>
> And I agree with the figure mentioned below, that the LOCK prefix slows
> the x* instructions down by up to 8 times, or maybe even a bit more.

Heiner

--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Daniel Boelzle
In reply to this post by stephan.bergmann
>> I also found from timing tests using hand-optimised assembler that calls
>> to the Win32 API Interlocked routines appeared to be optimised when the
>> code is compiled by MSVC, but not GCC (say). It was as though MSVC was
>> emitting optimised assembler on the fly instead of calling the routines
>> in Kernel32.dll. My timings showed that the standard Interlocked routine
>> calls compiled with MSVC were as fast or faster than my inlined
>> assembler without the LOCK prefix. The interlocked routines are used as
>> the basis for the mutex operations in pthreads-win32, and using the
>> assembler versions, I was able to cut the time for some of the pthreads-
>> win32 test applications involving saturated POSIX reader-writer lock
>> calls to nearly 1/3 for the gcc compiled versions, and match the times
>> produced by the MSVC compiled code.
>
> Now that's interesting! Did you disassemble what MSVC emits instead of
> calling the interlocked routines. How do they achieve atomic operations
> without the lock prefix to xadd, xchg or cmpxchg instructions?

However, compiling on my Pentium-M lets VS8 always take the kernel32
routines (tried all /O? options, taking decls from winbase.h): no
intrinsics [1].
Though including the intrinsic versions directly works, the code is inlined:

#include <intrin.h>

void foo(long * p)
{
    _InterlockedIncrement(p);
}

[d:\]cl -c /Ox t.cxx ^ dumpbin /DISASM t.obj
[...]
  00000004: B9 01 00 00 00     mov         ecx,1
  00000009: F0 0F C1 08        lock xadd   dword ptr [eax],ecx

regards,
-Daniel

[1] http://msdn2.microsoft.com/en-us/library/2ddez55b(VS.80).aspx

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jim Watson
In reply to this post by Jens-Heiner Rechtien
On Fri, Apr 21, 2006 at 06:22:52PM +0200, Jens-Heiner Rechtien wrote:

> Thorsten Behrens wrote:
> >Stephan Bergmann <[hidden email]> writes:
> >
> >>So, depending on CPU type, the version with LOCK is 2--8 times slower
> >>than the version without LOCK.  Would be interesting to see whether
> >>this has any actual impact on overall OOo performance.
> >>
> >Hm. First off, I'd try to inline those asm snippets (except for the
> >sparc ones, which use a function ptr to dynamically distinguish
> >between v8 and v9 instruction sets).

It is documented somewhere that the function is called only once, during
start up but a simple trace i did ages ago showed that util.c is traversd 2 times during
startup.

>
> We can inline the sparc ones, too. Since the last compiler change we
> support only v8plus (V9 architecture, but in 32 mode). The distinction
> between the spin lock implementation and the (faster) "cas" based
> implementation is no longer necessary, maybe with the exception of
> NetBSD/Sparc (have to ask sparcmoz if he still supports v7/v8 architectures)

I would like to support v8 because that is done by the GNU/Linux SPARC effort
generally, much of the user-base uses surplus hardware. I believe packagers
(debian) also like to support v8. This is the _only_ place where there is a
difference beteen v8 and v8+ in the entire code base.

However it may be sufficient to ship a separate v8 version of the sal
library, as any (very few) v8 users also using OOo can handle that.

jim
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Ross Johnson-2
In reply to this post by Jens-Heiner Rechtien
On Fri, 2006-04-21 at 18:32 +0200, Jens-Heiner Rechtien wrote:

> Ross Johnson wrote:
> > On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
> >> Hi all,
> >>
> >> Someone recently mentioned that osl_increment/decrementInterlockedCount
> >> would show up as top scorers with certain profiling tools (vtune?).
> >> That got me thinking.  On both Linux x86 and Windows x86, those
> >> functions are implemented in assembler, effectively consisting of a
> >> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor
> >> machine, the LOCK would probably not be that expensive, but that the
> >> profiling tool in question might be confused by it and present bogus
> >> results.
> >>
> >> However, the following little program on Linux x86 (where incLocked is a
> >> copy of osl_incrementInterlockedCount, and incUnlocked is the same,
> >> without the LOCK prefix) told a different story:
> >
> >>From a completely different project (pthreads-win32) I have seen the
> > same thing and was surprised.
> >
> > I had read that the LOCK prefix has no effect for uni-processors.
> > However, on a single CPU system, the LOCK prefix slowed the interlocked
> > instructions down considerably. In this case, it was the xchg and
> > cmpxchg instructions - the same ones that are at the centre of the Win32
> > API InterlockedExchange routines.
> >
> > I also found from timing tests using hand-optimised assembler that calls
> > to the Win32 API Interlocked routines appeared to be optimised when the
> > code is compiled by MSVC, but not GCC (say). It was as though MSVC was
> > emitting optimised assembler on the fly instead of calling the routines
> > in Kernel32.dll. My timings showed that the standard Interlocked routine
> > calls compiled with MSVC were as fast or faster than my inlined
> > assembler without the LOCK prefix. The interlocked routines are used as
> > the basis for the mutex operations in pthreads-win32, and using the
> > assembler versions, I was able to cut the time for some of the pthreads-
> > win32 test applications involving saturated POSIX reader-writer lock
> > calls to nearly 1/3 for the gcc compiled versions, and match the times
> > produced by the MSVC compiled code.
>
> Now that's interesting! Did you disassemble what MSVC emits instead of
> calling the interlocked routines. How do they achieve atomic operations
> without the lock prefix to xadd, xchg or cmpxchg instructions?

No - but I did check the emitted assembler for both compilers to make
sure that the inlining and lock prefixing was as I expected.

The timings for the Interlocked routine calling and for the inlined non-
locked asm using MSVC 6 were almost identical, whereas the inlined
locked asm was much slower. The same tests using GCC showed the
Interlocked calls to be similar to the slower locked asm from either GCC
or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
GCC may have been a little faster, reflecting that gcc can optimise the
inlined asm by substituting registers.

AFAIK, the xchg etc instructions are atomic without the lock prefix on
the single (non-hyperthreaded (TM)) processor system that I'm still
using.

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Ross Johnson-2
In reply to this post by Jens-Heiner Rechtien
On Fri, 2006-04-21 at 18:22 +0200, Jens-Heiner Rechtien wrote:
> I can't see what we could do about the costs of the "lock" instruction
> on x86. I mean, if we need an atomic increment/decrement for our
> reference counter we can't work with non-atomic instructions here,
> especially now when multi-processor/multi-core PCs are entering the
> mainstream.

There are still a lot of people using older x86 machines.

Can't you check the system processor mask on startup and have your
inlined asm conditionally run locked or non-locked instructions
accordingly. The overhead of the test is still much much less than the
lock overhead on uni-processor systems.

This is what I've done in pthreads-win32 and it seems to work in
practise. Nobody has complained or suggested that the concept is
invalid.

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Ross Johnson wrote:

> On Fri, 2006-04-21 at 18:32 +0200, Jens-Heiner Rechtien wrote:
>> Ross Johnson wrote:
>>> On Fri, 2006-04-21 at 15:09 +0200, Stephan Bergmann wrote:
>>>> Hi all,
>>>>
>>>> Someone recently mentioned that osl_increment/decrementInterlockedCount
>>>> would show up as top scorers with certain profiling tools (vtune?).
>>>> That got me thinking.  On both Linux x86 and Windows x86, those
>>>> functions are implemented in assembler, effectively consisting of a
>>>> LOCK-prefixed XADD.  Now, I thought that, at least on a uniprocessor
>>>> machine, the LOCK would probably not be that expensive, but that the
>>>> profiling tool in question might be confused by it and present bogus
>>>> results.
>>>>
>>>> However, the following little program on Linux x86 (where incLocked is a
>>>> copy of osl_incrementInterlockedCount, and incUnlocked is the same,
>>>> without the LOCK prefix) told a different story:
>>> >From a completely different project (pthreads-win32) I have seen the
>>> same thing and was surprised.
>>>
>>> I had read that the LOCK prefix has no effect for uni-processors.
>>> However, on a single CPU system, the LOCK prefix slowed the interlocked
>>> instructions down considerably. In this case, it was the xchg and
>>> cmpxchg instructions - the same ones that are at the centre of the Win32
>>> API InterlockedExchange routines.
>>>
>>> I also found from timing tests using hand-optimised assembler that calls
>>> to the Win32 API Interlocked routines appeared to be optimised when the
>>> code is compiled by MSVC, but not GCC (say). It was as though MSVC was
>>> emitting optimised assembler on the fly instead of calling the routines
>>> in Kernel32.dll. My timings showed that the standard Interlocked routine
>>> calls compiled with MSVC were as fast or faster than my inlined
>>> assembler without the LOCK prefix. The interlocked routines are used as
>>> the basis for the mutex operations in pthreads-win32, and using the
>>> assembler versions, I was able to cut the time for some of the pthreads-
>>> win32 test applications involving saturated POSIX reader-writer lock
>>> calls to nearly 1/3 for the gcc compiled versions, and match the times
>>> produced by the MSVC compiled code.
>> Now that's interesting! Did you disassemble what MSVC emits instead of
>> calling the interlocked routines. How do they achieve atomic operations
>> without the lock prefix to xadd, xchg or cmpxchg instructions?
>
> No - but I did check the emitted assembler for both compilers to make
> sure that the inlining and lock prefixing was as I expected.
>
> The timings for the Interlocked routine calling and for the inlined non-
> locked asm using MSVC 6 were almost identical, whereas the inlined
> locked asm was much slower. The same tests using GCC showed the
> Interlocked calls to be similar to the slower locked asm from either GCC
> or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
> GCC may have been a little faster, reflecting that gcc can optimise the
> inlined asm by substituting registers.

Your measurements seem to suggest that Microsoft uses a conditional
approach in the non-inlined version of the  Interlocked[In|De]crement
routines, without lock prefix for older processors/single processor
systems. The additional check penalizes newer
HT/multicore/multiprocessor systems, if it matters at all needs to be
measured.

>
> AFAIK, the xchg etc instructions are atomic without the lock prefix on
> the single (non-hyperthreaded (TM)) processor system that I'm still
> using.

The Intel manuals states that xchg implicitly behaves as if it had a
lock prefix.

BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
that expensive, because if the target memory of the instruction is
cacheable the CPU will not assert the Lock# signal (which locks the bus)
but only lock the affected cache line.

Heiner

--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Ross Johnson wrote:

> On Fri, 2006-04-21 at 18:22 +0200, Jens-Heiner Rechtien wrote:
>> I can't see what we could do about the costs of the "lock" instruction
>> on x86. I mean, if we need an atomic increment/decrement for our
>> reference counter we can't work with non-atomic instructions here,
>> especially now when multi-processor/multi-core PCs are entering the
>> mainstream.
>
> There are still a lot of people using older x86 machines.
>
> Can't you check the system processor mask on startup and have your
> inlined asm conditionally run locked or non-locked instructions
> accordingly. The overhead of the test is still much much less than the
> lock overhead on uni-processor systems.

That's a possible solution which we should keep in mind if we change the
implementation. We would still need some performance measurements
regarding the overhead for the additional test.

Heiner

--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Ross Johnson-2
In reply to this post by Jens-Heiner Rechtien
On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote:

> Ross Johnson wrote:
> >
> > The timings for the Interlocked routine calling and for the inlined non-
> > locked asm using MSVC 6 were almost identical, whereas the inlined
> > locked asm was much slower. The same tests using GCC showed the
> > Interlocked calls to be similar to the slower locked asm from either GCC
> > or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
> > GCC may have been a little faster, reflecting that gcc can optimise the
> > inlined asm by substituting registers.
>
> Your measurements seem to suggest that Microsoft uses a conditional
> approach in the non-inlined version of the  Interlocked[In|De]crement
> routines, without lock prefix for older processors/single processor
> systems. The additional check penalizes newer
> HT/multicore/multiprocessor systems, if it matters at all needs to be
> measured.

I checked my notes and there isn't actually much difference between MSVC
and GCC when calling the Interlocked routines - they are both faster
than results using the locked prefix. So it isn't the compiler but the
dll itself that appears to conditionally switch.

I had also completely forgotten that I emulate xchg using cmpxchg to
avoid the mandatory buss lock on that instruction.

The following results are for an application that performs saturated
pthreads reader-writer lock operations using pthreads-win32, which uses
xchg and cmpxchg inline assembler in the underlying mutex and semaphore
routines.

It's not important, but these times are total milliseconds for 5 threads
to perform 1,000,000 access operations each (about 50,000 writes and
950,000 reads each). On a 2.4GHz i686 single-core, single processor.

-------------------------------------------------------------------
                inlined         inlined         call
                PTW32 xchg      XCHG            InterlockedExchange
...................................................................
GCC             641             1687            765

VC6.0           844             1750            891
-------------------------------------------------------------------

As you say below, the XCHG instruction always locks the buss, so I
emulate the XCHG instruction using the CMPXCHG instruction. The times
for the emulated xchg (inlined PTW32 xchg in the table) and the real
XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange
call timings suggest that it also uses cmpxchg instead of xchg, and that
it doesn't use the lock prefix.

Even though the time spent in the xchg operation is a small proportion
of the whole application, the difference in overall run time with and
without the lock prefix (first two result columns) is 2 to 2.5 times.

> >
> > AFAIK, the xchg etc instructions are atomic without the lock prefix on
> > the single (non-hyperthreaded (TM)) processor system that I'm still
> > using.
>
> The Intel manuals states that xchg implicitly behaves as if it had a
> lock prefix.

Yes. See above.

> BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
> that expensive, because if the target memory of the instruction is
> cacheable the CPU will not assert the Lock# signal (which locks the bus)
> but only lock the affected cache line.

Otherwise, for some very specific multithreaded applications, a single
processor can still beat two processors working together. :o)

Ross


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Thorsten Behrens
In reply to this post by stephan.bergmann
Jens-Heiner Rechtien <[hidden email]> writes:

> BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
> that expensive, because if the target memory of the instruction is
> cacheable the CPU will not assert the Lock# signal (which locks the
> bus) but only lock the affected cache line.
>
Wasn't Stephan's original measurement based on a P4 ("model name:
Intel(R) Pentium(R) 4 CPU 1.80GHz"), with a factor-eight slowdown?

So, how to go forward from here? Is anybody trying out/profiling the
proposed changes (switch on multi-coreness, inlining)?

Cheers,

--

Thorsten

If you're not failing some of the time, you're not trying hard enough.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Thorsten Behrens wrote:
> Jens-Heiner Rechtien <[hidden email]> writes:
>
>> BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
>> that expensive, because if the target memory of the instruction is
>> cacheable the CPU will not assert the Lock# signal (which locks the
>> bus) but only lock the affected cache line.
>>
> Wasn't Stephan's original measurement based on a P4 ("model name:
> Intel(R) Pentium(R) 4 CPU 1.80GHz"), with a factor-eight slowdown?

I think that Intel changed the behavior with the introduction of HT in
Pentium 4, so Stefans machine is still one which locks the whole bus.
Could be wrong here, though.

>
> So, how to go forward from here? Is anybody trying out/profiling the
> proposed changes (switch on multi-coreness, inlining)?

I volunteer for that, but I got two projects to finish before that, so
it could take a little time.

Heiner


--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Hi Ross,

thanks for your numbers. So it looks like the lock prefix inside the
reference counters will have on older processors - exactly where it's
not needed at all - an impact which dwarfs even the costs for not
inlining the reference counter. I'll have a look at it.

Heiner

Ross Johnson wrote:

> On Mon, 2006-04-24 at 16:19 +0200, Jens-Heiner Rechtien wrote:
>> Ross Johnson wrote:
>>> The timings for the Interlocked routine calling and for the inlined non-
>>> locked asm using MSVC 6 were almost identical, whereas the inlined
>>> locked asm was much slower. The same tests using GCC showed the
>>> Interlocked calls to be similar to the slower locked asm from either GCC
>>> or MSVC. The inlined non-locked asm for MSVC and GCC were very similar.
>>> GCC may have been a little faster, reflecting that gcc can optimise the
>>> inlined asm by substituting registers.
>> Your measurements seem to suggest that Microsoft uses a conditional
>> approach in the non-inlined version of the  Interlocked[In|De]crement
>> routines, without lock prefix for older processors/single processor
>> systems. The additional check penalizes newer
>> HT/multicore/multiprocessor systems, if it matters at all needs to be
>> measured.
>
> I checked my notes and there isn't actually much difference between MSVC
> and GCC when calling the Interlocked routines - they are both faster
> than results using the locked prefix. So it isn't the compiler but the
> dll itself that appears to conditionally switch.
>
> I had also completely forgotten that I emulate xchg using cmpxchg to
> avoid the mandatory buss lock on that instruction.
>
> The following results are for an application that performs saturated
> pthreads reader-writer lock operations using pthreads-win32, which uses
> xchg and cmpxchg inline assembler in the underlying mutex and semaphore
> routines.
>
> It's not important, but these times are total milliseconds for 5 threads
> to perform 1,000,000 access operations each (about 50,000 writes and
> 950,000 reads each). On a 2.4GHz i686 single-core, single processor.
>
> -------------------------------------------------------------------
>                 inlined         inlined         call
>                 PTW32 xchg      XCHG            InterlockedExchange
> ...................................................................
> GCC             641             1687            765
>
> VC6.0           844             1750            891
> -------------------------------------------------------------------
>
> As you say below, the XCHG instruction always locks the buss, so I
> emulate the XCHG instruction using the CMPXCHG instruction. The times
> for the emulated xchg (inlined PTW32 xchg in the table) and the real
> XCHG instruction (inlined XCHG) are shown above. The InterlockedExchange
> call timings suggest that it also uses cmpxchg instead of xchg, and that
> it doesn't use the lock prefix.
>
> Even though the time spent in the xchg operation is a small proportion
> of the whole application, the difference in overall run time with and
> without the lock prefix (first two result columns) is 2 to 2.5 times.
>
>>> AFAIK, the xchg etc instructions are atomic without the lock prefix on
>>> the single (non-hyperthreaded (TM)) processor system that I'm still
>>> using.
>> The Intel manuals states that xchg implicitly behaves as if it had a
>> lock prefix.
>
> Yes. See above.
>
>> BTW, on newer processors (P4, Xeon etc) the "lock" prefix shouldn't be
>> that expensive, because if the target memory of the instruction is
>> cacheable the CPU will not assert the Lock# signal (which locks the bus)
>> but only lock the affected cache line.
>
> Otherwise, for some very specific multithreaded applications, a single
> processor can still beat two processors working together. :o)
>
> Ross
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Kay Ramme - Sun Germany - Hamburg
In reply to this post by stephan.bergmann
Thorsten,

Thorsten Behrens wrote:
> Apart from that, with the integration of the UNO threading framework,
> there's opportunity to bin thread-safe implementations at tons of
> places, isn't there?
Yes and no, with the introduction of the UNO threading framework (see

http://wiki.services.openoffice.org/wiki/Uno/Effort/Creating_the_Uno_Threading_Framework

respectively

http://wiki.services.openoffice.org/wiki/Effort/Multi_Threading_Clean_up

for details), we basically can get rid of many interlock usages.
Unfortunately, many of these in-/decrements are generated on behalf of
the rtl strings, which are not yet planned to be replaced with a thread
unsafe version (despite the fact that this is certainly possible).

Note: With CWS perform07 we (Malte and I) fixed the unnecessary
refcounting of empty rtl and tools strings, this might relax the
situation a little bit.
>
> Cheers,
>
Kay

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Kay Ramme - Sun Germany - Hamburg
In reply to this post by stephan.bergmann
Kay Ramme - Sun Germany - Hamburg wrote:
> Unfortunately, many of these in-/decrements are generated on behalf of
> the rtl strings, which are not yet planned to be replaced with a thread
> unsafe version (despite the fact that this is certainly possible).
Regarding the (rtl, tools) strings, I think many copy and interlock
inc-/decrement operations could be avoided, if the constant strings,
e.g. as in

   myFun(rtl::OUString(RTL_CONSTASCII_USTRINGPARAM("aStringLiteral")))

could be emitted constantly initialized. That would mean that
- no charset conversion (UNIX),
- no memory de-/allocation and
- no interlock operations
would ever be executed on behalf of these constant strings, which seems
to be superfluous anyway.


Kay

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
In reply to this post by stephan.bergmann
Hi,

I did some measurements with a copy of SRC680 m164 and one of the more
pathological calc documents, and found that the "lock" prefix indeed
imposes a significant overhead of about 8% on a non HT 1.8 GHz Pentium IV.

(The tests included starting StarOffice, loading the document and
closing the application as soon as the document is loaded).

$ time ./soffice numbers_large.ods
With "lock":          w/o "lock"
user time: 41.474s    38.379s
user time: 41.611s    38.676s
user time: 41.796s    38.397s
user time: 41.623s    38.412s
user time: 41.696s    38.742s

mean:      41.64s     38.52s

Comparing the wall clock times showed essentially the same value of 8%
overhead for the "lock" case.

Heiner


Stephan Bergmann wrote:

> Hi all,
>
> Someone recently mentioned that osl_increment/decrementInterlockedCount
> would show up as top scorers with certain profiling tools (vtune?). That
> got me thinking.  On both Linux x86 and Windows x86, those functions are
> implemented in assembler, effectively consisting of a LOCK-prefixed
> XADD.  Now, I thought that, at least on a uniprocessor machine, the LOCK
> would probably not be that expensive, but that the profiling tool in
> question might be confused by it and present bogus results.
>
> However, the following little program on Linux x86 (where incLocked is a
> copy of osl_incrementInterlockedCount, and incUnlocked is the same,
> without the LOCK prefix) told a different story:
>
>   // lock.c
>   #include <stdio.h>
>   int incLocked(int * p) {
>     int n;
>     __asm__ __volatile__ (
>       "movl $1, %0\n\t"
>       "lock\n\t"
>       "xaddl %0, %2\n\t"
>       "incl %0" :
>       "=&r" (n), "=m" (*p) :
>       "m" (*p) :
>       "memory");
>     return n;
>   }
>   int incUnlocked(int * p) {
>     int n;
>     __asm__ __volatile__ (
>       "movl $1, %0\n\t"
>       "xaddl %0, %2\n\t"
>       "incl %0" :
>       "=&r" (n), "=m" (*p) :
>       "m" (*p) :
>       "memory");
>     return n;
>   }
>   int main(int argc, char ** argv) {
>     int i;
>     int n = 0;
>     if (argv[1][0] == 'l') {
>       puts("locked version");
>       for (i = 0; i < 100000000; ++i) {
>         incLocked(&n);
>       }
>     } else {
>       puts("unlocked version");
>       for (i = 0; i < 100000000; ++i) {
>         incUnlocked(&n);
>       }
>     }
>     return 0;
>   }
>
> m1> cat /proc/cpuinfo
>   processor : 0
>   model name: Intel(R) Pentium(R) 4 CPU 1.80GHz
>   ...
> m1> time ./lock l
>   locked version
>   11.868u 0.000s 0:12.19 97.2%  0+0k 0+0io 0pf+0w
> m1> time ./lock u
>   unlocked version
>   1.516u 0.000s 0:01.57 96.1%  0+0k 0+0io 0pf+0w
>
> m2> cat /proc/cpuinfo
>   processor : 0
>   model name: AMD Opteron(tm) Processor 242
>   processor : 1
>   model name: AMD Opteron(tm) Processor 242
>   ...
> m2> time ./lock l
>   locked version
>   1.863u 0.000s 0:01.86 100.0%  0+0k 0+0io 0pf+0w
> m2> time ./lock u
>   unlocked version
>   0.886u 0.000s 0:00.89 98.8%  0+0k 0+0io 0pf+0w
>
> So, depending on CPU type, the version with LOCK is 2--8 times slower
> than the version without LOCK.  Would be interesting to see whether this
> has any actual impact on overall OOo performance.  (But first, I'm off
> on vacation...)
>
> -Stephan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


--
Jens-Heiner Rechtien
[hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Kay Ramme - Sun Germany - Hamburg
In reply to this post by stephan.bergmann
Heiner,

Jens-Heiner Rechtien wrote:
> Hi,
>
> I did some measurements with a copy of SRC680 m164 and one of the more
> pathological calc documents, and found that the "lock" prefix indeed
> imposes a significant overhead of about 8% on a non HT 1.8 GHz Pentium IV.
>

8% would really be a great improvement, especially compared to the
amount of work we need to invest. I expect this relative number to even
be higher after the integration of Maltes perform07 CWS.
So, we should find a way to integrate this change in a compatible (SMP,
ABI) way, in particular because of performance improvements are even be
more interesting for people with older machines.

>
> Heiner
Kay

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Cor Nouws
Kay Ramme - Sun Germany - Hamburg wrote:

> 8% would really be a great improvement,

Indeed! The tests by Jens-Hiener recorded total time:
starting SO - loading ods - closing SO.

So when SO is running, and the user only loads the document .... :-)

Good news - thanks!

Cor


--
Cor Nouws
www.bsooo.nl - www.nouenoff.nl
Free your files

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]