Re: x86 osl/interlck.h performance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Jens-Heiner Rechtien
Hi,

I've done some additional very simple minded measurements to estimate
the effects of inling the reference counters and the potential overhead
for checking if we are on a SMP system. I got the following numbers:

I:      inlining
NOI:    no-inlining
SMPC:   SMP check
NOSMPC: no SMP check

Times are in seconds.

                     NOI/NOSMPC, I/NOSMPC, NOI/SMPC, I/SMPC
P-IV 1800 (single)    7.634       6.892     1.796   0.784
Xeon 3.06GHz (multi)  6.50        4.07      6.67    4.11

Conclusions: Checking for SMP costs about 1% (4.11s vs. 4.07s)
additionally on multi-processor machines, and yields about 880% speed
improvement on older non-HT/non-multiprocessor systems. Inlining is
significant, too. The effect of inlining dwarfs the penalty for checking
for SMP on modern multi-processor systems.

The measurements were done with the simple benchmark attached, they are
of course no substitute for doing some real profiling with the office code.

Heiner

--
Jens-Heiner Rechtien
[hidden email]

CFLAGS= -I. -fPIC -O2 -Wall -DINLINE -DCHECKSMP
#CFLAGS= -I. -fPIC -O2 -Wall -DINLINE
#CFLAGS= -I. -fPIC -O2 -Wall -DCHECKSMP
#CFLAGS= -I. -fPIC -O2 -Wall

intrlock: intrlock.o libsal.so
        $(CC) $(CFLAGS) -o intrlock $< -L. -lsal

libsal.so: sal.o
        $(CC) -shared -o libsal.so $<


clean:
        rm *.o libsal.so intrlock
       
all: intrlock libsal.so
       

extern int is_smp;

#if defined(INLINE)
#if defined(CHECKSMP)
inline int incrementInterlockedCount(int *p) {
    int n;
    if ( is_smp ) {
        __asm__ __volatile__ (
            "movl $1, %0\n\t"
            "lock\n\t"
            "xaddl %0, %2\n\t"
            "incl %0" :
            "=&r" (n), "=m" (*p) :
            "m" (*p) :
            "memory");
    }
    else {
        __asm__ __volatile__ (
            "movl $1, %0\n\t"
            "xaddl %0, %2\n\t"
            "incl %0" :
            "=&r" (n), "=m" (*p) :
            "m" (*p) :
            "memory");
    }
    return n;
}
#else /* !CHECKSMP */
inline int incrementInterlockedCount(int *p) {
    int n;
    __asm__ __volatile__ (
        "movl $1, %0\n\t"
        "lock\n\t"
        "xaddl %0, %2\n\t"
        "incl %0" :
        "=&r" (n), "=m" (*p) :
        "m" (*p) :
        "memory");
    return n;
}
#endif /* !CHECKSMP */
#else  /* INLINE */
int incrementInterlockedCount(int *p);
#endif  /* INLINE */


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Ross Johnson-2
Jens-Heiner Rechtien wrote:

> Hi,
>
> I've done some additional very simple minded measurements to estimate
> the effects of inling the reference counters and the potential
> overhead for checking if we are on a SMP system. I got the following
> numbers:
>
> I:      inlining
> NOI:    no-inlining
> SMPC:   SMP check
> NOSMPC: no SMP check
>
> Times are in seconds.
>
>                     NOI/NOSMPC, I/NOSMPC, NOI/SMPC, I/SMPC
> P-IV 1800 (single)    7.634       6.892     1.796   0.784
> Xeon 3.06GHz (multi)  6.50        4.07      6.67    4.11
>
> Conclusions: Checking for SMP costs about 1% (4.11s vs. 4.07s)
> additionally on multi-processor machines, and yields about 880% speed
> improvement on older non-HT/non-multiprocessor systems. Inlining is
> significant, too. The effect of inlining dwarfs the penalty for
> checking for SMP on modern multi-processor systems.

Great result for older machines, which is, I assume, where any
improvement is needed most. I'm curious as to why the call overhead is
such a large proportion of the Xeon result (37%). Were the total number
of calls to incrementInterlockedCount() the same for both P-IV and Xeon?
It looks as though the Xeon either doesn't lock the buss in this test,
or it's a lot more efficient with it. I think you mentioned earlier that
this was possible.

Ross

>
> The measurements were done with the simple benchmark attached, they
> are of course no substitute for doing some real profiling with the
> office code.
>
> Heiner
>
>------------------------------------------------------------------------
>
>CFLAGS= -I. -fPIC -O2 -Wall -DINLINE -DCHECKSMP
>#CFLAGS= -I. -fPIC -O2 -Wall -DINLINE
>#CFLAGS= -I. -fPIC -O2 -Wall -DCHECKSMP
>#CFLAGS= -I. -fPIC -O2 -Wall
>
>intrlock: intrlock.o libsal.so
> $(CC) $(CFLAGS) -o intrlock $< -L. -lsal
>
>libsal.so: sal.o
> $(CC) -shared -o libsal.so $<
>
>
>clean:
> rm *.o libsal.so intrlock
>
>all: intrlock libsal.so
>
>  
>
>------------------------------------------------------------------------
>
>extern int is_smp;
>
>#if defined(INLINE)
>#if defined(CHECKSMP)
>inline int incrementInterlockedCount(int *p) {
>    int n;
>    if ( is_smp ) {
>        __asm__ __volatile__ (
>            "movl $1, %0\n\t"
>            "lock\n\t"
>            "xaddl %0, %2\n\t"
>            "incl %0" :
>            "=&r" (n), "=m" (*p) :
>            "m" (*p) :
>            "memory");
>    }
>    else {
>        __asm__ __volatile__ (
>            "movl $1, %0\n\t"
>            "xaddl %0, %2\n\t"
>            "incl %0" :
>            "=&r" (n), "=m" (*p) :
>            "m" (*p) :
>            "memory");
>    }
>    return n;
>}
>#else /* !CHECKSMP */
>inline int incrementInterlockedCount(int *p) {
>    int n;
>    __asm__ __volatile__ (
>        "movl $1, %0\n\t"
>        "lock\n\t"
>        "xaddl %0, %2\n\t"
>        "incl %0" :
>        "=&r" (n), "=m" (*p) :
>        "m" (*p) :
>        "memory");
>    return n;
>}
>#endif /* !CHECKSMP */
>#else  /* INLINE */
>int incrementInterlockedCount(int *p);
>#endif  /* INLINE */
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: x86 osl/interlck.h performance

Kay Ramme - Sun Germany - Hamburg
In reply to this post by Jens-Heiner Rechtien
Hi Ross,

Ross Johnson wrote:
> Jens-Heiner Rechtien wrote:
Heiner is on vacation this week, so, I jump in ... ;-)
>
>
> Great result for older machines, which is, I assume, where any
> improvement is needed most. I'm curious as to why the call overhead is
Yep, this is obviously more important for older machines.

> such a large proportion of the Xeon result (37%). Were the total number
You are right, the call overhead seems to be much more expensive for the
Xeons, at least compared to the ordinary P IV. Any processor gurus out
there with some explanations?

> of calls to incrementInterlockedCount() the same for both P-IV and Xeon?
It seems that Heiner forgot to attach the "main" function. The call
count was the same for all samples (if I remember correctly it was about
10.000.000).

> It looks as though the Xeon either doesn't lock the buss in this test,
> or it's a lot more efficient with it. I think you mentioned earlier that
It actually locks the bus in all samples, because it _is_ SMP.

> this was possible.
>
> Ross
Kay

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]