#import <ppc_intrinsics.h> for Universal Binary ?

Member
Posts: 110
Joined: 2002.07
Post: #1
how i can replace ppc_intrinsics.h ?

i need to replace only __fres and __frsqrte

any idea ?
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #2
You could just do 1/ for _fres (reciprocal) and sqrtf() for _fsqrte. If you are using them and it's putting it in there for optimization purposes, then are you linking to the 10.4 Universal SDK rather than 10.3.9, 10.4, or "Current OS" SDKs? If not, you must link it to the 10.4 Universal SDK. (assuming you're in XCode; I don't know how to do that from the command line)
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #3
sqrt() calls on Intel will translate to a single machine instruction, so you may not need this optimization any more...

completely untested:

Code:
#include <xmmintrin.h>

#define __frsqrte(f) ({ float _f = f; _mm_rsqrt_ss(_mm_set_ss(_f)) })
#define __fres(f) ({ float _f = f; _mm_rcp_ss(_mm_set_ss(_f)) })
Quote this message in a reply
Member
Posts: 110
Joined: 2002.07
Post: #4
thanks going to try that :-)
Quote this message in a reply
Member
Posts: 41
Joined: 2006.01
Post: #5
Those are a single instruction on PPC as well, but not single-cycle.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #6
the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction. All Intel Macs have such a thing:

Code:
iMacCoreDuo:~ keith$ cat > test.c
#include <math.h>

float square_root(float f) { return sqrtf(f); }
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000        b       0x20    ; symbol stub for: _sqrtf
iMacCoreDuo:~ keith$ gcc -c -O2 -arch ppc -mcpu=G5 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000        fsqrts  f1,f1
00000004        blr
iMacCoreDuo:~ keith$ gcc -c -O2 -arch i386 test.c
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
_square_root:
00000000        pushl   %ebp
00000001        movl    %esp,%ebp
00000003        subl    $0x04,%esp
00000006        sqrtss  0x08(%ebp),%xmm0
0000000b        movss   %xmm0,0xfffffffc(%ebp)
00000010        fldsl   0xfffffffc(%ebp)
00000013        leave
00000014        ret
Quote this message in a reply
Moderator
Posts: 771
Joined: 2003.04
Post: #7
From TN2087:
Quote:The G5 has a full-precision hardware square root implementation. If your code executes square root, check for the availability of the hardware square root in the G5 and execute code calling the instruction directly (e.g. __fsqrt()) instead of calling the sqrt() routine. (Use __fsqrts() for single-precision.) You can use the GCC compiler flags -mpowerpc-gpopt and -mpowerpc64 to transform sqrt() function calls directly into the PPC sqrt instruction.

Edit: Grr...OSC beat me to it...
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #8
For completeness' sake:

Code:
iMacCoreDuo:~ keith$ icc -c test.c -O2
iMacCoreDuo:~ keith$ otool -tV test.o
test.o:
(__TEXT,__text) section
__text:
00000000        subl    $0x0c,%esp
00000003        sqrtss  0x10(%esp,1),%xmm0
00000009        movss   %xmm0,(%esp,1)
0000000e        fldsl   (%esp,1)
00000011        addl    $0x0c,%esp
00000014        ret
00000015        nop
00000016        nop
00000017        nop

Looks like GCC 4 for Intel is producing rather suboptimal code for this simple case...
Quote this message in a reply
Member
Posts: 41
Joined: 2006.01
Post: #9
OneSadCookie Wrote:the PowerPC has reciprocal square root estimate and reciprocal estimate, yes, but only the G5 has a full precision non-reciprocal square root instruction.

This is not so, what about fsqrt t,b and fsqrts t,b? Not to go on about it.

Also, don't forget that cycles count. Just because an instruction exists to do something doens't mean it's faster than several instructions. The 5-bit frsqrte and 8-bit fres are done instantly with parallel logic. It's unlikely (though I don't know) that either the PPC or intel do a division in a single cycle. It may take dozens.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #10
fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|

And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.
Quote this message in a reply
Member
Posts: 41
Joined: 2006.01
Post: #11
OneSadCookie Wrote:fsqrt and fsqrts are not available on the G3 or G4. Do some research before randomly arguing with me :|

OneSadCookie, that is certainly wrong. I have 2 data books and a pdf right in front of me, and they are all about a decade old, and they all have those instructions in them. I have been programming a wide variety of processors in assembler professionally for half my life and I don't like your tone at all. And I don't randomly argue with anyone.

I checked my FACTS in TWO PLACES before posting, please check yours!Mad

I don't know when I've been so angry.

Quote:And yes, I am well aware that just because it's a single instruction doesn't mean that it's "fast". It does, however, mean that you don't pay the (significant) overhead of a function call, particularly one into a dynamic library. I don't have a G5 handy to test with, but my recollection is that a fsqrt is about as expensive as a divide (30+ cycles latency). I'd also be very surprised if frsqrte and fres are "instant"; chances are they have at least a four-cycle latency. Again, no G5 to test with. Perhaps somebody who has one can post the numbers as reported by Shark.

frsqrte is 5-bits. My common sense tells me it is surely single-cycle. 50 gates should be enough to budget that. fres too. For that matter I don't see how an extra cycle could be used to reduce the gate count in either case, the instructions are just too weak.
Quote this message in a reply
DoG
Moderator
Posts: 869
Joined: 2003.01
Post: #12
The PPC ISA contains fsqrt and fsqrts, and lists them as optional. This is straight from the Apple headers:

/*
* __fsqrt - Floating-Point Square Root (Double-Precision)
*
* WARNING: Illegal instruction for PowerPC 603, 604, 750, 7400, 7410,
* 7450, and 7455
*/

Latency for fsqrtre is 3 or 4 cycles, according to my sources.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #13
The instructions are *optional*. They're defined in the original PowerPC spec, yes, but the range of processors that implement them is limited.

PowerPC compiler-writers guide Wrote:The PowerPC architecture includes a set of optional instructions:

General-Purpose Group—fsqrt and fsqrts.
Graphics Group—stfiwx, fres, frsqrte, and fsel.
If an implementation supports any instruction in a group, it must support all of the instructions in the group. Check the documentation for a specific implementation to determine which, if any, of the groups are supported

Section 1.5.2 of this document: http://www.freescale.com/files/32bit/doc...pdf?srch=1 describes that G3s implement the "graphics group" but not the "general-purpose group"

Section 1.3.2.3 of this document: http://www.freescale.com/files/32bit/doc...7410UM.pdf describes that G4s implement the "graphics group" but not the "general-purpose group"

G5s implement both, as described in section 2.2.4 of this document: http://www-306.ibm.com/chips/techlib/tec...6FEB09.pdf

Why else would GCC only generate the fsqrt instruction when explicitly told it's generating code for G5 only?!

The documents I referenced have a detailed discussion of the latency of various instructions, but long story short, for a G3 or G4, the minimum latency of a floating-point arithmetic instruction is 3 cycles, and frsqrte executes within that. On the G5, frsqrte has a latency of 6 cycles, and fsqrt has a latency of 40.

That's nearly 30 minutes of my time wasted proving stuff that I already knew Mad
Quote this message in a reply
Member
Posts: 41
Joined: 2006.01
Post: #14
DoG Wrote:Latency for fsqrtre is 3 or 4 cycles, according to my sources.

It's hard to find timing data for the PPC, this is the closest I could get from google:
http://www.google.com/search?q=frsqrte&h...rt=10&sa=N

Quote: Optimization and Optimization and Tuning on POWER4 Tuning on ...
File Format: PDF/Adobe Acrobat - View as HTML
Single cycle fres and frsqrte. Single cycle fres and frsqrte. Good for MASS instrinsics). Good for MASS instrinsics) ...
http://www.spscicomp.org/ScicomP5/Presentations/ Tutorial/Daresbury.POWER4.Tuning.tut.pdf - Similar pages

This link appears to go to a presentation by IBM.

frsqrte fres are also marked as 'optional' and have been present since the 601 (absent on the 601). The older of my databooks marks it (fsqrts) as optional and the more recent one published by IBM Microelectronics © 1994 does not.

I seem to remember using fsqrts, gentlemen.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #15
Make sure you read my post immediately above this one (you might have missed it, since you've only responded to DoG's above that). I'd hate to have spent all that time for nothing Mad
Quote this message in a reply
Post Reply 

Possibly Related Threads...
Thread: Author Replies: Views: Last Post
  Creating universal binary(ppc/i386) with XCode 3.1.2 AdrianM 1 3,533 Apr 13, 2009 09:11 AM
Last Post: DoG
  Universal Binary on PPC : a few questions frozax 6 3,645 Mar 7, 2008 04:02 PM
Last Post: frozax
  Initial Svn Import? bronxbomber92 5 3,926 Mar 2, 2008 12:21 AM
Last Post: Skorche
  Universal Binary skyhawk 3 2,909 Feb 4, 2008 11:52 PM
Last Post: sohta
  Universal binary? mac_girl 9 4,762 Jan 13, 2007 11:02 PM
Last Post: Frank C.