Share you optimization tips!

Member
Posts: 749
Joined: 2003.01
Post: #1
Reading another thread I discovered that changing a thing from O to O3 in my target options actually improved the performance of my game significantly.

I was wondering what other flags you gurus use: I read there's a -fast flag but i didn't get if it works also for intel macs, that there's a -ffast-math one but it may be unsafe etc.

Also other things i should check-uncheck before relasing?strip debug symbols during copy?generate position dependent code?

©h€ck øut µy stuƒƒ åt ragdollsoft.com
New game in development Rubber Ninjas - Mac Games Downloads
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #2
First of all, Shark, and fix what it tells you to!

inline small functions.

on PowerPC in particular, make sure you're not constantly loading from memory (eg. avoid multiple accesses to a global within a function - shadow it with a local; avoid x->y->z; x->y->w; cache x->y in a local and use l->z; l->w; )

Shark and fix what it tells you!

-mdynamic-no-pic (should be default in Xcode)
-O2 (should be default for a release build in Xcode; you can consider -O3 if testing shows it really is faster. Xcode supports per-file flags so you could conceivably compile some files with -O3 and most with -O2)
-ffast-math

Shark and fix what it tells you!

Shark and fix what it tells you!!
Quote this message in a reply
Sage
Posts: 1,482
Joined: 2002.09
Post: #3
Though as you said in the other thread OSC, when using -ffast-math be aware of what it does.
Check the man page if you are unsure.

I use it in Chipmunk as it is able to produce very notable speedups without breaking any functionality on PPC or x86. According to the man page it does not produce code to handle infinite values among other things that Chipmunk uses. I assume that many of the operations are handled by the hardware without special code needed. This is not true of all processors though. Someone compiling Chipmunk on the ARM instruction set was having issues with collisions being ignored, and it was because they were using -ffast-math. (the same is not true of the iPhone's ARM CPU however)

Scott Lembcke - Howling Moon Software
Author of Chipmunk Physics - A fast and simple rigid body physics library in C.
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #4
If you're using -O3, it should inline functions automatically for you if they are small and visible from where you need it. It is also pretty good at cutting out dead code, loop unrolling, pre-fetching, etc. Other kinds of optimizations depend on what language you're using. For example, if you're using C++, try to use as many objects on the stack as possible rather than newing everything, since allocations and deallocations on the heap are more expensive. If you're using ObjectiveC, you will have fewer options both with allocating on the stack vs. the heap and inlining functions. You can, however, provide inlined C functions and use C structs allocated on the stack. As mentioned, though, Shark will let you know where your problem areas are so you don't spend a lot of time optimize something that will speed up your application by 0.0001%.
Quote this message in a reply
Member
Posts: 53
Joined: 2007.08
Post: #5
Bookmarked Wink

Very educational tips. Thanks guys.
Quote this message in a reply
Member
Posts: 254
Joined: 2005.10
Post: #6
Maybe someone should sticky this thread?
Quote this message in a reply
Sage
Posts: 1,234
Joined: 2002.10
Post: #7
GIYF.
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #8
As for Shark, I have a war story I always share to teach the "you don't know where the bottleneck is" paradigm.

I was asked to finish a port of a physics-heavy FPS engine to the Mac. The demo includes shooting a branch off of a three, and seeing it tumble through the trunk, so you can imagine the number-crunching going on.

The Mac version ran at 10-15 FPS, where the PC version trundled along happily at 60 FPS solid. So, I started looking through collision code, physics, matrix inversion... Then I hit Shark.

Shark told me that 35% of the CPU time was spent in NSString. Clearly wrong? No. Turns out the engine loaded assets by name, so whenever a texture was referenced, its name was sent down to the asset loader, where it was looked up and returned a handle to a texture or the like. However, the Mac version did something peculiar to the string.

It converted it from a regular ASCII string to a UTF8 string. Next, allocated an NSString with that string. Created a copy of that NSString, and converted it to a system-native representation. Extracted that into a C++ std::wstring, and then munged that back into a regular std::string, which was finally looked up. No [release] messages sent anywhere.

Not only was this done on load, but on every darn asset access, every frame. This little string circus was performed a couple of hundred times per frame, wrecked the CPU cache, filled the RAM...

A quick three-line rewrite put it back up there with the PC version.

Lesson: Shark it, then Shark it.
Quote this message in a reply
Member
Posts: 749
Joined: 2003.01
Post: #9
Wow funny story ivan Wink

I was actually using Shark before changing the optimization level (still O) and it was telling me that stl vector's [ ] operator (to indexes elements) was taking 10% of the time! Changing to O3 took that down to less than 1%.

©h€ck øut µy stuƒƒ åt ragdollsoft.com
New game in development Rubber Ninjas - Mac Games Downloads
Quote this message in a reply
Member
Posts: 59
Joined: 2007.12
Post: #10
I think most of the STL stuff gets inlined with O2, so that's what I normally use. I found that O3 has little benefit in most cases, and it actually slows my programs down most of the time.
Quote this message in a reply
Member
Posts: 45
Joined: 2006.07
Post: #11
Shark, shark shark!

I've been playing around recently with signed distance fields for collision detection, and sampling their gradients through finite differencing. Shark told me that I was spending lots of time in numerically computing the gradient.

I instead just computed the gradient of the entire SDF once and cached the result -- it resulted in something like a 2x speedup for collision handling in my code.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #12
Najdorf Wrote:Wow funny story ivan Wink

I was actually using Shark before changing the optimization level (still O) and it was telling me that stl vector's [ ] operator (to indexes elements) was taking 10% of the time! Changing to O3 took that down to less than 1%.

I hit that once, a while back with shadow volume extrusion -- even with O3 I still had a heavy hit ( would never have known without Shark! ). So what I did was something like so:

Code:
// _vertices is std::vector< vec3 >
vec3 *vertices = &(_vertices.front());

for ( ... )
{
    //act on vertices[]
}

This was a long time ago, 10.3 days, but the hit on std::vector::operator[] went to zero and the direct access of the local array was practically unmeasurable. Performance went up from 10fps to 60.
Quote this message in a reply
Member
Posts: 749
Joined: 2003.01
Post: #13
TomorrowPlusX Wrote:I hit that once, a while back with shadow volume extrusion -- even with O3 I still had a heavy hit ( would never have known without Shark! ). So what I did was something like so:

Code:
// _vertices is std::vector< vec3 >
vec3 *vertices = &(_vertices.front());

for ( ... )
{
    //act on vertices[]
}

This was a long time ago, 10.3 days, but the hit on std::vector::operator[] went to zero and the direct access of the local array was practically unmeasurable. Performance went up from 10fps to 60.

Hm, good trick.

Actually using O1, O2 or O3 fixes the problem for me, but probably I'm not using [ ] as much as you were.

©h€ck øut µy stuƒƒ åt ragdollsoft.com
New game in development Rubber Ninjas - Mac Games Downloads
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #14
Najdorf Wrote:Hm, good trick.

Actually using O1, O2 or O3 fixes the problem for me, but probably I'm not using [ ] as much as you were.

GCC's a lot better now than it was back then.

That being said, I don't use shadow volume extrusion any more, and the places where I do act heavily on std::vector::operator[] are not at run time, so optimization there's less of an issue for me now.
Quote this message in a reply
Post Reply