Random Optimization Questions

Moderator
Posts: 373
Joined: 2006.08
Post: #1
So, I was bored, and wondering about optimizing things...here are some questions that I'm curious about; some are more theoretical, some aren't Smile

When the C language was being programmed, it was programmed in assembler, right? As far as I know, assembler automatically optimizes the code for the processor that the code is compiled with, so how did the good people who made C get it to not be processor specific?
Why does C enforce concrete variable types like 'int', 'double', and 'char', while assembler does not?
Does C/C++ optimize switch statements more than if/else if/else statements? A friend of mine told me that it does, but I've never heard of that, so I thought that I'd ask...would I get any (however small)of an optimzation out of something like this:
Code:
int i = 20;
switch(i)
{
case 5:
cout << "is five";
break;
case 20:
cout << "is twenty";
break;
default:
cout << "is not five or twenty";
break;
}
over something like this:
Code:
int i = 20;
if(i==5)
{
cout << "is five";
}
else if(i==20)
{
clout << "is twenty";
}
else
{
cout << "is not five or twenty";
}
?

In C++, does setting something to 'protected' or 'private' instead of 'public' have anny optimization benefit over setting it to public, or are those purely for information hiding?

Is there any difference between these two declarations:
Code:
public void whatever();
public void whatever(void);
?


Ok, so, in C++, I have some code like this:
Code:
bool hasTexture;
bool hasMaterial;
//several other booleans
public void draw()
{
if(hasTexture)
{
//bind texture
}
if(hasMaterial)
{
//attach material
}
//several more 'if's that do things before drawing
}

so, when I instantiate my class, I set all of the boolean variables. When I draw my object, based on if the booleans are true or false, my code does something (binds the texture, etc.). This 'something' never changes. Now, couldn't you (theoretically) have something that, when you instantiated the class and set the boolean variables, went through and took all of the code that would execute if the booleans were 'true', and compile it into binary? Then, when you called your drawing code, you would just execute the binary, and save yourself the time of evaluating however boolean if statements you had. Obviously this would be a really small optimization, but it still seems like a cool thought to me Wink

Just curious...
Thanks Smile
-wyrmmage

Worlds at War (Current Project) - http://www.awkward-games.com/forum/
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #2
Sorry I don't have any specific answers for you here, I gave up worrying about little details in performance many years ago in favor of easier to write code. I am certain others will chime in with more specifics.

Pre-mature optimization is indeed the root of all evil. Trying not to do really dumb things can help, but whatever. Write it, and if it works great then it's written correctly Wink If it's a little slow, then profile and see if you can tighten up a few spots.

As to the assemblers and optimizations? All assemblers have to target a specific processor, and all processors have certain ways they work best for given situations, so optimizations across the processor spectrum are largely impossible. C is not assembly language, it's a higher-level language which sits on top of the assembler/machine code level, so that's how it can be the same across platforms.

C enforces data types because it can. It's one of the features of the language to help you write better, bug-free code. Assemblers vary on their features, but they usually assume you know what you meant, and don't try to stop you from shooting yourself in the foot if you so desire.

As I understand it, switch statements are often faster than if-else, but I don't usually care unless it makes the code prettier Wink
Quote this message in a reply
Moderator
Posts: 373
Joined: 2006.08
Post: #3
hmmm...thanks for the notes Smile
I'm not actually trying to implement any of this stuff at the moment, I was just bored and thought I'd ask. I know that assembler code optimized for the specific processor, I'm just wondering how the people who made C managed to avoid that.
Thanks for the input Smile
-wyrmmage

Worlds at War (Current Project) - http://www.awkward-games.com/forum/
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #4
wyrmmage Wrote:I know that assembler code optimized for the specific processor, I'm just wondering how the people who made C managed to avoid that.
They didn't. You can't. Not for different language architectures anyway. Often times optimizations for one processor in a *family* of processors, like the POWER architecture or X86 architecture, can affect performance across the family, but not across different architectures.
Quote this message in a reply
Moderator
Posts: 373
Joined: 2006.08
Post: #5
But...a program written in C that is compiled on an Intel processor seems to run at the same speed on an AMD processor (well, not really, because some processors run at different speeds, etc., but what I'm saying is that C code doesn't seem to be faster or slower on different processors in the same way the Assembler is; why?)
-wyrmmage

Worlds at War (Current Project) - http://www.awkward-games.com/forum/
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #6
Oh, I see what you're asking!

The C compiler can take slight differences between the Intel processor and the AMD processor into consideration and generate [an instruction stream] which optimizes for those conditions. Writing assembler by hand to work on both of them wouldn't do that unless you specifically made the adjustments yourself. Nowadays though, you'd be pretty hard-pressed to out-geek the compilers!

[edit] Compilers always try to generate optimal instructions for the processor they are targetting. Some compilers are better at it for any given processor than others, but they are almost always better at it than anybody hand-writing with assembly nowadays. [/edit]
Quote this message in a reply
Moderator
Posts: 373
Joined: 2006.08
Post: #7
ok, thanks, that answers a lot of my questions Smile
So...does anyone know the answer to any of my other questions?
Thanks again Smile
-wyrmmage

Worlds at War (Current Project) - http://www.awkward-games.com/forum/
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #8
Private/protected/public are only hints for visibility.

Having void in the middle of the parenthesis is the same as having nothing at all.

Your last thing isn't really possible without having separate functions or versions of the class to handle each case separately without if statements.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #9
akb825 Wrote:Having void in the middle of the parenthesis is the same as having nothing at all.

In C++ that's true, in C it's not.

Quote:Your last thing isn't really possible without having separate functions or versions of the class to handle each case separately without if statements.

Or doing something like just-in-time compilation.

Switch statements can occasionally be optimized into a computed jump, however, modern processors don't like computed jumps, so I don't think I've ever seen the compiler do anything other than a bunch of if/else statements.

Basically, none of what you've asked about is at all relevant to optimization Rasp

The first thing you should optimize for is algorithms and data structures -- those usually make the biggest difference.

Once you've got the best algorithms and data structures you can think of, function inlining is usually the biggest win. You can often get away going only this far.

I'd put vectorization at the next kind of level, since it's a lot of work for a fairly unknown performance benefit, but I can see arguments for it being either higher (since it's often an algorithmic modification) or lower (since it's harder than many of the things below).

Then you get into micro-optimizations like
* avoiding expensive math operations (sqrt, /, etc)
* shadowing globals with locals
* unrolling loops and interleaving iterations
* avoiding branches
* issuing cache prefetch instructions
* etc
There's quite a lot of speed you can gather at this level if you know what you're doing, but it's usually not at all worthwhile to bother trying Smile
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #10
OneSadCookie Wrote:The first thing you should optimize for is algorithms and data structures -- those usually make the biggest difference.
As an example: Recently I had changed a duplicate vertex removal routine to use a hash lookup (just an NSMutableDictionary) instead of a brute-force linear search through an array in straight C. Even with the overhead of Obj-C messaging to use the NSDictionary, the gain was huge! For relatively large amounts of vertices, on the order of 75,000 triangles, it went from taking around seven minutes, to taking around seven seconds. I could have optimized the linear search until the computer bled bits and it wouldn't have come remotely close to simply changing the approach. I am certain I could shave another five or six seconds off of that by moving the whole thing to straight C, using a nice balanced tree. But seven seconds to load a 75k poly model [and removing duplicate vertices [and tessellation!] at the same time] seems more than adequate to me considering that's about as much as I'd ever want to push per frame in one scene on modern hardware right now anyway.
Quote this message in a reply
Moderator
Posts: 608
Joined: 2002.04
Post: #11
AnotherJake Wrote:As an example: Recently I had changed a duplicate vertex removal routine to use a hash lookup (just an NSMutableDictionary) instead of a brute-force linear search through an array in straight C. Even with the overhead of Obj-C messaging to use the NSDictionary, the gain was huge! For relatively large amounts of vertices, on the order of 75,000 triangles, it went from taking around seven minutes, to taking around seven seconds. I could have optimized the linear search until the computer bled bits and it wouldn't have come remotely close to simply changing the approach. I am certain I could shave another five or six seconds off of that by moving the whole thing to straight C, using a nice balanced tree. But seven seconds to load a 75k poly model [and removing duplicate vertices [and tessellation!] at the same time] seems more than adequate to me considering that's about as much as I'd ever want to push per frame in one scene on modern hardware right now anyway.
I may show my own ignorance here but... I imagine an NSMutableDictionary is O(1) or O(n) depending on how much it sucks right? So I'm curious what your linear search was? If it was really O(n)... them's some big constant factors.
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #12
Ignorance? Ha! I fart in your general direction! I know NOTHING about big-O notation because I suck at math that bad. As far as I know, it went from O(alot) to O(notsomuch)...

And maybe even `linear search' isn't the proper terminology here (I'm just a dumb hobbiest with no CS training), but it went something like this:

- search for existing vertex in final vertex array by starting at index 0 and iterating through the whole thing until you find one matching the vertex, normal and texCoord
- if you find a match then stop searching and add that index to the index array
- otherwise, after you've searched the whole friggen array and didn't find a match, add the new unique vertex to the vertex array and add the new index to the index array

[for the fast search] I generate the key for the NSMutableDictionary by simply formatting an NSString with %0.3f for each value of the vertex (v3, n3, t2). I tried it with %0.6f and it worked as expected: didn't weld as many vertices because the tolerance was lower. So far I haven't seen the key hash fail to make a unique value for each vertex like this, but I haven't tested it thoroughly (maybe only a half-dozen test models). Sure would be nice if it stays solid!

I was originally planning to go the tree route but figured dictionaries are so easy to work with that it was at least worth a try. So far so good!

[edit] this is the actual [fast] search code:
Code:
- (void)addVertexDataToCurrRenderingGroup:(VertexData *)vertexData
{
    NSString            *key;
    NSData                *data;
    NSNumber            *index;
    
    totalVertsBeforeDupeRemoval++;
    key = [NSString stringWithFormat:@"%0.3f%0.3f%0.3f%0.3f%0.3f%0.3f%0.3f%0.3f",
                vertexData->vertex[0], vertexData->vertex[1], vertexData->vertex[2],
                vertexData->normal[0], vertexData->normal[1], vertexData->normal[2],
                vertexData->texCoord[0], vertexData->texCoord[1]];
    index = [vertexDataDictionary objectForKey:key];
    if (index == nil)
    {
        data = [NSData dataWithBytesNoCopy:vertexData length:sizeof(VertexData) freeWhenDone:NO];
        [vertexDataArray addObject:data];
        index = [NSNumber numberWithUnsignedInt:[vertexDataArray count] - 1];
        [vertexDataDictionary setObject:index forKey:key];
        [vertexDataIndexArray addObject:index];
    }
    else
    {
        [vertexDataIndexArray addObject:index];
    }
}
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #13
For vertex removal, I believe it's usually O(n^2), since for every vertex you need to look at every other for duplicates. For that I I've used a std::map in C++, which is O(lg(n)) since it's using a tree, which improves the overall algorithm to O(n*lg(n)). When doing this, I would get improvements from several minutes to a few seconds with large meshes having hundreds of thousands of vertices. I suppose I also could have used a hash_map (should be more or less the same to NSDictionary), but I would have had to come up with a hashing function, (or could have just created a string) and also it would have the cost of rehashing constantly, which I figured would hurt it in the end. (though it would be interesting to actually test the 2)

There are oftentimes several choices such as this that you need to consider when choosing a data structure to be at the heart of your algorithm. If you have a tight loop and/or recursion with just some math then another call, then you may need to worry about things like expensive math, doing branches, etc. If you're using other things like a data structure that ends up dominating your algorithm, it isn't nearly as important.

Just as an FYI: using GCC, if you choose optimization level 3, it ends up doing a lot of optimizations for you. Things like unrolling loops, re-aranging instructions, inlining, etc. (in fact, I had one case where manually unrolling a loop made something noticeably slower with that level of optimization enabled)
Quote this message in a reply
Moderator
Posts: 3,577
Joined: 2003.06
Post: #14
akb825 Wrote:For vertex removal, I believe it's usually O(n^2), since for every vertex you need to look at every other for duplicates.
That sounds about right in my situation -- still don't know what big-O notation means, but little models seemed to load instantaneously and it just kept getting slower and slowwer and sssllloooowwwwwerr the bigger the models got in polycount, using the `linear search' I was talking about.

Again, it just shows that higher level optimizations in algorithms and data structures can save truck-loads of time. Screw cycles, we're talking minutes in some cases!
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #15
Big-O notation is a rough estimate of the number of operations you need to perform based on how many items you have. In this case, the items are vertices, so n is the number of vertices. In order to walk through an array, it O(n), since you must visit every element of the array. In this case, you walk through the array, then for each item you walk through it again. (to find the duplicates) So you have O(n) for the first level, and for each of those n steps you also have O(n) for the second level. Together they are O(n*n), or O(n^2).

In my case with a tree, a tree takes O(lg(n)) to traverse. (where lg(n) is log based 2 of n) This is because for each level of the tree you can cut out half of the remaining subtree by choosing to go left or right. You still have to visit each vertex once when you're searching for duplicates of that vertex, so on the top level you still have O(n). However, instead of a linear search for each of those n steps, you have the tree lookup, so you have O(lg(n)) on each step. This means you have O(n*lg(n)) total.

Generally, if you do a series of actions, then afterwards you do another series of actions, you add the 2 time costs together. (though for the estimate you only take the bigger of the 2, since that's the one that dominates your runtime) If you do an action within another action, such as nested loops, you multiply the time costs. As you may have noticed, this isn't an exact representation of the runtime, since constants are ignored and so are smaller terms that are added: it's meant to be an estimate on how the time grows when n grows large.

I hope that could clear some things up.
Quote this message in a reply
Post Reply 

Possibly Related Threads...
Thread: Author Replies: Views: Last Post
  Obj-C optimization woes NitroPye 10 4,377 Apr 26, 2005 10:20 AM
Last Post: NitroPye