optimizing dynamic geometry

Sage
Posts: 1,232
Joined: 2002.10
Post: #1
Ok, the time has come to speed things up. Looking for tips, affirmations that I'm on the right path.

My app generates a small pile (~45k) of vertices dynamically each frame, which need to be drawn in various modes (lines, polys, with/without texturing, etc.)

Currently this is just a state machine with all the vertices and other data passed in immediate mode. It's not too slow (blazing fast compared to the same results in Quartz, actually) but, faster is better.

My target hardware is my 15" TiBook. My app requires rectangle textures, so minimum of Radeon/GF2. No vertex/pixel shaders.

So, my initial thought is to save the computed vertices into a temp array, and after computation is completed for each frame, use glDrawElements to draw the whole shmear in one go. And probably GL_VERTEX_ARRAY_RANGE_APPLE to eliminate the extra copy.

This seems like it will work well for the basic case of a list of triangles, but:

1) my current state machine reuses a computed vertex in more than one way. For instance, drawing 3 points as a LINE_STRIP, and then the same 3 points plus 6 offset points and texture coordinates as TRIANGLES for texture mapped points (particle system).

If I stuff all the extra vertex data into the same array, is there any penalty for using funky stride sizes with glDrawElements?

What about interleaved arrays? I need color, texture, and vertex data.

2) I also need glLineWidth for line drawing. I don't see any way to specify that info as an array? This kinda blows my whole approach away if it can't be done. Unless it's still a win to drawElements in groups of three vertices? Doesn't seem likely...

3) Is there any more efficient way to draw sized points? Given that antialiased points are totally broken on ATI cards, I mean. I currently draw a textured circle in the corner fourth of one triangle, so I'm passing 12 floats for a 2D point, when it ought to be 3 (x, y, size.)

I see there's an NV extension for textured points with attenuation, that doesn't do me any good...


thanks.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #2
You're going to have to profile a few solutions to find out what works best...

Use APPLE_var and APPLE_vao to keep the vertices on the card.

Interleaved arrays are generally marginally slower than non-interleaved, but probably not so's you'll notice much.

Don't have a funky stride in the vertex array, use the indices to pick out the vertices you want.

Use the same calls as Quake III Wink

Pass colors as 4ub, vertices as 4f and texture coordinates as 2f. If you're using interleaved arrays, make sure that the stride is cache-friendly.

AFAIK, there's no way to use glLineWidth in an array. My advice is to find another way to achieve the same effect Wink Calling glDrawElements with 3 vertices is probably more efficient than doing the same calls in immediate mode.

If you want to stay with immediate mode, you can use aglMacro / CGLMacro to about double the performance of calls like glVertex3f. Reply if you need more info.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #3
Quote:Originally posted by OneSadCookie
You're going to have to profile a few solutions to find out what works best...

Naturally. Just looking for a heads-up on what to avoid. :)

Quote:Use APPLE_var and APPLE_vao to keep the vertices on the card.

OK, question. I'm not sure I've done the var flush correctly-- if I do:

Code:
glTexCoordPointer(2, GL_FLOAT, 0, glt);
        glEnableClientState(GL_TEXTURE_COORD_ARRAY);
        glVertexPointer(2, GL_FLOAT, 0, glv);
        glEnableClientState(GL_VERTEX_ARRAY);
        glVertexArrayRangeAPPLE(sizeof(glv), glv);
        glEnableClientState(GL_VERTEX_ARRAY_RANGE_APPLE);
        glFlushVertexArrayRangeAPPLE(sizeof(glv), glv);
        glDrawArrays(GL_QUADS, 0, drawn*4);

the vertex data ought to be shared in AGP and not getting copied, but what about the texture data? Does APPLE_var apply to each type of array? The extension spec is totally unhelpful.

Quote:Interleaved arrays are generally marginally slower than non-interleaved, but probably not so's you'll notice much.

Don't have a funky stride in the vertex array, use the indices to pick out the vertices you want.

Noted. Using non-interleaved arrays for now.

Quote:AFAIK, there's no way to use glLineWidth in an array. My advice is to find another way to achieve the same effect ;) Calling glDrawElements with 3 vertices is probably more efficient than doing the same calls in immediate mode.

Arrgh. I'll just call glDrawElement a whole lot then. It's not worth creating antialiased quads to simulate lines.

Off topic: does anyone else feel that the GL API is full of huge, gaping holes?

Back on topic: So, after using glDrawArrays and APPLE_var, there's practically no speedup. (Yes, I'm positive I'm GPU-bound.) This is on a Radeon 7500 Mobility which ought to do hardware TCL. In fact, it looks like I'm fill rate limited, which seems silly. Drawing around 13k 64x64 blended quads (that's only 54 megapixels), with a lot of overlap, drops to 5 fps. GL profiler says 92% of my time is in glDrawArrays. Changing the quad size has immediately noticable effects.

This is for a particle system, so, again, is there any faster way to draw sized points? I'm already doing alpha testing with glAlphaFunc(GL_GREATER, 0.0), which did speed it up slightly.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #4
Sounds like you're fill limited, which obviously transform and CPU optimizations won't help with.

Try increasing the threshold on the alpha test. Find the highest number that doesn't look crap Wink

Look at texture modes -- are you trilinear filtering? do you need to?

Are you multitexturing? Do you need to?

Are you using OpenGL lighting? Do you need to?

Can you run at a lower screen resolution? A lower bit depth?

Are you using destination alpha? Do you need to?

Do you need to use antialiased lines? These are pretty slow on consumer hardware...

Just a few questions. Ultimately, there may be nothing you can do but buy a new graphics card Rasp
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #5
lessee:

0.0, nope, nope, nope, not yet, nope, nope, nope, nope, nope, nope, yes, and they're much faster than Quartz.

And... "new graphics card" means "new laptop" for me. Which might not be a bad thing, if there's a 970/Radeon 9600 model coming out anytime soon...

Anyway. Thanks for the tips. If anyone has a faster way to texture particles, please chime in.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #6
Followup on a possible reason why I'm not seeing any speedup with var_APPLE:

This message from an ATI engineer:

http://lists.apple.com/archives/mac-open...mancew.txt

says that GL_POLYGON is not accelerated through var on chips earlier than the Radeon 8500.

Well, that's the primitive I need, so it's just going to be slow on my TiBook. Sad
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #7
You can usually replace GL_POLYGON with GL_TRIANGLE_FAN for solid geometry and GL_LINE_LOOP for lines, both of which probably are accelerated. Might be worth a shot.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #8
Revisiting this issue:

let's say hypothetically you want to optimize this:
Code:
glBegin(GL_LINES);
    for (i=0; i<10000; i++){
            glColor4f(frand(1), frand(1), frand(1), frand(1);
            glVertex2f(frand(100), frand(100);
            glColor4f(frand(1), frand(1), frand(1), frand(1));
            glVertex2f(frand(100), frand(100));
    }
    glEnd();

In other words, a bunch of geometry where the coordinates and colors (and etc) all change dynamically every frame.

So, in this case you could calculate the colors/vertices and store them in a big vertex array, and then submit the array to GL all at once, avoiding the function call overhead and possibly getting some AGP-happiness if the hardware can accelerate VAR.

BUT

now suppose you need to do this instead:

Code:
for (i=0; i<10000; i++){
        glLineWidth(frand(10));
        glColorMask(0,0,0,1);
        glBegin(GL_LINES);
            glColor4f(frand(1), frand(1), frand(1), frand(1);
            glVertex2f(frand(100), frand(100);
            glColor4f(frand(1), frand(1), frand(1), frand(1));
            glVertex2f(frand(100), frand(100));
        glEnd();
        glColorMask(1,1,1,0);
        glBegin(GL_LINES);
            glColor4f(frand(1), frand(1), frand(1), frand(1);
            glVertex2f(frand(100), frand(100);
            glColor4f(frand(1), frand(1), frand(1), frand(1));
            glVertex2f(frand(100), frand(100));
        glEnd();
   }

In other words, there is now a state change in the middle of drawing each line. (Actually, a lot of state change-- I need to create a mask in the destination alpha buffer and then draw over it in the color buffer.)

Is there any way to accelerate this sort of sequence of GL submissions?
Quote this message in a reply
Moderator
Posts: 916
Joined: 2002.10
Post: #9
just out of curiousity, how fast is frand versus, say, ranrot scaled?
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #10
frand is a fictional function AFAIK... For performance comparisons of real PRNGs, see the page in the FAQ (once the FAQ gets FIQsed)

There is nothing you can do. If you have to change state, you have to change state, and that requires you to make more than one drawing call. It's going to perform miserably, and you should probably investigate other techniques for achieving the same effect Grin

(for glLineWidth, you might draw multiple thin lines to emulate a thicker line; the glColorMask in your code seems to me to be a red herring since you could do two passes, one with each of the two masks)
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #11
Darn it.

Ignore the glLineWidth. I actually have to calculate the projected vertices myself.

But I don't think there is any getting around the state change, I have to create a mask in the destination alpha in order to draw properly antialiased lines.

skyhawk-- yes frand() is a macro:
lib_frand[++lib_fseed%16384]*(X)

Basically just the cost of a table lookup.
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #12
If the linewidth isn't there, then the chances are that the code can be simplified to remove the state change...
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #13
Quote:Originally posted by OneSadCookie
If the linewidth isn't there, then the chances are that the code can be simplified to remove the state change...

If you have any suggestions re: my earlier post I'm all ears.

Goal: draw AA'd primitives. Constraint: hardware doesn't support AA.

For example, here's what I'm currently doing to draw an AA line (still in progress so there are some bugs regarding line length < 1.0):

texture0 is bound to a texture containing a set of stripes of increasing length, e.g.

Code:
* * * * ...
  * * *
    * *
      *

global state vars:
Code:
float glAA_perp_ab_x, glAA_perp_ab_y;
float glAA_x2, glAA_y2;
float glAA_width;
int   glAA_width_tex;
float glAA_color[4];


Code:
void glAALineA(float x1, float y1, float x2, float y2) {
    glAA_x2 = x2; glAA_y2 = y2;
    glAA_perp_ab_y = x1-x2;
    glAA_perp_ab_x = y2-y1;
    float factor = glAA_perp_ab_y*glAA_perp_ab_y+glAA_perp_ab_x*glAA_perp_ab_x;
    float perpd = __frsqrte(factor);                
    perpd *= (1.5f - (0.5f*factor * perpd * perpd));
    if (glAA_width < 1.0) {
        glAA_width_tex = 1;
        glGetFloatv(GL_CURRENT_COLOR, &glAA_color[0]);
        glColor4f(glAA_color[0], glAA_color[1], glAA_color[2], glAA_color[3]*glAA_width);
    } else {
        glAA_width_tex = glAA_width;
    }
    float prllx = -glAA_perp_ab_y * perpd;                    // 1 px
    float prlly =  glAA_perp_ab_x * perpd;
    perpd *= (glAA_width+2)*0.5f;
    glAA_perp_ab_y *= perpd;
    glAA_perp_ab_x *= perpd;

    glColorMask(0,0,0,1);
    glBlendFunc(GL_ONE, GL_ZERO);
    glBegin(GL_QUADS);
        glTexCoord2f(glAA_width_tex*anis+3,  psz);
        glVertex2f(x1+glAA_perp_ab_x-prllx, y1+glAA_perp_ab_y-prlly);
        glTexCoord2f(glAA_width_tex*anis+3,  psz+glAA_width_tex+2);
        glVertex2f(x1-glAA_perp_ab_x-prllx, y1-glAA_perp_ab_y-prlly);
        glVertex2f(x2-glAA_perp_ab_x, y2-glAA_perp_ab_y);
        glTexCoord2f(glAA_width_tex*anis+3,  psz);
        glVertex2f(x2+glAA_perp_ab_x, y2+glAA_perp_ab_y);
        glTexCoord2f(glAA_width_tex*anis+3.5,  psz);
        glVertex2f(x1+glAA_perp_ab_x+prllx, y1+glAA_perp_ab_y+prlly);
        glTexCoord2f(glAA_width_tex*anis+3.5,  psz+glAA_width_tex+2);
        glVertex2f(x1-glAA_perp_ab_x+prllx, y1-glAA_perp_ab_y+prlly);
        glTexCoord2f(glAA_width_tex*anis+1.5,  psz+glAA_width_tex+2);
        glVertex2f(x1-glAA_perp_ab_x-prllx, y1-glAA_perp_ab_y-prlly);
        glTexCoord2f(glAA_width_tex*anis+1.5,  psz);
        glVertex2f(x1+glAA_perp_ab_x-prllx, y1+glAA_perp_ab_y-prlly);
        glTexCoord2f(glAA_width_tex*anis+1.5,  psz);
        glVertex2f(x2+glAA_perp_ab_x, y2+glAA_perp_ab_y);
        glTexCoord2f(glAA_width_tex*anis+1.5,  psz+glAA_width_tex+2);
        glVertex2f(x2-glAA_perp_ab_x, y2-glAA_perp_ab_y);
        glTexCoord2f(glAA_width_tex*anis+3.5,  psz+glAA_width_tex+2);
        glVertex2f(x2-glAA_perp_ab_x-prllx*2, y2-glAA_perp_ab_y-prlly*2);
        glTexCoord2f(glAA_width_tex*anis+3.5,  psz);
        glVertex2f(x2+glAA_perp_ab_x-prllx*2, y2+glAA_perp_ab_y-prlly*2);
    glEnd();
    glColorMask(1,1,1,0);
    glDisable(GL_TEXTURE_RECTANGLE_EXT);
    glBlendFunc(GL_DST_ALPHA, GL_ONE_MINUS_DST_ALPHA);
    glBegin(GL_QUADS);
        glTexCoord2f(glAA_width_tex*anis+2,  psz);
        glVertex2f(x1+glAA_perp_ab_x-prllx, y1+glAA_perp_ab_y-prlly);
        glTexCoord2f(glAA_width_tex*anis+2,  psz+glAA_width_tex+2);
        glVertex2f(x1-glAA_perp_ab_x-prllx, y1-glAA_perp_ab_y-prlly);
}

void glAALineB() {
        glVertex2f(glAA_x2-glAA_perp_ab_x, glAA_y2-glAA_perp_ab_y);
        glTexCoord2f(glAA_width_tex*anis+2,  psz);
        glVertex2f(glAA_x2+glAA_perp_ab_x, (glAA_y2)+glAA_perp_ab_y);
    glEnd();
    glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
    glEnable(GL_TEXTURE_RECTANGLE_EXT);
    glColorMask(1,1,1,1);
    if (glAA_width < 1.0) {
        glColor4f(glAA_color[0], glAA_color[1], glAA_color[2], glAA_color[3]);
    }
}

Inbetween A and B subs the client can insert e.g. glColor() to get gradient shading.

So, you can see the state changes involved are glColorMask, glBlendFunc, and glEnable.

I haven't been able to think of any way to do this without destination alpha...
Quote this message in a reply
Luminary
Posts: 5,143
Joined: 2002.04
Post: #14
Are you sure that GL is your bottleneck? With that many globals, that's going to produce some pretty hideous code... Try simply making local copies of all the globals at the beginning of the function, then referring to them rather than the globals.

You've got a bunch of integer and double constants being used in float expressions. Make sure you specify float constants as 1.0f, for example. GCC is very stupid in this regard. Also make sure to explicitly load constants into locals if you use the same one more than once.

I'm sure glDrawArrays (or better glDrawElements if you're reusing vertices) will give you better performance than the huge section of immediate mode calls. Don't worry too much about optimizations (though if gl*Pointer calls are disallowed between glAALineA and glAALineB you should probably use CVA).

If these functions are not being inlined, definitely use the CGLMacro interfaces for this many immediate mode calls. It approximately halves the function call overhead.

As for the case where these functions are being called in a tight loop, you can probably improve performance by writing a dedicated function for that purpose. It's not clear to me if you necessarily need to draw the color buffer immediately after the destination alpha, or whether it would be acceptable to draw all the destination alpha first and then all the color.

Even if that's not acceptable, you may be able to avoid some state changes by drawing only non-overlapping line segments in each pass -- keep unioning the bounding rectangles until you can't add any more segments, then draw that set of segments and move on.

Of course, what's fast will depend on the characteristics of the data being drawn, which you don't necessarily know beforehand -- take comfort in the fact that CopyBits and BlockMoveData have to deal with precisely those situations, and they manage.

So there y'go, that's my brain-dump for the evening. Are you sure FSAA isn't good enough? Wink
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #15
Quote:Originally posted by OneSadCookie
Are you sure that GL is your bottleneck?
Well, perceptively, right now it is a battle between CPU and fillrate, it depends on the size of the primitives. Since I'm still getting the algorithm right I haven't spent a lot of time profiling, but FWIW Shark says 50% of my time is spent in gldGetString and gldUpdateDispatch. glProfiler says 75% is glBegin/glEnd, since everything is immediate...

Quote:You've got a bunch of integer and double constants being used in float expressions...
Also make sure to explicitly load constants into locals...
I know about these, I just haven't gotten to the Shark stage yet.

Quote:With that many globals, that's going to produce some pretty hideous code... Try simply making local copies of all the globals at the beginning of the function, then referring to them rather than the globals.
This one I'm not familiar with (still PPC newbie.) The only reason I've got it split up into separate functions (should be inlined) like this is so the client can still change color/texture/probably linewidth between vertices. Globals avoid resending all the vertex1 info & temp values in the later vertex call(s).

What if I specify globals as 'register float foo_bar asm( "f12")' or whatever? Not that I trust GCC to do anything smart, anyway...

Quote:definitely use the CGLMacro interfaces...
I've looked at these (in ADC VertexPerformanceTest for example) but not tried them yet. Looks like they avoid getting the ctx before each state call? I already manage my contexts explicitly so the macros are a probably a no-brainer win...

...but I'd really like to get out of immediate mode...

Quote:I'm sure glDrawArrays ... will give you better performance...
It's not clear to me if you necessarily need to draw the color buffer immediately after the destination alpha, or whether it would be acceptable to draw all the destination alpha first and then all the color.
This is really the problem. In order for overlapping shapes to appear correctly the color of each primitive needs to be drawn immediately after its alpha mask is drawn. Otherwise, I could just do two glDrawArray passes.

You're right that I could futz around building non-intersecting arrays to draw... but I know that my geometry is very likely to self-intersect (fractally & randomly generated, it is...)

Quote:So there y'go, that's my brain-dump for the evening.
Very much appreciated!

Quote:Are you sure FSAA isn't good enough? Wink
Yes, quite sure. Take a look at how crappy it is:
[Image: AA_compare.png]
Quote this message in a reply
Post Reply 

Possibly Related Threads...
Thread: Author Replies: Views: Last Post
  GLSL geometry- and multipass-shaders (nogo?) mcMike 3 5,853 May 2, 2008 05:51 AM
Last Post: mcMike
  geometry intersact with rendering volume stella1016 0 2,219 Oct 3, 2007 12:44 PM
Last Post: stella1016
  Smoothing geometry via subdivision TomorrowPlusX 1 2,847 Aug 30, 2007 02:46 PM
Last Post: OneSadCookie
  Optimizing CGLFlushDrawable Nick 3 3,965 Nov 27, 2006 06:48 PM
Last Post: OneSadCookie
  Geometry instancing and vertex streams Puzzler183 2 2,826 Apr 16, 2005 11:00 PM
Last Post: Puzzler183