Regarding GL profiling

Sage
Posts: 1,199
Joined: 2004.10
Post: #1
I'm having a severe performance problem, and am hoping you guys can give me some pointers.

I've implemented foliage, yet again. For the first time I've implemented a system which can generate TONs of foliage very, VERY quickly. The generation of foliage hauls *ss. But the rendering performance is terrible.

Scenes which ran at 30fps without foliage, becoming a slideshow with it! Gah.

Now, I noticed that when the foliage quads were very small, my performance remained at 30, but as soon as the quads got bigger it slowed down considerably.

Now, here's a screenshot so you can see it's really not pushing that many pixels.
[Image: Terrain-2009-06-25-03.png]
(also, please pardon the poor quality texture, since I'm still experimenting)

So, I assume this means that fill rate is the problem.

To accommodate this, I've done a few things:

1) I'm using a high alpha test level, so I can render the quads order independant. I'm using a trick from this article (http://http.developer.nvidia.com/GPUGems...ter01.html) to fade foliage out at distance
2) I've simplified the shadow mapping algorithms used by foliage to be faster and lighter, since high quality isn't really important there.
3) I billboard on the GPU
4) I'm using interlieved arrays, but not VBOs, since individual foliage patches come and go pretty quickly, but I don't think that the vertex throughput is the problem ( though it may be )

I don't know what else to do. I can post code for you to pore over, since I have to assume my GLSL is too expensive.

Any ideas?
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #2
What does the GL profiler say? I'm seeing on the order of 10^3 quads here, and not a lot more than perhaps 10^6 pixels in total - say a megapixel of fillrate for the foilage. That shouldn't be a problem unless you were fill-bound already, and in which case you shouldn't drop half of your frames... My gut feeling is saying that you're somehow falling back to software, or incurring state changes that stall your pipeline. Still, increasing foilage size does imply fill-rate as your culprit. Then again, that drastic falloff in framerate can't well be caused by such a small amount of extra fill, and so on, back and forth.

My point here is that your pipe is behaving oddly, and you're not likely to crack the nut without asking the driver for some answers.
Quote this message in a reply
Member
Posts: 81
Joined: 2007.05
Post: #3
Wow! That looks awesome! Love seeing those screenshots.

If you turn glsl off when rendering the grass is it much faster? Ya, I know the fixed pipeline really is programmed but its still the fast baseline case.

Also, I've noticed that sending a glsl script uniform values is a real slowdown. Figure it has to travel from cpu to card. ( for each plant )I try to hardcode that in the glsl as much as possible and that sped things up for me. And I try to order what I draw such that it minimizes any material setting and glsl program loading. So, lump all the grass together and set the material once at the beginning. I vaugley recall I use to set values into the texture matrix because all the scripts could access that .... but I think that method is deprecated in newer versions of opengl.

Mipmapping might help because there are so many small ones drawn in the background. Probably doing that...

Looks great!
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #4
Fenris Wrote:What does the GL profiler say? I'm seeing on the order of 10^3 quads here, and not a lot more than perhaps 10^6 pixels in total - say a megapixel of fillrate for the foilage. That shouldn't be a problem unless you were fill-bound already, and in which case you shouldn't drop half of your frames... My gut feeling is saying that you're somehow falling back to software, or incurring state changes that stall your pipeline. Still, increasing foilage size does imply fill-rate as your culprit. Then again, that drastic falloff in framerate can't well be caused by such a small amount of extra fill, and so on, back and forth.

My point here is that your pipe is behaving oddly, and you're not likely to crack the nut without asking the driver for some answers.

I just ran it under GL profiler with "Break on SW Fallback" and "Breakon Error" checked, but no dice. Apparently, it's all running in HW, and without complaint.

I'm really boggled -- I've seen much more foliage being rendered before without this kind of hit.

I'm going to give a stab at better batching of the foliage, and to use VBOs just in case. But I'm reasonably confident this is a fillrate problem, I just don't understand why the fill is so expensive.

@macnib -- I can't drop GLSL. I have a pretty thorough pipeline for fog, shadowing, etc.

EDIT:
It's pretty clearly a matter of fillrate...
With large textures:
[Image: Terrain-2009-06-25-09.png]
And with small:
[Image: Terrain-2009-06-25-10.png]

I think I will have to put a lot of love into the GLSL to try to speed it up.
Quote this message in a reply
Member
Posts: 268
Joined: 2005.04
Post: #5
Off the top of my head, some sanity checks:

1) Make sure the invisible grass at distance isn't being sent to the card. Just a sanity check to make sure you aren't still sending the verts and having the shader run on 0.0 alpha pixels.

2) There seems to be a couple thousand grass quads in view. I'd try combining multiple grass clumps into a single, larger quad.

3) I'd have to guess that fillrate is the issue, so I'd just try fiddling with anything that can possibly affect that. Try without alpha-test/blending turned on. Turn off depth-writes to save a tiny bit of fillrate? Smaller texture map / make sure you've got mip-maps. Just to track down where the bottleneck might be.
Quote this message in a reply
Member
Posts: 45
Joined: 2008.04
Post: #6
Would you care to produce those screenshots in wireframe mode so we can get a better idea of the sizes and complexity of the grass and terrain (and of course some numbers in terms of vertices, etc would help).

[edited] If you think its the GLSL performance then replace the grass shader with one that does nothin except draw the quad, and see how it changes...

We'd love to see your code :-)
Quote this message in a reply
Member
Posts: 87
Joined: 2006.08
Post: #7
I generally agree with the rest of the advice:

a) Do the standard 'change the framebuffer' dimensions trick to verify a fillrate issue
b) Try disabling alpha test (without bothering to sort)
c) Try disabling blending (without bothering to sort)
d) If it is a fragment shader, post the GLSL itself?

The wireframe test will also be very interesting. How many pixels are being processed, only to be discarded?

Also, I notice that shadows are being cast on the grass. Can you describe what you're drawing to handle that? What if you remove the grass from the shadow function, and only render it into the final color buffer (ie, if there is a fill limit, is it happening during shadow map creation?).
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #8
Turning off all blending, I get the following rather surreal image:
[Image: Terrain-2009-06-25-12.png]

But, you'll notice I'm getting better FPS. Which is interesting.

Bachus Wrote:Off the top of my head, some sanity checks:

1) Make sure the invisible grass at distance isn't being sent to the card. Just a sanity check to make sure you aren't still sending the verts and having the shader run on 0.0 alpha pixels.

I did -- I'm only drawing those patches which are within the visibility radius -- it was one of the first sanity checks I did. Sad

Quote:2) There seems to be a couple thousand grass quads in view. I'd try combining multiple grass clumps into a single, larger quad.

I will consider this, but I'm not certain how I could pull it off -- seems like it would be very hard to do in a manner which doesn't look wonky. Plus, as far as I can tell this isn't limited by # vertices, but rather fill.

Quote:3) I'd have to guess that fillrate is the issue, so I'd just try fiddling with anything that can possibly affect that. Try without alpha-test/blending turned on. Turn off depth-writes to save a tiny bit of fillrate? Smaller texture map / make sure you've got mip-maps. Just to track down where the bottleneck might be.

I can't do this without depth writes, since I don't want the overhead of depth sorting. Also, the texture's pretty small -- it's a small texture, atlas 1024x128
( 4 * 256x128 ) textures. I do have mipmaps too.

This is wigging me out! Is there any way to use Instruments' OpenGL tool to get an idea of what's going on? I can't figure out how to use Instruments effectively, and as such I spend all my optimizing time in Shark.

The good news: The code that generates points for the foliage is very fast. I can produce a comical density in real time and it's awesome! It's also being used to place boulders and other related greebling.

EDIT: I'll post GLSL presently -- I've got to walk the dogs.
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #9
SOLVED

[Image: Terrain-2009-06-25-16.png]

The problem seems to stem from my having used a very wide texture atlas of 1024x128 -- when I switched to a single square texture at 128x128, I get the expected performance.

I assume this has to do with the mipmaps pyramid being balanced along width & height. So, I have learned something today!

( now, I just need to come up with a non-awful texture atlas! )
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #10
Well, while the egregious problem is solved, I'm still having performance problems.

I've just run the app under gl profiler's "Statistics" mode, and here's the results:
Code:
GL Function;# of Calls;Total Time (µsec);Avg Time (µsec);% GL Time;% App Time
CGLClearDrawable;1;0;0.00;0.00;0.00
CGLFlushDrawable;635;3068140;4831.72;6.86;3.27
CGLGetParameter;635;0;0.00;0.00;0.00
glActiveTexture;705,468;55670;0.08;0.12;0.06
glAlphaFunc;271,296;25595;0.09;0.06;0.03
glBegin;3,810;47998;12.60;0.11;0.05
glBindBuffer;1,137,892;324924;0.29;0.73;0.35
glBindFramebufferEXT;5,072;22807;4.50;0.05;0.02
glBindTexture;380,320;110980;0.29;0.25;0.12
glBlendFunc;118,940;12709;0.11;0.03;0.01
glBufferData;305;10406;34.12;0.02;0.01
glClear;1,268;34240;27.00;0.08;0.04
glClearColor;634;99;0.16;0.00;0.00
glClearDepth;1,268;478;0.38;0.00;0.00
glClearStencil;634;165;0.26;0.00;0.00
glColor3f;635;146;0.23;0.00;0.00
glColor4fv;66,275;8248;0.12;0.02;0.01
glColorMask;634;225;0.36;0.00;0.00
glColorMaterial;61,196;13720;0.22;0.03;0.01
glCopyTexSubImage2D;635;21767;34.28;0.05;0.02
glCullFace;158,915;31968;0.20;0.07;0.03
glDeleteBuffers;913;55272;60.54;0.12;0.06
glDeleteFramebuffersEXT;1;4;4.81;0.00;0.00
glDeleteObjectARB;8;2297;287.20;0.01;0.00
glDeleteTextures;3;1723;574.38;0.00;0.00
glDepthFunc;196,710;26085;0.13;0.06;0.03
glDepthMask;41,687;20055;0.48;0.04;0.02
glDisable;445,855;80054;0.18;0.18;0.09
glDisableClientState;432,947;108740;0.25;0.24;0.12
glDrawArrays;289,682;16778926;57.92;37.51;17.88
glDrawElements;213,090;3180477;14.93;7.11;3.39
glEnable;472,578;65668;0.14;0.15;0.07
glEnableClientState;425,363;75465;0.18;0.17;0.08
glEnd;3,810;1662;0.44;0.00;0.00
glFlush;2,536;18527663;7305.86;41.42;19.74
glFogfv;2,540;1613;0.64;0.00;0.00
glFrontFace;161,453;23756;0.15;0.05;0.03
glGenBuffers;305;61;0.20;0.00;0.00
glGetUniformLocationARB;408;565;1.39;0.00;0.00
glInterleavedArrays;216,727;156642;0.72;0.35;0.17
glLightModelfv;3,809;1439;0.38;0.00;0.00
glLightfv;6,349;3507;0.55;0.01;0.00
glLoadIdentity;191,747;22460;0.12;0.05;0.02
glLoadMatrixf;191,102;21389;0.11;0.05;0.02
glMaterialf;122,390;19072;0.16;0.04;0.02
glMaterialfv;489,560;76754;0.16;0.17;0.08
glMatrixMode;755,550;74073;0.10;0.17;0.08
glMultMatrixf;284,775;65144;0.23;0.15;0.07
glMultiTexCoord2i;2,540;347;0.14;0.00;0.00
glNormalPointer;139,319;34174;0.25;0.08;0.04
glOrtho;1,270;327;0.26;0.00;0.00
glPolygonMode;635;146;0.23;0.00;0.00
glPolygonOffset;634;219;0.35;0.00;0.00
glPopMatrix;291,759;49980;0.17;0.11;0.05
glPushMatrix;291,759;110199;0.38;0.25;0.12
glShadeModel;97,085;15288;0.16;0.03;0.02
glTexCoord2f;10,160;686;0.07;0.00;0.00
glTexCoord2i;208,820;13248;0.06;0.03;0.01
glTexEnvf;1,270;462;0.36;0.00;0.00
glTexEnvi;144,830;19504;0.13;0.04;0.02
glTexParameterf;125,008;24545;0.20;0.05;0.03
glTexParameteri;593,111;126650;0.21;0.28;0.13
glTranslatef;3,809;1310;0.34;0.00;0.00
glUniform1fARB;284,797;175831;0.62;0.39;0.19
glUniform1iARB;185,397;68900;0.37;0.15;0.07
glUniform2fARB;2,527;2085;0.83;0.00;0.00
glUniform3fARB;142,672;130005;0.91;0.29;0.14
glUniform4fARB;277,592;248558;0.90;0.56;0.26
glUniformMatrix4fvARB;225,143;204978;0.91;0.46;0.22
glUseProgramObjectARB;163,569;237090;1.45;0.53;0.25
glVertex2fv;17,780;2549;0.14;0.01;0.00
glVertex2i;208,820;20700;0.10;0.05;0.02
glVertex4fv;2,540;1588;0.63;0.00;0.00
glVertexPointer;286,044;54972;0.19;0.12;0.06
glViewport;5,073;3054;0.60;0.01;0.00

This doesn't tell you a lot but the hotspots as shown in the app are:
CGLFlushDrawable which takes 3.27 % of app time
glDrawArrays which takes 17.88%,
and glFlush which takes 19.74%

The only place I'm using glFlush is in FBOs, and I'm going to see if they still work without it.

Also, Shark tells me I'm spending 32% of CPU time drawing foliage, most of which is in glDrawArrays
Code:
0.2%    32.6%    Terrain                         sre::modules::Foliage::Patch::display(sre::FrameRenderState const&)    
0.0%    28.4%    libGL.dylib                          glDrawArrays    
0.0%    1.9%    libGL.dylib                          glInterleavedArrays    
0.0%    0.9%    SRE                          sre::MaterialShader::setupForEntity(sre::FrameRenderState const&, boost::shared_ptr<sre::Entity> const&)    
0.2%    0.2%    Terrain                          boost::detail::sp_counted_base::release()    
0.0%    0.2%    libstdc++.6.dylib                          std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)    
0.0%    0.1%    libstdc++.6.dylib                          std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_destroy(std::allocator<char> const&)    
0.0%    0.1%    SGF                          sgf::glsl::ShaderProgram::setUniform(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, sgf::vec3_<float> const&)    
0.0%    0.1%    SGF                          sgf::glsl::ShaderProgram::setUniform(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, float)    
0.0%    0.1%    libstdc++.6.dylib                          std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&)    
0.0%    0.1%    SRE                          sre::MaterialShader::shaderForPass(sre::FrameRenderState const&)    
0.0%    0.0%    SGF                          sgf::glsl::Uniform::set()    
0.0%    0.0%    libGL.dylib                          glAlphaFunc    
0.0%    0.0%    libstdc++.6.dylib                          std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char> const&)    
0.0%    0.0%    SGF                          sgf::glsl::ShaderProgram::setUniform(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, sgf::mat4_<float> const&)    
0.0%    0.0%    Terrain                          boost::enable_shared_from_this<sre::Entity>::shared_from_this()    
0.0%    0.0%    libstdc++.6.dylib                          std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)    
0.0%    0.0%    GLEngine                          glInterleavedArrays_ExecThread    
0.0%    0.0%    GLEngine                          glAlphaFunc_Comp    
0.0%    0.0%    SRE                          sre::Material::setupForEntity(sre::FrameRenderState const&, boost::shared_ptr<sre::Entity> const&)    
0.0%    0.0%    SRE                          sre::Entity::transform() const    
0.0%    0.0%    GLEngine                          glDrawArrays_PackThread    
0.0%    0.0%    libstdc++.6.dylib                          std::allocator<char>::allocator(std::allocator<char> const&)    
0.0%    0.0%    Terrain                          boost::detail::shared_count::shared_count(boost::detail::weak_count const&)    
0.0%    0.0%    libstdc++.6.dylib                          __gnu_cxx::__exchange_and_add(int volatile*, int)
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #11
Right, let's drop lower into your GL for a bit. Hit the GL profiler, and from the Views menu, hit the Driver Monitor. A new app will launch, without a window. From the menu bar, select Monitors->Driver monitors and select whichever GPU you're running off. (Cmd-1 should be the right one).

Now, you get a graph window that tells you, well, nothing. Press the Parameters button in the lower left corner. From here, you can drag and drop stuff you want to keep an eye on into the empty space below the graph.

I suggest you monitor the following:
  • Current Video Memory in Use (to see if you're exhausting your GPU's RAM)
  • CPU Wait for GPU (to see if you're glFlushing inappropriately, or waste time with a de-parallelized pipeline)
  • All of the Surface|Texture Page On|Off Data (to see if you're choking the upload paths with non-FBO stuff)
  • OpenGL context switches (to see if you're somehow hopping between contexts)

There is a ton of parameters to monitor, but these are the ones I drop in when stuff is acting weird. Run your app and see if any values look insane. (Also, this is a system-wide monitor, so kill all other apps before running).
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #12
Fenris,

When I get home from work I'll give all this a stab. Can I post the results here for interpretation?
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #13
Please do, I'd be intrigued to see what you find. Smile
Quote this message in a reply
Oldtimer
Posts: 834
Joined: 2002.09
Post: #14
I just now noticed that your Statistics profile shows one thing as pretty borked:

Quote:glActiveTexture; 705,468; 55670; 0.08;0.12;0.06
glAlphaFunc; 271,296; 25595; 0.09; 0.06; 0.03
[...]
glBindBuffer; 1,137,892; 324924; 0.29; 0.73; 0.35

Judging by the #calls value to CGLFlushDrawable (635) you've rendered roughly 600 frames. During those frames, you've performed 700,000 calls to glActiveTexture, changed the alpha function state 271,000 times and bound buffers over a million times. (I'm not sure if this is as strange as it sounds, perhaps Frogblast or Arekkusu could chime in - but it sure looks funny to me.)

Now, these calls usually don't take a lot of time themselves, which is why they don't top the profile; but AFAIK they stall the pipeline since they muck up GLs state machine. (I'd defer to anyone who disagrees on this one, I'm not 100% on it). If I were you, I'd take a long hard look at whence you call glActiveTexture, glAlphaFunc and glBindBuffer.

If you can't spot anything fishy, run the GL Profiler with "Include Backtraces" checked, open the Trace panel, and start the trace by clicking the Play/Pause button. Trace for a couple of seconds. Now, you can examine the call flow of those seconds. If you see any of these functions crop up too darn often, you have the call stack for each call to examine, to see who's knocking the door.

Still, I'd be intrigued to see what your code is doing, so take these hints for a spin and let us know what you come up with. Smile Cheers!
Quote this message in a reply
Sage
Posts: 1,199
Joined: 2004.10
Post: #15
OK, first, the text export, which seems voluminous and cryptic.
http://shamyl.zakariya.net/etc/gl/Terrain-DM-0.txt

And a screenshot of GL Driver Monitor
[Image: Terrain-DM-0.png]

Any bells ring for you here?
Quote this message in a reply
Post Reply