Best vertex throughput on MacBooks?

Member
Posts: 26
Joined: 2006.09
Post: #1
hi,

I have been trying to investigate what is the best method to get maximum vertex throughput on MacBooks. The problem with this/that computer is that you pretty much always are vertex bound since there is no vertex processing hardware. According to some profiling callstacks in Shark everything from tranforming, lighting to clipping is done on software.

I went through the Apple dev site and a lot of the articles were a bit old. According to some work in progress post I saw some time ago the VBO would be the most optimized way.

I'm currently testing a solution where I have one rendering thread and use VBOs, so that I can use the second processor to do all the heavy opengl work. Most of my data is dynamic so that is a bit tricky situation too.

So my question is, does someone have some experince what kind of setup would be the fastest on this particular hardware? How to describe the data for opengl, single or multithreaded, etc. Any links to articles or benchmarking sources, etc are very welcome too.
Quote this message in a reply
Member
Posts: 87
Joined: 2006.08
Post: #2
There isn't much information posted on this, but the best way to submit streaming vertex data is to use VBOs. Today, VBOs will probably not provide much performance benefit on systems that do software vertex processing (and similarly, will not hurt either). However it is the most future-proof API.

Uploading dynamic data can easily be done efficiently using VBO, so this should not be an impediment.

Also, what kind of 'multithreaded' model were you talking about? You imply that you are accessing OpenGL from multiple threads, which you cannot safely do.

Other general macbook perforamance tips:
- Draw fewer vertices (user selectable options are a good thing)
- I believe that fixed function vertex processing is currently faster
- If doing programmable processing, try to prefer instructions that map easily to SSE. Avoid DP4s, use MULs and MADs instead, etc.
Quote this message in a reply
Moderator
Posts: 3,579
Joined: 2003.06
Post: #3
Frogblast Wrote:Also, what kind of 'multithreaded' model were you talking about? You imply that you are accessing OpenGL from multiple threads, which you cannot safely do.
OpenGL is thread safe, but the contexts are not. In 10.4 you can use CGLLockContext and CGLUnlockContext to lock down an OpenGL context to call it safely from any thread (not simultaneously of course). Be forwarned though: multi-threading can be tricky unless you absolutely understand what you're doing. I use it successfully, but there are many pitfalls to watch out for initially. Once you understand the rules, it is pretty easy though. Since all the new Macs are dual core, and current computer trends are towards multi-processing, it certainly can't hurt to try it if you think you can benefit from multi-threading.

Also, dynamic data isn't suited to it, but if you have any static data and are willing to mix techniques, display lists are almost always the fastest on the Mac. VAR is another option on the Mac for DMA vertex uploads, which may offer better performance on specific implementations, but I am in the dark as to the differences. arekkusu or OSC will definitely have better info on that. VBOs are perfectly fine though, IMHO.
Quote this message in a reply
Member
Posts: 87
Joined: 2006.08
Post: #4
AnotherJake Wrote:Also, dynamic data isn't well suited to it, but if you can swing it, display lists are almost always the fastest on the Mac. VAR is another option on the Mac for DMA vertex uploads.

VAR does not have any advantage over VBO (especially on SW TCL renderers). VARs are also more difficult to use properly, while not portable to other OpenGL implementations.

For a specific example on the best use of VBOs for streaming vertex data, see listing 8-3 in the OpenGL Programming Guide. http://developer.apple.com/documentation...-CH406-SW9
Quote this message in a reply
Moderator
Posts: 3,579
Joined: 2003.06
Post: #5
I've used VARs without any difficulties. VAR is also available on other OpenGL implentations, but is slower compared to VBO everywhere except Apple's implementation from what I've heard. How portable Apple's implementation is, I do not know, so I'll take your word for it. [edit] Not to sound like I'm pumping up VAR though. Just trying to point out that it exists. VBO is definitely the preferred path if you plan on cross-platform! [/edit]
Quote this message in a reply
Moderator
Posts: 1,140
Joined: 2005.07
Post: #6
Since you always have to stream the data to the graphics cards regardless (due to the fact that it doesn't have any on-board VRAM), I would imagine that VBOs and Vertex Arrays would be pretty much the same. Though I don't have any tests to prove it, I would think that display lists would be a little slower, since they tend to be larger, and since it needs to be streamed anyway it doesn't have any advantage there. I would put my vote to VBO or Vertex Array. (you probably would want to use VBOs, or at least the option to use them, to benefit computers with discrete graphic cards, anyway)

Edit: now that I think about it, VBOs might be a little bit faster, since you're guaranteed that the memory is consecutive.
Quote this message in a reply
Moderator
Posts: 3,579
Joined: 2003.06
Post: #7
The display list implementation is unknown to all except Apple engineers as far as I know. I was under the impression that they were able to leave some data resident on the video card without having to stream it all the time, which was one of the reasons it was faster, but I could be full of it. Whether it is still faster on the GMA950 is indeed a good question that I hadn't really thought about in depth. I would guess that VBO would be faster in this instance too, after further consideration. I would guess that it would be faster than vertex arrays too still, because vertex arrays would still be sent through the processor, where VBO data could still be DMA'd by the GMA950 (in theory, I think). Strange to imagine DMA from RAM to RAM, but I don't see why that can't happen, or some variation of it. But now we're in speculation land and need some raw data... Or someone who knows better.
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #8
Frogblast Wrote:VAR does not have any advantage over VBO

VAR does have one advantage: the ability to explicitly flush a subrange of a mapped array. With VBO, when you map a buffer to modify the data, there is no mechanism to specify which portion you modified, so on unmap, the entire thing has to be considered dirty.
Quote this message in a reply
Member
Posts: 26
Joined: 2006.09
Post: #9
Frogblast, currently I'm have a higher level API that draws antialiased 2D primitimes. I was planning to pass that simple data to the other thread and let it process and draw the stuff. There are some calculations that needs to be done before sending the data to opengl too, so I think it would be overally quite benefitical.

Later I need to expand that include 3D rendering too, but here again I think I could manage to have a higher level API there too, so that I can send minimal info to the rendering thread.

I'm not really confident on multithreading, so let's see if I can make it work.

My initial preference for VBOs came from this post on apple lists:
http://lists.apple.com/archives/mac-open...00010.html

According to it the VBO seems to be the most optimised path on GMA950s.

I guess I'll go and do some testing now Smile
Quote this message in a reply
Member
Posts: 26
Joined: 2006.09
Post: #10
Ok, here's some initial results. I had following test setup:
- variations: static and dynamic data, immediate mode, vertex array, vertex buffer object and indexed triangles, indexed triangle strip, and triangles strip
- The object was procedurally generated PQ-torusknot and it had 15k vertices and 30k triangles
- The torus was lit with one point light, color material enabled.
- No clipping occured.
- The dynamic results include generating the vertex data.


Here's the results:
Code:
IM_INDEXED            16.3696ms
IM_INDEXED_STRIP       7.2489ms
IM_STRIP               7.2248ms
VA_INDEXED            12.2359ms
VA_INDEXED_STRIP       6.0711ms
VA_STRIP              19.3035ms
VBO_INDEXED           12.2815ms
VBO_INDEXED_STRIP      6.1633ms
VBO_STRIP              5.9710ms

IM_INDEXED dyn        18.3346ms
IM_INDEXED_STRIP dyn   9.7929ms
IM_STRIP dyn           9.7505ms
VA_INDEXED dyn        14.6873ms
VA_INDEXED_STRIP dyn   8.6191ms
VA_STRIP dyn          27.2527ms
VBO_INDEXED dyn       14.7448ms
VBO_INDEXED_STRIP dyn  8.6341ms
VBO_STRIP dyn          8.5206ms

The torus generation seems to have about 2.5ms overhead, according to Shark 75% of that time was spent on sin/cos.

The conclusion could be that avoid glDrawArrays on pure vertex arrays, VBO with glDrawArray is the fastest... use tri-strips Smile

I will continue to test with interleaved vertexarrays, textures coord gen, different amount of lights, etc. According to Shark about 60% was spent on lighting calculations.

[edit] There was typo in my test case for the VA_STRIP, it performs similarly as the indexed version.

[edit2] Few more test runs here: http://www.moppiproductions.net/memon/st...t_test.txt
I did the test pretty randomly, whatever simple options I could think of. The times see to be accurate to about 0.5ms, probably more consistent within one set of tests.
Quote this message in a reply
Member
Posts: 204
Joined: 2002.09
Post: #11
arekkusu Wrote:With VBO, when you map a buffer to modify the data, there is no mechanism to specify which portion you modified, so on unmap, the entire thing has to be considered dirty.

What about glBufferSubData()? Granted, there's an extra copy involved over glMapBuffer(), but it does allow updating only a portion of the data.
Quote this message in a reply
Member
Posts: 87
Joined: 2006.08
Post: #12
KittyMac Wrote:What about glBufferSubData()? Granted, there's an extra copy involved over glMapBuffer(), but it does allow updating only a portion of the data.

BufferSubData would avoid the partial flush. BufferSubData is preferred if you are only modifying a very small portion of the overall buffer.

I stand by my assertion that VAR no longer has any advantage over VBOs, although arekkusu has forced me to point out one caveat: You CAN get the partial-flush behavior with VBOs by using the APPLE_flush_buffer_range extension. This extension is not portable to other platforms, but neither is VAR.

Anyway, if you find yourself modifying only part of the contents of a buffer object repeatedly, you have probably organized your data wrong.

Ideally, you should structure your buffer objects such that one buffer object contains a set of data that will be used and discarded together, instead of partially modifying a buffer. i.e, rather than using BufferSubData or Map/Unmap, you can call BufferData(NULL) to void the entire contents. You may need to split a single object's vertex data across multiple buffer objects to do this (which is OK, even for an app concerned with performance). At this point, Map/Unmap can be used to re-fill the entire range of the buffer object cheaply.

Refer to the example that I linked to in the OpenGL programming guide. This pattern will perform well on all major OpenGL implementations, including ATI and NV on windows. If you stick to this pattern, then the partial-flush behavior isn't something you'll need to worry about.

For streaming data (that you will draw only once), the example in the OpenGL programming guide is actually very efficient.

For specific performance info on the GMA 950, memon's benchmarks are accurate, and makes a good guide to follow. I suggest using glDrawRangeElements(GL_TRIANGLE_STRIP), and sourcing vertex data from VBOs. It isn't the absolute fastest option on the GMA 950, but it is pretty close, and will make the higher-end hardware much happier (the higher-end hardware very much prefers DrawRangeElements over DrawArrays).
Quote this message in a reply
Sage
Posts: 1,232
Joined: 2002.10
Post: #13
Yes, VAR has no advantage over VBO + APPLE_flush_buffer_range.
Quote this message in a reply
Member
Posts: 26
Joined: 2006.09
Post: #14
There is one bit in my tests that made me a bit disappointed. In the index+strip, there were only N vertices, and N*2 indices, but it performed the same as N*2 vertices. There are many reasons I can think of why it is that way.

I've yet to test interleaved vertex buffers, just in case I could get some advantage from cache there. Currently busy strippifying my drawing algorithms and trying to figure out the SSE support. I was thinking of using texture projection to apply my gradient textures, but I think I can do it faster myself, since I can optimise quite a few cases there.
Quote this message in a reply
Post Reply