Accelerate and working with Integer Arrays SIMD

Member
Posts: 95
Joined: 2009.09
Post: #1
Hey guys,

I recently saw the WWDC Lecture on the accelerate framework and how to use it to speed up vector-arithmetic on the iPhone.
Now in order to fill my Interleaved Vertex-Arrays I have to do some vector-calculations and I thought I could improve my fill rates that way.

Sadly BLAS only seems to support floating-point operations.

But for my Elements-Array that tells OpenGL which Triangles to connect, I need vector operations on integers.
Basically I would want to add an "unsigned int" componentwise to a vector of "unsigned int"'s (in matlab-like pseudo-code):

bigArray[n:n+83] = smallArray[0:83] + v * ones(84);

EDIT: For the general interleaved Vertex Array of type:

Code:
```typedef struct {     float x;     float y;     float z; } Vertex3; typedef struct _iVertex3D {     unsigned int color;     Vertex3 v;     Vertex3 n;     float uv[2]; } iVertex3D;```

Using BLAS vector operations and some strides, I can overwrite all the float types in this struct but I'm having a hard time with the integer color component.
If I have to do a for-loop again to set the color, I would loose all the benefit of SIMD I aimed for in the first place.
Member
Posts: 95
Joined: 2009.09
Post: #2
Basically I'm searching for the integer equivalent to the BLAS function:

DAXPY
Sage
Posts: 1,231
Joined: 2002.10
Post: #3
Side note: unsigned int != GLubyte[4]. If you're going to pack colors like that, please be aware of endianness.
Member
Posts: 95
Joined: 2009.09
Post: #4
arekkusu Wrote:Side note: unsigned int != GLubyte[4]. If you're going to pack colors like that, please be aware of endianness.

Well I'm using "GL_UNSIGNED_BYTE" and that works quite well.
I mean the whole thing is running for quite a while now and I just thought I could save some battery power/speed it up a little by using vector operations for the filling of my interleaved array.

Anybody here not using for-loops to add vertices to an interleaved array?

I mean it might be possible to save quite a lot of fp-operations here, if done correctly. I can't be the only one wonderin about this!
Member
Posts: 95
Joined: 2009.09
Post: #5
Managed to get this working and posted a new blog entry with detailed speed comparison for OpenGL interleaved Array fill rates:

Member
Posts: 226
Joined: 2008.08
Post: #6
I'm not exactly sure if compilers optimize memcpy, but even so, it would be nice to see a comparison of the SIMD example comparing memcpy and a for-loop with:

Code:
```//Assuming you're using triangles here memcpy(dest, basicScrub, scrubVertexCount*sizeof(iVertex3D)); //versus iVertex3D* dest = _interleavedVerts+_vertexCount; for (int k=0; k<scrubVertexCount;k+=3) {      dest[k+0]=basicScrub[k];      dest[k+1]=basicScrub[k];      dest[k+2]=basicScrub[k]; }```

Just that I've only ever seen memcpy as a straight byte-for-loop-and-assign.

Member
Posts: 95
Joined: 2009.09
Post: #7
"iVertex3D" is the data structure for one vertex, that is a point in 3D.
So I tried the following sequential code:
Code:
```iVertex3D* dest = _interleavedVerts+_vertexCount; for (int k=0; k<scrubVertexCount;k++) {     dest[k]=basicScrub[k]; }```

Since I don't have my iPod-Cable with me, I could only test on the Simulator, that gives about 10% worse results then memcpy.
But since I stop the whole drawing process for one scrub, 10% of the whole thing seem to be quite a big chunk for memcopy alone. I'm rechecking at the iPod once I get home

Update: It's roughly the same on the iPod, also about +10% time using the loop.