glTF in Unity optimization - 5. New Mesh API - The Refactor
This is part 5 of a mini-series.
TL;DR: The speed improvements from the advanced Mesh API for glTFast are marginal and needed to be earned via a time-consuming refactor.
Goal
In the last post I concluded that in order to use the advanced Mesh API to its fullest, I'd have to refactor a lot of things. In this post I want to show you what I did and compare the results.
Problem
Meshes' vertices come with a couple of attributes. At least they have positions (in 3D space), but most often also normals, texture coordinates, tangents and sometimes colors, weights and even more texture coordinates. The advanced Mesh API only supports up to 4 vertex streams.
Solution
Once you have more than 4 vertex attributes, you have to combine some of them into one interleaved vertex stream. From a memory layout perspective this means instead of having one array per vertex attribute you now have to create an array of structs (AoS), where the struct contains multiple vertex attributes.
So instead of this…
vertexCount = 100;
var positions = new Vector3[vertexCount];
var normals = new Vector3[vertexCount];
var tangents = new Vector4[vertexCount];
…you create something like this:
[StructLayout(LayoutKind.Sequential)]
struct Vertex {
public Vector3 position;
public Vector3 normal;
public Vector4 tangent;
}
vertexCount = 100;
var vertexData = new NativeArray<Vertex>(vertexCount,Allocator.TempJob);
The next step is to retrieve the data from the glTF buffers into the NativeArray. This was done via C# jobs before and has to be changed, so that the interleaved nature of the output array is respected. I introduced an outputByteStride
parameter for that reason. A new C# Job looks something like this:
public unsafe struct GetVector3sInterleavedJob : IJobParallelFor {
public int inputByteStride;
public byte* input;
public int outputByteStride;
public Vector3* result;
public void Execute(int i) {
float* resultV = (float*) (((byte*)result) + (i*outputByteStride));
byte* off = input + i*inputByteStride;
*((Vector2*)resultV) = *((Vector2*)off);
*(resultV+2) = -*(((float*)off)+2);
}
}
I had to change all existing >50 Jobs to support output byte strides. On top of that, I had to create new data structures/classes and determine the way vertex data will be clustered retrieved before being able to schedule these new Jobs.
This better be worth it 😃
Benchmarks
High polygon scene
First test scene is a high resolution mesh (4 million triangles) with normals tangents and texture coordinates. These are the most important timings when loading it with glTFast 1.0.0
Phase | Duration |
---|---|
Prepare | 12 ms |
Retrieve Data | 19 ms |
Set Positions | 21 ms |
Set Indices | 65 ms |
Set UVs | 34 ms |
Set Normals | 59 ms |
Set Tangents | 101 ms |
Initial test were not very promising. Overall loading times were identical or worse, not better. A little investigation showed, that setting the data was taking too long. This led to a first learning: the advanced Mesh API is faster, because it makes certain sanity checks (like index out of bounds) optional. But to benefit from that, you'd have to actually disable those checks via MeshUpdateFlags
:
MeshUpdateFlags flags =
MeshUpdateFlags.DontNotifyMeshUsers
| MeshUpdateFlags.DontRecalculateBounds
| MeshUpdateFlags.DontResetBoneBounds
| MeshUpdateFlags.DontValidateIndices;
msh.SetVertexBufferParams(…);
msh.SetVertexBufferData(vertexData,0,0,vertexData.Length,stream,flags);
msh.SetIndexBufferData(indices,0,indexCount,indices.Length,flags);
msh.SetSubMesh(0,new SubMeshDescriptor(indexCount,indices.Length,topology),flags);
After I did this, I got improved timings.
Phase | Duration |
---|---|
Alloc VB array | 55 ms |
Alloc UV array | 8 ms |
Alloc Index array | 11 ms |
Retrieve Data | 30 ms |
Set VB Params | 47 ms |
Set VB Data | 14 ms |
Apply UV | 3 ms |
Set Index Data | 31 ms |
Set SubMesh | 0.004 ms |
Recalculate Bounds | 38 ms |
Since so many things changed at once here (the structure of the code base and the procedure of loading in general), we have totally different load phases and to some extent this is like comparing apples and oranges. I'll try to interpret this as fair as possible and figure out the actual factors of causation.
The former Prepare phase consistent mostly of allocating C# arrays for mesh data (like Vector3[]
for positions) and was surprisingly fast compared to the newer allocations of NativeArrays (Alloc VB=VertexBuffer, UV=Texture Coordinates and Index arrays). Since in the past I got errors when holding NativeArrays for more than 4 frames (which occasionally happens when bulk-loading many scenes), I used the slower persistent allocator. But even changing the Allocator type to TempJob didn't bring much relieve.
Retrieving data (via C# Jobs) became up to 50% slower. I assume the addition of the output byte stride is the reason. The positive thing to mention is that the vertex buffer data jobs start in parallel before the Index buffers are allocated now (5 ms - 10 ms overlap). In some tests runs data retrieving was not slower overall because of that. Makes me wonder how much potential lies in restructuring the Coroutines.
Let's compare setting the data. The new Set VB params and Set VB Data (61 ms combined) are replacing the former Set Positions, Set Normals and Set Tangents (181 ms combined). Hard to tell why it's so much faster. Maybe the promised benefits of the advanced Mesh API, maybe the old interface triggers additional, dispensable calculations.
Set UVs was replaced by Apply UV and is 11 times faster!
Set Indices (65 ms) was replaced by Set Index Data and Set Submesh (combined 31 ms). I assume this is because the old interface calculates the mesh bounds automatically. I added it (after getting errors about invalid bounds), which takes 38 ms. So overall this is comparable to slightly slower.
Most important is the overall load times, so I made 10 runs reach and it got ~14% faster.
1.0.0 | new Mesh API | |
---|---|---|
High Polygon Scene | 716 ms | 618 ms |
Test sample set
Even with a specific test scene it is hard to lay the finger on where exactly things got better or worse without knowing the inner workings. So I tried to get a feeling for the overall impact on a wide variety of lower complex scenes. I took the glTF Sample Model set, removed irrelevant scenes (embed buffers, draco compression) and loaded the 114 files all at once.
1.0.0 | new Mesh API | |
---|---|---|
glTF Sample Models | 10.6 sec | 10.1 sec |
That's an ~4% - 5% improvement.
Conclusion
The way I implemented usage of the advanced Mesh API is an ever so slight improvement, no doubt. It is not the silver bullet I hoped it would be though.
I still think it is a good foundation to build upon and try the next couple of things. Maybe I missed something and use it in an non-optimal way. I'd be delighted to be corrected, in that case.
This work is not in a release yet, as I think it needs some more work and polishing before going live.
Where to go from here
With this refactor work being out of the way, I can move on to try other ideas for improvement. Some candidates that popped up as a result of this post:
Maybe it is better to set a mesh's vertex parameters before/in parallel to data retrieval.
One thing is to take a look at are the now slower Jobs:
- Schedule them as soon as possible for better parallelization
- Burst and Mathematics
- Optimized special Jobs for most used cases
Draco meshes were not involved thus far, so those should be using the advanced Mesh API as well.
Looking at the profiling data, more specifically the worker threads, it seems that the C# Jobs are not the bottleneck. So looking into smarter flow of the async loading process seems to be the most promising endeavour.
Next up
I haven't decided yet. It'll be a surprise 😃
Follow me on twitter or subscribe the feed to not miss updates on this topic.
If you liked this read, feel free to
Next: Asynchronous Programming