glTF in Unity optimization - 5. New Mesh API - The Refactor

13 Apr 2020

This is part 5 of a mini-series.

TL;DR: The speed improvements from the advanced Mesh API for glTFast are marginal and needed to be earned via a time-consuming refactor.

Goal

In the last post I concluded that in order to use the advanced Mesh API to its fullest, I'd have to refactor a lot of things. In this post I want to show you what I did and compare the results.

Problem

Meshes' vertices come with a couple of attributes. At least they have positions (in 3D space), but most often also normals, texture coordinates, tangents and sometimes colors, weights and even more texture coordinates. The advanced Mesh API only supports up to 4 vertex streams.

Solution

Once you have more than 4 vertex attributes, you have to combine some of them into one interleaved vertex stream. From a memory layout perspective this means instead of having one array per vertex attribute you now have to create an array of structs (AoS), where the struct contains multiple vertex attributes.

So instead of this…

vertexCount = 100;
var positions = new Vector3[vertexCount];
var normals = new Vector3[vertexCount];
var tangents = new Vector4[vertexCount];

…you create something like this:

[StructLayout(LayoutKind.Sequential)]
struct Vertex {
    public Vector3 position;
    public Vector3 normal;
    public Vector4 tangent;
}

vertexCount = 100;
var vertexData = new NativeArray<Vertex>(vertexCount,Allocator.TempJob);

The next step is to retrieve the data from the glTF buffers into the NativeArray. This was done via C# jobs before and has to be changed, so that the interleaved nature of the output array is respected. I introduced an outputByteStride parameter for that reason. A new C# Job looks something like this:

public unsafe struct GetVector3sInterleavedJob : IJobParallelFor {

    public int inputByteStride;
    public byte* input;
    public int outputByteStride;
    public Vector3* result;

    public void Execute(int i) {
        float* resultV = (float*) (((byte*)result) + (i*outputByteStride));
        byte* off = input + i*inputByteStride;
        *((Vector2*)resultV) = *((Vector2*)off);
        *(resultV+2) = -*(((float*)off)+2);
    }
}

I had to change all existing >50 Jobs to support output byte strides. On top of that, I had to create new data structures/classes and determine the way vertex data will be clustered retrieved before being able to schedule these new Jobs.

This better be worth it 😃

Benchmarks

High polygon scene

First test scene is a high resolution mesh (4 million triangles) with normals tangents and texture coordinates. These are the most important timings when loading it with glTFast 1.0.0

Phase	Duration
Prepare	12 ms
Retrieve Data	19 ms
Set Positions	21 ms
Set Indices	65 ms
Set UVs	34 ms
Set Normals	59 ms
Set Tangents	101 ms

Initial test were not very promising. Overall loading times were identical or worse, not better. A little investigation showed, that setting the data was taking too long. This led to a first learning: the advanced Mesh API is faster, because it makes certain sanity checks (like index out of bounds) optional. But to benefit from that, you'd have to actually disable those checks via MeshUpdateFlags:

MeshUpdateFlags flags =
  MeshUpdateFlags.DontNotifyMeshUsers
  | MeshUpdateFlags.DontRecalculateBounds
  | MeshUpdateFlags.DontResetBoneBounds
  | MeshUpdateFlags.DontValidateIndices;

msh.SetVertexBufferParams(…);
msh.SetVertexBufferData(vertexData,0,0,vertexData.Length,stream,flags);
msh.SetIndexBufferData(indices,0,indexCount,indices.Length,flags);
msh.SetSubMesh(0,new SubMeshDescriptor(indexCount,indices.Length,topology),flags);

After I did this, I got improved timings.

Phase	Duration
Alloc VB array	55 ms
Alloc UV array	8 ms
Alloc Index array	11 ms
Retrieve Data	30 ms
Set VB Params	47 ms
Set VB Data	14 ms
Apply UV	3 ms
Set Index Data	31 ms
Set SubMesh	0.004 ms
Recalculate Bounds	38 ms

Since so many things changed at once here (the structure of the code base and the procedure of loading in general), we have totally different load phases and to some extent this is like comparing apples and oranges. I'll try to interpret this as fair as possible and figure out the actual factors of causation.

The former Prepare phase consistent mostly of allocating C# arrays for mesh data (like Vector3[] for positions) and was surprisingly fast compared to the newer allocations of NativeArrays (Alloc VB=VertexBuffer, UV=Texture Coordinates and Index arrays). Since in the past I got errors when holding NativeArrays for more than 4 frames (which occasionally happens when bulk-loading many scenes), I used the slower persistent allocator. But even changing the Allocator type to TempJob didn't bring much relieve.

Retrieving data (via C# Jobs) became up to 50% slower. I assume the addition of the output byte stride is the reason. The positive thing to mention is that the vertex buffer data jobs start in parallel before the Index buffers are allocated now (5 ms - 10 ms overlap). In some tests runs data retrieving was not slower overall because of that. Makes me wonder how much potential lies in restructuring the Coroutines.

Let's compare setting the data. The new Set VB params and Set VB Data (61 ms combined) are replacing the former Set Positions, Set Normals and Set Tangents (181 ms combined). Hard to tell why it's so much faster. Maybe the promised benefits of the advanced Mesh API, maybe the old interface triggers additional, dispensable calculations.

Set UVs was replaced by Apply UV and is 11 times faster!

Set Indices (65 ms) was replaced by Set Index Data and Set Submesh (combined 31 ms). I assume this is because the old interface calculates the mesh bounds automatically. I added it (after getting errors about invalid bounds), which takes 38 ms. So overall this is comparable to slightly slower.

Most important is the overall load times, so I made 10 runs reach and it got ~14% faster.

	1.0.0	new Mesh API
High Polygon Scene	716 ms	618 ms

Test sample set

Even with a specific test scene it is hard to lay the finger on where exactly things got better or worse without knowing the inner workings. So I tried to get a feeling for the overall impact on a wide variety of lower complex scenes. I took the glTF Sample Model set, removed irrelevant scenes (embed buffers, draco compression) and loaded the 114 files all at once.

	1.0.0	new Mesh API
glTF Sample Models	10.6 sec	10.1 sec

That's an ~4% - 5% improvement.

Conclusion

The way I implemented usage of the advanced Mesh API is an ever so slight improvement, no doubt. It is not the silver bullet I hoped it would be though.

I still think it is a good foundation to build upon and try the next couple of things. Maybe I missed something and use it in an non-optimal way. I'd be delighted to be corrected, in that case.

This work is not in a release yet, as I think it needs some more work and polishing before going live.

Where to go from here

With this refactor work being out of the way, I can move on to try other ideas for improvement. Some candidates that popped up as a result of this post:

Maybe it is better to set a mesh's vertex parameters before/in parallel to data retrieval.

One thing is to take a look at are the now slower Jobs:

Schedule them as soon as possible for better parallelization
Burst and Mathematics
Optimized special Jobs for most used cases

Draco meshes were not involved thus far, so those should be using the advanced Mesh API as well.

Looking at the profiling data, more specifically the worker threads, it seems that the C# Jobs are not the bottleneck. So looking into smarter flow of the async loading process seems to be the most promising endeavour.

Next up

I haven't decided yet. It'll be a surprise 😃

Follow me on twitter or subscribe the feed to not miss updates on this topic.

If you liked this read, feel free to

Next: Asynchronous Programming

Overview of this mini-series

gltf unity performance