HLSL Templates Part 4: Lazy AST Generation

Discovery

Like many gnarly bugs, this issue was discovered while debugging a different user reported issue: #4263. When I attempted to compile the test case attached to issue 4263 I noticed that there were two separable bugs occurring. The one described by the title of that issue and one which I captured in issue #4315.

This post discusses the later, which has a fix merged in PR 4317.

Debugging

Starting with the test case provided in issue #4263, I removed the second Load override to provide a reduced test case for this issue. The reduced test case results in the following error:

InstantiateObjectMethods.hlsl:5:61: error: no member named 'Load' in 'Texture2D<vector<float, 4> >'
  template <typename Arg0> T Load(Arg0 arg0) { return Get().Load(arg0); }
                                                      ~~~~~ ^

Searching the *.td files in the codebase for “no member named” revealed the error err_no_member, which (thankfully) is only issued in a few places. I quickly narrowed the issue location down to Sema::BuildMemberReferenceExpr (note: there are two functions by that name with different arguments). Walking up the call stack into the other Sema::BuildMemberReferenceExpr, it seemed the only way to get to the error was for member function lookup to fail.

This reminded me that some built-in HLSL functions are added when resolving overloads, so I went to look at where Texture2D::Load comes from. It is not added during overload resolution, that mechanism is only used for built-in functions, not members of built-in types. Built-in data types, like Texture2D are added by a HLSLExternalSource which is an implementation of ExternalSemaSource, and their methods are added through a different custom code path.

ExternalSemaSource and its base class ExternalASTSource are abstractions that allow inserting hooks to generate AST and Semantic resolutions in Clang. They are used by Clang’s IDE tooling libraries and LLDB to create valid ASTs without full underlying source (not dissimilar from how DXC uses them). In DXC we use some of the hooks, but not all of them.

My first realization was that the code to generate object methods AddObjectMethods as never getting called in my test case. I started trying to read the code to figure out where it was supposed to be called, which identified the problem.

Object methods are added in the Sema::DiagnoseHLSLDecl method, which is called in three places Sema::HandleDeclarator, Sema::ActOnParamDeclarator, and Sema::HandleField, none of which are called in the template instantiation code path.

Breaking Down the Problem

Template instantiation is a weird chunk of code, but it solves a very difficult problem. I wrote a bit about how clang breaks down templates in my (last post)[2023-01-09-Templates-Postmortem-3.md], but I’ll recap here. Clang breaks templates down into two pieces, a template declaration, and a template specialization. The declaration is the template without any of the template parameters specified, and the specialization is the template with the template parameters partially or completely specified.

Template declarations can be broken apart into dependent and non-dependent contexts. As an example:

template <typename T>
struct Doggo {
  T Legs() {
    return 4 * (T)Tails();
  }

  int Tails() {
    return 1;
  }
};

constexper int MaxTails = 9;

In this example, Legs is a dependent context because it changes depending on the value of T. Tails is odd because it is an non-dependent declaration contained in a dependent context (since Doggo is dependent). MaxTails is not dependent because it is independent of T.

In C++, dependent template contexts have very different rules. Most C++ compilers validate syntax of dependent contexts during parsing, but don’t validate the semantics until instantiation which is lazy and late. There is one slight catch to that, which is dependent contexts like Tails that don’t change as a result of the template. They can be validated during parsing (although the compiler may choose not to).

Further, a compiler may choose to selectively instantiate individual declarations in a template. This would allow instantiating Doggo for values of T where Legs isn’t semantically valid, as long as Legs is never called.

The differences between dependent and non-dependent contexts is why our other test cases passed. When builtin types are specified under templates, but not template derived the parser handles the declarations the same way non-template code would be handled.

Clang postpones many template specializations until the end of the translation unit, and performs the instantiation in a work loop, where each instantiation can cause other instantiations to be processed. This is not required by the language, but it is a common implementation.

These distinctions are important to understand because the indicate the root of the issue at hand. In the test case provided Texture2D’s template parameter is T. That makes all uses of Texture2D dependent on T. This means they are not evaluated in the parser, which is the only place we generate method declarations.

Potential Solution #1 - The Minimalist Approach

My first stab at a solution was to try and populate the methods during template instantiation, by calling the existing method to lazily add methods if they weren’t already defined. This didn’t work, and led to a lot of frustration. Clang method lookups are based on declaration contexts (DeclContext), which build maps of declarations to resolve against. I could not figure out a way to append the member declarations using the existing methods and have them appear in the DeclContext::lookup results. I believe (but never quite figured out) that this is because the StoredDeclsMap is cached, and not invalidated when members are added to the class.

After spending too much time on this I realized I should move on and try something different.

Potential Solution #2 - The Last Resort

I knew the problem was that the member declarations were not being added, and that there was a lazy system for adding them. Why not disable the lazy initialization of members?

60%

That’s how much slower DXC got for compiling a pass through vertex shader (similar to the ones we have in our execution tests).

I spent an hour trying to figure out if I could speed up adding the member functions to make this less terrible before realizing that was a lost cause.

Next option.

Actual Solution

After realizing how bad the performance was if we disabled the lazy initialization I went back to my original idea, but started thinking bigger. Rather than thinking about how to insert the member declarations during template instantiation, I started thinking about inserting them during member function resolution. That would cover all the cases.

This line of thought got me poking more around how member lookups worked. Eventually, I stumbled across Sema::RequireCompleteType, and things started to click into place.

In C and C++, you can forward declare types making the name valid, but there are limitations as to what you can do with an “incomplete” type. Specifically, you can pass around pointers and references to incomplete types, you can use an incomplete typename in a typedef, alias or friend declaration, and that’s about it. If you need to pass it by value or call of an object’s methods: you need a complete type. This aligns pretty closely to what we need for HLSL.

Clang is littered with checks all over the place with calls to Sema::RequireCompleteType, and Sema::RequireCompleteTypeImpl calls into ExternalSemaSource::CompleteType.

Once I put all this together the solution became simple. In ASTContextHLSL.cpp where we create the stub objects, instead of creating them as complete definitions we can leave them incomplete. Then, when they are required to be complete Clang’s existing code will call the HLSLExternalSource::CompleteType method to perform the completion. Turning HLSLExternalSource::AddHLSLObjectMethodsIfNotReady into CompleteType rounds out a working implementation, and stripping out the bitfield used to track which objects have been completed is an added benefit.

The end result: lazy initialization of built in types driven by Clang’s ExternalSemaSource, and no extra overhead for HLSL. Better yet, in my anecdotal testing this approach improves compile time (I see the test suite run consistently 1-2% faster with this change). The improvement is likely because IsCompleteDecl() is header-implemented (and tiny) so it can be inlined everywhere avoiding calling into the ExternalSemaSource unless absolutely needed. The old implementation always called in because the bitfield tracking which objects were initialized was stored in the ExternalSemaSource, which (1) is a polymorphic object and (2) is declared and implemented in a cpp file so it is private.

Abolish \r\n

Removing Vestigial Technology