Yesterday, I outlined how the compiler and linker cooperate to support TLS. However, I didn’t mention just what exactly goes on under the hood when one declares a __declspec(thread) variable and accesses it.
Before the inner workings of a __declspec(thread) variable access can be explained, however, it is necessary to discuss several more special variables in tlssup.c. These special variables are referenced by _tls_used to create the TLS directory for the image.
The first variable of interest is _tls_index, which is implicitly referenced by the compiler in the per-thread storage resolution mechanism any time a thread local variable is referenced (well, almost every time; there’s an exception to this, which I’ll mention later on). _tls_index is also the only variable declared in tlssup.c that uses the default allocation storage class. Internally, it represents the current module’s TLS index. The per-module TLS index is, in principal, similar to a TLS index returned by TlsAlloc. However, the two are not compatible, and there exists significantly more work behind the per-module TLS index and its supporting code. I’ll cover all of that later as well; for now, just bear with me.
The definitions of _tls_start and _tls_end appear as so in tlssup.c:
#pragma data_seg(".tls") #if defined (_M_IA64) || defined (_M_AMD64) _CRTALLOC(".tls") #endif char _tls_start = 0; #pragma data_seg(".tls$ZZZ") #if defined (_M_IA64) || defined (_M_AMD64) _CRTALLOC(".tls$ZZZ") #endif char _tls_end = 0;
This code creates the two variables and places them at the start and end of the “.tls” section. The compiler and linker will automatically assume a default allocation section of “.tls” for all __declspec(thread) variables, such that they will be placed between _tls_start and _tls_end in the final image. The two variables are used to tell the linker the bounds of the TLS storage template section, via the image’s TLS directory (_tls_used).
Now that we know how __declspec(thread) works from a language level, it is necessary to understand the supporting code the compiler generates for an access to a __declspec(thread) variable. This supporting code is, fortunately, fairly straightforward. Consider the following test program:
__declspec(thread) int threadedint = 0; int __cdecl wmain(int ac, wchar_t **av) { threadedint = 42; return 0; }
For x64, the compiler generated the following code:
mov ecx, DWORD PTR _tls_index mov rax, QWORD PTR gs:88 mov edx, OFFSET FLAT:threadedint mov rax, QWORD PTR [rax+rcx*8] mov DWORD PTR [rdx+rax], 42
Recall that the gs segment register refers to the base address of the TEB on x64. 88 (0x58) is the offset in the TEB for the ThreadLocalStoragePointer member on x64 (more on that later):
+0x058 ThreadLocalStoragePointer : Ptr64 Void
If we examine the code after the linker has run, however, we’ll notice something strange:
mov ecx, cs:_tls_index
mov rax, gs:58h
mov edx, 4
mov rax, [rax+rcx*8]
mov dword ptr [rdx+rax], 2Ah ; 42
xor eax, eax
If you haven’t noticed it already, the offset of the “threadedint” variable was resolved to a small value (4). Recall that in the pre-link disassembly, the “mov edx, 4” instruction was “mov edx, OFFSET FLAT:threadedint”.
Now, 4 isn’t a very flat address (one would expect an address within the confines of the executable image to be used). What happened?
Well, it turns out that the linker has some tricks up its sleeve that were put into play here. The “offset” of a __declspec(thread) variable is assumed to be relative to the base of the “.tls” section by the linker when it is resolving address references. If one examines the “.tls” section of the image, things begin to make a bit more sense:
0000000001007000 _tls segment para public 'DATA' use64
0000000001007000 assume cs:_tls
0000000001007000 ;org 1007000h
0000000001007000 _tls_start dd 0
0000000001007004 ; int threadedint
0000000001007004 ?threadedint@@3HA dd 0
0000000001007008 _tls_end dd 0
The offset of “threadedint” from the start of the “.tls” section is indeed 4 bytes. But all of this still doesn’t explain how the instructions the compiler generated access a variable that is instanced per thread.
The “secret sauce” here lies in the following three instructions:
mov ecx, cs:_tls_index mov rax, gs:58h mov rax, [rax+rcx*8]
These instructions fetch ThreadLocalStoragePointer out of the TEB and index it by _tls_index. The resulting pointer is then indexed again with the offset of threadedint from the start of the “.tls” section to form a complete pointer to this thread’s instance of the threadedint variable.
In C, the code that the compiler generated could be visualized as follows:
// This represents the ".tls" section struct _MODULE_TLS_DATA { int tls_start; int threadedint; int tls_end; } MODULE_TLS_DATA, * PMODULE_TLS_DATA; PTEB Teb; PMODULE_TLS_DATA TlsData; Teb = NtCurrentTeb(); TlsData = Teb->ThreadLocalStoragePointer[ _tls_index ]; TlsData->threadedint = 42;
This should look familiar if you’ve used explicit TLS before. The typical paradigm for explicit TLS is to place a structure pointer in a TLS slot, and then to access your thread local state, the per thread instance of the structure is retrieved and the appropriate variable is then referenced off of the structure pointer. The difference here is that the compiler and linker (and loader, more on that later) cooperated to save you (the programmer) from having to do all of that explicitly; all you had to do was declare a __declspec(thread) variable and all of this happens magically behind the scenes.
There’s actually an additional curve that the compiler will sometimes throw with respect to how implicit TLS variables work from a code generation perspective. You may have noticed how I showed the x64 version of an access to a __declspec(thread) variable; this is because, by default, x86 builds of a .exe involve a special optimization (/GA (Optimize for Windows Application, quite possibly the worst name for a compiler flag ever)) that eliminates the step of referencing the special _tls_index variable by assuming that it is zero.
This optimization is only possible with a .exe that will run as the main process image. The assumption works in this case because the loader assigns per-module TLS index values on a sequential basis (based on the loaded module list), and the main process image should be the second thing in the loaded module list, after NTDLL (which, now that this optimization is being used, can never have any __declspec(thread) variables, or it would get TLS index zero instead of the main process image). It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).
For reference, with /GA enabled, the x86 build of the above code results in the following instructions:
mov eax, large fs:2Ch mov ecx, [eax] mov dword ptr [ecx+4], 2Ah ; 42
Remember that on x86, fs points to the base address of the TEB, and that ThreadLocalStoragePointer is at offset +0x2C from the base of the x86 TEB.
Notice that there is no reference to _tls_index; the compiler assumes that it will take on the value zero. If one examines a .dll built with the x86 compiler, the /GA optimization is always disabled, and _tls_index is used as expected.
The magic behind __declspec(thread) extends beyond just the compiler and linker, however. Something still has to set up the storage for each module’s per-thread state, and that something is the loader. More on how the loader plays a part in this complex process next time.