Welcome to the 5th installment of the series I’m doing on a C/C++ Low-Level Curriculum. This is the 3rd post about the Stack, the fundamentals have been covered a couple of posts ago, and the previous post and this one are really just for extra information to round out the picture of ways the Stack is used in win32 x86 function calls – then we can move on to other low level aspects of the C/C++ languages.

The last two (win32 x86) function calling conventions we’re going to look at are thiscall which is used for calling non-static member functions of classes, and fastcall which emphasises register use over stack use for parameters. As with the previous posts about the Stack, the point of this isn’t so much the specific calling conventions that we’re examining, but rather to see the different ways that the Stack and registers are used to pass information around when functions are called.

 

Previously on #AltDevBlogADay…

If you missed the previous C/C++ Low Level Curriculum posts, here are some backlinks:

  1. http://altdevblogaday.com/2011/11/09/a-low-level-curriculum-for-c-and-c/
  2. http://altdevblogaday.com/2011/11/24/c-c-low-level-curriculum-part-2-data-types/
  3. http://altdevblogaday.com/2011/12/14/c-c-low-level-curriculum-part-3-the-stack/
  4. http://altdevblogaday.com/2011/12/24/c-c-low-level-curriculum-part-4-more-stack/

Generally I will try to avoid too much assumed knowledge, but this post does assume that you have read the posts linked above as 3 and 4 (or have a working knowledge of how the Stack works in vanilla x86 assembler, in which case why are you reading this!?).

 

Compiling and running code from this article

I assume that you are familiar with the VS2010 IDE, and comfortable writing, running, and debugging C++ programs.

As with the previous posts in this series, I’m using a win32 console application made by the “new project” wizard in VS2010 with the default options (express edition is fine).

The only change I make from the default project setup is to turn off “Basic Runtime Checks” to make the generated assembler more legible (and significantly faster…) see this previous post for details on how to do this.

To run code from this article in a VS2010 project created this way open the .cpp file that isn’t stdafx.cpp and replace everything below the line: #include “stdafx.h” with text copied and pasted from the code box.

The disassembly we look at is from the debug build configuration, which will generate “vanilla” unoptimised win32 x86 code.

 

The “thiscall” calling convention

As I’m sure you’re aware, in any non-static class member function it is possible to access a pointer to the instance of the class that the function was called on via the C++ keyword this.

The presence of the this pointer is often explained away by saying that it is an invisible “0th parameter to member functions”, which isn’t necessarily incorrect but is the same kind of truth that Obiwan Kenobi might have dealt in if he had been a computer science professor rather than a retired Jedi Knight; that is to say “true, from a certain point of view”.

The thiscall calling convention is more or less exactly the same as the stdcall calling convention we have already looked at in some detail in the last two posts (this->pPrevious->pPrevious, this->pPrevious). Though it is the default calling convention used by the VS2010 compiler for non-static member functions, it’s worth noting that there are situations where the compiler won’t use it (e.g. if your function uses the elipsis operator to take a varaible number of arguments).

As we have seen in the last two posts; the unoptimised win32 x86 stdcall calling convention passes its parameters on the Stack. The thiscall convention obviously must somehow pass the this pointer to member functions, but rather than storing an extra parameter on the Stack, it uses a register (ecx) to pass it to the called function.

The code below demonstrates this…

class CSumOf
{
public:
    int m_iSumOf;

    void SumOf( int iParamOne, int iParamTwo )
    {
        m_iSumOf = iParamOne + iParamTwo;
    }
};

int main( int argc, char** argv )
{
    int iValOne        = 1;
    int iValTwo        = 2;
    CSumOf cMySumOf;
    cMySumOf.SumOf( iValOne, iValTwo );
    return 0;
}

Paste this into VS2010, and put a breakpoint on the line

cMySumOf.SumOf( iValOne, iValTwo );

Run the debug build configuration; when the breakpoint is hit, right click and choose “Go To Disassembly”, and you should see something like this (n.b. the addresses in the leftmost column of the disassembly will almost certainly differ):

Make sure that the check boxes in your right-click context menu match those shown in this screenshot, or your disassembly will not match mine!

The block of assembler that we’re interested in for the purposes of illustrating how the thiscall convention works is shown below:

    14:     int iValOne        = 1;
00EE1259  mov         dword ptr [iValOne],1
    15:     int iValTwo        = 2;
00EE1260  mov         dword ptr [iValTwo],2
    16:     CSumOf cMySumOf;
    17:     cMySumOf.SumOf( iValOne, iValTwo );
00EE1267  mov         eax,dword ptr [iValTwo]
00EE126A  push        eax
00EE126B  mov         ecx,dword ptr [iValOne]
00EE126E  push        ecx
00EE126F  lea         ecx,[cMySumOf]
00EE1272  call        CSumOf::SumOf (0EE112Ch)

The assembler involved with calling CSumof::SumOf() starts at line 7 and goes to line 12.

Lines 7 to 10 are pushing the parameters to the function onto the stack in reverse order of declaration, exactly as with the stdcall convention we looked at in the previous article.

Line 11 is storing the address of cMySumOf in ecx using the instruction lea. If you right click and un-check “Show Symbol Names” you can see that lea is computing the address of cMySumOf given its offset from the ebx register.

Line 12 is obviously calling the function.

Stepping into the function call in the disassembly you should see the following: (not forgetting that we have to step through an additional jmp instruction before we get there because of VS2010 incremental linking Рsee approx. half way through this post for the details)

     6:     void SumOf( int iParamOne, int iParamTwo )
     7:     {
00EE1280  push        ebp
00EE1281  mov         ebp,esp
00EE1283  sub         esp,44h
00EE1286  push        ebx
00EE1287  push        esi
00EE1288  push        edi
00EE1289  mov         dword ptr [ebp-4],ecx
     8:         m_iSumOf = iParamOne + iParamTwo;
00EE128C  mov         eax,dword ptr [iParamOne]
00EE128F  add         eax,dword ptr [iParamTwo]
00EE1292  mov         ecx,dword ptr [this]
00EE1295  mov         dword ptr [ecx],eax
     9:     }

The calling code stored the address of the calling instance of the local variable cMySumOf in the ecx register before calling this function, and if we examine line 9 in code box above, you can see that – compared to the stdcall assembler – the function prologue has an extra step – it is moving the value in ecx into a memory address within the function’s stack frame (i.e. ebp-4). The upshot of this is that after line 9 [ebp-4] now stores the function’s this pointer.

The function then proceeds exactly as you might expect from the disassembly we’ve examined in previous articles up until line 13.

Line 13 moves the this pointer (previously stored in the function’s stack frame) into ecx, then line 14 stores the value of eax into the address specified by ecx (remember: in the VS2010 disassembly view, values in [square brackets] are memory accesses, taking the address to access from the value in the brackets). If you right click in the disassembly window and un-check “Show Symbol Names” you will see that the symbol this corresponds to ebp-4, which is where the value of ecx was stored at the end of the function prologue.

The astute amongst you will have noticed that the assembler is storing the this pointer from ecx into the Stack only to get it re-load it into ecx later without having used the register in the intervening time. This is exactly the kind of odd thing that un-optimised compiler generated assembler will do, try not to let it bother you :)

So the sum of the two parameters is stored using the this pointer, and then we hit the function epilogue and the function returns; end of story – or is it?

 

Nothing to see here. Move along.

This is not what you might expect because – based on what we’ve seen so far – that assembler that is setting CSumOf::m_iSumOf in the member function doesn’t obviously match the C++ code we wrote.

What we’re seeing looks like it might have been generated by the code

*((int*) this) = iParamOne + iParamTwo;

And in fact if you substitute that line it will generate exactly the same assembler – so how does that work?!?

// Here's what we wrote. Since m_iSumOf is a class member the language syntax allows use to
// "access it directly" (another Professor Kenobiism) in the member function
m_iSumOf = iParamOne + iParamTwo;

// in fact, what happens is that the compiler evaluates the code as if it was written like this
this->m_iSumOf = iParamOne + iParamTwo;

Ok, so there’s invisible pointer access in the C++ code, but that still doesn’t explain what we’re seeing – exactly how is

*((int*) this)

equivalent to

this->m_iSumOf

The answer has to do with memory layout of C++ classes (and structs), which is a topic for another entire article (probably several).

For now we’ll keep the explanation simple whilst trying not to channel our friend Professor Kenobi more than absolutely necessary…

First let’s take it as read that the member data for an instance of class must be stored somewhere in memory, and take a high level look at how the “pointing to” operator works with another code snippet:

this->m_iSumOf = 0;

This basically tells the compiler generate assembler that:

  • gets the value of this (a memory address)
  • looks up the offset of m_iSumOf relative to the start of the data making up an instance of CSumOf (which is known at compile time, so it’s constant at run time)
  • adds the offset to the address of this to get the memory address storing m_iSumOf and then sets the value at the resulting memory address to 0

The this pointer holds the address of the first byte of the data in an instance of CSumOf.

The first (and only) member variable in CSumOf is m_iSumOf, which puts it at an offset of 0 relative to the this pointer – and clearly even a debug build knows better than to add an offset of 0, so it accesses the memory at the address this.

So, again, we can see that even in seemingly innoccuous everyday C++ code there is hidden stuff going on – which is a big part of why I’m doing this series :)

Incidentally, I have recently been made aware of an unbelievably useful (and undocumented!) feature of the VS2010 compiler which prints the memory layout of classes to the build output during compilation: here’s the link I was sent, I hope you find it useful: http://thetweaker.wordpress.com/2010/11/07/d1reportallclasslayout-dumping-object-memory-layout/

 

fastcall (last one, I promise)

At last we come to the win32 x86 calling convention excitingly named fastcall, so named because in theory it makes function calls faster (than the more common stdcall or cdecl conventions).

So why is it faster than the other calling conventions that we’ve looked at? To answer this, we’ll need to examine the assembler generated by a function call that uses the fastcall convention.

To demonstrate this we’ll use the code below:

int __fastcall SumOf( int iParamOne, int iParamTwo, int iParamThree )
{
    int iLocal = iParamOne + iParamTwo + iParamThree;
    return iLocal;
}

int main( int argc, char** argv )
{
    int iValOne   = 1;
    int iValTwo   = 2;
    int iValThree = 4;
    int iResult   = SumOf( iValOne, iValTwo, iValThree );
    return 0;
}

This is basically the same as the code used in the previous post in the series to show how the stdcall calling convention stores multiple parameters on the stack, except the function SumOf has got an extra keyword between the return type and the name of the function.

The __fastcall keyword is a not-quite Microsoft specific C++ extension that changes the calling convention used to call the function it is applied to (http://en.wikipedia.org/wiki/X86_calling_conventions#fastcall).

If you follow the usual drill to make a runnable project from this snippet, put a breakpoint on line 12, then compile and run the debug configuration, wait for the breakpoint to get hit, and go to disassembly you should see something like this:

     8: int main( int argc, char** argv )
     9: {
010F1280  push        ebp
010F1281  mov         ebp,esp
010F1283  sub         esp,50h
010F1286  push        ebx
010F1287  push        esi
010F1288  push        edi
    10:     int iValOne   = 1;
010F1289  mov         dword ptr [iValOne],1
    11:     int iValTwo   = 2;
010F1290  mov         dword ptr [iValTwo],2
    12:     int iValThree = 4;
010F1297  mov         dword ptr [iValThree],4
    13:     int iResult   = SumOf( iValOne, iValTwo, iValThree );
010F129E  mov         eax,dword ptr [iValThree]
010F12A1  push        eax
010F12A2  mov         edx,dword ptr [iValTwo]
010F12A5  mov         ecx,dword ptr [iValOne]
010F12A8  call        SumOf (10F1136h)
010F12AD  mov         dword ptr [iResult],eax
    14:     return 0;
010F12B0  xor         eax,eax
    15: }

You should by this point be pretty familiar with function prologues, and the assembler that precedes a function call in the other conventions we’ve examined, so we’ll just look at the differences with __fastcall.

Looking at lines 16 to 20, we can see that of the three parameters passed to SumOf():

  • the 3rd (iValThree) is being pushed onto the stack,
  • the 2nd (iValTwo) is being moved into the edx register, and
  • the 1st (iValOne) is being moved into the ecx register

Stepping into the disassembly of SumOf() you should see something like this (N.B. I unchecked “Show Symbol Names” before grabbing this text from the disassembly view so the addresses were all visible):

     2: int __fastcall SumOf( int iParamOne, int iParamTwo, int iParamThree )
     3: {
010F1250  push        ebp
010F1251  mov         ebp,esp
010F1253  sub         esp,4Ch
010F1256  push        ebx
010F1257  push        esi
010F1258  push        edi
010F1259  mov         dword ptr [ebp-8],edx
010F125C  mov         dword ptr [ebp-4],ecx
     4:     int iLocal = iParamOne + iParamTwo + iParamThree;
010F125F  mov         eax,dword ptr [ebp-4]
010F1262  add         eax,dword ptr [ebp-8]
010F1265  add         eax,dword ptr [ebp+8]
010F1268  mov         dword ptr [ebp-0Ch],eax
     5:     return iLocal;
010F126B  mov         eax,dword ptr [ebp-0Ch]
     6: }

The assembly making up the function prologue is doing extra work compared to a stdcall function; taking the values of ecx and edx and storing them into the function’s Stack frame (lines 9 & 10).

Lines 12 to 14 then add the three values passed to it using eax – iParamOne (passed via ecx now in [ebp-4]), iParamTwo (passed via edx now in [ebp-8]), and iParamThree (passed via the Stack in [ebp+8]).

Line 15 sets iLocal from the sum calculated in eax, and then Line 16 moves the return value of the function into eax where the calling code will expect to find it (as previous established in this post).

That’s all well and good, but how is fastcall faster than the alternative calling conventions?

In theory, passing the arguments via registers should save two operations per parameter:

  1. not writing the value into the Stack (i.e. memory access) before the function is called, and
  2. not reading it from the Stack (i.e. memory access) when it is needed inside the function.

As a rule of thumb, performing less operations and avoiding those that involve accessing memory should result in faster code, but this is not always the case. I don’t want to get into discussing why this is, because on its own it is a subject for many posts and by someone more qualified than myself to explain (e.g. Bruce Dawson, Mike Acton, Tony Albrecht, Jaymin Kessler, or John McCutchan).

In all honesty I would be extremely surprised if the unoptimised code we’ve looked at runs any faster at all when using fastcall. As you can see by examining the disassembly above, the first of these potentially saved operations is being un-done by pushing the content of ecx and edx onto the Stack in the function prologue, and the second is being un-done by accessing the parameter values from the Stack in lines 12 & 13.

I assume that, like the other instances of unoptimised compiler generated assembler performing redundant operations we have come across, these unnecessary instructions would happily optimise away in a release build; however the sad fact is that it is pretty hard to test the disassembly of trivial programs like the one we’ve been looking at meaningfully in a release build configuration.

Why? because the optimising compiler is so good that any simple program (like this one) which uses compile time constants for input, and does no output will pretty much compile to “return 0;”

I leave it as an exercise for you, dear reader, to work out the smallest number of changes to this code that will result in disassembly that actually calls SumOf() :)

 

Summary

So, we have now seen how thiscall and fastcall differ from the other x86 calling conventions we’ve looked at, and we have seen yet again that even in simple code there is black magic going on behond the scenes of the language syntax.

Also, I want to point out that – whilst non x86 platforms will be do things slightly differently – this information is more generally useful than it may appear; the more different ways you’ve seen assembler doing similar tasks (like calling functions using the Stack), the more likely you are to be able to make sense of some new assembly language that you’ve never seen before (e.g. powerPC assembler) sure the mnemonics may be very different but you should be able to guess at a lot of them and the documentation is out there to allow you to put the rest of the picture together given time.

No doubt we will revisit the Stack from time to time as this (Potentially neverending! Help!) series of articles continues, but I’ve now covered it in as much detail as I feel is appropriate until we’ve covered some other aspects of the Low Level view of C/C++ (for example; we will definitely be coming back to the Stack when we examine structs & classes and their memory layout to discuss pass by value).

Next time we’ll be looking at the disassembly from common C / C++ language constructs like loops and control statements, which are very useful things to be familiar with know if you find yourself staring at bunch of disassembly as a result of a crash in code you don’t have symbols for..

In case you missed it whilst reading the main body of the post, here’s that link again concerning the undocumented VS2010 compiler feature that dumps memory layouts of classes to the build output: http://thetweaker.wordpress.com/2010/11/07/d1reportallclasslayout-dumping-object-memory-layout/

Also, thanks to Fabian and Bruce for their help reviewing this post.