<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>#AltDevBlogADay &#187; Steven Tovey</title>
	<atom:link href="http://www.altdevblogaday.com/author/steven-tovey/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.altdevblogaday.com</link>
	<description>Each day a little more #gamedev love</description>
	<lastBuildDate>Thu, 17 May 2012 03:06:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Absolutely Crackers!</title>
		<link>http://www.altdevblogaday.com/2011/12/12/absolutely-crackers/</link>
		<comments>http://www.altdevblogaday.com/2011/12/12/absolutely-crackers/#comments</comments>
		<pubDate>Mon, 12 Dec 2011 23:56:21 +0000</pubDate>
		<dc:creator>Steven Tovey</dc:creator>
				<category><![CDATA[General Interest]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[codebreaking]]></category>
		<category><![CDATA[cracking]]></category>
		<category><![CDATA[cryptography]]></category>
		<category><![CDATA[GCHQ]]></category>
		<category><![CDATA[Hex Workshop]]></category>
		<category><![CDATA[IDA]]></category>
		<category><![CDATA[IDA Freeware]]></category>
		<category><![CDATA[John the Ripper]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[puzzles]]></category>

		<guid isPermaLink="false">http://altdevblogaday.com/?p=21201</guid>
		<description><![CDATA[<p>I have held off posting this solution until the deadline on GCHQ&#8217;s website had expired, my hope is that this will help to ensure that those who are &#8220;unworthy&#8221; (in their view) do not abuse the solutions I give here. Before we jump right in my special thanks go to my awesome colleague <a href="http://www.hikey.org">Lionel Lemarié</a> who was also working on this at the same time. Bouncing ideas and thoughts off such a knowledgeable chap was super useful and really does go to prove that two heads are better than one! This post is also reproduced on my <a href="http://www.spuify.co.uk">personal blog</a>.</p>
<p><a href="http://www.altdevblogaday.com/2011/12/12/absolutely-crackers/" class="more-link">Read more on Absolutely Crackers!&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<p>I have held off posting this solution until the deadline on GCHQ&#8217;s website had expired, my hope is that this will help to ensure that those who are &#8220;unworthy&#8221; (in their view) do not abuse the solutions I give here. Before we jump right in my special thanks go to my awesome colleague <a href="http://www.hikey.org">Lionel Lemarié</a> who was also working on this at the same time. Bouncing ideas and thoughts off such a knowledgeable chap was super useful and really does go to prove that two heads are better than one! This post is also reproduced on my <a href="http://www.spuify.co.uk">personal blog</a>.</p>
<p>On the 1st of December 2011 I saw the tag GCHQ trending on <a href="http://www.twitter.com" target="_blank">Twitter</a>. Apparently they had set a challenge to British Nationals to try and crack a code they released on the website <a href="http://www.canyoucrackit.co.uk" target="_blank">http://www.canyoucrackit.co.uk</a>. As someone who enjoys puzzles I thought it&#8217;d be fun to have a go at trying to crack the code they put up. This blog post details my efforts in solving the code.</p>
<h2>Stage 1:</h2>
<p>If you visit <a href="http://www.canyoucrackit.co.uk" target="_blank">http://www.canyoucrackit.co.uk</a> you will be greeted with something that looks like a really <a href="http://chriswoodill.blogspot.com/2008/12/top-10-list-of-bad-movie-computer.html" target="_blank">bad IT system in a Hollywood Blockbuster</a>. Most prominently there is an image with 160 bytes of hexadecimal, split into 10 rows of 16 entries. The first stage of solving this puzzle was to get this data into binary format. There was no real good way of doing this except to painstaking transcribe the bytes by hand into my favourite Hex Editor. I double checked the data and also had a friend check it to make sure there was no errors at this point. My initial feeling was perhaps it was some &#8220;cryptographic in joke&#8221;, where the data hashed to make a readable bit of plain text. I ran the data through a bunch of hashing functions, a bunch of SHA variants, CRCs, the works. Nothing! It wasn&#8217;t until I noticed the string of bytes <code>0xef 0xeb 0xad 0xde</code>, that it set alarm bells ringing in my head. This was an executable payload! Once this realisation set in it only took a cursory glance at the first byte of the data, <code>0xeb</code>, to confirm that it was likely to be x86. <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/gchq.txt">Here is the binary file I created</a> for those who are interested, please excuse the .txt extension (WordPress flags my .bin files and even .c files as security risks so a lot of downloads from this post are .txt).</p>
<p><a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/canyoucrackit_site.png"><img class="alignnone size-medium wp-image-649" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/canyoucrackit_site.png" alt="" width="300" height="168" /></a></p>
<p>It actually turns out that the payload in the image on the site is only half the story. There is a second part to it, which is inside cypber.png itself. So download the main image from the site and fire up your favourite hex editor. You&#8217;re looking for an <code>iTXt</code> chunk, which is the header of some metadata.</p>
<blockquote><p><code>QkJCQjIAAACR2PFtcCA6q2eaC8SR+8dmD/zNzLQC+td3tFQ4qx8O447TDeuZw5P+0SsbEcYR<br />
78jKLw==<br />
</code></p></blockquote>
<p>The presence of the == at the end was a hint that the data was actually <a href="http://en.wikipedia.org/wiki/Base64">base64 encoded</a>. So you&#8217;ll need to decode this, I used <a href="http://www.opinionatedgeek.com/dotnet/tools/base64decode/">this site</a>. Once you have that, your payload is now complete.</p>
<blockquote><p><code><br />
001C13AE jmp main+24h (1C13B4h)<br />
001C13B0 scas dword ptr es:[edi]<br />
001C13B1 ret 0A3BFh<br />
001C13B4 sub esp,100h<br />
001C13BA xor ecx,ecx<br />
001C13BC mov byte ptr [esp+ecx],cl<br />
001C13BF inc cl<br />
001C13C1 jne main+2Ch (1C13BCh)<br />
001C13C3 xor eax,eax<br />
001C13C5 mov edx,0DEADBEEFh<br />
001C13CA add al,byte ptr [esp+ecx]<br />
001C13CD add al,dl<br />
001C13CF ror edx,8<br />
001C13D2 mov bl,byte ptr [esp+ecx]<br />
001C13D5 mov bh,byte ptr [esp+eax]<br />
001C13D8 mov byte ptr [esp+eax],bl<br />
001C13DB mov byte ptr [esp+ecx],bh<br />
001C13DE inc cl<br />
001C13E0 jne main+3Ah (1C13CAh)<br />
001C13E2 jmp main+0B3h (1C1443h)<br />
001C13E7 mov ebx,esp<br />
001C13E9 add ebx,4<br />
001C13EF pop esp<br />
001C13F0 pop eax<br />
001C13F1 cmp eax,41414141h<br />
001C13F6 jne main+0ABh (1C143Bh)<br />
001C13F8 pop eax<br />
001C13F9 cmp eax,42424242h<br />
001C13FE jne main+0ABh (1C143Bh)<br />
001C1400 pop edx<br />
001C1401 mov ecx,edx<br />
001C1403 mov esi,esp<br />
001C1405 mov edi,ebx<br />
001C1407 sub edi,ecx<br />
001C1409 rep movs byte ptr es:[edi],byte ptr [esi]<br />
001C140B mov esi,ebx<br />
001C140D mov ecx,edx<br />
001C140F mov edi,ebx<br />
001C1411 sub edi,ecx<br />
001C1413 xor eax,eax<br />
001C1415 xor ebx,ebx<br />
001C1417 xor edx,edx<br />
001C1419 inc al<br />
001C141B add bl,byte ptr [esi+eax]<br />
001C141E mov dl,byte ptr [esi+eax]<br />
001C1421 mov dh,byte ptr [esi+ebx]<br />
001C1424 mov byte ptr [esi+eax],dh<br />
001C1427 mov byte ptr [esi+ebx],dl<br />
001C142A add dl,dh<br />
001C142C xor dh,dh<br />
001C142E mov bl,byte ptr [esi+edx]<br />
001C1431 mov dl,byte ptr [edi]<br />
001C1433 xor dl,bl<br />
001C1435 mov byte ptr [edi],dl<br />
001C1437 inc edi<br />
001C1438 dec ecx<br />
001C1439 jne main+89h (1C1419h)<br />
001C143B xor ebx,ebx<br />
001C143D mov eax,ebx<br />
001C143F inc al<br />
001C1441 nop<br />
001C1442 nop<br />
001C1443 nop<br />
001C1444 nop<br />
001C1445 call main+57h (1C13E7h)<br />
001C144A inc ecx<br />
001C144B inc ecx<br />
001C144C inc ecx<br />
001C144D inc ecx<br />
001C144E inc edx<br />
001C144F inc edx<br />
001C1450 inc edx<br />
001C1451 inc edx<br />
001C1452 xor al,byte ptr [eax]<br />
001C1454 add byte ptr [eax],al<br />
001C1456 xchg eax,ecx<br />
001C1457 fdiv st,st(1)<br />
001C1459 ins dword ptr es:[edi],dx<br />
001C145A jo main+0ECh (1C147Ch)<br />
001C145C cmp ch,byte ptr [ebx-3BF46599h]<br />
001C1462 xchg eax,ecx<br />
001C1463 sti<br />
001C1464 db c7h<br />
001C1465 paddb xmm1,xmm5<br />
001C1469 int 3<br />
001C146A mov ah,2<br />
001C146C cli<br />
001C146D xlat byte ptr [ebx]<br />
001C146E ja main+94h (1C1424h)<br />
001C1470 push esp<br />
001C1471 cmp byte ptr [ebx-711CF1E1h],ch<br />
001C1477 ror dword ptr ds:[93C399EBh],cl<br />
001C147D db feh<br />
001C147E shr dword ptr [ebx],1<br />
001C1480 sbb edx,dword ptr [ecx]<br />
001C1482 db c6h<br />
001C1483 adc edi,ebp<br />
001C1485 enter 2FCAh,0Dh<br />
</code></p></blockquote>
<p>As it would be less than smart to simply execute a chunk of unknown x86 binary on my machine (regardless of its source), I disassembled the binaries to make sure they weren&#8217;t harmful prior to running them. To run the payload you can do it in two different methods. You can use a little C program, in this program you just need to dump the data as an array of unsigned chars, and then make a function pointer to the prototype <code>typedef void(*executePayload)();</code>. A lovely typecast from the unsigned char array to a function pointer will do the trick. However, most systems these days come with various security measures to stop malicious code chewing through your system. One such feature is DEP (Data Execution Protection). In order to execute a payload in this way you need to make sure the memory that contains the payload is mapped correctly with the correct executable flags. <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/stage1.txt">Here is the source file</a> that will execute the payload. You can compile it and execute it with the following commands in your Linux terminal and then use GDB to debug it:</p>
<blockquote><p><code>gcc -g -O0 payload.c -o payload<br />
./payload </code></p></blockquote>
<p>In visual studio, you can just do <code>__asm _emit</code> for each of the bytes after you byte swap them. You can <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/asm_emit.txt">download my source file for this here</a>.</p>
<p>The first thing the payload does it jump the next 4 bytes (this will be important later on), then it allocated a buffer on the stack by moving <code>esp</code> and then does a bunch of ops to build strings in these buffers. It&#8217;s not really important or interesting what it&#8217;s actually doing, to get a hold of what you want, just run until the <code>int 0x80</code> call, and then check out the stack frame in the memory window. It&#8217;ll contain a URL to a .js file.</p>
<p><img class="alignnone size-full wp-image-836" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/stage1buffer.png" alt="" width="604" height="284" /></p>
<h2>Stage 2:</h2>
<p>If we download the file we are presented with some JavaScript source code. It&#8217;s just a case of implementing the virtual machine, running it til it hits a halt instruction and checking the memory buffer for the URL.</p>
<p>The virtual machine as described had a handful of 8-bit registers, 8 instructions and a 16-byte segmented memory model.</p>
<h3>Instruction Fetch &amp; Decode</h3>
<p>The op codes that are described in the JS file take the format shown in the image below, it is trivial to knock up a bit of C code to fetch and decode each op.</p>
<p><img class="alignnone size-full wp-image-858" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/opcodeformat.png" alt="" width="603" height="96" /></p>
<blockquote><p><code><br />
/* fetch &amp; decode: */<br />
const uint8_t byte = g_mem[calculateIndex(g_cpu.cs, g_cpu.ip)];<br />
const uint8_t opCode = (byte &amp; 0xe0) &gt;&gt; 5;<br />
const uint8_t mod = (byte &amp; 0x10) &gt;&gt; 4;<br />
const uint8_t op1 = (byte &amp; 0xf);<br />
const uint8_t op2 = g_mem[calculateIndex(g_cpu.cs, g_cpu.ip)+1]; /* optional */<br />
</code></p></blockquote>
<h3>Addressing</h3>
<p>Segmented addressing means that in order to address something you use a segment index (which in this case selects which 16-byte location of memory you are in), and then an offset from that segment&#8217;s base. The offset is not necessarily constrained to [0..15], check out the C code below. <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/main.txt">Here is the source file for my VM written in C</a>. If that wasn&#8217;t enough Lionel has very kindly made available his JS source code for stage 2, which you <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/stage2_js.txt">can get here</a>. Alternatively just <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/stage2.html">click here</a> if you just want to run Lionel&#8217;s VM in your browser and see his cool debug output.</p>
<blockquote><p><code><br />
int16_t calculateIndex(const uint8_t segment, const uint8_t offset) {</code></p>
<p>const int16_t segment16 = (uint16_t)segment;<br />
const int16_t offset16 = (uint16_t)offset;<br />
const int16_t index = (segment16 &lt;&lt; 4) + offset16;<br />
return index;<br />
}</p></blockquote>
<h3>jmp</h3>
<p>A simple unconditional jump. Just sets <code>cs</code> and <code>ip</code> to whatever the jump target is. If mod bit is not set, then the <code>cs</code> register is left unaffected.</p>
<h3>movr &amp; movm</h3>
<p>These instructions just move data between registers and also to and from memory.</p>
<h3>add, xor and cmp</h3>
<p>These three instructions have similar characteristics from the mod flag perspective, and basically do exactly what it says on the tin. However, in the case of <code>cmp</code> there are some additional changes required to the flags register. Comparisons are usually implemented as a subtraction, if the result is 0 then the values are equal, and hence the flags register is set to 0. If the result of the subtraction yields a positive value, then the first operand is larger than the second. In the specifications for this virtual machine the flags register is required to be set to <code>0x1</code> in this case. And if the sign bit is set, (i.e.: the subtract yields a negative result) then the flags register should be set to <code>0xff</code>. This is shown in the code snippet below.</p>
<blockquote><p><code><br />
const int8_t result = getRegisterValue(op1) - getRegisterValue(op2);<br />
if(result == 0)<br />
setRegisterValue(REGISTER_FLAG, 0);<br />
else if(result &lt; 0)<br />
setRegisterValue(REGISTER_FLAG, 0xff);<br />
else<br />
setRegisterValue(REGISTER_FLAG, 1);<br />
</code></p></blockquote>
<h3>jmpe</h3>
<p>This instruction will jump when the flag register is set to <code>0x0</code>, which indicates that the result of a preceding <code>cmp</code> instruction was 0.</p>
<blockquote><p><code>if(g_cpu.fl == 0x0) {<br />
if(!mod) {<br />
g_cpu.ip = getRegisterValue(op1);<br />
printf("jmpe r%d\n", op1);<br />
} else {<br />
g_cpu.cs = op2;<br />
g_cpu.ip = getRegisterValue(op1);<br />
printf("jmpe #%x:r%d\n", op2, op1);<br />
}<br />
}<br />
</code></p></blockquote>
<h3>hlt</h3>
<p>The halt instruction (as the name suggests) simply brings the virtual machine to a halt.</p>
<h3>Wrapping it up&#8230;</h3>
<p>On examining the source code an astute reader maybe notice that the firmware is not used at all. I tried disassembling this stuff with the disassembler I wrote before I wrote the VM (also included in my little C file, just change the argument to <code>cpuInit</code> to <code>true</code>, it was a load of rubbish, so it didn&#8217;t make sense to run it. Initially why it was there was a mystery (but it became obvious later)!</p>
<p>Another thing to note is that it turns out that their specifications aren&#8217;t the entire picture. If you take a look at the <code>CS</code> register you will notice that is never non-zero. Seems odd to have a completely redundant register, well it turns out it *is* required. The <code>CS</code> register I assume stands for code-segment (matching the <code>DS</code> register which I guess is data-segment). This should have been my first clue.</p>
<p><img class="alignnone size-full wp-image-793" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/memoryBuffer1.png" alt="" width="604" height="306" /></p>
<p>Looks like another URL, this time to a .exe file.</p>
<h2>Stage 3:</h2>
<p>So after downloading the executable file my first stop was to disassemble it. My tool of choice here was IDA Freeware. IDA is an awesome tool which I would highly recommend, you can <a href="http://www.hex-rays.com/products/ida/support/download_freeware.shtml">grab it here</a>. I also used a Hex Editor, my tool of choice is Hex Workshop, which you can <a href="http://www.hexworkshop.com/">grab here</a>. Once you disassemble it under IDA you need to run the exe and step it (you will need to install Cygwin for this). You will notice that it attempts to load a file called license.txt, so I created a txt file with this name and continued to step it. Eventually I hit the following bit of code. It is checking the first 4 bytes of the buffer it reads from license.txt to see if it matches some special magic number (GCHQ).</p>
<blockquote><p><code><br />
.text:00401160 mov [ebp+var_4C], 0<br />
.text:00401167 cmp [ebp+gchqHeaderBytes], 71686367h<br />
...<br />
.text:0040117F mov [esp+78h+salt?], eax<br />
.text:00401182 call crypt<br />
.text:00401187 mov edx, eax<br />
.text:00401189 mov eax, DEShash<br />
.text:0040118E mov [esp+78h+var_74], eax<br />
.text:00401192 mov [esp+78h+salt?], edx<br />
.text:00401195 call strcmp<br />
.text:0040119A test eax, eax<br />
</code></p></blockquote>
<p>So the first 4 bytes of license.txt is easy, you just replicate this value in there. Be careful to byte swap it. The next bit of the code is a bit more tricky. It calculates a hash value for the next few bytes in license.txt and compares it the result to some other magic value (again stored in the data segment of the executable). The first 2 bytes of the result buffer from <code>crypt()</code> will be the salt for the hashing function (to try and guard against dictionary-based attacks). To break this password I fired up John the Ripper. This is a free program you <a href="http://www.openwall.com/john/">can get from here</a> (just make sure you get the jumbo version that supports the Markov mode). To build John the Ripper Jumbo, fire up your cygwin bash terminal and navigate to the src directory. You will then need to the following, replacing system with your target system:</p>
<blockquote><p><code>make clean (system)</code></p></blockquote>
<p>Where (system) is replaced with your target system, for me it was win32-cygwin-x86-any. Once it&#8217;s built, make a new file in the run directory called &#8220;password.txt&#8221; and type the following in it:</p>
<blockquote><p><code>user:hqDTK7b8K2rvw</code></p></blockquote>
<p>Then use the following command line to try and break it:</p>
<blockquote><p><code>./john -markov:220 password.txt</code></p></blockquote>
<p>Now for me 220 was enough to break the password, but other individuals I&#8217;ve spoken to about this said that they had to go up to 240. I&#8217;m not exactly sure on how the Markov mode works in John the Ripper, but potentially you might need to run a few times at 220 to break the password, or up the level. On my laptop which has an Intel Core i5 in it, it took about 5 minutes to break the password.</p>
<blockquote><p><code><br />
$ ./john -markov:220 password.txt<br />
Loaded 1 password hash (Traditional DES [24/32 4K])<br />
Warning: MaxLen = 12 is too large for the current hash type, reduced to 8<br />
MKV start (lvl=220 len=8 pwd=1601142515)<br />
cyberwin (user)<br />
guesses: 1 time: 0:00:04:39 DONE (Sat Dec 3 17:29:12 2011) c/s: 302851 trying: cybervam - cyberwo<br />
Use the "--show" option to display all of the cracked passwords reliably</code></p></blockquote>
<p>At this point it&#8217;s probably worth mentioning that actually breaking the password for license.txt is a waste of time. You can crack the .exe without having to generate the password. Later on in the execution flow the .exe will try to set up the stage 1 and stage 2 license keys. All this involves is a copy (or two) from the license.txt buffer that was read in via the standard C file I/O stuff. From examining the code you can work out the exact offset into this buffer that the copy is performed. Once you have this, you can subtract the size in bytes of the magic number (which was 4 bytes as you will recall). Congratulations, you now know the password&#8217;s length without needing to actually crack the password itself. You can pad license.txt with whatever you like and then just hex edit the conditional jump, to make it non-conditional just after the <code>strcmp</code> of the output buffer from the call to <code>crypt</code>.</p>
<p>So what are the final pieces of the puzzle? It took me a while to work it out, but mentions of &#8220;Stage 1&#8243; and &#8220;Stage 2&#8243; in the <code>printf</code> are actually clues about using values from stage 1 and stage 2 (duh!). You need a single 4 byte value from stage 1 and 2&#215;4 byte values from stage 2. Stage 2 to me was obvious, the unused firmware values from the VM (just remember to byte swap them!). Stage 1 you can get by looking at the disassembly again, the first instruction jumps the next 4 bytes. :) This data is already byte swapped, as shown by the presence of <code>0xdeadbeef</code> in the data.</p>
<p>So append these to your license.txt file and you&#8217;re done! <a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/license.txt">Here is mine</a>. If you step the code again you will eventually see keygen.exe will try to make a BSD socket and connect to the host you passed in as the argument (as shown below).</p>
<blockquote><p><code><br />
.text:004012C4 call connect<br />
.text:004012C9 test eax, eax<br />
.text:004012CB jns short loc_4012EF<br />
.text:004012CD mov eax, [ebp+arg_0]<br />
.text:004012D0 mov [esp+148h+var_144], eax<br />
.text:004012D4 mov [esp+148h+var_148], offset aErrorConnectSF ; "error: connect(\"%s\") failed\n"<br />
.text:004012DB call printf<br />
.text:004012E0 mov [ebp+var_144], 0FFFFFFFFh<br />
.text:004012EA jmp loc_401423<br />
</code></p></blockquote>
<p>I actually stepped over this stuff until you get to the call to <code>sprintf</code> where it builds the path to request for the HTTP server.</p>
<blockquote><p><code><br />
.text:00401315 mov [esp+148h+pathPart1], eax<br />
.text:00401319 mov [esp+148h+var_144], offset aGetSXXXKey_txt ; "GET /%s/%x/%x/%x/key.txt HTTP/1.0\r\n\r\n"<br />
.text:00401321 lea eax, [ebp+removeFilePath]<br />
.text:00401327 mov [esp+148h+var_148], eax<br />
.text:0040132A call sprintf<br />
.text:0040132F lea eax, [ebp+removeFilePath]<br />
</code></p></blockquote>
<p>The value at <code>removeFilePath</code> in the disasm is the file you want. Fire up your favourite browser and bash that path in. It turns out to be <a href="http://www.canyoucrackit.co.uk/hqDTK7b8K2rvw/a3bfc2af/d2ab1f05/da13f110/key.txt">http://www.canyoucrackit.co.uk/hqDTK7b8K2rvw/a3bfc2af/d2ab1f05/da13f110/key.txt</a></p>
<p><img class="alignnone size-full wp-image-798" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/buffer.png" alt="" width="742" height="63" /><br />
The file contains a single line of text:</p>
<blockquote><p><code>Pr0t3ct!on#cyber_security@12*12.2011+<br />
</code></p></blockquote>
<p><a href="http://www.spuify.co.uk/wp-content/uploads/2011/12/well_done.png"><img class="alignnone size-medium wp-image-649" src="http://www.spuify.co.uk/wp-content/uploads/2011/12/well_done.png" alt="" width="300" height="168" /></a></p>
<p>And that&#8217;s that&#8230; :) I hope you enjoyed this little detour into how I spent a few hours of my life. It was good fun, maybe the games industry should do recruitment tests more like this! :) Until next time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.altdevblogaday.com/2011/12/12/absolutely-crackers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>FPStress</title>
		<link>http://www.altdevblogaday.com/2011/02/28/fpstress/</link>
		<comments>http://www.altdevblogaday.com/2011/02/28/fpstress/#comments</comments>
		<pubDate>Mon, 28 Feb 2011 16:45:57 +0000</pubDate>
		<dc:creator>Steven Tovey</dc:creator>
		
		<guid isPermaLink="false">http://altdevblogaday.org/?p=1404</guid>
		<description><![CDATA[<p>This post is going to be much shorter as I’ve sadly been very pressed for time this last couple of weeks and as a result been unable to finish the other (longer) post I had original planned to the high standard that you all deserve. This post is talking about measuring stuff in frames per second and why you shouldn’t do it.</p>
<p><a href="http://www.altdevblogaday.com/2011/02/28/fpstress/" class="more-link">Read more on FPStress&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<p>This post is going to be much shorter as I’ve sadly been very pressed for time this last couple of weeks and as a result been unable to finish the other (longer) post I had original planned to the high standard that you all deserve. This post is talking about measuring stuff in frames per second and why you shouldn’t do it.</p>
<p>Many people who haven’t got any experience in profiling and optimising a game will approach performance measurements in a frankly scary way. I’m taking of course about measuring performance in terms of <em>frame rate</em>. Of course it’s unfair to say the games industry never mentions ‘frame rate’, we do! We have target frame rates for our games which are usually 60Hz or 30Hz (increasingly the latter), but for anything beyond this I think it’s fair to say that most sane people take the leap into a different, more logical and intuitive space. It’s called <em>frame time</em> and it’s measured in; you guessed it, units of time! Typical units of time used include ms, µs or even ns. The two have a relationship which allows you to convert between them given by <em>t</em>=1/<em>r</em>.</p>
<p><img class="size-full wp-image-570 aligncenter" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/fps_graph1.png" alt="" width="551" height="331" /></p>
<p>So why am I concerned about this if it’s trivial to convert between them? Well, the reason is that measuring anything in terms of the frame rate suffers from some big flaws which measuring in frame time does not and can lead to problems with your view of the situation:</p>
<p>1. Traditional frame rate measurements are often dependent on synchronisation with a hardware-driven event such as the vertical reset on your display device. If you’ve locked your application to this vertical sync to avoid frame tearing this can mean no matter how fast your application is going it will always appear to run at say, 60fps. If you&#8217;re application is straddling the border of 60fps and locked to vsync this can give rise to occassions where small fluctuations in the amount of time something takes can make the game to appear to go either two times quicker or slower!</p>
<p>2. Frame rate can be very deceptive! To show what I mean let’s use an example that crops up time and time again among beginners, it goes as follows:</p>
<blockquote><p>Coder A: “Check this shit out, my game is running at 2000fps! I’m such a ninja!”<br />
Coder B: “Nice one chief, but you’re accidentally frustum culling the player there&#8230;”<br />
Coder A: “Ah shit, okay&#8230; Let me fix that&#8230; Sorted!”<br />
<em>Coder A hits F5 and looks on truly horrified at the frame rate counter.</em><br />
Coder B: “Heh! Not so ninja now are you? Running at 300fps!”<br />
<em>Coder A slopes off looking dejected.</em></p></blockquote>
<p>What happened here? Apparently performance has dropped drastically by a whopping 1700 frames per second to a mere three hundred! However, if these pairs of idiots worked in a set of sensible units they’d instantly see that it’s only taking 2.5ms to render that player mesh. Small drops in frame rate in an application with an already low frame rate can point to big performance problems, where as large drops in an application with a very big frame rate are usually not much to worry about.</p>
<p>3. And perhaps the most compelling reason which should give any remaining doubters a firm nudge into reality; All the wonderful profiling tools out there work in time, not rate.</p>
<p>At <a href="http://en.wikipedia.org/wiki/Bizarre_Creations">Bizarre Creations</a> (R.I.P) we had a large cardboard cut out of the <a href="http://en.wikipedia.org/wiki/The_Stig">Stig</a> (among others, such as Hulk Hogan and Austin Powers) for anyone that broke the build. In honour of Bizarre which sadly closed last Friday, I offer these final Stig inspired words to bring this post to a close: Next time you even contemplate telling your colleague that your code runs at a particular frame rate remember that what your saying is more backwards than Jeremy Clarkson telling you that the Stig got the Ariel Atom 500 round the top gear track at 37.54993342 metres per second!</p>
<p><img class="size-full wp-image-567 aligncenter" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/fps_dave.png" alt="" width="625" height="441" /></p>
<p>This post is reproduced over at my personal blog, which you can find <a href="http://www.spuify.co.uk">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.altdevblogaday.com/2011/02/28/fpstress/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Alternatives to malloc and new</title>
		<link>http://www.altdevblogaday.com/2011/02/12/alternatives-to-malloc-and-new/</link>
		<comments>http://www.altdevblogaday.com/2011/02/12/alternatives-to-malloc-and-new/#comments</comments>
		<pubDate>Sat, 12 Feb 2011 02:29:29 +0000</pubDate>
		<dc:creator>Steven Tovey</dc:creator>
		
		<guid isPermaLink="false">http://altdevblogaday.org/2011/02/12/alternatives-to-malloc-and-new/</guid>
		<description><![CDATA[<h2>Obligatory Introductory Parable</h2>
<p>I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why&#8230;</p>
<p><a href="http://www.altdevblogaday.com/2011/02/12/alternatives-to-malloc-and-new/" class="more-link">Read more on Alternatives to malloc and new&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<h2>Obligatory Introductory Parable</h2>
<p>I really like Sushi, it’s tasty and convenient. I like the immediacy of being able to go into a Sushi restaurant complete with conveyor belt and being able to take a seat and grab something fresh and delicious from the belt without blowing my entire lunch hour. Having said that, what I really wouldn’t like is to be a member of staff in a Sushi restaurant, especially if it was my job to show the diners to their seats and here’s why&#8230;</p>
<h2 style="text-align: center;"><img title="mem01" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem01.png" alt="" width="463" height="273" /></h2>
<p>A group of three diners walk in and ask to be seated. Delighted at having some custom, you kindly oblige showing the diners to their seats. No sooner have they sat down and helped themselves to some tasty looking ‘Tekka Maki’, when the door opens again and four more diners walk in! Wow, you guys are on a roll (see what I did there?). Your restaurant now looks like this&#8230;</p>
<p style="text-align: center;"><img class="size-full wp-image-472 aligncenter" title="mem02" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem02.png" alt="" width="464" height="292" /></p>
<p>Since its lunch time, the restaurant quickly fills up to capacity. Finally after eating all they could and settling the bill, the first party to arrive (the party of three leave), and a group of two walk in and you offer the newly vacant seats to your new customers. This occurrence happens a few more times until your restaurant looks like this&#8230;</p>
<p style="text-align: center;"><img class="size-full wp-image-473 aligncenter" title="mem03" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem03.png" alt="" width="463" height="277" /></p>
<p>Finally a group of four dinners walk in and ask to be seated. Ever the pragmatist, you’ve been carefully keeping track of how many empty seats you have left and it’s your lucky day, you’ve got four spare seats! There is one snag though, these diners are the social type and would like to be seated together. You look around frantically, but while you have four empty seats you can’t seat the diners together! It would be very rude to your ask existing customers to move mid-meal, which sadly leaves you no option but to turn your new customers away, probably never to return again. This makes everyone very sad. If you’re sat there wondering how this little tale relates in the slightest bit to games development then read on.  In programming we have to manage memory. Memory is a precious resource which must be carefully managed, much like the seats in our metaphorical sushi restaurant. Every time we allocate memory dynamically we’re reserving memory from something called the “heap”. In C and C++ this is typically done through the use of the <em>malloc</em> function and the <em>new</em> operator respectively. To continue the somewhat fishy analogy (last one I promise!), this is like our intrepid groups of diners asking to be seated in our sushi restaurant. The real shame though is what happened in our hypothetical scenario happens in the context of memory also, but the results are much worse than a couple of empty tummies. It is called <em>fragmentation</em> and it is a nightmare!</p>
<h2>What’s wrong with malloc and new?</h2>
<p>Sadly, the rest of the discussion won’t have such a fishy flavour to it as this post is going to talk about malloc and new and why they have a very limited place in the context of embedded systems (such as games consoles). While fragmentation is just one facet of problems caused by dynamic memory allocation, it is perhaps the most serious, but before we can come up with some alternatives we should take a look at <em>all</em> of the problems:</p>
<p><strong>1. malloc and new try to be all things to all programmers&#8230;</strong><br />
They will as soon as allocate you a few bytes as they will a few megabytes. They have no concept of what the data is that they&#8217;re allocating for you and what its lifetime is likely to be. Put another way, they don&#8217;t have the bigger picture that we have as programmers.  <strong> </strong></p>
<p><strong>2. Run-time performance is relatively bad&#8230;</strong><br />
Allocations from the standard library functions or operators typical require descending into the kernel to service the allocation requests (this can involve all sorts of nasty side effects to your application&#8217;s performance, including flushing of translation lookaside buffers, copying blocks of memory, etc). For this reason alone using dynamic memory allocation can be very expensive in terms of performance. The cost of the free or delete operations in some allocation schemes can also be expensive as in many cases a lot of extra work is done to try to improve the state of the heap for subsequent allocations. &#8220;Extra work&#8221; is a fairly vague term, but can mean the merging of memory blocks and in some cases can mean walking an entire list of allocations your application has made! Certainly not something you&#8217;d want to be wasting valuable processor cycles on if you can avoid it!  <strong> </strong></p>
<p><strong>3. They cause fragmentation of your heap&#8230;</strong><br />
If you&#8217;ve never been on a project that has suffered from the problems associated with fragmentation then count yourself very lucky, but rest of us know that heap fragmentation can be a complete and utter nightmare to address.</p>
<p><strong>4. Tracking dynamic allocations can be tricky&#8230;</strong><br />
Dynamic allocation comes with the inherent risk of memory leaks. I’m sure we all know what memory leaks are, but if not, <a href="http://en.wikipedia.org/wiki/Memory_leak">have a read of this</a>. Most studios try to build some tracking infrastructure on top of their dynamic allocations to try and track what memory is in play and where.</p>
<p><strong>5. Poor locality of reference&#8230;</strong><br />
There is essentially no way of knowing where the memory you get back from malloc or new will be in relation to any of the other memory in your application. This can lead to more us suffering more increasingly expensive cache misses than we need to, as we end up dancing through memory like Billy Elliot on Ecstasy.  So what is the alternative? The idea behind this post is provide you with the details (and some code!) for a few different alternatives that you can use in place of malloc and new to help combat the problems we&#8217;ve just discussed.</p>
<h2>The Humble Linear Allocator</h2>
<p>As the name of this section suggests this allocator is certainly the simplest of all those I&#8217;m going to present, although truth be told; they&#8217;re all simple (and that&#8217;s part of the beauty). The linear allocator essentially assumes that there is no fine grained de-allocation of allocated memory resources and proceeds to make allocations from a fixed pool of memory one after the other in a linear fashion. Check out the diagram below.</p>
<p style="text-align: center;"><img class="size-full wp-image-474 aligncenter" title="mem04" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem04.png" alt="" height="237" /></p>
<p>A good example of where a linear allocator is exceedingly useful is found in nearly all SPU programming. The persistence of data in the local store is not important beyond the scope of the currently executing job and in many cases the amount of data one brings into the local store (at least data that needs some degree of variance in its size) tends to fairly restricted. However, don’t be fooled that’s far from its only application. Here&#8217;s an example of how one might go about implementing a simple, linear allocator in C.</p>
<p><code> /** Header structure for linear buffer. */<br />
typedef struct _LinearBuffer {</code></p>
<p><code> </code></p>
<p><code> uint8_t *mem;               /*!&lt; Pointer to buffer memory. */<br />
uint32_t totalSize;   /*!&lt; Total size in bytes. */<br />
uint32_t offset;         /*!&lt; Offset. */<br />
} LinearBuffer; </code></p>
<p><code> /* non-aligned allocation from linear buffer. */<br />
void* linearBufferAlloc(LinearBuffer* buf, uint32_t size) {</code></p>
<p><code> </code></p>
<p><code> </code><code> if(!buf || !size)<br />
return NULL;</code></p>
<p><code> </code></p>
<p><code> </code><code> uint32_t newOffset = buf-&gt;offset + size;<br />
if(newOffset &lt;= buf-&gt;totalSize) {<br />
</code><code> void* ptr = buf-&gt;mem + buf-&gt;offset;<br />
buf-&gt;offset = newOffset;<br />
return ptr;<br />
}<br />
return NULL; /* out of memory */<br />
} </code></p>
<p>It is of course possible to support aligned allocations by applying the required bitwise operations to the offset during the course of the allocation. This can be incredibly useful, but be aware that depending on the size of the data you’re allocating from your buffer (and in some cases the order in which those allocations are made) you may find that you get some wasted space in the buffer between allocations. This wasted space is typically okay for alignments of a reasonable size, but can get prohibitively wasteful if you are allocating memory which requires a much larger alignment, for example 1MB. The ordering of your allocations from linear allocators can have a drastic effect on the amount of wasted memory in these types of situations.  To reset the allocator (perhaps at the end of a level), all we need to do is set the value of offset to zero. Just like with all allocations you would do, clients of the allocator need to ensure they’re not hanging on to any pointers that you’ve effectively de-allocated by doing this, otherwise you risk corrupting the allocator. Any C++ objects you’ve allocated from the buffer will also need their destructors calling manually.</p>
<h2>The Stack</h2>
<p>Let’s get something out of the way before we start into this allocator. When I talk about a “stack allocator” in this particular case, I’m not talking about the <em>call stack</em>, however, <em>that</em> stack does have an important part to play in avoiding run-time heap allocations as we shall see later on. So what am I talking about then? Well, the linear allocator I just described above is an excellent solution to many allocation problems, but what if you want slightly more control over how you free up your resources? The stack allocator will give you this.  Towards the end of my description of the linear allocator, I mentioned that to reset the allocator you can simply set the offset to zero in order to free up all the resources in the allocator. The principle of setting the offset to a particular value is the principle that guides the implementation of the stack allocator. If you are not familiar with the concept of the stack data structure then now is probably a good time to get acquainted, you can do so <a href="http://en.wikipedia.org/wiki/Stack_%28data_structure%29">here</a>.  Back? Okay, each allocation from our stack allocator will optionally be able to get a handle which represents the state of the stack allocator just <em>before</em> that allocation was made. This means that if we restore the allocator to this state (by changing its offset) we are effectively ‘freeing’ up the memory to be reused again. This is shown in the diagram below.</p>
<p style="text-align: left;"><img class="size-full wp-image-475 aligncenter" style="display: block; margin-left: auto; margin-right: auto;" title="mem05" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem05.png" alt="" height="472" /></p>
<p>This can be a very useful thing if you want some memory allocated temporarily for data which has a limited scope. For example; the life time of a specific function or sub-system. This strategy can also be useful for things like level resources, which have a well defined order that objects need to be freed up in (that is reverse order to which they are allocated). Here is some example C code to illustrate what I’ve just explained:  <code> </code></p>
<p><code>typedef uint32_t StackHandle; </code></p>
<p><code> </code></p>
<p><code>void* stackBufferAlloc(StackBuffer* buf, uint32_t size, StackHandle* handle) {<br />
</code><code><br />
if(!buf || !size)<br />
return NULL;</code></p>
<p><code> </code></p>
<p><code> </code><code> const uint32_t currOffset = buf-&gt;offset;<br />
if(currOffset + size &lt;= buf-&gt;totalSize) {</code></p>
<p><code> </code></p>
<p><code> </code><code> uint8_t* ptr = buf-&gt;mem + currOffset;<br />
buf-&gt;offset += size;</code></p>
<p><code>if(handle)<br />
*handle = currOffset; /* set the handle to old offset */<br />
return (void*)ptr;<br />
}</p>
<p>return NULL;<br />
}</p>
<p></code></p>
<p><code> void stackBufferSet(StackBuffer* buf, StackHandle handle) {<br />
</code><code><br />
buf-&gt;offset = handle;<br />
return;<br />
}</code></p>
<h2>The Double Stack</h2>
<p>If you’re comfortable with the stack concept above, we can now move on to the double-ended stack. This is similar to the stack allocator we just described except that there are two stacks, one which grows from the bottom of the buffer upward and one which grows from the top of the buffer downward. This is shown in the diagram below.</p>
<p style="text-align: center;"><img class="size-full wp-image-476 aligncenter" title="mem06" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem06.png" alt="" height="191" /></p>
<p>Where would this be useful? A good example would be any situation where you have data of a similar type, but which have distinctly different lifetimes or perhaps if you had data that was related and should be allocated from the same static memory buffer (i.e.: part of the same subsystem), but had different size properties. It should be noted that it is not mandated where the offsets of the two stacks meet, they don’t have to be exactly half the buffer. In one case the bottom stack can grow very large and the top stack smaller and vice versa. This added flexibility can sometimes be required to make the best use of your memory buffers.  I don’t think I really need insult your intelligence by including a code sample for the double stack allocator due to its inherent resemblance to the single stack allocator discussed previously.</p>
<h2>The Pool</h2>
<p>We’re going to shift gears a little now from the family of allocators described above that are based on linearly advancing pointers or offsets and move to something a little different. The pool allocator I’m about to describe is designed to work with data types that are of the same kind or size. It splits up the memory buffer it is managing into equally sized <em>chunks</em>, and then allows the client to allocate and free these chunks at will (see the diagram below). To do this, it must keep a track of which chunks are free and I’ve seen several ways of implementing this. I personally shy away from implementations such as those using a stack of indices (or pointers) to available chunks, due to the extra storage required which can often be prohibitive, but I’ve seen them around. The implementation I shall describe here uses no additional storage to manage which chunks in the pool are free. In fact the header structure for this allocator contains only two member variables, making it the smallest of all the allocators described in this post.</p>
<p style="text-align: center;"><img class="size-full wp-image-477 aligncenter" title="mem07" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem07.png" alt="" height="158" /></p>
<p>So how does it work internally? To manage which chunks are free we’re going to use a data structure known as a <em>linked list</em>. Again if you’re not acquainted with this data structure then try reading <a href="http://en.wikipedia.org/wiki/Linked_list">this</a>. Coming from a PlayStation3 and Xbox360 background, where memory access is expensive I generally find node-based structures (such as the linked list) leave a rather sour taste, but I think this is perhaps one of the applications I may approve of. Essentially the header structure for the allocator will contain a pointer to a linked list. The linked list itself is spread throughout out the pool, occupying the same space as the free chunks in the memory buffer. When we initialise the allocator, we move through the memory buffer’s chunks and write a pointer in the first four (or eight) bytes of each chunk, with the address (or index) of the next free chunk in the buffer. The header then points to the first element in that linked list. A limitation of storing pointers in the pool’s free chunks in this way is that chunks must be at least the same size as a pointer on your target hardware. See the diagram below.</p>
<p style="text-align: center;"><img class="size-full wp-image-478 aligncenter" title="mem08" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem08.png" alt="" height="180" /></p>
<p>When allocations are made from the pool we simply need to make the linked list pointer in the header structure point at the second element in the linked list and then return the pointer we originally had to the first element. It’s very simple, we always return the first element in the linked list when allocating. Similarly, to free a chunk and return it to the pool, we simply insert it into the front of the linked list. Inserting chunks we want to free at the front of the list as opposed to the back has a couple of benefits, firstly we don’t need to a traverse the linked list (or store an extraneous tail pointer in the header structure) and secondly (and more importantly) we stand a better chance of the node we just freed up being in the cache for subsequent allocations from the pool. After a few allocations and de-allocations your pool might look a little like the diagram below.</p>
<p style="text-align: left;"><img class="size-full wp-image-479 aligncenter" style="display: block; margin-left: auto; margin-right: auto;" title="mem09" src="http://www.spuify.co.uk/wp-content/uploads/2011/02/mem09.png" alt="" height="180" /></p>
<p>Some C code for initialising the allocator and making allocations and de-allocations from it is provided below.  <code> </code></p>
<p><code>/* allocate a chunk from the pool. */<br />
void* poolAlloc(Pool* buf) {</code></p>
<p><code> </code></p>
<p><code> </code><code> if(!buf)<br />
return NULL;</code></p>
<p><code> </code></p>
<p><code> </code><code> if(!buf-&gt;head)<br />
return NULL; /* out of memory */</code></p>
<p><code> </code></p>
<p><code> </code><code> uint8_t* currPtr = buf-&gt;head;<br />
buf-&gt;head = (*((uint8_t**)(buf-&gt;head)));<br />
return currPtr;<br />
}</code></p>
<p><code> </code></p>
<p><code> </code><code> /* return a chunk to the pool. */<br />
void poolFree(Pool* buf, void* ptr) {</code></p>
<p><code> </code></p>
<p><code> </code><code> if(!buf || !ptr)<br />
return;<br />
</code><code><br />
*((uint8_t**)ptr) = buf-&gt;head;<br />
buf-&gt;head = (uint8_t*)ptr;<br />
return;<br />
} </code></p>
<p><code> </code></p>
<p><code> /* initialise the pool header structure,  and set all chunks in the pool as empty. */<br />
void poolInit(Pool* buf, uint8_t* mem, uint32_t size, uint32_t chunkSize) {<br />
</code><code><br />
if(!buf || !mem || !size || !chunkSize)<br />
return;</code></p>
<p><code> </code></p>
<p><code> </code><code> const uint32_t chunkCount = (size / chunkSize) - 1;<br />
for(uint32_t chunkIndex=0; chunkIndex&lt;chunkCount; ++chunkIndex) {<br />
</code><code><br />
uint8_t* currChunk = mem + (chunkIndex * chunkSize);<br />
*((uint8_t**)currChunk) = currChunk + chunkSize;<br />
}</code></p>
<p><code> </code></p>
<p><code> </code><code> *((uint8_t**)&amp;mem[chunkCount * chunkSize]) = NULL; /* terminating NULL */<br />
buf-&gt;mem = buf-&gt;head = mem;<br />
return;<br />
}</code></p>
<h2>A Note on Stack-based Allocation (alloca is your friend)</h2>
<p>Earlier on, you may recall that I said there’d be a mention of stack based allocations in the context of the call stack. I’m sure you’ve seen code which conceptually looks something like this:</p>
<blockquote class="posterous_medium_quote"><p>myFunction() {</p>
<p>myTemporaryMemoryBuffer = malloc(myMemorySize);<br />
/* do processing limited to this function. */<br />
free(myTemporaryMemoryBuffer);<br />
}</p></blockquote>
<p>There is a function you can use which comes with most C compilers which should mean (depending on the size of your allocation) that you won’t have to resort to heap allocations for temporary buffers of this ilk. That function is <em><code>alloca</code></em>. How <code>alloca</code> works internally is architecture dependant, but essentially it performs allocations by adjusting the stack frame for your function to allow you to write data to an area, this can be as simple as moving the stack pointer (just like the linear allocator we mentioned right at the start). The memory returned to you by alloca is then freed up when the function returns.  There are a few caveats with using alloca that you should be aware of however. Be sure to check the size of your allocation to make sure you’re not requesting an unreasonable amount from the stack, this can cause nasty crashes later on if your stack gets so large that it overflows. For this reason it is also best to know all the places where your function will be called in the context of the program’s overall flow if you’re contemplating allocating a large chunk with alloca. Use of alloca can affect portability in some limited circumstances (apparently), but I’ve yet to come across a compiler that doesn’t support it.</p>
<p>For more details you can consult your favourite search engine.</p>
<h2>A Final Thought&#8230;</h2>
<p>Often the memory one allocates during the course of a processing task is temporary and persists only for the lifetime of a single frame. Taking advantage of this type of knowledge and moving allocations of this type to separate memory buffers is essential to combating the adverse affects of fragmentation in a limited memory system. Any of the above allocation schemes will work, but I would probably suggest going with one of the linear ones, as they are much easier to reset than the pool implementation I’ve described here. Noel Llopis talks about this topic in more detail in this <a href="http://gamesfromwithin.com/start-pre-allocating-and-stop-worrying">excellent blog post</a>.  The right allocator for you depends on many factors and what the problem you’re trying to solve demands. The advice I would offer is to think carefully about the patterns of allocations and de-allocations you wish to perform with your system, think about the sizes and lifetimes of these allocations and try to manage them with allocators that make sense in that context. It can sometimes help to draw memory layouts on paper (yeah, with a pen and everything) or to graph roughly how you expect your memory usage will look over time, believe it or not, this can really help you to understand how a system produces and consumes data. Doing these things can help you make calls about how to separate your allocations to make memory management as easy, quick and fragmentation-free as possible.</p>
<p>Remember, what I’ve talked about here is just a small selection of the some of the simplest strategies that I tend to favour when writing code for console games. It is my hope that if you’ve made it this far and you weren’t already doing this stuff, that your brain is alive with ideas about code you’ve written in the past which could have taken advantage of these techniques to mitigate the substantial drawbacks associated with dynamic memory allocations, or of other strategies you can exploit to solve your memory management problems without dynamic allocation in limited memory systems.</p>
<p>In closing, I believe that programmers should be mindful of the impact of making dynamic allocations (especially in console games) and think twice about using the malloc function or the new operator. It is easy to have the attitude that you’re not doing many allocations so it doesn’t really matter, but this type of thinking spread across a whole team quickly snowballs and results in a death by a thousand paper cuts. If not nipped in the bud, fragmentation and the performance penalties arising from the use of dynamic memory allocations can have catastrophic consequences later on in your development lifecycle which are not easy to solve. Projects where memory allocation and management is not thought through and managed properly often suffer from random crashes after prolonged playing session due to out of memory conditions (which by the way are near impossible to reproduce) and cost hundreds of programmer hours trying to free up memory and reorganise allocations.  <em> </em></p>
<p><em>Remember: You can never start thinking about memory early enough in your project and you’ll always wish you had started earlier!</em></p>
<h2>More Information</h2>
<p>Here are some links to related topics, or to topics that I didn’t have time to cover:</p>
<p><a href="http://gamesfromwithin.com/start-pre-allocating-and-stop-worrying">http://gamesfromwithin.com/start-pre-allocating-and-stop-worrying</a></p>
<p><a href="http://en.wikipedia.org/wiki/Circular_buffer">http://en.wikipedia.org/wiki/Circular_buffer</a></p>
<p><a href="http://en.wikipedia.org/wiki/Buddy_memory_allocation">http://en.wikipedia.org/wiki/Buddy_memory_allocation</a></p>
<p><a href="http://www.memorymanagement.org/articles/alloc.html">http://www.memorymanagement.org/articles/alloc.html</a></p>
<p>Thanks to Sarah, Jaymin, Richard and Dat for proof reading this drivel.<br />
This post is reproduced over at my personal blog, which you can find <a title="SPUify.co.uk" href="http://www.spuify.co.uk">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.altdevblogaday.com/2011/02/12/alternatives-to-malloc-and-new/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Radix Sort for Humans</title>
		<link>http://www.altdevblogaday.com/2011/01/28/radix-sort-for-humans/</link>
		<comments>http://www.altdevblogaday.com/2011/01/28/radix-sort-for-humans/#comments</comments>
		<pubDate>Fri, 28 Jan 2011 10:51:00 +0000</pubDate>
		<dc:creator>Steven Tovey</dc:creator>
		
		<guid isPermaLink="false">http://altdevblogaday.org/2011/01/28/radix-sort-for-humans/</guid>
		<description><![CDATA[<h2><strong>Introduction</strong></h2>
<p><strong>&#160;</strong>Welcome to my first contribution to the thus far excellent  <a href="http://www.altdevblogaday.com">altdevblogaday.com</a> blogging effort! This post is also reproduced on my  <a href="http://www.spuify.co.uk" target="_blank">personal blog</a>.</p>
<p>On my travels around the games industry I have noticed that although  many people know about the existence of the Radix Sort, most know that  it&#8217;s typically quick (some even known it&#8217;s non-comparison based, and  linear time). What a great number of individuals I&#8217;ve met don&#8217;t seem to  know however, is the nuts and bolts of how it actually works! Given  this, I thought I&#8217;d have a crack at explaining Radix Sort for us mere  mortals. With any luck, if you&#8217;re scratching your head at the notion of a  sort that doesn&#8217;t do any comparisons or wondering how one is able to  break free from the shackles of that <em>O(n log n)</em> thing that your Comp Sci. professor told you about, then this post will help you through it.</p>
<p><a href="http://www.altdevblogaday.com/2011/01/28/radix-sort-for-humans/" class="more-link">Read more on Radix Sort for Humans&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<h2><strong>Introduction</strong></h2>
<p><strong>&nbsp;</strong>Welcome to my first contribution to the thus far excellent  <a href="http://www.altdevblogaday.com">altdevblogaday.com</a> blogging effort! This post is also reproduced on my  <a href="http://www.spuify.co.uk" target="_blank">personal blog</a>.</p>
<p>On my travels around the games industry I have noticed that although  many people know about the existence of the Radix Sort, most know that  it&rsquo;s typically quick (some even known it&rsquo;s non-comparison based, and  linear time). What a great number of individuals I&rsquo;ve met don&rsquo;t seem to  know however, is the nuts and bolts of how it actually works! Given  this, I thought I&rsquo;d have a crack at explaining Radix Sort for us mere  mortals. With any luck, if you&rsquo;re scratching your head at the notion of a  sort that doesn&rsquo;t do any comparisons or wondering how one is able to  break free from the shackles of that <em>O(n log n)</em> thing that your Comp Sci. professor told you about, then this post will help you through it.</p>
<p>The high-level concept of Radix Sort can be imagined by thinking  about what you could do if you had an array of, say 128, unique 16bit  integers that you wanted to sort. What is perhaps the most obvious way  to do it, if you didn&rsquo;t care about memory utilisation or locality of  your results? A simple way would be to simply use each value in the  array as an index into an array. Something akin to the following:</p>
<p><code>for(uint32_t currentValue=0; currentValue&lt;maxCount; ++currentValue) {</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp; uint32_t value= unsortedArray[currentValue];<br />&nbsp;&nbsp;&nbsp;&nbsp; sortedArray[value] = value;<br /> }</code></p>
<p>Obviously this is pretty wasteful in this form, but this extremely  basic concept gives rise to the foundation of how Radix Sort works, put  another way, we simply aim to just put the values into the correct  places straight away without any need to compare with any other values.  Obviously in the above example there are numerous points of failure and  problems that we need to deal with, but we&rsquo;ll get to those shortly.</p>
<p>Okay, so we&rsquo;ve got the concept of roughly what a Radix Sort aims to  do; now how do we go about solving the problems with our naive  first-pass implementation? Like all good programmers, we&rsquo;ll begin by defining  the problems we&rsquo;re trying to solve. Perhaps the most  fatal blow to the code above is the larger the&nbsp;values you have in the  list of unsorted values, the greater amount of memory you will need to  accommodate their placement directly into the correct spot in the  results array. This is because of the way we&rsquo;re placing the values, we  just use the value itself to determine where to store it in the  corresponding results array. Another problem is collisions. The word  &ldquo;collision&rdquo; here simply means, &ldquo;When two items map to the same location  in memory in the results array,&rdquo; for example, when we have the same  value twice (or more) in the input array that we&rsquo;re attempting to sort.  For the acquainted, the concept is analogous to that of a collision in  the context of a hash map. Another problem that we actually wouldn&rsquo;t  have with the above (due to it working on 16bit unsigned integers) is  what to do with negative values or values of a different type, such as  floating point numbers. It is not a stretch for the imagination to  contrive use-cases in which our sort should be robust enough to deal  with this type of data.</p>
<p>&nbsp;</p>
<h2><strong>Analysing Data Using Elementary Statistical Tools</strong></h2>
<p><strong>&nbsp;</strong>One of the reasons I like Radix Sort is that the solutions to these  seemingly death-dealing problems come in the form of some mathematical  tools you were likely taught at primary school (for those of you that  only speak American, read &ldquo;elementary school&rdquo;). By performing some  rudimentary analysis on the data, we&rsquo;re able to sort it robustly and  efficiently. The analysis that I speak of is something called a <a href="http://en.wikipedia.org/wiki/Histogram"><em>histogram</em></a>,  a simple mathematical tool used to depict the distribution of a  dataset. In its traditional form a histogram is just a bar chart where  the height of the column represents the number of times a given value is  present in a set. Check out the example below:</p>
<p style="text-align: center;"><img class="size-full wp-image-421 aligncenter" title="histogram" src="http://www.spuify.co.uk/wp-content/uploads/2011/01/histogram.png" height="340" alt="" /></p>
<p>To calculate a histogram for an arbitrary set of values we can simply  iterate over all the values in our set. For each value in our array we  maintain a running total of the number of times it has been encountered&hellip;  That&rsquo;s it! (My apologies if you were expecting something more complex).  We have a histogram for our array, and we&rsquo;re well on our way to  conquering Radix Sort. <em>Pause for a moment here, and actually take a  look at the histogram for the dataset you&rsquo;re sorting. It can be  interesting and very illuminating to see just how your data is spread  out and you may gain some interesting insight.</em></p>
<p>An astute reader may have spotted that I actually glossed over an  important implementation detail here. How exactly do you keep track of a running total  for each unique value? Ideally we&rsquo;d like this to be nice and quick, so  we don&rsquo;t want to use some nefarious mapping data structure to keep track  of the histogram. Besides, I <em>did</em> promise that I&rsquo;d keep things  simple so how about a nice, simple, flat array? The problem is if we&rsquo;re  sorting, for example, 32bit integers and want to use the values  themselves as array indices denoting the counter for a specific value,  we&rsquo;re left with the rather inconvenient problem of requiring 2<sup>32</sup> elements just to store the histogram! Not to fear though, as this is where the concept of the radix enters the fray.</p>
<p>If we have a smaller radix, of say 8 bits, and <em>multiple histograms</em> (one for each byte of the 32bit integers we&rsquo;re sorting, a total of  four), then we can use a relatively small amount of memory for the  histograms which is proportional to the assumable range of the radix.  Calculation of multiple histograms can be done in parallel (or in the  same loop) no matter how many histograms you seek to calculate for the  dataset, you just shift and mask off the number of bits you&rsquo;re  interested in for each histogram. Here&rsquo;s a quick example of what I mean,  for an 8bit radix, you&rsquo;d simply do: <code>(x &amp; 0xff)</code>, <code>((x&gt;&gt;8)&amp;0xff)</code>, <code>((x&gt;&gt;16)&amp;0xff)</code> and <code>(x&gt;&gt;24)</code> to access each byte of the 32bit value individually. This type of bit  shifting should be immediately familiar to any graphics coders out there  as it is often used to access individual channel in a 32bit colour  value. The code to calculate four histograms (one for each byte) from  our 32bit integers ends up looking a little something like this:</p>
<p><code><br /> for(uint32_t currentItem=0; currentItem&lt;maxCount; ++currentItem) {</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;     const uint32_t value = unsortedArray[currentItem];<br />&nbsp;&nbsp;&nbsp;&nbsp;     const uint32_t value0 = value &amp; 0xff;<br />&nbsp;&nbsp;&nbsp;&nbsp;     const uint32_t value1 = (value&gt;&gt;0x8) &amp; 0xff;<br />&nbsp;&nbsp;&nbsp;&nbsp;     const uint32_t value2 = (value&gt;&gt;0x10) &amp; 0xff;<br />&nbsp;&nbsp;&nbsp;&nbsp;     const uint32_t value3 = (value&gt;&gt;0x18);<br />&nbsp;&nbsp;&nbsp;&nbsp;     histogram0[value0]++;<br />&nbsp;&nbsp;&nbsp;&nbsp;     histogram1[value1]++;<br />&nbsp;&nbsp;&nbsp;&nbsp;     histogram2[value2]++;<br />&nbsp;&nbsp;&nbsp;&nbsp;     histogram3[value3]++;<br /> }<br /> </code></p>
<p>At this point I&rsquo;d like to stress that <em>there is absolutely no reason why you must have an 8bit radix</em>.  It is common, yes, but 11bit, 16bit, or whatever you want will also  work. When you&rsquo;re actually implementing this algorithm you should  probably try out a few different radix sizes to see which gives you the  best results on your target hardware. It&rsquo;s essentially a trade-off  between cache performance accessing the supporting data structures and  doing less passes over the input array. Increasingly from my experience  the more cache optimal solution (i.e.: the smaller Radix size and hence  histogram/offset table) tends to perform best as memory latency  increases relative to CPU performance. It&rsquo;s also worth noting that  histograms for different subsets of a full dataset can be summed in  order to produce the histogram for the entire set, this trick can be  useful when doing this type of thing in parallel, but we&rsquo;ll get to that  in due course.</p>
<p>&nbsp;</p>
<h2><strong>Offset Calculation</strong></h2>
<p>If you cast your mind back to the beginning of this post, I stated that  the guiding principle of Radix Sort was to place values at the correct  places in the output array immediately without performing any  comparisons. To do this we need to know at what offset into our output  array we should use to start placing writing to for each unique value in  our input array. The histogram was actually an intermediate step in  calculating this list of offsets. To illustrate this I will use a worked  example, consider the following list of unsorted numbers:</p>
<p>1, 2, 4, 3, 1, 1, 3, 1, 7, 6, 5</p>
<p>If we wanted to place the value &lsquo;2&rsquo; directly into the output array at  its correct place we would actually need to place it at index 4 (i.e.:  the 5th slot). So the table we&rsquo;re computing will simply tell us that for  any values of &lsquo;2&rsquo; begin placing them at index 4. We then increment the  offset for the value we just placed as we go, so that any subsequently  occurring values which are the same (assuming there are any) go just  after the one we&rsquo;ve placed. The offset table we want to calculate for  the above example would look something like this:</p>
<p>0 &ndash; N/A (There are no 0&prime;s in the set, so it doesn&rsquo;t matter!)<br />1 &ndash; 0<br />2 &ndash; 4<br />3 &ndash; 5<br />4 &ndash; 7<br />5 &ndash; 8<br /> 6 &ndash; 9<br /> 7 &ndash; 10</p>
<p>So how do we calculate this offset table for a histogram? That&rsquo;s  easy; each location in the table is just the running total of each value  in the histogram at that point. So in this example, the offset for &lsquo;2&rsquo;  would be 4, because we have no &lsquo;0&rsquo;s, but then four &lsquo;1&rsquo;&prime;s. This  unsurprisingly is a total of 4! The next offset, for &lsquo;3&rsquo;, is 5 because  in our data set we only have one &lsquo;2&rsquo;, and 0+4+1 is 5. The technique of  increasing the offset for a given value as you place it in the output  array gives rise to a very subtle, but important property of Radix Sort  that is vital for implementations which begin at the least significant  byte (LSB Radix Sort) &mdash;the sort is stable. In the context of sorting  algorithms, a sort is said to be stable if the ordering of like values  in the unsorted list is preserved in the sorted one, we shall see why  this is so important for LSB Radix Sort later on. Incidentally don&rsquo;t  worry about what LSB means just yet, we&rsquo;ll get to that!</p>
<p><img class="size-full wp-image-422 aligncenter" title="cumulativefreq" src="http://www.spuify.co.uk/wp-content/uploads/2011/01/cumulativefreq.png" height="340" alt="" style="display: block; margin-left: auto; margin-right: auto;" /></p>
<p>A quick note at this point, the application of these types of  analysis tricks can also be done offline to help you better design a  good solution around the data you&rsquo;re trying to process. I&rsquo;m hoping that a  certain Mr. Acton might be kind enough to share some the increasingly  important statistical tools that have made their way into his bag of  tricks at some point on this blog in the future, :).</p>
<p>&nbsp;</p>
<h2><strong>What Have We Got So Far?</strong></h2>
<p>The data flow chart below shows what we&rsquo;ve got so far. We&rsquo;ve  successfully applied some elementary data analysis to our data, and in  the process computed the only supporting data structure we need: a table  of offsets for each radix which tells us where to begin inserting each  datum in our output array. I find a fun and useful way to visualise what  we&rsquo;ve just done is to imagine taking the bars of the histogram for each  number in our data set, rotate them 90 degrees, and then lay them end  to end. You can imagine these bars come together to form a contiguous  block of memory, with each bar being big enough to hold all the  occurrences of the particular value it represents. The offset table is  just the indices along this array where each bucket begins. Now comes  the fun part, actually moving the data.</p>
<p style="text-align: center;"><img class="size-full wp-image-423 aligncenter" title="flow" src="http://www.spuify.co.uk/wp-content/uploads/2011/01/flow.png" height="677" alt="" width="385" /></p>
<p>Things are going to get a little more complicated here, but only  very, very slightly I promise. There are actually two ways to approach  data movement in Radix Sort. We can either start with the least  significant bits or with the most significant bits (LSB or MSB). Most people with a  serial programming mindset tend to go with LSB, so we&rsquo;ll start there and  then cover MSB later on as that has some advantages (predominately for  parallel programming).</p>
<p>&nbsp;</p>
<h2><strong>Serial Offender!</strong></h2>
<p>Radix Sort is typically not an <em>in-place</em> sort, that is to say  implementations typically requires some auxiliary storage which is  proportional to the number of elements in the original, unsorted list in  order to operate. To move the data around we need to do one pass  through our unsorted array for each radix (the number of passes will  change depending on the size of each datum being sorted and the size of  your radix, of course). Each time we perform a pass, we are actually  sorting the data with respect to the particular radix we are  considering, and I&rsquo;ll begin by discussing this for LSB and then discuss  MSB.</p>
<p>The first pass will typically sort by the first byte, the second by  the second (but respecting the positioning of those elements with  respect to the first) and so on. Hopefully at this point you are able to  see both why the stable property of LSB Radix Sort is so important and  where it comes from. The data movement pass is just a single pass over  the input array for each radix as mentioned. For each pass, you just  read the value in the input array, mask off the bits that are relevant  to the particular radix you are considering and then look-up the offset  from the table of indices we computed earlier. This offset tells you  where in the corresponding output array we should move the value to.  When you do this, you need to be sure to increment the offset table so  that the next value from the input array with the same binary signature  will be written to the next element in the array (not on top of the one  you just wrote!). That&rsquo;s all there is to it really! You just do this  once for each radix, something like this:</p>
<p><code><br /> for(uint32_t currentValue=0; currentValue&lt;maxCount; ++currentValue) {</p>
<p>&nbsp;&nbsp;&nbsp; const uint32_t value = unsortedArray[currentValue];<br />&nbsp;&nbsp;&nbsp; const uint32_t offsetTableIndex = value &amp; 0xff;<br />&nbsp;&nbsp;&nbsp; const uint32_t writePosition = offsetTable0[offsetTableIndex];<br />&nbsp;&nbsp;&nbsp; offsetTable0[offsetTableIndex]++;<br />&nbsp;&nbsp;&nbsp; auxiliaryArray[writePosition] = value;<br /> }<br /> </code></p>
<p>So that&rsquo;s our LSB Radix Sort for integers, but what if you want to  sort floating point numbers, or indeed negative numbers? I&rsquo;m sure if  you&rsquo;re reading this you&rsquo;re probably aware of the common storage formats  of floating point values (the most common of which being IEEE 754) or  two&rsquo;s complement for signed integers. How does Radix Sort deal with  these things? The answer is perfectly well if you apply a simple  transformation to the input data, in time-honoured fashion I&rsquo;ll leave  this as an exercise for the reader as I don&rsquo;t have time to delve into  this now. If you&#8217;re really struggling then please leave a comment or something and I&#8217;ll try and find time to write it up for you :)</p>
<p><strong><br /> </strong></p>
<h2><strong>Making the Best of a Parallel Situation</strong></h2>
<p>So we&rsquo;ve covered how to Radix Sort works when starting from the LSB,  seems to work nicely, why would we want to complicate things any  further? This part will talk about how sorting data beginning with the  most significant digits changes the nature of the problem and how we can  use this to our advantage in this increasingly parallel world.</p>
<p>If you think about what you actually have at each stage of the data  movement passes, you will see that using MSB Radix gives rise to a  particularly interesting property, the first pass over the data actually  produces <em>n</em> <em>independent</em> sorting problems, where <em>n</em> is the assumable range of the radix. If you can&rsquo;t see why the parallel  programmer in you is now rubbing his/her hands together, don&rsquo;t worry,  here&rsquo;s why!</p>
<p>The first pass over the data in the context of an MSB Radix Sort  actually has the effect of roughly bucketing the data according it&rsquo;s  most significant byte (or whatever size radix you&rsquo;re working with), to  put it another way, you know that after each pass of MSB Radix Sort,  none of the values in the buckets will ever need to be moved into other  buckets (which is of course not the case with LSB Radix Sort). This  important property actually opens up the potential for us to distribute  the sorting of the different buckets over multiple processing elements,  crucially with each element working on independent data, making  synchronisation much simpler. To perform MSB in serial you just re-apply  Radix sort to each bucket independently. Another interesting side  effect of doing MSB Radix Sort is that each application of Radix Sort  makes the data more and more cache coherent for subsequent passes.</p>
<p>&nbsp;</p>
<h2><strong>Wrapping it up&#8230;</strong></h2>
<p>So, there you have it. Radix Sort explained (hopefully well enough)  for humans! I hope after reading this you are able to have a strong  mental model of how Radix Sort works in practise and moves data around  in memory, and importantly how it is able to break free of the O(n log  n) lower limit on comparison-based sorting algorithms. Just a quick  disclaimer, for some of my more performance minded readers: I make no  guarantees as to the optimality of my implementation; the purpose here  was to explain the concepts behind the algorithm, not to write an  optimal implementation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.altdevblogaday.com/2011/01/28/radix-sort-for-humans/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

