Sunday, December 17, 2006

Implicit memcpy(3) calls

Peter Hosey was kind enough to document the speed difference between malloc() followed by bzero() and calloc(). They are two different ways to allocate and zero a block of memory, but it turns out that calloc() is much faster.

I was wondering how much of that speed difference is the result of having one fewer function call in your code, and that reminded me of a lesson I learned a few months ago...

Some time ago, I picked up the practice of initializing all local variables at their declarations, like:

double aDouble = 0.0;

In many cases, it's not really necessary, and the coders who disagreed with the practice correctly pointed out that I was simply adding extra code and extending the execution time. My reasoning was that not having garbage data in a local variable is worth the extra few cycles it takes to zero an integer or even a double.

At the time, I was a PowerPlant junkie, and would initialize LStrings with Str_Empty or whatever it was called. And all was good.

Since I've switched to XCode, I've been using C strings more than ever before. NSMutableString is great when it's possible to use it, but creating, modifying and releasing a few hundred million of them gets annoying. So I was using lots of C strings and initializing them to zeroed bytes like so:

char aString[1000] = {0};

I knew that only the first byte needed to be zero to qualify the local variable as a null string, but I didn't like garbage data. In addition, if any of my subsequent string handling code accidentally left out the null terminator, it would still be there. It was a win/win situation, and it worked flawlessly.

As it turned out, my adoption of the initialize-blindly-at-declaration practice slowed things down considerably when dealing with C strings. After optimizing all I could, Shark told me that about 55% of my code's running time was spent in memcpy(). At first I thought that was a good sign. After all, I was not calling memcpy() explicitly in my code, so it must have been invoked by printf, scanf, and the like. And I figured that when more than half the time is spent in low level system functions, I've done all I can.

Out of curiosity, I disassembled the app I was working on with otool and searched for "memcpy". Holy shit, it's everywhere. It's not only being invoked from printf/scanf(which Shark would have told me if I had asked). This is when the lesson sunk in- zeroing an entire C string of arbitrary length is a Bad Idea™. gcc translated my harmless {0} into a memcpy call. Combine that with 10 or 20 C strings per method, with each method being called a few million times, and I began to see why memcpy was the latest culprit according to Shark. Although I was not calling memcpy() explicitly, gcc was. A lot.

So I still initalize atomic variables in their declarations, but I'm much more careful when using composite data types. Before typing this post I just used my instinct to decide which way to go. But since Peter was generous enough to provide some empirical data, I may as well follow suit.

Not surprisingly, PowerPC and Intel code behave quite differently. Here's what I found when using the initialize-at-declaration approach for C strings. The following disassembly is modified output from otx.

If the char array is less than or equal to 32 bytes, gcc produces inline load/store instructions. For example, a 16 byte string:

3c400004  lis   r2,0x4
3842ef54  addi  r2,r2,0xef54
80020000  lwz   r0,0x0(r2)
81220004  lwz   r9,0x4(r2)
81620008  lwz   r11,0x8(r2)
8042000c  lwz   r2,0xc(r2)
901e0018  stw   r0,0x18(r30)
913e001c  stw   r9,0x1c(r30)
917e0020  stw   r11,0x20(r30)
905e0024  stw   r2,0x24(r30)

where the data pointed to by 0x3ef54 is a bunch of zeroes, and r30 is a copy of the stack pointer.

If the char array is greater than 32 bytes, gcc inserts a call to memcpy().

If the char array is less than or equal to 64 bytes, gcc produces inline move instructions. For example, the same 16 byte string:

c745e800000000  movl  $0x00000000,0xffffffe8(%ebp)
c745ec00000000  movl  $0x00000000,0xffffffec(%ebp)
c745f000000000  movl  $0x00000000,0xfffffff0(%ebp)
c745f400000000  movl  $0x00000000,0xfffffff4(%ebp)

If the char array is greater than 64 bytes, gcc inserts a call to either memset() or memcpy().

So apparently, if the length of your C string is more than 16 times the number of bytes your CPU can zero with a single instruction, gcc will insert a call to memcpy(). Moral of the story- only zero the first byte of a C string, unless it's absolutely necessary, or speed doesn't matter.


ken said...

It seems possible that the problem is not the call to memcpy, the problem is the amount of work being done.

I'd be curious to see the result of writing the assembly out by hand to zero the string without calling memcpy. Would it be any faster, or just harder to diagnose? It'd definitely increase binary size if you have a lot of those 1000 character strings around.

Blake said...

"It seems possible that the problem is not the call to memcpy, the problem is the amount of work being done."

I would say the bigger problem is indeed the amount of work being done, but in my case, the unnecessary function call also added lots of overhead.

"I'd be curious to see the result of writing the assembly out by hand to zero the string without calling memcpy."

Doing that would be slightly faster, due to the lack of function call and return code. And it would definitely increase the binary size, as you mentioned. The lesson I took from this was that's it's Bad Design™ to zero every byte in a C string. It allows for sloppy code that doesn't properly apply the null terminator, and it greatly increases the execution time, regardless of the method you choose to zero the bytes. The better way seems to be to simply zero the 1st byte.

manish said...

nice post, i see that u have used a tool named "Shark". i m not able to find it on google. could u please send me a link on this tool?

Blake C. said...

Sorry for the late reply, Manish. Shark is a performance tool that is part of Mac OS X 10.4 developer tools(and perhaps even earlier). If you're writing code on another platform, you probably won't see it. Search for more info.