Fortunately Apple choose to install nasm by default (well when you install the developers' tools), so we just need to understand how a mach-o binary is laid out. This is some official documentation about the file format (and there is the referenced header files installed with the dev tools). Using the otool utility also gives a good idea of what is in a real binary (try running with the -l command on the previous static binary to see what you get).
Anyway as with last time I still want this executable to not require any dirty tricks, although looking at the file format there isn't many obvious ones we could employ (at least compared to some of the stunts you can play with ELFs).
So what needs to be in a valid mach-o? The header, obviously, is required, then a number of load commands. It turns out (through a bit of reading the source) that you only actually need a LC_UNIXTHREAD (or LC_THREAD) command and the executable will load, seems unlike most other executable formats the entry point is not a field in the headers but is inferred by specifying the initial thread context.
Of course without any code in memory this isn't exactly that useful (well not immediately) so we also need to specify an LC_SEGMENT load command. This will map some of our binary into memory and we are ready to go. As a short aside if you look at the output of otool -l under most segments there are also sections, these are as far as I can tell unnecessary, and are more meta-data to make linking more consistent.
; A basic Mach-O executable
; (c) Tyranid 2010
BITS 32
ORG 0x1000
_program_start:
; mach_header
dd 0xfeedface ; MH_MAGIC
dd 7 ; cputype
dd 3 ; cpusubtype
dd 2 ; filetype
dd 2 ; ncmds
dd _cmd_end-_cmd_start ; sizeofcmds
dd 0x2001 ; flags
_cmd_start:
_segment_cmd:
dd 1 ; LC_SEGMENT
dd _segment_cmd_end-_segment_cmd ; sizeofcmd
_segment_name: ; segname
db "__TEXT"
times 16-$+_segment_name db 0
dd _program_start ; vmaddr
dd ((_program_end-_program_start)+4095)&~4095 ; vmsize
dd 0 ; fileofs
dd _program_end-_program_start ; filesize
dd 7 ; maxprot
dd 5 ; initprot
dd 0 ; nsects
dd 4 ; flags
_segment_cmd_end:
_thread_cmd_start:
dd 5 ; LC_UNIXTHREAD
dd _thread_cmd_end-_thread_cmd_start ; sizeofcmd
dd 1 ; flavor (i386_THREAD_STATE)
dd (_registers_end-_registers_start)/4 ; count
_registers_start:
dd 0 ; unsigned int __eax;
dd 0 ; unsigned int __ebx;
dd 0 ; unsigned int __ecx;
dd 0 ; unsigned int __edx;
dd 0 ; unsigned int __edi;
dd 0 ; unsigned int __esi;
dd 0 ; unsigned int __ebp;
dd 0 ; unsigned int __esp;
dd 0x1F ; unsigned int __ss;
dd 0 ; unsigned int __eflags;
dd _start ; unsigned int __eip;
dd 0x17 ; unsigned int __cs;
dd 0x1F ; unsigned int __ds;
dd 0x1F ; unsigned int __es;
dd 0 ; unsigned int __fs;
dd 0 ; unsigned int __gs;
_registers_end:
_thread_cmd_end:
_cmd_end:
_start:
; Call exit(42)
push byte 42
push byte 1
pop eax
push eax
int 0x80
_program_end:
Throw it through nasm in binary mode and what do we get? 172 bytes, far smaller. There are some further tricks you could play with this, such as embedding the code inside the thread context (as only EIP and probably the segment registers are important) or actually store a few of the necessary values in the context to slightly reduce the pushes. Still 172 is alright for now, can it go any lower?