08 Memory Management

Chapter 8: Memory Management 83
CHAPTER 8 Memory Management

Besides the CPU memory is the other fundamental primary resource. This chapter describes the
memory managment architecture of JX. The rst part of the chapter explains the problems of the glo-
bal memory management, the second part explains the memory object abstraction.
1 Global and domain-local memory management
Memory protection in JX is based on the use of a type-safe instruction set. No memory management
hardware (MMU) is necessary. The whole system, including all applications, runs in one physical ad-
dress space. This makes the system ideally suited for small devices that lack an MMU. But it also
leads to several problems. In a traditional systemfragmentation is not an issue for the user-level mem-
ory allocator, because memory that is allocated but not actively used, is paged to disk. In JX unused
memory is wasted main memory. So we face a similar problemas kernel memory allocators in UNIX,
where kernel memory usually also is not paged and therefore a scarce resource. In UNIX a kernel
memory allocator is used for vnodes, proc structures, and other small objects. In contrast to this the
JX kernel does not create many small objects. It allocates memory for a domains heap and the small
objects live in the heap. The heap is managed by a garbage collector. In other words, the JX memory
management has two levels, a global management, which must cope with large objects and avoid
fragmentation, and a domain-local garbage-collected memory.
Global memory management. The global memory is managed by using a bitmap allocator [187].
This allocator was easy to implement, it automatically joins free areas, and it has a very low memory
footprint. On the other hand there is nothing in the systems design or implementation that prevents
us to use another allocator.
Domain-local memory management. A domain has two memory areas: an area where objects may
be moved by the garbage collector and an area where they are xed. In the future, a single area may
sufce, but then all data structures that are used by a domain must be movable. Currently, the xed
area contains the code and class information. Moving these objects requires an extension of the sys-
tem: all pointers to these objects must be known to the GC and updated; for example, when moving
a code component the return addresses on all stack frames must be adjusted.
Stack overow detection and null pointer check. A system design without MMU means that sev-
eral of their responsibilities (besides protection) must be implemented in software. One example is
the stack overow detection, another one the null pointer detection.
84
Stack overow detection is implemented in JX by inserting a stack size check at the beginning
of each method. This is feasible, because the required size of a stack frame is known before
the method is executed. The size check has a reserve, in case the Java method must trap to a
runtime function in DomainZero, such as checkcast. A stack size check must be performed
whenever a method is entered. To store its local and temporary variables the method needs a
stack frame of a known size. The check code must test whether the stack has enough space for
the frame and otherwise throw an exception or enlarge the stack. Enlarging the stack is only
possible with a more sophisticated stack management that currently is not implemented. The
check code must be fast because it is executed very often. We implemented two versions of
the check. Version 1 (STACK0) aligns all stacks at an address that is a multiple of the stack
size, which must be a multiple of 2. The check code adds the frame size to the stack pointer
and sets the lower bits of the result to zero. If the result is larger than the original stack pointer
the frame would overow the stack and an exception is thrown. Version 2 (STACK1) uses the
current thread pointer to access the stack boundary that is stored in the TCB. It compares the
stack pointer plus the frame size to the stack boundary and throws an exception if it is greater.
The null pointer check currently is implemented using the debug system of the Pentium pro-
cessor. It can be programmed to raise an exception when data or code at address zero is ac-
cessed. On architectures that do not provide such a feature, the compiler inserts a null-pointer
check before a reference is used.
2 Heap
Objects in Java and other type-safe runtime systems, such as Inferno and ML, are stored on a
garbage collected heap [100]. The garbage collector usually is able to move the objects to compact
the heap. Moving low-level data structures, such as thread control blocks and stacks, was a special
challenge that we had to cope with.
Several different garbage collection algorithms can be used. No single GC algorithm will t all ap-
plications. The Inferno VM, for example, uses reference counting
1
for predictable response times. A
reference counting GC has disadvantages, such as high runtime overhead, bad multiprocessor scal-
ability, and the need for an additional cycle collecting mechanism, that make it unsuitable for many
applications. Therefore we decided not to use a single GC but to dene an interface between the GC
and the rest of the runtime system to allow different GCs to be used. Currently four collectors are im-
plemented: a copying collector with xed heap size (COPY), a copying collector with dynamically
changing heap size (CHUNKED), and a compacting collector (COMPACTING).
COPY. This collector is an exact, copying, non-generational GC. An exact GC always knows wheth-
er a word is a reference or a data value
2
. A copying collector starts with a root set of object references,
copies the object graph that is spawned by these references to a newheap and deallocates the old heap.
1. And an additional mechanism to periodically collect cycles.
2. This is in contrast to consevative GCs that do not have this information and must assume that a word is either a
reference or a data value. They can be used for unsafe languages, such as C and C++. The shortcoming of these
collectors is the inability to move objects, because they can not update references to objects.
The JX implementation of this algorithm does not need any additional data structures and is not re-
cursive (see Algorithm 8.1).
Two so called semi-spaces are used as a heap. Objects are allocated in one semi-space. When
this semi-space runs full the garbage collector copies all live objects into the other semi-space, which
then is also used to allocate objects.
This collector has some interesting properties:
After a GC run the heap is in compact form. There is no need to search for free space during
an allocation. Objects are always allocated at the top of the heap by simply advancing a point-
er, which is a very fast operation.
Deallocation is done by discarding the original semi-space. Therefore the time of one GC is
independent from the number of dead objects, it is proportional to the number of live objects.
Because objects are moved, pointers to objects must be updated. It is necessary to be able to
locate all pointers to objects. This requires special attention within the microkernel (see
Section 2.2).
CHUNKED. The heap consists of small linked chunks instead of a large block. This allows the heap
to grow and shrink according to the memory needs of the domain. A group of domains can use a cer-
tain amount of memory cooperatively. A limit can be specied for the sum of the heap sizes of such
a group. If this limit is reached and a domain needs to allocate additional memory a garbage collection
is started in this domain and if the collection does not releases the necessary amount of memory the
domain is blocked, waiting for the other domains in the group to release memory.
Figure 8.1: Copying garbage collection
Initial setup:
There are two semi-spaces S
1
and S
2
S
1
contains the original objects; S
2
is empty
There is a root set of references consisting of all stacks, all static variables, and all ob-
ject references that are stored by the microkernel
Algorithm:
Shallow copy all objects that are directly referenced by the root set to semi-space S
2
When copying an object (old object) mark the object in semi-space S
1
as cop-
ied and write a forwarding pointer to the object of semi-space S
2
into the old
object
Scan all objects of semi-space S
2
If the object contains an object reference and the object has already been cop-
ied, replace the original object reference by the value of the forwarding pointer
If the object contains an object reference and the object has not been copied,
copy the object and create a forwarding pointer
86
COMPACTING. The compacting collector is also an exact collector that moves objects. But in con-
trast to a copying GC it does not use two semispaces and does not change the order of the objects.
The collector operates in four phases. In the rst it marks all reachable (live) objects, in the second
phase it calculates the new address of all live objects. In the third phase it corrects all references to
live objects, and in the nal phase it copies the contents of the object to the new location. The second
phase uses two pointers into the heap: top and current. Top marks the end of processed live objects,
current marks the start of unprocessed objects (the current position of the heap scan). Initially top and
current are set to the start of the heap. If the object at the current address is marked it is moved to the
top address and the top address is set to the end of the moved object. If it is not marked it is not moved.
The current pointer is then set to the next object. At the end of this phase the current pointer points to
the original end of the heap and the top pointer to the new end of the heap.
2.1 GC interface
GCruns of JXdomains are completely independent fromeach other. The independence allows
for different GC implementations that even use different objects layouts. Some GCs need additional
information per object, for example a mark bit or a generation number, some GCs will reorder the
elds of an object to better use CPU caches, some GCs will even compress objects or store a group
of objects, such as an array of booleans, in a space efcient manner. For this to work the garbage col-
lector implementation must be hidden behind an implementation independent interface. The interface
is realized as a domain-specic function table (see Figure 8.2).
2.2 Moving special objects
TCB. The GC must know all references to thread control blocks (TCB) and update them. Threads
are linked to threads in other domains. During a portal call, for example, the sender thread is appended
to a wait queue in the service control block. When the TCB is moved also the external references to
it must be updated. For this purpose the TCB has a link to the SCB it waits for.
Figure 8.2: Garbage collector interface
ObjectHandle allocDataInDomain(struct DomainDesc_s * domain, int objsize, int ags);
This function is called to allocate a new object on the domains heap.
void done(struct DomainDesc_s * domain);
This function is called when the domain terminates. The garbage collector should release all resources,
especially the memory allocated for the heap.
void gc(struct DomainDesc_s * domain)
A garbage collection should be performed.
boolean(*isInHeap) (struct DomainDesc_s * domain, ObjectDesc * obj)
The garbage collector tests whether the given object pointer is inside its heap.
void (*walkHeap) (struct DomainDesc_s * domain, HandleObject_t handleObject)
The garbage collector applies the handler function to all objects on its heap. This function can be used
by the garbage collector independent code to operate on all objects on the heap without knowing the
organization of the heap.
A TCB object cannot simply be copied to another domain. When a TCB object crosses a do-
main border (for example from domain D
1
to domain D
2
in Figure 8.4) a proxy for the TCB is created
on the target heap. The proxy, which has the type ForeignCPUState, references the domain of the real
TCB (domain D
1
) by using the domain pointer / domain ID mechanism (see Section 4 of Chapter 4).
To reference the TCB it is not possible to simple use a pointer to the TCB object, because this object
could be moved during a GC in domain A. Therefore the proxy uses the unique thread ID to reference
the TCB. Because it is very expensive to nd the TCB given the thread ID we use the following op-
timization. The proxy additionally contains a direct pointer to the TCB and the number of the GC
epoch of domain A. The GC epoch of a domain is a strictly monotonic increasing number. It is incre-
mented when a GC is performed in the domain. It is guaranteed that if the GC epoch number has not
changed objects are not moved.
Every pointer to a TCB must be known to the garbage collector. The GC must be able to nd
and update these pointers when moving the TCB. For scalability reasons these pointers must be found
without scanning the heap of other domains. An example for such a pointer is the link between a ser-
vice thread and its client, which is created during a portal invocation. To return the result of a portal
call the service thread needs a reference to the client thread. There are two ways to implement such
a reference.
The servers TCB contains the thread ID of the client thread. On return from a portal call the
corresponding client TCB must be found. As this may involve a linear search over all threads
of the client domain the mapping from thread ID to TCB could be very slow. Therefore an op-
timization must be used. Together with the thread ID a direct TCB pointer, a GC epoch num-
ber, and a pointer to the clients DCB are stored in the servers TCB. The direct TCB pointer
can be used as long as the GC epoch number is identical to the client domains GC epoch. The
current GC epoch is stored in the clients DCB. If the epoch numbers differ a GC in the client
domain could have moved the client TCB. In this case the thread ID is used to nd the TCB
and the TCB pointer and GC epoch number can be updated.
The reference to the client TCB is stored in the servers TCB as a direct pointer called
mostRecentlyCalledBy (see Figure 8.3). This pointer must be updated when the client TCB is
moved during a garbage collection in the client domain or when the client thread or the client
domain is terminated. To detect whether a TCB is referenced via a mostRecentlyCalledBy
pointer the client TCBhas a pointer to the servers TCB(blockedInServiceThread). This pointer
must be updated similar to the mostRecentlyCalledBy pointer.
The current implementation uses the second alternative, because it uses fewer space in the TCB and
is faster in the common case that no GC occurred during a portal invocation.
A similar problem is caused by fast portals that represent a thread of another domain. These
portals, called ForeignCPUState, also need a reference to the TCB of a thread in another domain. The
second alternative can not be used because potentially there are many ForeignCPUState portals refer-
encing the same TCB. Therefore ForeignCPUState portals are implemented using the rst alternative
(see Figure 8.4).
Stack. A previous implementation of the stack overow check required that stacks are aligned at a
multiple of their size (see Section 1). This alignment requirement prohibited allocating them on the
heap because of heap fragmentation. The current implementation (called STACK1 in Section 1) does
not need aligned stacks and allocates all stacks on the heap. When the stacks are moved the frame
88
Figure 8.3: Management of Thread Control Blocks during a portal communication
During a portal communication the TCBs of the client and the server thread are connected in both
directions. The connection from the server TCB to the client TCB (mostRecentlyCalledBy) is re-
quired for delivering the results of the invocation. The connection from client to server TCB (blocke-
dInServiceThread) is required to update the mostRecentlyCalledBy link during a garbage collec-
tion.
Figure 8.4: Relation between TCBs, inter-domain TCB references, and stacks
Thread Control Blocks and stacks live on the heap of a domain. A stack consists of linked stack
frames. Another domain can hold a reference to a thread control block by using ForeignCPUState
portal. Because TCBs as well as stacks can be moved during a garbage collection, the ForeignC-
PUState portal contains the thread ID of the foreign TCB, the GC epoch and a direct pointer to the
TCB.
blockedInServiceThread
mostRecentlyCalledBy
client TCB server TCB
Domain D
1
Domain D
2
Domain D
2
Domain D
1
Heap
Thread Control Block (CPUState portal) Stack
move move
Heap
GC epoch
thread ID
TCB pointer
ForeignCPUState portal
move move
move
move
Frames
grow
Frame Pointer
DCB pointer
pointers (which link the stack frames), the CPU context in the TCB (stack pointer and frame pointer
registers), and the stack pointers in the TCB must be corrected. A complication occurs when the col-
lector thread moves its own stack. It must detect this and switch to the new stack before releasing the
old heap. As the stack was copied in the shallow copy phase (see Figure 8.1) it must copy the stack
again to reect the actual execution state. The current implementation switches to the new stack im-
mediately after correcting the frame pointers of this stack and runs the rest of the collection on the
new stack.
SCB. Service Control Blocks (SCBs) and Service Pools are also allocated on the heap. They contain
references to TCBs which must be updated.
Domain. Domain Control Blocks could be allocated on the heap of DomainZero. To allow direct
pointers to DCBs we used a dedicated memory area for DCBs and domain portals that contain a DCB
pointer and a domain ID. When the IDin the DCBequals the IDin the domain portal the DCBpointer
is valid otherwise the DCB has been reused.
2.3 Garbage collection and interrupt handling
All interrupt handlers are written in Java. All of the current GC implementations operate non-
incremental, i.e., no mutator thread
1
is allowed to run during a collection because the heap is not con-
sistent during this time. Therefore all interrupts that are serviced by the domain must be blocked. On
a PC architecture the programmable interrupt controller (PIC) is used to disable dedicated interrupts.
Ablocking in software by remembering the interrupt (set a ag) and returning fromthe core IRQhan-
dler is not possible with level-triggered interrupts that must be acknowledged at the interrupting de-
vice to be deactivated.
To avoid that a garbage collection becomes necessary during interrupt handling the heap con-
tains a reserve that is only available for object allocation in interrupt handler threads. As rst-level
interrupt handler should be very short and should not allocate much memory this reserve should be
sufcient. If it is not, an exception is thrown.
2.4 Garbage collecting and timeslicing
Scheduling threads of one domain preemptively means that a thread may be at an arbitrary in-
struction when another thread requires a collection. A technique that is also used in JX is to advance
all threads that are located in Java code to a safe point [3]. A safe point is a point where the execution
state of the thread is known: the types of the saved registers and the types of all stack positions are
known. It is difcult to advance a thread that currently executes a core function, because to obtain a
stack map, which is necessary for a exact collector, the C compiler must be modied to generate such
maps. Therefore C code must either disable interrupts or register references.
A collection may also be necessary in kernel code when the kernel allocates objects. The code
must be carefully written to check whether a GC occurred and reload variables (see Figure 8.5).
1. All threads, besides the GC thread, that run in the domain and modify the heap are called mutator threads.
90
2.5 Garbage collection of portals and memory objects
Portals and memory objects are shared between domains. Both refer to a central data structure. Por-
tals have a reference to a service control block (SCB) in another domain and memory objects have a
reference to a memory control block (MCB) in DomainZero. Reference counters are used to reclaim
the SCB and MCB. When a domain terminates or portals or memory objects become garbage the re-
spective reference counter must be decremented during a nalization cycle. How this is actually re-
alized depends on the installed garbage collector. The copying GC moves live objects to a second
heap and sets a ag in the original objects header to indicate that the object was copied. At the end
of the GC cycle all objects that have not set this ag are dead. During the following nalization phase
the heap is scanned using the walkHeap() function of Figure 8.2 and the reference counters of all dead
portals and memory objects are decremented. When the domain is in the process of being terminated
all objects are considered dead.
2.6 Future work
Similar to scheduling garbage collection should be removed fromthe microkernel. This is much more
difcult, because it must be guaranteed that the untrusted GC preserves the type of objects while
copying them. Although there are recent advances [183] several open problems have to be solved be-
fore the technology can be used for a real system.
Most commercial JVMs use generational garbage collection. Generational GC assumes that most ob-
jects die young and if an object survives a GC it lives very long. The GC algorithm tries to separate
young and old objects in two heaps and garbage collect the heap of the young objects more often than
the heap of the old objects. The heap of the young objects often is called incubator or nanny. All ob-
jects are allocated in the nanny. When the nanny is garbage collected all live objects are copied to the
Figure 8.5: GC in kernel functions
Kernel code must be programmed very carefully to not hold object references across function calls.
They should assume that a garbage collection can invalidate the reference and should reload all ref-
erences after a function call. An exception to this rule are functions that are specied not to cause a
garbage collection.
GC
object reference cached in local variable
object reference invalid; must reload
function call
function return
old object s heap. Because old objects are not traversed GC time is reduced. To avoid that old objects
contain references to new objects write barriers can be used.
2.7 Summary
This chapter discussed issues that are related to the management of main memory in the JX system.
The memory hierarchy was described. This hierarchy consists of a global memory management at the
lowest level and on top of this a domain-local memory management and a heap management. The
heap memory is managed by a garbage collector, which can be selected on a per-domain basis.
3 Memory Objects
Efciently managing buffers is one oft the challenging tasks of an operating system. This sec-
tion describes the memory management facilities of the JX operating system. In a pure JVM there
are only objects and arrays. A JVM that must serve as a foundation for an efcient operating system
must provide a richer interface for buffer management. For this purpose JX provides memory objects.
They have four non-orthogonal properties: sharing, revocation, splitting, and mapping. Memory ob-
jects can be shared between domains. There is a revoke() method that atomically returns a new mem-
ory object representing the same range of memory and revokes access to the old one. A memory ob-
ject can be split in two memory objects that together represent the original memory range. Access to
the original object is thereby revoked. And a memory object can be mapped to a class structure and
accessed like an object.
3.1 Lifecycle of memory objects
3.1.1 Creation
Memory objects are created by the microkernel as fast portals. The MemoryManager portal allows to
allocate a memory object of a specied size, a ReadOnlyMemory object with a specied size, or a
DeviceMemory object at a specied address with a specied size.
3.1.2 Destruction
An explicit deallocation was not an option, because this would contradict the whole philosophy of the
JX system. We need an automatic mechanism - a garbage collector for memory objects. Memory ob-
jects are shared between domains and the memory GC is not allowed to stop all domains and should
not posses any global knowledge. This makes the memory GC similar to a distributed GC. One of the
simplest distributed GC is reference counting. Reference counting can be used, because there are no
cycles in memory references (a memory object only contains a reference to its parent memory object).
92
3.2 Implementation
This section describes one possible implementation of memory objects. As the data structures used
to maintain memory objects are only accessed by the memory object implementation different imple-
mentations are possible. We experimented with implementations that improve efciency by providing
only a subset of the four features sharing, revocation, splitting, and mapping. The results of these ex-
periments are presented in the performance evaluation section at the end of this chapter.
Sharing. Because memory objects can be shared between domains a global data structure (Memory
Control Block MCB) is required to maintain the state of the memory object (see Figure 8.6). This
shared data structure is referenced by memory portals that are located on the heaps of different do-
mains. This state or meta information of the memory object consists of a ag used for revocation and
a reference count used to garbage collect memory regions.
Revocation. If a memory object supports revocation the revocation check and memory access must
be performed in an atomic operation. Systems that do not perform these operations atomically are
Figure 8.6: Data structures for the management of memory objects
Domains can access "plain" memory by using a memory portal. The memory portal is a fast portal
that contains a pointer to a memory control block (MCB), size information, and a pointer to the mem-
ory area. This information is stored in the "additional data" part of the fast portal (see Figure 5.12 of
Chapter 5). MCBs are maintained in a doubly-linked list. Each MCB is responsible for an area of
memory that does not overlap with the area of any other MCB.
mem
refcount
size
prev
next
Heap
Memory Portal
mcb
mem
size
mem
refcount
size
prev
next
mem
refcount
size
prev
next
domain-local data structures
shared data structures
Memory Control Blocks (MCB)
vtable
DRAM
M
e
m
o
r
y
M
e
m
o
r
y
M
e
m
o
r
y
said to have the time-of-check-to-time-of-use (TOCTTOU) aw. An example of this kind of aw is
the STOP-PROCESS-ERROR of Multics described in [23]. Operations of memory objects are short
(with the exception of memory copy) and do not block. To realize atomicity efciently (this means
without using mutual exclusion locks) on a uniprocessor interrupts can be disabled or atomic code
(see Section 4.3 of Chapter 7) can be used. On a multiprocessor a spinlock is used. The spinlock is
implemented using an atomic bus transaction to read and write a data word. Several CPU instructions
are available for this purpose, such as test-and-set or compare-and-swap (CAS). When using CAS the
revocation ag and the lock ag of the spinlock can be arranged in the same word and checked with
a single CAS instruction.
Splitting. Often it is desirable to pass only a subrange of a certain memory to another domain. Each
memory type supports the creation of subranges. Splitting and revocation are not orthogonal. A nat-
ural semantics for revocation says that the contents of the memory can only be changed by the revoker
after a revocation. If a memory is splitted there are still references to the unsplitted memory. A split
must therefore revoke access to the original memory.
Mapping. A memory object is a data container without data structure. Many memory objects contain
data that is structured. For example, a memory object holding a disk block of inodes has the structure
of a eld of inodes. A memory object that holds a network packet can be structured in a header and
a payload. To allow structured access to a memory object JX supports the abstraction of mapping an
object to a memory object. The memory object contains the state of the object. An access of a eld
of the object accesses the underlying memory object. The object is an "instance" of a class that is
marked using the marker interface MappedObject. The bytecode verier ensures that no instances of
such a class can be created by using the regular object creation mechanism (the new operation). To
specify the byte order that should be used when mapping a sequence of bytes to an integer or short
data type, the mapping class uses a subtype of MappedObject: either MappedLittleEndian or Mapped-
BigEndian. The bytecode-to-nativecode translator is then able to generate the necessary machine code
instructions to access the memory in the correct byte order.
Figure 8.7: Mapping a memory range to a class
(a) The mapping function describes how a eld access is translated into a memory access.
(b) A mapped memory is represented by a fast portal. The portal object contains a pointer to the begin
of the memory area, a reference to the memory control block (MCB), and a reference to a parent map-
ping if this object is part of a larger aggregation (otherwise the parent link is null).
int a
byte b
short c
Object Memory
Mapping
Function
(a) the mapping function
(b) the fast portal representing a mapped memory
Fast Portal
Heap
vtable
mem
MCB
parent
94
The mapping function currently only supports primitive integer types (byte, short, char, int,
long). The usefulness of the mapping mechanism would be improved considerably when aggregated
data types are supported, i.e., an object that contains not only primitive elds but also other objects.
The way Java handles the aggregationalso called parts/wholerelation makes this very difcult.
The whole object contains not the data of the part object but only a reference to it. This reference has
no pointer to its (logically) enclosing object and can be used as an independent reference. This prob-
lem can be solved by introducing hierarchical mappings: each mapped object can be part of a larger
aggregation and the portal object contains a reference to the enclosing mapped object.
3.3 Problems
Efciency. As already mentioned in Section 3.2 not all features of a memory object are needed by all
applications. Several applications do not pass memory objects to other domains but require the revo-
cation feature within their own domain. We implemented a memory object that creates the central
data structure only when the memory object crosses a domain boundary. The revocation ag is part
of the heap-managed memory proxy until a global MCB object is created.
Global resources. Memory objects require a central data structure to allow garbage collection and
revocation of shared memory. This contradicts our design principle of not allowing a domain to allo-
cate global resources. We alleviate this problem by using per-domain quotas that limit the number of
MCB objects and the amount of memory a domain is allowed to allocate.
Hardware references. Memory objects are also used in device drivers. A device driver may use the
memory for DMA transfers to and from a device. This means that the hardware has an implicit refer-
ence to the memory object and the memory should not be garbage collected, even if there are no other
references to it. There are two solutions for this problem. The programmer of the device drivers en-
sures that the memory object stays alive by making it reachable from the root set of live references.
This will usually be the case because the memory object that is involved in the DMA transfer will be
part of a data structure to manage the currently executing device operation. A more robust solution is
to support the programmer in keeping the memory object alive. An implicit hardware reference can
only be created by asking the memory object for the start of its physical memory area. This start ad-
dress is then written in a device register or in a DMA table. The system can support the programmer
by incrementing the reference count in the MCB object whenever the physical start address of the
memory is queried. This requires that the programmer invokes a method to decrement the reference
count once the memory is no longer used by the hardware.
3.4 Buffer management
Memory objects are used as buffers, for example in the network stack. Network packet pro-
cessing requires an efcient buffer management. Allocating a buffer and passing a buffer to another
domain must be fast operations. Some kind of ow control is needed to avoid that a domain runs out
of buffers. Adomain that wants to receive a packet passes a buffer to the network domain. In exchange
it gets a (different) buffer that contains the received data. The send operation returns an empty buffer
in exchange for the buffer that contains the data. The principle of this operation is illustrated in
Figure 8.8.
Figure 8.8: Principle of buffer management
(a) This gure illustrates the principle of buffer management if the network protocol stack is distrib-
uted across multiple domains. Domain D
1
is the application domain. Domain D
2
contains the appli-
cation, domain D
5
the lowest layer of the network stack. During the send operation a memory object
is passed from D
1
to D
2
to D
3
where it is enqueued in a processing queue. The service thread of D
3
dequeues a memory object from the free queue and passes it as return value to domain D
2
, which
passes it to D
1
. Another thread (worker thread) asynchronously dequeues the memory object from
the processing queue and passes it on to D
4
, which processes it and passes it to D
5
. The service thread
of D
5
makes the memory available for the DMA hardware and inserts it in a processing queue. The
device asserts an interrupt when it completes the DMA transfer and the memory object can be moved
from the processing queue to the free queue.
(b) The same as (a) but all components run in a single domain. The operation is identical, only there
are no service threads.
send
Domain D
1
Domain D
2
Domain D
3
processing
free
worker thread
process
Domain D
4
Domain D
5
processing
free
interrupt thread
process
application thread
service thread
service thread
service thread
service thread
send
Domain D
1
processing
free
worker thread
process
processing
free
interrupt thread
process
application thread
(a) multi-domain conguration (b) single-domain conguration
Legend:
portal
invocation
portal
return
cycle of
memory
objects
queue
96
4 Performance evaluation
This section evaluates the performance of the global memory manager, memory objects, and stack
size checks.
4.1 Global memory management
The bitmap-based global memory manager has a low memory footprint. Using 1024-byte blocks and
managing about 128MBytes or 116977 blocks, the overhead is only 14622 bytes or 15 blocks or 0.01
percent.
4.2 Stack size check
We used a microbenchmark to measure the overhead of the stack size check. A method is
called in a loop of 100,000 iterations. There is no measurable difference between the two check vari-
ants in this microbenchmark.
4.3 Memory objects
Table 8.2 shows the performance of the memory operations create, revoke, split.
Tab. 8.1 Virtual method invocation with different stack size checks
Operation Time (ns)
no check 22
check STACK0 40
check STACK1 40
Tab. 8.2 Performance of memory operations
Operation Time
(ns)
Create Memory 2,496
Create DeviceMemory 750
Revoke Memory 900
Split Memory 1,780
set32, 1 KB block 18,400
Figure 8.9 displays the time in nanoseconds between event log points in the memory alloca-
tion code.
set32 50
get32 45
Create mapping 570
Figure 8.9: Time between events during the creation of a memory object.
Time is given in nanoseconds with standard deviation. The number in parenthesis is the number of
transitions between two events. The time between IN and MALLOC is spent allocating the memory
from the global memory management. The time between MALLOC and MEMSET is spent lling
the memory range with zero. The time between PROXYIN and MCBIN is spent creating the proxy
object (memory portal). The time between MCBIN and OUT is spent creating the MCB.
Tab. 8.2 Performance of memory operations
Operation Time
(ns)
PROXYIN
MCBIN
464 +/- 81 (2005)
OUT
578 +/- 54 (1004)
IN
MALLOC
552 +/- 66 (1004)
MEMSET
542 +/- 217 (1004)
220 +/- 44 (1004)
98
4.3.1 Delayed MCB creation
When a memory object is used only within one domain (no sharing feature) no central data structures
must be created for this memory object. Table 8.3 shows the effect of creating MCBs lazily. The saved
time of about 500 nanoseconds is consistent with the event trace of Figure 8.9.
4.3.2 MappedMemory
Figure 8.10 shows an example of the use of MappedMemory. Figure 8.11 shows an equivalent C pro-
gram. The C program is not compiled with an optimization option. By inspecting the assembler code
we recognized that when using the -O2 optimization option the gcc compiler moves data accesses
out of the loop. The mapped access and get/set test was compiled with inlined memory access. The
results of the benchmark are presented in Table 8.4
The table shows the improvement of mapping compared to the set/get interface. This improve-
ment is due to removed range checks. It also shows howthe get/set interface and the mapped interface
compare to a Cprogramrunning on Linux. While access to a mapped object is still slower than access
to a C struct the difference is rather small and is due to the additional indirection of MappedMemory
compared to a C struct (see Section 4.3.2). We also measured the time required to create a mapping.
This time is also important because mapping is performed on the performance critical path, for ex-
ample, when receiving or sending network packets. The time to create 1000 mappings was 552 mi-
croseconds (or 552 nanoseconds per mapping). This means that creating a mapping is slightly slower
Tab. 8.3 Memory creation time with and without the optimization delayed MCB creation
Operation Time (ns)
w/o delayed MCB creation 2,496
with delayed MCB creation 1,904
Tab. 8.4 Memory access time of mapped memory vs. get/set
This table shows the performance of different memory access mechanisms. The column titled "get-
set/mapped" contains the ratio between the access time when using the get-set interface and the access
time when using mapped memory. The last two columns compare these interfaces with time of the
Linux benchmark.
operation get-set
interface
(s)
mapped
(s)
get-set/
mapped
C prog.
on Linux
(s)
get-set/
C-Linux
mapped/
C-Linux
4,000,000 write 73474 61010 1.20 42212 1.74 1.45
4,000,000 read 48000 32687 1.47 24664 1.95 1.33
1 write 0.018 0.015 1.20 0.010 1.74 1.45
1 read 0.012 0.008 1.47 0.006 1.95 1.33
than creating an object (see Chapter 3, Section 4). This is reasonable because a mapping is represent-
ed by a proxy object that is created during the mapping process. These numbers show that mapping
pays off only when the data is accessed very often. A mapped read saves 6 nanoseconds, a mapped
write saves 8 nanoseconds. The time to map an object is equivalent to 92 read accesses or 69 write
accesses. A network packet header is usually not accessed that often, therefore mapping will not im-
prove performance but lead to more readable programs. Only an order of magnitude performance im-
provement of object allocation will make mapping practical.
5 Summary
This chapter described the memory management of the JX system. Similar to two-level sched-
uling memory management is separated into a global management and a domain-local management.
The global management allocates the available main memory to domains. It uses a bitmap allocator
with low memory footprint. The domain-local management uses almost all its memory for a garbage
collected heap. All data is allocated on this heap
1
. This includes thread control blocks and stacks. A
domain can select from a number of garbage collectors to manage this heap. A dened interface be-
Figure 8.10: Measured code sequence for JX MappedMemory access
class MyMap implements MappedLittleEndianObject {
int a; char b; short c; int d;
}
Memory mem = memoryManager.alloc(12);
VMClass cl = componentManager.getClass("test/memobj/MyMap");
MyMap m = (MyMap) mem.map(cl);
// -- START TIME --
for(int i=0; i<ntries; i++) {
m.a = 42; m.b = 43; m.c = 44; m.d = 45;
}
// -- END TIME --
Figure 8.11: Measured code sequence for C struct access
struct MyMap {
signed long a; unsigned short b; signed short c; signed long d;
};
char *mem = malloc(12);
struct MyMap *m = (struct MyMap*) mem;
// -- START TIME --
for(i=0;i<n;i++) {
m->a = 42; m->b = 43; m->c = 44; m->d = 45;
}
// -- END TIME --
1. As already mentioned the current prototype uses a so-called xed memory area that is used to store data structures
that are not yet prepared to be moved by a garbage collector. This includes the domain-local part of a component.
100
tween the kernel and the garbage collectors allows to use different collectors. To cope with large
amounts of memory and to share memory between domains memory objects have been introduced.
They allow sharing, revocation, splitting, and mapping. The performance of several operations of
memory objects and the performance impact of delayed memory control block creation was mea-
sured.

08 Memory Management

Uploaded by

Copyright:

Available Formats

08 Memory Management

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Memory Management

Uploaded by

Copyright:

Available Formats

Chapter 8: Memory Management 83

CHAPTER 8 Memory Management

You might also like