This document summarizes memory management in JX, which lacks an MMU. It discusses two levels of memory management - a global level to avoid fragmentation and a domain-local level managed by garbage collection. It describes stack overflow detection without an MMU and different garbage collection algorithms implemented, including copying and compacting collectors. The copying collector uses two semi-spaces to compactly allocate live objects, while the compacting collector calculates new object positions without changing their order.
This document summarizes memory management in JX, which lacks an MMU. It discusses two levels of memory management - a global level to avoid fragmentation and a domain-local level managed by garbage collection. It describes stack overflow detection without an MMU and different garbage collection algorithms implemented, including copying and compacting collectors. The copying collector uses two semi-spaces to compactly allocate live objects, while the compacting collector calculates new object positions without changing their order.
This document summarizes memory management in JX, which lacks an MMU. It discusses two levels of memory management - a global level to avoid fragmentation and a domain-local level managed by garbage collection. It describes stack overflow detection without an MMU and different garbage collection algorithms implemented, including copying and compacting collectors. The copying collector uses two semi-spaces to compactly allocate live objects, while the compacting collector calculates new object positions without changing their order.
This document summarizes memory management in JX, which lacks an MMU. It discusses two levels of memory management - a global level to avoid fragmentation and a domain-local level managed by garbage collection. It describes stack overflow detection without an MMU and different garbage collection algorithms implemented, including copying and compacting collectors. The copying collector uses two semi-spaces to compactly allocate live objects, while the compacting collector calculates new object positions without changing their order.
Besides the CPU memory is the other fundamental primary resource. This chapter describes the memory managment architecture of JX. The rst part of the chapter explains the problems of the glo- bal memory management, the second part explains the memory object abstraction. 1 Global and domain-local memory management Memory protection in JX is based on the use of a type-safe instruction set. No memory management hardware (MMU) is necessary. The whole system, including all applications, runs in one physical ad- dress space. This makes the system ideally suited for small devices that lack an MMU. But it also leads to several problems. In a traditional systemfragmentation is not an issue for the user-level mem- ory allocator, because memory that is allocated but not actively used, is paged to disk. In JX unused memory is wasted main memory. So we face a similar problemas kernel memory allocators in UNIX, where kernel memory usually also is not paged and therefore a scarce resource. In UNIX a kernel memory allocator is used for vnodes, proc structures, and other small objects. In contrast to this the JX kernel does not create many small objects. It allocates memory for a domains heap and the small objects live in the heap. The heap is managed by a garbage collector. In other words, the JX memory management has two levels, a global management, which must cope with large objects and avoid fragmentation, and a domain-local garbage-collected memory. Global memory management. The global memory is managed by using a bitmap allocator [187]. This allocator was easy to implement, it automatically joins free areas, and it has a very low memory footprint. On the other hand there is nothing in the systems design or implementation that prevents us to use another allocator. Domain-local memory management. A domain has two memory areas: an area where objects may be moved by the garbage collector and an area where they are xed. In the future, a single area may sufce, but then all data structures that are used by a domain must be movable. Currently, the xed area contains the code and class information. Moving these objects requires an extension of the sys- tem: all pointers to these objects must be known to the GC and updated; for example, when moving a code component the return addresses on all stack frames must be adjusted. Stack overow detection and null pointer check. A system design without MMU means that sev- eral of their responsibilities (besides protection) must be implemented in software. One example is the stack overow detection, another one the null pointer detection. 84 Stack overow detection is implemented in JX by inserting a stack size check at the beginning of each method. This is feasible, because the required size of a stack frame is known before the method is executed. The size check has a reserve, in case the Java method must trap to a runtime function in DomainZero, such as checkcast. A stack size check must be performed whenever a method is entered. To store its local and temporary variables the method needs a stack frame of a known size. The check code must test whether the stack has enough space for the frame and otherwise throw an exception or enlarge the stack. Enlarging the stack is only possible with a more sophisticated stack management that currently is not implemented. The check code must be fast because it is executed very often. We implemented two versions of the check. Version 1 (STACK0) aligns all stacks at an address that is a multiple of the stack size, which must be a multiple of 2. The check code adds the frame size to the stack pointer and sets the lower bits of the result to zero. If the result is larger than the original stack pointer the frame would overow the stack and an exception is thrown. Version 2 (STACK1) uses the current thread pointer to access the stack boundary that is stored in the TCB. It compares the stack pointer plus the frame size to the stack boundary and throws an exception if it is greater. The null pointer check currently is implemented using the debug system of the Pentium pro- cessor. It can be programmed to raise an exception when data or code at address zero is ac- cessed. On architectures that do not provide such a feature, the compiler inserts a null-pointer check before a reference is used. 2 Heap Objects in Java and other type-safe runtime systems, such as Inferno and ML, are stored on a garbage collected heap [100]. The garbage collector usually is able to move the objects to compact the heap. Moving low-level data structures, such as thread control blocks and stacks, was a special challenge that we had to cope with. Several different garbage collection algorithms can be used. No single GC algorithm will t all ap- plications. The Inferno VM, for example, uses reference counting 1 for predictable response times. A reference counting GC has disadvantages, such as high runtime overhead, bad multiprocessor scal- ability, and the need for an additional cycle collecting mechanism, that make it unsuitable for many applications. Therefore we decided not to use a single GC but to dene an interface between the GC and the rest of the runtime system to allow different GCs to be used. Currently four collectors are im- plemented: a copying collector with xed heap size (COPY), a copying collector with dynamically changing heap size (CHUNKED), and a compacting collector (COMPACTING). COPY. This collector is an exact, copying, non-generational GC. An exact GC always knows wheth- er a word is a reference or a data value 2 . A copying collector starts with a root set of object references, copies the object graph that is spawned by these references to a newheap and deallocates the old heap. 1. And an additional mechanism to periodically collect cycles. 2. This is in contrast to consevative GCs that do not have this information and must assume that a word is either a reference or a data value. They can be used for unsafe languages, such as C and C++. The shortcoming of these collectors is the inability to move objects, because they can not update references to objects. Chapter 8: Memory Management 85 The JX implementation of this algorithm does not need any additional data structures and is not re- cursive (see Algorithm 8.1). Two so called semi-spaces are used as a heap. Objects are allocated in one semi-space. When this semi-space runs full the garbage collector copies all live objects into the other semi-space, which then is also used to allocate objects. This collector has some interesting properties: After a GC run the heap is in compact form. There is no need to search for free space during an allocation. Objects are always allocated at the top of the heap by simply advancing a point- er, which is a very fast operation. Deallocation is done by discarding the original semi-space. Therefore the time of one GC is independent from the number of dead objects, it is proportional to the number of live objects. Because objects are moved, pointers to objects must be updated. It is necessary to be able to locate all pointers to objects. This requires special attention within the microkernel (see Section 2.2). CHUNKED. The heap consists of small linked chunks instead of a large block. This allows the heap to grow and shrink according to the memory needs of the domain. A group of domains can use a cer- tain amount of memory cooperatively. A limit can be specied for the sum of the heap sizes of such a group. If this limit is reached and a domain needs to allocate additional memory a garbage collection is started in this domain and if the collection does not releases the necessary amount of memory the domain is blocked, waiting for the other domains in the group to release memory. Figure 8.1: Copying garbage collection Initial setup: There are two semi-spaces S 1 and S 2 S 1 contains the original objects; S 2 is empty There is a root set of references consisting of all stacks, all static variables, and all ob- ject references that are stored by the microkernel Algorithm: Shallow copy all objects that are directly referenced by the root set to semi-space S 2 When copying an object (old object) mark the object in semi-space S 1 as cop- ied and write a forwarding pointer to the object of semi-space S 2 into the old object Scan all objects of semi-space S 2 If the object contains an object reference and the object has already been cop- ied, replace the original object reference by the value of the forwarding pointer If the object contains an object reference and the object has not been copied, copy the object and create a forwarding pointer 86 COMPACTING. The compacting collector is also an exact collector that moves objects. But in con- trast to a copying GC it does not use two semispaces and does not change the order of the objects. The collector operates in four phases. In the rst it marks all reachable (live) objects, in the second phase it calculates the new address of all live objects. In the third phase it corrects all references to live objects, and in the nal phase it copies the contents of the object to the new location. The second phase uses two pointers into the heap: top and current. Top marks the end of processed live objects, current marks the start of unprocessed objects (the current position of the heap scan). Initially top and current are set to the start of the heap. If the object at the current address is marked it is moved to the top address and the top address is set to the end of the moved object. If it is not marked it is not moved. The current pointer is then set to the next object. At the end of this phase the current pointer points to the original end of the heap and the top pointer to the new end of the heap. 2.1 GC interface GCruns of JXdomains are completely independent fromeach other. The independence allows for different GC implementations that even use different objects layouts. Some GCs need additional information per object, for example a mark bit or a generation number, some GCs will reorder the elds of an object to better use CPU caches, some GCs will even compress objects or store a group of objects, such as an array of booleans, in a space efcient manner. For this to work the garbage col- lector implementation must be hidden behind an implementation independent interface. The interface is realized as a domain-specic function table (see Figure 8.2). 2.2 Moving special objects TCB. The GC must know all references to thread control blocks (TCB) and update them. Threads are linked to threads in other domains. During a portal call, for example, the sender thread is appended to a wait queue in the service control block. When the TCB is moved also the external references to it must be updated. For this purpose the TCB has a link to the SCB it waits for. Figure 8.2: Garbage collector interface ObjectHandle allocDataInDomain(struct DomainDesc_s * domain, int objsize, int ags); This function is called to allocate a new object on the domains heap. void done(struct DomainDesc_s * domain); This function is called when the domain terminates. The garbage collector should release all resources, especially the memory allocated for the heap. void gc(struct DomainDesc_s * domain) A garbage collection should be performed. boolean(*isInHeap) (struct DomainDesc_s * domain, ObjectDesc * obj) The garbage collector tests whether the given object pointer is inside its heap. void (*walkHeap) (struct DomainDesc_s * domain, HandleObject_t handleObject) The garbage collector applies the handler function to all objects on its heap. This function can be used by the garbage collector independent code to operate on all objects on the heap without knowing the organization of the heap. Chapter 8: Memory Management 87 A TCB object cannot simply be copied to another domain. When a TCB object crosses a do- main border (for example from domain D 1 to domain D 2 in Figure 8.4) a proxy for the TCB is created on the target heap. The proxy, which has the type ForeignCPUState, references the domain of the real TCB (domain D 1 ) by using the domain pointer / domain ID mechanism (see Section 4 of Chapter 4). To reference the TCB it is not possible to simple use a pointer to the TCB object, because this object could be moved during a GC in domain A. Therefore the proxy uses the unique thread ID to reference the TCB. Because it is very expensive to nd the TCB given the thread ID we use the following op- timization. The proxy additionally contains a direct pointer to the TCB and the number of the GC epoch of domain A. The GC epoch of a domain is a strictly monotonic increasing number. It is incre- mented when a GC is performed in the domain. It is guaranteed that if the GC epoch number has not changed objects are not moved. Every pointer to a TCB must be known to the garbage collector. The GC must be able to nd and update these pointers when moving the TCB. For scalability reasons these pointers must be found without scanning the heap of other domains. An example for such a pointer is the link between a ser- vice thread and its client, which is created during a portal invocation. To return the result of a portal call the service thread needs a reference to the client thread. There are two ways to implement such a reference. The servers TCB contains the thread ID of the client thread. On return from a portal call the corresponding client TCB must be found. As this may involve a linear search over all threads of the client domain the mapping from thread ID to TCB could be very slow. Therefore an op- timization must be used. Together with the thread ID a direct TCB pointer, a GC epoch num- ber, and a pointer to the clients DCB are stored in the servers TCB. The direct TCB pointer can be used as long as the GC epoch number is identical to the client domains GC epoch. The current GC epoch is stored in the clients DCB. If the epoch numbers differ a GC in the client domain could have moved the client TCB. In this case the thread ID is used to nd the TCB and the TCB pointer and GC epoch number can be updated. The reference to the client TCB is stored in the servers TCB as a direct pointer called mostRecentlyCalledBy (see Figure 8.3). This pointer must be updated when the client TCB is moved during a garbage collection in the client domain or when the client thread or the client domain is terminated. To detect whether a TCB is referenced via a mostRecentlyCalledBy pointer the client TCBhas a pointer to the servers TCB(blockedInServiceThread). This pointer must be updated similar to the mostRecentlyCalledBy pointer. The current implementation uses the second alternative, because it uses fewer space in the TCB and is faster in the common case that no GC occurred during a portal invocation. A similar problem is caused by fast portals that represent a thread of another domain. These portals, called ForeignCPUState, also need a reference to the TCB of a thread in another domain. The second alternative can not be used because potentially there are many ForeignCPUState portals refer- encing the same TCB. Therefore ForeignCPUState portals are implemented using the rst alternative (see Figure 8.4). Stack. A previous implementation of the stack overow check required that stacks are aligned at a multiple of their size (see Section 1). This alignment requirement prohibited allocating them on the heap because of heap fragmentation. The current implementation (called STACK1 in Section 1) does not need aligned stacks and allocates all stacks on the heap. When the stacks are moved the frame 88 Figure 8.3: Management of Thread Control Blocks during a portal communication During a portal communication the TCBs of the client and the server thread are connected in both directions. The connection from the server TCB to the client TCB (mostRecentlyCalledBy) is re- quired for delivering the results of the invocation. The connection from client to server TCB (blocke- dInServiceThread) is required to update the mostRecentlyCalledBy link during a garbage collec- tion. Figure 8.4: Relation between TCBs, inter-domain TCB references, and stacks Thread Control Blocks and stacks live on the heap of a domain. A stack consists of linked stack frames. Another domain can hold a reference to a thread control block by using ForeignCPUState portal. Because TCBs as well as stacks can be moved during a garbage collection, the ForeignC- PUState portal contains the thread ID of the foreign TCB, the GC epoch and a direct pointer to the TCB. blockedInServiceThread mostRecentlyCalledBy client TCB server TCB Domain D 1 Domain D 2 Domain D 2 Domain D 1 Heap Thread Control Block (CPUState portal) Stack move move Heap GC epoch thread ID TCB pointer ForeignCPUState portal move move move move Frames grow Frame Pointer DCB pointer Chapter 8: Memory Management 89 pointers (which link the stack frames), the CPU context in the TCB (stack pointer and frame pointer registers), and the stack pointers in the TCB must be corrected. A complication occurs when the col- lector thread moves its own stack. It must detect this and switch to the new stack before releasing the old heap. As the stack was copied in the shallow copy phase (see Figure 8.1) it must copy the stack again to reect the actual execution state. The current implementation switches to the new stack im- mediately after correcting the frame pointers of this stack and runs the rest of the collection on the new stack. SCB. Service Control Blocks (SCBs) and Service Pools are also allocated on the heap. They contain references to TCBs which must be updated. Domain. Domain Control Blocks could be allocated on the heap of DomainZero. To allow direct pointers to DCBs we used a dedicated memory area for DCBs and domain portals that contain a DCB pointer and a domain ID. When the IDin the DCBequals the IDin the domain portal the DCBpointer is valid otherwise the DCB has been reused. 2.3 Garbage collection and interrupt handling All interrupt handlers are written in Java. All of the current GC implementations operate non- incremental, i.e., no mutator thread 1 is allowed to run during a collection because the heap is not con- sistent during this time. Therefore all interrupts that are serviced by the domain must be blocked. On a PC architecture the programmable interrupt controller (PIC) is used to disable dedicated interrupts. Ablocking in software by remembering the interrupt (set a ag) and returning fromthe core IRQhan- dler is not possible with level-triggered interrupts that must be acknowledged at the interrupting de- vice to be deactivated. To avoid that a garbage collection becomes necessary during interrupt handling the heap con- tains a reserve that is only available for object allocation in interrupt handler threads. As rst-level interrupt handler should be very short and should not allocate much memory this reserve should be sufcient. If it is not, an exception is thrown. 2.4 Garbage collecting and timeslicing Scheduling threads of one domain preemptively means that a thread may be at an arbitrary in- struction when another thread requires a collection. A technique that is also used in JX is to advance all threads that are located in Java code to a safe point [3]. A safe point is a point where the execution state of the thread is known: the types of the saved registers and the types of all stack positions are known. It is difcult to advance a thread that currently executes a core function, because to obtain a stack map, which is necessary for a exact collector, the C compiler must be modied to generate such maps. Therefore C code must either disable interrupts or register references. A collection may also be necessary in kernel code when the kernel allocates objects. The code must be carefully written to check whether a GC occurred and reload variables (see Figure 8.5). 1. All threads, besides the GC thread, that run in the domain and modify the heap are called mutator threads. 90 2.5 Garbage collection of portals and memory objects Portals and memory objects are shared between domains. Both refer to a central data structure. Por- tals have a reference to a service control block (SCB) in another domain and memory objects have a reference to a memory control block (MCB) in DomainZero. Reference counters are used to reclaim the SCB and MCB. When a domain terminates or portals or memory objects become garbage the re- spective reference counter must be decremented during a nalization cycle. How this is actually re- alized depends on the installed garbage collector. The copying GC moves live objects to a second heap and sets a ag in the original objects header to indicate that the object was copied. At the end of the GC cycle all objects that have not set this ag are dead. During the following nalization phase the heap is scanned using the walkHeap() function of Figure 8.2 and the reference counters of all dead portals and memory objects are decremented. When the domain is in the process of being terminated all objects are considered dead. 2.6 Future work Similar to scheduling garbage collection should be removed fromthe microkernel. This is much more difcult, because it must be guaranteed that the untrusted GC preserves the type of objects while copying them. Although there are recent advances [183] several open problems have to be solved be- fore the technology can be used for a real system. Most commercial JVMs use generational garbage collection. Generational GC assumes that most ob- jects die young and if an object survives a GC it lives very long. The GC algorithm tries to separate young and old objects in two heaps and garbage collect the heap of the young objects more often than the heap of the old objects. The heap of the young objects often is called incubator or nanny. All ob- jects are allocated in the nanny. When the nanny is garbage collected all live objects are copied to the Figure 8.5: GC in kernel functions Kernel code must be programmed very carefully to not hold object references across function calls. They should assume that a garbage collection can invalidate the reference and should reload all ref- erences after a function call. An exception to this rule are functions that are specied not to cause a garbage collection. GC object reference cached in local variable object reference invalid; must reload function call function return Chapter 8: Memory Management 91 old object s heap. Because old objects are not traversed GC time is reduced. To avoid that old objects contain references to new objects write barriers can be used. 2.7 Summary This chapter discussed issues that are related to the management of main memory in the JX system. The memory hierarchy was described. This hierarchy consists of a global memory management at the lowest level and on top of this a domain-local memory management and a heap management. The heap memory is managed by a garbage collector, which can be selected on a per-domain basis. 3 Memory Objects Efciently managing buffers is one oft the challenging tasks of an operating system. This sec- tion describes the memory management facilities of the JX operating system. In a pure JVM there are only objects and arrays. A JVM that must serve as a foundation for an efcient operating system must provide a richer interface for buffer management. For this purpose JX provides memory objects. They have four non-orthogonal properties: sharing, revocation, splitting, and mapping. Memory ob- jects can be shared between domains. There is a revoke() method that atomically returns a new mem- ory object representing the same range of memory and revokes access to the old one. A memory ob- ject can be split in two memory objects that together represent the original memory range. Access to the original object is thereby revoked. And a memory object can be mapped to a class structure and accessed like an object. 3.1 Lifecycle of memory objects 3.1.1 Creation Memory objects are created by the microkernel as fast portals. The MemoryManager portal allows to allocate a memory object of a specied size, a ReadOnlyMemory object with a specied size, or a DeviceMemory object at a specied address with a specied size. 3.1.2 Destruction An explicit deallocation was not an option, because this would contradict the whole philosophy of the JX system. We need an automatic mechanism - a garbage collector for memory objects. Memory ob- jects are shared between domains and the memory GC is not allowed to stop all domains and should not posses any global knowledge. This makes the memory GC similar to a distributed GC. One of the simplest distributed GC is reference counting. Reference counting can be used, because there are no cycles in memory references (a memory object only contains a reference to its parent memory object). 92 3.2 Implementation This section describes one possible implementation of memory objects. As the data structures used to maintain memory objects are only accessed by the memory object implementation different imple- mentations are possible. We experimented with implementations that improve efciency by providing only a subset of the four features sharing, revocation, splitting, and mapping. The results of these ex- periments are presented in the performance evaluation section at the end of this chapter. Sharing. Because memory objects can be shared between domains a global data structure (Memory Control Block MCB) is required to maintain the state of the memory object (see Figure 8.6). This shared data structure is referenced by memory portals that are located on the heaps of different do- mains. This state or meta information of the memory object consists of a ag used for revocation and a reference count used to garbage collect memory regions. Revocation. If a memory object supports revocation the revocation check and memory access must be performed in an atomic operation. Systems that do not perform these operations atomically are Figure 8.6: Data structures for the management of memory objects Domains can access "plain" memory by using a memory portal. The memory portal is a fast portal that contains a pointer to a memory control block (MCB), size information, and a pointer to the mem- ory area. This information is stored in the "additional data" part of the fast portal (see Figure 5.12 of Chapter 5). MCBs are maintained in a doubly-linked list. Each MCB is responsible for an area of memory that does not overlap with the area of any other MCB. mem refcount size prev next Heap Memory Portal mcb mem size mem refcount size prev next mem refcount size prev next domain-local data structures shared data structures Memory Control Blocks (MCB) vtable DRAM M e m o r y M e m o r y M e m o r y Chapter 8: Memory Management 93 said to have the time-of-check-to-time-of-use (TOCTTOU) aw. An example of this kind of aw is the STOP-PROCESS-ERROR of Multics described in [23]. Operations of memory objects are short (with the exception of memory copy) and do not block. To realize atomicity efciently (this means without using mutual exclusion locks) on a uniprocessor interrupts can be disabled or atomic code (see Section 4.3 of Chapter 7) can be used. On a multiprocessor a spinlock is used. The spinlock is implemented using an atomic bus transaction to read and write a data word. Several CPU instructions are available for this purpose, such as test-and-set or compare-and-swap (CAS). When using CAS the revocation ag and the lock ag of the spinlock can be arranged in the same word and checked with a single CAS instruction. Splitting. Often it is desirable to pass only a subrange of a certain memory to another domain. Each memory type supports the creation of subranges. Splitting and revocation are not orthogonal. A nat- ural semantics for revocation says that the contents of the memory can only be changed by the revoker after a revocation. If a memory is splitted there are still references to the unsplitted memory. A split must therefore revoke access to the original memory. Mapping. A memory object is a data container without data structure. Many memory objects contain data that is structured. For example, a memory object holding a disk block of inodes has the structure of a eld of inodes. A memory object that holds a network packet can be structured in a header and a payload. To allow structured access to a memory object JX supports the abstraction of mapping an object to a memory object. The memory object contains the state of the object. An access of a eld of the object accesses the underlying memory object. The object is an "instance" of a class that is marked using the marker interface MappedObject. The bytecode verier ensures that no instances of such a class can be created by using the regular object creation mechanism (the new operation). To specify the byte order that should be used when mapping a sequence of bytes to an integer or short data type, the mapping class uses a subtype of MappedObject: either MappedLittleEndian or Mapped- BigEndian. The bytecode-to-nativecode translator is then able to generate the necessary machine code instructions to access the memory in the correct byte order. Figure 8.7: Mapping a memory range to a class (a) The mapping function describes how a eld access is translated into a memory access. (b) A mapped memory is represented by a fast portal. The portal object contains a pointer to the begin of the memory area, a reference to the memory control block (MCB), and a reference to a parent map- ping if this object is part of a larger aggregation (otherwise the parent link is null). int a byte b short c Object Memory Mapping Function (a) the mapping function (b) the fast portal representing a mapped memory Fast Portal Heap vtable mem MCB parent 94 The mapping function currently only supports primitive integer types (byte, short, char, int, long). The usefulness of the mapping mechanism would be improved considerably when aggregated data types are supported, i.e., an object that contains not only primitive elds but also other objects. The way Java handles the aggregationalso called parts/wholerelation makes this very difcult. The whole object contains not the data of the part object but only a reference to it. This reference has no pointer to its (logically) enclosing object and can be used as an independent reference. This prob- lem can be solved by introducing hierarchical mappings: each mapped object can be part of a larger aggregation and the portal object contains a reference to the enclosing mapped object. 3.3 Problems Efciency. As already mentioned in Section 3.2 not all features of a memory object are needed by all applications. Several applications do not pass memory objects to other domains but require the revo- cation feature within their own domain. We implemented a memory object that creates the central data structure only when the memory object crosses a domain boundary. The revocation ag is part of the heap-managed memory proxy until a global MCB object is created. Global resources. Memory objects require a central data structure to allow garbage collection and revocation of shared memory. This contradicts our design principle of not allowing a domain to allo- cate global resources. We alleviate this problem by using per-domain quotas that limit the number of MCB objects and the amount of memory a domain is allowed to allocate. Hardware references. Memory objects are also used in device drivers. A device driver may use the memory for DMA transfers to and from a device. This means that the hardware has an implicit refer- ence to the memory object and the memory should not be garbage collected, even if there are no other references to it. There are two solutions for this problem. The programmer of the device drivers en- sures that the memory object stays alive by making it reachable from the root set of live references. This will usually be the case because the memory object that is involved in the DMA transfer will be part of a data structure to manage the currently executing device operation. A more robust solution is to support the programmer in keeping the memory object alive. An implicit hardware reference can only be created by asking the memory object for the start of its physical memory area. This start ad- dress is then written in a device register or in a DMA table. The system can support the programmer by incrementing the reference count in the MCB object whenever the physical start address of the memory is queried. This requires that the programmer invokes a method to decrement the reference count once the memory is no longer used by the hardware. 3.4 Buffer management Memory objects are used as buffers, for example in the network stack. Network packet pro- cessing requires an efcient buffer management. Allocating a buffer and passing a buffer to another domain must be fast operations. Some kind of ow control is needed to avoid that a domain runs out of buffers. Adomain that wants to receive a packet passes a buffer to the network domain. In exchange it gets a (different) buffer that contains the received data. The send operation returns an empty buffer in exchange for the buffer that contains the data. The principle of this operation is illustrated in Figure 8.8. Chapter 8: Memory Management 95 Figure 8.8: Principle of buffer management (a) This gure illustrates the principle of buffer management if the network protocol stack is distrib- uted across multiple domains. Domain D 1 is the application domain. Domain D 2 contains the appli- cation, domain D 5 the lowest layer of the network stack. During the send operation a memory object is passed from D 1 to D 2 to D 3 where it is enqueued in a processing queue. The service thread of D 3 dequeues a memory object from the free queue and passes it as return value to domain D 2 , which passes it to D 1 . Another thread (worker thread) asynchronously dequeues the memory object from the processing queue and passes it on to D 4 , which processes it and passes it to D 5 . The service thread of D 5 makes the memory available for the DMA hardware and inserts it in a processing queue. The device asserts an interrupt when it completes the DMA transfer and the memory object can be moved from the processing queue to the free queue. (b) The same as (a) but all components run in a single domain. The operation is identical, only there are no service threads. send Domain D 1 Domain D 2 Domain D 3 processing free worker thread process Domain D 4 Domain D 5 processing free interrupt thread process application thread service thread service thread service thread service thread send Domain D 1 processing free worker thread process processing free interrupt thread process application thread (a) multi-domain conguration (b) single-domain conguration Legend: portal invocation portal return cycle of memory objects queue 96 4 Performance evaluation This section evaluates the performance of the global memory manager, memory objects, and stack size checks. 4.1 Global memory management The bitmap-based global memory manager has a low memory footprint. Using 1024-byte blocks and managing about 128MBytes or 116977 blocks, the overhead is only 14622 bytes or 15 blocks or 0.01 percent. 4.2 Stack size check We used a microbenchmark to measure the overhead of the stack size check. A method is called in a loop of 100,000 iterations. There is no measurable difference between the two check vari- ants in this microbenchmark. 4.3 Memory objects Table 8.2 shows the performance of the memory operations create, revoke, split. Tab. 8.1 Virtual method invocation with different stack size checks Operation Time (ns) no check 22 check STACK0 40 check STACK1 40 Tab. 8.2 Performance of memory operations Operation Time (ns) Create Memory 2,496 Create DeviceMemory 750 Revoke Memory 900 Split Memory 1,780 set32, 1 KB block 18,400 Chapter 8: Memory Management 97 Figure 8.9 displays the time in nanoseconds between event log points in the memory alloca- tion code. set32 50 get32 45 Create mapping 570 Figure 8.9: Time between events during the creation of a memory object. Time is given in nanoseconds with standard deviation. The number in parenthesis is the number of transitions between two events. The time between IN and MALLOC is spent allocating the memory from the global memory management. The time between MALLOC and MEMSET is spent lling the memory range with zero. The time between PROXYIN and MCBIN is spent creating the proxy object (memory portal). The time between MCBIN and OUT is spent creating the MCB. Tab. 8.2 Performance of memory operations Operation Time (ns) PROXYIN MCBIN 464 +/- 81 (2005) OUT 578 +/- 54 (1004) IN MALLOC 552 +/- 66 (1004) MEMSET 542 +/- 217 (1004) 220 +/- 44 (1004) 98 4.3.1 Delayed MCB creation When a memory object is used only within one domain (no sharing feature) no central data structures must be created for this memory object. Table 8.3 shows the effect of creating MCBs lazily. The saved time of about 500 nanoseconds is consistent with the event trace of Figure 8.9. 4.3.2 MappedMemory Figure 8.10 shows an example of the use of MappedMemory. Figure 8.11 shows an equivalent C pro- gram. The C program is not compiled with an optimization option. By inspecting the assembler code we recognized that when using the -O2 optimization option the gcc compiler moves data accesses out of the loop. The mapped access and get/set test was compiled with inlined memory access. The results of the benchmark are presented in Table 8.4 The table shows the improvement of mapping compared to the set/get interface. This improve- ment is due to removed range checks. It also shows howthe get/set interface and the mapped interface compare to a Cprogramrunning on Linux. While access to a mapped object is still slower than access to a C struct the difference is rather small and is due to the additional indirection of MappedMemory compared to a C struct (see Section 4.3.2). We also measured the time required to create a mapping. This time is also important because mapping is performed on the performance critical path, for ex- ample, when receiving or sending network packets. The time to create 1000 mappings was 552 mi- croseconds (or 552 nanoseconds per mapping). This means that creating a mapping is slightly slower Tab. 8.3 Memory creation time with and without the optimization delayed MCB creation Operation Time (ns) w/o delayed MCB creation 2,496 with delayed MCB creation 1,904 Tab. 8.4 Memory access time of mapped memory vs. get/set This table shows the performance of different memory access mechanisms. The column titled "get- set/mapped" contains the ratio between the access time when using the get-set interface and the access time when using mapped memory. The last two columns compare these interfaces with time of the Linux benchmark. operation get-set interface (s) mapped (s) get-set/ mapped C prog. on Linux (s) get-set/ C-Linux mapped/ C-Linux 4,000,000 write 73474 61010 1.20 42212 1.74 1.45 4,000,000 read 48000 32687 1.47 24664 1.95 1.33 1 write 0.018 0.015 1.20 0.010 1.74 1.45 1 read 0.012 0.008 1.47 0.006 1.95 1.33 Chapter 8: Memory Management 99 than creating an object (see Chapter 3, Section 4). This is reasonable because a mapping is represent- ed by a proxy object that is created during the mapping process. These numbers show that mapping pays off only when the data is accessed very often. A mapped read saves 6 nanoseconds, a mapped write saves 8 nanoseconds. The time to map an object is equivalent to 92 read accesses or 69 write accesses. A network packet header is usually not accessed that often, therefore mapping will not im- prove performance but lead to more readable programs. Only an order of magnitude performance im- provement of object allocation will make mapping practical. 5 Summary This chapter described the memory management of the JX system. Similar to two-level sched- uling memory management is separated into a global management and a domain-local management. The global management allocates the available main memory to domains. It uses a bitmap allocator with low memory footprint. The domain-local management uses almost all its memory for a garbage collected heap. All data is allocated on this heap 1 . This includes thread control blocks and stacks. A domain can select from a number of garbage collectors to manage this heap. A dened interface be- Figure 8.10: Measured code sequence for JX MappedMemory access class MyMap implements MappedLittleEndianObject { int a; char b; short c; int d; } Memory mem = memoryManager.alloc(12); VMClass cl = componentManager.getClass("test/memobj/MyMap"); MyMap m = (MyMap) mem.map(cl); // -- START TIME -- for(int i=0; i<ntries; i++) { m.a = 42; m.b = 43; m.c = 44; m.d = 45; } // -- END TIME -- Figure 8.11: Measured code sequence for C struct access struct MyMap { signed long a; unsigned short b; signed short c; signed long d; }; char *mem = malloc(12); struct MyMap *m = (struct MyMap*) mem; // -- START TIME -- for(i=0;i<n;i++) { m->a = 42; m->b = 43; m->c = 44; m->d = 45; } // -- END TIME -- 1. As already mentioned the current prototype uses a so-called xed memory area that is used to store data structures that are not yet prepared to be moved by a garbage collector. This includes the domain-local part of a component. 100 tween the kernel and the garbage collectors allows to use different collectors. To cope with large amounts of memory and to share memory between domains memory objects have been introduced. They allow sharing, revocation, splitting, and mapping. The performance of several operations of memory objects and the performance impact of delayed memory control block creation was mea- sured.
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More