
Distributed Java Virtual Machine
Document information
Author | Ken H. Lee |
instructor | Prof. Thomas R. Gross |
School | ETH Zurich |
Major | Computer Science |
Document type | Master Thesis |
Place | Zurich |
Language | English |
Format | |
Size | 733.75 KB |
Summary
I.Distributed Java Virtual Machine DJVM Design
This research presents a novel distributed Java Virtual Machine (DJVM) design that enhances Java performance in cluster computing environments. Unlike existing page-based Distributed Shared Memory (DSM) systems like JESSICA and Java/DSM, this DJVM embeds a Shared Object Space (SOS) – a virtual global object heap – directly into the Jikes Research Virtual Machine (JikesRVM). This architecture provides access to rich runtime information, enabling significant optimizations. The system utilizes a Home-based Lazy Release Consistency (HLRC) model for cache coherence, managing the interactions between shared and local objects. A key feature is the lazy detection of shared objects, only moving objects into the SOS when accessed by multiple threads across different nodes, thereby minimizing communication overhead. I/O redirection to a single master node ensures a Single System Image (SSI), enhancing programming transparency. The system employs a distributed scheduler with load balancing to distribute threads effectively across the cluster.
1. Shared Object Space SOS and JikesRVM Integration
The core innovation of this distributed Java Virtual Machine (DJVM) lies in its integration of a Shared Object Space (SOS), acting as a virtual global object heap, directly within the Jikes Research Virtual Machine (JikesRVM). This contrasts sharply with other approaches, such as JESSICA and Java/DSM, which utilize a paged-based distributed shared memory. The direct embedding of the SOS into the JikesRVM grants the system access to extensive runtime information from the JVM, providing numerous avenues for performance optimization. This architectural choice enables more sophisticated optimizations compared to systems that rely on external DSM solutions. The research emphasizes this novel architecture as a primary distinguishing factor from other existing distributed JVM implementations. The close integration of the SOS with the JikesRVM allows for several efficiency gains and unique optimization opportunities.
2. Cache Coherence and Distributed Synchronization
A crucial aspect of the DJVM design is the establishment of a robust cache coherence protocol. This protocol governs how objects are managed within both the Shared Object Space (SOS) and the node's local object heap. Efficient distributed object synchronization is also implemented to ensure data consistency across the cluster. To handle the challenges of maintaining data consistency in a distributed environment, a custom cache coherence protocol was developed to handle interactions between the Shared Object Space (SOS) and the local object heap of each node. This protocol is integral to managing concurrent access to shared objects by multiple threads residing on different machines in the cluster. This detailed handling of cache coherence is vital to support the distributed synchronization of threads while ensuring data consistency across nodes.
3. Distributed Class Loading Scheduling and I O Redirection
The DJVM incorporates mechanisms for distributed class loading, achieving a consistent view of classes across all nodes in the cluster. A distributed scheduling mechanism is implemented to distribute threads among the cluster nodes, aiming to improve resource utilization. To ensure consistent class definitions across all nodes in the cluster, a mechanism for distributed class loading was implemented. This mechanism was crucial in achieving a system where each node has access to the same class definitions. In addition, a distributed scheduler was implemented to distribute threads across the nodes for load balancing. Finally, for seamless integration and programmer transparency, all I/O operations are redirected to a single node in the cluster, simplifying the programming model and masking the complexity of the distributed architecture from the application.
4. Communication Substrate and Message Handling
The communication infrastructure of the DJVM is built upon a combination of raw sockets and TCP/IP sockets. Raw sockets offer high-speed communication by bypassing the TCP/IP stack. While faster, they lack reliability, which is addressed by the incorporation of TCP/IP sockets for situations demanding reliable communication. Raw sockets were chosen to provide fast communication, while TCP/IP sockets were included for reliability. A buffer pool is used for efficient message encoding and transmission, reducing the overhead of frequent memory allocations. Each node runs a MessageDispatcher thread to handle incoming messages and a process() method to process the message efficiently. The choice of communication mechanism is determined by the need for speed and reliability in each communication scenario. Efficient message handling is achieved through custom serialization methods and the use of a buffer pool.
5. Thread Distribution and Object Sharing
The DJVM employs a strategy where only directly accessible objects from a thread's stack are marked as shared when the thread is distributed to another node. Globally Unique Identifiers (GUIDs) are used to track these objects. The receiver node reconstructs the thread object, using the GUIDs to locate the objects in its local memory or initiating a fault-in request from the object's home node if necessary. To efficiently distribute threads across nodes, the system only designates directly reachable objects on a thread's stack as shared. Each shared object is assigned a GUID for identification. When a thread is sent to a receiver node, the GUIDs of the referenced objects are also transmitted. The receiving node uses the GUIDs to look up the corresponding objects in the local memory or initiate a fault-in request from the home node. This mechanism optimizes the amount of data transferred during thread distribution.
II.Memory Consistency and Object Management
The DJVM employs an object-based DSM with a lazy detection scheme for shared objects. The system uses the HLRC model to manage cache coherence, improving upon the scalability limitations of the standard Lazy Release Consistency (LRC) model. A faulting scheme allows efficient retrieval of shared objects from their home nodes when necessary. The Java Memory Model (JMM) is adhered to, guaranteeing correct synchronization through the implementation of an efficient cache coherence protocol. The system separates objects into shared and node-local objects, leveraging the local garbage collector for node-local objects. This separation reduces the overhead associated with distributed cache flushes.
1. Object Based Distributed Shared Memory DSM
The research implements an object-based Distributed Shared Memory (DSM) system, contrasting with page-based DSM approaches like those used in JESSICA and Java/DSM. The object-based approach offers several advantages. The system uses a lazy detection scheme, meaning objects are only moved into the shared memory (Shared Object Space or SOS) when they are accessed by multiple threads on different nodes. This lazy approach reduces unnecessary data movement and minimizes communication overhead. The system also employs a faulting scheme, where a copy of a shared object is requested from its home node only if the object is not already locally available. This improves efficiency by avoiding unnecessary data transfers. The management of shared objects is a key aspect of the system's performance and scalability. The choice to use an object-based DSM, as opposed to a page-based approach, represents a critical design decision in this research.
2. Home Based Lazy Release Consistency HLRC
The implemented system uses a Home-based Lazy Release Consistency (HLRC) model for memory consistency. HLRC is presented as an improvement over the standard Lazy Release Consistency (LRC) model, addressing the scalability issues associated with LRC's high memory consumption for protocol overhead. In HLRC, each shared memory page is assigned a home base, simplifying data management and reducing communication traffic. Updates are efficiently propagated to the home base, where they are applied. This approach minimizes the number of messages exchanged between nodes and the amount of data transmitted, leading to performance improvements. The selection of HLRC as the memory consistency model is a central design choice and is justified by its efficiency compared to alternative models like LRC. The reduction in protocol overhead, especially beneficial in the context of garbage collection, is highlighted as a key advantage.
3. Cache Coherence Protocol and Java Memory Model JMM
A custom cache coherence protocol ensures that the system correctly handles synchronization on objects, adhering to the specifications of the Java Memory Model (JMM). This protocol guarantees that the most recent data of a shared object is fetched during an acquire operation, and that writebacks to the home nodes occur after release operations. The protocol's efficiency is directly tied to the overall performance of the system. By separating objects into shared and node-local categories, the system optimizes garbage collection and reduces the overhead of distributed cache flushes. Node-local objects are handled by the local garbage collector, while shared objects are managed by the distributed cache coherence protocol. This separation is crucial for efficient resource management and performance. The consistency of the system's memory model with the Java Memory Model is essential for program correctness and predictability.
4. Faulting Scheme and Handling Dangling Pointers
The system's faulting scheme is implemented through software checks added to the JikesRVM's Baseline compiler. This is done because the DSM is object-based, lacking the hardware support of a Memory Management Unit (MMU). This scheme detects and handles access to invalid or unavailable shared objects by triggering requests to their home nodes. The system's approach to handling dangling pointers is also discussed. The use of software checks in the compiler, rather than relying on hardware mechanisms, is a central element of the implemented faulting scheme. This choice is justified by the object-based nature of the DSM and the absence of hardware-level support for memory management. The strategies for handling dangling pointers, including the comparison between using invalid addresses and allocating dummy objects, are carefully considered.
III.Distributed Mechanisms and Optimizations
Several critical distributed mechanisms were implemented: distributed classloading using a centralized classloader, a distributed scheduler with load balancing for efficient thread allocation, and I/O redirection to a master node to maintain the Single System Image (SSI). A key optimization is the lazy detection of shared objects, only migrating objects to the Shared Object Space (SOS) when necessary. The system further utilizes raw sockets and TCP/IP sockets for efficient inter-node communication, offering both speed and reliability. Efficient mechanisms for object transfer are also implemented using direct memory access provided by JikesRVM.
1. Distributed Class Loading
A key mechanism implemented in the DJVM is distributed class loading. This ensures that all nodes in the cluster maintain a consistent view of the classes used by the Java application. This consistent view is essential for correct program execution in a distributed environment. The research does not detail the specific implementation but emphasizes that this centralized approach eliminates inconsistencies that could arise from independent class loading on each node. The centralized nature of class loading contributes to the system's overall Single System Image (SSI) design goal, presenting a unified view of the application to the programmer, regardless of the underlying distributed nature of the JVM.
2. Distributed Scheduling and Load Balancing
The DJVM incorporates a distributed scheduler with a load balancing function. This component is responsible for determining where to allocate and start Java application threads. The goal is to distribute the workload evenly across the cluster to maximize resource utilization and minimize execution time. The load balancing strategy employed in this work uses a static approach, allocating threads to the least loaded node before they begin execution. This contrasts with more sophisticated dynamic load balancing techniques used in other distributed JVMs. The static load-balancing approach was chosen for its simplicity and ease of implementation. However, the research acknowledges the limitations of this approach compared to dynamic load balancing.
3. I O Redirection for Single System Image SSI
To achieve a Single System Image (SSI), all I/O operations are redirected to a designated master node. This means that file operations, such as opening, reading, and writing files, are all handled by the master node, even if the request originates from a worker node. The master node also handles standard output, ensuring consistent behavior across nodes. This design choice simplifies the application programming model, as the programmer does not need to account for the distributed nature of the system when performing I/O operations. By centralizing I/O, the system avoids the need for a shared filesystem like NFS, reducing complexity and potential performance bottlenecks. The consistent I/O handling from the master node contributes significantly to the system's Single System Image (SSI) design.
4. Communication Model Raw Sockets and TCP IP
The DJVM utilizes both raw sockets and TCP/IP sockets to handle inter-node communication. Raw sockets, which encapsulate messages directly into Ethernet frames, are selected for speed in a local area network, avoiding the overhead of the TCP/IP stack. However, the system also incorporates TCP/IP sockets to ensure reliability. A buffer pool is implemented to efficiently manage message encoding and transmission. This reduces the computational overhead associated with repeated buffer allocation during message transmission, resulting in improved efficiency. The choice to use both communication methods reflects a balance between the desire for fast communication and the need for reliable message delivery. This dual approach provides flexibility and robustness to the system’s communication infrastructure.
5. Fast Object Transfer Mechanism
The system leverages the JikesRVM's direct memory access capabilities, combined with stream-oriented TCP sockets, to create a fast object transfer mechanism. The direct memory access allows for efficient copying of objects into byte buffers for transmission. This technique enhances the speed of data transfer between nodes when shared objects are requested or updated, improving overall performance, particularly in scenarios involving large objects. This optimization focuses on reducing the latency of data transfer between the nodes to improve overall system response time. The combination of direct memory access and stream-oriented communication is presented as an effective strategy for minimizing the overhead associated with object transfer.
IV.Challenges and Future Work
Challenges encountered during development included deadlocks due to blocking operations and issues with separating VM and application objects. Future work focuses on improving object home migration to further optimize performance, refining the handling of volatile keywords for complete thread synchronization consistency, addressing limitations in debugging support, and extending support for annotation processing and custom classloaders. The possibility of implementing thread migration for dynamic load balancing was also considered.
1. Deadlock Issues
During development, deadlocks were encountered. These arose in situations where a thread, after acquiring a lock, had to wait for a response from another node (e.g., during class loading). Because the waiting thread held a lock, a classic deadlock scenario occurred. The lack of comprehensive debugging support in the JikesRVM at that time increased the difficulty of identifying and resolving these deadlocks. The challenges faced during deadlock resolution highlight the importance of robust debugging tools, particularly when working with complex distributed systems. Future improvements in debugging capabilities are necessary to mitigate the challenges of debugging in a distributed and multithreaded environment.
2. VM and Application Object Separation
Another challenge involved separating VM objects from application objects. The system needed to prevent VM objects from becoming shared objects, as this would lead to performance problems. This was particularly relevant because several static methods in the Java class library are synchronized, potentially causing expensive cache flushes even for internal VM operations. The difficulty in cleanly separating VM and application objects highlights a design limitation that requires further attention. The solution involved introducing exception cases to handle VM objects differently. Future work should aim to develop a more elegant and efficient solution for handling this separation, perhaps through more sophisticated class loading or object management techniques.
3. Future Work Object Home Migration Volatile Keywords and Thread Synchronization
Future work focuses on several key areas. Object home migration, which allows moving shared objects to nodes where they are frequently accessed, is proposed as a significant performance optimization. Further work needs to be done to fully integrate the semantics of volatile keywords, which are essential for thread synchronization in Java, into the compiler's handling of object access. Additionally, the java.lang.Thread wrapper class needs adjustments for improved method call delegation to VM internal representations. The discussion of these topics emphasizes the iterative nature of system development and points to areas where further improvement and refinement are needed to enhance performance and robustness. More specifically, improvements to the handling of volatile keywords and enhancements to the java.lang.Thread wrapper class are identified as critical steps towards addressing thread synchronization issues.
4. Future Work Enhanced Debugging Support and Class Loading
The limitations of the JikesRVM's debugging capabilities are noted. The lack of proper debugger support hampered the debugging process during development. Future integration of the Java Debug Wire Protocol and the Java Virtual Machine Tool Interface (JVMTI) is proposed to improve this situation. Regarding class loading, support for annotations and custom class loaders is identified as future improvements. The initial implementation omitted annotation support because the meta-objects generated for annotations were handled differently. Furthermore, the situation where an application uses its own custom class loader wasn't fully addressed. For future work, the master node should be responsible for allocating and managing custom classloaders, and worker nodes should use proxies for redirection. These improvements to the development tools and class loading capabilities will contribute to greater ease of development and maintainability of the system.
V.Performance Evaluation
Performance evaluation was conducted on a two-node cluster, comprising a master node (Intel Core Duo 3GHz, 2GB RAM, Ubuntu 8.04) and a worker node (Intel Core Duo 2.5GHz, 3GB RAM, Ubuntu 8.04), interconnected via a 1Gbit Ethernet network. Benchmarks focused on object access times, highlighting the overhead of accessing shared objects compared to local objects. The impact of thread synchronization was also examined, demonstrating the cost of cache flushes in the distributed setting. The evaluation was performed using TCP sockets given the optimizations made within the system.
1. Test Environment and Methodology
Performance evaluation was carried out on a small cluster with two nodes. The master node was an Intel Core Duo 3GHz workstation with 2GB of memory running Ubuntu 8.04 with Linux kernel 2.6.24. The worker node was an Intel Core Duo 2.5GHz laptop with 3GB of memory, also running Ubuntu 8.04 with Linux kernel 2.6.24. Both nodes were connected via 1Gbit Ethernet and located on the same subnet. Given the optimizations to TCP/IP communication (omitting data-less acknowledgements due to TCP's reliability), benchmarks were run using TCP sockets. The setup reflects a typical small-scale cluster environment, allowing for a focused evaluation of the DJVM's performance characteristics in a practical context. The choice of TCP sockets for the benchmark, based on the implemented TCP/IP optimization, is explicitly stated.
2. Object Access Time Benchmarks
Benchmarks measured object access times for different numbers of objects (100, 1000, and 10000). Access times for node-local objects in the DJVM were higher than in the unmodified JikesRVM due to added software checks in the compiler. Accessing shared objects was slower than accessing node-local objects because of an additional invalid state check. Faulting-in shared objects was a relatively expensive operation (around 2ms per object), but once cached, access times became comparable to those of shared objects located on their home nodes. These benchmark results illustrate the performance trade-offs associated with the system's design, specifically highlighting the costs and benefits of accessing local vs. shared objects in a distributed environment. The overhead introduced by the additional software checks is explicitly discussed and attributed to the increased access times.
3. Thread Synchronization Benchmarks and Cache Flush Issues
Experiments assessed the impact of thread synchronization. Synchronization times were shortest (less than 20ms) when both threads and the synchronization object were on the master node. Times increased significantly when one or both threads resided on the worker node, because each acquire operation required invalidation and updates of the synchronization object. This demonstrated that the distributed cache flush in the DSM is far more expensive than its local counterpart. The test cases examined the performance of thread synchronization under different scenarios. The results reveal the substantial performance penalty associated with distributed cache flushes compared to local synchronization. The significant difference in execution time between scenarios with local versus remote thread placement highlights a key performance challenge in distributed environments.
4. Java Grande Benchmark Suite and Volatile Variables
The Java Grande benchmark suite was used for additional testing. This suite involves thread synchronization using volatile variables. The inherent semantics of volatile variables in the Java Memory Model (JMM) were not completely implemented in the Baseline compiler. The benchmark results, therefore, are interpreted with the caveat that additional software checks would be needed in a fully compliant implementation to ensure correct handling of volatile fields in the distributed environment. The handling of volatile variables points to limitations in the current implementation and highlights areas for further development. Specifically, the need to add additional software checks within the compiler to handle volatile variables correctly in a distributed context is noted.