Cisco IT Case Study Oracle On VM On UCS
Cisco IT Case Study Oracle On VM On UCS
Cisco IT Case Study Oracle On VM On UCS
How Cisco IT Virtualized Oracle RAC Databases on Unified Computing System with VMware
Nearly 300 nonproduction and production databases are virtualized, increasing resiliency while lowering costs.
EXECUTIVE SUMMARY
CHALLENGE Increase business agility by accelerating provisioning of new Oracle and Oracle RAC databases Increase resiliency Lower infrastructure costs for Oracle databases SOLUTION Virtualized production and nonproduction Oracle databases on Cisco Unified Computing System with VMware Automated backup processes RESULTS Accelerated provisioning from 12 weeks to 3 days, and as little as 3 hours Lowered TCO by approximately 45-50 percent Improved recovery time from several hours to a few minutes NEXT STEPS Virtualize 95 percent of Oracle databases
Challenge
The Cisco workforce accesses approximately 1300 Oracle databases for business processes ranging from order processing to customer care. While Oracle database servers are a small fraction of the total server footprint at Cisco, they are among the largest and most critical, and they incur significant infrastructure and maintenance costs, says Malathi Pinnamaneni, Cisco IT architect. Cisco IT has been systematically virtualizing applications and had already virtualized the Oracle web and application tiers. The next step in the journey would be virtualizing Oracle databases. Originally we focused on standalone, nonproduction databases, says Paul Wiltsey, Cisco IT engineer. When backup solutions became more sophisticated and we gained more experience with virtualization, we were ready to virtualize our Oracle RAC databases. Goals for virtualizing the Cisco IT Oracle environment included:
requests for new Oracle databases on standalone servers often took 12 weeks or longer. To increase business agility, Cisco IT wanted the ability to fulfill urgent requests in a few days.
Increasing resiliency: Cisco uses Oracle databases for revenue-generating activities, including customer support. Previously, if a server failed, restoring the database on another server generally took several hours, and people were concerned that virtualizing production databases would increase downtime, says Todd Glenn, Cisco IT service manager. Today, database failover is actually faster in a virtualized environment, improving the user experience and reducing lost productivity.
Lowering costs: The fewer the servers, the lower the space, power, cooling, cabling, and management costs. When Cisco hosted its Oracle databases on bare-metal servers, utilization was only about 25 percent. Other sources of high costs included high availability requirements and the need for specialized servers requiring scarce management skills. These servers also required expensive, dedicated network interconnects for clustered database hosts.
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 1 of 9
Solution
After carefully evaluating the risks, Cisco IT management decided to virtualize Oracle applications and databases on VMware even though standard industry practice was to use physical servers. Without the vision of senior management, virtualization of the Oracle environment would not have proceeded as quickly as it did," says Jag Kahlon, Cisco IT architect. "Their leadership accelerated benefits such as faster provisioning, lower TCO, and improved disaster recovery." Virtualizing the Oracle environment simplified infrastructure and processes. Because VMs are relatively independent of hardware, we can dynamically adjust compute and bandwidth resources without physically visiting the data center, Wiltsey says.
Design Principles
A guiding principle for Cisco IT was to use the existing shared infrastructure, Cisco Unified Computing System (Cisco UCS) and Cisco Nexus switches, instead of building special clusters that would require separate support and disaster recovery processes. Cisco UCS service profiles further simplify the environment by enabling IT staff to accurately configure server identity with a few clicks. The other guiding principle was not introducing new technology. Therefore, Cisco IT decided to continue using VMware datastores. The other option considered, raw device mapping (RDM), would increase throughput by mapping logical unit numbers (LUNs) directly to virtual machines. We decided to not use RDM, because it would prevent us from freely moving data on the backend, says Nagarajan R, Cisco IT architect. By using VMware datastores instead, Cisco IT was able to treat the virtualized Oracle database environment as part of the same Cisco UCS cluster used for other enterprise applications. We used CPU and memory reservations to control oversubscription, says Nagarajan. The more reservation, the less contention in the environment.
Virtualization Policy
Cisco IT is virtualizing all standalone Oracle databases in all environments: development, stage/test, load/test performance, disaster recovery, and production. Oracle RAC databases are virtualized if the VM sizing and I/O characteristics in Table 1 can support the workload. If not, the databases are hosted on bare-metal servers in the same Cisco UCS. In nonproduction environments, multiple databases are consolidated onto a single virtual machine when possible. The methodology to arrive at this table is described in the section entitled Workload Testing Methodology and Results later in this case study.
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 2 of 9
Table 1.
Standard VM Profiles: Criteria for Determining Whether a Database Qualifies for Virtualization
CPU Cores and RAM Server Performance (TPS) Bandwidth (Mbps)
Database
Number of Nodes
IOPS
Standalone VM Oracle 11g 2-node RAC VMs 4-node RAC VMs Standalone VM Oracle RAC 11g D-NFS 2-node RAC VMs 4-node RAC VMs
4 x 16 4 x 16 4 x 16 4 x 16 4 x 16 4 x 16
Figure 1 illustrates the deployment topology at Cisco, and Table 2 lists solution components.
Figure 1. Virtualized Oracle RAC Environment
Table 2.
Server
Oracle RAC Databases are Hosted on Standard Cisco Unified Data Center Environment
Virtual Switch Cisco Nexus 1000V with three interfaces: public, private, and monitoring (heartbeat) VMware Environment VMware Vsphere versions 4 and 5 20-node clusters 10 TB NFS datastores Largest configuration: 8 cores with 64 GB RAM Storage NetApp FAS 6280 ONTAP version 8 10 GB Link Aggregation Control Protocol (LACP) Port Channel to network
40 Cisco B200 M2 Servers with 12 Cores and 96 MB Connected to Cisco Nexus 7000 Switches
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 3 of 9
Cisco IT configures Cisco UCS blade servers to support both the Oracle database and the underlying Oracle Grid Infrastructure, which is the Oracle software that provides volume management, file system, and automatic restart capabilities. The standard VM configuration includes at least 2.5 GB of RAM for Oracle Grid Infrastructure plus swap space, which depends on available RAM (Table 3). Cisco IT also uses resource reservations to guarantee resources to specific service tiers.
Table 3. Swap Space Calculation
Available RAM Swap Space
28 GB 832 GB 32+ GB
An Oracle RAC cluster can contain a combination of virtual and physical servers. Cisco IT currently does not use a combination, but might in the future. If a cluster provides sufficient capacity for all periods other than year-end or month-end, we could quickly add virtual machines, or temporarily add another Cisco UCS blade server, says Nagarajan.
Each Oracle RAC cluster contains from two to eight nodes. If one node is down, Cisco users automatically connect to one of the other nodes without having to take any action. Each Cisco UCS has multiple blade servers. If one server develops issues, Cisco IT can quickly move its virtual machines to any other available blade server, with a few clicks.
Page 4 of 9
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Cisco IT enabled the Oracle Data Guard Fast-Start Failover feature. If all the nodes in Oracle RAC cluster fail, services automatically fail over to another location in the metro virtual data center (MVDC). In the event of a regional disaster that takes down production, Cisco IT can use VMware Site Recovery Manager (SRM) in conjunction with Oracle Data Guard for push-button failover to the disaster recovery data center in Research Triangle Park, North Carolina.
Normal CPU utilization over 90 percent and run queues with more than 30 processes High disk I/O and network I/O stress, but not high enough to interfere with CPU I/O wait not exceeding 30-40 percent Recovery from destructive failure tests
Note that the following results are specific to Cisco ITs environment, and other organizations may experience different results. Test Scenario 1: CPU Utilization and TPS Cisco IT executed 300 sessions of the Swingbench Calling Circle benchmark tool for 30 minutes, comparing TPS results using NFS and D-NFS. TPS performance for NFS and D-NFS was similar, and scaled linearly with the
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page 5 of 9
Test Scenario 2: Maximum I/O Throughput To test I/O utilization, Cisco IT conducted 10 sessions using the internal storage tool as well as 16 users random select statements for 30 minutes. The I/O throughput measures were high for NFS and even higher for D-NFS (Figure 3).
Figure 3. I/O Results for NFS (Left) and D-NFS (Right)
Scenario 3: Combined CPU and I/O Utilization Testing Finally, to simulate real-world workload patterns, Cisco IT executed both tests in parallel, for 30 minutes. Again, DNFS delivered higher transaction processing and bandwidth performance than NFS (Figure 4).
Figure 4. D-NFS Performed Better than NFS in Cisco Environment
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 6 of 9
Results
"When we used physical servers, provisioning a new database was time-consuming, complex, and expensive, says Adwait Samant, a senior manager with Cisco IT who oversees virtualization for Oracle databases. Virtualizing our Oracle databases enabled Cisco IT to quickly create test and development environments with the desired service-level agreements, helping us respond quickly to changing business needs.
Increased Resiliency
Virtualizing the Oracle environment on the Cisco UCS means that failure of a single node no longer disrupts critical business processes that depend on access to Oracle databases, says Anil Nileshwar, director of global infrastructure services for Cisco IT. If a server fails, Cisco IT can use Cisco UCS Manager server profiles to provision any other available Cisco UCS blade server in any chassis in a few minutes. During the proof of concept, Cisco IT measured the time to recover from various destructive events with vMotion. The tests simulated host failure, loss of votedisk on a node, deletion of a voting disk, and reboot of one node with a full load. Test results for all events met expectations for Oracle RAC databases, says Rajesh.
No Compromise to Performance
The Oracle D-NFS client introduced with Oracle 11g provides better scalability and performance than traditional NFS v3 clients. For most workloads, we are experiencing near-native performance, similar to the performance we would expect from a physical server, Pinnamaneni says.
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 7 of 9
Next Steps
Cisco IT is taking advantage of continual improvements in the Cisco Unified Data Center solution to virtualize everlarger databases. Larger virtual machines become feasible as the Cisco UCS and Nexus 1000V continue to scale, Wiltsey says. The next standard will be 20 virtual cores and up to 256 GB RAM. Cisco IT expects to virtualize 95 percent of its Oracle databases, including business-critical databases, by the end of the 2013 calendar year. This is a critical milestone in Cisco ITs platform-as-a-service strategy, Pinnamaneni says.
Lessons Learned
Cisco IT offers the following suggestions for organizations planning to virtualize their Oracle database environment.
Disable the hang-check timer for standalone instances. The hang-check timer is required only for Oracle RAC instances, to monitor the Linux kernel for extended operating system hangs that could affect the reliability of the RAC node and corrupt the database. Before Cisco IT disabled the hang-check timer, VMware ESX host kept rebooting because the vMotion load-balancing process introduced just enough temporary latency to trigger the timer.
Be prepared to change the Oracle System Global Area (SGA) size. On bare metal servers, all RAM is available for a single host, says Rajesh. But when we migrated to virtual machines, high SGA caused host hangs. Cisco IT overcame the issue by resizing SGA to match VM reservation. In the Cisco IT environment, the optimal settings were shmax at 50 percent of RAM and shmall at 75 percent of RAM.
Note
This publication describes how Cisco has benefited from the deployment of its own products. Many factors may have contributed to the results and benefits described; Cisco does not guarantee comparable results elsewhere. CISCO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties, therefore this disclaimer may not apply to you.
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 8 of 9
2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.
Page 9 of 9