Architectural Design Challenges + Elasticity
Architectural Design Challenges + Elasticity
Architectural Design Challenges + Elasticity
In this section, we will identify six open challenges in cloud architecture development. Some of
these topics have been observed as both obstacles and opportunities. Plausible solutions to meet
these challenges are discussed shortly.
The management of a cloud service by a single company is often the source of single points of
failure. To achieve HA, one can consider using multiple cloud providers. Even if a company has
multiple data centers located in different geographic regions, it may have common software
infrastructure and accounting systems. Therefore, using multiple cloud providers may provide
more protection from failures. Another availability obstacle is distributed denial of service
(DDoS) attacks. Criminals threaten to cut off the incomes of SaaS providers by making their
services unavailable. Some utility computing services offer SaaS providers the opportunity to
defend against DDoS attacks by using quick scale-ups. Software stacks have improved
interoperability among different cloud platforms, but the APIs itself are still proprietary. Thus,
customers cannot easily extract their data and programs from one site to run on another. The
obvious solution is to standardize the APIs so that a SaaS developer can deploy services and data
across multiple cloud providers. This will rescue the loss of all data due to the failure of a single
company. In addition to mitigating data lock-in concerns, standardization of APIs enables a new
usage model in which the same software infrastructure can be used in both public and private
clouds. Such an option could enable “surge computing,” in which the public cloud is used to
capture the extra tasks that cannot be easily run in the data center of a private cloud.
Current cloud offerings are essentially public (rather than private) networks, exposing the system
to more attacks. Many obstacles can be overcome immediately with well-understood
technologies such as encrypted storage, virtual LANs, and network middleboxes (e.g., firewalls,
packet filters). For example, you could encrypt your data before placing it in a cloud. Many
nations have laws requiring SaaS providers to keep customer data and copyrighted material
within national boundaries. Traditional network attacks include buffer overflows, DoS attacks,
spyware, malware, rootkits, Trojan horses, and worms. In a cloud environment, newer attacks
may result from hypervisor malware, guest hopping and hijacking, or VM rootkits. Another type
of attack is the man-in-the-middle 4.3 Architectural Design of Compute and Storage Clouds 225
attack for VM migrations. In general, passive attacks steal sensitive data or passwords. Active
attacks may manipulate kernel data structures which will cause major damage to cloud servers.
Multiple VMs can share CPUs and main memory in cloud computing, but I/O sharing is
problematic. For example, to run 75 EC2 instances with the STREAM benchmark requires a
mean bandwidth of 1,355 MB/second. However, for each of the 75 EC2 instances to write 1 GB
files to the local disk requires a mean disk write bandwidth of only 55 MB/second. This
demonstrates the problem of I/O interference between VMs. One solution is to improve I/O
architectures and operating systems to efficiently virtualize interrupts and I/O channels. Internet
applications continue to become more data-intensive. If we assume applications to be “pulled
apart” across the boundaries of clouds, this may complicate data placement and transport. Cloud
users and providers have to think about the implications of placement and traffic at every level of
the system, if they want to minimize costs. This kind of reasoning can be seen in Amazon’s
development of its new CloudFront service. Therefore, data transfer bottlenecks must be
removed, bottleneck links must be widened, and weak servers should be removed.
The database is always growing in cloud applications. The opportunity is to create a storage
system that will not only meet this growth, but also combine it with the cloud advantage of
scaling arbitrarily up and down on demand. This demands the design of efficient distributed
SANs. Data centers must meet programmers’ expectations in terms of scalability, data durability,
and HA. Data consistence checking in SAN-connected data centers is a major challenge in cloud
computing. Large-scale distributed bugs cannot be reproduced, so the debugging must occur at a
scale in the production data centers. No data center will provide such a convenience. One
solution may be a reliance on using VMs in cloud computing. The level of virtualization may
make it possible to capture valuable information in ways that are impossible without using VMs.
Debugging over simulators is another approach to attacking the problem, if the simulator is well
designed.
The pay-as-you-go model applies to storage and network bandwidth; both are counted in terms
of the number of bytes used. Computation is different depending on virtualization level. GAE
automatically scales in response to load increases and decreases; users are charged by the cycles
used. AWS charges by the hour for the number of VM instances used, even if the machine is
idle. The opportunity here is to scale quickly up and down in response to load variation, in order
to save money, but without violating SLAs. Open Virtualization Format (OVF) describes an
open, secure, portable, efficient, and extensible format for the packaging and distribution of
VMs. It also defines a format for distributing software to be deployed in VMs. This VM format
does not rely on the use of a specific host platform, virtualization platform, or guest operating
system. The approach is to address virtual platform-agnostic packaging with certification and
integrity of packaged software. The package supports virtual appliances to span more than one
VM.
OVF also defines a transport mechanism for VM templates, and can apply to different
virtualization platforms with different levels of virtualization. In terms of cloud standardization,
we suggest the ability for virtual appliances to run on any virtual platform. We also need to
enable VMs to run on heterogeneous hardware platform hypervisors. This requires hypervisor-
agnostic VMs. We also need to realize cross-platform live migration between x86 Intel and
AMD technologies and support legacy hardware for load balancing. All these issue are wide
open for further research.
Many cloud computing providers originally relied on open source software because the licensing
model for commercial software is not ideal for utility computing. The primary opportunity is
either for open source to remain popular or simply for commercial software companies to change
their licensing structure to better fit cloud computing. One can consider using both pay-for-use
and bulk-use licensing schemes to widen the business coverage. One customer’s bad behavior
can affect the reputation of the entire cloud. For instance, blacklisting of EC2 IP addresses by
spam-prevention services may limit smooth VM installation. An opportunity would be to create
reputation-guarding services similar to the “trusted e-mail” services currently offered (for a fee)
to services hosted on smaller ISPs. Another legal issue concerns the transfer of legal liability.
Cloud providers want legal liability to remain with the customer, and vice versa. This problem
must be solved at the SLA level. We will study reputation systems for protecting data centers in
the next section.
Resources in the cloud need not only be provisioned rapidly but also accessed and managed
universally, using standard Internet protocols, typically via RESTful web services. This enables
the users to access their cloud resources using any type of devices, provided they have an
Internet connection. Universal access is a key feature behind the cloud’s widespread adoption,
not only by professional actors but also by the general public that is nowadays familiar with
cloud-based solutions such as cloud storage or media streaming.
Cloud computing enables the users to enhance the reliability of their applications. Reliability is
already built in many cloud solutions via storage redundancy. Cloud providers usually have more
than one data center and further reliability can be achieved by backing data up in different
locations. This can also be used to ensure service availability, in the case of routine maintenance
operations or the rarer case of a natural disaster. The user can achieve further reliability using the
services of different cloud providers.
Cloud computing refers generally to paid services. The customers are entitled to a certain quality
of service, guaranteed by the Service Level Agreement, that they should be able to supervise.
Therefore, cloud providers offer monitoring tools, either using a graphical interface or via an
API. These tools also help the providers themselves for billing and management purposes.
2.2.5 Multitenancy
As the grid before, the cloud’s resources are shared by different simultaneous users. These users
had to reserve in advance a fixed number of physical machines for a fixed amount of time.
Thanks to the cloud’s virtualized data centers, a user’s provisioned resources no longer
correspond to the physical infrastructure and can be dispatched over multiple physical machines.
They can also run alongside another users’ provisioned resources thus requiring a lesser amount
of physical resources. Consequently, important energy savings can be made by shutting down
the unused resources or putting them in energy saving mode.
ELASTICITY
4.1 Beyond Static Scalability
Since the emergence of parallel and distributed systems, a great effort has been put into making
applications benefit efficiently from multiple computing resources. Scalability characterizes this
ability. It is measured by speedup, i.e., the performance gain due to additional resources and
efficiency, which shows to which extent the resources are made useful and is given by the ratio
of the speedup over the amount of used resources. These can be used to scale a given application,
that is provision its resources, based on the expected workload.
As the workloads tend to significantly vary over time, scalability needs to adapt
automatically, in order to use just enough resources. Thanks to autonomic computing techniques,
elasticity does just that. “Elasticity is the degree to which a system is able to adapt to workload
changes by provisioning and deprovisioning resources in an autonomic manner, such that at each
point in time the available resources match the current demand as closely as possible.” From this
definition, we can see that efficiency, or precision when it comes to matching the demand is still
a top concern, as it avoids any waste of resources. Elasticity also introduces a new important
factor, which is the speed. Rapid provisioning
and deprovisioning are key to maintaining an acceptable performance and are all the more
important in the context of cloud computing where quality of service is subjected to a service
level agreement.
In this chapter, we present the different approaches through which elasticity can be
achieved as well as a variety of existing elastic solutions, particularly those related to our
contribution.
4.2 Classification
Elasticity solutions can be arranged in different classes with regard to their scope, policy,
purpose and method. Figure 4.1 presents this classification and the remaining of this section is
dedicated to discuss it.
4.2.1 By Scope
With regard to scope, elasticity can be implemented on any of the cloud layers. Most
commonly, elasticity is achieved on the IaaS level, where the resources to be provisioned are
virtual machine instances. Other infrastructure services can also be scaled such as networks. On
the PaaS level, elasticity consists in scaling containers or databases for instance. Finally, both
PaaS and IaaS elasticity can be used to implement elastic applications, be it for private use or in
order to be provided as a SaaS. Most cloud providers offer elasticity mechanisms as part of their
services, although these mechanisms alone tend to be generic and fail to provide an efficient
framework for applications other than web servers.
4.2.2 By Policy
With regard to policy, elastic solutions can be either manual or automatic. A manual elastic
solution would provide their users with tools to monitor their systems and add or remove
resources but leaves the scaling decision to them, we believe that this cannot qualify as elasticity
as the latter has to be carried out automatically. Hence, elastic solutions can be either reactive or
predictive. An elastic solution is reactive when it scales a posteriori, based on a monitored
change in the system. These are generally implemented by a set of Event-Condition-Action rules.
A predictive –or proactive– elasticity solution uses its knowledge of either recent history or load
patterns inferred from longer periods of time in order to predict the upcoming load of the system
and scale according to it.
4.2.3 By Purpose
An elastic solution can have many purposes. The first one to come to mind is naturally
performance, in which case the focus should be put on their speed. Another purpose for elasticity
can also be energy efficiency, where using the minimum amount of resources is the dominating
factor. Other solutions intend to reduce the cost by multiplexing either resource providers or
elasticity methods. Mixed strategies that take into account different purposes and try to optimize
a corresponding utility function have also been put in place.
4.2.4 By Method
Finally, with regard to method, elasticity can be carried out either horizontally or vertically.
Horizontal elasticity consists in replication, i.e., the addition –or removal– of virtual machine
instances. In this case, a load balancer with an appropriate load balancing strategy needs to be
involved. This is the most commonly used method as on-demand provisioning is supported by all
cloud providers. Vertical elasticity, on the other hand, changes the amount of resources linked to
existing instances on-the-fly. This can be done in two manners. The first one consists in
explicitly redimensioning a virtual machine instance, i.e., changing the quota of physical
resources allocated to it. This is however poorly supported by common operating systems as they
fail to take into account changes in CPU or memory without rebooting, thus resulting in service
interruption. The second vertical scaling method involves VM migration: moving a virtual
machine instance to another physical machine with a different overall load changes its available
resources de facto. Once again, migrating a virtual machine on-the-fly, a.k.a., live migration, is
technically limited as it can only be achieved in environments with network shared disks.
While all cloud infrastructures offer APIs that allow both monitoring and on demand
provisioning of resources, most do not offer an automated elasticity service. Amazon, however,
has integrated an elasticity mechanism to its Elastic Cloud Compute (EC2)1. An EC2 user can
set an Auto Scaling Group (ASG) containing an initial number of instances, he then defines
Event-Condition-Action (ECA) rules for scaling, i.e., adding or removing instances from the
ASG, based on the system metrics provided by Amazon CloudWatch, Amazon’s monitoring
tool. Amazon also allows the user to specify elastic load balancers that forward requests to any
of an ASG instances. This makes Amazon Auto Scaling a reactive replication-based elasticity
solution. Reservoir 2, an open-source IaaS manager, allows the specification of these same ECA
elasticity rules as an extension of the Open Virtual Format standard. As for the clouds with no
elasticity capabilities, cloud management services such as RightScale3 and Scalr4 offer the same
reactive replication-based mechanisms and present the advantage of being able to stretch over
different cloud providers.
The use of a single metric goal can lead to oscillation, i.e., repetitive addition and
removal of resources, which can be overcome by the use of two thresholds in order to specify a
range goal. However the speedup when adding a second instance, which is roughly 100%, cannot
be comparable to the addition of the 101st one, which is only 1%. This has to be taken into
account when scaling, thus the need for goals, i.e., thresholds, that are adaptive to the current size
and performance of the clusters to be scaled.
PRESS, is an elasticity controller that analyzes the workload history using Fourier Fast
Transform to detect any repeating pattern which would allow future workload prediction. If no
pattern can be found, PRESS uses a discrete-time Markov chain to predict the near future
workload. In order to avoid resources under-estimation, which may lead to under-provisioning,
PRESS pads the predicted values by a small amount (5-10%). Roy et al. [33] use another
statistical model, autoregressive-moving-average, in order to predict future load based on
previously witnessed loads.
DejaVu, is an elasticity framework based on a predictive policy. The idea behind DejaVu
is to organize the different workloads that have been encountered into discrete classes using k-
means clustering, remember the preferred allocation setup for each class, i.e., the different virtual
machine instances used along with their sizes, and assign each class a signature. Thus, when the
system runs, each time it detects a workload change, it computes its signature, go back to its
cache and fetch the preferred allocation that it applies at once, thus accelerating the allocation of
the needed resources.
Tide, is a tool that offers elasticity for IaaS management itself. The main idea behind
Tide is to use the cloud’s resources themselves to handle the cloud’s management workload.
They implemented a custom predictive model that focuses on provisioning needed resources all
at once and as soon as possible.
While the previous solutions mostly focus on performance, Kingfisher, a cost-driven
elasticity tool is proposed. Kingfisher takes into account the costs of different virtual machines
instances’ configurations and try to optimize the overall cost, for a given workload. It also takes
into account the cost of the transition form a given VM instances configuration to another, upon
a workload change.
The above works propose different solutions to the elasticity problem. However, as they
are meant to be generic, they fail to fit the specific need of particular applications.
4.4 Conclusion
In this chapter, we have presented the different approaches used to achieve elasticity in the
literature. We have also discussed a representative range of elastic solutions, with focus on those
related to our contributions. As we have seen, generic elasticity frameworks do not necessarily fit
specialized applications. In the next chapters, we detail our contribution which consists in the
development of different specific elastic solutions based on open-source tools of different
horizons.