Sample
Sample
Sample
Developer–
Associate
(DVA-C01)
Cert Guide
MARKO SLUGA
Pearson IT Certification
ii AWS Certified Developer–Associate (DVA-C01) Cert Guide
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any information storage and retrieval
system, without written permission from the publisher, except for the inclusion of brief quotations in a
review.
ScoutAutomatedPrintCode
Library of Congress Control Number: 2019956930
ISBN-13: 978-0-13-585329-0
ISBN-10: 0-13-585329-X
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been appropri-
ately capitalized. Pearson IT Certification cannot attest to the accuracy of this information. Use of a term
in this book should not be regarded as affecting the validity of any trademark or service mark.
Figure Credits
Cover: Mirtmirt/Shutterstock
Figure 1-1, 1-2: National Institute of Standards and Technology special publication © NIST
Figure 1-9: AWS Global infrastructure © 2020 Amazon Web Services, Inc
Figure 1-10: Amazon S3 + Amazon CloudFront: A Match Made in the Cloud, 27 JUN 2018,
© 2020 Amazon Web Services, Inc
Figure 1-11 1-12, 1-13: Screenshot of AWS © 2020 Amazon Web Services, Inc
Figure 1-14 through 1-19: Screenshot of AWS Management © 2020 Amazon Web Services, Inc
Figure 1-20, 1-21: Screenshot of AWS CLI © 2020 Amazon Web Services, Inc
Figure 2-7: How to Use SAML to Automatically Direct Federated Users to a Specific AWS Management
Console Page by Alessandro Martini © Amazon Web Services, Inc
Figure 2-8: AWS IAM Now Supports Amazon, Facebook, and Google Identity Federation by Jeff Barr
© Amazon Web Services, Inc
Figure 3-1: Difference between Public IP and Private IP address © 2020, Difference Between
Figure 3-3: Latency numbers © 2020 GitHub, Inc, https://gist.github.com/2841832
Figure 3-5: Multiple Data Center HA Network Connectivity © 2019, Amazon Web Services, Inc
Figure 3-6: Unsupported VPC Peering Configurations © 2020, Amazon Web Services, Inc
Figure 3-7: Amazon EBS Snapshots © 2020, Amazon Web Services, Inc
Figure 3-9: AWS Elastic Load Balancer © Aman Sardana
Figure 3-10: Scaling Cooldowns for Amazon EC2 Auto Scaling © 2020, Amazon Web Services, Inc
Figure 3-11: High Availability with Route53 DNS Failover © Randika Rathugamage
Figure 3-12 through 3-15: Screenshot of Amazon EC2 © Amazon Web Services, Inc
Figure 3-16: Screenshot of CLI command © Amazon Web Services, Inc
Figure 4-1: Screenshot of Amazon S3 © Amazon Web Services, Inc
Figure 4-2: Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python
© Amazon Web Services, Inc
Figure 4-5: Amazon RDS Multi-AZ Deployments and Read Replicas © 2006-2020 Percona LLC
Figure 4-6: Amazon Aurora DB Clusters © 2020, Amazon Web Services, Inc
Figure 4-9: Choosing the Right DynamoDB Partition Key © 2019, Amazon Web Services, Inc
Figure 4-10: How to use Amazon DynamoDB global tables to power multiregion architectures
© 2019, Amazon Web Services, Inc
Figure 4-11: Jeff Barr, 200 Amazon CloudFront Points of Presence © 2020, Amazon Web Services, Inc.
Figure 4-12: CloudFront Events That Can Trigger a Lambda Function ©2020, Amazon Web Services, Inc.
Figure 4-13: Using Field-Level Encryption to Help Protect Sensitive Data © 2020, Amazon Web Services,
Inc.
Figure 5-1 through 5-8: Screenshot of AWS Lambda © Amazon Web Services, Inc
Figure 5-10, 5-11: Basic Amazon SQS Architecture © 2020, Amazon Web Services, Inc
Figure 6-2: DevOps in 3 Sentences © DEV Community 2016 - 2020
Figure 6-3: Difference between agile, CI/CD, and DevOps © 2020 Synopsys, Inc
Figure 6-4 through 6-22: Screenshot of AWS Cloud9 © 2019, Amazon Web Services, Inc
Figure 6-23: Screenshot of CodePipeline © 2019, Amazon Web Services, Inc
Figure 7-1 through 7-7: Screenshot of Amazon RDS © 2019, Amazon Web Services, Inc
Figure 7-8 through 7-23: Screenshot of AWS DMS © 2019, Amazon Web Services, Inc
Figure 8-1 through 8-21: Screenshot of Amazon CloudWatch © 2019, Amazon Web Services, Inc
iv AWS Certified Developer–Associate (DVA-C01) Cert Guide
Anthony joined Mastering Computers in 1996 and lectured to massive audiences around
the world about the latest in computer technologies. Mastering Computers became the
revolutionary online training company KnowledgeNet, and Anthony trained there for
many years.
Anthony is currently pursuing his second CCIE in the area of Cisco Data Center.
Anthony is a full-time instructor at CBT Nuggets.
v
Dedication
I would like to dedicate this book to my mother, Marta Sluga, who has always put an
emphasis on learning being the most important aspect of success.
Acknowledgments
This manuscript was made truly great by the incredible technical review of Anthony
Sequeira.
I would also like to express my gratitude to Chris Cleveland, development editor of this
book. His dedication made this book several cuts above the rest.
Finally, thanks you so much to Paul Carlstroem for giving me the benefit of the doubt
and being patient and understanding with all my ups and downs during the writing
process.
vi AWS Certified Developer–Associate (DVA-C01) Cert Guide
Contents at a Glance
Introduction xv
Index 308
vii
Contents
Introduction xv
AWS SDKs 32
Accessing AWS Through APIs 33
Summary 34
Exam Preparation Tasks 34
Review All Key Topics 35
Define Key Terms 35
Q&A 35
Index 308
xv
Introduction
I would like to welcome you and extend my gratitude for choosing this publication as
your guide on the journey to becoming an AWS Certified Developer. The main purpose
of the book is to guide you through the process of learning about Amazon Web Services
from the point of view of a developer. The book covers the topics that are listed as
required knowledge when preparing for the AWS Certified Developer–Associate exam.
This book also provides examples and code snippets to help you learn how to perform
the tasks being described in the book and also gives you the knowledge and tools
required to develop applications on the AWS cloud computing environment.
Taking any exam should not be taken lightly. Many experts who rate the IT industry
exams have put the AWS exams on the top of the scale as far as difficulty is concerned.
Some have gone so far as to claim that AWS sets the bar for everyone in the industry
much higher. But don’t worry, by reading through this book and following the examples,
you should gain valuable knowledge that you can put to use when you decide to take the
AWS Certified Developer–Associate exam.
But a book can only go so far, and throughout the book I stress that having hands-on
experience with AWS services, tools, and platforms is crucial to being prepared to pass
the exam. Think of the learning process as having two parts:
■■ Gaining theoretical knowledge and practicing (which is what this book is designed
to do)
■■ Getting real-world hands-on experience (which will be helpful as you use AWS on
a daily basis)
Each AWS certification exam conforms to an exam blueprint. You can use the blueprint
as a reference tool to get an overview of which areas of knowledge the exam is designed
to test. The AWS Certified Developer–Associate exam blueprint also states that taking
and passing the exam will prove:
■■ Understanding of core AWS services, uses, and basic AWS architecture best practices
■■ Ability to use the AWS service APIs, AWS CLI, and SDKs to write applications
■■ Ability to write code using AWS security best practices (such as not using secret and
access keys in code but instead using IAM roles)
As you see, the list of recommended AWS knowledge is quite extensive and mainly
covers real-world experience, which is invaluable in being able to develop on AWS. Of
course, this list of requirements is intended for your own assessment. AWS does not
require you to prove your experience and does allow you to take the exam even if you
do not possess all the knowledge recommendations. The basic rule is that the more
recommendations you meet, the more likely you are to pass the exam. This book is
designed to provide the theoretical part of the recommendations and to allow you to read
and study the concepts at your own pace. But I highly encourage you to gain the required
hands-on experience with all of the above before you attempt the exam.
Domain 1: Deployment
The Deployment domain, as the name indicates, focuses on testing the understanding
and knowledge of how to deploy applications on AWS. As a developer, you should have a
good understanding of how to deploy applications on AWS using the CLI, the SDK,
CI/CD pipelines, and AWS deployment processes and patterns.
The exam will test your ability to implement deployment and provisioning best
practices and gauge whether you are able to determine the right solution to use for
deploying an application. The exam also focuses on testing your understanding of tools
and approaches that allow developers to integrate the deployment of applications into
their code.
■■ The CLI and the SDKs: The exam will evaluate your understanding of what can be
achieved with the CLI and SDKs and how these tools are applicable when deploying
infrastructure services in AWS.
■■ Elastic Beanstalk: The exam will test your ability to understand the benefits and
advantages of Elastic Beanstalk as well as the limitations of the solution. You should
understand the capabilities of Elastic Beanstalk and the most common use cases and
should have a good understanding of the Elastic Beanstalk update process and the
processes of customizing a deployment.
■■ CodeDeploy and CodePipeline: The exam will ask questions focused on determin-
ing the understanding of the deployment part of a typical CI/CD pipeline as per the
best practices outlined by AWS.
■■ AWS Lambda: The exam will evaluate whether you understand the AWS Lambda
deployment procedure and how it can integrate with other AWS services to provide
a supporting role during a deployment.
xviii AWS Certified Developer–Associate (DVA-C01) Cert Guide
■■ Static websites on S3: In some cases, the exam will evaluate whether you understand
when a static website on S3 is the right type of deployment option for the outlined
case.
Domain 2: Security
Possibly the most important aspect of any application is security. The Security domain is
designed to make sure you understand how to design, develop, and deploy applications
on AWS with security in mind. Among the most important aspects tested on the exam is
the understanding of authentication and authorization, with a focus on calls to the AWS
infrastructure as well as the security of the application running on top of AWS.
The exam will test you on the following AWS topics:
■■ The CLI and the SDKs: The exam will evaluate your understanding of authentication
practices that should be observed when using the AWS CLI and the AWS SDKs.
■■ IAM: The exam will test your understanding of the practices associated with
managing users, groups, and roles; assigning policies; and granting access via the
least privilege approach.
■■ IAM federation: The exam will give special focus to evaluating your understanding
of how to federate authentication and authorization with external directories and
identity providers.
■■ Security groups and NACLs: The exam will focus on understanding how to secure
an application over the network; thus basic understanding of the way security groups
and NACLs operate in VPC is required.
■■ AWS CLI and the SDKs: The exam will test your ability to use the SDKs and the
CLI to interact with the AWS services and deliver application components straight
out of the code. Some focus is given to the ability to understand the command
structure and identify the correct command. The exam will also include questions
that test your general understanding of the capabilities of the CLI and SDKs.
■■ DevOps and Code* tools: The exam will focus on your ability to understand
the DevOps approach to development and identify the functionality of the
CodeCommit, CodeDeploy, CodeBuild, and CodePipeline tools.
Introduction xix
Domain 4: Refactoring
Many enterprises are in the midst of a cloud adoption process, and therefore, the exam
will test your ability to understand which AWS services and features will best suit your
application and how to migrate existing applications and application code to AWS. The
exam will test you on the following AWS topics:
■■ AWS migration tools: The exam will test your basic understanding of what AWS
migration tools can be used to transfer (VPN, DirectConnect), transport (Snowball/
Snowmobile), or transform (AWS DMS) the data from on-premises systems to AWS.
■■ Managed AWS services: The exam will test your understanding of which managed
services can be used to refactor an application that is being implemented on or
migrated to AWS.
■■ CloudWatch: The exam will evaluate your ability to capture performance data and
logs to CloudWatch. Further, it will test your ability to use and analyze the captured
data to perform troubleshooting, scaling, and optimization on the application being
monitored. You should also have a clear understanding of the features and limitations
of CloudWatch, CloudWatch Logs, and CloudWatch Alarms.
■■ CloudTrail: The exam will test your ability to trace the actions in the environment
and provide an audit-compliant log of events and actions in the AWS account.
fee for the exam is US$150. If you would like to try your knowledge before you take the
actual exam, you can take an online practice exam consisting of 20 questions at any time.
The registration fee for the practice exam is US $20.
Taking the practice exam is a good idea if you would like to get a feel for the exam with
sample questions that come out of the same pool of questions as the actual exam. The
practice exam can be a great tool to help you gauge your knowledge and determine
whether you are ready to pass the exam. However, passing the practice exam does not
guarantee that you will pass the real exam.
Exam Questions
The questions that will be testing your knowledge on the exam carry different weights.
Each question has a certain score assigned to it, and the scores of all the questions
together will add up 1000. AWS will be scoring you according to a percentage out of
1000 points rather than based on the number of questions you get correct. All the exam
questions are always scored in full. This means an incorrect or missing answer for a
multiple-response question will cause the score to be determined as 0. Make sure to take
your time and carefully read each question as some of the questions might be lengthy
and will be hiding crucial information that would easily help you determine the right
answer.
Passing Score
The passing score of each exam is never fully determined, and the score to shoot for is
not released publicly; however, the typical passing score for this exam is 720 points. AWS
uses statistical analysis of multiple metrics to determine the passing score. This means
that you should be as prepared as possible to pass the exam. But that can be difficult
to do, so I recommend setting a certain “confidence level” for yourself. This confidence
level can be determined by looking at the requirements and the content of this book and
taking practice exams like the ones provided in the Pearson Test Prep software for this
book. I like to set the confidence level of the content at 90%, meaning you should be able
to answer most of the questions you encounter on a certain topic.
Keep in mind that AWS sources a lot of the exam question content from the FAQs for
each service, so another way to prepare for the exam is to read the FAQs and try to
answer them yourself. If you can get to this level, then you should be able to pass the
exam. The idea behind this is that an AWS certified developer should be able to do any
task outlined in the FAQs by heart. A certified developer would certainly still need to
consult the documentation and contact AWS support if needed.
ease. Make sure you have enough time to get to the exam location. I usually plan to be
at the proctor about 30 minutes early, which helps me deal with any kind of delays on
the way. Try to clear your calendar before you take the exam; you’ve been studying for
the exam for quite a while so don’t try to cram your exam into an already packed day.
Any additional stress might prohibit you from relaxing when taking the exam, and being
relaxed during the exam is very important.
Book Features
To help you customize your study time using this book, the core chapters have several
features that help you make the best use of your time:
■■ Foundation Topics: These are the core sections of each chapter. They explain the
concepts for the topics in that chapter.
■■ Exam Preparation Tasks: This section provides a series of study activities that you
should do at the end of each chapter:
■■ Review All Key Topics: The Key Topic icon appears next to the most important
items in the “Foundation Topics” section of the chapter. The “Review All Key
Topics” activity lists the key topics from the chapter, along with their page
numbers. Although the contents of the entire chapter could be on the exam, you
should definitely know the information listed in each key topic, so you should
review these.
■■ Define Key Terms: Although the AWS Certified Developer–Associate exam may
be unlikely to ask a question such as “Define this term,” the exam does require
that you learn and know a lot of AWS-related terminology. This section lists the
most important terms from the chapter and asks you to write a short definition
and compare your answer to the glossary at the end of the book.
■■ Q&A: Confirm that you understand the content just covered by answering these
questions and reading the answer explanations.
■■ Web-based practice exam: The companion website includes the Pearson Test Prep
application, which allows you to take practice exam questions. Use it to prepare with
a sample exam and to pinpoint topics where you need more study.
■■ Print book: Look in the cardboard sleeve in the back of the book for a piece of
paper with your book’s unique PTP code.
■■ Premium Edition: If you purchase the Premium Edition eBook and Practice Test
directly from the Pearson IT Certification website, the code will be populated on
your account page after purchase. Just log in at www.pearsonITcertification.com,
click account to see details of your account, and click the digital purchases tab.
■■ Amazon Kindle: For those who purchase a Kindle edition from Amazon, the access
code will be supplied directly from Amazon.
■■ Other bookseller e-books: Note that if you purchase an e-book version from any
other source, the practice test is not included because other vendors to date have not
chosen to vend the required unique access code.
NOTE Do not lose the activation code because it is the only means by which you can
access the online content for the book.
Introduction xxv
NOTE Amazon eBook (Kindle) customers: It is easy to miss Amazon’s email that lists
your PTP access code. Soon after you purchase the Kindle eBook, Amazon should send an
email. However, the email uses very generic text and makes no specific mention of PTP or
practice exams. To find your code, read every email from Amazon after you purchase the
book. Also do the usual checks for ensuring that your email arrives, such as checking your
spam folder. If you have trouble getting an access code from Amazon, contact Pearson’s
tech support at http://pearsonitp.echelp.org.
NOTE Other eBook customers: As of the time of publication, only the publisher and
Amazon supply PTP access codes when you purchase their eBook editions of this book.
■■ Study mode: Allows you to fully customize your exams and review answers as
you are taking the exam. This is typically the mode you use first to assess your
knowledge and identify information gaps.
■■ Flash Card mode: Strips out the answers and presents you with only the question
stem. This mode is great for late-stage preparation when you really want to challenge
yourself to provide answers without the benefit of seeing multiple-choice options.
This mode does not provide the detailed score reports that the other two modes
provide, so it is not the best mode for helping you identify knowledge gaps.
In addition to these three modes, you will be able to select the source of your questions.
You can choose to take exams that cover all of the chapters, or you can narrow your
selection to just a single chapter or the chapters that make up specific parts in the book.
All chapters are selected by default. If you want to narrow your focus to individual
chapters, simply deselect all the chapters and then select only those on which you wish
to focus in the Objectives area.
You can also select the exam banks on which to focus. Each exam bank comes complete
with a full exam of questions that cover topics in every chapter. The two exams included
online with the purchase of this book are available to you, as are two additional exams of
unique questions available with the Premium Edition. You can have the test engine serve
up exams from all four banks or just from one individual bank by selecting the desired
banks in the exam bank area.
You can make several other customizations to your exam from the exam settings screen,
such as the time allotted to take the exam, the number of questions served up, whether
xxvi AWS Certified Developer–Associate (DVA-C01) Cert Guide
to randomize questions and answers, whether to show the number of correct answers for
multiple-answer questions, and whether to serve up only specific types of questions. You
can also create custom test banks by selecting only questions that you have marked or
questions on which you have added notes.
Sometimes, due to a number of factors, the exam data may not fully download when
you activate your exam. If you find that figures or exhibits are missing, you may need to
manually update your exams. To update a particular exam you have already activated and
downloaded, simply select the Tools tab and click the Update Products button. Again,
this is only an issue with the desktop Windows application.
If you want to check for updates to the Windows desktop version of the Pearson Test
Prep exam engine software, simply select the Tools tab and click the Update Application
button. Doing so allows you to ensure that you are running the latest version of the
software engine.
CHAPTER 4
Relational Versus Nonrelational Databases: To prepare you for the next two chapters,
this section provides a short overview of the differences between relational and
nonrelational databases and data types suitable for each database type.
Handling Nonrelational Data in AWS: Some datasets are just not suitable for relational
databases. When sustained and predictable performance for relatively simple datasets
is required, you can use the DynamoDB service in AWS. This section examines the
characteristics of DynamoDB and shows how to use DynamoDB in your applications.
Caching Data in AWS: The last part of this chapter covers the different options for
caching data and accelerating the delivery of content from the storage systems covered in
this chapter.
This chapter covers content important to the following exam domains:
■■ 3.4 Write code that interacts with AWS services by using APIs, SDKs, and AWS CLI.
Domain 4: Refactoring
Table 4-1 “Do I Know This Already?” Foundation Topics Section-to-Question Mapping
Foundations Topics Section Questions
Storing Static Data in AWS 1, 2, 5, 10, 11
Deploying Relational Databases in AWS 3, 6, 7
Handling Nonrelational Data in AWS 4, 8, 12
Caching Data in AWS 9, 13
CAUTION The goal of self-assessment is to gauge your mastery of the topics in this
chapter. If you do not know the answer to a question or are only partially sure of the answer,
you should mark that question as wrong for purposes of the self-assessment. Giving yourself
credit for an answer you correctly guess skews your self-assessment results and might
provide you with a false sense of security.
1. You are asked to provide an HTTP-addressable data store that will have the ability to
serve a static website. Which data back end would be the most suitable to complete
this task?
a. DynamoDB
b. EBS
c. Glacier
d. S3
110 AWS Certified Developer–Associate (DVA-C01) Cert Guide
2. Complete this sentence: The S3 service allows for storing an unlimited amount of data
as long as individual files are not larger than _____ and any individual PUT commands
do not exceed _____.
a. 5 GB; 5 MB
b. 5 GB; 5 GB
c. 5 TB; 5 GB
d. 5 TB; 5 MB
3. Which of these databases is not supported by RDS?
a. Cassandra
b. Microsoft SQL
c. Oracle
d. MariaDB
4. To determine the number of read capacity units required for your data, what do you
need to consider ?
a. Whether reads are performed in the correct sequence
b. Whether reads are strongly or eventually consistent
c. Whether reads are coming from one or multiple sources
d. All of these answers are correct.
5. Which of the following is not an S3 service tier?
a. S3 Standard
b. S3 Accelerated Access
c. S3 Infrequent Access
d. S3 Reduced Redundancy Store
6. RDS has the ability to deliver a synchronous replica in another availability zone in
which mode?
a. Multi-AZ mode
b. High-availability mode
c. Cross-AZ mode
d. Master-slave mode
7. Your company is implementing a business intelligence (BI) platform that needs to
retain end-of-month datasets for analytical purposes. You have been asked to create
a script that will be able to create a monthly record of your complete database that
can be used for analytics purposes only if required. What would be the easiest way of
doing this?
a. In RDS, choose to create an automated backup procedure that will create a data-
base snapshot every month. The snapshot can be restored to a working database
if required by the BI software.
b. Write a script that will run on a predetermined day and hour of the month and
snapshot the RDS database. The snapshot can be restored to a working database
if required by the BI software.
Chapter 4: Storing Data in AWS 111
c. Write a script that will offload all the monthly data from the database into S3.
The data in S3 can be imported into a working database if required by the BI
software.
d. In RDS, choose to create an automated export procedure that will offload all the
monthly data from the database into S3. The data in S3 can be imported into a
working database if required by the BI software.
8. If your application has unknown and very spiky read and write performance character-
istics, which of the following should you consider choosing?
a. Using a NoSQL solution such as Memcached
b. Auto-scaling the DynamoDB capacity
c. Distributing data across multiple DynamoDB tables
d. Using the on-demand model for DynamoDB
9. Which service would you select to accelerate the delivery of video files? 4
a. S3 Accelerated Access
b. ElastiCache
c. CloudCache
d. CloudFront
10. When uploading files to S3, it is recommended to do which of the following?
(Choose all that apply.)
a. Split files 100 MB in size to multipart upload them to increase performance
b. Use a WAN accelerator to increase performance
c. Add metadata when initiating the upload
d. Use a VPN connection to increase security
e. Use the S3 HTTPS front end to increase security
c. Add metadata after the upload has completed
11. Which of these data stores would offer be the least expensive way to store millions of
log files that are kept for retention purposes?
a. DynamoDB
b. EBS
c. Glacier
d. S3
12. DynamoDB reads are performed via:
a. HTTP NoSQL requests to the DynamoDB API.
b. HTTP HEAD requests to the DynamoDB API.
c. HTTP PUT requests to the DynamoDB API.
d. HTTP GET requests to the DynamoDB API.
13. Which ElastiCache engine can support Multi-AZ deployments?
a. Redis
b. Memcached
c. DAX
d. All of these answers are correct.
112 AWS Certified Developer–Associate (DVA-C01) Cert Guide
Foundation Topics
Depending on the way you deliver content, you can classify your data in to three major
categories:
■■ Static assets: Any type of content that cannot be opened from the storage environment
directly (via block access) but that must rather be transferred locally (downloaded) and
only then opened can be considered a static asset. These assets can be any types of files,
such as text, videos, images, archives, packages, and other data blobs that reside on a
web server and are accessed only via the web service. Because static assets are delivered
across the network, the access times for data ranges from tens of milliseconds to seconds
to minutes and even to hours (for very large files over slow links).
■■ Dynamic assets: These assets are any type of content that is opened and used from
the storage environment directly via block-level access. These can be any types of files
that are consumed by services to maintain data records or state, such as databases, log
files, and executable files. Usually dynamic assets are opened by a certain process and
start writing the records. A dynamic asset is not accessible to other processes on the
file system and is accessible only through a service such as a database service or an
API. Because static assets can be accessed directly through block access, you usually
see the latencies being measured in a few milliseconds to a few seconds.
■■ In-memory assets: An in-memory asset is any type of content that is loaded into
memory and used by one or multiple processes within a server directly via access
to the memory. Most commonly these assets are any kind of block-level data that
is cached for performance, in-memory databases, and other data that needs to be
accessed with the lowest possible latency. As the cost of memory per gigabyte is
much higher than the cost of disks, it is common to store only “hot data” or caches
in memory to increase the performance of traditional disk-based systems. Because
in-memory assets are ready to serve, the latencies for delivering data can be decreased
down to microseconds.
Amazon S3
Amazon S3 is essentially a serverless storage back end that is accessible via HTTP/HTTPS.
The service is fully managed and has a 99.99% high availability SLA per region and a
99.999999999% SLA for durability of data. The 99.99% high availability means you can
expect to have less than 45 minutes of service outage per region during a monthly billing
Chapter 4: Storing Data in AWS 113
cycle, and the 99.999999999% durability means the probability of losing a file is equal to
1 in 10,000,000 per every 10,000 years.
S3 delivers all content through the use of content containers called buckets. Each bucket
serves as a unique endpoint where files and objects can be aggregated (see Figure 4-1).
Each file you upload to S3 is called a key; this is the unique identifier of the file within the
S3 bucket. A key can be composed of the filename and prefixes. Prefixes can be used to
structure the files even further and to provide a directory-like view of the files, as S3 has no
concept of directories.
■■ Use an access control list (ACL) or a bucket ACL: With the ACL approach, you can
control the permissions on a broader spectrum than by using the bucket policy. This
approach is designed to quickly allow access to a large group of users, such as another
account or everyone with a specific type of access to all the keys in the bucket; for
example, an ACL can easily be used to define that everyone can list the contents of
the bucket.
coming onto S3 and perform transformations, record metadata, and so on so that the static
website functionality can be greatly enhanced. Figure 4-2 illustrates how a file being stored
on S3 can trigger a dynamic action on AWS Lambda.
S3 Events
When accessing content within a bucket on S3, there are three different URLs that you can
use. The first (default) URL is structured as follows:
http{s}://s3.{region-id}.amazonaws.com/{bucket-name}/{optional key
prefix}/{key-name}
As you can see, the default naming schema makes it easy to understand: First you see the
region the bucket resides in (from the region ID in the URL). Then you see the structure
defined in the bucket/key-prefix/key combination.
Here are some examples of files in S3 buckets:
However, the default format might not be the most desirable, especially if you want to
represent the S3 data as being part of your website. For example, suppose you want to host
all your images on your S3 website, and you would like to redirect the subdomain images.
mywebsite.com to an S3 bucket. The first thing to do would be to create a bucket with that
exact name images.mywebsite.com in it so you can create a CNAME in your domain and not
break the S3 request.
To create a CNAME, you can use the second type of FQDN in your URL that is provided
for each bucket, with the following format:
{bucket-name}.s3.{optional region-id}.amazonaws.com
As you can see, the regional ID is optional, and the bucket name is a subdomain of
s3.amazonaws.com, so it is easy to create a CNAME in your DNS service to redirect a
subdomain to the S3 bucket. For the image redirection, based on the preceding syntax, you
would simply create a record like this:
NOTE Bucket names are globally unique. Because every bucket name is essentially a sub-
domain of .s3.amazonaws.com, there is no way to make two buckets with the same name in
all of AWS.
■■ FQDN with the region ID is serving the index.html key this way:
http://images.markocloud.com.s3.us-east-1.amazonaws.com/index.html
116 AWS Certified Developer–Associate (DVA-C01) Cert Guide
■■ FQDN without the region ID is serving the index.html key this way:
http://images.markocloud.com.s3.amazonaws.com/index.html
■■ FQDN with the CNAME on the markocloud.com domain is serving the index.html
key this way: http://images.markocloud.com/index.html
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::everyonelovesaws/*"
}
]
}
As you can see, the policy is allowing access from anywhere and performing the
s3:GetObject function, which means it is allowing everyone to read the content but not
allowing the listing or reading of the file metadata.
You can save this bucket policy as eveyonelovesaws.json and apply it to the bucket with the
following command:
NOTE You have seen two different S3 CLI commands: aws s3 and aws s3api. The s3
command is designed to make it easier to work with files and objects in S3, whereas s3api
follows the API model precisely and is purely JSON driven. You can more granularly control
making the bucket public by using the s3api put-bucket-website command, but this command
requires a JSON configuration file as input, so instead you can use the s3 command. The s3
command provides a bit more abstraction and achieves the same result with a simpler, shorter
command.
When the static website is enabled, you are provided with a URL that looks like this:
http://everyonelovesaws.s3-website.us-east-2.amazonaws.com/
Note in this example as well as in the example of the CNAMED images bucket, that the
HTTP URL is not secure. This is due to the fact that there is a limitation to bucket names 4
containing dots when using HTTPS. The default S3 certificate used for signing is *.s3.
amazonaws.com. This certificate can only sign the first subdomain of .s3.amazonaws.com.
Any dot in the name will be represented as a further subdomain, which would break the
certificate chain. Therefore, going to the following site will show an insecure warning:
https://images.markocloud.com.s3.amazonaws.com/index.html
This is due to the fact that the *.s3.amazonaws.com certificate only signs the “com.
s3.amazonaws.com” portion of the domain name and going to the following site will now
show an insecure warning since the * certificate does not sign the DNS name for the “images.
markocloud.” part of the domain:
https://everyonelovesaws.s3.amazonaws.com/index.html
For hosted websites, you can, of course, have dots in the name of the bucket. However,
if you tried to add an HTTPS CloudFront distribution and point it to such a bucket, you
would break the certificate functionality by introducing a domain-like structure to the name.
Nonetheless, all static websites on S3 would still be available on HTTP directly even if there
were dots in the name. The final part of this chapter discusses securing a static website
through HTTPS with a free certificate attached to a CloudFront distribution.
Versioning
S3 provides the ability to create a new version of an object if it is uploaded more than once.
For each key, a separate entry is created, and a separate copy of the file exists on S3. This
means you can always access each version of the file and also prevent the file from being
deleted because a deletion will only mark the file as deleted and will retain the specific
previous versions.
To enable versioning on your bucket, you can use the following command:
S3 Storage Tiers
When creating an object in a bucket, you can also select the storage class to which the
object will belong. This can also be done automatically through data life cycling. S3 has six
storage classes:
■■ S3 One Zone-Infrequent Access: A cheaper data tier in only one availability zone
that can deliver an additional 25% savings over S3 Infrequent Access. It has the same
durability, with 99.5% availability.
■■ S3 Glacier: Less than one-fifth the price of S3 Standard, designed for archiving and
long-term storage.
■■ S3 Glacier Deep Archive: Costs four times less than Glacier and is the cheapest
storage solution, at about $1 per terabyte per month. This solution is intended for very
long-term storage.
NOTE Due to the reduction in price of S3 Standard over time and low interest in using
RRS, RRS is now more expensive. However, at press time, there is no official plan to sunset
the RRS tier, which is still being used to temporarily restore data from Glacier and Glacier
Deep Archive.
S3 Security
When storing data in the S3 service, you need to consider the security of the data. First,
you need to ensure proper access control to the buckets themselves. There are three ways to
grant access to an S3 bucket:
■■ IAM policy: You can attach IAM policies to users, groups, or roles to allow granular
control over different levels of access (such as types of S3 API actions, like GET, PUT,
or LIST) for one or more S3 buckets.
■■ Bucket policy: Attached to the bucket itself as an inline policy, a bucket policy can
allow granular control over different levels of access (such as types of S3 API actions,
like GET, PUT, or LIST) for the bucket itself.
■■ Bucket ACL: Attached to the bucket, an access control list (ACL) allows coarse-
grained control over bucket access. ACLs are designed to easily share a bucket with a
large group or anonymously when a need for read, write, or full control permissions 4
over the bucket arises.
Both policy types allow for much better control over access to a bucket than does using
an ACL.
Example 4-2 demonstrates a policy that allows all S3 actions over the bucket called
everyonelovesaws from the 192.168.100.0/24 CIDR range.
Example 4-2 S3 Policy with a Source IP Condition
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::everyonelovesaws/*",
"Condition": {
"IpAddress": {"aws:SourceIp": "192.168.100.0/24"},
}
}
]
}
On top of access control to the data, you also need to consider the security of data during
transit and at rest by applying encryption. To encrypt data being sent to the S3 bucket, you
can either use client-side encryption or make sure to use the TLS S3 endpoint. (Chapter 1
covers encryption in transit in detail.) To encrypt data at rest, you have three options in S3:
■■ S3 SSE-C: With the SSE-C option, S3 is configured with server-side encryption that
uses a customer-provided encryption key. The encryption key is provided by the
client within the request, and so each blob of data that is delivered to S3 is seamlessly
encrypted with the customer-provided key. When the encryption is complete, S3
discards the encryption key so the only way to decrypt the data is to provide the same
key when retrieving the object.
Graph
Document
Analytical (OLAP)
Choosing which type of database to use is essentially governed by the data model. A typical
relational database model is strictly structured by row and column, as illustrated in Table 4-2.
The data must fit fully within one row and is required to be structured to fit the categories
defined in the columns.
Different columns of a traditional database are indexed to expedite the retrieval of the data
from the database. The index is usually loaded into memory and allows for very fast retrieval
of specific pieces of data. Traditional databases are usually also ACID compliant, where
ACID stands for
4
■■ Atomicity: Every transaction against an SQL database is atomic and (usually) cannot
be broken down into smaller pieces. If a transaction fails, the whole operation needs to
be restarted.
■■ Consistency: Data must be consistent at all times, even if replicated across a cluster.
This means that data will be made unavailable until the replication has completed and
the data is confirmed to be consistent.
■■ Isolation: Concurrent transactions can never interfere with each other. In SQL, for
example, you always put a lock on a table or an index that you are modifying so no
other transaction can interfere with it.
■■ Durability: Data must be stored durably and must also have the ability to be recovered
in case of a failure. You should keep a transaction log that can be replayed and one or
more backups of a database to maintain durability.
With a NoSQL database, you can represent the whole dataset of one row as a set of key/value
pairs that are stored and retrieved as a document. This document needs to be encoded in a for-
mat from which the application can build the rows and columns represented in the document.
Example 4-3 demonstrates a JSON-formatted document that represents the same data as the
first row of your SQL table (refer to Table 4-2).
Example 4-3 JSON-Formatted Data with Key/Value Pairs Matching the First Row of
Table 4-2
{
"Index":"0000",
"Name":" Anthony ",
"Surname":" Soprano"
"Occupation":" Waste Management Consultant"
"Active":"Y"
}
To speed up retrieval of the data, you need to select a key that can appear in all documents and
allow for the prompt retrieval of the complete dataset. The benefit of this type of format is that
only a certain part of the data—not the complete dataset—defines the structure. So you can
122 AWS Certified Developer–Associate (DVA-C01) Cert Guide
essentially shorten or extend the dataset with any number of additional key/value pairs on the
fly. For example, if you want to add the date of last activity for a user, you can simply add an
additional key/value pair to the document denoting the date, as demonstrated in Example 4-4.
Example 4-4 Adding the Last Active Attribute to the Data
{
"Index":"0001",
"Name":" Christopher ",
"Surname":" Moltisanti "
"Occupation":" Disposal Operator "
"Active":"N"
"Last active": 13052007
}
You could even structure the day, month, and year as their own nested key/value pairs in the
Last active key, as demonstrated in Example 4-5.
Example 4-5 Adding an Entry as Nested Key/Value Pairs
{
"Index":"0001",
"Name":" Christopher ",
"Surname":" Moltisanti "
"Occupation":" Disposal Operator "
"Active":"N"
"Last active": [
{ "Day":13},
{ "Month":"05"},
{ "Year":"2007" },
]
}
The ability to nest keys in your database adds a lot more flexibility to the way you store and
access the data in the NoSQL database. Just think of the impact of the schema modifications
required to fit the new type of data into an existing SQL table: Not only would the process
be disruptive to ongoing operations, but rolling back changes to a schema is sometimes
impossible. With NoSQL, you can change the data model on the fly by adding and removing
key/value pairs to items with ease.
NoSQL databases are designed with linear scalability in mind as all data is distributed across
multiple nodes, which become authoritative for a certain subset of indexing keys. To retrieve
the data, you usually address a common front end that then delivers the data by contact-
ing multiple back ends and delivers documents from all of them in parallel. With a SQL
database, that design is very hard to implement as the transaction usually cannot be easily
distributed across multiple back ends. Unlike SQL databases, NoSQL databases usually
conform to the BASE database ideology, where BASE stands for
■■ Basic availability: Availability of the database is the main requirement. The database
must seem to be up all the time, and reads from/writes to the database must succeed
as much as possible.
Chapter 4: Storing Data in AWS 123
■■ Soft state: The state of the system is allowed to change over time: The database
is allowed to be repartitioned (by adding or removing nodes), and the data can be
expired, deleted, or offloaded. The replication system ensures that the data is repli-
cated as soon as possible, but the availability must not be affected by the any state
changes.
■■ Eventual consistency: The system will eventually (after a period of time) achieve con-
sistency of data across all the nodes of a cluster. The data will be available even during
replication, and a client requesting data could access a node with a stale piece of data.
To mitigate eventual consistency, strongly consistent reads can be utilized to read data
from multiple nodes to ensure that the data is always in a consistent state when being
read. The read consistency must be handled by the application.
■■ You can deploy an EC2 instance with a database server application installed.
Amazon RDS
The choice between a standalone EC2 instance with a database on top and RDS is essentially
the choice between an unmanaged environment where you have to manage everything your-
self and a managed service where most of the management tasks are automated and complete
control over deployment, backups, snapshots, restores, sizing, high availability, and replicas
is as simple as making an API call. When developing in AWS, it always makes sense to lean
toward using a managed service as the benefits of reducing the management overhead can be
numerous. Aside from simplifying the management, another business driver can be increased
flexibility and automation, which can be achieved by using the AWS CLI, the SDKs, and
CloudFormation to deploy the database back end with very little effort or through an auto-
mated CI/CD system. Managed services essentially empower developers to take control of
the infrastructure and design services that can be easily deployed and replicated and that can
have auto-healing characteristics built into them.
Example 4-6 shows how the deployment of an RDS database can be integrated in a Java
application by using the AWS Java SDK, giving you the ability to deploy the database and
use the database string returned to connect to the newly created database.
Example 4-6 Java Script That Can Be Used to Build an RDS Database
NOTE This example breaks the rules by storing credentials in the code. This can be
avoided by running the code on an EC2 instance and using a role with the permissions to
create the RDS database. You need to be aware that you also have the ability to enable data-
base authentication via IAM and generate an IAM token within your Java code to authenti-
cate to the database without having any passwords baked in the code.
Once the script is created, you can list all your instances with the DescribeDBInstanceRe-
sult class. You will want to get the instance identifier and the endpoint, which is the SQL
endpoint URL that you can later use to connect to the database. You can do this by includ-
ing the snippet shown in Example 4-7 in your Java code.
Example 4-7 Using the Java DescribeDBInstanceResult Class
■■ MariaDB
■■ PostgreSQL
■■ Amazon Aurora
Chapter 4: Storing Data in AWS 125
■■ Oracle
■■ All PostgreSQL versions (though version 9.3.5t is required for Multi-AZ and read replicas)
Synchronous Replication
AWS Region
Amazon Aurora
Amazon Aurora is the next-generation open-source engine currently supporting the MySQL
and PostgreSQL database types. The benefit of Aurora is that it decouples the processing
from the storage. All the data is stored on a synchronously replicated volume in three avail-
ability zones, and the processing of SQL requests is performed on the cluster instances. The
instances have no local storage, and they all access the cluster volume at the same time, so
the performance of the cluster can be linearly scaled by adding nodes.
The write node in an Aurora cluster, also called the primary instance, is used to process
all write requests. The primary instance type needs to be scaled to the write performance
requirements of your application and can be easily resized by promoting a larger read replica
to the primary role. All other members of the cluster are called replica instances, and they
can respond to read requests. The primary and the replicas have different DNS names to
which you send requests, which means you can simply configure your application with two
FQDN targets—one for the writes and another for the reads—and do not need to handle the
read/write distribution on your own.
Because the primary and replica instances have access to the same synchronously replicated
cluster volume, you can also instantly promote any read replica into the primary role if the
primary instance fails or if the availability zone where the primary instance is running experi-
ences difficulties. Figure 4-6 illustrates how the Aurora design ensures synchronous writes
and decouples storage from the compute layer.
Wr Writ
es
Writes
ites
Reads
Reads
Reads
Reads
Scaling Databases
There are four general ways to scale database performance:
■■ Vertical scaling: You can give a single database engine more power by adding more
CPU and RAM.
■■ Horizontal scaling: You can give a database cluster more power by adding more
instances.
■■ Read offloading: You can add read replicas and redirect read traffic to them.
■■ Sharding: You can distribute the data across multiple database engines, with each one
holding one section, or shard, of data.
With relational databases, vertical scaling always works, but it has a maximum limit. In AWS,
the maximum limit is the largest instance size that can be deployed in the service. An alter-
native is horizontal scaling, but generally relational databases are not the best at being able
to scale horizontally. The nature of the atomicity of the SQL transactions usually means that
the whole transaction must be processed by one server—or sometimes even in one thread
on a single CPU.
If an RDS database is deployed in a Multi-AZ configuration, the resizing can be done trans-
parently because the slave database is resized first, the data is synchronized, the connection
fails over, and the slave becomes the master while the previous master instance is resized.
When the resizing is complete, data is again synchronized, and a failover is performed to the
previous master instance.
Example 4-8 uses the boto3 Python SDK to increase the instance size from db.t3.small to
db-t3-medium for the instance created in the previous example.
128 AWS Certified Developer–Associate (DVA-C01) Cert Guide
Example 4-8 Python SDK (boto3) Script That Can Be Used to Create an RDS Instance
Another way of scaling is to distribute the read and write transactions on multiple nodes. A typi-
cal relational database is more read intensive than write intensive, with a typical read-to-write
ratio being 80:20 or even 90:10. By introducing one or more read replicas, you can offload 80%
or even 90% of the traffic off your write node. Aurora excels at read replica scaling, whereas the
other services that support read replicas support only asynchronous replication, which means
the read data is not as easily distributed across the cluster because the data read from the replica
might be stale. But even asynchronous replicas can be a great benefit for offloading your write
master where historical analytics and business intelligence applications are concerned.
Typically the last resort for scaling relational databases is to shard the data. Essentially this
means that a dataset is sliced up into meaningful chunks and distributed across multiple
masters, thus linearly increasing write performance.
NOTE The performance in sharding increases linearly only under the utmost perfect condi-
tions, where data distribution across shards is equal. In real-world scenarios, achieving equal
distribution of data across shards and retaining meaningful pieces of data on the same server
is the biggest challenge.
For example, imagine a phone directory in a database with names from A to Z. When you
need more performance, you can simply split up the database into names starting with A to
M and N to Z. This way, you have two databases to write to, thus theoretically doubling the
performance. Figure 4-7 illustrates the principle of sharding RDS databases to achieve better
performance.
-M
rd A
Sha Names A-M
Sh
Names A-Z ard
N-
Z
However, the limitation of sharding is immediately apparent when you try to perform analyt-
ics as you need to access two databases, join the two tables together, and only then perform
the analytics or BI operation. Figure 4-8 illustrates tables from sharded databases being
joined to an analytical database.
Jo
in
A-
Names A-M M
Z
N- Names N-Z
Join 4
Names N-Z
Analytics
Figure 4-8 Steps Required for Analytics on Sharded Databases
■■ DynamoDB: A NoSQL key/value storage back end that is addressable via HTTP/
HTTPS
■■ Neptune: A NoSQL graphing solution for storing and addressing complex networked
datasets
■■ Redshift: A columnar data warehousing solution that can scale to 2 PB per volume
■■ Redshift Spectrum: A serverless data warehousing solution that can address data
sitting on S3
■■ TimeStream: A time series recording solution for use with IoT and industrial telemetry
■■ Quantum Ledger: A ledger database designed for record streams, banking transactions,
and so on
As you can see, you are simply spoiled for choice when it comes to storing nonrelational
data types in AWS. This chapter focuses on the first two database types, DynamoDB and
ElastiCache, as they are important both for gaining a better understanding of the AWS
environment and for the AWS Certified Developer–Associate exam.
130 AWS Certified Developer–Associate (DVA-C01) Cert Guide
Amazon DynamoDB
DynamoDB is a serverless NoSQL solution that uses a standard REST API model for both
the management functions and data access operations. The DynamoDB back end is designed
to store key/value data accessible via a simple HTTP access model. DynamoDB supports
storing any amount of data and is able to predictably perform even under extreme read and
write scales of 10,000 to 100,000 requests per second from a single table at single-digit
millisecond latency scales. When reading data, DynamoDB has support for eventually
consistent, strongly consistent, and transactional requests. Each request can be augmented
with the JMESPath query language, which gives you the ability to sort and filter the data
both on the client side and on the server side.
A DynamoDB table has three main components:
Tables
Like many other NoSQL databases, DynamoDB has a distributed back end that enables it
to linearly scale in performance and provide your application with the required level of per-
formance. The distribution of the data across the DynamoDB cluster is left up to the user.
When creating a table, you are asked to select a primary key (also called a hash key). The
primary key is used to create hashes that allow the data to be distributed and replicated
across the back end according to the hash. To get the most performance out of DynamoDB,
you should choose a primary key that has a lot of variety. A primary key is also indexed so
that the attributes being stored under a certain key are accessible very quickly (without the
need for a scan of the table).
For example, imagine that you are in charge of a company that makes online games. A table
is used to record all scores from all users across a hundred or so games, each with its own
unique identifiers. Your company has millions of users, each with a unique username. To
select a primary key, you have a choice of either game ID or username. There are more
unique usernames than game IDs, so the best choice would be to select the username as the
primary key as the high level of variety in usernames will ensure that the data is distributed
evenly across the back end.
Optionally, you can also add a sort key to each table to add an additional index that you can
use in your query to sort the data within a table. Depending on the type of data, the sorting
can be temporal (for example, when the sort key is a date stamp), by size (when the sort key
is a value of a certain metric), or by any other arbitrary string.
A table is essentially just a collection of items that are grouped together for a purpose.
A table is regionally bound and is highly available within the region as the DynamoDB
back end is distributed across all availability zones in a region. Because a table is regionally
bound, the table name must be unique within the region within your account.
Chapter 4: Storing Data in AWS 131
Movie ID Drama,
3 The Kid Chaplin
Comedy
Items
An item in a table contains all the attributes for a certain primary key or the primary key and
sort key if the sort key has been selected on the table. Each item can be up to 400 KB in size
and is designed to hold key/value data with any type of payload. The items are accessed via
a standard HTTP model where PUT, GET, UPDATE, and DELETE operations allows you to
perform create, read, update, and delete (CRUD) operations. Items can also be retrieved in
batches, and a batch operation is issued as a single HTTP method call that can retrieve up to
100 items or write up to 25 items with a collective size not exceeding 16 MB.
132 AWS Certified Developer–Associate (DVA-C01) Cert Guide
Attributes
An attribute is a payload of data with a distinct key. An attribute can have one of the
following values:
■■ A single value that is either a string, a number, a Boolean, a null, or a list of values
"name" : Anthony,
"height" : "6.2",
}
These attributes would be represented in a DynamoDB table as illustrated in Table 4-3.
}
This attribute would be represented in a DynamoDB table as illustrated in Table 4-4.
"activities" :
}
This attribute would be represented in a DynamoDB table as shown in Table 4-5.
Secondary Indexes
Sometimes the combination of primary key and sort key does not give you enough of an
index to efficiently search through data. You can add two more indexes to each table by
defining the following:
4
■■ Local secondary index (LSI): The LSI can be considered an additional sort key for
sifting through multiple entries of a certain primary key. This is very useful in applica-
tions where two ranges (the sort key and the secondary index) are required to retrieve
the correct dataset. The LSI consumes some of the provisioned capacity of the table
and can thus impact your performance calculations in case it is created.
■■ Global secondary index (GSI): The GSI can be considered an additional primary key
on which the data can be accessed. The GSI allows you to pivot a table and access
the data through the key defined in the GSI and get a different view of the data. The
GSI has its own provisioned read and write capacity units that can be set completely
independently of the capacity units provisioned for the table.
For example, say that you have industrial sensors that continuously feed data at a rate of
10 MB per second, with each write being approximately 500 bytes in size. Because the write
capacity units represent a write of up to 1 KB in size, each 500-byte write will consume
1 unit, meaning you will need to provision 20,000 WCUs to allow enough performance for
all the writes to be captured.
134 AWS Certified Developer–Associate (DVA-C01) Cert Guide
As another example, say you have 50 KB feeds from a clickstream being sent to DynamoDB
at the same 10 MB per second. Each write will now consume 50 WCUs, and at 10 MB per
second, you are getting 200 concurrent writes, which means 10,000 WCUs will be sufficient
to capture all the writes.
With reads, the calculation is dependent on whether you are reading with strong or eventual
consistency because the eventually consistent reads can perform double the work per
capacity unit. For example, an the application is reading at a consistent rate of 10 MB per
second and performing strongly consistent reads of items 50 KB in size, each read consumes
13 RCUs of 4 KB, whereas eventually consistent reads consume only 7 RCUs. To read the
10 MB per second in a strongly consistent manner, you would need an aggregate of 2600
RCUs, whereas eventually consistent reads would require you to only provision 1400 RCUs.
Global Tables
In DynamoDB, you also have the ability to create a DynamoDB global table (see Figure 4-10).
This is a way to share data in a multi-master replication approach across tables in different
regions. To create a global table, you need to first create tables in each of the regions and
then connect them together in the AWS console or by issuing a command in the AWS CLI
to create a global table from the previously created regional tables. Once a global table is
established, each of the tables subscribes to the DynamoDB stream of each other table in
the global table configuration. This means that a write to one of the tables will be instantly
replicated across to the other region. The latency involved in this operation will essentially be
almost equal to the latency of the sheer packet transit across one region to another.
Global App
Replica (Europe)
Global Table
--table-name vegetables \
--attribute-definitions \
AttributeName=name,AttributeType=S AttributeName=type,
AttributeType=S \
--key-schema \
AttributeName=name,KeyType=HASH AttributeName=type,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=10
NOTE The CLI commands in DynamoDB can be quite extensive, and for the sake of
clarity, the preceding example uses the \ newline Linux command-line separator.
After the table is created, you can use the aws dynamodb put-item command to write items
to the table:
■■ "key": {"data type":"value"}: The name, type, and value of the data
If you are following along with the instructions, you can create some more entries in the
table with the put-item command.
When doing test runs, you can also add the --return-consumed-capacity TOTAL switch at
the end of your command to get the number of capacity units the command consumed in
the API response.
Next, to retrieve the data from the DynamoDB table for your user with the primary key
potato and sort key tuber, you issue the following command:
HTTP/1.1 200 OK
x-amzn-RequestId: <RequestId>
x-amz-crc32: <Checksum>
Content-Type: application/x-amz-json-1.0
Content-Length: <PayloadSizeBytes>
Date: <Date>
{
"Item": {
"name": { "S": ["potato"] },
"type": { "S": ["tuber"] },
"cost": { "N": ["1.5"] },
}
}
■■ Data access with specific permissions to read, write, update, or delete items in specific
tables
You can also write your application to perform both the administrative and data access
tasks. This gives you the ability to easily self-provision the table from the application. This is
especially useful for any kind of data where the time value is very sensitive, also it is useful
for any kind of temporary data, such as sessions or shopping carts in an e-commerce website
or an internal report table that can give management a monthly revenue overview.
You can provision as many tables as needed. If any tables are not in use, you can simply
delete them. When data availability is required, you can reduce the RCU and WCU provi-
sioning to 5 units (the lowest possible setting). This way, a reporting engine can still access
historical data, and the cost of keeping the table running is minimal.
Chapter 4: Storing Data in AWS 137
For a sales application that records sales metrics each month, the application could be
trusted to create a new table every month with the production capacity units but maintain
the old tables for analytics. Every month, the application would reduce the previous monthly
table's capacity units to whatever would be required for analytics to run.
Because policies give you the ability to granularly control permissions, you can lock down
the application to only one particular table or a set of values within the table by simply
adding a condition on the policy or by using a combination of allow and deny rules.
The policy in Example 4-10 locks down the application to exactly one table by denying
access to everything that is not this table and allowing access to this table. This way, you can
ensure that any kind of misconfiguration will not allow the application to read or write to
any other table in DynamoDB.
Example 4-10 IAM Policy Locking Down Permissions to the Exact DynamoDB Table
{
4
"Version": "2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":["dynamodb:*"],
"Resource":["arn:aws:dynamodb:us-east-1:111222333444:table/vegetables"]
},
{
"Effect":"Deny",
"Action":["dynamodb:*"],
"NotResource":["arn:aws:dynamodb:us-east- 1:111222333444:table/ vegetables "]
},
]
}
Amazon ElastiCache
ElastiCache is a managed service that helps simplify the deployment of in-memory data
stores in AWS. With in-memory data stores, you can perform caching of frequently retrieved
responses, maintain session state, and, in some cases, run SQL-like databases that support
transaction type queries through scripting.
One of the primary uses for ElastiCache is simple database offloading. Your application
is likely to have a high read-to-write ratio, and some requests are possibly made over and
over and over again. If all of these common requests are constantly being sent to the back-
end database, you might be consuming more power in that database than needed, and it
might become very expensive. Instead of constantly retrieving data from the database, you
can ensure that frequent responses are cached in an intermediary service that is faster to
respond and that can help you reduce the size of the database server. No matter whether
your application requires just a simple place to store simple values that it retrieves from the
database or whether it requires a scalable, highly available cluster that offers high-performance
complex data types, ElastiCache can deliver the right solution for the right purpose.
Memcached
Memcached is a high-performance, distributed, in-memory caching system. The basic
design of the Memcached system is meant for storing simple key/value information. The
Memcached service differs from the DynamoDB back end in the fact that each key has only
one value. Of course, you can nest multiple values into the value of the key, but there is no
index to the data because all the data is stored in memory and retrievable with microsecond
latency.
Memcached is perfectly suited for simple caching such as offloading of database responses
where the key is the query and the value is the response. It is also perfectly suited for storing
session information for your web application, where the cookie ID can be used as the key
and linked with the session state as the value.
ElastiCache offers an easy way to deploy a Memcached cluster in a single availability zone.
Redis
When a more advanced in-memory database is required, Redis is the solution. Redis supports
running an in-memory database in a more classical approach, with a Multi-AZ pair and read
replicas in the cluster. It supports more complex datasets and schema-type data, has the abil-
ity to be used as a messaging back end, and gives some transactional data access support
through Lua scripting.
Amazon CloudFront
CloudFront is a serverless content delivery network that can enhance the user experience of
any application running in the AWS cloud, outside the cloud, or on premises. CloudFront
Chapter 4: Storing Data in AWS 139
provides you with the ability to cache common responses from your HTTP/HTTPS web
application by caching the responses to GET, HEAD, and OPTIONS HTTP methods. The
data is cached at the AWS edge locations, which are distributed closer to densely populated
areas in more than 100 different locations. Figure 4-11 illustrates the AWS regions and edge
location distribution across the globe.
Edge
Locations
Multiple
Edge Locations
Regional
Edge Caches
■■ GET: A read operation that retrieves a document from the web server
■■ HEAD: A read operation that retrieves only the header of the document
■■ POST: A write operation that is used to send text-based content to a web server
■■ PUT: A write operation that is used to send a file or data blob to a web server
■■ DELETE: A write operation that deletes a file or some content on a web server
■■ GET and HEAD: Standard caching for documents and headers. Useful for static websites.
■■ GET, HEAD, and OPTIONS: Adds the ability to cache OPTIONS responses from an
origin server.
140 AWS Certified Developer–Associate (DVA-C01) Cert Guide
■■ GET, HEAD, OPTIONS, PUT, PATCH, POST, and DELETE: Terminates all HTTP/
HTTPS sessions at the CloudFront edge location and can increase the performance of
both read and write requests.
In addition, you can control the time-to-live (TTL) of your cache. By controlling the TTL,
you can set a custom way of expiring content when it should be refreshed. CloudFront
distributions support the following options for setting TTL:
■■ Min TTL: When forwarding all headers, this is a required setting. Determines the
minimum cache lifetime for your CloudFront distribution and determines the shortest
interval for CloudFront to check the origin for newer versions of the document.
■■ Max TTL: This optional value defines the longest possible period that objects can stay
in the cache. It is used to override any cache-control headers being sent out by the
origin.
■■ Default TTL: This optional value works only when no specific TTL is set in the head-
ers coming from the origin. It allows the origin to control its own cache behavior and
override the default with cache-control headers.
CloudFront offers the capability to both improve the performance of an application and
decrease the cost of content delivery. For example, when delivering content from S3, the
transfer costs can add up. With CloudFront, the transfer cost for your data is cheaper
per gigabyte. This makes a lot of difference when content that goes viral is hosted on S3.
Imagine a video-sharing service where videos tend to go viral and are getting millions of
views per day. If each video is 10 MB in size, each million views would carry 10 TB of
transfer costs from S3. To achieve the same performance from S3, you can turn on Transfer
Acceleration, which increases the delivery speed of content to remote regions. The cost of
delivery of the content doubles this way. So with CloudFront you can get initial savings,
which can translate to less than 50% of the cost of delivering from S3 with Transfer
Acceleration, while also reaping the benefit of having the content cached much closer to the
user, who will benefit from the decreased latency of your service. Figure 4-12 illustrates the
operation of the CloudFront cache.
performed by Google found that the traffic to a typical website decreases by 20% if the
latency of the web page load is increased by 500 ms. A typical website sequentially loads
anywhere between 10 and 100 objects when delivering a web page, and that can translate
to a site loading anywhere from a few seconds (less than 3 seconds is considered good) up
to tens of seconds for the worst-performing sites. If the latency to request each of those
objects is about 100 ms, that alone adds a whole second for each of those 10 objects. Using
CloudFront can bring down the request latency times to single-digit or low double-digit
milliseconds, thus drastically improving the performance of a web page’s load time even
without any site content optimization. It should be noted, though, that optimizing the site
content makes the biggest difference; however, optimization can require quite a lot of effort,
whereas turning on CloudFront can accelerate a site within minutes.
Another great feature that can help you develop and tune the content delivery is the fact
that CloudFront is addressable via the API. This means you can easily control the behavior
of the caching environment from within the application. You have complete control over 4
how the headers are forwarded to the origin, you have control over compression, you can
modify the responses coming directly out of CloudFront, and you can detect the client
type within the cache.
To add some processing power to CloudFront, a distribution can be integrated with
Lambda@Edge, which executes predefined functions at the edge location, thus allow-
ing you to include some dynamic responses at the cone of access to your application.
The Lambda@Edge execution performance will have the same low latency as the content
being delivered from CloudFront and can significantly increase the user experience with
your application.
CloudFront Security
CloudFront is secure and resilient to L3 and L4 DDoS attacks when used with AWS Shield
Standard. The AWS Shield Advanced service on your CloudFront distribution gives you a
24/7 response team look after your site, allows for custom DDoS mitigation for advanced
higher-layer DDoS attacks, and protects you from incurring additional costs associated with
the increase in capacity when absorbing a DDoS attack. CloudFront can also be integrated
with the AWS Web Application Firewall (WAF), which can help mitigate other types of
attacks, such as web address manipulations, injection attacks, and web server vulnerabilities
(known and zero-day attacks), and provides the ability to implement different types of rules
for allowed patterns, sources, and methods.
To secure data in transit, you can use a TLS endpoint over HTTPS. CloudFront seamlessly
integrates with the AWS Certificate Manager (ACM) service, which can automatically
provision, renew, and replace an HTTPS certificate on your distribution at no additional cost.
This service provides a great benefit to your web application because you never need to
worry about renewing, replacing, or paying for an X.509 certificate from a public certificate
authority.
You can also use CloudFront to offload all in-transit encryption by sending data to an
HTTP origin. When sensitive data is involved, you can use field-level encryption, which only
encrypts chosen fields being sent to the server, as with a payment form where the credit card
details are encrypted but the rest of the information (such as customer name and address) are
sent in clear text to the origin. Field-level encryption uses a set of public and private keys to
asymmetrically encrypt and decrypt data across the network and keep the data secure, as
illustrated in Figure 4-13.
142 AWS Certified Developer–Associate (DVA-C01) Cert Guide
Personally
identifiable
information User agents CloudFront distribution Custom origin
Personal health
information
Confidential
information
Payments data
This example shows how to create an OAI and allow access only to a specific S3 bucket
through the identity. The command needs two arguments:
■■ CallerReference, which ensures that the request can't be replayed (like a timestamp)
--cloud-front-origin-access-identity-config \
CallerReference=20190820,Comment=everyonelovesaws
Make sure to capture the OAI ID from the response because you will be using it in your
configuration.
Now that you have created the origin access identity, you need to add the identifier
of the origin access identity to the bucket policy that you will protect with the origin
access identity. The following policy allows only the origin access identity with the ID
E37NKUHHPJ30OF to access the everyonelovesaws bucket. You apply this bucket policy
to the S3 bucket that you previously made public. Example 4-11 shows a policy that allows
access for the origin access identity.
Chapter 4: Storing Data in AWS 143
{
"Version": "2008-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::cloudfront:user/CloudFront Origin Access
Identity E37NKUHHPJ30OF"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::everyonelovesaws/*"
}
]
4
}
This policy makes your bucket unavailable until you create the distribution in CloudFront.
You can test this by trying to access the bucket through its URL.
To create the CloudFront distribution, you can use the cloudfront.json file shown in
Example 4-12 as an input to the AWS CloudFront create-distribution command. In the file
you need to define at least the following sections:
{
"CallerReference": "20190820",
"Aliases": {
"Quantity": 0
},
"DefaultRootObject": "index.html",
"Origins": {
"Quantity": 1,
"Items": [
144 AWS Certified Developer–Associate (DVA-C01) Cert Guide
{
"Id": "everyonelovesaws",
"DomainName": "everyonelovesaws.s3.amazonaws.com",
"S3OriginConfig": {
"OriginAccessIdentity": "origin-access-identity/cloudfront/E37NKUHHPJ30OF"
}
}
]
},
"DefaultCacheBehavior": {
"TargetOriginId": "everyonelovesaws",
"ForwardedValues": {
"QueryString": true,
"Cookies": {
"Forward": "none"
}
},
"TrustedSigners": {
"Enabled": false,
"Quantity": 0
},
"ViewerProtocolPolicy": "allow-all",
"MinTTL": 0
},
"Comment": "",
"Enabled": true
}
Save this file to where you are running the CLI command and run the aws cloudfront
create-distribution command as follows:
--distribution-config file://cloudfront.json
This command returns the complete set of JSON settings from the cloudfront.json file, but
the most important thing it returns is the distribution FQDN. Look for the following string
in the response from the last command:
"DomainName": "d1iq7pwkt6nlfb.cloudfront.net"
Now you can browse the d1iq7pwkt6nlfb.cloudfront.net FQDN and see that your S3 bucket
is accessible only from the CloudFront origin access identity. This FQDN can also be used as
a CNAME for your website so you can serve your content with your custom domain name.
Chapter 4: Storing Data in AWS 145
Q&A
The answers to these questions appear in Appendix A. For more practice with exam format
questions, use the Pearson Test Prep Software Online.
1. True or false: In most cases, it is not possible to determine the type of storage to use
simply by looking at the data structure.
2. True or false: A video being delivered via a streaming service should be considered a
static asset.
3. What is the maximum file size that can be sent to the S3 service in one PUT
command?
146 AWS Certified Developer–Associate (DVA-C01) Cert Guide
4. Which types of security documents allow you to limit the access to the S3 bucket?
5. Which types of database engines are supported on RDS?
6. Can an RDS database be resized without service disruption?
7. True or false: In a DynamoDB database, both the management and data access are
available through the same DynamoDB API.
8. True or false: A DynamoDB database always requires you to specify the RCU and
WCU capacities and use AutoScaling.
9. Which service would you recommend to cache commonly returned responses from a
database?
10. What is an origin access identity in CloudFront?
Index
aws ec2 create-key-pair command, 104 AWS Lambda, 15, 76–77, 153–161
aws ec2 create-nat-gateway code writing in, 153–157
command, 74 invoking, 160–161
aws ec2 create-route command, 73 permissions and roles, 157–160
aws ec2 create-route-table aws lambda get-function command,
command, 73 159
aws ec2 create-subnet command, aws logs create-log-group command,
73–74 274
aws ec2 create-vpc command, 72 aws logs create-log-stream command,
aws ec2 describe-import-image-tasks 274
command, 234 aws logs put-log-events command,
aws ec2 import-image command, 233 274–275
aws ecs create-cluster command, 85 AWS OpsWorks, 20, 97
aws ecs register-task-definition AWS Pinpoint, 19
command, 86–87 aws s3 command, 117
AWS Elastic Beanstalk, 96–101 aws s3 sync command, 249–250
CLI for, 99–101 aws s3 website command, 116
components, 97–98 aws s3api abort-multipart-upload
services controlled by, 98–99 command, 251
supported platforms, 98 aws s3api command, 117
AWS Fargate, 77, 84, 152 aws s3api complete-multipart-upload
aws help command, 30–31 command, 254
aws iam add-role-to-instance-profile aws s3api create-bucket command,
command, 210 114, 198
aws iam add-user-to-group command, aws s3api create-multipart-upload
196 command, 251
aws iam attach-group-policy command, AWS SageMaker, 20
196 AWS Schema Conversion Tool (SCT),
aws iam attach-role-policy command, 235
208, 235–236 AWS Server Migration Service
aws iam create-group command, 196 (SMS), 234
aws iam create-instance-profile AWS Serverless Application Model
command, 210 (SAM), 152
aws iam create-role command, 200, AWS Shield, 71
208–209, 215, 231, 235–236 AWS Snowball, 16, 255–256
aws iam put-role-policy command, AWS Snowball Edge, 255–256
210, 233 AWS Snowmobile, 16, 256
AWS Internet of Things (IoT) aws sns create-topic command,
Services, 20 172–173
aws sns publish command, 174
clients 313
F 90–91
Amazon Route 53, 93–95
Fargate, 77, 84, 152 design patterns, 89–90
federation, 52–54 history
LDAP and Active Directory, 56 of AWS, 2–3
OpenID, 55 of software development
SAML 2.0, 56 Agile, 182
web identities, 54–55 CI/CD, 184–185
when to use, 56 CR (continuous reaction),
185–186
IP (Internet Protocol) 319