Solutions Partner Technical Onboarding Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27
At a glance
Powered by AI
The document provides an overview of concepts Solutions Partners are expected to be familiar with when working on Snowflake engagements, including Snowflake's architecture, data loading best practices, and security.

Solutions Partners are expected to staff engagements with consultants knowledgeable in Snowflake fundamentals like architecture, data loading, querying, and performance optimization with minimal ramp-up time.

The fastest way to learn Snowflake is to use it through tutorials and self-service accounts. The documentation and Snowflake Lodge community are also valuable resources.

Solutions Partner 

 
Technical Onboarding Guide 
 
 
Table of Contents 
 
Snowflake’s Technical Expectations of Solutions Partners 2

Technical Onboarding Guide Overview 3

General Tips 4

Snowflake Basics 5

Snowflake Architecture 8
A Deeper Dive 8
Snowflake Deployment 11
Snowflake Storage 12
Snowflake Security 14
Data Sharing 15

Working With Snowflake 16


Snowflake Objects 16
Accessing and Connecting 18
Loading and Querying 19

Snowflake Ecosystem and Competitive Positioning 21

Snowflake Best Practices 23


Design 23
Semi-Structured Data 23
Loading 24
Performance/Billing/Usage Optimization 25
Security 27
Data Sharing 27
 
   
Snowflake’s Technical Expectations of Solutions Partners 
Snowflake expects our Solutions Partners to staff their Snowflake engagements with
Consultants that have existing expertise in the Snowflake fundamentals. With this knowledge
base, these Consultants should be able to serve as effective hands-on Snowflake developers
with no or minimal ramp-up time.

Consultants staffed on Snowflake engagements are expected to have a comprehensive


understanding of:
● Snowflake’s architecture
● Snowflake’s advantages vs. both traditional and cloud competitor database and data
warehouse technologies
● Snowflake’s positioning in the data warehousing, data integration, business intelligence,
and analytics markets.
● The basic processes and best practices for loading data into Snowflake
● How Snowflake stores data and optimizes its querying engine
● How to connect to and work with Snowflake using various techniques and technologies
● How to evaluate and optimize Snowflake performance and usage.

In addition, Consultants that Solutions Partners staff on Snowflake engagements are expected
to be familiar with the Snowflake technical enablement resources available to them. These
resources should be leveraged for first level technical enablement, engagement ramp-up, and
troubleshooting. These resources include but are not limited to:
● Snowflake Documentation
● Snowflake Resources Guide
● Snowflake Blog
● Snowflake YouTube Channel
● Snowflake Webinars
● Snowflake Lodge Community
Technical Onboarding Guide Overview 
The Snowflake Solutions Partner Technical Onboarding Guide document will provide its
audience highlights of the concepts any Solutions Partner Consultant working on a Snowflake
engagement is expected to have expertise in.

These highlights will be followed with links to self-guided learning materials for each concept.
These self-guided learning materials will include Documentation, Videos, Blogs, Case Studies,
etc. that will provide greater detail on the given topic. These materials are extensive and
technically comprehensive, and it is the expectation that these self-guided resources will be
leveraged for additional knowledge on concepts with which a Consultant may need to gain
greater expertise.

   
General Tips 
● The fastest way to learn Snowflake is the use it.
○ Set up an account.
■ Does your company have a Snowflake Partner Account? If so, work with
your Account Administrator to have a developer account provisioned for
you.
■ No Partner Account? No problem! ​Sign up​ for a self-service account.
You’ll receive $400 in free Snowflake credits for use in your first 30 days.
○ Jumpstart your learning journey with tutorials
■ Snowflake in 20 Minutes
■ Bulk Loading from a Local File System
■ Bulk Loading from Amazon S3
■ JSON Basics
■ + Tutorials Available in the UI Worksheet section and on YouTube
● The ​Snowflake Documentation​ is your secret to becoming a SnowPro.
○ Very comprehensive--most questions can be answered by reading through the
documentation on the concept
○ A Google search of “Snowflake <Topic/Question>” will often pull up the relevant
section in the Documentation
● The ​Snowflake Lodge Community​ is Snowflake's online community and support portal
for Snowflake Customers and Partners. The Lodge hosts:
○ Release updates
○ Important alerts
○ Useful articles posted by Snowflake's team of experts
○ Community forums
● Snowflake Webinars​ are a great way to get up to speed on everything Snowflake
○ Multiple upcoming Webinars per week + on-demand Webinars
○ Every Tuesday: Snowflake Office Hours/Talk to a Customer Live
○ Every Thursday: Snowflake Live Demos
● The ​Snowflake Resources Guide​ is a one-stop shop for everything Snowflake has
published.
○ Case Studies
○ Articles
○ Whitepapers
○ eBooks
○ Migration Guides
○ Blogs
○ Videos
● An online session of a Snowflake fundamentals class was recorded and is available for
you to ​listen on-demand​.
Snowflake Basics 
● Snowflake history:
○ Founded in 2012 in “stealth mode” by industry veterans
○ Initial customers came on board in 2014
○ Became Generally Available in 2015
○ Cloud Providers:
■ Amazon Web Services (AWS)--since inception
■ Microsoft Azure Cloud Platforms--generally available as of September
2018
○ Snowflake’s global presence and deployments continue to grow
■ Currently available on select Regions within each Cloud Provider
■ Regions for each Cloud Provider have been added over time
■ Additional Regions for each Cloud Provider are on the near-future
roadmap
● Snowflake’s mission is to enable every organization to be data-driven.
○ Make Better, Quicker Business Decisions
○ Scale Your Analytics, Scale Your Business
○ Create a Data-Driven Customer Journey
● Snowflake is the only data warehouse built for the cloud
○ Built from-scratch to take full advantage of the elasticity offered by the cloud
○ Most cloud data warehouse options have on-premises foundations that have
been modified to provide cloud availability; this limits their elastic capabilities
○ Google Big Query is another technology built for the cloud; however, it is a Query
Engine and not a data warehouse.
● As the only data warehouse built for the cloud, Snowflake offers several key
differentiators:
○ Single place for data, both structured and semi-structured
○ Instant and unlimited on-demand scalability in both Storage and Compute
○ Minimal management
○ Instant and live data sharing
○ Pay only for what is used, down to the second
● Snowflake’s unique architecture is known as multi-cluster, shared data
○ The Snowflake architecture is comprised of three distinct layers
■ Storage (Databases)
● Cloud object storage via AWS S3 or Azure Blobs
● The Snowflake data
● Scales infinitely as data requirements grow
● Costs are based on a daily average of all compressed data
storage, including data stored according to Time Travel retention
policy and failsafe practices
■ Compute (Virtual Warehouses)
● Cloud compute via AWS EC2 or Azure Compute
● The Snowflake braun/horsepower
● Can scale up and out to handle workloads, even as queries are
running
■ Cloud Services
● Transparent to end users
● The Snowflake brain
○ Security management
○ Infrastructure management
○ Metadata management
○ Query optimization
● Always on with a Snowflake account
○ Snowflake’s Storage (Database) and Compute (Warehouse) layers are
completely separate and independent
■ The layers scale up and down independently
■ A link between the two is established only by a Role provisioned to a
user--a Role with access to a Warehouse may use the Warehouse to
query any data (Database) the same Role has access to
● Snowflake’s processing engine is ANSI SQL, the most familiar and utilized database
querying language. SQL capabilities have been natively built into the product.
● Customer use cases include data lakes, enterprise data warehouses, and consolidation
of data silos/data marts
● A customer does not have to have a preexisting account with AWS or Microsoft Azure in
order to have a Snowflake account or to use Snowflake
● Snowflake Basics Self-Guided Learning Material
○ Get to Know Snowflake as a Company
■ Preparing to Power the Data Economy​ (Video)
■ Snowflake Website
■ About Snowflake
■ Reinventing the Data Warehouse​ (Video)
■ Snowflake Values
■ Business Insider Article on Snowflake's Values Driving Its 2014-2018
Growth
■ GeekWire Article - With Huge New $450m Funding Round, Snowflake
Computing Has Now Raised Almost $1 Billion
■ Sequoia's Series F Investment Thesis
○ Learn the Basics
■ Introduction to Snowflake - Key Concepts & Architecture
■ Getting Started with Snowflake – Key Concepts​ (Video)
■ Overview of Key Features
■ Before You Begin
■ Snowflake Architecture​ (Video)
■ Introduction to Semi-Structured Data
■ Bringing Together Structured and Semi-Structured Data​ (Video)
■ Data Warehouse as Service​ (Video)
■ See Snowflake in 8 Minutes​ (Video)
Snowflake Architecture 

A Deeper Dive 
● Snowflake’s unique architecture is known as multi-cluster, shared data
○ The Snowflake architecture is comprised of three distinct layers
■ Storage (Databases)
● Cloud object storage via AWS S3 or Azure Blobs
● The Snowflake data
● Scales infinitely as data requirements grow
● Costs are based on a daily average of all compressed data
storage, including data stored according to Time Travel retention
policy and failsafe practices
■ Compute (Virtual Warehouses)
● Cloud compute via AWS EC2 or Azure Compute
● The Snowflake braun/horsepower
● Can scale up and out to handle workloads, even as queries are
running
■ Cloud Services
● Transparent to end users
● The Snowflake brain
○ Security management
○ Infrastructure management
○ Metadata management
○ Query optimization
● Always on with a Snowflake account
○ Multi-cluster means that in addition to multiple Virtual Warehouses in a
deployment, an individual Virtual Warehouse may be built to scale up or down
automatically to include multiple nodes, or clusters, depending on workload
■ A multi-cluster Virtual Warehouse will add clusters automatically based on
query activity and scale down, or turn off clusters, when query activity
slows
■ The Snowflake User does not have to specify which cluster each query
will run on in a multi-cluster warehouse--Snowflake’s query optimizer will
make this decision for him
● Snowflake’s processing engine is ANSI SQL, the most familiar and utilized database
querying language. SQL capabilities have been natively built into the product.
○ This:
■ Allows customers to leverage the skills they already have
■ Enables interoperability with trusted tools, specifically in data integration
and business intelligence
■ Promotes simplified migration from legacy platforms
○ SQL functionality can be extended via SQL User Defined Functions (UDFs),
Javascript UDFs, session variables, and (in Preview) Stored Procedures
● Snowflake supports structured and semi-structured data within one fully SQL data
warehouse.
○ Semi-structured data strings are stored in a column with a Snowflake data type of
“VARIANT”
○ Snowflake’s storage methodology optimizes semi-structured, or VARIANT,
storage based on repeated elements
○ Just like structured data, semi-structured data can be queried using SQL while
incorporating JSON path notation
● Snowflake uses caching to optimize performance
○ All cache types are invalidated if the underlying data changes.
○ Snowflake has three types of caching:
■ Warehouse caching
● Stores data that has been loaded into Virtual Warehouses during
querying to speed up additional queries on the same data, even if
different aggregates are requested
● May be reset/invalidated if the Virtual Warehouse is suspended
and resumed
● Lives on the Compute (Virtual Warehouse) instance
■ Metadata caching
● Stores various information about
● Used to optimize queries and improve query compile time
● Lives in the Cloud Services layer
■ Results caching
● Stores the results of queries that have been executed for 24 hours
unless the underlying data changes, at which point the entry is
invalidated
● Lives in the Cloud Services layer
● Snowflake Architecture - A Deeper Dive Self-Guided Learning Material
○ Snowflake Architecture
○ Data Storage Considerations
○ Virtual Warehouses
○ Cloud Services
○ The Snowflake X-Factor: Separate Metadata Processing​ (Video)
○ Queries
○ UDFs (User-Defined Functions)
○ Semi-Structured Data Types
○ Querying Semi-Structured Data
○ Using Persisted Query Results
○ Accelerating BI Queries with Caching in Snowflake​ (Video)
○ Top 10 Cool Snowflake Features, #10: Snowflake Query Result Sets Available to
Users via History

   
Snowflake Deployment 
● Snowflake is currently available on two Cloud Providers:
○ Amazon Web Services (AWS)--since inception
○ Microsoft Azure Cloud Platforms--generally available as of September 2018
● A customer does not have to have a preexisting account with AWS or Microsoft Azure in
order to have a Snowflake account or to use Snowflake
● Snowflake is deployed in one Region for a given Cloud Provider
○ The Cloud Provider and Region are selected during Snowflake account setup
○ Currently available on select Regions within each Cloud Provider
○ Regions for each Cloud Provider have been added over time
○ Additional Regions for each Cloud Provider are on the near-future roadmap
○ Each Snowflake account is located in a single region, i.e. multi-region accounts
are not supported.
● Within a Cloud Provider Region, there are several data centers; these are called
Availability Zones.
○ Snowflake automatically replicates a Snowflake account's Storage and Cloud
Services layers across 3 Availability Zones within its Region
○ This replication provides built-in disaster recovery and high availability--protecting
against the risk of a data center within a region going down--all fully managed by
Snowflake
● Snowflake Architecture - Deployment Self-Guided Learning Material
○ Technology Partners - Amazon Web Services
○ Technology Partners - Microsoft Azure
○ Snowflake Regions
○ Snowflake delivers 'High Availability' by using AWS availability zones
○ How to Make Data Protection and High Availability for Analytics Fast and Easy
○ Azure Friday | Azure Availability Zones​ (Video)

   
Snowflake Storage 
● Snowflake is built on scalable cloud blob storage, either Amazon S3 or Azure Blobs.
● Snowflake automatically partitions in a process called “automatic micro-partitions”.
○ Automatic in the sense that a partition scheme does not need to be defined up
front—Snowflake determines and creates it automatically
○ Data partitioning is automatically completed on the natural ingestion order
○ New partitions are cut mostly based on physical properties, such as the size of
the compressed data and how much memory the decompressed data would take
up.
○ This avoids skews between partitions.
○ The natural ingestion order maintains correlations between columns which could
be useful for pruning.
○ As new partitions are created based on physical properties as opposed to logical
properties, the partitions can overlap in ranges, which also reduces skew.
○ The partitions are “Micro” because their size are kept small – usually up to a few
tens of megabytes--this enables very fine-grained pruning using Snowflake
metadata, which behaves almost like an index.
● The Snowflake Storage Layer holds all the diverse data, tables, and query results for
Snowflake.
○ When storing semi-structured data, Snowflake optimizes the storage based on
repeated elements within the semi-structure strings
○ Snowflake stores structured and semi-structured data in the same proprietary file
format
● Snowflake’s storage architecture enables two of Snowflake’s key features:
○ Time Travel
■ Protection from accidental data operations
■ Ability to recover data without the costs of running backups, purchasing
additional hardware to handle backups, incurring database downtime and
the overhead of additional administration
■ Previous versions of data automatically retained--retention period
selected by customer (up to 90 days for Enterprise)
○ Zero-Copy Cloning
■ Metadata-only operation
■ No replication of data!
■ Unmodified data stored only once; modified data stored as new blocks
■ Facilitates:
● The provisioning of real, Production data for Dev and Test without
physically copying the data
● Quickly supporting and creating point-in-time snapshots
● Instantaneous backups with no additional cost
● Snowflake storage costs are based on a daily average of all compressed data storage
○ This includes data stored according to Time Travel retention policy and failsafe
practices
○ There is no difference in storage cost allocation between structured and
semi-structured data
● Snowflake Architecture - Storage Self-Guided Learning Material
○ Micro-Partitions
○ Data Storage Usage
○ Cloning Tables, Schemas, and Databases
○ A Quick Look at Zero-Copy Cloning​​ (Video)
○ Top 10 Cool Snowflake Features, #7: Snowflake Fast Clone
○ Understanding & Using Time Travel
○ Data Protection with Time Travel in Snowflake​ (Video)
○ Storage Costs for Time Travel and Fail-safe
○ Bulk Loading from Amazon S3 Using COPY
○ Bulk Loading from Microsoft Azure Using COPY

   
Snowflake Security 
● Snowflake uses role-based security--roles are the entities to which privileges can be
granted and revoked.
○ A user can be assigned multiple roles, in which case the user’s privileges are the
combination of the privileges granted to all of the roles that have been assigned
to the user.
○ Roles can be also granted to other roles, creating a hierarchy of roles. The
privileges associated with a role are inherited by any roles above that role in the
hierarchy.
○ When a new Snowflake object is created, all Users with the Role that created the
object will have access to it, as well as all Users and Roles that the Role that
created the object roll up to
○ A Snowflake Role is the only thing that connects Storage (Database) to Compute
(Virtual Warehouse)
● Snowflake provides embedded multi-factor authentication across all of its Editions
● All data in Snowflake is automatically encrypted using AES 256 strong encryption
regardless of Snowflake Edition; Snowflake’s higher editions--Enterprise, Enterprise for
Sensitive Data, and Virtual Private Snowflake--deliver additional security protections
such as:
○ Support for user SSO (single sign-on) through federated authentication.
(Enterprise and higher)
○ Periodic rekeying of encrypted data. (Enterprise and higher)
○ Support for encrypting data using customer-managed keys. (Enterprise for
Sensitive Data and higher)
○ Support for HIPAA compliance. (Enterprise for Sensitive Data and higher)
○ PCI DSS compliance. (Enterprise for Sensitive Data and higher)
● Snowflake tracks all access and usage automatically so that the Customers don’t have
to do so manually
● Snowflake Architecture - Security Self-Guided Learning Material
○ Snowflake Security Overview​ (Video)
○ Crucial Security Controls for Your Cloud Data Warehouse​ (Video)
○ Summary of Security Features
○ Overview of Access Control
○ Data Encryption
○ Industrial-Strength Security by Default
○ Snowflake Editions

   
Data Sharing 
● Data Sharing is a brand new feature in the data market; it is unique to Snowflake
● Data Sharing is integrated with Snowflake’s Role-based access controls
● Data Sharing Providers:
○ Incur the cost of the data storage; there is no upcharge for shared data--normal
storage costs apply
○ May share a data set with an unlimited number of accounts
○ May set up and manage “Reader Accounts”
■ Accounts for Consumers who are not already Snowflake customers,
■ Not required to become Snowflake customers in order to access the
shared data
● Data Sharing Consumers:
○ Incur the cost of the compute
○ Must use the ACCOUNTADMIN role to create a database based on the share;
otherwise, the share will not be available in the Consumer account
○ Can query shared objects in the same query that they query their own objects
● Currently, Data Sharing is only supported to Consume accounts residing in the same
Cloud Provider and Region as the Provider account
● Snowflake Architecture - Data Sharing Self-Guided Learning Material
○ Introduction to Data Sharing
○ Getting Started With Data Sharing
○ Data Sharing for Dummies​ (Webinar)
○ Data Sharing for Dummies​ (eBook)
○ Data Providers
○ Data Consumers
○ New and Improved Snowflake Data Sharing​ (Video)
○ Modern Data Sharing: The Opportunities Are Endless

   
Working With Snowflake 

Snowflake Objects 
● The ACCOUNTADMIN role in Snowflake provides a user the ability to administer both
security (ie: SECURITYADMIN role) and objects (ie: SYSADMIN role) within a database;
this role has overarching privileges across the Snowflake account and should be handed
out with extreme caution
● Data (Storage)
○ All data in Snowflake is maintained in databases.
○ Each database consists of one or more schemas, and, within these schemas,
one or more tables and/or views.
○ Schemas can be thought of as logical groupings of database objects, such as
tables and views by concept or purpose.
○ Snowflake does not place any hard limits on the number of databases, schemas
(within a database), or objects (within a schema) you can create.
● Processing (Compute)
○ Compute in Snowflake is provisioned via Virtual Warehouses
○ Virtual Warehouses can be configured to auto-suspend after a specified period of
inactivity, or auto-resume as soon as a user requests an operation that requires
compute power
■ Auto-suspend / auto-resume can be specified upon initial configuration, or
modified as needed after the fact
■ The right balance of auto-suspend / auto-resume will result in an
optimized bill--minimum bill time for a warehouse, upon startup, is 1
minute per startup; after a minute has passed, accounts are billed by the
second
● Table Design
○ Snowflake can support both structured and semi-structured data
○ Semi-structured data should be stored, as is, in a table column with a data type
of VARIANT
■ A Snowflake table can contain both a VARIANT data type, as well as a
timestamp
■ Snowflake supports data in VARIANTs up to a maximum size of 16MB
compressed.
○ Snowflake’s proprietary algorithm stores and processes semi-structured data in a
unique way:
■ Repeating keys and paths are further stored as separate physical
columns, similar to regular SQL attributes; currently, this is not true of
data stored in arrays.
■ For data that is mostly regular and uses only native JSON types (strings
and numbers, not timestamps), both storage and query performance for
operations on relational data and data in a VARIANT column is very
similar.
■ Non-native values such as dates and timestamps are stored as strings
when loaded into a VARIANT column, so operations on these values
could be slower and also consume more space than when stored in a
relational column with the corresponding data type.
● Working With Snowflake - Data Objects Self-Guided Learning Material
○ Overview of Warehouses
○ What is a Multi-cluster Warehouse?
○ Tackling High Concurrency with Multi-Cluster Warehouses​ (Video)
○ Databases, Tables & Views
○ Understanding Snowflake Table Structures
○ Semi-Structured Data
○ Using the ACCOUNTADMIN Role
○ Role Hierarchy and Privilege Inheritance
○ Accessing Database Objects

   
Accessing and Connecting 
● Snowflake is a full SQL database--both structured and semi-structured data can be
queried using SQL syntax
● Snowflake offers a web-based User Interface (UI) in which a Snowflake account’s
Storage, Compute, Queries, and Query History and Performance can be built, accessed
and managed. The UI is divided into 4 basic areas
○ Databases - Storage
○ Warehouses - Compute
○ Worksheet - where queries are built and executed
○ History - where all query history lives and the Query Profiler is accessed
○ Please note:
■ Users with the ACCOUNTADMIN role will see an extra area, Account.
This area allows them to monitor billing & usage, manage users and
roles, set security policies, monitor user sessions, create monitors to
manage and automatically monitor account usage, and create reader
accounts
■ Accounts may now also see a “Shares” area if they have any data shared
with them and they have the right to access it.
● Snowflake offers a variety of connectivity options:
○ SnowSQL: Snowflake’s own command line interface client; can only be
downloaded by an authenticated user from the Snowflake UI
○ ODBC: open database connectivity driver; can only be downloaded by an
authenticated user from the Snowflake UI
○ JDBC: java database connectivity; can be downloaded publicly
○ Python: can be downloaded publicly
○ Node.js: can be downloaded publicly
● The PUT and GET functions cannot be run via the Snowflake User Interface; rather,
these can only be run using the SnowSQL client. These commands are not currently
supported by the ODBC Driver.
● Working With Snowflake - Accessing and Connecting Self-Guided Learning
Material
○ Quick Tour of the Web Interface
○ User Interface Quick Tour​ (Video)
○ Making the Most of the Snowflake Worksheet​ (Video)
○ Connecting to Snowflake
○ Using SnowSQL
○ Connecting to Snowflake with Oracle SQL Developer Data Modeler (SDDM)
(Video)
○ Top 10 Cool Snowflake Features, #9: Connect to Snowflake With JDBC

   
Loading and Querying 
● Two separate commands can be used to load data into Snowflake:
○ COPY
■ Bulk insert
■ Allows insert on a SELECT against a staged file, but a WHERE clause
cannot be used
■ More performant
○ INSERT
■ Row-by-row insert
■ Allows insert on a SELECT against a staged file, and a WHERE clause
can be used
■ Less performant
● Snowflake also offers a continuous data ingestion service, Snowpipe, to detect and load
streaming data:
○ Snowpipe loads data within minutes after files are added to a stage and
submitted for ingestion.
○ The service provides REST endpoints and uses Snowflake-provided compute
resources to load data and retrieve load history reports.
○ The service can load data from any internal (i.e. Snowflake) or external (i.e. AWS
S3 or Microsoft Azure) stage.
○ With Snowpipe’s server-less compute model, Snowflake manages load capacity,
ensuring optimal compute resources to meet demand. In short, Snowpipe
provides a “pipeline” for loading fresh data in micro-batches as soon as it’s
available.
● To load data into Snowflake, the following must be in place:
○ A Virtual Warehouse
○ A pre-defined target table
○ A Staging location with data staged
○ A File Format
● Snowflake supports loading from the following file/data types:
○ Text Delimited (referenced as CSV in the UI)
○ JSON
○ XML
○ Avro
○ Parquet
○ ORC
● Data must be staged prior to being loaded, either in an Internal Stage (managed by
Snowflake) or an External Stage (self-managed) in AWS S3 or Azure Blob Storage
● As data is loaded:
○ Snowflake compresses the data and converts it into an optimized internal format
for efficient storage, maintenance, and retrieval.
○ Snowflake gathers various statistics for databases, tables, columns, and files and
stores this information in the Metadata Manager in the Cloud Services Layer for
use in query optimization
● In querying data within Snowflake:
○ It is possible to run a query against the Result Cache with no Virtual Warehouse
running and retrieve results if the query is cached
○ Uncached queries will require a Virtual Warehouse to execute
○ The Query Profile can be be used to analyze the execution details of a query
■ Available for both in-progress and completed queries
■ Provides:
● A graphical representation of the main components of the
processing plan for the query
● Statistics for each component of the query
● Details and statistics for the overall query
● Working With Snowflake - Loading and Querying Self-Guided Learning Material
○ Summary of Data Loading Features
○ Getting Started - Introduction to Data Loading​ (Video)
○ Loading Data
○ Bulk Load Using COPY
○ COPY INTO <table>
○ INSERT
○ Getting Started - Introduction to Databases and Querying​ (Video)
○ Easily Loading and Analyzing Semi-Structured Data in Snowflake​ (Video)
○ Processing JSON data in Snowflake​ (Video)
○ Queries
○ Analyzing Queries Using Query Profile
○ Using the History Page to Monitor Queries

   
Snowflake Ecosystem and Competitive Positioning 
● Snowflake does not provide tools to extract data from source systems and/or visualize
data--it relies upon its Technology Partners to provide those tools
● Snowflake’s relationships with/integrations with Technology Partners are driven largely
by customer requests and needs for them
● Snowflake engages with Technology Partners and works with technologies that are both
cloud and on-premises based
● As most activity in Snowflake revolves around integrating and visualizing data, Data
Integration and Business Intelligence technologies are the most prevalent in the
Snowflake Technology Ecosystem
● Various technologies offer different levels of integrations and advantages with
Snowflake:
○ ELT tools like Talend and Matillion leverage Snowflake's scalable compute for
data transformation by pushing tranform processing to Snowflake
○ BI tools like Tableau and Looker offer native connectivity built into their products,
with Looker leveraging Snowflake’s in-database processing scalable compute for
querying
○ Snowflake has built a custom Spark library that allows the results of a Snowflake
query to be loaded directly into a dataframe
● To fully understand the advantages of Snowflake, one must understand its advantages
vs. its competitors
○ On-Premises EDW
■ Instant scalability
■ Separation of compute and storage
■ No need for data distribution
○ Cloud EDW
■ Concurrency
■ Automatic failover and disaster recovery
■ Built for the cloud
○ Hadoop
■ No hardware to manage
■ No need to manage files
■ Native SQL (including on semi-structured)
○ Data Engines
■ No need to manage files
■ Automated cluster management
■ Native SQL
○ Apache Spark
■ No need to manage data files
■ Automated cluster management
■ Full SQL Support
● Snowflake Ecosystem and Competitive Positioning Self-Guided Learning Material
○ Technology Partners
○ Code of Conduct for Competitive Partnerships
○ Overview of the Ecosystem
○ Snowflake Partner Connect
○ Getting Started on Snowflake with Partner Connect​ (Video)
○ Cloud Data Warehouse Buyer’s Guide
○ Snowflake Partners YouTube Playlist​ (Videos)
○ Snowflake Customers YouTube Playlist​ (Videos)

   
Snowflake Best Practices 

Design 
● Snowflake supports the proven data modeling techniques, including:
○ 3rd Normal Form
○ Data Vault
○ Star Schema / Snowflake Schema
● Best practice is to use one of these proven techniques, with the optimal one selected
based on the data use case and audience
● Snowflake is unique in its manner of handling constraints; despite this uniqueness, it is
still best practice to use them
○ Snowflake supports defining and maintaining constraints, but only enforces NOT
NULL constraints
○ Referential integrity, uniqueness, and primary keys are not enforced
○ However, many downstream BI and Analytics tools leverage constraints to
optimize the manner in which they write their queries; it is therefore best practice
to define these constraints in Snowflake even though they are not enforced
● Snowflake Best Practices - Design Self-Guided Learning Material
○ Table Design Considerations
○ Data Vault Modeling and Snowflake
○ Top 9 Best Practices for Data Warehouse Development
○ Referential Integrity Constraints

Semi-Structured Data 
● In Snowflake, it is best practice to load and store semi-structured data--JSON, XML,
Avro, etc.--into a column with a data type of VARIANT vs. parsing the semi-structured
string into structured columns on source data load
● As many BI and Analytics tools will not be able to parse the VARIANT column on their
end, it is best practice to create a View in Snowflake that completes this parsing for them
and makes the data available to them in the structured format they are expecting. By
completing this via a View rather than a Table, the semi-structured data remains in its
complete form in Snowflake, while appearing to be in structured form to 3rd Party
Applications
● Snowflake Best Practices - Semi-Structured Data Self-Guided Learning Material
○ Storing Semi-structured Data in a VARIANT Column vs. Flattening the Nested
Structure
○ Best Practices for Using Tableau with Snowflake
Loading 
● When loading data into Snowflake, a COPY command will be more performant than an
INSERT command, as COPY uses bulk and INSERT uses row-by-row
● The number of COPY operations that run in parallel cannot exceed the number of data
files to be loaded. To increase the number of parallel operations, it is best practice to
split large chunks of data into separate files.
○ Optimal file size is 10 to 100 MB in size compressed.
○ Split files by line to avoid records that span chunks.
○ Splitting large files into a greater number of smaller files distributes the load
among the servers in an active warehouse, thereby increasing performance.
○ The number of data files that are processed in parallel is determined by the
number and capacity of servers in a warehouse.
○ If your source database does not allow you to export data files in smaller chunks,
you can use a third-party utility to split large CSV files.
● While Users control file split and size of data being loaded into Snowflake, Snowflake is
fully in control of how the data is divided into micro-partitions after it is loaded into
Snowflake; Users cannot create or configure these partitions
● Snowflake Best Practices - Loading Self-Guided Learning Material
○ Preparing Your Data Files
○ Planning a Data Load
○ FAQ: Does Snowflake have a recommended file size for loading?

   
Performance/Billing/Usage Optimization 
● For most Snowflake tables, it is best practice to allow Snowflake’s automated
micro-partitioning process to fully manage the table’s micro-partitions.
○ As data is loaded into a Snowflake table, Snowflake co-locates column data with
the same values in the same micro-partition, if possible. This process is called
“natural clustering”.
○ For the most part, Snowflake’s natural clustering produces well-clustered and
performant data in tables.
○ Snowflake uses a mechanism called pruning to limit the number of
micro-partitions scanned by each query
● For very large tables--tables in the multi-terabyte (TB) size range--it may be beneficial to
turn on Clustering Keys
○ Clustering keys enable explicitly clustering data in a table according to one or
more columns/expressions in the table vs. allowing Snowflake’s automated
micro-partitioning process to do so
○ To see performance improvements from clustering keys:
■ A table has to be large enough to reside on many micro-partitions
■ The clustering keys have to provide sufficient filtering to select a subset of
these micro-partitions.
○ Some general indicators that can help determine whether to define clustering
keys for a very large table include:
■ Queries on the table are running slower than expected or have noticeably
degraded over time.
■ The clustering ratio for the table is very low and the clustering depth is
very large.
○ Clustering (and Reclustering) is the technique Snowflake Users have at their
fingertips to improve query performance; however, Clustering/Reclustering may
actually cause some queries to run slower, so it should be specially considered
and evaluated
● Whether Clustering Keys are used or Snowflake’s automatic process is used, Snowflake
data is divided into and stored in micro-partitions.
○ These micro-partitions are immutable, meaning that once they have been written,
they will never be changed or overwritten; rather, subsequent changes of any
type to the data will be written to additional micro-partitions.
○ In addition to storing Table definitions; Snowflake’s metadata repository stores
references to all of the micro-partition files for each Table, as well as tracking of
all versions of the table data within the data retention window
● When configuring and managing Virtual Warehouses:
○ The number of Snowflake credits consumed depends on the size of the
warehouse and how long it runs.
○ The keys to using warehouses effectively and efficiently are:
■ Experiment with different types of queries and different warehouse sizes
to determine the combinations that best meet your specific query needs
and workload.
■ Don’t focus on warehouse size. Snowflake utilizes per-second billing, so
you can run larger warehouses (Large, X-Large, 2X-Large, etc.) and
simply suspend them when not in use.
○ It is important to select an auto-suspend value that reflects how it is anticipated
the Compute power will be needed vs how usage is billed
■ A Warehouse on startup is billed for 60 seconds minimum, and then by
the second after the initial 60 seconds
■ An auto-suspend value of “Never” is most likely to result in most likely to
result in inefficient credit consumption
■ Note that if a Warehouse suspends and resumes, the Warehouse cache
may be reset, which could result in slower query performance for common
data sets
○ Snowflake provides Resource Monitors to help control costs and avoid
unexpected credit usage related to using Warehouses
■ Can be used to impose limits on the number of credits that Warehouses
consume within each monthly billing period.
■ When these limits are close and/or reached, the Resource Monitor can
trigger various actions, such as sending alert notifications and suspending
the Warehouses.
■ Resource Monitors can only be created by account administrators (i.e.
users with the ACCOUNTADMIN role); however, account administrators
can choose to enable users with other roles to view and modify resource
monitors.
● Snowflake Best Practices - Performance/Billing/Usage Optimization Self-Guided
Learning Material
○ Micro-Partitions
○ Data Clustering
○ Top 10 Cool Snowflake Features, #1: Automatic Query Optimization. No Tuning!
○ Clustering Keys
○ When to Set a Clustering Key
○ How Clustering Can Improve Your Query Performance
○ Warehouse Considerations
○ Monitoring Warehouse Load
○ Managing Resource Monitors
○ Managing Costs for Short-Lived Tables
○ Managing Costs for Large, High-Churn Tables
Security 
● The ACCOUNTADMIN role is the highest set of permissions in the system and should
be limited to a small number of users
● There should always be at least two Users granted the ACCOUNTADMIN role within a
Snowflake account
● All ACCOUNTADMIN users should have multi-factor authentication enabled
● Objects should not be created using the ACCOUNTADMIN role
● It is a best practice to have the majority of objects owned by SYSADMIN
● Snowflake’s role hierarchy and privilege inheritance should be used to align access to
database objects with business functions in your organization
● Snowflake Best Practices - Security Self-Guided Learning Material
○ Aligning Object Access with Business Functions
○ Using the ACCOUNTADMIN Role

Data Sharing 
● Rather than exporting data and sending it to various consumers, it is best practice for
Snowflake customers to leverage Snowflake’s Data Sharing capability
● A consumer does not have to be a Snowflake customer in order to receive shared
Snowflake data; rather, such consumers can access the data via Reader Accounts
● The ACCOUNTADMIN role must be used to configure Data Shares as well as Reader
Accounts
○ On the Provider side, the ACCOUNTADMIN role must be used to create and
configure Shares
○ On the Consumer side, the ACCOUNTADMIN role must be used to create a
database based on the share
○ If the Consumer will be a Reader Account, the Provider must create, configure,
and manage the Reader Account using the ACCOUNTADMIN role
● Snowflake Best Practices - Data Sharing Self-Guided Learning Material
○ Data Sharing Made Simple
○ Using Secure Views to Control Consumer Access

You might also like