Data Report Martin Inline Graphics R7 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Unified Data Infrastructure Architecture

Query and Processing


Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
Interpreting the Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Generate relevant Extract data from Store data in a Present results of


Provide an interface for analysts and data scientists
business and operational systems format accessible to data analysis to
to derive insights (query)
operational data (E) query & processing internal and
systems external users
Execute queries and data models against stored
Deliver to storage,
data, often using distributed compute (processing)
aligning schemas Optimize for low Embed data models
between source cost, scalability, and into operational
and destination (L) analytic workloads systems and
(e.g., column store) applications
Transform data to a
structure ready for In some cases,
analysis (T) provide additional
data structures or
guarantees Describe what Predict what will
happened in the happen in the future
past (including very
recent past) Build data-driven/
ML applications

Coordinate the flow of data and the execution of computations across the full lifecycle

Ensure proper data quality, performance, and governance of all systems and datasets
Three Common Blueprints

Analytic
1 Modern Business Intelligence
Systems

2 Multimodal Data Processing

Operational
3 AI and ML
Systems
1. Modern Business Intelligence Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
2. Multimodal Data Processing Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch, (Snowflake, BigQuery, Redshift)
via CDC Matillion) Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
3. AI and ML Blueprint

Data Transformation Model Training and Development Model Inference

Data Labeling
(Labelbox, Snorkel,
Scale, Sagemaker)

Data Sources
(Data lake + Dataflow Automation
data warehouse + (Airflow, Pachyderm, Elementl, Prefect, Tecton, Kubeflow)
streaming engine)

Query Engines Feature Store Feature Server


(Presto, Hive) (Tecton) (Tecton, Cassandra)

Data Science
Libraries
(Spark, Pandas,
NumPy, Dask)

Data Science Platform Model Batch Predictor


(Jupyter, Databricks, Domino, Sagemaker, DataRobot, Registry (Spark)
H2O, Colab, Deepnote, Noteable) (Algorithmia,
MLflow,
Sagemaker) Online Model Clients
Server
Experiment ML (TF Serving, Ray
Tracking Framework Compiler Serve, Seldon)
(Weights and (Scikit-learn, (TVM)
Biases, Comet, XGBoost, MLlib)
MLflow)
Model
DL Monitoring
Visualization Framework (Fiddler, Arthur,
(Tensorboard, (TensorFlow, Keras, Arize)
Fiddler) PyTorch, H2O)

Model Tuning RL Libraries


(Sigopt, hyperopt, (Gym, Dopamine,
Ray Tune) RLlib, Coach)

Distributed
Processing
(Spark, Ray, Dask,
Distributed TF,
Kubeflow,
Horovod)

You might also like