Data Report Martin Inline Graphics R7 PDF
Data Report Martin Inline Graphics R7 PDF
Data Report Martin Inline Graphics R7 PDF
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
Interpreting the Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Coordinate the flow of data and the execution of computations across the full lifecycle
Ensure proper data quality, performance, and governance of all systems and datasets
Three Common Blueprints
Analytic
1 Modern Business Intelligence
Systems
Operational
3 AI and ML
Systems
1. Modern Business Intelligence Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
2. Multimodal Data Processing Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output
Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS
Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)
Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)
Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
3. AI and ML Blueprint
Data Labeling
(Labelbox, Snorkel,
Scale, Sagemaker)
Data Sources
(Data lake + Dataflow Automation
data warehouse + (Airflow, Pachyderm, Elementl, Prefect, Tecton, Kubeflow)
streaming engine)
Data Science
Libraries
(Spark, Pandas,
NumPy, Dask)
Distributed
Processing
(Spark, Ray, Dask,
Distributed TF,
Kubeflow,
Horovod)