A cross-industry approach to shorten time-to-value for AI/ML use-case deployment (part 1/3)

VirooPax Mirji

CDO-Data & AI Strategy Leader⭐ Re-imagining Digital Experiences with Product Mgmt. for Hi-Tech, AEC, FS, & Telco🔹 leveraging AI/Data, Cognitive Services & Azure OpenAI/GPT for better Business Outcomes #Microsoft #AI #PM

Published Apr 30, 2020

Abstract

In this digital age of hyper competition and collaboration, AI/machine-learning enabled use cases hold the key to continuous & disruptive innovation in the post-CoronaVirus new norm business environment. Every industry has the same obstacle-course to go through. After developing the initial experimental use-case, the enterprise needs an agile platform to deploy the model in their production environment in an accelerated manner. Sometimes data scientists and analysts take a lot of time in gathering data and utilizing it in their advanced analytics activities. There is a ‘need for speed’ and ‘efficiency’ of ML operational deployment. Currently there are good options such as GCP & Databricks cloud environments to deploy machine learning enabled use cases in a matter of hours. This document outlines one of the automated approaches for enterprises to deploy machine learning enabled use cases using rapid deployment GCP and Databricks platforms.

Part 1 of 3:

Problem statement

Many enterprises leverage data sciences and machine-learning use-cases to extract insights from data today. These use cases classify existing data into logical categories or predict approximate ‘high-confidence’ probabilities of business outcomes. Business teams are always pressurizing the Data Analytics and IT teams to produce useful results faster via practical use-cases. Typical examples of these industry use-cases include:

What product should be recommended to customer based on his demographics?
What factor will increase the propensity to buy?
How many units of a particular SKU product should be shipped to a given location, for optimal inventory management?
What is the likelihood of component failure given associated operating parameters?

The machine learning use cases typically evolve based on the business circumstances, experimentation and continually growing requirements from business users in the CMO, CRO, CFO, CIO and COO organizations. While the backlog of requirements is growing fast, a lot of time is spent in data preparation (about 80%), modeling and prototyping initially. Simplification and automation to increase velocity of commercial delivery is difficult with many moving parts and non-standard approaches. In this article we will explore a holistic approach on how Databricks and GCP like platforms enable simplified and agile deployment of end-to-end production-ready use-cases.

Implementation

The exact steps to implement a use-case and algorithm could differ based on your organization’s requirement, however the steps given in this article will , overall, have a good resemblance to the steps that need to be implemented. There are multiple ways organizations can implement machine learning use cases. At a high level, a common approach can be summarized as follows:

Use any standard tools such as R Studio, PyCharm or plain notepad++ for playing with light dataset. There are several algorithms that can be implemented quickly to test out the datasets under consideration.
Iteratively develop machine learning algorithm after collaborative discussion with business and IT. This step could take several iterations; however, this is typical approach to develop a machine learning use case along with finalizing the parameters.
Once algorithm and approach are finalized, then the real deployment approach would be discussed. GCP and Databricks provides integrated platforms with tooling to quickly ingest the dataset and apply machine learning algorithms to it via notebooks.

Steps normally considered:

A. Ingest the datasets

Data ingestion in Databricks & GCP is greatly simplified using products like Infoworks’ DataFoundry, Fivetran, SyncSort or Streamsets. The data sources can be many, from business support and operational databases such as Oracle, MySQL, etc., to flat/CSV files, to SaaS applications like Salesforce, NetSuite, ServiceNow, Marketo, etc. Ingesting all this data into a unified data lake is often time-consuming and hard when using typical ETL tools, in many cases requiring custom development and dozens of connectors or Slowly changing dimensions (SCD-I/II) or APIs that change over time and then break the data onboarding process. Traditional companies use disparate data integration tools that require scores of data engineers to write manual scripts and schedule jobs, schedule triggers and handle job failures manually. This approach does not lend to simplification and scaling-up and creates painstaking operational overhead. So, use of automated data ops and orchestration tools are highly recommended to build a strong unified approach to data lake-housing.

Source: A well-written Databricks.com blog showing the partner eco-system & applications

Figure 1: Example of data onboarding (ingest framework) as utilized by DataBricks

Typically, data processing and gathering can be divided into three sections:

Raw Data
Processed Data
Trusted Data

Raw data section is where the data, from diverse sources, lands. These sources could traditional databases such as MySQL, Oracle, or streaming data sources such as Kafka, or cloud-based data store such as S3, GCS or SaaS applications such as Salesforce, Jira, Freshdesk, ServiceNow, NetSuite, etc.

The processed section is where the data sources are processed and prepared for downstream applications such as business intelligence application. This section can feed into machine learning and artificial intelligence algorithms as well.

The final section of this is trusted datasets. These are the datasets which will contain data which can be directly consumed by business customers, data scientists and to some extent by IT data analysts.

There are several considerations for ingestions such as

Incremental and full ingestions
State management while loading the data, in the event of failure
Data compaction
Data de-duplication
Dependency management
Reconciliation of processed datasets.

Data scientists and ML modelers can work on data produced out of processed and trusted data sets as follows :

A simplified view of a data prep and processing phase while running ML apps / experiments / models

Figure 2 : ML Use-cases leveraging current Data domains (for training, modeling, commercial devOps)

The main challenges for data ingestion and synchronizations are around

- fast initial data load and

- incremental data load once the initial data load is completed.

Simple questions such as how many partitions are required, how many mappers required could become challenges if the structure of the underlying dataset is not known. Typically, it is recommended that data ingestion, transformation and orchestration should have following features in some form:

-Data and metadata crawling

Schema & Data Type discovery
Data Pattern discovery

-Data Ingestion

Parallel and secured ingestion
Data validation and reconciliation
Data type conversion

-Data and schema sync

Continuous change data capture
Continuous schema change capture
Continuous merge process
Auto time axis building
Slowly changing dimensions (Type I & II)

These activities can be configured very simply using the tools/suites mentioned above without much scripting or manual coding. In the next 'part 2 of 3', we will cover the next step of feature engineering & model building.

Next : ML Models and feature engineering

----

Authors : VirooPax B. Mirji and Ganesh Walavalkar

p.s. These are personal views and should not be considered as representing any specific company or ecosystem of partners.

-------

To view or add a comment, sign in

See all

A cross-industry approach to shorten time-to-value for AI/ML use-case deployment (part 1/3)

VirooPax Mirji

CDO-Data & AI Strategy Leader⭐ Re-imagining Digital Experiences with Product Mgmt. for Hi-Tech, AEC, FS, & Telco🔹 leveraging AI/Data, Cognitive Services & Azure OpenAI/GPT for better Business Outcomes #Microsoft #AI #PM

Abstract

Part 1 of 3:

Problem statement

Implementation

Steps normally considered:

A. Ingest the datasets

Authors : VirooPax B. Mirji and Ganesh Walavalkar

More articles by this author

Insights from the community

Others also viewed

Empowering Intelligence: Automated Machine Learning (AutoML) Unveiled - Making Machine Learning Accessible to All

MLOps: Managing Machine Learning Pipelines from Development to Production

Modelops 2022: the state of practice

Defining the Differences between MLOps, ModelOps, DataOps & AIOps

The Business Value of MLOps

Marvelous MLOps #38: MLOps Operating Models

Unlock the Full Potential of Your Machine Learning Projects with Centizen’s Tailored MLOps Services

Unleashing the Power of Enterprise AI: Scaling Insights for Business Transformation

How MLOps Improves the Lifecycle of Machine Learning Models

BIG-AI

Explore topics

Abstract

Part 1 of 3:

Problem statement

Implementation

Steps normally considered:

A. Ingest the datasets

Authors : VirooPax B. Mirji and Ganesh Walavalkar

Part 2: Evolving the convergence concept for CX and CS digital workflows

Jun 19, 2021

The case for Convergence of CX and CS workflows in the new Digital Age

Jun 2, 2021

The New Normal is creating 'Disruptive Innovations' in Digital Healthcare

Aug 10, 2020

Digital Program Management: an Art and Science

Jun 1, 2020

A cross-industry approach to shorten time-to-value for AI/ML use-case deployment (part 3/3)

May 6, 2020

A cross-industry approach to shorten time-to-value for AI/ML use-case deployment (part 2/3)

Apr 30, 2020

Content is Still King, But User Experience is the Emperor

Aug 5, 2018

AI Assistants, Smarter Machines & AR-VR 360 Galore!

Mar 4, 2018

The world of Connected possibilities @Home smart Home!

Mar 6, 2017

So how smart should one be for driverless car technology?

Jan 23, 2017

Insights from the community

Others also viewed

Empowering Intelligence: Automated Machine Learning (AutoML) Unveiled - Making Machine Learning Accessible to All

MLOps: Managing Machine Learning Pipelines from Development to Production

Modelops 2022: the state of practice

Defining the Differences between MLOps, ModelOps, DataOps & AIOps

The Business Value of MLOps

Marvelous MLOps #38: MLOps Operating Models

Unlock the Full Potential of Your Machine Learning Projects with Centizen’s Tailored MLOps Services

Unleashing the Power of Enterprise AI: Scaling Insights for Business Transformation

How MLOps Improves the Lifecycle of Machine Learning Models

BIG-AI

Explore topics