A cross-industry approach to shorten time-to-value for AI/ML use-case deployment (part 1/3)
Abstract
In this digital age of hyper competition and collaboration, AI/machine-learning enabled use cases hold the key to continuous & disruptive innovation in the post-CoronaVirus new norm business environment. Every industry has the same obstacle-course to go through. After developing the initial experimental use-case, the enterprise needs an agile platform to deploy the model in their production environment in an accelerated manner. Sometimes data scientists and analysts take a lot of time in gathering data and utilizing it in their advanced analytics activities. There is a ‘need for speed’ and ‘efficiency’ of ML operational deployment. Currently there are good options such as GCP & Databricks cloud environments to deploy machine learning enabled use cases in a matter of hours. This document outlines one of the automated approaches for enterprises to deploy machine learning enabled use cases using rapid deployment GCP and Databricks platforms.
Part 1 of 3:
Problem statement
Many enterprises leverage data sciences and machine-learning use-cases to extract insights from data today. These use cases classify existing data into logical categories or predict approximate ‘high-confidence’ probabilities of business outcomes. Business teams are always pressurizing the Data Analytics and IT teams to produce useful results faster via practical use-cases. Typical examples of these industry use-cases include:
- What product should be recommended to customer based on his demographics?
- What factor will increase the propensity to buy?
- How many units of a particular SKU product should be shipped to a given location, for optimal inventory management?
- What is the likelihood of component failure given associated operating parameters?
The machine learning use cases typically evolve based on the business circumstances, experimentation and continually growing requirements from business users in the CMO, CRO, CFO, CIO and COO organizations. While the backlog of requirements is growing fast, a lot of time is spent in data preparation (about 80%), modeling and prototyping initially. Simplification and automation to increase velocity of commercial delivery is difficult with many moving parts and non-standard approaches. In this article we will explore a holistic approach on how Databricks and GCP like platforms enable simplified and agile deployment of end-to-end production-ready use-cases.
Implementation
The exact steps to implement a use-case and algorithm could differ based on your organization’s requirement, however the steps given in this article will , overall, have a good resemblance to the steps that need to be implemented. There are multiple ways organizations can implement machine learning use cases. At a high level, a common approach can be summarized as follows:
- Use any standard tools such as R Studio, PyCharm or plain notepad++ for playing with light dataset. There are several algorithms that can be implemented quickly to test out the datasets under consideration.
- Iteratively develop machine learning algorithm after collaborative discussion with business and IT. This step could take several iterations; however, this is typical approach to develop a machine learning use case along with finalizing the parameters.
- Once algorithm and approach are finalized, then the real deployment approach would be discussed. GCP and Databricks provides integrated platforms with tooling to quickly ingest the dataset and apply machine learning algorithms to it via notebooks.
Steps normally considered:
A. Ingest the datasets
Data ingestion in Databricks & GCP is greatly simplified using products like Infoworks’ DataFoundry, Fivetran, SyncSort or Streamsets. The data sources can be many, from business support and operational databases such as Oracle, MySQL, etc., to flat/CSV files, to SaaS applications like Salesforce, NetSuite, ServiceNow, Marketo, etc. Ingesting all this data into a unified data lake is often time-consuming and hard when using typical ETL tools, in many cases requiring custom development and dozens of connectors or Slowly changing dimensions (SCD-I/II) or APIs that change over time and then break the data onboarding process. Traditional companies use disparate data integration tools that require scores of data engineers to write manual scripts and schedule jobs, schedule triggers and handle job failures manually. This approach does not lend to simplification and scaling-up and creates painstaking operational overhead. So, use of automated data ops and orchestration tools are highly recommended to build a strong unified approach to data lake-housing.
Source: A well-written Databricks.com blog showing the partner eco-system & applications
Figure 1: Example of data onboarding (ingest framework) as utilized by DataBricks
Typically, data processing and gathering can be divided into three sections:
- Raw Data
- Processed Data
- Trusted Data
Raw data section is where the data, from diverse sources, lands. These sources could traditional databases such as MySQL, Oracle, or streaming data sources such as Kafka, or cloud-based data store such as S3, GCS or SaaS applications such as Salesforce, Jira, Freshdesk, ServiceNow, NetSuite, etc.
The processed section is where the data sources are processed and prepared for downstream applications such as business intelligence application. This section can feed into machine learning and artificial intelligence algorithms as well.
The final section of this is trusted datasets. These are the datasets which will contain data which can be directly consumed by business customers, data scientists and to some extent by IT data analysts.
There are several considerations for ingestions such as
- Incremental and full ingestions
- State management while loading the data, in the event of failure
- Data compaction
- Data de-duplication
- Dependency management
- Reconciliation of processed datasets.
Data scientists and ML modelers can work on data produced out of processed and trusted data sets as follows :
Figure 2 : ML Use-cases leveraging current Data domains (for training, modeling, commercial devOps)
The main challenges for data ingestion and synchronizations are around
- fast initial data load and
- incremental data load once the initial data load is completed.
Simple questions such as how many partitions are required, how many mappers required could become challenges if the structure of the underlying dataset is not known. Typically, it is recommended that data ingestion, transformation and orchestration should have following features in some form:
-Data and metadata crawling
- Schema & Data Type discovery
- Data Pattern discovery
-Data Ingestion
- Parallel and secured ingestion
- Data validation and reconciliation
- Data type conversion
-Data and schema sync
- Continuous change data capture
- Continuous schema change capture
- Continuous merge process
- Auto time axis building
- Slowly changing dimensions (Type I & II)
These activities can be configured very simply using the tools/suites mentioned above without much scripting or manual coding. In the next 'part 2 of 3', we will cover the next step of feature engineering & model building.
----
Authors : VirooPax B. Mirji and Ganesh Walavalkar
p.s. These are personal views and should not be considered as representing any specific company or ecosystem of partners.
-------