Machine Learning Systems: Designs that scale
()
About this ebook
Machine Learning Systems: Designs that scale is an example-rich guide that teaches you how to implement reactive design solutions in your machine learning systems to make them as reliable as a well-built web app.
Foreword by Sean Owen, Director of Data Science, Cloudera
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the Technology
If you’re building machine learning models to be used on a small scale, you don't need this book. But if you're a developer building a production-grade ML application that needs quick response times, reliability, and good user experience, this is the book for you. It collects principles and practices of machine learning systems that are dramatically easier to run and maintain, and that are reliably better for users.
About the Book
Machine Learning Systems: Designs that scale teaches you to design and implement production-ready ML systems. You'll learn the principles of reactive design as you build pipelines with Spark, create highly scalable services with Akka, and use powerful machine learning libraries like MLib on massive datasets. The examples use the Scala language, but the same ideas and tools work in Java, as well.
What's Inside
- Working with Spark, MLlib, and Akka
- Reactive design patterns
- Monitoring and maintaining a large-scale system
- Futures, actors, and supervision
About the Reader
Readers need intermediate skills in Java or Scala. No prior machine learning experience is assumed.
About the Author
Jeff Smith builds powerful machine learning systems. For the past decade, he has been working on building data science applications, teams, and companies as part of various teams in New York, San Francisco, and Hong Kong. He blogs (https: //medium.com/@jeffksmithjr), tweets (@jeffksmithjr), and speaks (www.jeffsmith.tech/speaking) about various aspects of building real-world machine learning systems.
Table of Contents
PART 1 - FUNDAMENTALS OF REACTIVE MACHINE LEARNING
- Learning reactive machine learning
- Using reactive tools
PART 2 - BUILDING A REACTIVE MACHINE LEARNING SYSTEM
- Collecting data
- Generating features
- Learning models
- Evaluating models
- Publishing models
- Responding
PART 3 - OPERATING A MACHINE LEARNING SYSTEM
- Delivering
- Evolving intelligence
Jeffrey Smith
Jeffrey A. Smith has an undergraduate degree in religion, with a focus on the ancient world, from Dartmouth College (USA) and a master’s degree in history from the University of Birmingham (UK). He has taught humanities and ancient history at The Stony Brook School, a boarding school on the North Shore of Long Island, for the past decade.
Read more from Jeffrey Smith
Shadow Knights: The Secret War Against Hitler Rating: 3 out of 5 stars3/5The Corinthian War, 395–387 BC: The Twilight of Sparta's Empire Rating: 0 out of 5 stars0 ratings
Related to Machine Learning Systems
Related ebooks
Machine Learning Engineering in Action Rating: 0 out of 5 stars0 ratingsDeep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsMachine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5MLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsMachine Learning with TensorFlow, Second Edition Rating: 0 out of 5 stars0 ratingsPractical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions Rating: 0 out of 5 stars0 ratingsGraph-Powered Machine Learning Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsApplied Machine Learning Solutions with Python: SOLUTIONS FOR PYTHON, #1 Rating: 0 out of 5 stars0 ratingsIntroducing Data Science: Big data, machine learning, and more, using Python tools Rating: 5 out of 5 stars5/5Grokking Machine Learning Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsFeature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsEffective Data Science Infrastructure: How to make data scientists productive Rating: 0 out of 5 stars0 ratingsStreaming Data: Understanding the real-time pipeline Rating: 0 out of 5 stars0 ratingsRe-Engineering Legacy Software Rating: 0 out of 5 stars0 ratingsMastering Scala Machine Learning Rating: 0 out of 5 stars0 ratingsMachine Learning in Production: Master the art of delivering robust Machine Learning solutions with MLOps (English Edition) Rating: 0 out of 5 stars0 ratingsMastering Large Datasets with Python: Parallelize and Distribute Your Python Code Rating: 0 out of 5 stars0 ratingsProbabilistic Deep Learning: With Python, Keras and TensorFlow Probability Rating: 0 out of 5 stars0 ratingsReal-World Machine Learning Rating: 0 out of 5 stars0 ratingsDeep Reinforcement Learning in Action Rating: 4 out of 5 stars4/5Natural Language Processing in Action: Understanding, analyzing, and generating text with Python Rating: 0 out of 5 stars0 ratingsPractical Recommender Systems Rating: 5 out of 5 stars5/5Inside Deep Learning: Math, Algorithms, Models Rating: 0 out of 5 stars0 ratingsDeep Learning for Search Rating: 0 out of 5 stars0 ratingsDeep Learning for Vision Systems Rating: 5 out of 5 stars5/5Spark GraphX in Action Rating: 0 out of 5 stars0 ratingsGrokking Deep Reinforcement Learning Rating: 5 out of 5 stars5/5Advanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratings
Computers For You
The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Elon Musk Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsThe Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratings101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Computer Science I Essentials Rating: 5 out of 5 stars5/5Managing Humans: Biting and Humorous Tales of a Software Engineering Manager Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5
Reviews for Machine Learning Systems
0 ratings0 reviews
Book preview
Machine Learning Systems - Jeffrey Smith
Copyright
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email:
©2018 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
Development editor: Susanna Kline
Review editor: Aleksandar Dragosavljević
Technical development editor: Kostas Passadis
Project editor: Tiffany Taylor
Copyeditor: Corbin Collins
Proofreader: Katie Tennant
Technical proofreader: Jerry Kuch
Typesetter: Gordan Salinovic
Cover designer: Marija Tudor
ISBN 9781617293337
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – EBM – 23 22 21 20 19 18
Brief Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this book
About the author
About the cover illustration
1. Fundamentals of reactive machine learning
Chapter 1. Learning reactive machine learning
Chapter 2. Using reactive tools
2. Building a reactive machine learning system
Chapter 3. Collecting data
Chapter 4. Generating features
Chapter 5. Learning models
Chapter 6. Evaluating models
Chapter 7. Publishing models
Chapter 8. Responding
3. Operating a machine learning system
Chapter 9. Delivering
Chapter 10. Evolving intelligence
Getting set up
A reactive machine learning system
Phases of machine learning
Index
List of Figures
List of Tables
List of Listings
Table of Contents
Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this book
About the author
About the cover illustration
1. Fundamentals of reactive machine learning
Chapter 1. Learning reactive machine learning
1.1. An example machine learning system
1.1.1. Building a prototype system
1.1.2. Building a better system
1.2. Reactive machine learning
1.2.1. Machine learning
1.2.2. Reactive systems
1.2.3. Making machine learning systems reactive
1.2.4. When not to use reactive machine learning
Summary
Chapter 2. Using reactive tools
2.1. Scala, a reactive language
2.1.1. Reacting to uncertainty in Scala
2.1.2. The uncertainty of time
2.2. Akka, a reactive toolkit
2.2.1. The actor model
2.2.2. Ensuring resilience with Akka
2.3. Spark, a reactive big data framework
Summary
2. Building a reactive machine learning system
Chapter 3. Collecting data
3.1. Sensing uncertain data
3.2. Collecting data at scale
3.2.1. Maintaining state in a distributed system
3.2.2. Understanding data collection
3.3. Persisting data
3.3.1. Elastic and resilient databases
3.3.2. Fact databases
3.3.3. Querying persisted facts
3.3.4. Understanding distributed-fact databases
3.4. Applications
3.5. Reactivities
Summary
Chapter 4. Generating features
4.1. Spark ML
4.2. Extracting features
4.3. Transforming features
4.3.1. Common feature transforms
4.3.2. Transforming concepts
4.4. Selecting features
4.5. Structuring feature code
4.5.1. Feature generators
4.5.2. Feature set composition
4.6. Applications
4.7. Reactivities
Summary
Chapter 5. Learning models
5.1. Implementing learning algorithms
5.1.1. Bayesian modeling
5.1.2. Implementing Naive Bayes
5.2. Using MLlib
5.2.1. Building an ML pipeline
5.2.2. Evolving modeling techniques
5.3. Building facades
5.3.1. Learning artistic style
5.4. Reactivities
Summary
Chapter 6. Evaluating models
6.1. Detecting fraud
6.2. Holding out data
6.3. Model metrics
6.4. Testing models
6.5. Data leakage
6.6. Recording provenance
6.7. Reactivities
Summary
Chapter 7. Publishing models
7.1. The uncertainty of farming
7.2. Persisting models
7.3. Serving models
7.3.1. Microservices
7.3.2. Akka HTTP
7.4. Containerizing applications
7.5. Reactivities
Summary
Chapter 8. Responding
8.1. Moving at the speed of turtles
8.2. Building services with tasks
8.3. Predicting traffic
8.4. Handling failure
8.5. Architecting response systems
8.6. Reactivities
Summary
3. Operating a machine learning system
Chapter 9. Delivering
9.1. Shipping fruit
9.2. Building and packaging
9.3. Build pipelines
9.4. Evaluating models
9.5. Deploying
9.6. Reactivities
Summary
Chapter 10. Evolving intelligence
10.1. Chatting
10.2. Artificial intelligence
10.3. Reflex agents
10.4. Intelligent agents
10.5. Learning agents
10.6. Reactive learning agents
10.6.1. Reactive principles
10.6.2. Reactive strategies
10.6.3. Reactive machine learning
10.7. Reactivities
10.7.1. Libraries
10.7.2. System data
10.8. Reactive explorations
10.8.1. Users
10.8.2. System dimensions
10.8.3. Applying reactive principles
Summary
Getting set up
Scala
Git code repository
sbt
Spark
Couchbase
Docker
A reactive machine learning system
Phases of machine learning
Index
List of Figures
List of Tables
List of Listings
Foreword
Today’s data scientists and software engineers are spoiled for choice when looking for tools to build machine learning systems. They have a range of new technologies that make it easier than ever to build entire machine learning systems. Considering where we—the machine learning community—started, it’s exciting to see a book that explores how powerful and approachable the current technologies are.
To better understand how we got here, I’d like to share a bit of my own story. They tell me I’m a data scientist, but I think I’m only here by accident. I began as a software person and grew up on Java 1.3 and EJB. I left the software-engineer role at Google a decade ago, although I dabbled in open source and created a recommender system that went on to be part of Apache Mahout in 2009. Its goal was to implement machine learning algorithms on the then-new Apache Hadoop MapReduce framework. The engineering parts were familiar—MapReduce came from Google, after all. The machine learning was new and exciting, but the tools were lacking.
Not knowing any better, and with no formal background in ML, I tried to help build ML at scale. In theory, this was going to open an era of better ML, because more data generally means better models. ML just needed tooling rebuilt on the nascent distributed computing platforms like Hadoop.
Mahout (0.x) was what you’d expect when developers with a lot of engineering background and a little stats background try to build ML tools: JVM-based, modular, scalable, complex, developer-oriented, baroque, and sometimes eccentric in its interpretation of stats concepts. In retrospect, classic Mahout wasn’t interesting because it was a better version of stats tooling. In truth, it was much less usable than, say, R (which I admit having never heard of until 2010). Mahout was interesting, because it was built from the beginning to work at web scale, using tooling developed for enterprise software engineering. The collision of stats tooling with new approaches to handling web-scale data gave birth to what became known as data science.
The more I back-filled my missing context about how real statisticians and analysts had been successfully applying ML for decades, thank you very much, the more I realized that the existing world of analytics tooling optimizes for some usages and not others. Python, R, and their ecosystems have rich analytics libraries and visualization tools. They’re not as concerned with issues of scale or production deployment.
Coming from an enterprise software world, I was somewhat surprised that the tooling generally ended at building a model. What about doing something with the model in production? I found this was usually viewed as a separate activity for software engineers to undertake. The engineering community hadn’t settled on clear patterns for product application around Hadoop-related technologies.
In 2012, I spun out a small company, Myrrix, to expand on the core premise of Mahout and make it into a continuously learning, updating service with the ability to serve results from the model in production—not just a library that output coefficients. This became part of Cloudera and was reimagined again, on top of Apache Spark, as Oryx (https://github.com/OryxProject/oryx).
Spark was another game changer for the Hadoop ecosystem. It brought a higher-level, natural functional paradigm to big data software development, more like you’d encounter in Python. It added language bindings to Python and R. It brought a new machine learning library, Spark MLlib. By 2015, the big data ecosystem at large was suddenly much closer to the world of conventional analytics tools.
These and other tools have bridged the worlds of stats and software engineering such that the two now interact regularly. Today’s big data engineer has ready access to Python-only tooling like TensorFlow for deep learning and Seaborn for visualization. The software-engineering culture of version control and testing and strongly typed languages has flowed into the data science community, too.
That brings us back to this book. It doesn’t cover just tools but also the entire job of building a machine learning system. It gets into topics that people used to gloss over, like model serialization and building model servers. The language of the book is primarily Scala, a unique language that is both principled and expressive without sacrificing conveniences like type inference. Scala has been used to build powerful technologies like Spark and Akka, which the book shows you how to use to build machine learning systems. The book also doesn’t ignore the importance of interoperability with Python technologies or portable application builds with Docker.
We’ve come a long way, and there’s farther to go. The person who can master the tools and techniques in this book will be well prepared to play a role in machine learning’s even more exciting future.
SEAN OWEN
DIRECTOR OF DATA SCIENCE, CLOUDERA
Preface
I’ve been working with data for my entire professional career. Following my interests, I’ve worked on ever-more-analytically sophisticated systems as my career has progressed, leading to a focus on machine learning and artificial intelligence systems.
As my work content evolved from more traditional data-warehousing sorts of tasks to building machine learning systems, I was struck by a strange absence. When I was working primarily with databases, I could rely on the rich body of academic and professional literature about how to build databases and applications that interact with them to help me define what a good design was. So, I was confused and surprised to find that machine learning as a field generally lacked this sort of guidance. There were no canonical implementations of anything other than the model learning algorithms. Huge chunks of the system that needed to be built were largely glossed over in the literature. Often, I couldn’t even find a consistent name for a given system component, so my colleagues and I inevitably confused each other with our choices of terminology.
What I wanted was a framework, something like a Ruby on Rails for machine learning, but no such framework seemed to exist.[¹] Barring a commonly accepted framework, I wanted at least some clear design patterns for how to build machine learning systems; but alas, there was no Design Patterns for Machine Learning Systems to be found, either.
¹
Eventually, I came across Sean Owen’s work on Oryx and Simon Chan’s on PredictionIO, which were super-instructive. If you’re interested in the background of machine learning architectures, you’ll benefit from reviewing them both.
So, I built machine learning systems the hard way: by trying things and figuring out what didn’t work. When I needed to invent terminology, I just picked reasonable terms. Over time, I tried to synthesize some of my learnings about what worked for machine learning system design and what didn’t into a coherent whole. Fields like distributed systems and functional programming offered the promise of adding coherence to my views about machine learning systems, but neither was particularly focused on application to machine learning.
Then, I discovered reactive systems design, via reading the Reactive Manifesto (www.reactivemanifesto.org). It was startling in its simple coherence and bold mission. Here was a complete world view of what the challenge of building modern software applications was and a principled way of building applications that met that challenge. I was excited by the promise of the approach and immediately began attempting to apply it to the problems I’d seen in architecting and building machine learning systems.
Poop prediction
This inquiry led me to poop—specifically, to dog poop. I tried to imagine how a naive machine learning system could be refactored into something much better, using the tools from reactive systems design. To do this, I wrote a blog post about a dog poop prediction startup (http://mng.bz/9YK8; see figure).
The post got a surprisingly large and serious response from a wide range of people. I learned two things from that response:
I wasn’t the only one interested in coming up with a principled approach to building machine learning systems.
People really enjoyed talking about machine learning in terms of cartoon animals.
Those insights led to the book you’re reading. In this book, I try to cover a range of issues you’re likely to encounter in building real-world machine learning systems that have to keep customers happy. My focus is on all the stuff you won’t find in other books. I’ve tried to make the book as broad as possible, in the hopes of covering the full responsibilities of the modern data scientist or engineer. I explore how to use general principles and techniques to break down the seemingly unique problems of a given component of a machine learning system. My goal is to be as comprehensive as possible in my coverage of machine learning system components, but that means I can’t be comprehensive on huge topics like model learning algorithms and distributed systems. Instead, I’ve designed examples that provide you with experience building various components of a machine learning system.
I firmly believe that to build a truly powerful machine learning system, you must take a system-level view of the problem. In this book, I provide that high-level perspective and then help you build skills around each of the key components in that system. I learned through my experience as a technical lead and manager that understanding the entire machine learning system and the composition of its components is one of the most important skills a developer of machine learning systems can have. So, the book tries to cover all the different pieces it takes to build up a powerful, real-world machine learning system. Throughout, we’ll take the perspective of teams shipping sophisticated machine learning systems for live users. So, we’ll explore how to build everything in a machine learning system. It’s a big job, and I’m excited that you’re interested in taking it on.
Acknowledgments
A book is the opposite of an academic paper when it comes to attribution. In an academic paper, everyone who ever even grabbed lunch at the lab can get their name on the paper; but in a book, for some reason, we only put one or two names on the cover. But it’s not that simple to pull a book together; lots of people are involved. Here are all the people who made this book happen.
As I mentioned in the preface, the book grew out of (believe it or not) a blog post about dog poop (http://mng.bz/9YK8). I’m immensely grateful to the serious and accomplished people who took my cartoons about dog poop seriously enough to provide useful feedback: Roland Kuhn, Simon Chan, and Sean Owen.
In the early days of the book, the members of the reactive study group and the data team at Intent Media were invaluable in helping me understand where I was trying to take these ideas about building machine learning systems. I’m also indebted to Chelsea Alburger from Intent Media, who provided great early art direction for the book’s visuals.
Thanks go to the team at Manning who took my original ideas and helped them become a book: Frank Pöhlmann, who suggested that there might be a book in this reactive machine learning stuff; Susanna Kline, who dragged me kicking and screaming through the dark forest; Kostas Passadis, who kept me from looking like a complete fool; and Marjan Bace, who green-lit the whole mad endeavor. I also want to thank the technical peer reviewers, led by Aleksandar Dragosavljevic: David Andrzejewski, Jose Carlos Estefania Aulet, Óscar Belmonte-Fernández, Tony M. Dubitsky, Vipul Gupta, Jason Hales, Massimo Ilario, Shobha Iyer, Shanker Janakiraman, Jon Lehto, Anuja Kelkar, Alexander Myltsev, Tommy O’Dell, Jean Safar, José San Leandro, Jeff Smith, Chris Snow, Ian Stirk, Fabien Tison, Jeremy Townson, Joseph Wang, and Jonathan Woodard.
Once the book really got rolling, the team at x.ai were immensely helpful in providing a test lab for various ideas and supporting me as I took the book’s ideas on the road in the form of talks. I thank you, Dennis Mortensen, Alex Poon, and everyone on the tech team.
Also, thanks go to anyone who came out to hear one of the talks associated with the book at conferences and meetups. All the feedback provided, in person and online, was instrumental to helping me understand how the material was evolving.
Finally, I thank my illustrator, yifan, without whom the book wouldn’t have been possible. You’ve brought to life my vision of cartoon animals who do machine learning, and now I’m excited to be able to share it with the world.
P.S. Thanks to my muse: nom nom, the data dog. Who’s a good little machine learner? You are!
About this book
This book serves two slightly different audiences. First, it serves software engineers who are interested in machine learning but haven’t built many real-world machine learning systems. I presume such readers want to put their skills into practice by actually building something with machine learning. The book is different from other books you may have picked up on machine learning. In it, you’ll find techniques applicable to building whole production-grade systems, not just naive scripts. We’ll explore the entire range of possible components you might need to implement in a machine learning system, with lots of hard-won tips about common design pitfalls. Along the way, you’ll learn about the various jobs of a machine learning system, in the context of implementing systems that fulfill those needs. So, if you don’t have a lot of background in machine learning, don’t worry that you’ll have to wade through pages of math before you get to build things. The book will have you coding all the way through, often relying on libraries to handle the more complex implementation concerns like model learning algorithms and distributed data processing.
Second, this book serves data scientists who are interested in the bigger picture of machine learning systems. I presume that such readers know the concepts of machine learning but may only have implemented simple machine learning functionality (for example, scripts over files on a laptop). For such readers, the book may introduce you to a range of concerns that you’ve never before considered part of the work of machine learning. In places, I’ll introduce vocabulary to name components of a system that are often neglected in academic machine learning discussions, and then I’ll show you how to implement them. Although the book does get into some powerful programming techniques, I don’t presume that you have deep experience in software engineering, and I’ll introduce all concepts beyond the very basic, in context.
For either type of reader, I assume that you have some interest in reactive systems and how this approach can be used to build better machine learning systems. The reactive perspective on system design underpins every part of the book, so you’ll spend a lot of time examining the properties your system has or doesn’t have, often presuming that real-world problems like server outages and network partitions will occur in your system.
Concretely, this focus on reactive systems means the book contains a fair bit of material on distributed systems and functional programming. The goal of unifying these concerns with the task of building machine learning systems is to give you tools to solve some of the hardest problems in technology today. Again, if you don’t have a background in distributed systems or functional programming, don’t worry: I’ll introduce this material in context with the appropriate motivation. Once you see tools like Scala, Spark, and Akka in action, I hope it will become clear to you how helpful they can be in solving real-world machine learning problems.
How this book is organized
This book is organized into three parts. Part 1 introduces the overall motivation of the book and some of the tools you’ll use:
Chapter 1 introduces machine learning, reactive systems, and the goals of reactive machine learning.
Chapter 2 introduces three of the technologies the book uses: Scala, Spark, and Akka.
Part 2 forms the bulk of the book. It proceeds component by component, helping you to deeply understand all the things a machine learning system must do, and how you can do them better using reactive techniques:
Chapter 3 discusses the challenges of collecting data and ingesting it into a machine learning system. As part of that, it introduces various concepts around handling uncertain data. It also goes into detail about how to persist data, focusing on properties of distributed databases.
Chapter 4 gets into how you can extract features from raw data and the various ways in which you can compose this functionality.
Chapter 5 covers model learning. You’ll implement your own model learning algorithms and use library implementations. It also covers how to work with model learning algorithms from other languages.
Chapter 6 covers a range of concerns related to evaluating models once they’ve been learned.
Chapter 7 shows how to take learned models and make them available for use. In the service of this goal, this chapter introduces Akka HTTP, microservices, and containerization via Docker.
Chapter 8 is all about using machine learned models to act on the real world. It also introduces an alternative to Akka HTTP for building services: http4s.
Finally, part 3 introduces a few more concerns that become relevant once you’ve built a machine learning system and need to keep it running and evolve it into something better:
Chapter 9 shows how to build Scala applications using SBT. It also introduces concepts from continuous delivery.
Chapter 10 shows how to build artificially intelligent agents of various levels of complexity as an example of system evolution. It also covers more techniques for analyzing the reactive properties of a machine learning system.
How should you read this book? If you have good experience in Scala, Spark, and Akka, then you might skip chapter 2. The heart of the book is the journey through the various system components in part 2. Although they’re meant to stand alone as much as possible, it will probably be easiest to follow the flow of the data through the system if you proceed in order from chapter 3 through chapter 8. The final two chapters are separate concerns and can be read in any order (after you’ve read part 2).
Code conventions and downloads
This book contains many examples of source code, both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text.
In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers ( ). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts.
The code used in the book can be found on the book’s website, www.manning.com/books/machine-learning-systems, and in this Git repository: http://github.com/jeffreyksmithjr/reactive-machine-learning-systems.
Book forum
Purchase of Machine Learning Systems includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://forums.manning.com/forums/machine-learning-systems. You can also learn more about Manning’s forums and the rules of conduct at https://forums.manning.com/forums/about.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking him some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
Other online resources
For more information about Scala and pointers to various resources on how to learn the language, the language website