Google Testing Blog: Alek Icev

Testing Blog

Online Machine Learning Testing == Extreme Testing

Monday, November 24, 2008

Posted by Alek Icev As you may know our core vision is to build "The perfect search engine that would understand exactly what you mean and give back exactly what you want.". In order to do that we learn from our data, we learn from the past and we love Machine Learning. Everyday we are trying to answer the following questions.

Is this email spam?

Is this search result relevant?

What product category does that query belong to?

What is the ad that users are most likely to click on for the query “flowers”?

Is this click fraudulent?

Is this ad likely to result in a purchase (not merely a click)?

Is this image pornographic?

Does this page contain malware? –Should this query bring up a maps onebox?

Solving many problems require Machine Learning techniques. On all of them we can build prediction models that will learn from the past and try to give the most precise answers to our users. We use variety of Machine Learning algorithms at Google and we are experimenting with numerous old and new advancements in this field in order to find the most accurate, fast and reliable solution for the different problems that we are attacking. Of course one the biggest challenges that we are facing in the Test Engineering community is how are we going to test these algorithms. The amount of the data that Google generates goes beyond all of the known boundaries of environments where the current Machine Learning Solutions were being crafted and tested. We want to open discussion around the ideas how to test different online machine algorithms. From time to time we will present an algorithm and some ideas how to test it and solicit the feedback from the wider audience i.e. try to build a wisdom of the crowds over the testing ideas.
So let's look at the Stochastic Gradient Descent Algorithm

Where X is the set of input values of X_i ,W is set of the importance factors(weights) of every value X_i. A positive weight means that that risk factor increases the probability of the outcome, while a negative weight means that that risk factor decreases the probability of that outcome. t is the target output value, η is the learning rate(the role of the learning rate is to control the level to which the weights are modified at every iteration and f(z) is the output generated by the function that maps large input domain to a small set of output values in this case. The function f(z) in this case is the logistic function:

$f(z)=\frac{1}{1+e^{-z}}$

z = x0w0 + x1w1 + x2w2 + ... + xkwk The logistic function has nice characteristics since it can take any input, and basically squash it to 0 or 1. Ideal for predicting probabilities on events that are dependent on multiple factors(X_i) each with different importance weights(W_i).The Stochastic Gradient Descent provides fast convergence to find the optimal minimums of the error(E) that the function is making on the prediction as well as if there are multiple local minimums the algorithms guarantees converging to the global minimum of the prediction error. So let’s go back now into the real online world where we want to give answers (predictions) to our users in milliseconds and ask the question how are we going to design automated tests for the Stochastic Gradient Descent Algorithm embedded into a live online prediction system. The environment is pretty agile and dynamic, the code is being changed every hour, you want your tests to run on 24/7 basis, you want to detect errors upstream in the development process, but you don’t want to block the development process with tests that are running days, on the other side you want to release new features fast, but the release process has to be error prone(imagine the world with google being down for 5 mins, that is a global catastrophe, isn’t it?! So let’s look at some of the test strategies: Should we try to train the model(set of the importance factors) and test the model with the subset of the training data? What if this takes far more than hours, maybe days to do that? Should we try to reduce the set of importance factors (X_i) and get the convergence(E->0) on the reduced model? Should we try to reduce the training data set(the variety of set of values for X as an input to the algorithm) and keep the original model and get the convergence by any price? Should we be happy with reducing both the model size and the training set? Are we going to worry for over-fitting in the test environment? Given the original data is online data and evolves fast, are we going to be satisfied with fixed data test set or change the input test data frequently? What are the triggers that will make you do so? What else should we do? Drop us a note, all ideas are more than welcome.

7 comments

Google

Labels: Alek Icev

Intro to Ad Quality Test Challenges

Saturday, May 24, 2008

Alek Icev

I'd like to take a second and introduce you to the team testing the ads ranking algorithms. We'd like to think that we had a hand in the webs shift to a "content meritocracy". As you know the Google search results are unbiased by human editors, and we don't allow buying a spot at the top of the results list. This idea builds trust with users and allows the community to decide what's important.

Recently, we started applying the same concept to the online advertising. We asked ourselves how to bring the same level of "content meritocracy" to the online advertising where everybody pays to have ads being displayed on Google and on our partner sites. In other words, we needed to change a system that was predominately driven by human influence into one that build its merit based on feedback from the community. The idea was that we would penalize the ranking of paid ads in several circumstances: few users were clicking on a particular ads, an ad's landing page was not relevant, or if users don't like an ad's content. We want to provide our users with absolutely the most relevant ads for their click. In order to make our vision a reality we are building one of the largest online and real time machine learning labs in the world. We learn from everything: clicks, queries, ads, landing pages, conversions... hundreds of signals.

The Google Ad Prediction System brought new challenges to Test Engineering. The problem is that we needed to build the abstraction layers and metrics systems that allow us to understand if the system is organically getting better or regressing. Put another way, we started off lacking the decision tree or a perceptron that a bank or credit card company have embedded into their risk analysis, or the neural net that's behind all broken speech recognition, or the latest tweaks on expectation-maximization algorithms needed to predict the protein transcription in the cells. The amount and versatility of the data that Google Ad Prediction models learn is immense. The amount of time needed to make the prediction is counted in milliseconds. The amount of computing resources, ads databases and infrastructure needed to serve predictions on every ad that is showing today is beyond imagination. Our challenge is to train and test learning models, that span clusters of servers and databases, simulate ads traffic and having everything compiled and running from the latest code changes submitted to the huge source depot. And the icing on the cake is to run that on 24/7 schedule.

On top of all of the technical challenges, we are also challenging the industry definition of "testing" and are believers in automated tests that are incorporated upstream into the development process and run on continuous basis. Ads Quality Test Engineering Group at Google, works on a bleeding edge testing infrastructure to test, simulate and train Google Ads Prediction systems in real time.