So you’ve got a nice group of players interested in your game. Maybe you’ve even got a possibility of profit coming up! Now the challenge is to keep your success rolling along. You need to identify ways to reach out to customers, and figure out which players could benefit from a promotional offering. It’s time to develop a regression model for our data!

## An Introduction To Regression for Business Analytics

I won’t beat around the bush: there’s a lot to know about regression. It was a mathematical technique cooked up by some of the smartest mathematicians ever – including Gauss, who used it to forecast the locations of planets – so this is not an easy field. But for today’s article, I’m going to just get you started.

First off, most companies have no problem generating ratios. It’s very easy to say:

- “23% of the people who visit our site launch the game.”
- “5.6% of people who play our game purchase something.”
- “The majority of our revenue comes from 5% of the paying players.”

Know what? In most cases, that kind of simple math is enough to get the job done.

*Lesson #1 of Business Analytics: Use the simplest tool that works.*

Why is this the first lesson? Because complex tools are easy to screw up. Feynman once said, “The first principle is that you must not fool yourself, and you are the easiest person to fool.” Using complex tools can create complex and subtle problems that are difficult to anticipate and detect.

## When Do We Want Regression?

Most people get up to the point where they’re doing A/B testing – frankly, this is the best case scenario for “ratio” modelling. You do two tests, A and B. A results in 5% increase in sales, and B results in 6% increase in sales. Therefore B is better!

But ratios become very difficult when you have lots of interdependent variables. Let’s say you are trying to figure out why players cancel your game. You’ve got a hunch that you can predict when a player will cancel by looking at some of a few potential factors, but you don’t know for sure which one is the most relevant.

For example, let’s say we are looking at the number of times a player logged in, the duration of play, the number of guildmates who cancelled recently, the amount of EXP they gained, and the number of achievements they gained.

Modelling all those variables using ratios would take forever! Some of those variables are discrete, but most of them are continuous. You’d have to slot them into buckets (1-5 achievements, 6-10, 11-15, etc) and grade each bucket separately! You’d have to create ratios for every possible permutation of variables and compare them in a gigantic matrix. Blech! There’s got to be a better way.

Well, that’s where regression modelling comes in.

## How Does Regression Work?

I won’t proclaim that I’m a math teacher, so let me describe this in a way that a casual user can appreciate it. A regression model assumes that all your independent variables have some influence on the target (called the dependent variable).

But – and this is important – In order to get started, you first need to come up with a theory of how you think the variables work. Without a theory, your work will be blind and your results may not show anything at all!

This is the great part: if you come up with a theory that doesn’t work, regression modeling will help you confirm it. The fact that regression models can also give negative results can prevent you from spending tons of time researching data that isn’t useful (or is misleading).

Getting back to our model: Let’s presume that each one of these variables has some influence on whether a player decides to cancel. Using the most common kind of regression, Ordinary Least Squares (OLS), we will assume that we can construct a basic algebraic equation that helps us decide if a player will cancel or not. Using OLS, our theory looks like this in algebra:

*(user cancelled) = x + (y _{1} * logins) + (y_{2} * playtime) + (y_{3} * friends_cancelled) + (y_{4} * exp_gained) + (y_{5} * achievements)*

This is exactly the kind of algebra that a computer can solve in no time.

First, let’s go to your data team and ask them to provide a dump of information. But before we do, I need to remind you that __it is vitally important that we get an unbiased sample of data__. A common rookie mistake is to say “Gee, I want to figure out what makes people cancel, so let’s run a report on all the people who cancelled.” That’s bad because it leads to selection bias.

The way to avoid selection bias is to pretend you have no knowledge of the results of the study beforehand. Pretend you live behind the veil of ignorance. Ask your data guys,

“Could you run a report that gives me logins, playtime, friends cancelled, exp_gained, and achievements for all players for the month of July? The report should cover only users who have began playing before July 1st, and should exclude anyone who cancelled in July. Oh – and add as the last column a 1 if the user cancelled during the first week of August or a 0 if they did not.”

The reason this query works is that it meets a few requirements:

- Every user who is involved in this data set has had the same measurements applied to them. Every user who participated in this study supplied an entire month’s worth of data.
- Our dependent variable, “cancelled in August”, is entirely separate from the independent variables.
- Ideally, we’ll get lots and lots of rows of results. The more rows we get, the better our regression software can help us understand our variables.

This is good enough to get started. I’ll pretend that you got a report that looks something like this fake data I cooked up. Next, we need to get the software involved.

## Regression Analysis Software

Next, we need Let’s say you’re rich and your company overflows with money. In that case, buy SPSS, Stata, or Mathematica. Heck, get your company to send you to graduate school! But for the rest of us, there’s a very fun useful open source package called “GRETL”, which is a great place to start for someone who wants to learn. Go download gretl from sourceforge, then download my test dataset, and let’s get started.

First, save your report in CSV format and launch Gretl. Select File | Open Data | Import | text/CSV. Specify the data delimiter and choose the file.

Whoah! All of a sudden, Gretl asks a question “Do you want to give the data a time-series or panel interpretation?” Let’s tell it no for now; time series and panels are topics for a different lesson, maybe not an introduction. Then again, even if I have a potentially complex problem, I’ll try modeling it as a simple one first and only move to a more complex model once the simple one fails.

You should now see a screen that looks like this. It lists seven variables, including an auto-generated constant (basically a row number).

So let’s get modelling! From the top menu, choose Model | Ordinary Least Squares. We now need to tell GRETL about our theory. For the dependent variable, select “Cancelled_”; and for the independent variables choose everything else, then click OK.

You’ll probably see a screen that looks a lot like this. That’s a lot of text and complicated numbers. How can we make sense of it all?

For a beginner, there are two things you should look for in this chart. The little asterisks next to each row of data are a little visual hint as to which variables are most useful – the more stars, the more useful. Second, look at the bottom where it says “p-value was highest for playtime”. This is a suggestion to tell you which variables should be omitted from your model. In this case, the math says that playtime just doesn’t matter – we can’t tell if a customer is going to cancel by looking at their playtime.

In general, any variable with a p-value close to 1 (or lacking stars) should probably be dropped from your model.

Why is that the case? I don’t know in advance; this is where your theory comes in handy. It may be that some people log on a lot when they’re trying to decide whether or not to cancel, whereas others just fade away and forget about the site. You won’t know until you start gathering some raw-in-the-field research. This is something to bring to your game designer or community manager! Eventually, you might discover something fun, like perhaps there are two different kinds of playtime and only one kind is a reliable indicator of cancelling. But for the moment, let’s omit this flawed variable and move on.

To re-run the model without the bad variables, select from the main menu Test | Omit Variables; then select both “playtime” and “experience gained” to be omitted. Click OK. You’ll see another screen like this:

Now you’ve got an awesome model with some really useful variables. Every one of your variables has a really low p-value. The actual algebraic formula you created is this:

*(likelihood of cancelling) = 1.31132 – (0.0470642 * logins) + (0.0567763 * friends_cancelled) – (0.0795353 * achievements)*

So how can we make this work in practice? Let’s graph the output of our formula and see how well it works. From the main menu, select Graphs | Fitted, actual plot | Actual vs Fitted. You’ll see this chart:

Your model basically scores people who actually cancel as 0.6 and higher; and people who don’t cancel as 0.4 and lower. Based on this model, you might want to start offering promotional discounts or freebies to people who score above 0.6 – maybe giving really good incentives if the customer has spent lots of money in the past!

## Next Steps

There’s so much you can do with regression – I want to encourage you to learn, but frankly some parts of it can get daunting and aren’t easily taught. The Khan Academy has some good regression videos, but most of the wikipedia articles on regression have tons of complex math in them and are a bit impenetrable to novices. Gretl has a good manual too, so good luck and happy regressing!