Thursday, March 29, 2012

Multiple Linear Regression (Multivariate Analysis)

Here's your process:

generic blackbox process
It's a black box. All you know is that you have multiple process inputs (X) and at least one process output (Y) that you care about. Multivariate analysis is the method by which you analyze how Y varies with your multiple inputs (x1, x2,... xn). There a lot of ways to go about figuring out how Y relates.

One way to go is to turn that black box into a transparent box where you try to understand the fundamentals from first principles. Say you identify x1 as cell growth and you believe that your cells grow exponentially, you can try to apply an equation like Y = Y0eµx1.

But this is large-scale manufacturing. You don't have time for that. You have to supply management with an immediate solution followed by a medium-term solution. What you can do is assume that each parameter varies with Y linearly.

y mx b
Just like we learned in 8th grade. How can we just say that Y relates to X linearly? Well, for one, I can say whatever I want (it's a free country). Secondly, all curves (exponential, polynomial, logarithmic, asymptotic...) are linear over small ranges... you know, like the proven acceptable range in which you ought to be controlling your manufacturing process.

Assuming everything is linear keeps things simple and happens to be rooted in manufacturing reality. What next?

y m1x1 m2x2 b
Next you start adding more inputs to your equation... applying a different coefficient for each new input. And if you think that a few of your inputs may interact, you can add their interactions like this:

mlr with interactions
You achieve interactions by multiplying the inputs and giving that product its own coefficient. So now you - the big nerd - have this humongous equation that you need solving. You don't know:
  • Which inputs (x's) to put in the equation
  • What interactions (x1 * x2) to put in the equation
  • What coefficients to put in the keep (m's)

What you're doing with multiple linear regression is picking the right inputs, interactions and so that the data you have fits that your statistical software package and brute-force the coefficients (m's) to fit an equation that gives you the least error.

Here's the thing: The fewer rows you have in your data table, the fewer inputs you get to throw into your equation. If you have 10 samples, but 92 inputs, you're going to have to be very selective with what you try in your model.

It's a tough job, but someone's got to do it. And when you finally do (i.e. explain the relationship between, say, cell culture titer and your cell culture process inputs), millions of dollars can literally roll into your company's coffers.

Your alternative is to hire Zymergi and skip that learning curve.


More reading:

No comments: