Maximum Likelihood Estimation and Duality in Optimization
Duality and Lagrangian Duality
duality states that a mathematical optimization problem can be subjected to an equality, and an inequality constraint in the way that:
Now, Lagrangian duality states that there exists lambda, a multiplier with which:
Now, keep this in mind. Here’s what we’re going to explain in this post:
There exists a duality between Maximum Entropy and Likelihood.
Therefore:
A logistic regression model can be optimized using Maximum Likelihood.
Why? Because a logistic regressor is simply a binary Maxent model.
Discimination, Generation, and Likelihood
We have a set of features X and a set of targets y which are distributed according to an unkown distribution. We wish to estimate the conditional probability P(y|X)
, in other words, create a probabilistic ML model based on these data. There are two approaches.
- Discriminative approach: We use tools from learning theory ot compute a hypthesis that models the conditional probability
P(y|X)
up to a small error. - Generative approach: We estimate the probability distribution
P(X|y)
separately for every y. For instance y could represent the attribute gender and x could represent the attirbute height. In this case, according to Bayes’ rule:
Maximum Likelihood
Imagine our goal is to find distribution D of discrete data X. To motivate this notion, suppose D is distributed according to one out of two possible density funcqtions q1 and q2. If you’re wondering what a density function is, a Probability Density Function is a function that takes distribution from continuous space to discrete space. On the other hand, a Probability Mass Function takes a discrete distribution to continouous space.
Now, intuitively, if we observe that q1(xi) is typicalled much larger than q2(xi), we will tend to conclude that D is distributed according to q1.
In general, we consider a sent of density functions Q. For q being a member of Q we have:
This is what we call likelihood of X under q. Notice, if Xs are independent — as in, realization of one does not affect the realization of the other — likelihood is the same as probability.
Now, logarithm has this quality which makes it monotonically increasing. This is the quality that was used in the old logarithm tables which people used to multiply large numbers before the advent of calculators. So we have:
Maximizing this function is equivalent to minimizing it. So think of this as a loss function:
And thereby we introduce the true risk of q as:
And yes, that is entropy you see besides the Sigma. Relative entropy, to be exact. This is what we call cross entropy, except with two functions — one distribution — instead of a value and a function, like in entropy. We can expand this equation like this:
Where RE denotes relative entropy and H is normal Shannon entropy.
Maximum Entropy
Another way of approaching the the above problem is to use the method of maximum entropy. Here, we start with the fact that we can approximate the true expected value of each feature fj by its empirical average taken over the gien samples. In other words:
This leads to the idea of finding a distribution p which satisifes the constraint:
Now, we calculate its distance from uniform distribution by calculating the relative entropy in a way which:
Since logarithn of N is just a constant, we can write maximum entropy as a dual problem:
Where:
Duality betwee Maxent and Likelihood
We finally arrive at this theorem:
Let q* be a probability distribution. Then, the following are equivalent:
And there we are! The relationship between these two explained.