Duality and Lagrangian Duality

duality states that a mathematical optimization problem can be subjected to an equality, and an inequality constraint in the way that:

$Duality Optimization$

Now, Lagrangian duality states that there exists lambda, a multiplier with which:

$Lagrangian Optimization$

Now, keep this in mind. Here’s what we’re going to explain in this post:

There exists a duality between Maximum Entropy and Likelihood.

Therefore:

A logistic regression model can be optimized using Maximum Likelihood.

Why? Because a logistic regressor is simply a binary Maxent model.

Discimination, Generation, and Likelihood

We have a set of features X and a set of targets y which are distributed according to an unkown distribution. We wish to estimate the conditional probability P(y|X), in other words, create a probabilistic ML model based on these data. There are two approaches.

Discriminative approach: We use tools from learning theory ot compute a hypthesis that models the conditional probability P(y|X) up to a small error.
Generative approach: We estimate the probability distribution P(X|y) separately for every y. For instance y could represent the attribute gender and x could represent the attirbute height. In this case, according to Bayes’ rule:

$Bayes' Rule$

Maximum Likelihood

Imagine our goal is to find distribution D of discrete data X. To motivate this notion, suppose D is distributed according to one out of two possible density funcqtions q1 and q2. If you’re wondering what a density function is, a Probability Density Function is a function that takes distribution from continuous space to discrete space. On the other hand, a Probability Mass Function takes a discrete distribution to continouous space.

Now, intuitively, if we observe that q1(xi) is typicalled much larger than q2(xi), we will tend to conclude that D is distributed according to q1.

In general, we consider a sent of density functions Q. For q being a member of Q we have:

$Likelihood$

This is what we call likelihood of X under q. Notice, if Xs are independent — as in, realization of one does not affect the realization of the other — likelihood is the same as probability.

Now, logarithm has this quality which makes it monotonically increasing. This is the quality that was used in the old logarithm tables which people used to multiply large numbers before the advent of calculators. So we have:

$Log Likelihood$

Maximizing this function is equivalent to minimizing it. So think of this as a loss function:

$Likelihood Loss$

And thereby we introduce the true risk of q as:

$True Risk$

And yes, that is entropy you see besides the Sigma. Relative entropy, to be exact. This is what we call cross entropy, except with two functions — one distribution — instead of a value and a function, like in entropy. We can expand this equation like this:

$Expansion$

Where RE denotes relative entropy and H is normal Shannon entropy.

Maximum Entropy

Another way of approaching the the above problem is to use the method of maximum entropy. Here, we start with the fact that we can approximate the true expected value $Expected Value$ of each feature fj by its empirical average taken over the gien samples. In other words: