Softmax Classifier Intuition

18 Nov 2016

Say we have some input $x_i$ and its score vector $z$. We want to convert these scores to a probability distribution; the softmax function takes a score from $\mathbb{R}$ and maps it somewhere in $[0,1]$, interpreted as the probability of assigning some class to $x_i$; the sum of the probabilities of assigning each score is one.

Once we have $z$ converted to a probability distribution, we can then calculate the information associated with the event of assigning the correct class to $x_i$, namely

\[L_i = -\log P(y_i | x_i; W) = -\log \frac{\exp(f _ {y_i})}{\sum_j \exp(f_j)}.\]

We want the probability of correct classification to be close to $1$, and so we want to find the parameter $W$ that minimizes the event information $L_i$.

Alternatively, $L_i$ is the cross-entropy between the “true” distribution $p=[0,\dots,0,1,0,\dots,0]$ and the estimated distribution $q$. Cross-entropy is the expected codeword length using an encoding optimized for $q$ to code transmissions from $p$, namely

\[H(p,q)=-\sum_x p(x) \log q(x) \geq H(p),\]

where the inequality comes from the fact that $q$’s codes are by definition not optimized for $p$. As an aside, Kullback-Leibler divergence is the extra bits needed to use $q$’s codewords, namely

\[D _ {KL}(p\|q) = H(p,q) - H(p).\]

Minimizing log-likelihood interpretation. There is some perfect underlying parameter $W^*$ that overall gives the best probability of assigning the correct class to $x_i$ that we can expect. We want to find an estimate $W$ of the underlying parameter $W^*$ that maximizes the likelihood of a correct classification event (ie running the classification once and observing $y_i$). The $\log$ function is monotonically increasing, so maximizing the likelihood is equivalent to maximizing the log-likelihood, which is equivalent to minimizing the negative log-likelihood.

Softmax cares about relative positions of scores, not their actual values. Observe that for any constant $C$,

\[\sigma(z)_i = \frac{e^{f_i}}{\sum_j e^{f_j}} = \frac{Ce^{f_i}}{C\sum_j e^{f_j}} = \frac{e^{f_i + \log C}}{\sum_j e^{f_j + \log C}}\]

This means you can translate each score in $z$ by $\delta$ and get the same distribution $\sigma(z)$.

Softmax probability values are technically not interpretable—just their relative order. The regularization strength $\lambda$ will influence the weights in $W$. If you scale down $W$ with a factor of $1/2$, say, the scores in $z = f(x_i, W)$ will halve, as will their differences. Exponentiation and normalization (to sum to one) will then make the probability distribution more diffuse than before (imagine squeezing points together along the graph of $e^x$), though the ordering of the probabilities will be the same.