Softmax: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{#addbodyclass:tag_math}}
{{stub}}
'''softmax''' (sometimes called softargmax {{comment|(except it's not really like [[argmax]])}}, normalized exponential function, and other things)
* takes a vector of numbers
:: (any scale)
* returns a same-length vector of probabilities
:: all in 0 .. 1
:: that sum to 1.0




<!--
<!--
Note that it is ''not'' just normalization, nor is it just a way to bring out the strongest answer.
The exponent in its internals, plus the "will sum to 1.0 part" will mean things shift around in a non-linear way,
so even relative probabilities already in in 0..1 and summing to 1.0 will change, e.g.
: softmax([1.0,0.5,0.1])  ~= 0.5, 0.3, 0.2,
: softmax([0.5, 0.3, 0.2]) ~= 0.4, 0.31, 0.28


softmax (a.k.a. softargmax, normalized exponential function)


* takes a vector of numbers
* provides a vector of probabilities
:: all in 0..1
:: and sum to 1.0




Many reference you'll find ''now'' are its use in neural nets, where they take activation on any sort, and put them into 0..1 scale sensibly,
The name might suggest to you it is a numerically smoothed maximum. It is not.
in what is often a final layer (in a functional block, or overall).


It is much closer to [[argmax]]
A smooth approximation to the arg max function: the function whose value is the index of a vector's largest element. In fact, the term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum. For this reason, some prefer the more accurate term "softargmax", but the term "softmax" is conventional in machine learning.[3][4] This section uses the term "softargmax" to emphasize this interpretation.


While the exponent makes it look like some choices of sigmoid functions,


And it isn't directly comparable to transfer functions, and you can't get an easy graph of it, exactly ''because'' it takes multiple inputs.




But also, it's a more general mathematical tool, even if it's mostly seen in machine learning.
<!--
The output has the probabilities of a probability distribution.  




If you squint, it is ''something'' like [[sigmoid]] function (because this is a generalization of the [[logistic function]])
but it is not directly comparable to transfer functions,
and you can't get an easy plot of it,
exactly ''because'' it takes multiple inputs.




It is ''not'' just normalization.  
It is a more generic mathematical tool,
historically seen a bunch in machine learning,
and these days many references are its use in neural nets.


Nor is just a way to bring out the strongest answer.
In that context they will
Both its exponent internals and the "will sum to 1.0 part" will mean things shift around, even if you feed it probabilities in 0..1 - e.g. softmax([1.0,0.5,0.1]) ~= 0.5, 0.3, 0.2, and even if they already sum to 1.0, e.g. softmax([0.5, 0.3, 0.2]) ~= 0.4, 0.31, 0.28
take activation on any sort,
and put them into 0..1 scale sensibly,
mostly as a normalization step that is often used at least in the final layer,
and sometimes at the end of smaller building blocks as well.




When using nets as multiclass classifiers, you would need something like softmax to be able to respond on all the labels, and in a way that looks like probabilities.   
When using nets as multiclass classifiers, you would need ''something'' like softmax to be able to respond on all the labels,
and in a way that looks like probabilities.   
In part it's just a choice of what you want to show (you could output classification margin scores instead),  
In part it's just a choice of what you want to show (you could output classification margin scores instead),  
in part it's a choice that  
in part it's a choice that  




Line 47: Line 70:


-->
-->
https://en.wikipedia.org/wiki/Softmax_function
[[Category:Math on data]]

Latest revision as of 23:15, 21 April 2024

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

softmax (sometimes called softargmax (except it's not really like argmax), normalized exponential function, and other things)

  • takes a vector of numbers
(any scale)
  • returns a same-length vector of probabilities
all in 0 .. 1
that sum to 1.0




https://en.wikipedia.org/wiki/Softmax_function