Add von Mises-Fisher family

The von Mises-Fisher distribution on `Stiefel(n, k,  𝔽)` and submanifolds is an exponential family distribution with natural parameters $\eta \in \mathbb{F}^{n \times k}$, and density function (wrt the normalized invariant measure)
```math
p(x ; \eta) = c_{n,k,\mathbb{F}}(\eta)^{-1} \exp(\Re(\langle \eta, x \rangle))
```

The names usually given to this family on different submanifolds are:
- Circle: von Mises
- Sphere: von Mises-Fisher/Fisher/Langevin
- Stiefel: Matrix von Mises-Fisher/matrix Fisher/matrix Langevin

> [!NOTE]
> Some authors[^Chikuse1995] define the "matrix Langevin" distribution on the Grassmann manifold as distribution with the above density when the point is a projector matrix. However, upon changing the point to a subspace, one gets the Bingham distribution, so we here keep the distinction that vMF is for Stiefel and its submanifolds and Bingham is for Grassmann and its submanifolds.

> [!NOTE]
> For $\mathbb{F}=\mathbb{C}$, we could equivalently write $`\eta=\begin{bmatrix}\eta_r \\ \eta_i\end{bmatrix} \in \mathbb{R}^{2n\times k}`$ and $`T(x) = \begin{bmatrix} \Re(x) \\ \Im(x)\end{bmatrix}`$ to get the density in the usual exponential family form $`c_{n,k,\mathbb{C}}(\eta)^{-1} \exp(\langle \eta, T(x)\rangle)`$, and a similar approach can be taken when $\mathbb{F}=\mathbb{H}$. But the form above using the real inner product is cleaner.

## Parameterizations

### Canonical

In the canonical parameterization, we have a single $n \times k$ parameter matrix $\eta$.
This corresponds to the $(c)$ parameterization of von Mises-Fisher and the $(F)$ parameter of matrix von Mises-Fisher.
This parameterization is most convenient if the parameter is fixed, and one just needs the unnormalized density.

### SVD

Let $\eta = U D V^\mathrm{H}$ be the SVD of $\eta$.
It can be made unique by choice of sign/phase convention if all singular values are distinct and non-zero.
This parameterization is most convenient if one needs the normalization constant and/or mode.

For the Sphere and Circle, $V=1$, $\mu=U$ is a vector/scalar, and $D=\kappa$ is a positive scalar.

For $\mathrm{SO}(n)$, if $\tilde{U} \tilde{D} \tilde{V}^\mathrm{T}=\eta$ is an SVD decomposition, we use instead $U = \tilde{U} \mathrm{diag}([1, 1, \ldots, \det(\tilde{U})]) \in \mathrm{SO}(n)$, $D = [\tilde{d}_1, \ldots, \det(\tilde{U}) \det(\tilde{V}) \tilde{d}_n]$, and $V = \tilde{V}\mathrm{diag}([1, 1, \ldots, \det(\tilde{V})]) \in \mathrm{SO}(n)$.
Thus the sign of the determinant of $\eta$ is entirely determined by the sign of the final entry in $D$.

### Polar/Mode-Concentration (not planned)

The polar decomposition is $\eta = MK$ for $M \in \mathrm{St}(n, k, \mathbb{F})$ and positive semi-definite $K$.
It is unique under the same conditions as the SVD of $\eta$.
When the SVD is unique, $M$ is the unique mode.
$K$ is positive-definite iff the singular values of $D$ are all non-zero.
Note that $M = U V^\mathrm{H}$ and $K=V D V^\mathrm{H}$, so $\eta^\mathrm{H} \eta = K^2$.
We could call either $D$ or $K$ the concentration parameter, but section 13.2.3 of [^Mardia2009] calls $K$ the concentration parameter.

This parameterization is most convenient if one needs the mode, but for Stiefel in general, it's really no more convenient than the SVD parameterization. Also, on submanifolds of the Stiefel, $M$ may not be a global maximizer. (e.g. it is only the global maximizer on $\mathrm{SO}(n)$ if $M \in \mathrm{SO}(n)$, see below.).

In conclusion, I don't think we should include this parameterization except for the sphere/circle.
But it's useful for talking about some properties of vMF.

## Properties

### Closure

If $`x \sim \mathrm{vMF}_\mathbb{F}(U, D, V)$, and if $L \in \mathrm{U}(n, \mathbb{F})`$, $`R \in \mathrm{U}(k, \mathbb{F})`$, then $`LxR^\mathrm{H} \sim \mathrm{vMF}_\mathbb{F}(LU, D, RV)`$, so the family is closed under left- and right- unitary transformations.
As an example, let $`L = \begin{bmatrix}U & U_\perp\end{bmatrix}^\mathrm{H} \in \mathrm{U}(n, \mathbb{F})`$ and $`R = V^\mathrm{H}`$.
Then $`LxR^\mathrm{H} \sim \mathrm{vMF}_\mathbb{F}(I_{n,k}, D, I_k)`$

### Rotational Symmetry

Let $L \in \mathrm{U}(n, \mathbb{F})$ and $R \in \mathrm{U}(k, \mathbb{F})$ be matrices such that $L^\mathrm{H} U=U$ and $R^\mathrm{H}V=V$. Then $p(LxR^\mathrm{H}; \eta) = p(x; \eta)$. Trivial examples of matrices $L$ and $R$ would be Householder reflectors, e.g. $L = 2 UU^\mathrm{H} - I_n$. In this way vMF distributions exhibit a kind of rotational symmetry. On the sphere, this simplifies to rotational symmetry about the axis including the modal direction.

### Normalization constants

The normalization constant $`c_{n,k,\mathbb{F}}(D) = \int_{\mathcal{M}} \exp(\Re(\langle \eta, x \rangle)) [\mathrm{d}x]`$, where $[\mathrm{d}x]$ is the normalized invariant measure, is only dependent on $D$ and is given by

- `Stiefel(n, k, 𝔽)`: $`c_{n,k,\mathbb{F}}(D) = {}_0F_1^{(2/\mathrm{dim}_\mathbb{F})}(\frac{1}{2}n \mathrm{dim}_\mathbb{F}; \frac{1}{4} D^2)`$[^1]
- `Sphere(n)`:
```math
c_n(\kappa) = c_{n+1,1,\mathbb{R}}(\kappa) = {}_0F_1^{(2)}(\frac{n + 1}{2}; \frac{1}{4} \kappa^2) = I_{(n-1)/2}(\kappa) \frac{\Gamma((n+1)/2)}{(\kappa/2)^{(n-1)/2}} = \frac{1}{A_{n} C_{n+1}(\kappa)}
```
- `Circle()`: $`c(\kappa) = c_{2,1,\mathbb{R}}(\kappa) = {}_0F_1(1; \frac{1}{4} \kappa^2) = I_{0}(\kappa)`$
- $\mathrm{SO}(2), \mathrm{SU}(2)$: $`c(\eta) = c_{2\mathrm{dim}_\mathbb{F}}(\sqrt{\lVert \eta\rVert^2 + 2\Re(\det(\eta))})`$, where $c_n(\cdot)$ is the normalization constant of $\mathbb{S}^n$. 
- $\mathrm{SO}(3)$: $`c(D) = {}_1F_1^{(2)}(\frac{1}{2}, 2; \Lambda(D))`$, where $\Lambda(D)$ is given in [^Wood1993]. This result is derived using the connection between vMF on $\mathrm{SO(3)}$ and the Bingham distribution on $\mathbb{S}^3$. (note that because of the double-cover they have an extra factor of $\frac{1}{2}$, but since we're writing densities wrt the normalized invariant measure here, this factor is eliminated in the normalization.)
where $`{}_pF_q^{(\alpha)}`$ is a (real,scalar-valued) hypergeometric function of matrix argument, $A_{n}$ is the surface area of $\mathbb{S}^{n}$, $C_n(\kappa)$ is defined in the [wikipedia article](https://en.wikipedia.org/wiki/Von_Mises%E2%80%93Fisher_distribution), and $I_\nu$ is the modified Bessel function of the first kind.

> [!NOTE]
> In directional statistics, the density on the sphere is conventionally written wrt the Hausdorff measure (or Lebesgue measure on the circle), which is related to the normalized invariant measure by the constant factor $A_{n}$.

The standard algorithm for computing matrix-argument hypergeometric functions is described in [^KoevEdelman2006]. [HypergeoMat.jl](https://github.com/stla/HypergeoMat.jl) implements this algorithm. It works through computing a truncated series and becomes more expensive for large $D$.

I haven't yet found expressions for the normalization constant for other special cases of $\mathrm{SU}(n, \mathbb{F})$.

### Mode

Wrt the invariant/Hausdorff measures, the mode in the SVD representation is $x^*=UV^\mathrm{H}$, which is just the orthogonal projection to the Stiefel manifold under the Euclidean metric. It is a unique maximizer iff all values in $D$ are unique and non-zero. This follows from application of von Neumann's trace inequality and uniqueness properties of the SVD.

$\mathrm{SU}(n)$'s mode does not appear to have a simple solution unless $\det(\eta) \in \mathbb{R}_{>0}$.

### Moments

For the sphere/circle, the Riemannian mean is known to be the mode $\mu$ when it exists ($\kappa > 0$). [REF]
The intrinsic (Riemannian) variance and higher moments reduce to 1-dimensional integrals but in general have no solution.

Using well-known properties of the exponential family, the extrinsic mean and covariance can be obtained through differentiation of the logarithm of the normalization constant, which we can do here using autodiff. For the sphere, the extrinsic statistics are well-studied; however, Manifolds currently only has an API for *estimating* extrinsic statistics, not for computing them exactly, so for now, these are excluded.

### Median

I haven't seen this given anywhere, though for the sphere/circle it stands to reason it's the same as the Riemannian mean when $\kappa>0$.

### Fitting

#### Maximum-likelihood estimates

Given $m$ draws $x_{(1)}, \ldots, x_{(m)}$, the sufficient statistic $\overline{x} = \frac{1}{m} \sum_{i=1}^m x_{(i)}$ contains all information necessary to compute maximum likelihood estimates.

The key obstacle in MLE for vMF is the (in)tractibility of $c_{n,k,\mathbb{F}}$.

##### Stiefel

Let $\hat{U} \hat{S} \hat{V}^\mathrm{H}=\overline{x}$ be the unique SVD decomposition.
Then $\hat{U}$, $\hat{V}$ are the maximum-likelihood estimates of $U$ and $V$, which follows from von-Neumann's trace inequality.

The MLE of $D$ is then
```math
\hat{D} = \sup_{D}\left(\langle D, \hat{S}\rangle -\log(c_{n,k,\mathbb{F}}(D))\right).
```
For small $D$, we have a first-order approximation
```math
c_{n,k,\mathbb{F}}(D) \approx 1+ \frac{\mathrm{tr}(D^2)}{2 n\mathrm{dim}_\mathbb{F}}.
```
Empirically, an even better approximation for small $D$ is
```math
c_{n,k,\mathbb{F}}(D) \approx \exp\left(\frac{\mathrm{tr}(D^2)}{2 n\mathrm{dim}_\mathbb{F}}\right),
```
which of course has the same first-order approximation.

In both cases, we get  $\hat{D} \approx n \mathrm{dim}_\mathbb{F} \hat{S}$.

In general the MLE $\hat{D}$ needs to be solved via optimization, which I propose to do starting from this initial guess.
However, some authors have proposed other approximations for the normalization constant that could be used in an objective to obtain approximate MLE estimates.

##### Sphere

Note that for the sphere and the complex circle, $\hat{U} = \hat{\mu} = \frac{\overline{x}}{\Vert \overline{x} \rVert}$.
Here we're helped by properties of the [derivative of the hypergeometric function](https://dlmf.nist.gov/16.3.E1), namely
```math
c_{n}'(\kappa) = \frac{n \kappa}{4} c_{n+2}(\kappa),
```
so
```math
\frac{\mathrm{d}}{\mathrm{d} \kappa}\log c_{n}(\kappa) = \frac{c_n'(\kappa)}{c_n(\kappa)} = \frac{I_{(n+1)/2}(\kappa)}{I_{(n-1)/2}(\kappa)} = \rho(\kappa),
```
$\rho$ is termed the mean resultant length and monotonically increases.
So the MLE $\hat{\kappa}$ is the unique solution to $\hat{s} = \rho(\hat{\kappa})$, which can be obtained with iterative root-finding methods.
[^Hornik2014] gives a nice review of the different techniques that have been employed.

##### Special Orthogonal

[^Sei2013] described how to modify the MLE on the Stiefel manifold to obtain MLE on $\mathrm{SO}(n)$.
Here the challenge posed by the intractability of the normalization constant is even worse, and they propose an approach using holonomic gradient descent.

## Random generation

Algorithms for exact random generation have been worked out for some of the submanifolds:

- Circle: [^Best1979]
- Sphere: [^Wood1994]
- `Stiefel(n, k, ℝ)`: The best option seems to be the rejection sampling method of [^Hoff2009], using a proposal derived from [^Wood1994] for von Mises-Fisher on the sphere. It's worth investigating whether Hoff's algorithm straightforwardly generalizes to complex Stiefel.
- $\mathrm{SO}(2)$: This is equivalent to the von Mises distribution on the circle, so the same algorithm can be used.
- $\mathrm{SO}(3)$/$\mathrm{SU}(2)$: Using the connections between $\mathbb{S}^3$ and $\mathrm{SO}(3)$, one finds that a Bingham random variable on $\mathbb{S}^3$ maps to a vMF random variable on $\mathrm{SO}(3)$. Thus one can use a Bingham sampler on $\mathbb{S}^2$ and map to $\mathrm{SO}(3)$.

## Notes

[^Bingham1992] and [^Chikuse2004] seemingly independently proposed the complex generalization of matrix-vMF. While many of its properties included here follow from the same derivations for the real case, the complex version seems much less well studied.

[^1]: For ``\mathbb{F}=\mathbb{H}``, I'm pretty sure this is the correct definition from reading Theorem 4.3 of [^LiXue2009], but before implementing it one should verify that their definition of matrix-argument hypergeometric functions really match to this definition used elsewhere.

## References

[^Chikuse1995]: Chikuse, Yasuko, and Geoffrey S. Watson. "Large sample asymptotic theory of tests for uniformity on the Grassmann manifold." Journal of multivariate analysis 54.1 (1995): 18-31. https://doi.org/10.1006/jmva.1995.1043
[^Mardia2009]: Mardia, Kanti V., and Peter E. Jupp. Directional statistics. John Wiley & Sons, 2009. https://doi.org/10.1002/9780470316979
[^KoevEdelman2006]: Koev, Plamen, and Edelman, Alan. "The efficient evaluation of the hypergeometric function of a matrix argument." Mathematics of Computation 75.254 (2006): 833-846. https://doi.org/10.1090/S0025-5718-06-01824-2 [Reference implementation](https://sites.google.com/sjsu.edu/plamenkoev/home/software/mhg)
[^LiXue2009]: Li, F., & Xue, Y. (2009). Zonal Polynomials and Hypergeometric Functions of Quaternion Matrix Argument. Communications in Statistics - Theory and Methods, 38(8), 1184–1206. https://doi.org/10.1080/03610920802379185
[^Bingham1992]: Bingham, Christopher, Ted Chang, and Donald Richards. "Approximating the matrix Fisher and Bingham distributions: Applications to spherical regression and Procrustes analysis." Journal of Multivariate Analysis 41.2 (1992): 314-337. https://doi.org/10.1016/0047-259X(92)90072-N
[^Chikuse2004]: Chikuse, Yasuko. "Hermite and Laguerre polynomials with complex matrix arguments." Linear algebra and its applications 388 (2004): 91-105. https://doi.org/10.1016/j.laa.2004.02.028
[^Hoff2009]: Hoff, P. D. (2009). Simulation of the Matrix Bingham–von Mises–Fisher Distribution, With Applications to Multivariate and Relational Data. Journal of Computational and Graphical Statistics, 18(2), 438–456. https://doi.org/10.1198/jcgs.2009.07177
[^Best1979]: D. J. Best, N. I. Fisher, Efficient Simulation of the von Mises Distribution, Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 28, Issue 2, June 1979, Pages 152–157, https://doi.org/10.2307/2346732 
[^Wood1994]: Wood, A. T. A. (1994). Simulation of the von mises fisher distribution. Communications in Statistics - Simulation and Computation, 23(1), 157–164. https://doi.org/10.1080/03610919408813161
[^Hornik2014]: Hornik, K., & Grün, B. (2014). movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions. Journal of Statistical Software, 58(10), 1–31. https://doi.org/10.18637/jss.v058.i10
[^Sei2013]: Sei, Tomonari, et al. "Properties and applications of Fisher distribution on the rotation group." Journal of Multivariate Analysis 116 (2013): 440-455. https://doi.org/10.1016/j.jmva.2013.01.010
[^Wood1993]: Wood, Andrew TA. "Estimation of the Concentration Parameters of the Fisher Matrix Distribution on 50 (3) and the Bingham Distribution on Sq, q≥ 2." Australian Journal of Statistics 35.1 (1993): 69-79.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add von Mises-Fisher family #7

Parameterizations

Canonical

SVD

Polar/Mode-Concentration (not planned)

Properties

Closure

Rotational Symmetry

Normalization constants

Mode

Moments

Median

Fitting

Maximum-likelihood estimates

Stiefel

Sphere

Special Orthogonal

Random generation

Notes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add von Mises-Fisher family #7

Description

Parameterizations

Canonical

SVD

Polar/Mode-Concentration (not planned)

Properties

Closure

Rotational Symmetry

Normalization constants

Mode

Moments

Median

Fitting

Maximum-likelihood estimates

Stiefel

Sphere

Special Orthogonal

Random generation

Notes

References

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions