The Newton-Muon Optimizer

5 minute read

Published: April 18, 2026

Muon is almost an implicit Newton method.

Muon has a surprisingly simple update. Given a matrix gradient $G\in\mathbb{R}^{m\times n}$, it applies the matrix sign operator to $G$ to produce an update matrix $Q\in\mathbb{R}^{m\times n}$:

\[Q\propto \operatorname{msgn}(G).\]

Here $\operatorname{msgn}(A)=UV^\top$ for a compact SVD $A=USV^\top$. At first glance, this looks almost too simple. Why should replacing a gradient by its matrix sign be a good idea?

In this blog, we explain one way to see why Muon works: under a layer-wise quadratic model, Muon is almost Newton’s method. The only missing piece is the input data geometry. Newton-Muon puts that piece back. In formulas, Newton-Muon will replace $\operatorname{msgn}(G)$ by $\operatorname{msgn}\!\left(G(ZZ^\top)^{-1}\right)$, where $Z$ stacks the layer inputs.

Local Second-Order Expansion

Let $f$ be the loss function, $W\in\mathbb{R}^{m\times n}$ be the weight matrix of a single layer, $G=\nabla_{W} f(W)\in\mathbb{R}^{m\times n}$ be the gradient, and let $Q\in\mathbb{R}^{m\times n}$ denote an update matrix. The second-order expansion is

\[f(W-Q)\approx f(W)-\operatorname{tr}(QG^\top)+\frac12\,\operatorname{vec}(Q)^\top\mathcal{H}_W\operatorname{vec}(Q), \tag{1}\label{eq:quadratic-model}\]

where $\mathcal{H}_{W}=\nabla^2_{\operatorname{vec}(W)}f(W)\in\mathbb{R}^{mn\times mn}$ is the parameter-space Hessian, and $\operatorname{vec}(\cdot)$ flattens a matrix into a vector.

This is the natural place to start if we want to understand Newton-like behavior. The obstacle, of course, is that the Hessian $\mathcal{H}_{W}$ is intractable. So the next step is to expose some structure.

Kronecker-Factored Curvature

For one sample, let $z_i\in\mathbb{R}^{n}$ be the input and let $y_i=Wz_i\in\mathbb{R}^{m}$ be the output. Write the sample loss as $\ell_i(W)=L_i(y_i)$, and assume the single-sample output curvature is $H_i=\nabla_{y_i}^{2}L_i(y_i)\in\mathbb{R}^{m\times m}$.

Since $y_i=Wz_i$, by the matrix derivative rule, this single-sample curvature $H_i$ contributes the following term to $\mathcal{H}_W$. Here $I_m\in\mathbb{R}^{m\times m}$ is the identity matrix and $\otimes$ denotes the Kronecker product:

\[\nabla_{\operatorname{vec}(W)}^2\ell_i(W) = (z_i\otimes I_m)H_i(z_i^\top\otimes I_m) = (z_iz_i^\top)\otimes H_i \in\mathbb{R}^{mn\times mn}.\]

Averaging these single-sample curvature contributions over $N$ samples gives the empirical parameter-space Hessian $\mathcal{H}_W$ (where we ignore token-coupled curvature in attention mechanisms):

\[\mathcal{H}_W = \frac1N\sum_{i=1}^{N}(z_iz_i^\top)\otimes H_i \in\mathbb{R}^{mn\times mn}.\]

Now, we replace the sample-dependent $H_i$ by a global average curvature $H=N^{-1}\sum_{i=1}^{N}H_i\in\mathbb{R}^{m\times m}$ and decouple it from the input second moment. With $Z=[z_1,\dots,z_N]\in\mathbb{R}^{n\times N}$,

\[\boxed{ \mathcal{H}_W \approx (ZZ^\top/N)\otimes H. } \tag{2}\label{eq:kfac}\]

This is the K-FAC-style approximation; see Optimizing Neural Networks with Kronecker-factored Approximate Curvature.

From Curvature to Displacement

Let $W^{\star}\in\mathbb{R}^{m\times n}$ be a local minimizer near $W$, so $\nabla f(W^{\star})=0$. The integral Taylor expansion of the gradient along the segment from $W^{\star}$ to $W$ gives

\[\operatorname{vec}(G) = \operatorname{vec}(\nabla f(W)) = \left[ \int_{0}^{1} \nabla_{\operatorname{vec}(W)}^{2} f\!\left(W^{\star}+t(W-W^{\star})\right) \,\mathrm{d}t \right] \operatorname{vec}(W-W^{\star}).\]

We replace this path-averaged Hessian by the fixed parameter-space Hessian $\mathcal{H}_{W}$. This gives the local Newton relation

\[\operatorname{vec}(G)\approx \mathcal{H}_{W}\operatorname{vec}(W-W^{\star}). \tag{3}\label{eq:local-newton}\]

Matrix-Sign Representation of the Newton Step

Here comes the magic part.

Under the assumptions in $\eqref{eq:kfac}$ and $\eqref{eq:local-newton}$, treated as exact, and assuming $H\succ0$ and $ZZ^\top\succ0$, the optimal solution $Q^\star$ to the quadratic model $\eqref{eq:quadratic-model}$ can be written directly as:

\[\boxed{ Q^\star = \Sigma_W^{1/2}\operatorname{msgn}\!\left(\Sigma_W^{1/2}G(ZZ^\top)^{-1}\right) } \tag{4}\label{eq:exact-qstar}\]

where $\Sigma_W=(W-W^\star)(W-W^\star)^\top$. The proof is just algebra. We move it to the end of the blog.

Isotropic Displacement Approximation

The expression above still contains $\Sigma_W$, which depends on the unknown displacement $W-W^\star$. To turn the formula into a practical optimizer, adopt the isotropic displacement approximation

\[\Sigma_{W}\propto I_{m}.\]

This proxy treats the unknown displacement directions as isotropic in aggregate.

Then $\Sigma_{W}^{1/2}\propto I_{m}$, and the exact expression $\eqref{eq:exact-qstar}$ reduces to

Newton-Muon

$$ Q^\star\propto \operatorname{msgn}\!\left(G(ZZ^\top)^{-1}\right). $$

If the input second moment is also isotropic, $ZZ^{\top}\propto I_{n}$, where $I_n\in\mathbb{R}^{n\times n}$ is the identity matrix, then we recover standard Muon:

Muon

$$ Q^{\star}\propto \operatorname{msgn}(G). $$

Thus standard Muon is the special case obtained by discarding the right preconditioner induced by the input activations.

A direct descent argument follows. Let $G_{r}=G(ZZ^{\top})^{-1}\in\mathbb{R}^{m\times n}$ denote the right-preconditioned gradient. If $G_{r}=USV^{\top}$ is a compact SVD with $S\succeq 0$ diagonal, then $Q=\operatorname{msgn}(G_{r})=UV^{\top}$ and

\[\operatorname{tr}(G^{\top}Q)=\operatorname{tr}(ZZ^{\top}VSV^{\top})\ge 0,\]

since $ZZ^{\top}\succeq 0$ and $VSV^{\top}\succeq 0$. So, under $ZZ^\top\succ0$, the Newton-Muon direction is a descent direction whenever $G_r\neq0$.

Modded-NanoGPT Result

On Record #4 of the short-track Modded-NanoGPT reproduction, Newton-Muon reaches Muon’s final validation loss using 6% fewer optimization steps and approximately 4% less wall-clock time.

The gain is not a giant architectural change. It is a small optimizer-level correction. But that is exactly why the result is interesting: a simple input-covariance preconditioner gives Muon a cleaner Newton interpretation and a measurable practical improvement.

Proof of the Matrix-Sign Formula

Assume the approximations $\eqref{eq:kfac}$ and $\eqref{eq:local-newton}$ hold exactly, and assume $H\succ0$ and $ZZ^\top\succ0$. The first-order condition for the quadratic model $\eqref{eq:quadratic-model}$ gives

\[\mathcal{H}_W\operatorname{vec}(Q^\star)=\operatorname{vec}(G). \tag{5}\label{eq:proof-stationarity}\]

Combining $\eqref{eq:proof-stationarity}$ with the exact version of $\eqref{eq:local-newton}$ gives $\operatorname{vec}(Q^\star) = \operatorname{vec}(W-W^\star)$, so

\[Q^{\star}=W-W^{\star}. \tag{6}\label{eq:proof-newton-direction}\]

Next, use $\eqref{eq:kfac}$ and $\eqref{eq:proof-newton-direction}$ in $\eqref{eq:proof-stationarity}$, together with $\operatorname{vec}(AXB)=(B^\top\otimes A)\operatorname{vec}(X)$:

\[\operatorname{vec}(G) = \left((ZZ^\top/N)\otimes H\right)\operatorname{vec}(W-W^\star) = \operatorname{vec}\!\left(H(W-W^\star)ZZ^\top/N\right).\]

Thus

\[G(ZZ^\top)^{-1}=H(W-W^\star)/N. \tag{7}\label{eq:preconditioned-gradient-proof}\]

Take the compact SVD $W-W^{\star}=U_QS_QV_Q^{\top}$. Then $\Sigma_{W}^{1/2}=U_QS_QU_Q^{\top}$. Using $\eqref{eq:preconditioned-gradient-proof}$,

\[\Sigma_W^{1/2}G(ZZ^\top)^{-1} = \frac1N\,U_QS_Q(U_Q^\top H U_Q)S_QV_Q^\top.\]

Since $H\succ0$ and $S_Q$ has positive diagonal entries, the middle factor $N^{-1}S_Q(U_Q^\top H U_Q)S_Q$ is symmetric positive definite. Therefore

\[\operatorname{msgn}\!\left(\Sigma_W^{1/2}G(ZZ^\top)^{-1}\right) = U_QV_Q^\top.\]

Finally, because $U_Q^\top U_Q=I_r$,

\[\Sigma_W^{1/2}\operatorname{msgn}\!\left(\Sigma_W^{1/2}G(ZZ^\top)^{-1}\right) = \Sigma_W^{1/2}U_QV_Q^\top = U_QS_QU_Q^\top U_QV_Q^\top = U_QS_QV_Q^\top = W-W^\star.\]

Combining the last display with $\eqref{eq:proof-newton-direction}$ proves $\eqref{eq:exact-qstar}$:

\[\boxed{ Q^\star = \Sigma_W^{1/2}\operatorname{msgn}\!\left(\Sigma_W^{1/2}G(ZZ^\top)^{-1}\right). }\]

This proves the matrix-sign representation of the Newton step.

Citation

@misc{du2026newtonmuonoptimizer,
  title={The Newton-Muon Optimizer},
  author={Zhehang Du and Weijie Su},
  year={2026},
  eprint={2604.01472},
  archivePrefix={arXiv},
  primaryClass={math.OC},
  url={https://arxiv.org/abs/2604.01472},
}