传统上,单眼 3D 人体姿势估计采用机器学习模型来预测给定输入图像最可能的 3D 姿势。然而,单个图像可能非常模糊,并为 2D-3D 提升步骤引入多个似是而非的解决方案,从而导致过于自信的 3D 姿势预测器。为此,我们提出了一种条件扩散模型 \emph{DiffPose},它可以预测给定输入图像的多个假设。与类似方法相比,我们的扩散模型简单明了,避免了密集的超参数调整、复杂的网络结构、模式崩溃和不稳定的训练。此外,我们解决了常见的两步法的问题,该方法首先通过联合热图估计 2D 联合位置的分布,并根据一阶或二阶矩统计连续逼近它们。由于热图的这种简化删除了关于可能正确但标记不太可能的联合位置的有效信息,我们建议将热图表示为一组二维联合候选样本。为了从这些样本中提取有关原始分布的信息,我们引入了调节扩散模型的 \emph{embedding transformer}。通过实验,我们表明 DiffPose 略微改进了简单姿势的多假设姿势估计的最新技术水平,并且在高度模糊的姿势中大大优于它。为了从这些样本中提取有关原始分布的信息,我们引入了调节扩散模型的 \emph{embedding transformer}。通过实验,我们表明 DiffPose 略微改进了简单姿势的多假设姿势估计的最新技术水平,并且在高度模糊的姿势中大大优于它。为了从这些样本中提取有关原始分布的信息,我们引入了调节扩散模型的 \emph{embedding transformer}。通过实验,我们表明 DiffPose 略微改进了简单姿势的多假设姿势估计的最新技术水平,并且在高度模糊的姿势中大大优于它。
Traditionally, monocular 3D human pose estimation employs a machine learning
model to predict the most likely 3D pose for a given input image. However, a
single image can be highly ambiguous and induces multiple plausible solutions
for the 2D-3D lifting step which results in overly confident 3D pose
predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion
model, that predicts multiple hypotheses for a given input image. In comparison
to similar approaches, our diffusion model is straightforward and avoids
intensive hyperparameter tuning, complex network structures, mode collapse, and
unstable training. Moreover, we tackle a problem of the common two-step
approach that first estimates a distribution of 2D joint locations via
joint-wise heatmaps and consecutively approximates them based on first- or
second-moment statistics. Since such a simplification of the heatmaps removes
valid information about possibly correct, though labeled unlikely, joint
locations, we propose to represent the heatmaps as a set of 2D joint candidate
samples. To extract information about the original distribution from these
samples we introduce our \emph{embedding transformer} that conditions the
diffusion model. Experimentally, we show that DiffPose slightly improves upon
the state of the art for multi-hypothesis pose estimation for simple poses and
outperforms it by a large margin for highly ambiguous poses.