Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

Abstract

Accurate 3D reconstruction in endoscopy enables quantitative and holistic lesion characterization within the gastrointestinal (GI) tract. To achieve this, reliable depth and pose estimation is required. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a StyleGAN-based generator and a Variational Autoencoder (VAE). The StyleGAN generator leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. To further enhance pose stability and generalizability, we introduce a prior transfer module that distills motion knowledge from natural scene SLAM systems. Specifically, pose priors from a pretrained SLAM model—supervised on large-scale natural scene datasets—are used to guide the latent distribution of pose through a KL-divergence reparameterization. This mechanism effectively transfers structural motion priors into the endoscopic domain, improving trajectory consistency under challenging conditions. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract’s complex textures and lighting. Extensive evaluations on SimCol, C3VD, and EndoSLAM datasets confirm our framework’s superior performance over published self-supervised methods in endoscopic depth and pose estimation. All data descriptions and code are available at https://github.com/EricXuziang/ Self-supervised-with-Latent-Priors.git.

Metadata

Item Type:	Article
Authors/Creators:	Xu, Z. https://orcid.org/0000-0002-3883-3716 Li, B. Hu, Y. https://orcid.org/0000-0002-4856-5014 Zhang, C. East, J. Ali, S. https://orcid.org/0000-0003-1313-3542 Rittscher, J. https://orcid.org/0000-0002-8528-8298
Copyright, Publisher and Additional Information:	This is an author produced version of an article published in IEEE Transactions on Medical Imaging,  made available via the University of Leeds Research Outputs Policy under the terms of the Creative Commons Attribution License (CC-BY), which permits unrestricted use, distribution and reproduction in any medium, provided the original work is properly cited.
Keywords:	Self-supervised learning, deep learning, endoscopy, monocular depth and pose estimation
Dates:	Published (online): 9 March 2026
Institution:	The University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering & Physical Sciences (Leeds) > School of Computing (Leeds)
Funding Information:	Funder Grant number EPSRC Accounts Payable UKRI914 Academy of Medical Sciences SBF0010\1191
Date Deposited:	16 Mar 2026 12:29
Last Modified:	16 Mar 2026 12:29
Status:	Published online
Publisher:	Institute of Electrical and Electronics Engineers (IEEE)
Identification Number:	10.1109/tmi.2026.3671423
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:238937

Download

Accepted Version

Filename: AcceptedCopy.pdf

Licence: CC-BY 4.0

CLICK TO DOWNLOAD

CORE (COnnecting REpositories)

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

Abstract

Metadata

Download

Accepted Version

Export

Statistics