Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

📝 Note: By way of exception, I include one and only one image dataset, due to its size: 700K scenes and the incredible improvement in depth estimation results of the fine-tuned Depth Anything V2 ViT-B model on MegaSynth and evaluated on Hypersim. See the results in Table 6.

	Dataset	Venue	Resolution
1	MegaSynth		512×512

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

	Dataset	Venue	Resolution	B o T	C 3 R	D ² U	D P	G C	M o G	P O M	R D	U D 2	V D A
1	Spring	(to do)	1920×1080	-	T	T	E	T	T	-	-	-	-
2	HorizonGS		1920×1080	-	-	-	-	-	-	-	-	-	-
3	MVS-Synth	(to do)	1920×1080	-	T	-	T	T	T	-	-	-	-
4	Mid-Air	(to do)	1024×1024	-	-	-	-	T	T	-	-	-	-
5	MatrixCity	(to do)	1000×1000	-	-	-	-	T	T	-	-	T	-
6	SAIL-VOS 3D	(to do)	1280×800	-	-	-	T	-	-	-	-	-	-
7	BEDLAM	(to do)	1280×720	-	T	-	T	-	-	-	-	T	-
8	Dynamic Replica	(to do)	1280×720	-	T	-	T	T	-	T	-	T	-
9	BlinkVision		960×540	-	-	T	-	-	-	-	-	-	-
10	PointOdyssey	(to do)	960×540	-	T	T	-	-	-	T	E	T	T
11	DyDToF	(to do)	960×540	-	-	-	-	-	-	-	E	-	-
12	IRS	(to do)	960×540	-	T	-	T	T	T	-	-	-	T
13	Scene Flow	(to do)	960×540	-	-	-	-	E	-	-	-	-	-
14	3D Ken Burns	(to do)	512×512	-	T	-	T	T	T	-	-	-	-
15	TartanAir	(to do)	640×480	-	T	T	T	T	T	T	T	T	T
16	ParallelDomain-4D		640×480	-	-	-	-	-	-	T	-	-	-
17	GTA-SfM	(to do)	640×480	-	-	-	-	T	T	-	-	-	-
18	MPI Sintel	(to do)	1024×436	E	E	E	E	E	E	E	-	E	E
19	Virtual KITTI 2	(to do)	1242×375	-	T	-	T	T	-	-	-	-	T
20	TartanAir Shibuya	(to do)	640×360	E	-	-	-	-	-	-	-	-	-
	Total: T (training)
	Total: E (testing)

List of Rankings

2D to 3D Video Conversion Rankings

Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242
Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

Monocular Depth Estimation Rankings

Appendices

Appendix 1: Rules for qualifying models for the rankings (to do)
Appendix 2: Metrics selection for the rankings (to do)
Appendix 3: List of all research papers from the above rankings

Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

RK	Model Links: Venue Repository	LPIPS ↓ {Input fr.} Table 1 M2SVid
1	M2SVid	0.180 {MF}
2	SVG	0.217 {MF}
3	StereoCrafter	0.242 {MF}

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

📝 Note: This ranking is based on my own perceptual judgement of the qualitative comparison results shown in Figure 7. One output frame (right view) is compared with one input frame (left view) from the video clip: 22_dogskateboarder and one output frame (right view) is compared with one input frame (left view) from the video clip: scooter-black

RK	Model Links: Venue Repository	Rank ↓ (human perceptual judgment)
1	StereoCrafter	1
2-3	Immersity AI	2-3
2-3	Owl3D	2-3
4	Deep3D	4

ScanNet (170 frames): TAE<=2.2

RK	Model Links: Venue Repository	TAE ↓ {Input fr.} VDA
1	VDA-L	0.570 {MF}
2	DepthCrafter	0.639 {MF}
3	Depth Any Video	0.967 {MF}
4	ChronoDepth	1.022 {MF}
5	Depth Anything V2 Large	1.140 {1}
6	NVDS	2.176 {4}

iBims-1: F-score>=0.303

RK	Model Links: Venue Repository	F-score ↑ {Input fr.} TABLE I UD2	F-score ↑ {Input fr.} Table 20 UniK3D
1	UniDepthV2-Large	0.709 {1}	-
2	UniK3D-Large	-	0.698 {1}
3	Depth Pro	0.628 {1}	0.628 {1}
4	MASt3R	0.557 {2}	0.557 {2}
5	UniDepth	0.303 {1}	0.303 {1}

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.079

📝 Note: 1) See Figure 4 2) The ranking order is determined in the first instance by a direct comparison of the scores of two models in the same paper. If there is no such direct comparison in any paper or there is a disagreement in different papers, the ranking order is determined by the best score of the compared two models in all papers that are shown in the columns as data sources. The DepthCrafter rank is based on the latest version 1.0.1.

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} VDA	AbsRel ↓ {Input fr.} L4P	AbsRel ↓ {Input fr.} Geo4D	AbsRel ↓ {Input fr.} Align3R	AbsRel ↓ {Input fr.} MonST3R	AbsRel ↓ {Input fr.} DC	AbsRel ↓ {Input fr.} CUT3R	AbsRel ↓ {Input fr.} RD
1	Depth Any Video	0.051 {MF}	-	-	-	-	-	-	-
2	VDA-L	0.053 {MF}	-	-	-	-	-	-	-
3	L4P	-	0.056 {MF}	-	-	-	-	-	-
4	Geo4D	-	-	0.059 {MF}	-	-	-	-	-
5	Depth Pro	-	-	-	0.067 {1}	-	-	-	-
6	Align3R (Depth Pro)	-	-	-	0.068 {2}	-	-	-	-
7	MonST3R	-	-	0.063 {2}	0.082 {2}	0.063 {2}	-	0.066 {2}	-
8	DepthCrafter v1.0.1	0.066 {MF} (DC v1.0.0)	0.071 {MF}	0.071 {MF}	0.075 {MF} (DC v1.0.0)	0.075 {MF} (DC v1.0.0)	0.071 {MF}	0.075 {MF} (DC v1.0.0)	0.066 {MF} (DC v1.0.0)
9	CUT3R	-	-	-	-	-	-	0.074 {MF}	-
10	RollingDepth	-	-	-	-	-	-	-	0.079 {MF}
11	Depth Anything	-	0.078 {1}	-	-	-	0.078 {1}	-	0.099 {1}

NYU-Depth V2: AbsRel<=0.0424 (relative depth)

📝 Note: The ranking order is determined in the first instance by a direct comparison of the scores of two models in the same paper. If there is no such direct comparison in any paper or there is a disagreement in different papers, the ranking order is determined by the best score of the compared two models in all papers that are shown in the columns as data sources. The Metric3D v2 ViT-Large rank is not based on a score of 0.134, which is probably just an anomaly.

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} MoGe	AbsRel ↓ {Input fr.} BD	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} DA	AbsRel ↓ {Input fr.} DA V2
1	MoGe	0.0341 {1}	-	-	-	-
2	UniDepth	0.0380 {1}	-	-	-	-
3-5	BetterDepth	-	0.042 {1}	-	-	-
3-5	Depth Anything V2 Large	0.0420 {1}	-	-	-	0.045 {1}
3-5	Metric3D v2 ViT-Large	0.134 {1}	-	0.042 {1}	-	-
6	Depth Anything Large	0.0424 {1}	0.043 {1}	0.043 {1}	0.043 {1}	0.043 {1}

NYU-Depth V2: AbsRel<=0.051 (metric depth)

RK	Model Links: Venue Repository	AbsRel ↓ {Input fr.} Table 16 UniK3D	AbsRel ↓ {Input fr.} UD2	AbsRel ↓ {Input fr.} M3D v2	AbsRel ↓ {Input fr.} Table 2 MS	AbsRel ↓ {Input fr.} GRIN
1	UniK3D	0.0443 {1}	-	-	-	-
2	UniDepthV2	-	0.0468 {1}	-	-	-
3	Metric3D v2 ViT-L FT	0.0470 {1}	0.0470 {1}	0.047 {1}	-	-
4	Metric-Solver	-	-	-	0.049 {1}	-
5	GRIN_FT_NI	-	-	-	-	0.051 {1}

Appendix 3: List of all research papers from the above rankings

Method	Abbr.	Paper	Official repository
Align3R	-	Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
BetterDepth	BD	BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation	-
ChronoDepth	-	Learning Temporally Consistent Video Depth from Video Diffusion Priors
CUT3R	C3R	Continuous 3D Perception Model with Persistent State
Deep3D	-	Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks
Depth Any Video	DAV	Depth Any Video with Scalable Synthetic Data
Depth Anything	DA	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Depth Anything V2	DA V2	Depth Anything V2
Depth Pro	DP	Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
DepthCrafter	DC	DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Geo4D	-	Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
GRIN	-	GRIN: Zero-Shot Metric Depth with Pixel-Level Diffusion	-
L4P	-	L4P: Low-Level 4D Vision Perception Unified	-
M2SVid	-	M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion	-
MASt3R	-	Grounding Image Matching in 3D with MASt3R
Metric3D v2	M3D v2	Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation
Metric-Solver	MS	Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image
MoGe	MoG	MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
MonST3R	-	MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
NVDS	-	Neural Video Depth Stabilizer
RollingDepth	RD	Video Depth without Video Models
StereoCrafter	-	StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
SVG	-	SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix
UniDepth	UD	UniDepth: Universal Monocular Metric Depth Estimation
UniDepthV2	UD2	UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
UniK3D	-	UniK3D: Universal Camera Monocular 3D Estimation
Video Depth Anything	VDA	Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

List of Rankings

2D to 3D Video Conversion Rankings

Monocular Depth Estimation Rankings

I. Rankings based on temporal consistency metrics

II. Rankings based on 3D metrics

III. Rankings based on 2D metrics

Appendices

Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

ScanNet (170 frames): TAE<=2.2

iBims-1: F-score>=0.303

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.079

NYU-Depth V2: AbsRel<=0.0424 (relative depth)

NYU-Depth V2: AbsRel<=0.051 (metric depth)

Appendix 3: List of all research papers from the above rankings

About

Uh oh!

Releases

Packages

Uh oh!

AIVFI/Monocular-Depth-Estimation-Rankings-and-2D-to-3D-Video-Conversion-Rankings

Folders and files

Latest commit

History

Repository files navigation

Monocular Depth Estimation Rankingsand 2D to 3D Video Conversion Rankings

Awesome Synthetic RGB-D Image Datasets for Training HD Video Depth Estimation Models

Awesome Synthetic RGB-D Video Datasets for Training HD Video Depth Estimation Models

List of Rankings

2D to 3D Video Conversion Rankings

Monocular Depth Estimation Rankings

I. Rankings based on temporal consistency metrics

II. Rankings based on 3D metrics

III. Rankings based on 2D metrics

Appendices

Stereo4D (400 video clips with 16 frames each at 5 fps): LPIPS<=0.242

Qualitative comparison of four 2D to 3D video conversion methods: Rank (human perceptual judgment)

ScanNet (170 frames): TAE<=2.2

iBims-1: F-score>=0.303

Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.079

NYU-Depth V2: AbsRel<=0.0424 (relative depth)

NYU-Depth V2: AbsRel<=0.051 (metric depth)

Appendix 3: List of all research papers from the above rankings

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Monocular Depth Estimation Rankings
and 2D to 3D Video Conversion Rankings

Packages