Tony Ng

Research Focus

Generative Systems

Diffusion models for image, video, and audio generation with real-world quality and reliability constraints.

Visual Localization

Learning-based localization that blends geometry with deep representations for AR/VR at scale.

Privacy + Security

Content-concealing descriptors and robust perception for privacy-preserving visual systems.

Now

I build and evaluate diffusion systems for ad creatives at Meta AI Research. I’m interested in controllable generation, scalable data curation, and evaluation frameworks that move beyond surface-level metrics.

I’m open to collaborations on generative media systems, privacy-preserving perception, and robust evaluation.

selected publications

CVPR

NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning

Tony Ng, Hyo Jin Kim, Vincent Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk, Chris Sweeney

In CVPR, 2022

Abs arXiv

In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.
arXiv

Reassessing the Limitations of CNN Methods for Camera Pose Regression

Tony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas, Krystian Mikolajczyk

arXiv preprint, 2021

Abs arXiv

In this paper, we address the problem of camera pose estimation in outdoor and indoor scenarios. In comparison to the currently top-performing methods that rely on 2D to 3D matching, we propose a model that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first analyse why regression methods are still behind the state-of-the-art, and we bridge the performance gap with our new approach. Specifically, we propose a way to overcome the biased training data by a novel training technique, which generates poses guided by a probability distribution from the training set for synthesising new training views. Lastly, we evaluate our approach on two widely used benchmarks and show that it achieves significantly improved performance compared to prior regression-based methods, retrieval techniques as well as 3D pipelines with local feature matching.
ECCV

SOLAR: Second-Order Loss and Attention for Image Retrieval

Tony Ng, Vassileios Balntas, Yurun Tian, Krystian Mikolajczyk

In ECCV, 2020

Abs arXiv Blog Code

Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks.

news

Feb 23, 2026	Two papers were accepted to CVPR 2026: TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models and VecGlypher: Unified Vector Glyph Generation with Language Models.
Dec 10, 2025	New preprint: TUNA — Taming Unified Visual Representations for Native Unified Multimodal Models (arXiv:2512.02014).
Aug 19, 2024	Started a new role as an AI Research Scientist at Meta, focusing on diffusion models for image, video, and audio generation.
Feb 6, 2023	Joined Synthesia as a Research Engineer, working on controllable video diffusion models for AI dubbing on avatars.
Oct 7, 2022	I completed a second research internship at Reality Labs, this time working on multi-modal understanding (text & geometry) using language models.