Research Focus
Generative Systems
Diffusion models for image, video, and audio generation with real-world quality and reliability constraints.
Visual Localization
Learning-based localization that blends geometry with deep representations for AR/VR at scale.
Privacy + Security
Content-concealing descriptors and robust perception for privacy-preserving visual systems.
Now
I build and evaluate diffusion systems for ad creatives at Meta AI Research. I’m interested in controllable generation,
scalable data curation, and evaluation frameworks that move beyond surface-level metrics.
I’m open to collaborations on generative media systems, privacy-preserving perception, and robust evaluation.
selected publications
-
NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning
Tony Ng,
Hyo Jin Kim,
Vincent Lee,
Daniel DeTone,
Tsun-Yi Yang,
Tianwei Shen,
Eddy Ilg,
Vassileios Balntas,
Krystian Mikolajczyk,
Chris Sweeney
In CVPR,
2022
In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.
-
Reassessing the Limitations of CNN Methods for Camera Pose Regression
arXiv preprint,
2021
In this paper, we address the problem of camera pose estimation in outdoor and indoor scenarios. In comparison to the currently top-performing methods that rely on 2D to 3D matching, we propose a model that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first analyse why regression methods are still behind the state-of-the-art, and we bridge the performance gap with our new approach. Specifically, we propose a way to overcome the biased training data by a novel training technique, which generates poses guided by a probability distribution from the training set for synthesising new training views. Lastly, we evaluate our approach on two widely used benchmarks and show that it achieves significantly improved performance compared to prior regression-based methods, retrieval techniques as well as 3D pipelines with local feature matching.
-
SOLAR: Second-Order Loss and Attention for Image Retrieval
In ECCV,
2020
Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks.
news
| Dec 10, 2025 |
New preprint: TUNA — Taming Unified Visual Representations for Native Unified Multimodal Models (arXiv:2512.02014).
|
| Aug 1, 2024 |
Started a new role as an AI Research Scientist at Meta, focusing on diffusion models for image, video, and audio generation.
|
| Feb 1, 2023 |
Joined Synthesia as a Research Engineer, working on controllable video diffusion models for AI dubbing on avatars.
|
| Oct 7, 2022 |
I completed a second research internship at Reality Labs, this time working on multi-modal understanding (text & geometry) using language models.
|
| Jun 24, 2022 |
Our paper NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning was presented at CVPR 2022, New Orleans LA.
|
Feel free to contact me via email, Twitter or LinkedIn DM :)