publications
Please also check my Google Scholar profile for an up-to-date list.
Last updated: February 2026.
2026
-
CVPRTUNA: Taming Unified Visual Representations for Native Unified Multimodal ModelsZhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren CongIn CVPR, 2026
-
CVPRVecGlypher: Unified Vector Glyph Generation with Language ModelsXiaoke Huang, Bhavul Gauri, Kam Woh Ng, Tony Ng, Mengmeng Xu, Zhiheng Liu, Weiming Ren, Zhaochong An, Zijian Zhou, Haonan Qiu, Yuyin Zhou, Sen He, Ziheng Wang, Tao Xiang, Xiao HanIn CVPR, 2026
2023
-
PatentSystems and Methods for Providing User Experiences on AR/VR SystemsHyo Jin Kim, Tony Ng, Vincent Lee, F E R Ilg, S El Ghazzal, Z Wang, Z Wang, P K Huang2023
2022
-
CVPRNinjaDesc: Content-Concealing Visual Descriptors via Adversarial LearningTony Ng, Hyo Jin Kim, Vincent Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk, Chris SweeneyIn CVPR, 2022
In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.
-
3DVOoD-Pose: Camera Pose Regression from Out-of-Distribution Synthetic ViewsTony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas, Krystian MikolajczykIn 2022 International Conference on 3D Vision (3DV), 2022
-
arXivSAMPLE-HD: Simultaneous Action and Motion Planning Learning EnvironmentM Nazarczuk, Tony Ng, Krystian MikolajczykarXiv preprint, 2022
2021
-
arXivReassessing the Limitations of CNN Methods for Camera Pose RegressionTony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas, Krystian MikolajczykarXiv preprint, 2021
In this paper, we address the problem of camera pose estimation in outdoor and indoor scenarios. In comparison to the currently top-performing methods that rely on 2D to 3D matching, we propose a model that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first analyse why regression methods are still behind the state-of-the-art, and we bridge the performance gap with our new approach. Specifically, we propose a way to overcome the biased training data by a novel training technique, which generates poses guided by a probability distribution from the training set for synthesising new training views. Lastly, we evaluate our approach on two widely used benchmarks and show that it achieves significantly improved performance compared to prior regression-based methods, retrieval techniques as well as 3D pipelines with local feature matching.
2020
-
ECCVSOLAR: Second-Order Loss and Attention for Image RetrievalTony Ng, Vassileios Balntas, Yurun Tian, Krystian MikolajczykIn ECCV, 2020
Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks.
-
NeurIPSHyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet LossYurun Tian, Axel Barroso-Laguna, Tony Ng, Vassileios Balntas, Krystian MikolajczykIn NeurIPS, 2020
Recent works show that local descriptor learning benefits from the use of L2 normalisation, however, an in-depth analysis of this effect lacks in the literature. In this paper, we investigate how L2 normalisation affects the back-propagated descriptor gradients during training. Based on our observations, we propose HyNet, a new local descriptor that leads to state-of-the-art results in matching. HyNet introduces a hybrid similarity measure for triplet margin loss, a regularisation term constraining the descriptor norm, and a new network architecture that performs L2 normalisation of all intermediate feature maps and the output descriptors. HyNet surpasses previous methods by a significant margin on standard benchmarks that include patch matching, verification, and retrieval, as well as outperforming full end-to-end methods on 3D reconstruction tasks.
-
ACCVD2D: Keypoint Extraction with Describe to Detect ApproachYurun Tian, Vassileios Balntas, Tony Ng, Axel Barroso-Laguna, Yiannis Demiris, Krystian MikolajczykIn Proceedings of the Asian Conference on Computer Vision (ACCV), 2020
In this paper, we present a novel approach that exploits the information within the descriptor space to propose keypoint locations. Detect then describe, or detect and describe jointly are two typical strategies for extracting local descriptors. In contrast, we propose an approach that inverts this process by first describing and then detecting the keypoint locations. % Describe-to-Detect (D2D) leverages successful descriptor models without the need for any additional training. Our method selects keypoints as salient locations with high information content which is defined by the descriptors rather than some independent operators. We perform experiments on multiple benchmarks including image matching, camera localisation, and 3D reconstruction. The results indicate that our method improves the matching performance of various descriptors and that it generalises across methods and tasks.