Wonjae Kim

I am a research scientist at Naver AI LAB. Prior to joining Naver, I worked as a research scientist at Kakao. I completed my M.Sc. and B.Sc. in computer science and engineering at Seoul National University.

I am interested in the following research topics:

Multimodal (Vision-and-Language) Representation Learning
Self-supervised Representation Learning
Human-Computer Interaction
Information Visualization

selected publications

NeurIPS

Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning

Wonjae Kim, and Yoonho Lee

32nd Conference on Neural Information Processing Systems (NeurIPS 2019)

Abs arXiv HTML Code

Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model’s focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps. Our code is available at https://github.com/kakao/DAFT.
ICML Long talk

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Wonjae Kim^*, Bokyung Son^*, and Ildoo Kim
^*Equal contribution

38th International Conference on Machine Learnings (ICML 2021)

Abs arXiv HTML Video Code

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.
ECCV

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, and Sangdoo Yun

19th European Conference on Computer Vision (ECCV 2024)

Abs arXiv

In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity ϵi can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.

news

2024/07	One ECCV-2024 paper to appear: HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts.
2024/07	One TMLR paper: CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
2024/05	One ICML-2024 paper to appear: STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment.
2024/04	One CVPR-2024 Synthetic Data for Computer Vision workshop paper: CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion.
2024/04	One CHIL-2024 paper to appear: Vision-Language Generative Model for View-Specific Chest X-ray Generation.