Wonjae (Dan) Kim

원재 · ML/HCI Research Scientist

prof_pic5.jpg

I lead the Embedding & Search team at TwelveLabs, where we build multimodal foundation models for video understanding. I’m the first author of ViLT (cited: 2632), one of the early works that shaped efficient vision-language architectures. Previously, I was a research scientist at Naver AI LAB and Kakao, and I hold an M.Sc. and B.Sc. from Seoul National University.

My current research focuses on:

  • Multimodal Representation Learning (video, audio, text)
  • Large-scale Embedding & Search Systems
  • User Behavior Modeling for Search

We’re Hiring! I’m building a research team at TwelveLabs where your models ship to thousands of customers within months. We’re tackling joint embedding spaces across modalities and containerized asset search—problems that go beyond simple retrieval to true semantic understanding of video structure. If you want to see your work create real-world impact at scale, grab a coffee chat with me. I’m looking for scientists and engineers who are excited to push video-language AI from idea to production. Join us in Seoul →

news

Dec 01, 2025 TwelveLabs releases Marengo 3.0, a new standard for foundation models that understand the world in all its complexity.
Apr 01, 2025 One CVPR-2025 EVAL-FoMo 2 Workshop paper: Emergence of Text Readability in Vision Language Models.
Feb 04, 2025 I’ve started a new chapter at TwelveLabs!
Jan 01, 2025 One ICLR-2025 paper to appear: Probabilistic Language-Image Pre-Training.
Dec 01, 2024 One AAAI-2025 paper to appear: Extract Free Dense Misalignment from CLIP.

latest posts

selected publications

  1. Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning(cited: 11)
    Wonjae Kim and Yoonho Lee
    In 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019
  2. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision(cited: 2632)
    Wonjae Kim*, Bokyung Son*, and Ildoo Kim
    In 38th International Conference on Machine Learnings (ICML 2021), 18–24 jul 2021
  3. HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts(cited: 21)
    Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, and Sangdoo Yun
    In 19th European Conference on Computer Vision (ECCV 2024), 2024