You'll Be Responsible For:

Assisting in the construction of visual multimodal LLMs, enhancing capabilities in perception, localization, and decision-making ("fast and slow thinking"), including model architecture design, data preprocessing, and feature extraction.
Working on data-set design and network training for tasks like object detection and multimodal image-text matching, assisting in developing innovative solutions to improve system performance across various scenarios.
Participating in data analysis, task design, and architecture development for "slow-thinking" Agents. Optimizing algorithms and models to improve the accuracy and efficiency of Agents executing complex tasks in virtual environments.
Managing model deployment, acceleration, and the design and implementation of API interfaces to ensure efficient operation and provide stable support for other systems and applications.

You Might Thrive In This Role If You:

Hold a Master or higher degree in Computer Science, Software Engineering, Electronics, or a related field.
Have experience in one or more of the following area: VLA, GUI, CUA, BUA.
Are familiar with building Agents using multimodal LLMs and prompt tuning.
Use LLMs as daily tools and can apply theoretical knowledge to solve practical problems.
Are proficient in algorithms related to NLP, Computer Vision (CV), and/or multimodal LLMs. You have experience with Python, PyTorch, TensorRT, and possess skills in one or more of the following: model deployment/acceleration, design, and tuning.
Have strong analytical and problem-solving skills, excellent engineering judgment, and the ability to make quick technical decisions.
Possess outstanding teamwork and communication skills, enabling effective collaboration with business, product, and other technical teams.

Even Better If You:

Have achieved excellent results in competitions like Kaggle or Tianchi.
Have published papers in top-tier conferences or journals such as CVPR, NeurIPS, ICLR, TPAMI, or IJCV.

Why Join Us?

Work in a pure research environment, akin to a "West Point for AI."
Join a team with multiple world championship titles in competitions, with opportunities to contribute to top-tier conference publications.
Work alongside and grow with scientists recognized in the top 2% of Google Scholar.

Research Scientist, Visual Multimodal