Trending Research

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

LeapLabTHU/Absolute-Zero-Reasoner ? ? 6 May 2025

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards.

Mathematical Reasoning

385

4.12 stars / hour

Paper
Code

PixelHacker: Image Inpainting with Structural and Semantic Consistency

hustvl/PixelHacker ? 29 Apr 2025

Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively).

Denoising Facial Inpainting

245

1.95 stars / hour

Paper
Code

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

maitrix-org/voila ? ? 5 May 2025

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner.

AI Agent Automatic Speech Recognition +4

233

1.82 stars / hour

Paper
Code

Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

aidc-ai/awesome-unified-multimodal-models ? ? 5 May 2025

Despite their respective successes, these two domains have evolved independently, leading to distinct architectural paradigms: While autoregressive-based architectures have dominated multimodal understanding, diffusion-based models have become the cornerstone of image generation.

Survey Text-to-Image Generation

1.79 stars / hour

Paper
Code

LTX-Video: Realtime Video Latent Diffusion

Lightricks/LTX-Video ? ? 30 Dec 2024

To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space.

Denoising Image to Video Generation

4,360

1.52 stars / hour

Paper
Code

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

ruc-nlpir/webthinker ? 30 Apr 2025

Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities.

Navigate

614

1.33 stars / hour

Paper
Code

FastVLM: Efficient Vision Encoding for Vision Language Models

apple/ml-fastvlm ? ? 17 Dec 2024

At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency.

227

1.28 stars / hour

Paper
Code

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

simular-ai/agent-s ? ? 1 Apr 2025

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries.

AI Agent Task Planning

4,514

1.24 stars / hour

Paper
Code

LiftFeat: 3D Geometry-Aware Local Feature Matching

lyp-deeplearning/liftfeat ? ? 6 May 2025

We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature.

3D geometry Homography Estimation +3

1.23 stars / hour

Paper
Code

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

facebookresearch/perception_models ? ? 17 Apr 2025

In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.

Ranked #4 on Video Question Answering on NExT-QA

Video Question Answering Video Understanding

961

0.98 stars / hour

Paper
Code