Back to Home

Research Statement

LLMs & Agents, Multimodal Reasoning, Code Intelligence, and Future Directions

Updated: 2026-05-11

Overview

My research centers on building large language models (LLMs) and agents that can reason, retrieve, and act reliably in complex, knowledge-intensive environments. Beginning with federated learning — where I investigated how to embed group-level fairness constraints into the gradient descent process [1] — I gradually broadened my scope from low-level model training dynamics to high-level cognitive capabilities: how models understand visual structures, manipulate code, plan with tools, and adapt through interaction. This journey reflects a core conviction: that the path toward genuinely intelligent systems requires us to address challenges across the entire stack, from training methodology to emergent multi-agent behavior.

Multimodal Reasoning

A fundamental limitation of current generative models is their struggle with compositional structural constraints — generative numeracy, attribute binding, and part-level relations. In Shape-of-Thought (SoT) [2], I introduced a visual Chain-of-Thought framework that enables progressive object assembly via interleaved textual plans and rendered intermediate states. Rather than producing explicit geometric representations, the model learns shape-assembly logic through coherent 2D projections without relying on external rendering engines at inference time. I constructed the SoT-26K dataset by converting part-based CAD assets into step-aligned multimodal traces via automated hierarchy decomposition and Blender rendering, and T2S-CompBench to evaluate both structural integrity and trace faithfulness. Fine-tuning a multimodal autoregressive model on these traces yielded an 88.4% component numeracy and 84.8% structural topology — approximately 20% above text-only baselines.

Beyond generation, a parallel challenge is enabling models to reason more effectively in complex visual tasks. In DMLR [3], I co-authored a novel reasoning paradigm that interleaves perceptual steps directly into the latent space. By replacing rigid, sequential processing with dynamic integration, the model mimics human attention shifts and significantly reduces hallucinations in visual reasoning tasks.

Code Intelligence

Beyond perception, reliable code generation and understanding represent a critical frontier for LLM application. My work on RepoShapley [4] addresses a core challenge in repository-level code completion: the utility of a retrieved context chunk is often interaction-dependent — some snippets help only when paired with complementary context, while others harm decoding through conflicts. I proposed ChunkShapley, an offline labeling module that estimates signed per-chunk effects via teacher-forced probing, feeds them into a lightweight surrogate game capturing saturation and interference, and computes exact Shapley values for small retrieval sets. The verified keep/drop decisions are then distilled into a single model via discrete control tokens. Experiments across multiple backbones show consistent improvements in completion quality while reducing harmful context and unnecessary retrieval.

Understanding why models fail on certain code tasks — even when surface-level accuracy appears similar — led me to co-author Brewing-to-Resolution [5]. Through layer-wise linear probing and Context-Stripped Decoding, we revealed a universal internal lifecycle governing how LLMs process code: models first "brew" the answer — linearly recoverable many layers before it becomes self-decodable — then diverge into one of four causally validated resolution outcomes. Across 16 models and six task families, only 41.5% of cases achieve full resolution, with substantial masses in Overprocessed, Misresolved, and Unresolved categories. These findings indicate that the brewing scaffold is architectural and stable, while resolution success is parametric and variable — a distinction that accuracy-only metrics completely obscure.

Agentic Systems

Scaling individual model capabilities to multi-agent workflows is the next frontier. In Group of Skills (GoSkills) [6], I co-authored an inference-time group-structured skill retrieval framework that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields — without modifying the downstream agent, skill payloads, or execution environment. Evaluated on SkillsBench and ALFWorld, GoSkills preserves visible-requirement coverage under a small skill budget and often improves reward relative to structural retrieval references.

Effective agents also require structured reasoning and memory. In ongoing work, I am developing agents that combine retrieval-augmented generation with multi-turn planning and self-critique loops, enabling continuous adaptation in dynamic environments.

Fairness & Efficiency in ML Systems

Intelligence at scale must be equitable intelligence. My early work on FedGF [1] tackled a critical gap in federated learning: prior methods optimized client-level fairness while overlooking demographic-group biases. I proposed embedding demographic-parity constraints into each layer's descent direction, jointly optimizing accuracy, client fairness, and group fairness. FedGF achieves a 78% reduction in group accuracy gaps compared to state-of-the-art methods on FMNIST and CIFAR-10 benchmarks.

Similarly, in q-Boost [7], I co-designed an optimal boost factor algorithm for auto-bidding mechanisms with publisher quality constraints, balancing advertiser and publisher interests within a unified welfare framework. Experiments on Alibaba's AuctionNet dataset demonstrate 2–6% welfare improvements over conventional approaches.

Future Plan

My long-term research agenda is to build reliable, adaptive AI agents that operate effectively in open-ended, social, and knowledge-intensive environments. I see two interconnected thrusts:

1. Cognitive agents. I plan to investigate how LLM-based agents can learn and adapt through interaction — designing algorithms that enable continuous internal strategy updates, principled uncertainty quantification, and human-like decision-making in dynamic social contexts. Key challenges include grounding symbolic reasoning in latent space, building robust world models from noisy observations, and developing verifiable self-correction mechanisms.

2. Multi-agent social intelligence. At the collective level, I aim to build interactive agent societies that serve as high-fidelity simulations for studying coordination, cooperation, and competition at scale. In such environments, we can observe emergent social patterns — norms, trust, strategic adaptation — arising from the accumulation of individual behaviors. These simulations will provide both scientific insight into collective intelligence and engineering foundations for robust multi-agent deployments.

References

  1. Yu Huo*, Yating Li*, and Xiaoying Tang. FedGF: Layer-Wise Federated Learning with Group Fairness Guarantees. ICIC 2025. Link.
  2. Yu Huo*, Siyu Zhang*, Kun Zeng*, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, and Xiaoying Tang. Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought. ICML 2026. Link.
  3. Chengzhi Liu*, Yu Huo*, Yue Fan, Qingyue Wei, Sheng Liu, and Xin Eric Wang. Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space. arXiv preprint arXiv:2512.12623, 2025. Link.
  4. Yu Huo*, Kun Zeng*, Siyu Zhang*, Yuquan Lu, Cheng Yang, Yifu Guo, and Xiaoying Tang. RepoShapley: Shapley-Enhanced Context Filtering for Repository-Level Code Completion. ACL 2026. Link.
  5. Yiyang Guo, Shuai Chen, Yuquan Lu, Jiarui Lin, Zongwei Xu, Jin Lin, Siyu Zhang, Cheng Yang, Jiawei Li, Yiming Li, Yu Huo, and Ruicheng Wang. From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs. COLM 2026 (Submission). Link.
  6. Kun Zeng*, Yu Huo*, Siyu Zhang, Zi Ye, Yuecheng Zhuo, Haoyue Liu, Yuquan Lu, Junhao Wen, and Xiaoying Tang. Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries. arXiv preprint arXiv:2605.06978, 2026. Link.
  7. Huanyu Yan*, Yu Huo*, Min Lu, Weitong Ou, Xingyan Shi, Ruihe Shi, and Xiaoying Tang. Optimal Boost Design for Auto-bidding Mechanism with Publisher Quality Constraints. arXiv preprint arXiv:2508.08772, 2025. Link.