Podcast Episodes
Back to SearchOccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Episode 1773
🤗 Upvotes: 50 | cs.CL
Authors:
Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng, Yuxuan…
2Â weeks, 5Â days ago
Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
Episode 1772
🤗 Upvotes: 24 | cs.AI, cs.CL
Authors:
Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang…
2Â weeks, 5Â days ago
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
Episode 1771
🤗 Upvotes: 23 | cs.LG, cs.AI, cs.CL
Authors:
Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu H…
2Â weeks, 5Â days ago
Exploration and Exploitation Errors Are Measurable for Language Model Agents
Episode 1770
🤗 Upvotes: 22 | cs.AI
Authors:
Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak, Kangwook Lee, Yong Jae …
2Â weeks, 5Â days ago
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
Episode 1769
🤗 Upvotes: 123 | cs.LG, cs.AI, cs.CL, cs.CV
Authors:
Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, …
2Â weeks, 6Â days ago
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Episode 1768
🤗 Upvotes: 62 | cs.LG, cs.AI, cs.CL
Authors:
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Ch…
2Â weeks, 6Â days ago
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
Episode 1767
🤗 Upvotes: 27 | cs.AI, cs.LG
Authors:
Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu…
2Â weeks, 6Â days ago
SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
Episode 1766
🤗 Upvotes: 26 | cs.AI
Authors:
Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Ya…
2Â weeks, 6Â days ago
Toward Autonomous Long-Horizon Engineering for ML Research
Episode 1765
🤗 Upvotes: 25 | cs.CL
Authors:
Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua …
2Â weeks, 6Â days ago
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Episode 1764
🤗 Upvotes: 21 | cs.CL, cs.AI
Authors:
Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe, Céline Hu…
2Â weeks, 6Â days ago