As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results