Abstract:Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce <a class="link-external link-http" href="http://TaskMatrix.AI" rel="external noopener nofollow">this http URL</a> as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, <a class="link-external link-http" href="http://TaskMatrix.AI" rel="external noopener nofollow">this http URL</a> focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.

Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

Dynabench: Rethinking Benchmarking in NLP

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

Aibench: an industry standard ai benchmark suite

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

TaskBench: Benchmarking Large Language Models for Task Automation

AIBench Training: Balanced Industry-Standard AI Training Benchmarking

Tur[k]ingBench: A Challenge Benchmark for Web Agents

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence

DataPerf: Benchmarks for Data-Centric AI Development

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Codabench: Flexible, Easy-to-use, and Reproducible Meta-Benchmark Platform

Personalized Benchmarking with the Ludwig Benchmarking Toolkit

DyPyBench: A Benchmark of Executable Python Software

DLBench: An Experimental Evaluation of Deep Learning Frameworks

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Enriching the Machine Learning Workloads in BigBench