Joschka Braun on Apr 6, 2024

In this post we will assess Anthropic’s new released tool calling beta API on the Berkeley Function Calling Leaderboard dataset. Specifically, we will compare Anthropic’s Claude 3 most expensive model (claude-3-opus-20240229), the cheapest one (claude-3-haiku-20240307) with OpenAI’s GPT-4 Turbo model (gpt-4-0125-preview) and GPT-3.5 Turbo model (gpt-3.5-turbo-0125). A huge thanks to the team at Berkeley for preparing this dataset and making it available to the public.

You can find the code to reproduce the results here and experiment details are publicly available here. The full results (with graph) are in the Results section. Note, we will use function and tool interchangeably in this post.

Berkeley Function Calling Leaderboard (BFCL) Dataset

We chose the Abstract Syntax Tree (AST) Evaluation subset of the BFCL dataset which compromises 1000 out of the total 1700 samples. It contains the following categories (from here):

Simple Function: These 400 samples are the simplest but most commonly seen format, where the user gives a single JSON function/tool definition, with one and only one function call will be invoked.

Multiple Function: These 200 samples consist of a user question that only invokes one function call out of 2 to 4 JSON tool definitions. The model needs to be capable of selecting the best function to invoke according to user-provided context.

Parallel Function: These 200 samples are defined as invoking multiple function calls in parallel with one user query. The model needs to digest how many function calls need to be made and the question to model can be a single sentence or multiple sentence.

Parallel Multiple Function: Each of these 200 samples is the combination of parallel function and multiple function. In other words, the model is provided with multiple tool definition, and each of the corresponding tool calls will be invoked zero or more times.

Abstract Syntax Tree (AST) Evaluation

The full details on how the evaluation process works can be found here. The main idea is to compare the abstract syntax tree (AST) of the function call generated by the model with the AST of the correct answer. Here are the relevant parts:

To evaluate the simple & multiple function category, the evaluation process compares the generated function call against the given function definition and possible answers. Then, it extracts the arguments from the AST and checks if each required parameter can be found and exact matched in the possible answers with the same type.

The parallel, or parallel-multiple function AST evaluation process extends the idea of the above evaluation to support multiple model outputs and possible answers. It applies the simple function evaluation on each generated function call and checks if all the function calls are correct. Note, this evaluation is invariant under the order of the function calls.

Results

For simple & multiple tool use, we can see that Haiku beats all the other models! However, both Haiku and Opus struggle to generate multiple function calls in parallel, i.e., with a single API call. It’s also interesting to observe how GPT-3.5 Turbo is much closer in performance to GPT-4 Turbo when generated function calls in parallel vs. when not. Below is a graph of the results which compares the accuracy of the 4 models to generate the correct function call(s) under different conditions.

Plot

When analyzing the failure cases for simple & multiple tool use, we can see that all models except for GPT-3.5 Turbo fail to generate the correct value for a parameter in 15-20%. GPT-3.5 Turbo generates multiple function calls in 30-40% of all samples for this category of tool use. Together with the fact that GPT-3.5 Turbo closely matches the performance of GPT-4 Turbo in the parallel tool use category, it looks like GPT-3.5 Turbo is biased towards working well in parallel tool use. In general, the OpenAI models generate the correct number of function calls for parallel & parallel multiple tool use, but still struggle to get every function call correct. On the other side, Haiku is not able to generate multiple function calls in parallel at all. And Opus generates only generates one function call in 30% of the cases for the parallel tool use scenarios.

Conclusion

In conclusion, Haiku is the best model for tool use when only a single function call should be generated. It is better than Opus & GPT-4 Turbo while being almost two orders of magnitude cheaper. However, when you need parallel tool use (i.e., multiple function call generations in a single API call), GPT-4 Turbo is still the best model, and GPT-3.5 Turbo is worth a try once desired quality is achievable with GPT-4 Turbo. Noteworthy, GPT-3.5 Turbo appears biased towards generating multiple function calls in parallel, no matter if that’s required or not.