Haiku > GPT-4 Turbo > Opus >>> GPT-3.5 Turbo if not using parallel function calling.
claude-3-opus-20240229
), the cheapest one (claude-3-haiku-20240307
) with OpenAI’s GPT-4 Turbo model (gpt-4-0125-preview
) and GPT-3.5 Turbo model (gpt-3.5-turbo-0125
).
A huge thanks to the team at Berkeley for preparing this dataset and making it available to the public.
You can find the code to reproduce the results here and experiment details are publicly available here.
The full results (with graph) are in the Results section.
Note, we will use function and tool interchangeably in this post.
Overview of major error types for each tool use & model scenario
Tool Use | Model | Error [%] | Error [#] | Major Error Type | Major Error [#] |
---|---|---|---|---|---|
Simple | Opus | 13% | 51 | Invalid value for parameter | 21 |
Simple | Haiku | 7% | 29 | Invalid value for parameter | 13 |
Simple | GPT-4 T | 11% | 44 | Invalid value for parameter | 17 |
Simple | GPT-3.5 T | 46% | 183 | Wrong number of functions | 162 |
Multiple | Opus | 13% | 26 | Invalid value for parameter | 15 |
Multiple | Haiku | 8% | 17 | Invalid value for parameter | 13 |
Multiple | GPT-4 T | 11% | 21 | Invalid value for parameter | 9 |
Multiple | GPT-3.5 T | 33% | 67 | Wrong number of functions | 47 |
Parallel | Opus | 39% | 78 | Invalid value for parameter | 61 |
Parallel | Haiku | 100% | 200 | Wrong number of functions | 200 |
Parallel | GPT-4 T | 11% | 22 | Could not find a matching function | 18 |
Parallel | GPT-3.5 T | 12% | 24 | Could not find a matching function | 17 |
Parallel Multiple | Opus | 59% | 116 | Wrong number of functions | 65 |
Parallel Multiple | Haiku | 100% | 200 | Wrong number of functions | 200 |
Parallel Multiple | GPT-4 Turbo | 36% | 73 | Could not find a matching function | 59 |
Parallel Multiple | GPT-3.5 Turbo | 43% | 85 | Could not find a matching function | 64 |
Invalid value for parameter
: generated value for parameter was incorrectWrong number of functions
: generated too many or too few function callsCould not find a matching function
: one or more function calls were incorrect