GPT-4-turbo preliminary benchmark results on code-editing

OpenAI just released new versions of GPT-3.5 and GPT-4,
and there’s a lot
of interest about their ability to code compared to the previous versions.
With that in mind, I’ve been benchmarking the new models.

Aider
is an open source command line chat tool that lets you work with GPT to edit
code in your local git repo.
To do this, aider needs to be able to reliably recognize when GPT wants to edit
your source code,
determine which files it wants to modify
and accurately apply the changes it’s trying to make.
Doing a good job on this “code editing” task requires a good LLM, good prompting and
a good tool driving the interactions with the LLM.

Aider relies on a
code editing benchmark
to quantitatively evaluate
performance
whenever one of these things changes.
For example,
whenever I change aider’s prompting or the backend which drives LLM conversations,
I run the benchmark to make sure these changes produce improvements (not regressions).

The benchmark uses aider to try and complete
133 Exercism Python coding exercises.
For each exercise, Exercism provides a starting python file with stubs for the needed functions,
a natural language description of the problem to solve
and a test suite to evaluate whether the coder has correctly solved the problem.

The benchmark gives aider two tries to complete the task:

On the first try, aider gives GPT the stub code file to edit and the natural language instructions that describe the problem. This reflects how you code with aider. You add your source code files to the chat and ask for changes, which are automatically applied.
If the test suite fails after the first try, aider gives GPT the test error output and asks it to fix the code. Aider supports this sort of interaction using a command like /run pytest to run and share pytest results in the chat with GPT. You can /run whatever tests/linters/etc make sense for your language/framework/situation.

Benchmark results

gpt-4-1106-preview

For now, I have only benchmarked the GPT-4 models using the diff edit method.
This is the edit format that aider uses by default with gpt-4.

The new gpt-4-1106-preview model seems 2-2.5X faster than the June GPT-4 model.
It seems better at producing correct code on the first try. It gets
53% of the coding exercises correct, without needing to see errors from the test suite. Previous models only get 46-47% of the exercises correct on the first try.
The new model seems to perform similar
(~65%) to the old models (63-64%) after their second chance to correct bugs by reviewing test suite error output.

gpt-3.5-turbo-1106

I benchmarked the GPT-3.5 models with both the whole and diff edit format.
None of the gpt-3.5 models seem able to effectively use the diff edit format, including the newest November (1106) model.

The comments below only focus on comparing the whole edit format results:

The new gpt-3.5-turbo-1106 model is completing the benchmark 3-4X faster than the earlier GPT-3.5 models.
The success rate after the first try of 42% is comparable to the previous June (0613) model. The new November and previous June models are both worse than the original March (0301) model’s 50% result on the first try.
The new model’s 56% success rate after the second try seems comparable to the original March model, and somewhat better than the June model’s 50% score.

This is one in a series of reports
that use the aider benchmarking suite to assess and compare the code
editing capabilities of OpenAI’s GPT models.
You can review the other reports
for additional information:

GPT code editing benchmarks evaluates the March and June versions of GPT-3.5 and GPT-4.
Code editing speed benchmarks for OpenAI’s “1106” models compares the performance of the new GPT models.

Updates

Last updated 11/14/23.
OpenAI has relaxed rate limits so these results are no longer considered preliminary.

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Artnaturals Kojic Acid Soap + African Net Sponge (2 pack X 142g Turmeric bars) Dark spot remover & Scars - Original Japanese Complex Vitamin C, Hyaluronic Acid, retinol, shea butter (Citrus)

(2131)

$14.97 (as of March 12, 2025 20:22 GMT +00:00 - )

Amazon Fire TV Stick HD (newest model), free and live TV, Alexa Voice Remote, smart home controls, HD streaming

(8233)

$34.99 (as of March 12, 2025 20:22 GMT +00:00 - )

124Pcs St. Patrick's Day Party Favor Set, St Patricks Day Decorations with Shamrock Glasses Necklaces Bracelets Tattoo Stickers for St Patricks Day Accessories Party Supplies

(8)

$19.99 (as of March 12, 2025 19:46 GMT +00:00 - )

Massive Beads Black Obsidian - Stone of Pleasure - Handmade Yoga Stretch Elastic Bracelet Natural Stone Crystal Healing Power Energy Gifts for Unisex Adult 4mm

(1941)

$5.99 (as of March 12, 2025 20:22 GMT +00:00 - )

Owala FreeSip Insulated Stainless Steel Water Bottle with Straw for Sports, Travel, and School BPA-Free Sports Water Bottle, 24 oz, Denim

(65443)

$27.99 (as of March 12, 2025 20:22 GMT +00:00 - )

Index Of News Author

Technology

The 55+ best Amazon Labor Day deals still live

ZDNETAre you short on cash but can't wait for October Prime Day or Black Friday? No need to worry, as Labor Day weekend also has great deals to supplement your shopping needs for the time being. Amazon is the tried-and-true home of internet discounts and offers no shortage of savings, especially if you have a Prime membership.

September 3, 2024

Technology

Vivo returns to storm the mid-range with its new smartphones Y33s, Y21s and Y21

Con la mente puesta en la autonomía y el apartado fotográfico, la compañía vivo ha lanzado tres terminales que pueden encajar en la gama media. Estas son sus características técnicas. Donde de verdad está el negocio es en la gama de entrada y en la gama media, ya que si bien los márgenes son menores,…

October 1, 2021

Technology

Solar Orbiter捕获到有史以来最大日珥喷发的全貌图

太阳轨道飞行器（Solar Orbiter）近日捕获的一张太阳全貌图中，展示了有史以来最大的日珥（Solar Prominence）喷发。日珥是太阳磁场剧烈活动的结果，它使密集的太阳等离子体悬浮在太阳表面之上，有时采取拱形环的形式。它们通常与日冕物质抛射有关，如果直接射向地球，会对我们的技术和日常生活造成严重破坏。最近一次日珥发生于 2022 年 2 月 15 日，覆盖范围延伸到数百万公里的空间。幸运的是，本次日冕物质抛射并不是针对地球的，而且正在远离我们。目前太阳轨道飞行器正在接近地日线，而在在面向航天器的太阳圆盘上没有爆发的迹象，这意味着本次日珥是源自太阳远离我们的一面。太阳轨道飞行器（Solar Orbiter）是欧洲空间局（ESA）和美国国家航空航天局（NASA）联合研制的太阳探测器，是近年来继帕克太阳探测器后NASA对内层太阳系的第二次探测任务。该探测器将利用金星和地球的引力脱离黄道面，首次提供太阳两极的图像。该图像是由太空轨道器上极端紫外线成像器（EUI）的“全日成像器”（FSI）拍摄的。FSI 在设计之初就用于观察整个太阳盘，即使在接近太阳的过程（比如下个月即将到来的近日点通过期间）中也能纳入整个视野里。在 3 月 26 日最接近时，航天器将在大约 0.3 倍的日地距离内通过，太阳将占据望远镜视野的更大部分。现在，圆盘周围仍然有很多"观察余地"，使 FSI 能够捕捉到大约 350 万公里以内的惊人细节，相当于太阳半径的 5 倍。其他空间望远镜，如欧空局/美国宇航局的SOHO卫星经常看到这样的太阳活动，但要么离太阳更近，要么通过遮挡器更远，遮挡住太阳圆盘的强光，以实现日冕本身的详细图像。因此，太阳轨道器观察到的突出物是有史以来最大的此类事件，它与太阳圆盘一起被捕捉到，为首次看到像这样的事件与太阳圆盘的联系提供了新的可能性。

February 20, 2022

Technology

AMD Ryzen 5 5560U Processor

The AMD Ryzen 5 5560U, is a fairly fast Cezanne family processor designed for use in thinner, lighter laptops. The R5 5560U integrates six of the eight cores based on the Zen 3 microarchitecture. The cores are clocked at 2.3 (guaranteed base clock) to 4 GHz (Turbo) and support SMT for a total of 12 threads.

November 5, 2023

Technology

Quantum Brilliance secures $20M for portable diamond-based accelerators

Australian-German startup Quantum Brilliance has raised $20mn in Series A funding as it looks to deploy small, portable quantum accelerators that promise to supercharge the computational power of everything from data centres and robots to satellites. A quantum accelerator is a specialised hardware unit that speeds up specific quantum algorithms, or tasks. They act as

January 15, 2025

Technology

ESR Topples GLP to Lead Real Estate Fund Managers in Asia Pacific

Total real estate AUM slid back to $4.1 trillion last year (ANREV) ESR’s M&A growth propelled the Hong Kong-listed firm ahead of arch-rival GLP Capital Partners to claim the top spot in assets under management for Asia Pacific non-listed real estate strategies in 2022, according to data compiled by a group of non-profits serving the

May 23, 2023

Hand-Picked Top-Read Stories

Commentary: South Korea braces for a tougher Trump on trade and US military protection

LaLiga: Sadiq nets winner in Valencia’s home win over Real Valladolid

Barcelona vs Osasuna LaLiga clash called off

Trending Tags

GPT-4-turbo preliminary benchmark results on code-editing

Benchmark results

gpt-4-1106-preview

gpt-3.5-turbo-1106

Updates

Artnaturals Kojic Acid Soap + African Net Sponge (2 pack X 142g Turmeric bars) Dark spot remover & Scars - Original Japanese Complex Vitamin C, Hyaluronic Acid, retinol, shea butter (Citrus)

Amazon Fire TV Stick HD (newest model), free and live TV, Alexa Voice Remote, smart home controls, HD streaming

124Pcs St. Patrick's Day Party Favor Set, St Patricks Day Decorations with Shamrock Glasses Necklaces Bracelets Tattoo Stickers for St Patricks Day Accessories Party Supplies

Massive Beads Black Obsidian - Stone of Pleasure - Handmade Yoga Stretch Elastic Bracelet Natural Stone Crystal Healing Power Energy Gifts for Unisex Adult 4mm

Owala FreeSip Insulated Stainless Steel Water Bottle with Straw for Sports, Travel, and School BPA-Free Sports Water Bottle, 24 oz, Denim

NBA All Star Game 2022: ΜVP o 50άρης Κάρι (vid)

‘I grew up in a mouldy council flat – here’s what I did to own a £20m wedding empire’

McCollum, Murphy Lead Pelicans Past Mavs

Mkts crash: How to open FD memes flood social media

How to Cool Down Your Body When You’re Really Hot and Can’t Stop Sweating

Commentary: South Korea braces for a tougher Trump on trade and US military protection

LaLiga: Sadiq nets winner in Valencia’s home win over Real Valladolid

Barcelona vs Osasuna LaLiga clash called off

NPFL: Finidi reacts to Rivers United’s victory against Niger Tornadoes

EPL: Aina credits Nottingham Forest fans for Man City win

GPT-4-turbo preliminary benchmark results on code-editing

Benchmark results

gpt-4-1106-preview

gpt-3.5-turbo-1106

Updates

Related Posts