Meta has recently unveiled its Llama 3 model in two configurations, boasting 8 billion and 70 billion parameters respectively, and made them available to the AI community as open-source. Despite its smaller size, the 70 billion parameter Llama 3 has demonstrated remarkable capabilities, as seen on the LMSYS leaderboard. Thus, we’ve conducted a comprehensive comparison between Llama 3 and the flagship GPT-4 model to assess their performance across various tests. Let’s delve into our analysis of Llama 3 versus GPT-4.
Elevator of Enchantment Experiment
Let’s start with the magic elevator test to gauge Llama 3’s logical reasoning abilities compared to GPT-4. Surprisingly, Llama 3 successfully passes the test, whereas GPT-4 falters in providing the correct answer. This outcome is unexpected given that Llama 3 is trained on only 70 billion parameters, while GPT-4 boasts a massive 1.7 trillion parameters.
It’s worth noting that we conducted the test on the GPT-4 model available on ChatGPT (accessible to ChatGPT Plus users), which appears to be using the older GPT-4 Turbo model. However, when we ran the same test on the recently released GPT-4 model (gpt-4-turbo-2024-04-09) via OpenAI Playground, it successfully passed. OpenAI has mentioned that they are rolling out the latest model to ChatGPT, but it might not be available on our account yet.
There is a tall building with a magic elevator in it. When stopping on an even floor, this elevator connects to floor 1 instead.Starting on floor 1, I take the magic elevator 3 floors up. Exiting the elevator, I then use the stairs to go 3 floors up again.Which floor do I end up on?
Victor: Llama 3 70B, whereas GPT-4 Turbo 2024-04-09 succumbs on ChatGPT Plus.
Determine the Time Required for Drying
Following that, we conducted a classic reasoning question to assess the intelligence of both models. Remarkably, both Llama 3 70B and GPT-4 provided the correct answer without resorting to mathematical calculations. Well done, Meta!
If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?
Victor: Llama 3 70B, and GPT-4 prevails on ChatGPT Plus.
Locate the Apple
Following that, I posed another question to evaluate the reasoning abilities of Llama 3 and GPT-4. In this test, the Llama 3 70B model nearly provides the correct answer but overlooks mentioning the box. Conversely, the GPT-4 model accurately responds that “the apples are still on the ground inside the box.” In this round, I’ll give the win to GPT-4.
There is a basket without a bottom in a box, which is on the ground. I put three apples into the basket and move the basket onto a table. Where are the apples?
Victor: GPT-4 on ChatGPT Plus.
Which object has more weight?
Although the question appears straightforward, numerous AI models struggle to provide the correct answer. However, in this test, both Llama 3 70B and GPT-4 delivered the accurate response. Nonetheless, it’s worth noting that Llama 3 occasionally produces incorrect outputs, so it’s important to consider that.
What's heavier, a kilo of feathers or a pound of steel?
Victor: Llama 3 70B, and GPT-4 when utilizing ChatGPT Plus.
Determine the Location
Following that, I posed a straightforward logical question, and both models provided a correct response. It’s intriguing to witness the smaller Llama 3 70B model competing with the top-tier GPT-4 model.
I am in a race and I am overtaken by the second person. What is my new position?
Victor: Llama 3 70B, and GPT-4 with ChatGPT Plus.
Resolve a Mathematical Equation
Subsequently, we conducted a challenging mathematical problem evaluation on both Llama 3 and GPT-4 to determine the superior model. In this test, GPT-4 excels, providing the correct answer effortlessly, while Llama 3 falls short. This outcome isn’t unexpected, considering GPT-4’s impressive performance on the MATH benchmark. It’s important to note that I specifically instructed ChatGPT not to utilize the Code Interpreter for mathematical calculations.
Determine the sum of the y-coordinates of the four points of intersection of y = x^4 - 5x^2 - x + 4 and y = x^2 - 3x.
Victor: GPT-4 utilizing ChatGPT Plus.
Adhere to User Directions
Complying with user instructions is crucial for an AI model, and Meta’s Llama 3 70B model excels in this regard. It successfully generated all 10 sentences ending with the word “mango”, while GPT-4 could only produce eight such sentences.
Generate 10 sentences that end with the word "mango"
Victor: Llama 3 70B.
National Intelligence Assessment and Handling Test
Despite Llama 3’s current limitation with a shorter context window, we conducted the NIAH test to assess its retrieval capabilities. The Llama 3 70B model, with support for a context length of up to 8K tokens, was able to swiftly locate a needle (a random statement) within a 35K-character long text (equivalent to 8K tokens). Surprisingly, Llama 3 70B successfully found the text without delay. Similarly, GPT-4 had no difficulty in locating the needle.
While this test utilized a relatively small context, it demonstrates Llama 3’s impressive retrieval capability. We anticipate conducting further evaluations when Meta releases a Llama 3 model with an expanded context window.
Victor: Llama 3 70B, along with GPT-4 on ChatGPT Plus.
Llama 3 versus GPT-4: The Final Decision
In nearly all assessments conducted, the Llama 3 70B model has demonstrated remarkable capabilities, excelling in advanced reasoning, adhering to user instructions, and retrieval tasks. Its performance in mathematical calculations is the only area where it falls behind the GPT-4 model. Meta states that Llama 3 has been trained on a vast coding dataset, indicating its potential for strong coding performance as well.
It’s important to note that we’re comparing a smaller model to the GPT-4 model, with Llama 3 being a dense model while GPT-4 is built on the MoE architecture comprising 8x 222B models. This highlights Meta’s commendable achievement with the Llama 3 family of models. With the potential release of the 500B+ Llama 3 model in the future, even better performance can be expected, possibly surpassing existing AI models.
Llama 3’s emergence has significantly elevated the standard, and Meta’s decision to open-source the model has substantially narrowed the gap between proprietary and open-source models. These assessments were conducted on an Instruct model, and fine-tuned models on Llama 3 70B are expected to deliver exceptional performance. With Meta’s entry into the AI race, alongside OpenAI, Anthropic, and Google, the competition is further intensified.
0 Comments