I Tested Gemini Live: It Falls Short Compared to ChatGPT Advanced Voice Mode


Gemini-Live-hands-on-azmotech

After months of anticipation, Gemini Live has finally launched. You can begin using it on your Android phone immediately, as long as you have subscribed to Gemini Advanced. I tested Gemini Live on my OnePlus phone, but it didn’t feel as groundbreaking as the demos presented during Google I/O 2024.

To begin with, Gemini Live currently does not support additional modalities, such as images or real-time camera input, which were key features emphasized in Project Astra. At the moment, it only facilitates free-flowing audio conversations that generally work well, but there are some fundamental issues with the feature’s implementation. We’ll delve into those later, but first, let’s review my interactions with Gemini Live.

Testing the Capabilities of Gemini Live

Interruptions and Multilingual Capability

First and foremost, Gemini Live supports interruptions, and it generally works well, although there are instances where it continues speaking despite being interrupted. You can also engage with Gemini Live in the background, even while your phone is locked.

Additionally, it can converse in multiple languages, effortlessly switching between them. I’ve tried it in English, Hindi, and Bengali, and it worked quite well. You can see the demo below:

Prepping for a Job Interview

I initiated my conversation with Gemini Live by asking for help preparing for a job interview as a developer in the AI field. It inquired whether I would be focusing on research or application development. After I explained that I would be working on a web app to generate AI images, Gemini Live provided me with a list of languages and frameworks I should be familiar with, including Python, PyTorch, TensorFlow Lite, and more.

It also recommended that I brush up on diffusion models, as they are currently a hot topic in image generation within the AI field. Overall, I had a productive conversation with Gemini Live, which provided several useful suggestions.

However, it often offers broader and more generic advice without diving deeply into the technical aspects of the topic. At times, you may need to guide Gemini to explore beyond surface-level discussions. On that note, I’d like to share my own experience. I engaged Gemini in a conversation about the MHA and JJK manga and their spoilers, and I encountered similar results. Gemini Live used repetitive and shallow sentences to convey the context of the stories.

Private Messaging Apps

Next, to test for hallucinations, I explored the topic of privacy and asked which messaging app is best for communicating with anonymous sources. It recommended Signal and advised me to avoid WhatsApp due to its ownership by Facebook. I then pointed out that both apps use the same end-to-end encryption protocol and asked why Signal is considered better than WhatsApp.

Gemini replied, “Facebook is known for collecting a lot of data” and “they want to make more money by showing ads.” I then asked who developed the end-to-end protocol, and it accurately responded with Moxie Marlinspike, the creator of the Signal protocol.

Additionally, I informed Gemini that there had been a recent security issue with Signal and asked if it could find more information. It quickly browsed the internet and reported that “there was a vulnerability in the desktop version of Signal that could allow someone to snoop on your files,” but assured me that “it’s been fixed.” Up to this point, I hadn’t noticed any hallucinations from Gemini regarding key facts.

Role-playing

To test its role-playing abilities, I asked Gemini Live to act like an English butler. At first, it spoke in a refined and formal tone, addressing me as “Sir,” but it quickly fell out of character and reverted to its original style. I had to repeatedly remind it to stay in role. Additionally, it can’t perform accents yet, so it wasn’t truly an English butler after all.

Gemini has struggled with following instructions in our earlier tests, and Gemini Live is no different. It easily forgets the role and context as soon as we switch to another topic in the same chat session.

Finding Information

I asked Gemini Live to recommend restaurants where I can find the best biryani in Kolkata, and it suggested Arsalan and Karim’s, which are indeed popular spots for biryani. I also inquired about places to get my laptop repaired, and it provided a few legitimate names. Gemini Live performed quite well when it came to finding information.

Gemini Live vs ChatGPT Advanced Voice Mode

Inspired by Cristiano Giardina’s demo of ChatGPT Advanced Voice Mode (demo), I asked Gemini Live to count from 1 to 10 extremely quickly, but it maintained a standard pace due to the speech-to-text conversion.

In contrast, the ChatGPT Voice Mode example provided a more human-like experience, even pausing to catch its breath while counting! That’s the true multimodal experience you get when speech is input and output natively.

Additionally, Gemini Live struggled to repeat tongue twisters without pausing in between, something that ChatGPT’s Advanced Voice Mode does remarkably well. After that, I asked Gemini Live to speak in David Attenborough’s accent, but as I mentioned earlier, it still can’t perform accents.

In conclusion, while Gemini Live is suitable for casual conversations, it’s not groundbreaking. We’ll need to wait for Google to enable native input/output capabilities on Gemini Live to truly compete with ChatGPT’s Advanced Voice Mode.

Hey Google, Where are the Emotional Depth in Gemini Live?

When Gemini was launched in December 2023, Google declared it a genuinely native multimodal model. In terms of audio processing, this means that Gemini can identify the tone and tenor of speech, recognize pronunciation, and detect the speaker’s mood whether they are happy, sad, excited, and so on by processing raw audio signals natively.

In the native input and output method, the multimodal model directly tokenizes and processes raw speech. This approach bypasses intermediate layers, where speech is typically transcribed into text using a speech-to-text (STT) engine, which can result in the loss of speech nuances. Instead, the text output is generated directly by the language model, and the final output is relayed through a text-to-speech (TTS) engine.

This traditional approach fails to leverage the advantages of native end-to-end multimodal capabilities, such as understanding speech intonation, expressions, and mood. Additionally, it results in increased latency, making the conversation feel more robotic rather than natural.

With Gemini Live, we anticipated that Google would deliver a native input/output multimodal experience akin to ChatGPT’s Advanced Voice Mode. However, it’s evident that Gemini Live is still relying on the traditional method for processing audio.

To demonstrate this, I asked Gemini Live to identify an animal’s sound, but it responded that it currently cannot process audio. Additionally, Gemini Live was unable to determine whether I was happy or sad based on my raw speech.

Ultimately, to summarize my experience, the current implementation of Gemini Live is not capable of facilitating truly natural conversations. At this stage, Gemini Live feels more like a glorified text-to-speech (TTS) engine powered by a large language model (LLM). It utilizes Gemini 1.5 Flash to generate quick responses and processes audio using TTS and speech-to-text (STT) engines. Overall, it’s a somewhat disappointing experience.

Have you accessed Gemini Live? How have your conversations been with it? Let us know in the comments section.


What's Your Reaction?

hate hate
666
hate
confused confused
400
confused
fail fail
200
fail
fun fun
133
fun
geeky geeky
66
geeky
love love
533
love
lol lol
600
lol
omg omg
400
omg
win win
200
win

0 Comments

Your email address will not be published. Required fields are marked *