How to Install and Run LLMs Locally on Your Android Phones


Although there are applications such as LM Studio and GPT4All for running AI models locally on computers, Android phones lack similar options. However, MLC LLM has introduced the MLC Chat Android app, allowing users to download and execute LLM models directly on their Android devices. This app supports small AI models ranging from 2B to 8B, including Llama 3, Gemma, Phi-2, Mistral, and others. Let’s get started with the process.

Note: Presently, MLC Chat does not utilize the on-device NPU on all Snapdragon devices, resulting in slow token generation. Inference is solely performed on the CPU. However, some devices, such as the Samsung Galaxy S23 Ultra (powered by Snapdragon 8 Gen 2), are optimized for running the MLC Chat app, providing a potentially better experience.

  • Get the MLC Chat app (Free) for Android phones by downloading the APK file (148MB) and installing it.
  • Open the MLC Chat app, where you’ll find a list of AI models including the latest Llama 3 8B model, Phi-2, Gemma 2B, and Mistral 7B.
  • Choose and download the AI model of your preference. For example, I opted for Microsoft’s Phi-2 model due to its compact size and efficient performance.

  • After downloading the model, simply tap the chat button next to it.
  • Now, start conversations with the AI model directly on your Android device without internet connectivity.
  • During my testing, Phi-2 performed well on my phone, albeit with occasional hallucinations. However, Gemma didn’t run, and Llama 3 8B operated sluggishly.

  • Phi-2 produced an output of 3 tokens per second on my OnePlus 7T, which uses the Snapdragon 855+ SoC, an older chip from five years ago.

Here’s how you can download and run LLM models directly on your Android device. While the token generation may be sluggish, this method demonstrates that running AI models locally on Android phones is now possible. Currently, it utilizes only the CPU; however, with the Qualcomm AI Stack integration, Android devices powered by Snapdragon can utilize the dedicated NPU, GPU, and CPU to deliver significantly improved performance.

Developers on the Apple side are currently leveraging the MLX framework for rapid local inferencing on iPhones, achieving a token generation rate of nearly 8 tokens per second. It’s expected that Android devices will also gain support for on-device NPUs, resulting in excellent performance. According to Qualcomm’s own statements, the Snapdragon 8 Gen 2 has the capability to produce 8.48 tokens per second when processing a larger 7B model, with improved performance expected when handling a 2B quantized model.

That wraps up our discussion. If you’re interested in interacting with your documents using a local AI model, be sure to check out our dedicated article. If you encounter any difficulties, feel free to share them in the comments below.


What's Your Reaction?

hate hate
400
hate
confused confused
133
confused
fail fail
600
fail
fun fun
533
fun
geeky geeky
466
geeky
love love
266
love
lol lol
333
lol
omg omg
133
omg
win win
600
win

0 Comments

Your email address will not be published. Required fields are marked *