Free software 'llama.cpp' that can run various AI models locally supports multimodal input and enables image explanations, etc.



The open source software '

llama.cpp ' that can run AI models locally now supports image input. You can input images and text at the same time to have the machine answer questions such as 'What is in this image?'

server : vision support via libmtmd by ngxson · Pull Request #12898 · ggml-org/llama.cpp · GitHub
https://github.com/ggml-org/llama.cpp/pull/12898

llama.cpp/docs/multimodal.md at master · ggml-org/llama.cpp · GitHub
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md

Trying out llama.cpp's new vision support
https://simonwillison.net/2025/May/10/llama-cpp-vision/

llama.cpp is software that can run AI models converted to GGUF format , and can run a variety of AI models in a local environment, including Llama, the Gemma series, and DeepSeek-R1. It also supports Windows, macOS, and Linux, and many backends such as Mtal, CUDA, and OpenCL, allowing you to run AI models in a variety of environments.

A change to add an image input function to llama.cpp was merged on May 10, 2025. Below is an example of entering 'what is this?' along with a poster image in the llama.cpp Web UI. A sentence explaining what kind of poster it is is generated.



It has also been

reported that a 4-bit quantized version of Mistral Small 2504 was successfully run on a PC equipped with a Radeon RX 7900 XTX to input images and text simultaneously. In the example below, a photo of a plate full of strawberries is input while a more difficult question is posed : 'How many strawberries are there in the letter 'R'?' According to the reporter, sometimes the answer was correct and sometimes it was incorrect.



Simon Willison, an engineer with a reputation for

measuring AI performance using Pelican , ran a 4-bit quantized version of Gemma 3 using llama.cpp on his laptop and typed the text 'describe it' along with the image of a Pelican below.



The answer is as follows: The features of the photo are explained in detail.

It's a striking photograph of a brown pelican standing in shallow water on a sandy beach. Here's a breakdown of what I see:

Subject: The main focus is a large brown pelican. It's standing with its wings fully extended, creating a dramatic pose.

Pose: The pelican is in a slightly awkward, almost comical stance, with its legs planted firmly in the wet sand and its wings spread wide. It looks like it might be preparing to take off or is simply enjoying the water.

Color & Texture: The pelican is a rich, dark brown with lighter markings on its wings and head. The sand is a muted beige color, and the water reflects the sunlight, creating a shimmering effect. You can see the texture of the sand and the ripples in the water.

Lighting: The lighting appears to be bright and sunny, casting shadows and highlighting the details of the pelican's feathers.

Background: The background is blurred, suggesting a shallow depth of field, which keeps the focus on the pelican. It appears to be a beach scene.

Overall Impression: The image has a dynamic feel due to the pelican's pose and the movement implied by the water. It's a well-composed shot that captures a moment of wildlife in its natural habitat.

Do you want me to zoom in on a particular detail, or perhaps tell you more about pelicans?



Detailed information on llama.cpp is available at the following link, where you can check the source code, installation instructions, etc.

GitHub - ggml-org/llama.cpp: LLM inference in C/C++
https://github.com/ggml-org/llama.cpp



in Software, Posted by log1o_hf