An ambiguous metropolis avenue, a freshly mown discipline, and a parked armoured car had been among the many instance pictures we selected to problem Giant Language Fashions (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate.
Again in July 2023, Bellingcat analysed the geolocation efficiency of OpenAI and Google’s fashions. Each chatbots struggled to determine pictures and had been extremely vulnerable to hallucinations. Nevertheless, since then, such fashions have quickly advanced.
To evaluate how LLMs from OpenAI, Google, Anthropic, Mistral and xAI examine in the present day, we ran 500 geolocation checks, with 20 fashions every analysing the identical set of 25 pictures.
Our evaluation included older and “deep analysis” variations of the fashions, to trace how their geolocation capabilities have developed over time. We additionally included Google Lens to match whether or not LLMs supply a real enchancment over conventional reverse picture search. Whereas reverse picture search instruments work in another way from LLMs, they continue to be probably the most efficient methods to slender down a picture’s location when ranging from scratch.
The Take a look at
We used 25 of our personal journey pictures, to check a spread of out of doors scenes, each rural and concrete areas, with and with out identifiable landmarks comparable to buildings, mountains, indicators or roads. These pictures had been sourced from each continent, together with Antarctica.
The overwhelming majority haven’t been reproduced right here, as we intend to proceed utilizing them to guage newer fashions as they’re launched. Publishing them right here would compromise the integrity of future checks.
Every LLM was given a photograph that had not been revealed on-line and contained no metadata. All fashions then obtained the identical immediate: “The place was this photograph taken?”, alongside the picture. If an LLM requested for extra data, the response was similar: “There is no such thing as a supporting data. Use this photograph alone.”
We examined the next fashions:
Developer | Mannequin | Developer’s Description |
Anthropic | Claude Haiku 3.5 | “quickest mannequin for each day duties” |
Claude Sonnet 3.7 | “our most clever mannequin but” | |
Claude Sonnet 3.7 (prolonged considering) | “enhanced reasoning capabilities for advanced duties” | |
Claude Sonnet 4.0 | “good, environment friendly mannequin for on a regular basis use” | |
Claude Opus 4.0 | “highly effective, giant mannequin for advanced challenges” | |
Gemini 2.0 Flash | “for on a regular basis duties plus extra options” | |
Gemini 2.5 Flash | “makes use of superior reasoning” | |
Gemini 2.5 Professional | “finest for advanced duties” | |
Gemini Deep Analysis | “get in-depth solutions” | |
Mistral | Pixtral Giant | “frontier-level picture understanding” |
OpenAI | ChatGPT 4o | “nice for many duties” |
ChatGPT Deep Analysis | “designed to carry out in-depth, multi-step analysis utilizing knowledge on the general public internet” | |
ChatGPT 4.5 | “good for writing and exploring concepts” | |
ChatGPT o3 | “makes use of superior reasoning” | |
ChatGPT o4-mini | “quickest at superior reasoning” | |
ChatGPT o4-mini-high | “nice at coding and visible reasoning” | |
xAI | Grok 3 | “smartest” |
Grok 3 DeepSearch | “superior search and reasoning” | |
Grok 3 DeeperSearch | “prolonged search, extra reasoning” |
This was not a complete evaluate of all accessible fashions, partly as a result of pace at which new fashions and variations are presently being launched. For instance, we didn’t assess DeepSeek, because it presently solely extracts textual content from pictures. Observe that in ChatGPT, no matter what mannequin you choose, the “deep analysis” operate is presently powered by a model of o4-mini.
Gemini fashions have been launched in “preview” and “experimental” codecs, in addition to dated variations like “03-25” and “05-06”. To maintain the comparisons manageable, we grouped these variants beneath their respective base fashions, e.g. “Gemini 2.5 Professional”.
We additionally in contrast each check with the primary 10 outcomes from Google Lens’s “visible match” function, to evaluate the problem of the checks and the usefulness of LLMs in fixing them.
We ranked all responses on a scale from 0 to 10, with 10 indicating an correct and particular identification, comparable to a neighbourhood, path, or landmark, and 0 indicating no try and determine the placement in any respect.
And the Winner is…
ChatGPT beat Google Lens.
In our checks, ChatGPT o3, o4-mini, and o4-mini-high had been the one fashions to outperform Google Lens in figuring out the proper location, although not by a big margin. All different fashions had been much less efficient when it got here to geolocating our check pictures.
We scored 20 fashions in opposition to 25 pictures, ranking every from 0 (purple) to 10 (darkish inexperienced) for accuracy in geolocating the photographs.
Even Google’s personal LLM, Gemini, fared worse than Google Lens. Surprisingly, it additionally scored decrease than xAI’s Grok, regardless of Grok’s well-documented tendency to hallucinate. Gemini’s Deep Analysis mode scored roughly the identical because the three Grok fashions we examined, with DeeperSearch proving the simplest of xAI’s LLMs.
The very best-scoring fashions from Anthropic and Mistral lagged effectively behind their present opponents from OpenAI, Google, and xAI. In a number of instances, even Claude’s most superior fashions recognized solely the continent, whereas others had been capable of slender their responses all the way down to particular elements of a metropolis. The newest Claude mannequin, Opus 4, carried out at an identical degree to Gemini 2.5 Professional.
Listed below are a few of the highlights from 5 of our checks.
A Street within the Japanese Mountains
The photograph beneath was taken on the highway between Takayama and Shirakawa in Japan. In addition to the highway and mountains, indicators and buildings are additionally seen.
Gemini 2.5 Professional’s response was not helpful. It talked about Japan, but in addition Europe, North and South America and Asia. It replied:
“With none clear, identifiable landmarks, distinctive signage in a recognisable language, or distinctive architectural kinds, it’s very tough to find out the precise nation or particular location.”
In distinction, o3 recognized each the architectural fashion and signage, responding:
“Greatest guess: a snowy mountain stretch of central-Honshu, Japan—someplace within the Nagano/Toyama space. (Japanese-style homes, kanji on the billboard, and typical expressway limitations give it away.)”
A Area on the Swiss Plateau
This photograph was taken close to Zurich. It confirmed no simply recognisable options aside from the mountains within the distance. A reverse picture search utilizing Google Lens didn’t instantly result in Zurich. With none context, figuring out the placement of this photograph manually may take a while. So how did the LLMs fare?
Gemini 2.5 Professional said that the photograph confirmed surroundings widespread to many elements of the world and that it couldn’t slender it down with out extra context.
Against this, ChatGPT excelled at this check. o4-mini recognized the “Jura foothills in northern Switzerland”, whereas o4-mini-high positioned the scene ”between Zürich and the Jura mountains”.
These solutions stood in stark distinction to these from Grok Deep Analysis, which, regardless of the seen mountains, confidently said the photograph was taken within the Netherlands. This conclusion seemed to be based mostly on the Dutch identify of the account used, “Foeke Postma”, with the mannequin assuming the photograph should have been taken there and calling it a “cheap and well-supported inference”.
An Internal-Metropolis Alley Stuffed with Visible Clues in Singapore
This photograph of a slender alleyway on Round Street in Singapore provoked a variety of responses from the LLMs and Google Lens, with scores starting from 3 (close by nation) to 10 (appropriate location).
The check served as a very good instance of how LLMs can outperform Google Lens by specializing in small particulars in a photograph to determine the precise location. People who answered appropriately referenced the writing on the mailbox on the left within the foreground, which revealed the exact deal with.
Whereas Google Lens returned outcomes from throughout Singapore and Malaysia, a part of ChatGPT o4-mini’s response learn: “This seems to be a basic Singapore shophouse arcade – in reality, if you happen to have a look at the mailboxes on the left you’ll be able to simply make out the label ‘[correct address].’”
Among the different fashions observed the mailbox however couldn’t learn the deal with seen within the picture, falsely inferring that it pointed to different places. Gemini 2.5 Flash responded, “The design of the mailboxes on the left, notably the ‘G’ for Geylang, factors strongly in direction of Singapore.” One other Gemini mannequin, 2.5 Professional, noticed the mailbox however targeted as an alternative on what it interpreted as Thai script on a storefront, confidently answering: “The visible proof strongly suggests the photograph was taken in an alleyway in Thailand, doubtless in Bangkok.”
The Costa Rican Coast
One of many more durable checks we gave the fashions to geolocate was a photograph taken from Playa Longosta on the Pacific Coast of Costa Rica close to Tamarindo.
Gemini and Claude carried out the worst on this job, with most fashions both declining to guess or giving incorrect solutions. Claude 3.7 Sonnet appropriately recognized Costa Rica however hedged with different places, comparable to Southeast Asia. Grok was the one mannequin to guess the precise location appropriately, whereas a number of ChatGPT fashions (Deep Analysis, o3 and the o4-minis) guessed inside 160km of the seashore.
An Armoured Automobile on the Streets of Beirut
This photograph was taken on the streets of Beirut and options a number of particulars helpful for geolocation, together with an emblem on the aspect of the armored personnel provider and {a partially} seen Lebanese flag within the background.
Surprisingly, most fashions struggled with this check: Claude 4 Opus, billed as a “highly effective, giant mannequin for advanced challenges”, guessed “someplace in Europe” owing to the “European-style avenue furnishings and constructing design”, whereas Gemini and Grok may solely slender the placement all the way down to Lebanon. Half of the ChatGPT fashions responded with Beirut. Solely two fashions, each ChatGPT, referenced the flag.
So Have LLMs Lastly Mastered Geolocation?
LLMs can actually assist researchers to identify the main points that Google Lens or they themselves may miss.
One clear benefit of LLMs is their capacity to go looking in a number of languages. In addition they
seem to make good use of small clues, comparable to vegetation, architectural kinds or signage. In a single check, a photograph of a person carrying a life vest in entrance of a mountain vary was appropriately situated as a result of the mannequin recognized a part of an organization identify on his vest and linked it to a close-by boat tour operator.
For touristic areas and scenic landscapes, Google Lens nonetheless outperformed most fashions. When proven a photograph of Schluchsee lake within the Black Forest, Germany, Google Lens returned it as the highest consequence, whereas ChatGPT was the one LLM to appropriately determine the lake’s identify. In distinction, in city settings, LLMs excelled at cross-referencing delicate particulars, whereas Google Lens tended to fixate on bigger, similar-looking constructions, comparable to buildings or ferris wheels, which seem in lots of different places.
Warmth map to indicate how every mannequin carried out on all 25 checks
Enhanced Reasoning Modes
You’d assume turning on “deep analysis” or “prolonged considering” features would have resulted in larger scores. Nevertheless, on common, Claude and ChatGPT carried out worse. Just one Grok mannequin, DeeperSearch, and one Gemini, Gemini Deep Analysis, confirmed enchancment. For instance, ChatGPT Deep Analysis was proven a photograph of a shoreline and took almost 13 minutes to provide a solution that was about 50km north of the proper location. In the meantime, o4-mini-high responded in simply 39 seconds and gave a solution 15km nearer.
General, Gemini was extra cautious than ChatGPT, however Claude was probably the most cautious of all. Claude’s “prolonged considering” mode made Sonnet much more conservative than the usual model. In some instances, the common mannequin would hazard a guess, albeit hedged in probabilistic phrases, whereas with “prolonged considering” enabled for a similar check, it both declined to guess or provided solely obscure, region-level responses.
LLMs Proceed to Hallucinate
All of the fashions, sooner or later, returned solutions that had been completely fallacious. ChatGPT was usually extra assured than Gemini, usually main to higher solutions, but in addition extra hallucinations.
The chance of hallucinations elevated when the surroundings was non permanent or had modified over time. In a single check, as an illustration, a seashore photograph confirmed a big lodge and a brief ferris wheel (put in in 2024 and dismantled throughout winter). Most of the fashions constantly pointed to a distinct, extra ceaselessly photographed seashore with an identical journey, regardless of clear variations.
Ultimate Ideas
Your account and immediate historical past could bias outcomes. In a single case, when analysing a photograph taken within the Coral Pink Sand Dunes State Park, Utah, ChatGPT o4-mini referenced earlier conversations with the account holder: “The person talked about Durango and Colorado earlier, so I think they could have posted a photograph from a earlier journey.”
Equally, Grok appeared to attract on a person’s Twitter profile, and previous tweets, even with out express prompts to take action.
Video comprehension additionally stays restricted. Most LLMs can’t seek for or watch video content material, chopping off a wealthy supply of location knowledge. In addition they wrestle with coordinates, usually returning tough or just incorrect responses.
Finally, LLMs aren’t any silver bullet. They nonetheless hallucinate, and when a photograph lacks element, geolocating it would nonetheless be tough. That stated, in contrast to our managed checks, real-world investigations usually contain extra context. Whereas Google Lens accepts solely key phrases, LLMs will be provided with far richer data, making them extra adaptable.
There may be little doubt, on the charge they’re evolving, LLMs will proceed to play an more and more vital position in open supply analysis. And as newer fashions emerge, we are going to proceed to check them.
Infographics by Logan Williams and Merel Zoet