Consistently inconsistent: 5-month study of AI's non-deterministic responses

Consistently inconsistent: 5-month study of AI's non-deterministic responses

May 4, 2025 Hugo Huijer

We are witnessing an AI boom where consumers increasingly rely on ChatGPT, Perplexity, and Google Gemini for recommendations. For brands, visibility in these AI responses has become a new frontier of digital marketing. But there's a critical factor many marketers overlook: the fundamentally non-deterministic nature of Large Language Models.

At Trackerly, we wanted to understand exactly how volatile AI recommendations can be. So we conducted a simple experiment. We asked the same question every day for five months across multiple AI platforms: "Which movies are most recommended as 'all-time classics' by AI?"

The results were eye-opening. After tracking 153 answers from ChatGPT alone, we discovered that while the general consensus remained similar, no two answers were identical. The implications for brands are profound - if you find your product featured prominently in an AI response today, a simple refresh might completely change your visibility tomorrow. This non-deterministic behavior presents both challenges and opportunities for brands seeking visibility in the AI era.

Key takeaways

This study reveals critical insights for brands concerned about their visibility in AI recommendations:

  1. All AI platforms are non-deterministic by nature. Even with identical prompts, responses will vary from query to query.
  2. Consistency varies significantly by platform. DeepSeek and Gemini demonstrated more stable rankings than ChatGPT and Perplexity.
  3. Even well-established topics show volatility. If "best movies" -- a topic with abundant training data across Wikipedia, IMDb, Reddit, and countless film sites -- produces such variable results, imagine the volatility for less-documented topics like emerging brands.
  4. Citation mechanisms don't guarantee stability. Contrary to what might be expected, Perplexity's citation-based approach resulted in the most volatile recommendations.
  5. Platform updates can dramatically impact visibility. Gemini's addition of explanatory text changed the prominence of its recommendations overnight, demonstrating how platform changes can affect brand visibility.

The experiment

Using Trackerly, we've been tracking this simple prompt for 5 months straight, every single day:

Which movies are most recommended as "all-time classics" by AI?

We tracked responses across five leading AI platforms: OpenAI's ChatGPT, Google's Gemini, Anthropic's Claude, Perplexity, and DeepSeek. Our data collection revealed fascinating patterns in how these systems respond to identical queries over time.

ChatGPT results: The shuffle never stops

ChatGPT showed significant variability in its responses. Even though certain films consistently appeared in the top recommendations, their exact ranking and descriptions changed with every query.

Here are two examples from different dates:

ChatGPT example response 1 ChatGPT example response 2

The top 3 consistently included Citizen Kane, The Godfather, and Casablanca, but their order changed frequently. The remaining seven films that completed the top 10 showed even greater variability, with movies like Pulp Fiction sometimes ranking as high as #4 and other times as low as #10.

Using our Relative Position of First Mention (RPOFM) metric, we can visualize this volatility over time:

ChatGPT position tracking graph

Understanding Relative Position of First Mention (RPOFM)

Relative Position of First Mention is calculated by counting the length of an AI's response in number of characters. The position of the first mention is then divided by the answer's length. This metric helps us understand where in a response a particular item appears, normalized against the total response length.

Gemini: More consistent, more verbose

Google's Gemini showed interesting differences in its response patterns. It tended to be more "wordy," often opening with an introduction before providing its list of recommendations.

Gemini example response 1 Gemini example response 2

What's particularly interesting is that Gemini demonstrated significantly more consistency in its rankings than ChatGPT. The Godfather was almost always ranked #3, while Citizen Kane and Casablanca regularly occupied the top two positions, though they occasionally swapped places.

Gemini position tracking graph

In February 2025, we observed a notable change in Gemini's behavior -- it began adding approximately 200-word explanations at the end of its responses. This increased the overall length of answers and consequently decreased the relative prominence of the recommended movies:

Gemini example with added explanations

Claude: Consistent core, varying format

Claude was added to our tracking in January 2025. While it also showed non-deterministic behavior, its responses maintained a fairly consistent core of recommendations.

Claude example response 1 Claude example response 2

Sometimes Claude would simply list the same top movies in a paragraph, while other times it would categorize recommendations by genre. Regardless of format, the top films remained relatively stable.

Our RPOFM tracking showed Claude maintained a middle ground in terms of consistency - more stable than ChatGPT and Perplexity, but with more variation than Gemini and DeepSeek in how it presented its recommendations.

Claude position tracking graph

Perplexity: Most volatile despite citations

Perplexity presented the most surprising findings. Despite being the only platform that consistently grounds its responses with citations for every query, it demonstrated the highest volatility in recommendations.

In some responses, Perplexity interpreted our question as asking about AI-themed movies, recommending films like Blade Runner and 2001: A Space Odyssey. In others, it provided more traditional classics like The Godfather and Citizen Kane:

Perplexity example response 1 Perplexity example response 2

This variability is clearly visible in the RPOFM chart:

Perplexity position tracking graph

There was a brief period from late February to early April 2025 when Perplexity's recommendations aligned more closely with other platforms, but otherwise, its answers fluctuated dramatically despite using the same prompt every time.

DeepSeek: Most consistent despite platform issues

DeepSeek shows lots of promise, if not for the inconsistent reliability of the platform. We added DeepSeek tracking on the 13th of February 2025, but ever since, it's been plagued by connection errors. Despite these issues, DeepSeek demonstrated consistent recommendations:

DeepSeek example response 1 DeepSeek example response 2

Its top five consistently featured The Godfather, Casablanca, Citizen Kane, Gone with the Wind, and Schindler's List, with minimal position changes:

DeepSeek position tracking graph

Consistency ranking: From most to least predictable LLMs

Having examined each platform individually, we can now compare their relative consistency -- a crucial factor for brands seeking reliable visibility in AI recommendations. This comparative analysis provides valuable insights into which platforms might offer more stable representation for brands.

Based on our five-month study, we can rank the five LLMs according to their consistency in movie recommendations. This ranking was determined by analyzing two key factors:

  1. Consistency of the movies included in the top 3 positions
  2. Consistency in the ordering of those top 3 movies

Here's how the platforms ranked from most to least consistent:

1. Gemini

Google's Gemini demonstrated the highest level of consistency in our testing. Not only did it maintain the same three films (Citizen Kane, The Godfather, and Casablanca) in its top positions, but it also showed remarkable stability in their ordering. The Godfather was consistently ranked at #3, while Citizen Kane and Casablanca maintained stable positions at #1 and #2, with only occasional swapping between them. This predictability makes Gemini the most reliable platform for consistent recommendations.

2. DeepSeek

Despite connectivity issues that resulted in gaps in our data collection, DeepSeek showed impressive consistency when it was accessible. Its top five recommendations remained stable throughout the testing period, with minimal position changes among The Godfather, Casablanca, Citizen Kane, Gone with the Wind, and Schindler's List. The ordered stability of its top recommendations puts it firmly in second place.

3. Claude

Anthropic's Claude maintained a consistent core of recommendations but showed more variability in both formatting and ordering than the top two platforms. While the same movies consistently appeared in its recommendations, their exact positions were less stable than in Gemini or DeepSeek responses.

4. OpenAI (ChatGPT)

ChatGPT showed significant variability in its responses. Although Citizen Kane, The Godfather, and Casablanca consistently appeared in its top 3 recommendations, their order was frequently shuffled, creating an unpredictable ranking system. Furthermore, the remaining films in its top 10 demonstrated even greater volatility, with movies like Pulp Fiction ranging anywhere from #4 to #10 in different responses.

5. Perplexity

Despite being the only platform to consistently ground its responses with citations, Perplexity ranked as the least consistent of all platforms tested. Not only did the order of recommendations vary widely, but the actual movies recommended sometimes shifted entirely based on different interpretations of the same query. At times, Perplexity would recommend films like Blade Runner and 2001: A Space Odyssey, interpreting our prompt as asking about AI-themed movies, while at other times it provided more traditional classics. This fundamental inconsistency in query interpretation, combined with ranking volatility, makes Perplexity the least predictable platform in our study.

Non-deterministic answers: Correlation between training data and consistency

The degree of randomness in AI responses directly correlates with the amount and consistency of training data on a given topic. This relationship has profound implications for brands seeking visibility in AI recommendations.

If you ask any AI provider what 2 + 2 equals, the answer will invariably be 4. This consistency stems from the simplicity of the question and the undisputed nature of the answer across all training data. Mathematical facts, basic definitions, and other unambiguous information tend to yield highly consistent responses.

However, as we move into more subjective or nuanced domains, the non-deterministic nature of LLMs becomes increasingly apparent. Our study on movie recommendations demonstrates this phenomenon perfectly. Despite decades of internet discussions about classic films - discussions that have presumably been included in these models' training data - the rankings remain inconsistent.

This observation leads to a critical insight for brands: the more diluted or disputed your presence is in the training data, the more volatile your visibility will be in AI responses. Even established cultural touchstones like "The Godfather" and "Citizen Kane" experience ranking volatility, suggesting that newer or less prominent brands will face even greater challenges in maintaining consistent visibility.

For brands that aren't well-featured in training data, the "randomness" factor increases substantially. Your brand might appear prominently in one response and be completely absent in the next, even with identical prompts.

We have several additional studies planned to explore these dynamics further. The central question for marketers has become: How can I influence these LLMs to consistently include my brand? Trackerly users have achieved promising results by focusing on targeted strategies, though the inherent non-deterministic nature of LLMs will always introduce some level of volatility in brand visibility.

Conclusion

Our findings reveal that despite vast amounts of training data available on classic films - arguably one of the most well-documented topics on the internet - AI recommendations remain highly volatile. While certain movies consistently appear in the top results, their exact positions shift constantly.

For marketers and brand managers, this has significant implications. If AI systems can't maintain consistency when recommending widely recognized classics like Citizen Kane or The Godfather, they'll likely struggle even more with lesser-known brands and products.

This non-deterministic behavior means that AI visibility isn't a one-time achievement but requires ongoing monitoring and strategy adjustment. Understanding these patterns through metrics like Relative Position of First Mention (RPOFM) can help brands make better decisions about their AI visibility strategy.

The AI recommendation landscape is still developing, with each platform showing different patterns and levels of consistency. As we continue to track these changes, one thing is clear: the brands that recognize and adapt to this new reality of AI volatility will be best positioned to maintain visibility in an increasingly AI-driven consumer experience.

Want to start tracking your brand's visibility across all major AI platforms? Try Trackerly free for 7 days and get ahead of the competition in the AI recommendation space.

Understand how AI is talking about your brand

Track how different AI models respond to your prompts. Compare OpenAI and Google Gemini responses to increase your visibility in LLMs.

Start monitoring AI responses →