Where Do LLMs Go to Learn? (and How to Get Featured)

Where Do LLMs Go to Learn? (and How to Get Featured)

December 1, 2024 Hugo Huijer

As large language models (LLMs) like OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude become central to digital experiences, understanding where these models source their information is critical. If your brand isn't part of these datasets, it risks being invisible in AI-generated answers. This guide explores the different LLMs, their training data, and how to strategically position your brand within these AI ecosystems.

How big LLMs differ from each other

OpenAI's GPT models (like ChatGPT), Google's Gemini, and Anthropic's Claude are leading LLMs, but they differ in their data sources, training methods, and goals:

  • OpenAI's GPT (e.g., GPT-4) focuses on generating conversational and creative content. It's trained on a broad dataset, including books, websites, and articles.
  • Google's Gemini leverages the search giant's vast data resources. It aims to provide factually accurate information, integrating real-time web results and data from partners like Reddit.
  • Anthropic's Claude emphasizes safety and neutrality, trained on carefully curated datasets that avoid harmful content while ensuring broad knowledge coverage.

Understanding these differences helps tailor your AIO (AI Optimization) strategy to target each platform's strengths and weaknesses. Learn more about SEO vs AIO here.

How your brand visibility depends on your presence in the LLMs' training data

For an LLM to mention your brand, it needs to recognize it within its training data. Unlike SEO, where visibility depends on ranking algorithms, AIO ensures your brand is referenced within reputable data sources that feed these models. If you're absent in these datasets, you're effectively invisible to AI-generated conversations.

What content does OpenAI train on?

OpenAI's GPT models use diverse datasets:

  • Public web data: OpenAI's models are trained on extensive web content, including blogs, forums, news sites, and books up until their knowledge cutoff (September 2023 for GPT-4).
  • Licensed data: OpenAI partners with organizations like the Associated Press, Axel Springer, Vox Media, Dotdash Meredith, and Stack Overflow to enhance reliability.
  • Microsoft ecosystem: OpenAI's relationship with Microsoft means its models likely benefit from data within platforms like LinkedIn (professional content, articles) and GitHub (code repositories, developer discussions). This broadens its knowledge in business, technical, and professional spheres.
  • Legal challenges and licensing: OpenAI has faced legal scrutiny over its data practices. For instance, The New York Times sued OpenAI for allegedly using its content without consent, claiming the AI plagiarized articles. This case underscores a shift in AI training towards licensed data and first-party sources, ensuring compliance and quality.

Implication for brands: Focus on high-quality, widely read content. Getting featured in top media outlets, industry-leading blogs, or authoritative forums like Stack Overflow increases your chances of being part of OpenAI's training set. For B2B companies, establishing a strong LinkedIn presence is key.

What content does Gemini (Google) train on?

Gemini benefits from Google's massive data infrastructure:

  • Web crawls: Google leverages data from its search index, prioritizing high-ranking pages.
  • Owned platforms: Data from platforms like YouTube, Google Maps, and Gmail likely play a role. This provides Gemini with rich multimedia and geographic context. For example, popular YouTube channels or reviews can influence how an AI discusses a brand.
  • Partnerships: Notably, Gemini incorporates data from platforms like Reddit. Google's $60 million/year partnership with Reddit ensures AI access to real-world discussions.

Implication for brands: Beyond traditional SEO, ensure your brand has a presence on platforms Google owns. Engaging on YouTube, contributing to Google Maps reviews, and participating in industry-relevant subreddits can significantly boost visibility.

What content does Anthropic (Claude) train on?

Anthropic's Claude focuses on responsible AI development:

  • Public datasets: Claude's training data includes websites, books, and open-source knowledge bases.
  • Curated sources: Anthropic avoids controversial or low-quality data, focusing on reputable sources that align with their safety-first principles.

Implication for brands: Maintain a strong presence on trusted platforms and ensure your content adheres to high-quality standards. Claude is more likely to include your brand if it's featured in ethical, high-authority discussions.

How do you get featured in the training data?

  1. Create high-quality, authoritative content: Contribute valuable insights to your industry. Articles, whitepapers, and thought leadership pieces on reputable platforms increase your inclusion likelihood.
  2. Leverage community platforms:
    • Reddit: Engaging in industry-relevant subreddits or starting discussions can elevate your brand visibility.
    • Quora: Its authoritative Q&A format makes it a valuable training data source.
    • Stack Overflow and GitHub: For tech brands, contributing to these platforms ensures visibility in AI models used by developers.
  3. Earn media mentions: Get featured in news articles and trusted blogs. AI models prioritize authoritative, well-cited sources.
  4. Understand platform ecosystems:
    • LinkedIn: Important for OpenAI's models. Contribute thought leadership or industry insights.
    • YouTube: Key for Google Gemini. Create engaging content related to your industry.
    • Twitter/X: Elon Musk's Grok AI is trained on tweet data, so maintaining an active, insightful Twitter presence is valuable.

Will LLMs ever run out of training data?

As LLMs grow more sophisticated, a critical challenge emerges: the risk of running out of quality training data. According to Spiceworks, AI models require vast amounts of authentic and unique content to continue improving. Without fresh data, models may stagnate, unable to learn new information or adapt to evolving contexts.

The risk of self-referential training: As AI-generated content proliferates online, there's a growing concern that future LLMs may end up training on their own output. This creates a "snake-eating-its-own-tail" scenario, where the data loop could degrade model quality. AI models thrive on diverse, human-created content that reflects genuine experiences and insights. If models feed on AI-generated material, it can lead to model collapse---a decline in the accuracy and originality of generated responses.

Exclusive data licensing deals: To counter this, companies like Google and OpenAI are striking licensing agreements with major content providers. For example, OpenAI's partnerships with The Associated Press, Axel Springer, Dotdash Meredith, and Vox Media grant access to exclusive news archives, ensuring high-quality, up-to-date training material.

Final thoughts

AIO isn't about keyword stuffing or ranking tricks---it's about making sure your brand becomes part of the knowledge that AI models learn from. By understanding where LLMs go to "school," you can strategically position your brand to be part of the conversation, shaping how AI represents you to the world.

In summary, the future of SEO lies in becoming part of the datasets LLMs are trained on. It's not just about being on the web; it's about making sure your content is recognized and included in the datasets that AI uses to learn.

As AI continues to evolve, those who adapt their strategies to the new world of AIO will be best positioned to thrive.

Understand how AI is talking about your brand

Track how different AI models respond to your prompts. Compare OpenAI and Google Gemini responses to increase your visibility in LLMs.

Start monitoring AI responses →