How to Optimize your Videos for ChatGPT, Gemini, Perplexity, and Claude

January 23, 2026

Topic tags

Phil Nottingham

Marketing

There are a few ways in which LLMs can surface and interpret video content:

They can share information learned from reading video transcripts.
They can share a link to a specific video when various factors indicate it as a relevant source for the user query.
They can analyse the transcripts and metadata of a particular video, when directly prompted.

With all of the above, the essential means by which LLMs — like ChatGPT, Gemini, Claude, Perplexity, Grok, etc. — interact with video content, is via supporting text.

LLMs operate in the modality of written language, and cannot (yet) begin to parse videos as a string of moving images connected to an audio file.

The reason for this is largely one of expense and resource allocation: it’s enormously expensive and difficult for robots to process video files.

Where 100 words of basic HTML might stretch to 0.8KB, the same words spoken and presented as a 45 second HD video will be about 20MB. That’s 25,000 times more data to say the same thing.

So while we’re at a point where certain advanced crawlers can parse and comprehend video files when prompted, we’re a long way still from a common web crawler being able to engage in such processing with every video they come across.

As a consequence, the way we must think about optimizing videos for LLMs, for the foreseeable future, needs to be anchored in the optimization of the supporting information that surrounds the video itself.

So how do you make videos visible to AI?

Transcripts. Titles. Descriptions.

Each video should have a transcript, a clear title, and a detailed description that explains what’s in the video.

But, beyond this, these metaelements need to be visible to LLM crawlers.

This means ensuring they are not reliant on any executable JavaScript to be visible, nor hidden in things like an iframe.

This is a bit of a problem, because 95%+ of videos on the open web rely on JavaScript or iFrames for delivery.

At Wistia, we have a solution to this with LLM-friendly embeds.

These embeds include the transcripts for the video as basic HTML text elements within the embed code, and then use JavaScript to overwrite this text with a video player.

Essentially they provide a text fall-back for any crawlers or users unable to render and read videos.

So if you want to optimize the videos on your website for LLMs, you either need to provide the transcript as plain HTML text next to the video (like you might add to a supporting blog post), or you need to use Wistia LLM embeds.

This is especially true where YouTube is concerned. YouTube embeds use iframes, which are invisible to LLM crawlers, so the only hope you have of getting an LLM to understand the content of an embedded YouTube video is by putting the transcription as text on the page.

Can LLMs understand YouTube videos based on the transcripts?

What about YouTube videos on YouTube? If explicitly instructed to do so, most LLMs can find a YouTube transcript from the public video URL to then read and comprehend its contents.

What they can’t do is process YouTube transcripts in its general training set in the way that it does the rest of the wider publicly available data on the web.

YouTube’s terms of service prohibit bulk scraping or reusing content at scale, which means this is unlikely to change in the future. So simply having your videos available on YouTube does not mean the content will be used to inform ChatGPT or Claude’s understanding.

So if you’re uploading videos directly to YouTube, the most important thing to do is just to ensure you’re uploading an accurate transcript, and doing so in as many languages as is relevant to your audience. This won’t enable LLMs to use your video within their general training sets, but it will mean they can find and reference your video, and comprehend its contents when prompted to do so.

Is this still true with YouTube videos, given that Google owns Gemini and YouTube?

An often assumed myth is that because Google is the parent company of both YouTube and Gemini, this must mean that the two are seamlessly integrated. I’m sorry to say this is not the case.

Gemini may have access to a database of YouTube videos with some additional metadata that ChatGPT does not, but the advantages end there.

It may be the case that a future version of Gemini actively queries the YouTube database and algorithm with any given input, but this presupposes that such functionality would actually enable better conversations and results for Gemini, which is no guarantee, given that YouTube is filled with creator-generated content and no minimum quality threshold.

As things stand, Gemini functions similarly to other LLMs, relying on citations and references throughout the web to determine which videos are likely to be the most relevant for any given query.

The videos that rank in the top spots for any given query will not necessarily correlate with the videos Gemini chooses to highlight under the same instruction.

Will this change? And if so, when?

Of course, it’s reasonable to assume the LLMs will, in the not too distant future, have the capabilities and raw processing power to read and comprehend video content in a way that’s more in line with human comprehension.

However, this assumes that such functionality would be a valuable use of the limited processing power available, even when that capacity is greatly increased. We might find that getting LLMs to parse encapsulated media files results in only a very minor increase in functional understanding of the videos, to the point that it’s simply not worth it.

Ultimately, the value users gain from more detailed comprehension of videos will determine the trajectory of the development of these tools and their crawlers.

My best guess? Writing in 2026, I think we’re about two years away from LLMs being able to parse and understand JavaScript in the (still fairly limited) way Googlebot can.

When the two-year point hits, Wistia Standard Embeds that use JSON-LD to deliver titles, descriptions, and transcript information within Schema.org mark-up should also start to work for LLMs.

We’re probably a further three years away from rendering and parsing of encapsulated video files being the norm. This will be the point at which video optimization moves from a practice of optimizing metadata to optimizing the video file itself, and will mark a further shift in SEO best practice.

The way we optimize video for discovery will look very different in five years. But until LLMs can process video files directly, the rule is simple: if it’s not readable as text, it’s invisible to AI.

Topic tags

Mailing list sign-up form