Multimodal
Multimodal
Pureframe indexes videos across two modalities: visual content and spoken language. This means a single search query can match a moment because of what someone is showing, what someone is saying, or both.
Two modalities
Visual
Every frame in your video is analyzed and encoded into a high-dimensional vector that captures its visual meaning. This enables searches like:
"whiteboard with a diagram""person in a blue shirt""outdoor scene at sunset"- A reference image of a product you want to find in footage
Visual search works even when no one is speaking and there are no text overlays.
Transcript
Spoken audio is transcribed and each speech segment is encoded separately. This enables searches like:
"pricing discussion""customer mentions churn""next steps for the project"
Transcript search works even when the visual content is static or uninformative (e.g., a talking-head interview in a plain room).
How they combine
By default, both modalities are active (modes=["video", "transcript"]). Pureframe runs both searches and merges the results, so a segment can rank highly because it matches visually, because of what was said, or both.
You can restrict to a single modality:
Image search
You can also provide a reference image instead of a text query. Pureframe finds frames that are visually similar to the image you provide — useful when you have an example of what you’re looking for but no words to describe it.
Image search operates on the visual modality only.
Segment types
Each search result segment includes a type field indicating which modality produced the match:
This lets you show users exactly why a moment was surfaced.