Multimodal

Multimodal

Pureframe indexes videos across two modalities: visual content and spoken language. This means a single search query can match a moment because of what someone is showing, what someone is saying, or both.

Two modalities

Visual

Every frame in your video is analyzed and encoded into a high-dimensional vector that captures its visual meaning. This enables searches like:

  • "whiteboard with a diagram"
  • "person in a blue shirt"
  • "outdoor scene at sunset"
  • A reference image of a product you want to find in footage

Visual search works even when no one is speaking and there are no text overlays.

Transcript

Spoken audio is transcribed and each speech segment is encoded separately. This enables searches like:

  • "pricing discussion"
  • "customer mentions churn"
  • "next steps for the project"

Transcript search works even when the visual content is static or uninformative (e.g., a talking-head interview in a plain room).

How they combine

By default, both modalities are active (modes=["video", "transcript"]). Pureframe runs both searches and merges the results, so a segment can rank highly because it matches visually, because of what was said, or both.

You can restrict to a single modality:

$# Search only spoken content
$curl -X POST https://api.pureframe.ai/v1/search \
> -H "Authorization: Bearer pf_key_..." \
> -F "query=pricing objection" \
> -F "modes=transcript"
$
$# Search only visual content
$curl -X POST https://api.pureframe.ai/v1/search \
> -H "Authorization: Bearer pf_key_..." \
> -F "query=person pointing at screen" \
> -F "modes=video"

You can also provide a reference image instead of a text query. Pureframe finds frames that are visually similar to the image you provide — useful when you have an example of what you’re looking for but no words to describe it.

$curl -X POST https://api.pureframe.ai/v1/search \
> -H "Authorization: Bearer pf_key_..." \
> -F "image=@reference.jpg"

Image search operates on the visual modality only.

Segment types

Each search result segment includes a type field indicating which modality produced the match:

ValueMatched via
visualVisual frame embedding
transcriptSpeech transcription

This lets you show users exactly why a moment was surfaced.