Skip to content

trackit/twelvelabs-model-evaluation

Repository files navigation

Twelve Labs Model Evaluation

This project evaluates Twelve Labs video foundation models (Pegasus and Marengo 2.7/3.0) using AWS Bedrock. It provides a comprehensive framework for generating rich video metadata, performing semantic retrieval, and executing multi-tiered model evaluations.

Response Format (Pegasus)

The model returns structured JSON with clips and metadata:

{
  "clips": [
    {
      "start_time": 10.5,
      "end_time": 15.2,
      "description": "...",
      "dialogue": "Winter is coming",
      "objects": ["coffee cup", "notebook"],
      "settings": "Office",
      "location": "Central Perk",
      "characters": "Jim and Dwight",
      "actors": "...",
      "actions": "...",
      "narrative": "...",
      "mood": "...",
      "emotion": "...",
      "shot_type": "..."
    }
  ]
}

Prerequisites

  1. Python 3.10+
  2. AWS Credentials configured (via ~/.aws/credentials or environment variables)
  3. AWS Permissions:
    • Access to your S3 bucket (configured via MODEL_EVAL_S3_BUCKET environment variable)
    • Access to AWS Bedrock Runtime
    • Permission to invoke: twelvelabs.pegasus-1-2-v1:0
    • Permission to invoke: twelvelabs.marengo-embed-2-7-v1:0 (and inference profile)
    • Permission to invoke: twelvelabs.marengo-embed-3-0-v1:0 (and inference profile)
    • Permission to invoke: anthropic.claude-3-7-sonnet-20250219-v1:0
    • Permission to invoke: anthropic.claude-opus-4-20250514-v1:0

Installation

  1. Install dependencies:
pip install boto3 yt-dlp

Configuration

Before running any scripts, configure your S3 bucket name using environment variables. See .env.example for a template.

Required Environment Variables

Variable Description Default
MODEL_EVAL_S3_BUCKET S3 bucket for evaluation data model-evaluation-dataset
TEST_S3_BUCKET S3 bucket for test scripts (optional) Falls back to MODEL_EVAL_S3_BUCKET

Setting Environment Variables

Linux / macOS:

export MODEL_EVAL_S3_BUCKET=your-bucket-name

Windows (PowerShell):

$env:MODEL_EVAL_S3_BUCKET = "your-bucket-name"

Windows (Command Prompt):

set MODEL_EVAL_S3_BUCKET=your-bucket-name

Using .env file:

cp .env.example .env
# Edit .env and set your bucket name

Bucket Naming Rules

S3 bucket names must follow AWS naming rules:

  • 3-63 characters long
  • Only lowercase letters, numbers, hyphens, and periods
  • Cannot start or end with hyphen or period
  • Cannot be formatted as an IP address

Dataset File Configuration

Warning

The evaluation_dataset.jsonl files in pegasus-eval/ and marengo-eval/ contain placeholder values that MUST be replaced before running any scripts.

Placeholders to replace:

Placeholder Replace With Example
{YOUR_BUCKET_NAME} Your S3 bucket name my-evaluation-bucket
{YOUR_ACCOUNT_ID} Your 12-digit AWS account ID 123456789012

Example transformation:

Before:

{"mediaSource": {"s3Location": {"uri": "s3://{YOUR_BUCKET_NAME}/videos/video.mp4", "bucketOwner": "{YOUR_ACCOUNT_ID}"}}}

After:

{"mediaSource": {"s3Location": {"uri": "s3://my-evaluation-bucket/videos/video.mp4", "bucketOwner": "123456789012"}}}

You can use sed to replace placeholders in bulk:

# Linux/macOS
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' pegasus-eval/evaluation_dataset.jsonl
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' marengo-eval/evaluation_dataset.jsonl

AWS Environment Setup

Note

Replace {YOUR_BUCKET_NAME} with your own unique bucket name throughout these examples.

1. Create S3 Bucket

aws s3api create-bucket --bucket {YOUR_BUCKET_NAME} --region us-east-1 --profile sandbox

2. Tag S3 Bucket

aws s3api put-bucket-tagging --bucket {YOUR_BUCKET_NAME} --tagging 'TagSet=[{Key=Name,Value={YOUR_BUCKET_NAME}},{Key=Owner,Value="Your Name"},{Key=Project,Value=model-evaluation}]' --profile sandbox

3. Configure CORS

Apply the CORS policy to allow cross-origin requests (needed for some Bedrock features):

aws s3api put-bucket-cors --bucket {YOUR_BUCKET_NAME} --cors-configuration file://cors-policy.json --profile sandbox

Video Content Preparation

1. Download YouTube Videos

Ensure videos are within Bedrock limits (<2GB and <1 hour).

# Download specific videos
yt-dlp "https://www.youtube.com/watch?v=hB1cIRfpjyU" -o "video_hB1cIRfpjyU.mp4"
yt-dlp "https://www.youtube.com/watch?v=_O6yTfjPfmU" -o "video__O6yTfjPfmU.mp4"

2. Upload to S3

Upload the videos to the videos/ prefix in your bucket.

aws s3 cp video_hB1cIRfpjyU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_hB1cIRfpjyU.mp4 --profile sandbox
aws s3 cp video__O6yTfjPfmU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video__O6yTfjPfmU.mp4 --profile sandbox
aws s3 cp video_dfuPBC-v5NE.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_dfuPBC-v5NE.mp4 --profile sandbox

3. Upload Dataset

Upload your initial dataset file to S3.

aws s3 cp evaluation_dataset.jsonl s3://{YOUR_BUCKET_NAME}/evaluate/evaluation_dataset.jsonl --profile sandbox

Usage (Pegasus Evaluation Pipeline)

For best results, use the dedicated scripts in the pegasus-eval/ directory.

Step 1: Prepare Dataset

Convert your input JSONL into the specific format Pegasus requires.

python pegasus-eval/prepare-dataset.py --input pegasus-eval/evaluation_dataset.jsonl

Step 2: Run Inference & Upload

Execute Pegasus model inference and auto-upload results to S3 for judging. Supports long videos (up to 1 hour) with optimized timeouts.

python pegasus-eval/run-manual-evaluation.py

Step 3: Trigger Bedrock Model-as-Judge

Create an automated evaluation job in AWS Bedrock using Claude 3.7 Sonnet as the judge.

python pegasus-eval/create-judge-job.py

Step 4: Download Results

Retrieve and summarize the judging results once the job is complete.

python pegasus-eval/get-results.py --job-name your-job-name

Step 5: Estimate Costs

The cost calculator now supports aggregating multiple jobs and processing local files.

# Calculate combined cost for Pegasus (Inference + Judge)
python pegasus-eval/calculate-job-cost.py \
  --uri s3://bucket/manual/inference.jsonl s3://bucket/judge-jobs/my-job/ \
  --total-video-minutes 120 \
  --output-json pegasus-eval/results/cost_summary.json

Unified Evaluation Reporting

After running both the Pegasus and Marengo evaluation pipelines, you can generate a single, comprehensive report that combines both models' results.

Step 1: Generate Report

This script uses Claude Opus 4.5 to analyze all technical metrics and draft a professional executive summary.

python generate-report.py
  • Inputs: Reads from pegasus-eval/results/ and marengo-eval/
  • Output: Generates evaluation-report-[TIMESTAMP].md in the project root.

Features

  • Cost & Latency Tracking: The report now includes a detailed breakdown of estimated costs (Pegasus, Marengo, Claude) and latency statistics for every stage of the pipeline.

Usage (Marengo Evaluation Pipeline)

Marengo is an embedding model. Evaluation follows a 3-step workflow in the marengo-eval/ directory.

Step 1: Generate Video Embeddings

Perform async inference to convert video segments into vector embeddings.

python marengo-eval/run-inference.py

Step 2: Run Semantic Retrieval

Convert text queries to vectors and perform cosine similarity search against video segments.

# Update inference_output in run-retrieval.py first
python marengo-eval/run-retrieval.py

Step 3: Evaluate Retrieval Quality

A tiered evaluation using standard math metrics and a Pegasus-enhanced LLM Judge (Claude Opus 4.5).

# Update retrieval_results in run-evaluation.py first
python marengo-eval/run-evaluation.py

Marengo Version Comparison (2.7 vs 3.0)

The evaluation framework supports both Marengo 2.7 and 3.0, controlled via the USE_MARENGO_3_0 environment variable.

Version Selection

# Use Marengo 2.7 (default)
python marengo-eval/run-inference.py

# Use Marengo 3.0
export USE_MARENGO_3_0=true
python marengo-eval/run-inference.py

Key Differences

API Structure

  • Marengo 2.7: Uses simpler API structure with visual-text and audio embeddings
  • Marengo 3.0: Uses nested structure with separate visual, audio, and optional transcription embeddings

Features

Marengo 3.0 introduces:

  • Dynamic Segmentation: Automatically segments videos based on content changes
  • Transcription Embeddings: Additional embedding type for speech-to-text analysis
  • Multiple Embedding Scopes: Both clip-level and asset-level embeddings
  • Configurable Segmentation: Choose between fixed or dynamic duration methods

Performance Trade-offs

Aspect Marengo 2.7 Marengo 3.0
Processing Speed Faster Slower (more comprehensive analysis)
Embedding Quality Good Better (higher mAP, nDCG scores)
Text Embedding Latency ~0.15s ~0.12s
Model Architecture Simpler More complex, larger
Use Case Quick processing needs Quality-critical applications

Why Marengo 3.0 is Slower

  1. Quality Over Speed: Marengo 3.0 prioritizes embedding quality and accuracy over processing speed
  2. More Comprehensive Analysis: Performs more sophisticated video understanding
  3. Larger Model: More complex architecture requires additional computation time
  4. Enhanced Features: Even with basic options, performs more under-the-hood processing

Optimization Tips for Marengo 3.0

  • Use specific embedding options instead of all three (visual, audio, transcription)
  • Configure fixed segmentation if dynamic segmentation isn't needed
  • Process shorter video clips when possible
  • Consider the quality vs speed trade-off for your specific use case

Complete Evaluation Workflow for Both Versions

# 1. Evaluate Marengo 2.7
cd marengo-eval
python run-inference.py
python run-retrieval.py  # Update inference_output path
python run-evaluation.py # Update retrieval_results path

# 2. Evaluate Marengo 3.0
export USE_MARENGO_3_0=true
python run-inference.py
python run-retrieval.py  # Update inference_output path
python run-evaluation.py # Update retrieval_results path

# 3. Generate combined report
cd ..
python generate-report.py

The unified report will include side-by-side comparisons of both Marengo versions, showing performance metrics, latency statistics, and quality scores.

Dataset Format (Input)

The source evaluation_dataset.jsonl should contain:

  • inputPrompt: The question for Pegasus or the search query for Marengo.
  • mediaSource.s3Location.uri: S3 URI of the video file.
  • responseFormat: (Optional) Bedrock JSON schema for Pegasus.
  • referenceResponse: (Optional) Ground truth for the judge (LLM evaluation).

Troubleshooting

General Issues

  • IAM Role: Ensure the role used in CONFIG has a trust relationship with bedrock.amazonaws.com.
  • S3 Access: Verify S3 bucket policies allow Bedrock service access.
  • Model Access: Ensure Twelve Labs Pegasus, Marengo 2.7, Marengo 3.0, Claude 3.7, and Claude Opus 4.5 are enabled in your region.

AWS Bedrock 15-Minute Timeout Limit

AWS Bedrock has a hard 15-minute timeout limit for async invocations that cannot be increased. This affects both Pegasus and Marengo when processing long or complex videos.

Symptoms:

  • Jobs fail exactly at 15 minutes with error: "Something went wrong. Please try again later."
  • Affects videos regardless of file size - both large and small videos can fail if processing takes too long

Solutions for Long Videos:

  1. Video Preprocessing (Recommended)

    • Split videos into 10-minute segments before processing
    • Use ffmpeg to segment without re-encoding:
    ffmpeg -i input.mp4 -c copy -segment_time 600 -f segment output_%03d.mp4
  2. Reduce Embedding Options (Marengo)

    • Process only visual or only audio embeddings instead of both
    • This can reduce processing time by up to 50%
  3. Alternative Approaches

    • Use AWS Batch or Step Functions to orchestrate multiple smaller jobs
    • Contact AWS Support (though this limit typically cannot be increased)
    • Consider using Twelve Labs' direct API if available

Note: Video duration and content complexity affect processing time more than file size. A small video with complex scenes may still exceed the timeout.

About

Project to develop code for evaluating Twelve Labs models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages