Twelve Labs Model Evaluation

This project evaluates Twelve Labs video foundation models (Pegasus and Marengo 2.7/3.0) using AWS Bedrock. It provides a comprehensive framework for generating rich video metadata, performing semantic retrieval, and executing multi-tiered model evaluations.

Response Format (Pegasus)

The model returns structured JSON with clips and metadata:

{
  "clips": [
    {
      "start_time": 10.5,
      "end_time": 15.2,
      "description": "...",
      "dialogue": "Winter is coming",
      "objects": ["coffee cup", "notebook"],
      "settings": "Office",
      "location": "Central Perk",
      "characters": "Jim and Dwight",
      "actors": "...",
      "actions": "...",
      "narrative": "...",
      "mood": "...",
      "emotion": "...",
      "shot_type": "..."
    }
  ]
}

Prerequisites

Python 3.10+
AWS Credentials configured (via ~/.aws/credentials or environment variables)
AWS Permissions:
- Access to your S3 bucket (configured via MODEL_EVAL_S3_BUCKET environment variable)
- Access to AWS Bedrock Runtime
- Permission to invoke: twelvelabs.pegasus-1-2-v1:0
- Permission to invoke: twelvelabs.marengo-embed-2-7-v1:0 (and inference profile)
- Permission to invoke: twelvelabs.marengo-embed-3-0-v1:0 (and inference profile)
- Permission to invoke: anthropic.claude-3-7-sonnet-20250219-v1:0
- Permission to invoke: anthropic.claude-opus-4-20250514-v1:0

Installation

Install dependencies:

pip install boto3 yt-dlp

Configuration

Before running any scripts, configure your S3 bucket name using environment variables. See .env.example for a template.

Required Environment Variables

Variable	Description	Default
`MODEL_EVAL_S3_BUCKET`	S3 bucket for evaluation data	`model-evaluation-dataset`
`TEST_S3_BUCKET`	S3 bucket for test scripts (optional)	Falls back to `MODEL_EVAL_S3_BUCKET`

Setting Environment Variables

Linux / macOS:

export MODEL_EVAL_S3_BUCKET=your-bucket-name

Windows (PowerShell):

$env:MODEL_EVAL_S3_BUCKET = "your-bucket-name"

Windows (Command Prompt):

set MODEL_EVAL_S3_BUCKET=your-bucket-name

Using .env file:

cp .env.example .env
# Edit .env and set your bucket name

Bucket Naming Rules

S3 bucket names must follow AWS naming rules:

3-63 characters long
Only lowercase letters, numbers, hyphens, and periods
Cannot start or end with hyphen or period
Cannot be formatted as an IP address

Dataset File Configuration

Warning

The evaluation_dataset.jsonl files in pegasus-eval/ and marengo-eval/ contain placeholder values that MUST be replaced before running any scripts.

Placeholders to replace:

Placeholder	Replace With	Example
`{YOUR_BUCKET_NAME}`	Your S3 bucket name	`my-evaluation-bucket`
`{YOUR_ACCOUNT_ID}`	Your 12-digit AWS account ID	`123456789012`

Example transformation:

Before:

{"mediaSource": {"s3Location": {"uri": "s3://{YOUR_BUCKET_NAME}/videos/video.mp4", "bucketOwner": "{YOUR_ACCOUNT_ID}"}}}

After:

{"mediaSource": {"s3Location": {"uri": "s3://my-evaluation-bucket/videos/video.mp4", "bucketOwner": "123456789012"}}}

You can use sed to replace placeholders in bulk:

# Linux/macOS
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' pegasus-eval/evaluation_dataset.jsonl
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' marengo-eval/evaluation_dataset.jsonl

AWS Environment Setup

Note

Replace {YOUR_BUCKET_NAME} with your own unique bucket name throughout these examples.

1. Create S3 Bucket

aws s3api create-bucket --bucket {YOUR_BUCKET_NAME} --region us-east-1 --profile sandbox

2. Tag S3 Bucket

aws s3api put-bucket-tagging --bucket {YOUR_BUCKET_NAME} --tagging 'TagSet=[{Key=Name,Value={YOUR_BUCKET_NAME}},{Key=Owner,Value="Your Name"},{Key=Project,Value=model-evaluation}]' --profile sandbox

3. Configure CORS

Apply the CORS policy to allow cross-origin requests (needed for some Bedrock features):

aws s3api put-bucket-cors --bucket {YOUR_BUCKET_NAME} --cors-configuration file://cors-policy.json --profile sandbox

Video Content Preparation

1. Download YouTube Videos

Ensure videos are within Bedrock limits (<2GB and <1 hour).

# Download specific videos
yt-dlp "https://www.youtube.com/watch?v=hB1cIRfpjyU" -o "video_hB1cIRfpjyU.mp4"
yt-dlp "https://www.youtube.com/watch?v=_O6yTfjPfmU" -o "video__O6yTfjPfmU.mp4"

2. Upload to S3

Upload the videos to the videos/ prefix in your bucket.

aws s3 cp video_hB1cIRfpjyU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_hB1cIRfpjyU.mp4 --profile sandbox
aws s3 cp video__O6yTfjPfmU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video__O6yTfjPfmU.mp4 --profile sandbox
aws s3 cp video_dfuPBC-v5NE.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_dfuPBC-v5NE.mp4 --profile sandbox

3. Upload Dataset

Upload your initial dataset file to S3.

aws s3 cp evaluation_dataset.jsonl s3://{YOUR_BUCKET_NAME}/evaluate/evaluation_dataset.jsonl --profile sandbox

Usage (Pegasus Evaluation Pipeline)

For best results, use the dedicated scripts in the pegasus-eval/ directory.

Step 1: Prepare Dataset

Convert your input JSONL into the specific format Pegasus requires.

python pegasus-eval/prepare-dataset.py --input pegasus-eval/evaluation_dataset.jsonl

Step 2: Run Inference & Upload

Execute Pegasus model inference and auto-upload results to S3 for judging. Supports long videos (up to 1 hour) with optimized timeouts.

python pegasus-eval/run-manual-evaluation.py

Step 3: Trigger Bedrock Model-as-Judge

Create an automated evaluation job in AWS Bedrock using Claude 3.7 Sonnet as the judge.

python pegasus-eval/create-judge-job.py

Step 4: Download Results

Retrieve and summarize the judging results once the job is complete.

python pegasus-eval/get-results.py --job-name your-job-name

Step 5: Estimate Costs

The cost calculator now supports aggregating multiple jobs and processing local files.

# Calculate combined cost for Pegasus (Inference + Judge)
python pegasus-eval/calculate-job-cost.py \
  --uri s3://bucket/manual/inference.jsonl s3://bucket/judge-jobs/my-job/ \
  --total-video-minutes 120 \
  --output-json pegasus-eval/results/cost_summary.json

Unified Evaluation Reporting

After running both the Pegasus and Marengo evaluation pipelines, you can generate a single, comprehensive report that combines both models' results.

Step 1: Generate Report

This script uses Claude Opus 4.5 to analyze all technical metrics and draft a professional executive summary.

python generate-report.py

Inputs: Reads from pegasus-eval/results/ and marengo-eval/
Output: Generates evaluation-report-[TIMESTAMP].md in the project root.

Features

Cost & Latency Tracking: The report now includes a detailed breakdown of estimated costs (Pegasus, Marengo, Claude) and latency statistics for every stage of the pipeline.

Usage (Marengo Evaluation Pipeline)

Marengo is an embedding model. Evaluation follows a 3-step workflow in the marengo-eval/ directory.

Step 1: Generate Video Embeddings

Perform async inference to convert video segments into vector embeddings.

python marengo-eval/run-inference.py

Step 2: Run Semantic Retrieval

Convert text queries to vectors and perform cosine similarity search against video segments.

# Update inference_output in run-retrieval.py first
python marengo-eval/run-retrieval.py

Step 3: Evaluate Retrieval Quality

A tiered evaluation using standard math metrics and a Pegasus-enhanced LLM Judge (Claude Opus 4.5).

# Update retrieval_results in run-evaluation.py first
python marengo-eval/run-evaluation.py

Marengo Version Comparison (2.7 vs 3.0)

The evaluation framework supports both Marengo 2.7 and 3.0, controlled via the USE_MARENGO_3_0 environment variable.

Version Selection

# Use Marengo 2.7 (default)
python marengo-eval/run-inference.py

# Use Marengo 3.0
export USE_MARENGO_3_0=true
python marengo-eval/run-inference.py

Key Differences

API Structure

Marengo 2.7: Uses simpler API structure with visual-text and audio embeddings
Marengo 3.0: Uses nested structure with separate visual, audio, and optional transcription embeddings

Features

Marengo 3.0 introduces:

Dynamic Segmentation: Automatically segments videos based on content changes
Transcription Embeddings: Additional embedding type for speech-to-text analysis
Multiple Embedding Scopes: Both clip-level and asset-level embeddings
Configurable Segmentation: Choose between fixed or dynamic duration methods

Performance Trade-offs

Aspect	Marengo 2.7	Marengo 3.0
Processing Speed	Faster	Slower (more comprehensive analysis)
Embedding Quality	Good	Better (higher mAP, nDCG scores)
Text Embedding Latency	~0.15s	~0.12s
Model Architecture	Simpler	More complex, larger
Use Case	Quick processing needs	Quality-critical applications

Why Marengo 3.0 is Slower

Quality Over Speed: Marengo 3.0 prioritizes embedding quality and accuracy over processing speed
More Comprehensive Analysis: Performs more sophisticated video understanding
Larger Model: More complex architecture requires additional computation time
Enhanced Features: Even with basic options, performs more under-the-hood processing

Optimization Tips for Marengo 3.0

Use specific embedding options instead of all three (visual, audio, transcription)
Configure fixed segmentation if dynamic segmentation isn't needed
Process shorter video clips when possible
Consider the quality vs speed trade-off for your specific use case

Complete Evaluation Workflow for Both Versions

# 1. Evaluate Marengo 2.7
cd marengo-eval
python run-inference.py
python run-retrieval.py  # Update inference_output path
python run-evaluation.py # Update retrieval_results path

# 2. Evaluate Marengo 3.0
export USE_MARENGO_3_0=true
python run-inference.py
python run-retrieval.py  # Update inference_output path
python run-evaluation.py # Update retrieval_results path

# 3. Generate combined report
cd ..
python generate-report.py

The unified report will include side-by-side comparisons of both Marengo versions, showing performance metrics, latency statistics, and quality scores.

Dataset Format (Input)

The source evaluation_dataset.jsonl should contain:

inputPrompt: The question for Pegasus or the search query for Marengo.
mediaSource.s3Location.uri: S3 URI of the video file.
responseFormat: (Optional) Bedrock JSON schema for Pegasus.
referenceResponse: (Optional) Ground truth for the judge (LLM evaluation).

Troubleshooting

General Issues

IAM Role: Ensure the role used in CONFIG has a trust relationship with bedrock.amazonaws.com.
S3 Access: Verify S3 bucket policies allow Bedrock service access.
Model Access: Ensure Twelve Labs Pegasus, Marengo 2.7, Marengo 3.0, Claude 3.7, and Claude Opus 4.5 are enabled in your region.

AWS Bedrock 15-Minute Timeout Limit

AWS Bedrock has a hard 15-minute timeout limit for async invocations that cannot be increased. This affects both Pegasus and Marengo when processing long or complex videos.

Symptoms:

Jobs fail exactly at 15 minutes with error: "Something went wrong. Please try again later."
Affects videos regardless of file size - both large and small videos can fail if processing takes too long

Solutions for Long Videos:

Video Preprocessing (Recommended)
- Split videos into 10-minute segments before processing
- Use ffmpeg to segment without re-encoding:
```
ffmpeg -i input.mp4 -c copy -segment_time 600 -f segment output_%03d.mp4
```
Reduce Embedding Options (Marengo)
- Process only visual or only audio embeddings instead of both
- This can reduce processing time by up to 50%
Alternative Approaches
- Use AWS Batch or Step Functions to orchestrate multiple smaller jobs
- Contact AWS Support (though this limit typically cannot be increased)
- Consider using Twelve Labs' direct API if available

Note: Video duration and content complexity affect processing time more than file size. A small video with complex scenes may still exceed the timeout.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
marengo-eval		marengo-eval
pegasus-eval		pegasus-eval
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
cors-policy.json		cors-policy.json
generate-report.py		generate-report.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Twelve Labs Model Evaluation

Response Format (Pegasus)

Prerequisites

Installation

Configuration

Required Environment Variables

Setting Environment Variables

Bucket Naming Rules

Dataset File Configuration

AWS Environment Setup

1. Create S3 Bucket

2. Tag S3 Bucket

3. Configure CORS

Video Content Preparation

1. Download YouTube Videos

2. Upload to S3

3. Upload Dataset

Usage (Pegasus Evaluation Pipeline)

Step 1: Prepare Dataset

Step 2: Run Inference & Upload

Step 3: Trigger Bedrock Model-as-Judge

Step 4: Download Results

Step 5: Estimate Costs

Unified Evaluation Reporting

Step 1: Generate Report

Features

Usage (Marengo Evaluation Pipeline)

Step 1: Generate Video Embeddings

Step 2: Run Semantic Retrieval

Step 3: Evaluate Retrieval Quality

Marengo Version Comparison (2.7 vs 3.0)

Version Selection

Key Differences

API Structure

Features

Performance Trade-offs

Why Marengo 3.0 is Slower

Optimization Tips for Marengo 3.0

Complete Evaluation Workflow for Both Versions

Dataset Format (Input)

Troubleshooting

General Issues

AWS Bedrock 15-Minute Timeout Limit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages