This project evaluates Twelve Labs video foundation models (Pegasus and Marengo 2.7/3.0) using AWS Bedrock. It provides a comprehensive framework for generating rich video metadata, performing semantic retrieval, and executing multi-tiered model evaluations.
The model returns structured JSON with clips and metadata:
{
"clips": [
{
"start_time": 10.5,
"end_time": 15.2,
"description": "...",
"dialogue": "Winter is coming",
"objects": ["coffee cup", "notebook"],
"settings": "Office",
"location": "Central Perk",
"characters": "Jim and Dwight",
"actors": "...",
"actions": "...",
"narrative": "...",
"mood": "...",
"emotion": "...",
"shot_type": "..."
}
]
}- Python 3.10+
- AWS Credentials configured (via
~/.aws/credentialsor environment variables) - AWS Permissions:
- Access to your S3 bucket (configured via
MODEL_EVAL_S3_BUCKETenvironment variable) - Access to AWS Bedrock Runtime
- Permission to invoke:
twelvelabs.pegasus-1-2-v1:0 - Permission to invoke:
twelvelabs.marengo-embed-2-7-v1:0(and inference profile) - Permission to invoke:
twelvelabs.marengo-embed-3-0-v1:0(and inference profile) - Permission to invoke:
anthropic.claude-3-7-sonnet-20250219-v1:0 - Permission to invoke:
anthropic.claude-opus-4-20250514-v1:0
- Access to your S3 bucket (configured via
- Install dependencies:
pip install boto3 yt-dlpBefore running any scripts, configure your S3 bucket name using environment variables. See .env.example for a template.
| Variable | Description | Default |
|---|---|---|
MODEL_EVAL_S3_BUCKET |
S3 bucket for evaluation data | model-evaluation-dataset |
TEST_S3_BUCKET |
S3 bucket for test scripts (optional) | Falls back to MODEL_EVAL_S3_BUCKET |
Linux / macOS:
export MODEL_EVAL_S3_BUCKET=your-bucket-nameWindows (PowerShell):
$env:MODEL_EVAL_S3_BUCKET = "your-bucket-name"Windows (Command Prompt):
set MODEL_EVAL_S3_BUCKET=your-bucket-nameUsing .env file:
cp .env.example .env
# Edit .env and set your bucket nameS3 bucket names must follow AWS naming rules:
- 3-63 characters long
- Only lowercase letters, numbers, hyphens, and periods
- Cannot start or end with hyphen or period
- Cannot be formatted as an IP address
Warning
The evaluation_dataset.jsonl files in pegasus-eval/ and marengo-eval/ contain placeholder values that MUST be replaced before running any scripts.
Placeholders to replace:
| Placeholder | Replace With | Example |
|---|---|---|
{YOUR_BUCKET_NAME} |
Your S3 bucket name | my-evaluation-bucket |
{YOUR_ACCOUNT_ID} |
Your 12-digit AWS account ID | 123456789012 |
Example transformation:
Before:
{"mediaSource": {"s3Location": {"uri": "s3://{YOUR_BUCKET_NAME}/videos/video.mp4", "bucketOwner": "{YOUR_ACCOUNT_ID}"}}}After:
{"mediaSource": {"s3Location": {"uri": "s3://my-evaluation-bucket/videos/video.mp4", "bucketOwner": "123456789012"}}}You can use sed to replace placeholders in bulk:
# Linux/macOS
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' pegasus-eval/evaluation_dataset.jsonl
sed -i 's/{YOUR_BUCKET_NAME}/my-evaluation-bucket/g; s/{YOUR_ACCOUNT_ID}/123456789012/g' marengo-eval/evaluation_dataset.jsonlNote
Replace {YOUR_BUCKET_NAME} with your own unique bucket name throughout these examples.
aws s3api create-bucket --bucket {YOUR_BUCKET_NAME} --region us-east-1 --profile sandboxaws s3api put-bucket-tagging --bucket {YOUR_BUCKET_NAME} --tagging 'TagSet=[{Key=Name,Value={YOUR_BUCKET_NAME}},{Key=Owner,Value="Your Name"},{Key=Project,Value=model-evaluation}]' --profile sandboxApply the CORS policy to allow cross-origin requests (needed for some Bedrock features):
aws s3api put-bucket-cors --bucket {YOUR_BUCKET_NAME} --cors-configuration file://cors-policy.json --profile sandboxEnsure videos are within Bedrock limits (<2GB and <1 hour).
# Download specific videos
yt-dlp "https://www.youtube.com/watch?v=hB1cIRfpjyU" -o "video_hB1cIRfpjyU.mp4"
yt-dlp "https://www.youtube.com/watch?v=_O6yTfjPfmU" -o "video__O6yTfjPfmU.mp4"Upload the videos to the videos/ prefix in your bucket.
aws s3 cp video_hB1cIRfpjyU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_hB1cIRfpjyU.mp4 --profile sandbox
aws s3 cp video__O6yTfjPfmU.mp4 s3://{YOUR_BUCKET_NAME}/videos/video__O6yTfjPfmU.mp4 --profile sandbox
aws s3 cp video_dfuPBC-v5NE.mp4 s3://{YOUR_BUCKET_NAME}/videos/video_dfuPBC-v5NE.mp4 --profile sandboxUpload your initial dataset file to S3.
aws s3 cp evaluation_dataset.jsonl s3://{YOUR_BUCKET_NAME}/evaluate/evaluation_dataset.jsonl --profile sandboxFor best results, use the dedicated scripts in the pegasus-eval/ directory.
Convert your input JSONL into the specific format Pegasus requires.
python pegasus-eval/prepare-dataset.py --input pegasus-eval/evaluation_dataset.jsonlExecute Pegasus model inference and auto-upload results to S3 for judging. Supports long videos (up to 1 hour) with optimized timeouts.
python pegasus-eval/run-manual-evaluation.pyCreate an automated evaluation job in AWS Bedrock using Claude 3.7 Sonnet as the judge.
python pegasus-eval/create-judge-job.pyRetrieve and summarize the judging results once the job is complete.
python pegasus-eval/get-results.py --job-name your-job-nameThe cost calculator now supports aggregating multiple jobs and processing local files.
# Calculate combined cost for Pegasus (Inference + Judge)
python pegasus-eval/calculate-job-cost.py \
--uri s3://bucket/manual/inference.jsonl s3://bucket/judge-jobs/my-job/ \
--total-video-minutes 120 \
--output-json pegasus-eval/results/cost_summary.jsonAfter running both the Pegasus and Marengo evaluation pipelines, you can generate a single, comprehensive report that combines both models' results.
This script uses Claude Opus 4.5 to analyze all technical metrics and draft a professional executive summary.
python generate-report.py- Inputs: Reads from
pegasus-eval/results/andmarengo-eval/ - Output: Generates
evaluation-report-[TIMESTAMP].mdin the project root.
- Cost & Latency Tracking: The report now includes a detailed breakdown of estimated costs (Pegasus, Marengo, Claude) and latency statistics for every stage of the pipeline.
Marengo is an embedding model. Evaluation follows a 3-step workflow in the marengo-eval/ directory.
Perform async inference to convert video segments into vector embeddings.
python marengo-eval/run-inference.pyConvert text queries to vectors and perform cosine similarity search against video segments.
# Update inference_output in run-retrieval.py first
python marengo-eval/run-retrieval.pyA tiered evaluation using standard math metrics and a Pegasus-enhanced LLM Judge (Claude Opus 4.5).
# Update retrieval_results in run-evaluation.py first
python marengo-eval/run-evaluation.pyThe evaluation framework supports both Marengo 2.7 and 3.0, controlled via the USE_MARENGO_3_0 environment variable.
# Use Marengo 2.7 (default)
python marengo-eval/run-inference.py
# Use Marengo 3.0
export USE_MARENGO_3_0=true
python marengo-eval/run-inference.py- Marengo 2.7: Uses simpler API structure with
visual-textandaudioembeddings - Marengo 3.0: Uses nested structure with separate
visual,audio, and optionaltranscriptionembeddings
Marengo 3.0 introduces:
- Dynamic Segmentation: Automatically segments videos based on content changes
- Transcription Embeddings: Additional embedding type for speech-to-text analysis
- Multiple Embedding Scopes: Both clip-level and asset-level embeddings
- Configurable Segmentation: Choose between fixed or dynamic duration methods
| Aspect | Marengo 2.7 | Marengo 3.0 |
|---|---|---|
| Processing Speed | Faster | Slower (more comprehensive analysis) |
| Embedding Quality | Good | Better (higher mAP, nDCG scores) |
| Text Embedding Latency | ~0.15s | ~0.12s |
| Model Architecture | Simpler | More complex, larger |
| Use Case | Quick processing needs | Quality-critical applications |
- Quality Over Speed: Marengo 3.0 prioritizes embedding quality and accuracy over processing speed
- More Comprehensive Analysis: Performs more sophisticated video understanding
- Larger Model: More complex architecture requires additional computation time
- Enhanced Features: Even with basic options, performs more under-the-hood processing
- Use specific embedding options instead of all three (
visual,audio,transcription) - Configure fixed segmentation if dynamic segmentation isn't needed
- Process shorter video clips when possible
- Consider the quality vs speed trade-off for your specific use case
# 1. Evaluate Marengo 2.7
cd marengo-eval
python run-inference.py
python run-retrieval.py # Update inference_output path
python run-evaluation.py # Update retrieval_results path
# 2. Evaluate Marengo 3.0
export USE_MARENGO_3_0=true
python run-inference.py
python run-retrieval.py # Update inference_output path
python run-evaluation.py # Update retrieval_results path
# 3. Generate combined report
cd ..
python generate-report.pyThe unified report will include side-by-side comparisons of both Marengo versions, showing performance metrics, latency statistics, and quality scores.
The source evaluation_dataset.jsonl should contain:
inputPrompt: The question for Pegasus or the search query for Marengo.mediaSource.s3Location.uri: S3 URI of the video file.responseFormat: (Optional) Bedrock JSON schema for Pegasus.referenceResponse: (Optional) Ground truth for the judge (LLM evaluation).
- IAM Role: Ensure the role used in
CONFIGhas a trust relationship withbedrock.amazonaws.com. - S3 Access: Verify S3 bucket policies allow Bedrock service access.
- Model Access: Ensure Twelve Labs Pegasus, Marengo 2.7, Marengo 3.0, Claude 3.7, and Claude Opus 4.5 are enabled in your region.
AWS Bedrock has a hard 15-minute timeout limit for async invocations that cannot be increased. This affects both Pegasus and Marengo when processing long or complex videos.
Symptoms:
- Jobs fail exactly at 15 minutes with error: "Something went wrong. Please try again later."
- Affects videos regardless of file size - both large and small videos can fail if processing takes too long
Solutions for Long Videos:
-
Video Preprocessing (Recommended)
- Split videos into 10-minute segments before processing
- Use ffmpeg to segment without re-encoding:
ffmpeg -i input.mp4 -c copy -segment_time 600 -f segment output_%03d.mp4
-
Reduce Embedding Options (Marengo)
- Process only visual or only audio embeddings instead of both
- This can reduce processing time by up to 50%
-
Alternative Approaches
- Use AWS Batch or Step Functions to orchestrate multiple smaller jobs
- Contact AWS Support (though this limit typically cannot be increased)
- Consider using Twelve Labs' direct API if available
Note: Video duration and content complexity affect processing time more than file size. A small video with complex scenes may still exceed the timeout.