Sat2Cap: Mapping Fine-grained Text Descriptions from Satellite Images

Aayush Dhakal*, Adeel Ahmad Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs

The repository is the official implementation of Sat2Cap [CVPRW, EarthVision 2024, Best Paper Award]. Sat2Cap model solves the mapping problem in a zero-shot approach. Instead of predicting pre-defined attributes for a satellite image, Sat2Cap attempts to learn the text associated with a given location.

🤗 Pretrained Models

Pretrained Sat2Cap models are available on HuggingFace:

MVRL Remote Sensing Foundation Models

You can load the pretrained model with a single function call:

from sat2cap.utils.load_model import load_sat2cap

# Automatically downloads the checkpoint from HuggingFace Hub
model = load_sat2cap(repo_id='MVRL/sat2cap', filename='sat2cap.ckpt')
model.eval()

Or install huggingface_hub and download manually:

pip install huggingface_hub

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id='MVRL/sat2cap', filename='sat2cap.ckpt')

🚀 Quick Start: Text-Image Similarity Demo

See demo.ipynb for a full walkthrough that shows how to:

Load the pretrained Sat2Cap model from HuggingFace
Preprocess a satellite image
Compute cosine similarity scores against a list of text prompts
Visualize the top-matching text descriptions for your satellite image

import torch
from transformers import AutoTokenizer, CLIPTextModelWithProjection
from sat2cap.utils.load_model import load_sat2cap

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load pretrained model
model = load_sat2cap(repo_id='MVRL/sat2cap', filename='sat2cap.ckpt').to(device).eval()

# Load CLIP text encoder
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch32')
text_model = CLIPTextModelWithProjection.from_pretrained('openai/clip-vit-base-patch32').to(device).eval()

# Define text prompts
prompts = ['a photo of a forest', 'a photo of a city center', 'a photo of farmland']

# Encode text prompts
with torch.no_grad():
    tokens = tokenizer(prompts, padding=True, return_tensors='pt').to(device)
    text_embeds = text_model(**tokens).text_embeds
    text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)

# Encode a satellite image (supply your own image tensor preprocessed to 224x224)
# img_tensor shape: (1, 3, 224, 224)
with torch.no_grad():
    img_embeds, _ = model.imo_encoder(img_tensor)

# Compute cosine similarities
similarities = (img_embeds @ text_embeds.T).squeeze(0)
best_match = prompts[similarities.argmax()]
print(f'Best matching description: "{best_match}"')

🏋️‍♀️ Training

You can use the run_geo.sh script to train the Sat2Cap model. All the necessary hyperparameters can be set in the bash script.

🔮 Inference

Once you have the trained model use the generate_map_embedding.py file under evaluations to generate Sat2Cap embeddings for all images of interest. Use merge_embeddings.py to add location and temporal input to the generated embeddings. Finally, the get_similarity.py file generates similarity values for a given prompt. These similarity values can then be used to create zero-shot maps.

📑 Citation

@inproceedings{dhakal2024sat2cap,
  title={Sat2cap: Mapping fine-grained textual descriptions from satellite images},
  author={Dhakal, Aayush and Ahmad, Adeel and Khanal, Subash and Sastry, Srikumar and Kerner, Hannah and Jacobs, Nathan},
  booktitle={IEEE/ISPRS Workshop: Large Scale Computer Vision for Remote Sensing (EARTHVISION)},
  pages={533--542},
  year={2024}
}

📄 License

This project is licensed under the Apache License 2.0 — see the LICENSE file for details.

🔍 Additional Links

Check out our lab website for other interesting works on geospatial understanding and mapping:

Multi-Modal Vision Research Lab (MVRL) - Link
Related Works from MVRL - Link

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
sat2cap		sat2cap
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
environment.yml		environment.yml
rgb_map.jpg		rgb_map.jpg
run_geo.sh		run_geo.sh
run_retrieval.sh		run_retrieval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sat2Cap: Mapping Fine-grained Text Descriptions from Satellite Images

🤗 Pretrained Models

🚀 Quick Start: Text-Image Similarity Demo

🏋️‍♀️ Training

🔮 Inference

📑 Citation

📄 License

🔍 Additional Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sat2Cap: Mapping Fine-grained Text Descriptions from Satellite Images

🤗 Pretrained Models

🚀 Quick Start: Text-Image Similarity Demo

🏋️‍♀️ Training

🔮 Inference

📑 Citation

📄 License

🔍 Additional Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages