|
An Open-Source Infrastructure for Privacy-Preserving Longitudinal Web Browsing Data Collection
Features • Installation • Usage • API Documentation • Contributing • License
GESIS Surf is an open-source research infrastructure for privacy-preserving, longitudinal collection of web browsing behavioral data at scale — combining a browser extension, REST API backend, and hierarchical session modeling to enable reproducible passive panel studies created by GESIS – Leibniz Institute for the Social Sciences.
🔗 Looking for the backend? Check out GESIS Surf Backend
- � Passive Longitudinal Data Collection - Captures naturalistic browsing behavior over time without interrupting users, enabling large-scale panel studies
- 🏗️ Hierarchical Session Modeling - Preserves the full structure of browsing behavior across windows, tabs, domains, and interactions
- 📄 Content-Level Capture - Records clicks, scrolls, DOM changes, page metadata, and full HTML snapshots per observation
- 🛡️ Privacy-by-Design - Strict opt-in participation, per-domain collection rules, and client-side data minimization at the point of collection
- 🌐 Cross-Browser Support - Works on both Chrome and Firefox via WebExtension API
- 🔐 Secure Authentication - Token-based authentication with secure session management
- 💾 Client-Side Storage - IndexedDB for local data buffering before transmission
- ♻️ Reproducible Infrastructure - Open-source, self-hostable backend with REST API for transparent and auditable research workflows
- Node.js: >= 18.12.0
- Package Manager: pnpm 9.1.1 or higher
Clone the repository and install dependencies:
git clone git@github.com:gesiscss/gesis_surf_extension.git
cd gesis_surf_extension
npm install
# or
pnpm installBuild the extension for Firefox (default):
pnpm run build:firefoxThe compiled files will be in the dist/ directory.
To load the extension in Firefox:
- Navigate to
about:debugging#/runtime/this-firefox - Or go to Firefox > Preferences > Extensions & Themes > Debug Add-ons > Load Temporary Add-on...
- Locate and select the
dist/manifest.jsonfile
Build the extension for Google Chrome:
pnpm run buildThe compiled files will be in the dist/ directory.
To load the extension in Chrome:
- Open
chrome://extensions/ - Enable Developer mode (top-right corner)
- Click Load unpacked
- Select the
dist/directory
For Chrome (with HMR support):
pnpm run devFor Firefox (with HMR support):
pnpm run dev:firefoxpnpm run clean- Clean build artifacts and cachepnpm run build- Build for Chromepnpm run build:firefox- Build for Firefoxpnpm run dev- Start development server (Chrome, with HMR)pnpm run dev:firefox- Start development server (Firefox, with HMR)pnpm run test- Run testspnpm run type-check- Type-check the entire projectpnpm run lint- Lint all filespnpm run lint:fix- Fix linting issuespnpm run prettier- Format code with Prettierpnpm run docs- Generate TypeDoc documentation
├── chrome-extension/ # Chrome extension source code
│ ├── lib/ # Core extension logic
│ │ ├── background/ # Service worker/background script
│ │ ├── controllers/ # Core extension controller
│ │ ├── db/ # Database service and configuration
│ │ ├── events/ # Event managers (Tab, Window, Domain, Content)
│ │ ├── handlers/ # Client and shared message handlers
│ │ ├── messages/ # Message interfaces and handlers
│ │ └── services/ # Auth, data collection, session, sync, policy
│ ├── public/ # Static assets (icons, CSS)
│ ├── utils/plugins/ # Vite manifest plugin
│ └── manifest.js # Extension manifest
├── pages/ # UI components and pages
│ ├── content/ # Content script (clicks, scrolls, HTML capture)
│ ├── popup/ # Extension popup (React, MUI, Auth, PrivacyMode)
│ └── utils/ # Shared page assets and ConnectedPage HOC
├── packages/ # Shared packages and utilities
│ ├── dev-utils/ # Manifest parser, logger, and dev utilities
│ ├── hmr/ # Hot module replacement (rollup-based)
│ ├── shared/ # Shared React hooks, storages, HOCs, and services
│ ├── tailwind-config/ # Shared Tailwind CSS configuration
│ └── tsconfig/ # Shared TypeScript configurations
└── docs/ # Generated TypeDoc documentation
- Background Service Worker (
lib/background/) - Manages extension lifecycle, coordinates all events and services - EventManager (
lib/events/) - Orchestrates Tab, Window, Domain, and Content event managers - Content Script (
pages/content/) - Injected script capturing clicks, scrolls, and HTML snapshots per page - Popup UI (
pages/popup/) - React interface for user authentication, privacy mode, and settings - GlobalSessionService (
lib/services/globalSession/) - Builds and maintains the hierarchical session model across windows and tabs - PolicyService (
lib/services/policyService/) - Enforces per-domain and per-content collection rules (privacy-by-design) - AuthService (
lib/services/authService/) - Token-based authentication and session management - DatabaseService (
lib/db/) - IndexedDB client-side data buffering before transmission - DataCollectionService (
lib/services/dataCollectionService/) - Aggregates and processes collected interaction data - SyncService (
lib/services/syncService/) - Handles periodic data synchronization to the backend API - PrivateModeService (
lib/services/privateModeService/) - User-controlled privacy mode with timed activation - MessageHandler (
lib/messages/) - Typed message passing between background, content, and popup scripts
The extension follows a modular architecture:
- Background Script (Service Worker) - Manages extension state and coordinates events
- Content Script - Collects user interaction data from web pages
- Popup UI - Provides user authentication and privacy controls
- Message Passing - Secure communication between background, content, and popup scripts
- IndexedDB - Local storage for data persistence
┌─────────────────────┐
│ Browser Extension │
├─────────────────────┤
│ - Content Script │ Collects: clicks, scrolls, HTML snapshots,
│ - Background Worker │ domains, tab/window events,
│ - Popup UI │ session hierarchy, host policy
│ - IndexedDB Storage │
└──────────┬──────────┘
│ HTTPS/Secure
│ Authentication
▼
┌─────────────────────┐
│ Django Backend │
├─────────────────────┤
│ - REST API │ Processes: user registration,
│ - Token Auth │ authentication, data aggregation,
│ - Database │ analysis & reporting
│ - Celery/Redis │
│ - Elasticsearch │
└─────────────────────┘
Data Flow:
- On startup/install,
AuthServicevalidates the stored token against/api/user/me/ - If authenticated,
HostServicesyncs the domain blocklist/allowlist from/api/host/hosts/ GlobalSessionServicecreates a hierarchical session (global → window → tab → domain) and posts it to/api/session/EventManagerstartsTabEventManager,WindowEventManager,DomainEventManager, andContentEventManager- Content script captures clicks, scrolls, and HTML snapshots (with meta tags) and sends them via message passing to the background service worker
- Background worker writes events to IndexedDB via
DatabaseServicefor local buffering - Events are flushed to the backend API (
/api/clicks/,/api/scrolls/,/api/tab/tabs/,/api/domain/domains/) HeartbeatServiceruns every 10 seconds to maintain extension liveness statePrivateModeServicesuspends data collection when the user activates privacy mode
We welcome contributions! Please see our Contributing Guide for detailed information on:
- 🌿 Branching Strategy -
dev→main→prodworkflow - 📝 Commit Conventions - Using Commitizen with Conventional Commits
- 🔍 Code Quality - Pre-commit hooks, linting, and formatting
- 🔀 Pull Request Process - Guidelines and review workflow
- Fork the repository
- Create a feature branch from
devgit checkout dev && git pull origin dev git checkout -b feature/amazing-feature - Install pre-commit hooks
pnpm install pnpm run prepare
- Commit using Commitizen
git add . pnpm cz - Push and open a Pull Request targeting
dev
This project uses:
- ESLint - For code linting
- Prettier - For code formatting
- TypeScript - For type safety
- Husky - For pre-commit and commit-msg hooks
- lint-staged - For running linters on staged files
- commitlint - Enforces Conventional Commits format on every commit message
Run quality checks:
pnpm run lint
pnpm run lint:fix
pnpm run type-check
pnpm run prettier- UI: React 18, React Router v6, MUI v6 (Material UI), Emotion, Tailwind CSS
- Build Tools: Vite 6, Turbo (monorepo task runner), Rollup (HMR package)
- Language: TypeScript 5.9
- Storage: IndexedDB via
idblibrary - Browser APIs: WebExtension API with
webextension-polyfill - Unique IDs:
uuidv11 for session identifier generation - Code Quality: ESLint (Airbnb TypeScript config), Prettier, Husky, lint-staged, commitlint
- Commit Tooling: Commitizen (
cz-conventional-changelog), commitlint (@commitlint/config-conventional) - Package Manager: pnpm 9.1.1 (workspace monorepo)
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright © 2023-2025 GESIS – Leibniz Institute for the Social Sciences
The GESIS Surf Extension works in conjunction with the GESIS Surf Backend for data processing and storage.
- GESIS Surf Backend - Django REST API for data collection, user management, and research analysis
- Built with Django 4.2 and Python 3.10+
- PostgreSQL for persistent storage
- Celery/Redis for async task processing
- Elasticsearch for fast data retrieval
- Docker-ready deployment
The extension communicates with the backend API for:
| Endpoint | Purpose |
|---|---|
/api/user/token/ |
Authentication token generation |
/api/user/me/ |
User profile and data collection status |
/api/session/ |
Global session hierarchy submission |
/api/tab/tabs/ |
Browser tab event tracking |
/api/domain/domains/ |
Domain classification and event tracking |
/api/clicks/ |
Click event submission |
/api/scrolls/ |
Scroll event submission |
/api/host/hosts/ |
Host blocklist/allowlist sync (policy rules) |
/api/host/task-result/ |
Async host sync task polling |
/api/selectors/ |
Dynamic LLM-based CSS selector retrieval |
/api/selectors/task-result/ |
Async selector task polling |
- Mario Ramirez - Lead Research Software Engineer - @geomario @MarioGesis
- Fernando Guzman - Software Architect Consultant - @Fernando
- Dr. Sebastian Stier - Department Director CSS @Seb
- Dr. Frank Mangold - Kommissarischer Teamleiter DDD @Frank
This extension is designed with privacy in mind. Data collection is:
- Transparent - Users know what data is being collected
- Ethical - Complies with research ethics standards
- Secure - Uses secure authentication and storage mechanisms
- User-Controlled - Includes privacy mode and user controls
For detailed privacy information, please refer to the project's privacy documentation or contact GESIS directly.
Questions or feedback? Reach out!
- GitHub Issues: Create an issue
- Backend Issues: Backend Repository
- GESIS: https://www.gesis.org/
If you use this software in your research, please cite:
@article{ramirez2025gesis,
title = {GESIS Surf Extension},
author = {Ramirez, Mario and Guzman, Fernando and Stier, Sebastian and Mangold, Frank},
journal = {SoftwareX},
volume = {XX},
pages = {XXXXXX},
year = {2026},
publisher = {Elsevier},
doi = {10.1016/j.softx.2025.xxxxxx}
}See CITATION.cff for more citation formats.
