Minimal self-hosted analytics collector in Go with a separate TypeScript client library.
Formation Web Analytics Collector is currently an early access release. It is already useful for teams that want to own their analytics pipeline, but it should still be treated as infrastructure for technical adopters rather than a finished product. Expect rough edges, limited ergonomics, and breaking changes while the project matures.
The main way to send data to this collector is the Formation Web Analytics Client. That client library is the intended integration path for browser-based event collection and should be considered part of the recommended setup for this server. The client repository was renamed, but the published npm package is currently still @tryformation/formation-web-analytics-client.
If you decide to use this today, assess the operational and compliance risks carefully. There is no polished UI yet, so setup, validation, and analysis require hands-on work with your own infrastructure and Elasticsearch. You should review your deployment, retention settings, access controls, backups, incident handling, cookie and consent requirements, and privacy notices before collecting production traffic. The mitigation is straightforward: start with a limited rollout, keep your configuration conservative, test your data handling end to end, and make sure your legal and security review matches your specific use case.
The main reason to adopt this approach is control. This project is intended for self-hosting web analytics in the EU in a way that can reduce third-country transfer exposure and support a GDPR-aligned deployment when configured and operated correctly. Instead of sending visitor data to foreign advertising or tracking companies, you keep collection, storage, access, and governance under your own control. That reduces third-party exposure, simplifies data residency choices, and lets you decide exactly what is collected, retained, and shared.
There is currently no user interface yet. formationxyz.com is waiting to announce some agentic solutions around this, so stay tuned and check the site for updates and more information.
The published container setup is designed around compliant GeoIP distribution. The Docker image does not need to ship a MaxMind database. Instead, the Compose stack expects you to provide your own MaxMind account ID and license key, downloads GeoLite2-City.mmdb into a mounted volume at runtime, and periodically refreshes that database in place.
The collector exposes:
POST /collectPOST /batchGET /healthzGET /readyz
Events are validated, enriched with request metadata, parsed user-agent metadata, parsed language metadata, optional client timezone metadata, and local GeoIP metadata, queued in memory, and flushed to an Elasticsearch data stream via the Bulk API.
Validation and abuse guard rails include request body limits, JSON-only ingest, bounded batch sizes, field-length limits, and payload depth/entry limits.
The default in-memory queue size is 10_000 events and the default maximum bulk batch size is 500 events.
Metrics are disabled by default; when enabled, GET /metrics is served on a separate listener configured with METRICS_LISTEN_ADDR.
- Copy
.env.exampleto.envand set Elasticsearch credentials. - Get your own MaxMind account ID and license key, then set
MAXMIND_ACCOUNT_IDandMAXMIND_LICENSE_KEY. - For non-Docker runs, download/update
GeoLite2-City.mmdblocally withgeoipupdateand pointGEOIP_DB_PATHat it. - Start local Elasticsearch with
docker compose -f docker-compose.elasticsearch.yml up -d. - Create the default
web-analyticsdata stream, ILM policy, and templates with./scripts/create-data-stream-and-templates.sh. Use--data-stream-name your-nameif you want a different data stream. - Run
make test-backend. - Run
go run ./cmd/collector. - Or start the deployment stack with
docker compose up --build. - Run
make smoke-testfor a local collector-to-Elasticsearch verification. - Run
make smoke-test-browser-clientfor an end-to-end check using the Formation Web Analytics Client in a Docker-managed test container.
Docker Compose now includes a geoipupdate service based on MaxMind's official container image. It downloads GeoLite2-City.mmdb into a shared Docker volume, and the collector waits for that database before starting.
Use your own MaxMind credentials here. The values in .env.example are placeholders only, and you should never commit real MAXMIND_LICENSE_KEY values or any populated local env files to your repository.
This is also the recommended distribution model for public container releases: publish the collector image without bundling the MaxMind database, mount a persistent volume for GeoIP data, and let geoipupdate fetch and periodically refresh the database on the user's deployment using their own credentials.
Attribution: This product includes GeoLite Data created by MaxMind, available from maxmind.com.
Relevant variables:
ALLOWED_DOMAINSshould list your collector hostnamesSITE_ORIGIN_MAPoptional per-site origin allowlist in the formmarketing:tryformation.com|www.tryformation.com;docs:docs.tryformation.comCADDY_DOMAINSdefault exampleanalytics.tryformation.comCADDY_RATE_LIMIT_EVENTSdefault120CADDY_RATE_LIMIT_WINDOWdefault1mMAXMIND_ACCOUNT_IDMAXMIND_LICENSE_KEYGEOIPUPDATE_EDITION_IDSdefaultGeoLite2-CityGEOIPUPDATE_FREQUENCYin hours;0means run once and exitGEOIP_DB_PATHdefault/data/GeoLite2-City.mmdbGEOIP_WAIT_TIMEOUTcollector startup wait timeout in secondsSTORE_IP_METADATAdefaultfalseSANITIZE_URLSdefaulttrueREQUIRE_ORIGINdefaulttrueREQUIRE_URL_HOST_MATCHdefaulttrueRATE_LIMIT_PER_MINUTEdefault300BLOCKED_USER_AGENTSdefaultbot,crawler,spider,curl,wget,python-requests,go-http-clientSUSPECT_USER_AGENTSdefaultheadless,playwright,puppeteer,selenium,phantomjs
If your environment uses egress controls, allow HTTPS redirects to:
mm-prod-geoip-databases.a2649acb697e2c09b632799562c076f2.r2.cloudflarestorage.com
This repository includes a GitHub Actions workflow at .github/workflows/docker-publish.yml that publishes the server container to Docker Hub when a new Git tag is pushed.
The workflow currently pushes:
tryformation/formation-web-analytics-server:<tag>
Required GitHub repository secrets:
DOCKERHUB_USERNAME: the Docker Hub user or service account nameDOCKERHUB_PASSWORD: the Docker Hub password or access token for that account
The Docker Hub account behind that token must have permission to push to the tryformation organization repository.
Example release flow:
git tag v0.1.0
git push origin v0.1.0That will publish tryformation/formation-web-analytics-server:v0.1.0.
The published image does not bundle a MaxMind database. You should keep geoipupdate as a separate runtime sidecar or companion job that downloads GeoLite2-City.mmdb with your own MaxMind credentials and mounts it into the collector container.
Pull a tagged release from Docker Hub with:
docker pull tryformation/formation-web-analytics-server:v0.1.0Or pull the most recent release tag you want to deploy:
docker pull tryformation/formation-web-analytics-server:<tag>The collector expects a GeoIP database file to be mounted at /data/GeoLite2-City.mmdb by default. A minimal direct run looks like this:
docker run --rm -p 8080:8080 \
--env-file .env \
-v /path/to/geoip:/data:ro \
tryformation/formation-web-analytics-server:v0.1.0For production deployments, prefer running the collector together with MaxMind's ghcr.io/maxmind/geoipupdate:latest container, as shown in docker-compose.yml, so the database is fetched and refreshed separately from the collector image.
Start a local Elasticsearch 9 node with:
docker compose -f docker-compose.elasticsearch.yml up -dThe local test cluster is exposed on http://localhost:19920.
Provision the default web-analytics data stream, ILM policy, and templates with:
./scripts/create-data-stream-and-templates.shOr specify a different data stream name:
./scripts/create-data-stream-and-templates.sh --data-stream-name your-data-streamThe script creates:
- data stream
<name>where the default isweb-analytics - ILM policy
<name>-ilm-policy - component templates
<name>-template-settingsand<name>-template-mappings - index template
<name>-template
The mappings are tuned for this collector's analytics event shape: fixed top-level dimensions as keywords/dates/IP or wildcard fields, and a payload field stored as flattened for arbitrary event properties without unbounded mapping growth.
- Contributor workflow and test/run expectations: CONTRIBUTING.md
- Done checklist and current reassessment: docs/definition-of-done.md
- Privacy, consent, and compliance notes for deployers: docs/privacy-and-compliance.md
- Third-party licenses and notices: THIRD_PARTY_NOTICES.md
Content-Typemust beapplication/jsonwhen present.- Requests larger than
MAX_PAYLOAD_BYTESare rejected. - Batches larger than
MAX_EVENTS_PER_REQUESTare rejected. - Core string fields are bounded by
MAX_FIELD_LENGTH(default10240, about 10 KB). payloadis bounded byMAX_PAYLOAD_ENTRIESandMAX_PAYLOAD_DEPTH.- Unknown top-level JSON fields are rejected.
GEOIP_DB_PATHis required; ingest startup fails without a local database.
- Requests without an
Originheader are rejected by default. - Requests with obviously automated user agents are rejected by default.
- Requests with browser-automation style user agents are accepted but marked as suspect.
- Requests are rate limited per client IP in-memory with
RATE_LIMIT_PER_MINUTE. - Caddy applies edge rate limiting before the request reaches the collector.
SITE_ORIGIN_MAPcan bind eachsite_idto an explicit set of allowed origins.- Event
urlhosts must match the requestOriginhost by default. - Query strings and fragments are stripped from
url,referrer,referer_header, andpathby default.
For your production setup, keep analytics.tryformation.com as the collector host and bind each site to a canonical site_id, for example:
ALLOWED_DOMAINS=analytics.tryformation.com,open-rtls.com,formation-xyz.com,tryformation.com
SITE_ORIGIN_MAP=open-rtls:open-rtls.com;formation-xyz:formation-xyz.com;tryformation:tryformation.comThis keeps site_id stable per property and prevents one allowed origin from writing traffic into another site's bucket.
- The collector is intentionally lossy under pressure or prolonged Elasticsearch outages.
- CORS is enforced in both Caddy and the backend.
/metricsis intentionally not exposed through Caddy.- Raw IP storage is disabled unless
CAPTURE_CLIENT_IP=true. - GeoIP enrichment still works without storing raw IP metadata because the lookup happens before indexing.