Skip to content

Latest commit

 

History

History
818 lines (753 loc) · 21.2 KB

File metadata and controls

818 lines (753 loc) · 21.2 KB

部署架构技术方案

1. 概述

本文档描述 DataJump 平台的部署架构设计,采用 Kubernetes 容器化部署,支持多环境管理、高可用、弹性伸缩等企业级特性。

2. 整体架构

┌─────────────────────────────────────────────────────────────────────────────┐
│                              用户访问层                                      │
│                    ┌─────────────────────────────┐                          │
│                    │         负载均衡 (SLB)       │                          │
│                    └─────────────┬───────────────┘                          │
└──────────────────────────────────┼──────────────────────────────────────────┘
                                   │
┌──────────────────────────────────▼──────────────────────────────────────────┐
│                           Kubernetes 集群                                    │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                         Ingress Controller                              │ │
│  │                    (Nginx Ingress / Traefik)                           │ │
│  └────────────────────────────────┬───────────────────────────────────────┘ │
│                                   │                                         │
│  ┌────────────────────────────────▼───────────────────────────────────────┐ │
│  │                           API Gateway                                   │ │
│  │                      (Kong / Spring Cloud Gateway)                      │ │
│  └────────────────────────────────┬───────────────────────────────────────┘ │
│                                   │                                         │
│  ┌──────────┬──────────┬──────────┼──────────┬──────────┬────────────────┐ │
│  │          │          │          │          │          │                │ │
│  ▼          ▼          ▼          ▼          ▼          ▼                │ │
│ ┌────┐   ┌────┐    ┌────┐    ┌────┐    ┌────┐    ┌────────┐             │ │
│ │Web │   │调度 │    │集成 │    │开发 │    │元数据│    │运维监控│             │ │
│ │前端│   │服务 │    │服务 │    │服务 │    │服务  │    │ 服务  │             │ │
│ └────┘   └────┘    └────┘    └────┘    └────┘    └────────┘             │ │
│                                                                           │ │
│  ┌──────────────────────────────────────────────────────────────────────┐ │
│  │                        Worker 节点池                                  │ │
│  │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐                │ │
│  │   │Worker-1 │  │Worker-2 │  │Worker-3 │  │Worker-N │                │ │
│  │   └─────────┘  └─────────┘  └─────────┘  └─────────┘                │ │
│  └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                            基础设施层                                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐   │
│  │  MySQL   │  │  Redis   │  │  Kafka   │  │    ES    │  │   MinIO    │   │
│  │ Cluster  │  │ Cluster  │  │ Cluster  │  │ Cluster  │  │  Cluster   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  └────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

3. 多环境管理

3.1 环境规划

环境 用途 资源配置 数据
DEV 开发调试 最小化配置 模拟数据
TEST 功能测试 中等配置 测试数据
STAGING 预发验证 接近生产 生产数据子集
PROD 生产环境 完整配置 真实数据

3.2 Namespace 隔离

# dev namespace
apiVersion: v1
kind: Namespace
metadata:
  name: datajump-dev
  labels:
    env: dev
---
# test namespace
apiVersion: v1
kind: Namespace
metadata:
  name: datajump-test
  labels:
    env: test
---
# staging namespace
apiVersion: v1
kind: Namespace
metadata:
  name: datajump-staging
  labels:
    env: staging
---
# prod namespace
apiVersion: v1
kind: Namespace
metadata:
  name: datajump-prod
  labels:
    env: prod

3.3 配置管理

# ConfigMap - 应用配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: datajump-config
  namespace: datajump-prod
data:
  application.yml: |
    spring:
      profiles:
        active: prod
      datasource:
        url: jdbc:mysql://mysql-cluster:3306/datajump
        hikari:
          maximum-pool-size: 50
      redis:
        cluster:
          nodes: redis-0:6379,redis-1:6379,redis-2:6379
      kafka:
        bootstrap-servers: kafka-0:9092,kafka-1:9092,kafka-2:9092

    scheduler:
      master:
        count: 3
      worker:
        min-count: 5
        max-count: 50

---
# Secret - 敏感配置
apiVersion: v1
kind: Secret
metadata:
  name: datajump-secrets
  namespace: datajump-prod
type: Opaque
stringData:
  db-password: ${DB_PASSWORD}
  redis-password: ${REDIS_PASSWORD}
  jwt-secret: ${JWT_SECRET}

4. 服务部署

4.1 Deployment 配置

# 调度服务 Master
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scheduler-master
  namespace: datajump-prod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scheduler-master
  template:
    metadata:
      labels:
        app: scheduler-master
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: scheduler-master
              topologyKey: kubernetes.io/hostname
      containers:
        - name: scheduler-master
          image: datajump/scheduler-master:v1.0.0
          ports:
            - containerPort: 8080
          env:
            - name: SPRING_PROFILES_ACTIVE
              value: prod
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: datajump-secrets
                  key: db-password
          resources:
            requests:
              memory: "2Gi"
              cpu: "1000m"
            limits:
              memory: "4Gi"
              cpu: "2000m"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
          volumeMounts:
            - name: config
              mountPath: /app/config
      volumes:
        - name: config
          configMap:
            name: datajump-config
---
# 调度服务 Worker (HPA 自动伸缩)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scheduler-worker
  namespace: datajump-prod
spec:
  replicas: 5
  selector:
    matchLabels:
      app: scheduler-worker
  template:
    metadata:
      labels:
        app: scheduler-worker
    spec:
      containers:
        - name: scheduler-worker
          image: datajump/scheduler-worker:v1.0.0
          ports:
            - containerPort: 8081
          env:
            - name: SPRING_PROFILES_ACTIVE
              value: prod
          resources:
            requests:
              memory: "4Gi"
              cpu: "2000m"
            limits:
              memory: "8Gi"
              cpu: "4000m"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8081
            initialDelaySeconds: 30
            periodSeconds: 10
---
# Worker HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scheduler-worker-hpa
  namespace: datajump-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scheduler-worker
  minReplicas: 5
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

4.2 Service 配置

# 调度服务
apiVersion: v1
kind: Service
metadata:
  name: scheduler-master
  namespace: datajump-prod
spec:
  selector:
    app: scheduler-master
  ports:
    - port: 8080
      targetPort: 8080
  type: ClusterIP
---
# 前端服务
apiVersion: v1
kind: Service
metadata:
  name: datajump-web
  namespace: datajump-prod
spec:
  selector:
    app: datajump-web
  ports:
    - port: 80
      targetPort: 80
  type: ClusterIP

4.3 Ingress 配置

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: datajump-ingress
  namespace: datajump-prod
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
    - hosts:
        - datajump.example.com
      secretName: datajump-tls
  rules:
    - host: datajump.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: datajump-web
                port:
                  number: 80
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 8080

5. 基础设施部署

5.1 MySQL 集群

# 使用 MySQL Operator
apiVersion: mysql.oracle.com/v2
kind: InnoDBCluster
metadata:
  name: mysql-cluster
  namespace: datajump-prod
spec:
  instances: 3
  router:
    instances: 2
  tlsUseSelfSigned: true
  secretName: mysql-root-secret
  datadirVolumeClaimTemplate:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 100Gi
    storageClassName: fast-ssd

5.2 Redis 集群

# 使用 Redis Operator
apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: RedisCluster
metadata:
  name: redis-cluster
  namespace: datajump-prod
spec:
  clusterSize: 3
  clusterVersion: v7
  persistenceEnabled: true
  kubernetesConfig:
    image: redis:7.0-alpine
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 1000m
        memory: 2Gi
  storage:
    volumeClaimTemplate:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi

5.3 Kafka 集群

# 使用 Strimzi Operator
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka-cluster
  namespace: datajump-prod
spec:
  kafka:
    version: 3.5.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 100Gi
          class: fast-ssd
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 20Gi
      class: fast-ssd

5.4 Elasticsearch 集群

# 使用 ECK Operator
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: es-cluster
  namespace: datajump-prod
spec:
  version: 8.10.0
  nodeSets:
    - name: master
      count: 3
      config:
        node.roles: ["master"]
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 50Gi
            storageClassName: fast-ssd
    - name: data
      count: 5
      config:
        node.roles: ["data", "ingest"]
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 500Gi
            storageClassName: fast-ssd

6. 监控与可观测性

6.1 Prometheus 监控

# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: datajump-monitor
  namespace: datajump-prod
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: datajump
  endpoints:
    - port: metrics
      interval: 15s
      path: /actuator/prometheus

6.2 Grafana 仪表盘

{
  "dashboard": {
    "title": "DataJump Overview",
    "panels": [
      {
        "title": "Task Success Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(task_success_total[5m])) / sum(rate(task_total[5m])) * 100"
          }
        ]
      },
      {
        "title": "Running Tasks",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(scheduler_running_tasks)"
          }
        ]
      },
      {
        "title": "Worker CPU Usage",
        "type": "timeseries",
        "targets": [
          {
            "expr": "avg(rate(container_cpu_usage_seconds_total{pod=~\"scheduler-worker.*\"}[5m])) * 100"
          }
        ]
      }
    ]
  }
}

6.3 日志收集

# Fluent Bit DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: datajump-prod
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:2.1
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: config
              mountPath: /fluent-bit/etc/
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: config
          configMap:
            name: fluent-bit-config

7. CI/CD 流程

7.1 GitLab CI 配置

# .gitlab-ci.yml
stages:
  - build
  - test
  - security
  - deploy

variables:
  DOCKER_REGISTRY: registry.example.com
  IMAGE_TAG: $CI_COMMIT_SHORT_SHA

build:
  stage: build
  image: docker:24-dind
  script:
    - docker build -t $DOCKER_REGISTRY/datajump/$SERVICE_NAME:$IMAGE_TAG .
    - docker push $DOCKER_REGISTRY/datajump/$SERVICE_NAME:$IMAGE_TAG

test:
  stage: test
  image: maven:3.9-eclipse-temurin-17
  script:
    - mvn test
  artifacts:
    reports:
      junit: target/surefire-reports/*.xml

security-scan:
  stage: security
  image: aquasec/trivy:latest
  script:
    - trivy image $DOCKER_REGISTRY/datajump/$SERVICE_NAME:$IMAGE_TAG

deploy-dev:
  stage: deploy
  image: bitnami/kubectl:latest
  environment:
    name: dev
  script:
    - kubectl set image deployment/$SERVICE_NAME $SERVICE_NAME=$DOCKER_REGISTRY/datajump/$SERVICE_NAME:$IMAGE_TAG -n datajump-dev
  only:
    - develop

deploy-prod:
  stage: deploy
  image: bitnami/kubectl:latest
  environment:
    name: production
  script:
    - kubectl set image deployment/$SERVICE_NAME $SERVICE_NAME=$DOCKER_REGISTRY/datajump/$SERVICE_NAME:$IMAGE_TAG -n datajump-prod
  when: manual
  only:
    - main

7.2 Helm Chart

# Chart.yaml
apiVersion: v2
name: datajump
description: DataJump - One-stop Big Data Platform
version: 1.0.0
appVersion: "1.0.0"

# values.yaml
replicaCount:
  master: 3
  worker: 5
  api: 3

image:
  registry: registry.example.com
  pullPolicy: IfNotPresent

resources:
  master:
    requests:
      memory: "2Gi"
      cpu: "1000m"
    limits:
      memory: "4Gi"
      cpu: "2000m"
  worker:
    requests:
      memory: "4Gi"
      cpu: "2000m"
    limits:
      memory: "8Gi"
      cpu: "4000m"

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50
  targetCPUUtilizationPercentage: 70

mysql:
  enabled: true
  architecture: replication
  primary:
    persistence:
      size: 100Gi

redis:
  enabled: true
  architecture: cluster
  cluster:
    nodes: 6

kafka:
  enabled: true
  replicaCount: 3

8. 灾备方案

8.1 数据备份

# 定时备份 CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mysql-backup
  namespace: datajump-prod
spec:
  schedule: "0 2 * * *"  # 每天凌晨2点
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: mysql:8.0
              command:
                - /bin/sh
                - -c
                - |
                  mysqldump -h mysql-cluster -u root -p$MYSQL_ROOT_PASSWORD --all-databases | \
                  gzip > /backup/datajump-$(date +%Y%m%d).sql.gz
              env:
                - name: MYSQL_ROOT_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: mysql-root-secret
                      key: password
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              persistentVolumeClaim:
                claimName: backup-pvc
          restartPolicy: OnFailure

8.2 多可用区部署

# Pod 跨可用区调度
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: scheduler-master

9. 安全配置

9.1 Network Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: datajump-network-policy
  namespace: datajump-prod
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/part-of: datajump
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: datajump-prod
        - podSelector:
            matchLabels:
              app: api-gateway
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: datajump-prod

9.2 Pod Security Policy

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: datajump-restricted
spec:
  privileged: false
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  fsGroup:
    rule: RunAsAny
  volumes:
    - configMap
    - emptyDir
    - projected
    - secret
    - downwardAPI
    - persistentVolumeClaim