You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/infrastructure-alibi.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,11 +14,11 @@ To ensure that results are always reproducable, the machine setup is enforced an
14
14
* The user is the only active user on the underlying hardware eliminating system load that might have otherwise been caused by other users.
15
15
* The system state corresponds to the one described in the systems initial puppet manifest. This ensures that no processes or containers from previous users are still running on the hardware as well as a consistent software stack.
16
16
17
-
# Installing the AliBI system
17
+
##Installing the AliBI system
18
18
19
19
The AliBI system relies on a CERN OpenStack VM for the _head node_ (`alibilogin01.cern.ch`) and a bare metal server as _compute node_ (`alibicompute01.cern.ch`). The software stack and machine state is formalized using puppet manifests and is fully integrated in the CERN configuration management ecosystem. The setup process is fully described below.
20
20
21
-
## AliBI head node
21
+
###AliBI head node
22
22
23
23
* On `aiadm.cern.ch` enter the OpenStack _Release Testing_ environment by running
24
24
@@ -38,11 +38,11 @@ The AliBI system relies on a CERN OpenStack VM for the _head node_ (`alibilogin0
38
38
openstack server set --property landb-alias=alibi alibilogin01
39
39
```
40
40
41
-
## AliBI compute node
41
+
###AliBI compute node
42
42
43
43
The compute node is a physical machine outside the CERN datacenter, which makes provisioning a bit more complicated.
44
44
45
-
### Registrations (only for first time set up)
45
+
####Registrations (only for first time set up)
46
46
47
47
* Register the machine in CERN [LANDB](https://network.cern.ch)
48
48
* Create an entry for the machine in [Foreman](https://judy.cern.ch/):
@@ -75,7 +75,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
75
75
* Enabled: `YES`
76
76
* Hardware Model: `ProLiant DL380 Gen10`
77
77
78
-
### Prepare installation
78
+
####Prepare installation
79
79
80
80
* Based on the Foreman entry, a provisioning template in form of a _kickstart file_ is generated and is updated every time the configuration in Foreman is changed.
81
81
* Since the compute node is outside of the CERN datacenter it does not have direct access to this file, so it needs to be downloaded and self hosted for the duration of the installation.
@@ -94,7 +94,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
94
94
95
95
* Set Foreman environment to `alibuild/alibi`.
96
96
97
-
### Installation
97
+
####Installation
98
98
99
99
* Get IPMI/ILO access to the physical server
100
100
* Boot machine in network boot (PXE)
@@ -123,7 +123,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
123
123
* At this point you will notice that the `post installation` section of the installation has not been completed automatically. Since all commands are bash, it can be executed dully by copy& paste or extracted and executed as a separate script.
124
124
* Afterwards the machine state should reflect the puppet manifests and can be fully monitored using the CERN Foreman infastruture.
125
125
126
-
## Installation of packages via puppet
126
+
###Installation of packages via puppet
127
127
128
128
* Packages are installed via puppet. The configuration / manifests is taken from a special `alibi` branch on a central git repository
Copy file name to clipboardExpand all lines: docs/infrastructure-frontend.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ layout: main
4
4
categories: infrastructure
5
5
---
6
6
7
-
# Frontend setup
7
+
##Frontend setup
8
8
9
9
The ALICE build infrastructure is exposed via SSO.
10
10
@@ -14,9 +14,9 @@ runs apache and does the reverse proxying to the actual service.
14
14
The machine is setup in CERN/IT puppet + OpenStack facility in the hostgroup
15
15
`alibuild/frontend`.
16
16
17
-
# Disaster recovering
17
+
##Disaster recovering
18
18
19
-
## Starting the frontend
19
+
###Starting the frontend
20
20
21
21
The quick recipe to restart the frontend is:
22
22
@@ -47,7 +47,7 @@ The quick recipe to restart the frontend is:
47
47
48
48
and they need to have the right Ip Address registered there.
49
49
50
-
## Enabling / disabling one host in the load balancing
50
+
###Enabling / disabling one host in the load balancing
51
51
52
52
Machines in the `alibuild/frontend` hostgroup participate in a load balanced DNS alias. In order to do so they must be in roger state `production`. To do so:
53
53
@@ -68,7 +68,7 @@ You can check their load balanced score with:
68
68
/usr/local/sbin/lbclient -d TRACE
69
69
```
70
70
71
-
# CERN Single Sign-On (SSO) authentication
71
+
##CERN Single Sign-On (SSO) authentication
72
72
73
73
Some web applications use Apache's OIDC support to authenticate with CERN SSO. Apache then sets [various `OIDC_CLAIM_*` headers][headers] on the forwarded requests.
74
74
@@ -77,7 +77,7 @@ See also [the CERN SSO documentation][cern-sso].
Applications must be configured on the CERN SSO side through the [Application Portal][app-portal] and on the ALICE side though our Puppet-generated Apache configuration, specifically the file `it-puppet-hostgroup-alibuild/data/hostgroup/alibuild/frontend.yaml`.
Copy file name to clipboardExpand all lines: docs/infrastructure-known-tradeoffs.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,9 @@ categories: infrastructure
6
6
7
7
This is a list of known issues or tradeoffs in our build infrastructure. We document them and try very hard to find a viable solution to all of them, however so far the solution seems to be unaffordable or has even worse drawbacks so we decided to simply live with them when they happen. Any contribution to improve the situation is welcome.
8
8
9
-
# PR checking
9
+
##PR checking
10
10
11
-
## PR checking dies due to external services (e.g. CCDB) being down
11
+
###PR checking dies due to external services (e.g. CCDB) being down
12
12
13
13
Sometimes checks fail because external services are down. Dealing with them in a proper way would imply mocking the service, but:
14
14
@@ -17,14 +17,14 @@ Sometimes checks fail because external services are down. Dealing with them in a
17
17
18
18
As a mitigation we run our test continuously, rebuilding broken tests when there is no pending ones.
19
19
20
-
## PR checks can affect each other, even if unrelated
20
+
###PR checks can affect each other, even if unrelated
21
21
22
22
In order to save time, we check our tests in the same build area, so that we rebuild only changes between one build and another. Due to limitations in CMake or undetected missing dependencies, we can however end up in a state where a given test interferes with another, in particular:
23
23
24
24
* When libraries / dictionaries are moved around
25
25
* When a missing / implicit dependency is present and the order in which PRs are build in the PR checker is by chance a working one.
26
26
27
-
## PR checks introduce relocation issues a few days after merging
27
+
###PR checks introduce relocation issues a few days after merging
28
28
29
29
In order to save time, PR checkers do their best to reuse pre-built tarballs which are downloaded from a central server. However by design this requires have packages fully relocatable in particular:
30
30
@@ -36,23 +36,23 @@ Failing that the net result will be that a relocation issue will be present and
36
36
Rebuilding a PR twice in two different locations is deemed to expensive.
37
37
Doing proper sandboxing requires changing the tools we have to something like Bazel.
38
38
39
-
## Errors appear in the PR checker which are not there local builds
39
+
###Errors appear in the PR checker which are not there local builds
40
40
41
41
Some of the recipes use environment variables (in particular `ALIBUILD_O2_TESTS`) to trigger different behaviors, e.g. increase the amount of testing being done and enable / disable special features. We should try to minimize their usage, however unfortunately they are still widely used.
42
42
43
-
## PRs take long to complete all tests
43
+
###PRs take long to complete all tests
44
44
45
45
By construction you are limited by the longest path, and even if we try to minimize the amount of work done, one has to ultimately chose between minimizing false negatives and performance. Work is currently being done to reduce the unneeded tests in particular for the analysis. A proper solution for this would be to use a tool which imposes specifying all the hidden dependencies and takes advantage of that. However, this most likely means to move away from CMake and so far it was not considered a viable solution.
46
46
47
-
# RPM generation
47
+
##RPM generation
48
48
49
-
## Updatable RPM packages have conflicting files
49
+
###Updatable RPM packages have conflicting files
50
50
51
51
Updatable RPMs are generate from the tarballs of the various packages which are also deployed in CVMFS. Those tarballs are built and installed in a separate per-package location, in order to allow multiple, coexisting installations. This means that conflicting files can be introduced without any previous warning at RPM generation time. The alternative, i.e. installing everything in a single location, would either move the problem to it's conjugate for CVMFS installation, or it would mean that what is installed in CVMFS is different from what is packaged in the updateable RPMs, duplicating CI and debugging issues.
52
52
53
53
## Externals
54
54
55
-
## Old / own version of externals
55
+
###Old / own version of externals
56
56
57
57
Sometimes the externals provided in alidist are either old, or provide a rebuild of a commonly available tool. In general this happens because we need to still support Run 2 Production requirements (including ROOT5 and XRootD3) and we prefer maintain a single set of tools, rather than split our configuration management.
Copy file name to clipboardExpand all lines: docs/infrastructure-logs.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,13 +15,13 @@ which is an SSO protected url exposed by machines in the `alibuild/frontend` pup
15
15
For the SSO access you need to be an alice member, while for the S3 endpoint, you either need to be in the `alice-vm-admin`
16
16
egroup.
17
17
18
-
# Essential operation guides
18
+
##Essential operation guides
19
19
20
20
*[Creating the bucket](#creating-the-bucket)
21
21
*[Updating the policy](#updating-the-policy)
22
22
*[Accessing the logs programmatically](#accessing-the-logs-programmatically)
23
23
24
-
## Creating the bucket
24
+
###Creating the bucket
25
25
26
26
Creating the bucket should not be needed unless some disaster happens. The current instructions to do so are:
27
27
@@ -32,7 +32,7 @@ Creating the bucket should not be needed unless some disaster happens. The curre
32
32
* Set the access policy to the contents of `ali-marathon/s3/alice-build-logs-policy.json`.
33
33
* Verify that using the `ali-bot` access_key / secret_key you can write files.
34
34
35
-
## Updating the policy
35
+
###Updating the policy
36
36
37
37
In case you need to update the S3 access permission policy, e.g. in case the frontend IP changes, you need to do so in `ali-marathon/s3/alice-build-logs-policy.json` and then apply it to the `s3://alice-build-logs`
If you get an actual reply, rather than permission denied, it means the machine can access the logs.
48
48
49
-
## Accessing the logs programmatically
49
+
###Accessing the logs programmatically
50
50
51
51
Accessing the logs programmatically can be done via any S3 enabled client, e.g. `s3cmd` (command line) or `boto3` (python). Ask usual suspects for the access key, secret. An example of how new logs can be pushed via `boto3` is at <https://github.com/alisw/ali-bot/blob/master/report-pr-errors#L175-L194>.
0 commit comments