Skip to content

Commit bf3c490

Browse files
authored
Fix navigation trees (#133)
They look a bit awkward since the refactor, this should make them more homogeneous
1 parent 307a0d0 commit bf3c490

23 files changed

Lines changed: 130 additions & 127 deletions

docs/aliphysics-ci.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@ layout: main
44
categories: developer
55
---
66

7-
# Checking the status of the daily builds
7+
## Checking the status of the daily builds
88

99
Daily build are built with Jenkins. You can check the status of the current
1010
build and the logs of previous one by going to:
1111

1212
<https://alijenkins.cern.ch/job/daily-builds/job/daily-aliphysics-github/>
1313

1414

15-
# Trying out a Release Validation
15+
## Trying out a Release Validation
1616

1717
If a release validation fails, one can try out the release candidate by sourcing the
1818
nightly environment from CVMFS:

docs/infrastructure-alibi-user-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ TBD
112112

113113
## FAQ
114114

115-
### Which Local Directories Can I Use on The Compute Node?
115+
### Which Local Directories Can I Use on The Compute Node?
116116
Fast, local directories are available under
117117
```
118118
$L_HOME=/home/$USER

docs/infrastructure-alibi.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ To ensure that results are always reproducable, the machine setup is enforced an
1414
* The user is the only active user on the underlying hardware eliminating system load that might have otherwise been caused by other users.
1515
* The system state corresponds to the one described in the systems initial puppet manifest. This ensures that no processes or containers from previous users are still running on the hardware as well as a consistent software stack.
1616

17-
# Installing the AliBI system
17+
## Installing the AliBI system
1818

1919
The AliBI system relies on a CERN OpenStack VM for the _head node_ (`alibilogin01.cern.ch`) and a bare metal server as _compute node_ (`alibicompute01.cern.ch`). The software stack and machine state is formalized using puppet manifests and is fully integrated in the CERN configuration management ecosystem. The setup process is fully described below.
2020

21-
## AliBI head node
21+
### AliBI head node
2222

2323
* On `aiadm.cern.ch` enter the OpenStack _Release Testing_ environment by running
2424

@@ -38,11 +38,11 @@ The AliBI system relies on a CERN OpenStack VM for the _head node_ (`alibilogin0
3838
openstack server set --property landb-alias=alibi alibilogin01
3939
```
4040

41-
## AliBI compute node
41+
### AliBI compute node
4242

4343
The compute node is a physical machine outside the CERN datacenter, which makes provisioning a bit more complicated.
4444

45-
### Registrations (only for first time set up)
45+
#### Registrations (only for first time set up)
4646

4747
* Register the machine in CERN [LANDB](https://network.cern.ch)
4848
* Create an entry for the machine in [Foreman](https://judy.cern.ch/):
@@ -75,7 +75,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
7575
* Enabled: `YES`
7676
* Hardware Model: `ProLiant DL380 Gen10`
7777

78-
### Prepare installation
78+
#### Prepare installation
7979

8080
* Based on the Foreman entry, a provisioning template in form of a _kickstart file_ is generated and is updated every time the configuration in Foreman is changed.
8181
* Since the compute node is outside of the CERN datacenter it does not have direct access to this file, so it needs to be downloaded and self hosted for the duration of the installation.
@@ -94,7 +94,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
9494

9595
* Set Foreman environment to `alibuild/alibi`.
9696

97-
### Installation
97+
#### Installation
9898

9999
* Get IPMI/ILO access to the physical server
100100
* Boot machine in network boot (PXE)
@@ -123,7 +123,7 @@ The compute node is a physical machine outside the CERN datacenter, which makes
123123
* At this point you will notice that the `post installation` section of the installation has not been completed automatically. Since all commands are bash, it can be executed dully by copy& paste or extracted and executed as a separate script.
124124
* Afterwards the machine state should reflect the puppet manifests and can be fully monitored using the CERN Foreman infastruture.
125125

126-
## Installation of packages via puppet
126+
### Installation of packages via puppet
127127

128128
* Packages are installed via puppet. The configuration / manifests is taken from a special `alibi` branch on a central git repository
129129
[PUPPET-HOSTGROUP](https://gitlab.cern.ch/ai/it-puppet-hostgroup-alibuild/blob/alibi).
@@ -134,9 +134,9 @@ The compute node is a physical machine outside the CERN datacenter, which makes
134134
puppet agent -t -v
135135
```
136136

137-
## Troubleshooting
137+
### Troubleshooting
138138

139-
### Symptom: No allocations can be made, node stuck in "drain" state
139+
#### Symptom: No allocations can be made, node stuck in "drain" state
140140

141141
In case `sinfo` shows:
142142

docs/infrastructure-alienvobox.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ layout: main
44
categories: infrastructure
55
---
66

7-
# Register the VOBOX
7+
## Register the VOBOX
88

99
An AliEn VOBOX has first to be registered to the AliEn LDAP. AliEn
1010
administrators can do that with two pieces of information:
@@ -15,7 +15,7 @@ administrators can do that with two pieces of information:
1515
A site certificate and an associated private key will be created.
1616

1717

18-
# Store credentials in Vault
18+
## Store credentials in Vault
1919

2020
First create a policy:
2121

@@ -55,7 +55,7 @@ Alternatively we can read it from a file:
5555
vault write secret/mysitevobox/host_cert value=@usercert.pem
5656

5757

58-
# Run the Ansible configuration
58+
## Run the Ansible configuration
5959

6060
Our configuration is stored on Ansible. To run it, by limiting the run only to
6161
the AliEn VOBOXes, do - from the private configuration folder:

docs/infrastructure-docker-packer.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@ The docker image definitions are available in
1212
<https://registry.cern.ch/>
1313

1414

15-
# Rebuilding a Docker image
15+
## Rebuilding a Docker image
1616

1717
If an image definition has changed, it must be rebuilt and pushed to the proper
1818
registry for Nomad to use it in new job allocations.
1919

2020

21-
## Packer-defined images
21+
### Packer-defined images
2222

2323
Newer images have a ``packer.json`` file, which allows them to be built using
2424
Hashicorp packer
@@ -36,7 +36,7 @@ packer build packer.json
3636
docker push registry.cern.ch/alisw/<image-name>
3737
```
3838

39-
## Dockerfile-defined images
39+
### Dockerfile-defined images
4040

4141
Older images without a `packer.json` can be built with:
4242

@@ -45,13 +45,14 @@ docker build -t alisw/<image-name> <image-name>
4545
```
4646

4747

48-
# Conventions
48+
## Conventions
4949

5050
The CI uses an image named <arch>-builder where <arch> is the architecture of the
5151
image. The CI system will automatically select the correct image for a given
5252
architecture, so the image name must match the format exactly.
5353

54-
https://github.com/alisw/ci-jobs/blob/master/ci/ci.nomad
54+
The code to infer the image names is
55+
[here](https://github.com/alisw/ci-jobs/blob/master/ci/ci.nomad)
5556

5657
[ci-jobs]: https://github.com/alisw/ci-jobs
5758
[packer]: https://www.packer.io/

docs/infrastructure-frontend.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ layout: main
44
categories: infrastructure
55
---
66

7-
# Frontend setup
7+
## Frontend setup
88

99
The ALICE build infrastructure is exposed via SSO.
1010

@@ -14,9 +14,9 @@ runs apache and does the reverse proxying to the actual service.
1414
The machine is setup in CERN/IT puppet + OpenStack facility in the hostgroup
1515
`alibuild/frontend`.
1616

17-
# Disaster recovering
17+
## Disaster recovering
1818

19-
## Starting the frontend
19+
### Starting the frontend
2020

2121
The quick recipe to restart the frontend is:
2222

@@ -47,7 +47,7 @@ The quick recipe to restart the frontend is:
4747

4848
and they need to have the right Ip Address registered there.
4949

50-
## Enabling / disabling one host in the load balancing
50+
### Enabling / disabling one host in the load balancing
5151

5252
Machines in the `alibuild/frontend` hostgroup participate in a load balanced DNS alias. In order to do so they must be in roger state `production`. To do so:
5353

@@ -68,7 +68,7 @@ You can check their load balanced score with:
6868
/usr/local/sbin/lbclient -d TRACE
6969
```
7070

71-
# CERN Single Sign-On (SSO) authentication
71+
## CERN Single Sign-On (SSO) authentication
7272

7373
Some web applications use Apache's OIDC support to authenticate with CERN SSO. Apache then sets [various `OIDC_CLAIM_*` headers][headers] on the forwarded requests.
7474

@@ -77,7 +77,7 @@ See also [the CERN SSO documentation][cern-sso].
7777
[headers]: https://auth.docs.cern.ch/user-documentation/oidc/config/
7878
[cern-sso]: https://auth.docs.cern.ch/applications/application-configuration/
7979

80-
## Adding a new application
80+
### Adding a new application
8181

8282
Applications must be configured on the CERN SSO side through the [Application Portal][app-portal] and on the ALICE side though our Puppet-generated Apache configuration, specifically the file `it-puppet-hostgroup-alibuild/data/hostgroup/alibuild/frontend.yaml`.
8383

docs/infrastructure-jenkins.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Master nodes are configured through Puppet in the file:
1313

1414
- [/code/manifests/alibuild/mesos/slave/jenkins.pp](https://gitlab.cern.ch/ai/it-puppet-hostgroup-alibuild/blob/master/code/manifests/mesos/slave/jenkins.pp)
1515

16-
# Essential Operation Guides:
16+
## Essential Operation Guides:
1717

1818
* [Create the Jenkins](#create-the-jenkins-master-only-in-case-of-disaster-recovery)
1919
* [Starting Jenkins](#starting-jenkins)
@@ -96,7 +96,7 @@ The step by step guide is:
9696

9797
The `<parameters>` are formatted as in a URL: `<name>=<value>&<name2>=<value2>`.
9898

99-
## Creating Jenkins agents with guaranteed resources
99+
### Creating Jenkins agents with guaranteed resources
100100

101101
This is the main way we deploy Jenkins builders.
102102
The advantage of fixed builders is that we are never in a situation where there is not enough space on the cluster by accident to run a Jenkins build.
@@ -136,7 +136,7 @@ levant render -var-file <name>.yaml | nomad job plan - # make sure job can
136136
levant render -var-file <name>.yaml | nomad job run - # actually run job
137137
```
138138

139-
## Gotchas and issues:
139+
### Gotchas and issues:
140140

141141
* On some systems, the CERN CA is not available by default. You can overcome this by either:
142142
* Go to <https://ca.cern.ch> and install all the required CA certificates. In general this is what is needed on macOS.

docs/infrastructure-known-tradeoffs.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ categories: infrastructure
66

77
This is a list of known issues or tradeoffs in our build infrastructure. We document them and try very hard to find a viable solution to all of them, however so far the solution seems to be unaffordable or has even worse drawbacks so we decided to simply live with them when they happen. Any contribution to improve the situation is welcome.
88

9-
# PR checking
9+
## PR checking
1010

11-
## PR checking dies due to external services (e.g. CCDB) being down
11+
### PR checking dies due to external services (e.g. CCDB) being down
1212

1313
Sometimes checks fail because external services are down. Dealing with them in a proper way would imply mocking the service, but:
1414

@@ -17,14 +17,14 @@ Sometimes checks fail because external services are down. Dealing with them in a
1717

1818
As a mitigation we run our test continuously, rebuilding broken tests when there is no pending ones.
1919

20-
## PR checks can affect each other, even if unrelated
20+
### PR checks can affect each other, even if unrelated
2121

2222
In order to save time, we check our tests in the same build area, so that we rebuild only changes between one build and another. Due to limitations in CMake or undetected missing dependencies, we can however end up in a state where a given test interferes with another, in particular:
2323

2424
* When libraries / dictionaries are moved around
2525
* When a missing / implicit dependency is present and the order in which PRs are build in the PR checker is by chance a working one.
2626

27-
## PR checks introduce relocation issues a few days after merging
27+
### PR checks introduce relocation issues a few days after merging
2828

2929
In order to save time, PR checkers do their best to reuse pre-built tarballs which are downloaded from a central server. However by design this requires have packages fully relocatable in particular:
3030

@@ -36,23 +36,23 @@ Failing that the net result will be that a relocation issue will be present and
3636
Rebuilding a PR twice in two different locations is deemed to expensive.
3737
Doing proper sandboxing requires changing the tools we have to something like Bazel.
3838

39-
## Errors appear in the PR checker which are not there local builds
39+
### Errors appear in the PR checker which are not there local builds
4040

4141
Some of the recipes use environment variables (in particular `ALIBUILD_O2_TESTS`) to trigger different behaviors, e.g. increase the amount of testing being done and enable / disable special features. We should try to minimize their usage, however unfortunately they are still widely used.
4242

43-
## PRs take long to complete all tests
43+
### PRs take long to complete all tests
4444

4545
By construction you are limited by the longest path, and even if we try to minimize the amount of work done, one has to ultimately chose between minimizing false negatives and performance. Work is currently being done to reduce the unneeded tests in particular for the analysis. A proper solution for this would be to use a tool which imposes specifying all the hidden dependencies and takes advantage of that. However, this most likely means to move away from CMake and so far it was not considered a viable solution.
4646

47-
# RPM generation
47+
## RPM generation
4848

49-
## Updatable RPM packages have conflicting files
49+
### Updatable RPM packages have conflicting files
5050

5151
Updatable RPMs are generate from the tarballs of the various packages which are also deployed in CVMFS. Those tarballs are built and installed in a separate per-package location, in order to allow multiple, coexisting installations. This means that conflicting files can be introduced without any previous warning at RPM generation time. The alternative, i.e. installing everything in a single location, would either move the problem to it's conjugate for CVMFS installation, or it would mean that what is installed in CVMFS is different from what is packaged in the updateable RPMs, duplicating CI and debugging issues.
5252

5353
## Externals
5454

55-
## Old / own version of externals
55+
### Old / own version of externals
5656

5757
Sometimes the externals provided in alidist are either old, or provide a rebuild of a commonly available tool. In general this happens because we need to still support Run 2 Production requirements (including ROOT5 and XRootD3) and we prefer maintain a single set of tools, rather than split our configuration management.
5858

docs/infrastructure-logs.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ which is an SSO protected url exposed by machines in the `alibuild/frontend` pup
1515
For the SSO access you need to be an alice member, while for the S3 endpoint, you either need to be in the `alice-vm-admin`
1616
egroup.
1717

18-
# Essential operation guides
18+
## Essential operation guides
1919

2020
* [Creating the bucket](#creating-the-bucket)
2121
* [Updating the policy](#updating-the-policy)
2222
* [Accessing the logs programmatically](#accessing-the-logs-programmatically)
2323

24-
## Creating the bucket
24+
### Creating the bucket
2525

2626
Creating the bucket should not be needed unless some disaster happens. The current instructions to do so are:
2727

@@ -32,7 +32,7 @@ Creating the bucket should not be needed unless some disaster happens. The curre
3232
* Set the access policy to the contents of `ali-marathon/s3/alice-build-logs-policy.json`.
3333
* Verify that using the `ali-bot` access_key / secret_key you can write files.
3434

35-
## Updating the policy
35+
### Updating the policy
3636

3737
In case you need to update the S3 access permission policy, e.g. in case the frontend IP changes, you need to do so in `ali-marathon/s3/alice-build-logs-policy.json` and then apply it to the `s3://alice-build-logs`
3838

@@ -46,6 +46,6 @@ curl alice-build-logs.s3.cern.ch/test.txt
4646

4747
If you get an actual reply, rather than permission denied, it means the machine can access the logs.
4848

49-
## Accessing the logs programmatically
49+
### Accessing the logs programmatically
5050

5151
Accessing the logs programmatically can be done via any S3 enabled client, e.g. `s3cmd` (command line) or `boto3` (python). Ask usual suspects for the access key, secret. An example of how new logs can be pushed via `boto3` is at <https://github.com/alisw/ali-bot/blob/master/report-pr-errors#L175-L194>.

0 commit comments

Comments
 (0)