Commit Graph

211 Commits (120ed2b7f3047aba091d2028e05293034f89db23)

Author SHA1 Message Date
Guillaume Abrioux 1019c7bf25 update: support upgrading a subset of nodes
It can be useful in a large cluster deployment to split the upgrade and
only upgrade a group of nodes at a time.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5cf9db2b0)
2021-10-25 20:15:17 +02:00
Guillaume Abrioux 492c2b5389 update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c14e9114ba)
2021-08-17 15:31:41 -04:00
Dimitri Savineau bcf9a2c25e infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 386661699b)
2021-08-04 11:43:53 -04:00
Dimitri Savineau 561a7c02c0 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 06471a4b82)
2021-08-03 13:57:20 -04:00
Dimitri Savineau 380e0bec83 rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e87a47cf0c)
2021-08-03 12:17:53 -04:00
Benoît Knecht 35ce2bb643 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit d7653dca95)
2021-08-02 15:54:04 +02:00
Guillaume Abrioux b9cc91f622 update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eec38784ec)
2021-07-26 13:39:20 -04:00
Dimitri Savineau 06158c2ac5 common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 738fa9428a)
2021-07-21 10:01:25 -04:00
Dimitri Savineau f9d60644ad common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf6e33346e)
2021-07-21 08:57:53 -04:00
Guillaume Abrioux f3a9135241 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13036115e2)
2021-07-20 11:48:39 -04:00
Dimitri Savineau 876fa07175 rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 97148dd58c)
2021-07-12 12:59:48 -04:00
Guillaume Abrioux a14a3e56c0 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
2021-07-09 16:32:56 -04:00
Guillaume Abrioux 361f373e18 update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4eb4268dee)
2021-07-09 11:34:22 +02:00
Guillaume Abrioux a0087c425b update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
2021-07-09 09:15:42 +02:00
Guillaume Abrioux 22fd0846a7 update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
2021-06-30 20:39:50 +02:00
Guillaume Abrioux ac0a5c1e68 dashboard: fix typo introduced during backport
during backport of c8b92deba1 the pattern
should have been s/monitoring_group_name/grafana_server_group_name/

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964907

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-26 15:23:01 +02:00
Guillaume Abrioux 84a0ed440d update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3db1ea7ec4)
2021-05-04 16:02:29 +02:00
Guillaume Abrioux f60516d920 update: followup on 07029e1
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9ddb972fe)
2021-03-29 10:54:56 +02:00
Guillaume Abrioux f42ee9f940 cephadm_adopt: fetch and write ceph minimal config
This commit makes the playbook fetch the minimal current ceph
configuration and write it later on monitoring nodes so `cephadm` can
proceed with the adoption.
When a monitoring stack was deployed on a dedicated node, it means no
`ceph.conf` file was written, `cephadm` requires a `ceph.conf` in order
to adopt the daemon present on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1939887

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b445df0479)
2021-03-26 15:20:50 +01:00
Alex Schultz 815ea7765f Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
2021-03-26 00:05:33 +01:00
Dimitri Savineau a1ca7f3daa rolling_update: enforce ceph-container-engine
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 48a456dc8c)
2021-02-10 09:58:03 +01:00
Dimitri Savineau 2e9bf2f0fb rolling_update: exclude clients from node-exporter
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 94af3c87d1)
2021-02-10 08:14:01 +01:00
Dimitri Savineau d4024eddbb library: add ceph_osd_flag module
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a)
2020-12-15 14:12:44 +01:00
Guillaume Abrioux 6b04f1154f common: do not use pipefail when not needed
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`

Fixes: #6090

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86a8889ee3)
2020-12-01 20:18:35 -05:00
Guillaume Abrioux 0c3adbc710 lint: all tasks should be named
Fix ansible-lint 502 error:

[502] All tasks should be named

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 97dd9218dd)
2020-11-24 10:39:03 +01:00
Guillaume Abrioux 630e6be904 lint: commands should not change things
Fix ansible lint 301 error:

[301] Commands should not change things if nothing needs doing

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5450de58b3)
2020-11-24 10:39:03 +01:00
Guillaume Abrioux 1d4cd3328a lint: set pipefail on shell tasks
Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9)
2020-11-24 10:39:03 +01:00
Dimitri Savineau 0cb9e179f5 rolling_update: fix mgr start with mon collocation
cec994b introduced a regression when a mgr is collocated with a mon.
During the mon upgrade, the mgr service is masked to avoid to be
restarted on packages update.
Then the start mgr task is failing because the service is still masked.
Instead we should unmask it.

Fixes: #5983

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3d3ce26327)
2020-11-03 14:32:42 +01:00
Dimitri Savineau d2114efa4d infrastructure: consume ceph_fs module
bd611a7 introduced the new ceph_fs module but missed some tasks in
rolling_update and shrink-mds playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16afe90806)
2020-11-03 14:32:25 +01:00
Dimitri Savineau 1c6bd9a383 rolling_update: use ceph health instead of ceph -s
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.

$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb67)
2020-11-03 14:32:09 +01:00
Dimitri Savineau 3bba1fd203 monitor: use quorum_status instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json  | wc -c
2001
$ ceph quorum_status -f json  | wc -c
957
$ time ceph status -f json > /dev/null

real	0m0.577s
user	0m0.538s
sys	0m0.029s
$ time ceph quorum_status -f json > /dev/null

real	0m0.544s
user	0m0.527s
sys	0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
2020-11-03 14:32:09 +01:00
Dimitri Savineau a8e2bc087f osds: use pg stat command instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real	0m0.529s
user	0m0.503s
sys	0m0.024s
$ time ceph pg stat -f json > /dev/null

real	0m0.426s
user	0m0.409s
sys	0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
2020-11-03 14:32:09 +01:00
Dimitri Savineau 4fc2d788b4 library: add ceph_fs module
This adds the ceph_fs ansible module for replacing the command module
usage with the ceph fs command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd611a785b)
2020-10-06 14:59:49 +02:00
Dimitri Savineau 23522a11e4 ceph_key: set state as optional
Most ansible module using a state parameter default to the present
value (when available) instead of using it as a mandatory option.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit abb4023d76)
2020-09-14 15:37:56 -04:00
Dimitri Savineau 7745fd3560 container: run engine/common roles on first client
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
2020-09-10 20:57:16 +02:00
Dimitri Savineau 0c0a930374 ceph-facts: only get fsid when monitor are present
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
2020-09-10 20:57:16 +02:00
Francesco Pantano 8e3ecfd869 Add --cluster option on ceph require-osd-release command
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
2020-09-09 14:54:19 +02:00
Francesco Pantano 8dd8675080 Fix hosts field in rolling_update playbook when mds are processed
In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
2020-09-09 14:53:44 +02:00
Guillaume Abrioux 3a8be20699 rolling_update: remove 'ignore_errors'
There's no need to use `ignore_errors: true` on these tasks.

Using a loop on the task stopping mon daemons allows us to avoid
duplicating this task, the `ignore_errors` isn't needed here because it
won't fail the playbook if one of the ID doesn't exist (shortname vs. fqdn)

Using the right condition on the task starting the mgr daemon allows us
to avoid using an `ignore_errors: true` as well.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cec994b973)
2020-08-21 16:33:15 +02:00
Guillaume Abrioux 8a7e4193db common: don't enable debug log on ceph-volume calls by default
ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
2020-08-12 22:57:10 +02:00
Dimitri Savineau 1dd9c43efc rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:43:36 -04:00
Dimitri Savineau 2ce60504bd rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:59:25 -04:00
Dimitri Savineau 8ea3fa1752 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

Finally, this adds the missing service stop task for ceph crash upgrade
workflow.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 10:59:25 -04:00
Guillaume Abrioux e6059fdcd3 ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-07-22 18:47:01 -04:00
Dimitri Savineau 503bc893fb facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-03 06:37:08 +02:00
Guillaume Abrioux 688d5eebf7 rolling_update: add any_errors_fatal
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
2020-06-29 17:13:03 -04:00
Dimitri Savineau e6bfdd2e44 rolling_update: fix rbdmirror group name
The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
2020-05-13 16:41:23 -04:00
Guillaume Abrioux 529d99a691 update: use tasks_from when including ceph-facts
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6df7887f87)
2020-04-06 18:05:10 +02:00
Guillaume Abrioux a084a2a347 common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-27 09:48:36 +01:00
Guillaume Abrioux e5812fe45b rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-22 11:29:36 -05:00