Commit Graph

206 Commits (0d670c7942d919bab5ea4ad4c31ade1a61be0336)

Author SHA1 Message Date
Guillaume Abrioux 2d38d8266b update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c14e9114ba)
2021-08-17 15:47:38 -04:00
Dimitri Savineau 03ed9e111c infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 386661699b)
2021-08-04 11:48:13 -04:00
Dimitri Savineau 1044940304 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 06471a4b82)
2021-08-03 14:03:35 -04:00
Dimitri Savineau 5c6921e553 rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e87a47cf0c)
2021-08-03 12:42:28 -04:00
Benoît Knecht bacaa654b1 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit d7653dca95)
2021-08-02 15:54:34 +02:00
Guillaume Abrioux 77171216fb update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eec38784ec)
2021-07-26 14:10:24 -04:00
Dimitri Savineau f433d06a93 rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 97148dd58c)
2021-07-26 17:49:23 +02:00
Dimitri Savineau 8e939dc377 common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 738fa9428a)
2021-07-21 10:03:36 -04:00
Dimitri Savineau 17b9ff03d2 common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf6e33346e)
2021-07-21 09:54:46 -04:00
Guillaume Abrioux f7882bbc02 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13036115e2)
2021-07-21 09:40:18 -04:00
Guillaume Abrioux f0cd3c4f48 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
2021-07-09 17:29:54 -04:00
Guillaume Abrioux 65ce69567a update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4eb4268dee)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux 1179ea8b2f update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux f0413c4a2b update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
2021-06-30 20:40:15 +02:00
Guillaume Abrioux a391dad8e1 dashboard: fix typo introduced during backport
during backport of c8b92deba1 the pattern
should have been s/monitoring_group_name/grafana_server_group_name/

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964907

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ac0a5c1e68)
2021-05-26 18:51:18 +02:00
Guillaume Abrioux 2d59f4579b update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3db1ea7ec4)
2021-05-05 09:47:32 +02:00
Guillaume Abrioux 5fd299e358 update: followup on 07029e1
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9ddb972fe)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux 82b934cfc1 rolling_update: unmask monitor service after a failure
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 07029e1bf1)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux a8420d41c6 update: stop ceph-crash service before upgrading
This adds the missing service stop task for ceph-crash upgrade workflow.

It should have been added through commit
`15872e3db1e342238636bc9c8e1aef6bd1d3dcd8` in stable-4.0 but at the time
we backported this patch ceph-crash wasn't implemented yet so the
ceph-crash related content in this patch was removed. Then, ceph-crash
has been implemented later so we are still missing this part of the patch in
stable-4.0.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943471

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-26 16:18:50 +01:00
Alex Schultz 7ddbe74712 Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
2021-03-26 00:16:58 +01:00
Dimitri Savineau 8f26ffdbac rolling_update: enforce ceph-container-engine
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 48a456dc8c)
2021-03-11 13:52:21 +01:00
Dimitri Savineau 3ba27c9387 rolling_update: exclude clients from node-exporter
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 94af3c87d1)
2021-03-11 13:52:02 +01:00
Guillaume Abrioux 858048560e update: fix require-osd-release task
This commit fixes two issues in rolling_update.yml:

- `container_exec_cmd_update_osd` is unset in the `complete osd upgrade`
play so it never runs the command in a container.
- the 'require-osd-release' task is never applied because the condition
  looks for luminous release.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1930164

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-18 22:22:06 +01:00
Guillaume Abrioux 607ef5a7d2 common: do not use pipefail when not needed
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`

Fixes: #6090

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86a8889ee3)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 72fc8877cb lint: all tasks should be named
Fix ansible-lint 502 error:

[502] All tasks should be named

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 97dd9218dd)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 35e738c681 lint: commands should not change things
Fix ansible lint 301 error:

[301] Commands should not change things if nothing needs doing

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5450de58b3)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 92b261df89 lint: set pipefail on shell tasks
Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9)
2020-12-16 14:05:45 +01:00
Dimitri Savineau 3f16132e44 library: add ceph_osd_flag module
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a)
2020-12-15 17:36:28 +01:00
Dimitri Savineau f917bb015c ceph_key: set state as optional
Most ansible module using a state parameter default to the present
value (when available) instead of using it as a mandatory option.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit abb4023d76)
2020-12-01 09:53:26 -05:00
Dimitri Savineau 522e183d8f rolling_update: use ceph health instead of ceph -s
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.

$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb67)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 69b51b5f19 monitor: use quorum_status instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json  | wc -c
2001
$ ceph quorum_status -f json  | wc -c
957
$ time ceph status -f json > /dev/null

real	0m0.577s
user	0m0.538s
sys	0m0.029s
$ time ceph quorum_status -f json > /dev/null

real	0m0.544s
user	0m0.527s
sys	0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 1185b7e86a osds: use pg stat command instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real	0m0.529s
user	0m0.503s
sys	0m0.024s
$ time ceph pg stat -f json > /dev/null

real	0m0.426s
user	0m0.409s
sys	0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 43da364188 container: run engine/common roles on first client
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
2020-09-10 20:36:08 -04:00
Guillaume Abrioux 66dde0034b ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-09-10 20:35:04 -04:00
Dimitri Savineau b745c76491 ceph-facts: only get fsid when monitor are present
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
2020-09-10 17:42:28 -04:00
Francesco Pantano 858e50da6b Add --cluster option on ceph require-osd-release command
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
2020-09-09 15:11:24 +02:00
Francesco Pantano 2691e385fb Fix hosts field in rolling_update playbook when mds are processed
In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
2020-09-09 15:11:02 +02:00
Guillaume Abrioux 88c9f6d969 common: don't enable debug log on ceph-volume calls by default
ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
2020-08-13 14:21:44 +02:00
Dimitri Savineau cbdff5f95b rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:49:15 -04:00
Dimitri Savineau 7a970ac028 rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:49:02 -04:00
Dimitri Savineau 15872e3db1 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):

~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 09:43:01 -04:00
Guillaume Abrioux 02e7468b4a update: use tasks_from when including ceph-facts
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
2020-07-23 17:26:04 +02:00
Dimitri Savineau 5db4219f26 facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-20 21:23:48 +02:00
Guillaume Abrioux 328db8bee1 rolling_update: add any_errors_fatal
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
2020-07-20 21:22:25 +02:00
Dimitri Savineau 8c4865cd14 rolling_update: fix rbdmirror group name
The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
2020-06-03 13:20:03 -04:00
Guillaume Abrioux 5a51bd12dc common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
2020-02-28 11:06:47 -05:00
Guillaume Abrioux cdc3e10cf3 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
2020-02-03 09:33:05 -05:00
Guillaume Abrioux 675b6788f4 update: remove legacy tasks
These tasks should have been removed with backport #4756

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-03 15:16:13 +01:00
Guillaume Abrioux fd217d9f08 rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
2020-01-22 18:28:54 +01:00
Guillaume Abrioux 4c4b0edfec update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
2020-01-10 17:16:51 +01:00