Commit Graph

180 Commits (92b261df899ed02a32f0f38b6d71cef524b0705f)

Author SHA1 Message Date
Guillaume Abrioux 92b261df89 lint: set pipefail on shell tasks
Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9)
2020-12-16 14:05:45 +01:00
Dimitri Savineau 3f16132e44 library: add ceph_osd_flag module
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a)
2020-12-15 17:36:28 +01:00
Dimitri Savineau f917bb015c ceph_key: set state as optional
Most ansible module using a state parameter default to the present
value (when available) instead of using it as a mandatory option.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit abb4023d76)
2020-12-01 09:53:26 -05:00
Dimitri Savineau 522e183d8f rolling_update: use ceph health instead of ceph -s
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.

$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb67)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 69b51b5f19 monitor: use quorum_status instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json  | wc -c
2001
$ ceph quorum_status -f json  | wc -c
957
$ time ceph status -f json > /dev/null

real	0m0.577s
user	0m0.538s
sys	0m0.029s
$ time ceph quorum_status -f json > /dev/null

real	0m0.544s
user	0m0.527s
sys	0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 1185b7e86a osds: use pg stat command instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real	0m0.529s
user	0m0.503s
sys	0m0.024s
$ time ceph pg stat -f json > /dev/null

real	0m0.426s
user	0m0.409s
sys	0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 43da364188 container: run engine/common roles on first client
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
2020-09-10 20:36:08 -04:00
Guillaume Abrioux 66dde0034b ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-09-10 20:35:04 -04:00
Dimitri Savineau b745c76491 ceph-facts: only get fsid when monitor are present
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
2020-09-10 17:42:28 -04:00
Francesco Pantano 858e50da6b Add --cluster option on ceph require-osd-release command
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
2020-09-09 15:11:24 +02:00
Francesco Pantano 2691e385fb Fix hosts field in rolling_update playbook when mds are processed
In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
2020-09-09 15:11:02 +02:00
Guillaume Abrioux 88c9f6d969 common: don't enable debug log on ceph-volume calls by default
ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
2020-08-13 14:21:44 +02:00
Dimitri Savineau cbdff5f95b rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:49:15 -04:00
Dimitri Savineau 7a970ac028 rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:49:02 -04:00
Dimitri Savineau 15872e3db1 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):

~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 09:43:01 -04:00
Guillaume Abrioux 02e7468b4a update: use tasks_from when including ceph-facts
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
2020-07-23 17:26:04 +02:00
Dimitri Savineau 5db4219f26 facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-20 21:23:48 +02:00
Guillaume Abrioux 328db8bee1 rolling_update: add any_errors_fatal
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
2020-07-20 21:22:25 +02:00
Dimitri Savineau 8c4865cd14 rolling_update: fix rbdmirror group name
The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
2020-06-03 13:20:03 -04:00
Guillaume Abrioux 5a51bd12dc common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
2020-02-28 11:06:47 -05:00
Guillaume Abrioux cdc3e10cf3 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
2020-02-03 09:33:05 -05:00
Guillaume Abrioux 675b6788f4 update: remove legacy tasks
These tasks should have been removed with backport #4756

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-03 15:16:13 +01:00
Guillaume Abrioux fd217d9f08 rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
2020-01-22 18:28:54 +01:00
Guillaume Abrioux 4c4b0edfec update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
2020-01-10 17:16:51 +01:00
Guillaume Abrioux 6e47e96a02 update: use flags noout and nodeep-scrub only
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b95)
2020-01-10 17:16:51 +01:00
Dimitri Savineau f042ece9af rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.

Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f344fdefe)
2020-01-09 20:16:07 -05:00
Guillaume Abrioux 5062d4094c update: restart iscsigws daemons after upgrade
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).

This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c7708eb458)
2019-12-11 08:48:34 -05:00
Guillaume Abrioux fe8858af38 upgrade: add dashboard deployment
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 451c5ca934)
2019-12-11 08:48:34 -05:00
Guillaume Abrioux e4c657d711 update: add default values when setting fact
This commit adds a default value in the `with_dict` because when using
python 2.7, if a task using a `with_dict` has a condition, it is
evaluated anyway whereas in python 3 it isn't.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9823f319b)
2019-10-29 16:00:21 -04:00
Dimitri Savineau 56f0cf79d9 rolling_update: remove default filter on mds group
There's no need to use the default filter on active/standby groups
because if the group doesn't exist then the play is just skipped.

Currently this generates warnings like:

[WARNING]: Could not match supplied host pattern, ignoring: |
[WARNING]: Could not match supplied host pattern, ignoring: default([])

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2ca79fcc99)
2019-10-28 13:08:33 -04:00
Dimitri Savineau ba4059d15a rolling_update: fix active mds host value
The active mds host should be based on the inventory hostname and not on
the ansible hostname.
The value returns under the mdsmap structure is based on the OS hostname
so we need to find the right node in the inventory with this value when
doing operation on inventory nodes.

Othewise we could see error like:

The task includes an option with an undefined variable. The error was:
"hostvars[foobar]" is undefined

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1f2352c79)
2019-10-28 13:08:33 -04:00
Dimitri Savineau b547ad9e71 rolling_update: fix reset mon_host variable
mon_host should use the inventory hostname and not the node hostname.
Fix creates an issue when the inventory and node hostname are different.

Closes: #4670

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 650bc0c3f0)
2019-10-26 08:20:54 -04:00
Guillaume Abrioux 3625ea6ef8 update: use right node when creating active mds group
This must be consistent with what is used in `name` parameter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d06057ebd2)
2019-10-25 09:42:52 +02:00
Guillaume Abrioux 73d97f525e update: avoid skipping single mds deployment upgrade
otherwise a single MDS would never be updated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d8ab11d2f8)
2019-10-25 09:42:52 +02:00
Guillaume Abrioux c599af6724 update: skip mds deactivation when no mds in inventory
Let's skip this part of the code if there's no mds node in the
inventory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ec906c3af)
2019-10-25 09:42:52 +02:00
Guillaume Abrioux 4a5d3c3c2d update: add missing quotes
Add missing quote in order to keep consistency.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8d72ff8e5e)
2019-10-21 13:26:37 -04:00
Guillaume Abrioux 9bc7f8a7d7 tests: add multimds coverage
This commit makes the all_daemons scenario deploying 3 mds in order to
cover the multimds case.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 25b98b2ce3)
2019-10-18 22:09:04 +02:00
Guillaume Abrioux bc3138eff4 upgrade: fix standby_mdss group creation
This commit fixes the standby_mdss group creation by using `{{ item }}`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c4fc8cc878)
2019-10-18 22:09:04 +02:00
Guillaume Abrioux c962d87def update: follow new recommandation to upgrade mds cluster
Refact the mds cluster upgrade code in order to follow the documented
recommandation.
See: https://github.com/ceph/ceph/blob/master/doc/cephfs/upgrading.rst

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1569689

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 71cebf80a6)
2019-10-16 12:59:08 -04:00
Guillaume Abrioux 37fd0b179b update: import ceph-defaults role in first play
Typical error:

```
fatal: [mon0]: FAILED! =>
  msg: |-
    The conditional check 'not delegate_facts_host | bool or inventory_hostname in groups.get(client_group_name, [])' failed. The error was: error while evaluating conditional (not delegate_facts_host | bool or inventory_hostname in groups.get(client_group_name, [])): 'client_group_name' is undefined
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8138d4193c)
2019-10-07 11:21:23 +02:00
Guillaume Abrioux 9a4fcfabe1 main: exclude client nodes from facts gathering when delegate_facts_host
This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 865d2eac9b)
2019-10-07 11:21:23 +02:00
Guillaume Abrioux 4afe1b748c update: reset mon_host after mons upgrade
after all mon are upgraded, let's reset mon_host which is used in the
rest of the playbook for setting `container_exec_cmd` so we are sure to
use the right value.

Typical error:

```
failed: [mds0 -> mon0] (item={u'path': u'/var/lib/ceph/bootstrap-mds/ceph.keyring', u'name': u'client.bootstrap-mds', u'copy_key': True}) => changed=true
  ansible_loop_var: item
  cmd:
  - docker
  - exec
  - ceph-mon-mon2
  - ceph
  - --cluster
  - ceph
  - auth
  - get
  - client.bootstrap-mds
  delta: '0:00:00.016294'
  end: '2019-09-27 13:54:58.828835'
  item:
    copy_key: true
    name: client.bootstrap-mds
    path: /var/lib/ceph/bootstrap-mds/ceph.keyring
  msg: non-zero return code
  rc: 1
  start: '2019-09-27 13:54:58.812541'
  stderr: 'Error response from daemon: No such container: ceph-mon-mon2'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d84160a170)
2019-09-28 09:01:16 +02:00
Sam Choraria 7594bc9181 rolling_update.yml: force ceph-volume scan on osds
The rolling_update.yml playbook fails when scanning ceph-disk osds while
deploying nautilus. The --force flag is required to scan existing osds
and rewrite their json metadata.

Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>
(cherry picked from commit 7cc9f93680)
2019-09-26 14:51:59 -04:00
Guillaume Abrioux 77d24203fa upgrade: accept HEALTH_OK and HEALTH_WARN as valid state
3a100cfa52 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6dce51183b)
2019-06-21 15:47:33 +00:00
Guillaume Abrioux b93064c7c8 rolling_update: fail early if cluster state is not OK
starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3a100cfa52)
2019-06-19 08:41:25 +00:00
Guillaume Abrioux 53dd58e84c rolling_update: only mask and stop unit in mgr part
Otherwise it fails like following:

```
fatal: [mon0]: FAILED! => changed=false
  msg: |-
    Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51b2813e04)
2019-06-19 08:41:25 +00:00
L3D 1daca1ba83 ansible: use 'bool' filter on boolean conditionals
By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```

Now appended ``| bool`` on a lot of the affected variables.

Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.

Closes: #4022

Signed-off-by: L3D <l3d@c3woc.de>
(cherry picked from commit ab54fe20ec)
2019-06-07 16:05:51 +02:00
Guillaume Abrioux e29fd842a6 rename docker_exec_cmd variable
This commit renames the `docker_exec_cmd` variable to
`container_exec_cmd` so it's more generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e74d80e72f)
2019-05-17 16:05:58 +02:00
Mike Christie 78a55a3df3 igw: Fix rolling update service ordering
We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d7ef12910e)
2019-05-10 15:53:44 +02:00
Rishabh Dave 06b3ab2a6b improve coding style
Keywords requiring only one item shouldn't express it by creating a
list with single item.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 739a662c80)

Conflicts:
	roles/ceph-mon/tasks/ceph_keys.yml
	roles/ceph-validate/tasks/check_devices.yml
2019-05-06 15:09:06 +00:00