1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
It might be possible at some point even with osd flags `noout` and
`norebalance` set the PGs states can change depending on the amount of data
written meantime. It means the check for PGs state will fail.
This commit changes the way we set those flags:
we set them before an OSD node upgrade and unset them before the PGs
state check so they can recover.
Fixes: #3961
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds a default value in the `with_dict` because when using
python 2.7, if a task using a `with_dict` has a condition, it is
evaluated anyway whereas in python 3 it isn't.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's no need to use the default filter on active/standby groups
because if the group doesn't exist then the play is just skipped.
Currently this generates warnings like:
[WARNING]: Could not match supplied host pattern, ignoring: |
[WARNING]: Could not match supplied host pattern, ignoring: default([])
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The active mds host should be based on the inventory hostname and not on
the ansible hostname.
The value returns under the mdsmap structure is based on the OS hostname
so we need to find the right node in the inventory with this value when
doing operation on inventory nodes.
Othewise we could see error like:
The task includes an option with an undefined variable. The error was:
"hostvars[foobar]" is undefined
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
mon_host should use the inventory hostname and not the node hostname.
Fix creates an issue when the inventory and node hostname are different.
Closes: #4670
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
after all mon are upgraded, let's reset mon_host which is used in the
rest of the playbook for setting `container_exec_cmd` so we are sure to
use the right value.
Typical error:
```
failed: [mds0 -> mon0] (item={u'path': u'/var/lib/ceph/bootstrap-mds/ceph.keyring', u'name': u'client.bootstrap-mds', u'copy_key': True}) => changed=true
ansible_loop_var: item
cmd:
- docker
- exec
- ceph-mon-mon2
- ceph
- --cluster
- ceph
- auth
- get
- client.bootstrap-mds
delta: '0:00:00.016294'
end: '2019-09-27 13:54:58.828835'
item:
copy_key: true
name: client.bootstrap-mds
path: /var/lib/ceph/bootstrap-mds/ceph.keyring
msg: non-zero return code
rc: 1
start: '2019-09-27 13:54:58.812541'
stderr: 'Error response from daemon: No such container: ceph-mon-mon2'
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The rolling_update.yml playbook fails when scanning ceph-disk osds while
deploying nautilus. The --force flag is required to scan existing osds
and rewrite their json metadata.
Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>
3a100cfa52 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise it fails like following:
```
fatal: [mon0]: FAILED! => changed=false
msg: |-
Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```
Now appended ``| bool`` on a lot of the affected variables.
Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.
Closes: #4022
Signed-off-by: L3D <l3d@c3woc.de>
This commit renames the `docker_exec_cmd` variable to
`container_exec_cmd` so it's more generic.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611
Signed-off-by: Mike Christie <mchristi@redhat.com>
Currently only rbd-target-gw service is restarted during an update.
We also need to restart tcmu-runner and rbd-target-api services
during the ceph iscsi upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
These tasks must be run from a monitor which is upgraded otherwise it
might fail.
See: https://tracker.ceph.com/issues/39355
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
these commands could return something else than 0.
Let's ensure all retries have been done before actually failing.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We do this so that the ceph-config role can most accurately
report the number of osds for the generation of the ceph.conf
file.
We don't want to use ceph-volume to determine the number of
osds because in an upgrade to nautilus ceph-volume won't be able to
accurately count osds created by ceph-disk.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
When upgrading to nautlius run ``ceph-volume simple scan`` and
``ceph-volume simple activate --all`` to migrate any running
ceph-disk osds to ceph-volume.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1656460
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
ceph aliases have been introduced in stable-3.2 during the ceph
deployment. On master this has been removed but we don't handle
this removal in the upgrade from stable-3.2 to master via the
rolling_update playbook.
Also remove the task from purge-docker-cluster missing from
d9e7835
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
all rgw instances should be stopped according to the multiple rgw
instances support added in rolling_update.yml
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add a couple of fixes to allow containerized deployments upgrade support
to upgrade from luminous/mimic to nautilus.
- pass CEPH_CONTAINER_IMAGE and CEPH_CONTAINER_BINARY environment
variable to the ceph_key module,
- fix the docker exec command in 'waiting for the containerized monitor
to join the quorum' task according to the `delegate_to` parameter,
- override `docker_exec_cmd` in `ceph-facts` with `mon_host` when
rolling_update is `True`,
- do not run unnecessarily `create_mds_filesystems.yml` when performing an
upgrade.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
iscsigws were missing.
The 'complete upgrade' couldn't complete because rolling_update was set
to False for iscsigw nodes.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
once the cluster is upgraded to nautilus, we can complete the process by
disallowing pre-nautilus OSDs and enabling all new nautilus-only functionality
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
As of 1c760904b0, ceph-ansible implicitly
bootstrap managers on monitors.
mgrs must be upgraded only after all monitors, therefore, this commit
refact the way mgrs are upgraded to be sure we don't upgrade a mgr
during the monitors upgrade.
This commit also ensure we handle the case were we split managers on
dedicated nodes.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This directory is created by ceph-config node by node.
In the upgrade context we need it to be created on ALL monitors as soon
as the first iteration because of the task right after which creates and sends
the keyrings on all monitors.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This prevents the packaging from restarting services before we do need
to restart them in the rolling update sequence.
We want to handle services restart at rolling_update playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There is no need to set osd flags (noout, norebalance) each time we
upgrade a mon.
This commit moves up those tasks (before stopping the mon) so we don't need
to delegate them.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We actually want to ensure the node being upgraded is joining the quorum
instead of the monitor picked up earlier.
Indeed, the `mon_host`is used only in `delegate_to:` so we can still run ceph
commands while the monitor being upgraded is stopped.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the `containerized` parameter in ceph_key module doesn't exist anymore.
This was making the module failing but was hidden because of the
`ignore_errors: True`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
1ac94c048f introduced the support of
multiple rgw instances on a single host but somehow has missed to
implement this feature in rolling_update.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>