we must use the ids instead of device names in the tasks executed in
`post_tasks` for the osd rolling update otherwise it ends up with old
systemd units enabled.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1739209
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since bedc0ab we now manage ceph-osd systemd unit scripts based on ID
instead of device name but it was not present in the shrink-osd
playbook (ceph-disk version).
To keep backward compatibility on deployment that didn't do yet the
transition on OSD id then we should stop unit scripts for both device
and ID.
This commit adds the ulimit nofile container option to get better
performance on ceph-disk commands.
It also fixes an issue when the OSD id matches multiple OSD ids with
the same first digit.
$ ceph-disk list | grep osd.1
/dev/sdb1 ceph data, prepared, cluster ceph, osd.1, block /dev/sdb2
/dev/sdg1 ceph data, prepared, cluster ceph, osd.12, block /dev/sdg2
Finally removing the shrinked OSD directory.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When shrinking an OSD, its corresponding 'prepare container' should be
removed otherwise it prevent from redeploying a new osd because of this
leftover.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Removing the gpt header on devices will ease ceph-disk to ceph-volume
migration when using shrink-osd + add-osd playbooks.
ceph-disk requires GPT header where ceph-volume will complain if GPT
header is present.
That won't break ceph-disk (re)deployment since we check and add the GPT
header if needed when deploying ceph-disk ODs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613735
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20e4852888)
ceph-facts should be run before we play ceph-validate since it has
reference to facts that are set in ceph-facts role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.
Resolves: #4020
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7503098ca0)
We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611
Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d7ef12910e)
Currently only rbd-target-gw service is restarted during an update.
We also need to restart tcmu-runner and rbd-target-api services
during the ceph iscsi upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1048627ea)
`lvm_volumes` and/or `devices` variable(s) can be undefined depending on
the scenario chosen.
These tasks should be run only if these variable are defined, otherwise
it ends up with undefined variable errors.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0180738313)
The systemd ceph-osd@.service file used for starting the ceph osd
containers is used in all osd_scenarios.
Currently purging a containerized deployment using the lvm scenario
didn't remove the ceph-osd systemd service.
If the next deployment is a non-containerized deployment, the OSDs
won't be online because the file is still present and override the
one from the package.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7cc626b72d)
otherwise, it will end up with error like following:
```
FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}
```
because facts won't have been gathered.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a440878533)
as of b70d54ac80 the service launched isn't
ceph-rbd-mirror@admin.service.
it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a915308477)
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.
Typical error:
```
fatal: [mon0]: FAILED! => changed=false
attempts: 3
msg: No package matching 'ceph' is available
```
The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
apt:
update_cache: yes
cache_valid_time: 3600
register: result
until: result is succeeded
```
since the task installing ceph packages has a `update_cache: no` it
fails:
```
- name: install ceph for debian
apt:
name: "{{ debian_ceph_pkgs | unique }}"
update_cache: no
state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
register: result
until: result is succeeded
```
/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3849f30f58)
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.
on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
cmd: command -v ceph-volume
failed_when_result: false
msg: '[Errno 2] No such file or directory: ''command'': ''command'''
rc: 2
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 89f77589fa)
Let's enforce the default value for `client_update_batch` to 20 since
`ansible_forks` isn't always available.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit ff8dbe114c)
Some people use the switch playbook to perform upgrade so they end up in
the same situation than https://bugzilla.redhat.com/show_bug.cgi?id=1650572
This is applying the same fix as
729744c6a8.
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".
Signed-off-by: Sébastien Han <seb@redhat.com>
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.
Closes: #3282
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0eb56e36f8)
This task has to be called after the role `ceph-defaults` has been
played, otherwise, `mon_group_name` will never be known.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a12de3e048)
we want a default value for `mon_group_name`, not for
`groups[mon_group_name]`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d0b3cb7f85)
the OSD part of the purge delegates commands on monitor node, we need to
gather monitors facts to know the `ansible_hostname` fact that is used
in the `docker_exec_cmd` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1a4a6ec855)
ceph-defaults relies on facts so we must gather facts before running it.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 62111ff53c)
Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fc6ebd8ebb)
Recently we introduced the collocation of mon and mgr by default, so we
don't need to have an explicit mgrs section for this. This means we have
to remove the mgr container on the mon machines too.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 325a159415)
# Conflicts:
# infrastructure-playbooks/purge-docker-cluster.yml
It's useful when running on CI to see what might remain on the machines.
So we list all the containers and images. We expect the list to be
empty.
We fail if we see containers running.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2bcc00896f)
This commits adds the support for purging cluster that were deployed
with ceph-volume. It also separates nicely with a block intruction the
work to do when lvm is used or not.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1751885bc9)
Json is a type structure which is always typed as a string, where before
this we were declaring a dict, which is not a json valid structure.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1663026
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 896676ee80)
There is no need to enforce `serial: 1` on client nodes.
Let's make it parameterizable by introducing a new *extra* variable
`client_update_batch`, if not filled this will default to `{{
ansible_forks }}`.
NOTE: this is only usable as an extra variable passed with
`-e client_update_batch=<num>`
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 268f2cef82)
add iscsi support for both non containerized and containerized
deployment in purge playbooks.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1651054
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78116fa6db)
So we can avoid the following failure:
The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded
We just need to set a default, the next iteration will have a more
complete json since the command won't fail.
Signed-off-by: Sébastien Han <seb@redhat.com>
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d)
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.
Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *****
fatal: [mon1]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584)
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f)
It's easier lookup a directoriy instead of the block devices,
especially because of ceph-volume and ceph-disk have a different way to
handle devices.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit c14f9b78ff)
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cd56dad9fa)
If we mask it we won't be able to start the OSD container since now the
osd container use the osd ID as a name such as: ceph-osd@0
Fixes the error: Failed to execute operation: Cannot send after transport endpoint shutdown
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fe1d09925a)
when upgrading from RHCS 2.5 to 3.2, it fails because the task `create
ceph mgr keyring(s) when mon is containerized` has a when condition
`inventory_hostname == groups[mon_group_name]|last`.
First, this is incorrect because `inventory_hostname` is referring to a
mgr node, it means this condition would have never been satisfied.
Then, this condition + `serial: 1` makes the mgr keyring creating skipped on
the first node. Further, the `ceph-mgr` role tries to copy the mgr
keyring (it's not aware we are running `serial: 1`) this leads to a
failure like the following:
```
TASK [ceph-mgr : copy ceph keyring(s) if needed] ***************************************************************************************************************************************************************************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10
Tuesday 27 November 2018 12:03:34 +0000 (0:00:00.296) 0:11:01.290 ******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'
failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"}
```
The ceph_key module is idempotent, so there is no need to have such a
condition.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649957
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 73287f91bc)