Commit Graph

384 Commits (1eea339f87da6e3dc7fc074f05c334ee793b04b6)

Author SHA1 Message Date
Guillaume Abrioux 07489c9f8e switch_to_containers: optimize ownership change
As per https://github.com/ceph/ceph-ansible/pull/4323#issuecomment-538420164

using `find` command should be faster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1757400

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-Authored-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit c5d0c90bb7)
2019-10-11 12:19:21 -04:00
Dimitri Savineau 2d40e3923f switch_to_containers: umount osd lockbox partition
When switching from a baremetal deployment to a containerized deployment
we only umount the OSD data partition.
If the OSD is encrypted (dmcrypt: true) then there's an additional
partition (part number 5) used for the lockbox and mount in the
/var/lib/ceph/osd-lockbox/ directory.
Because this partition isn't umount then the containerized OSD aren't
able to start. The partition is still mount by the system and can't be
remount from the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 19edf707a5)
2019-10-08 09:43:40 +02:00
Dimitri Savineau 2e44b6af74 ceph-config: remove container_binary variable
9e7972a introduced a regression via the container_binary variable
which is undefined.
The CEPH_CONTAINER_BINARY environment variable isn't used at all.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-10-08 00:44:13 +02:00
Kevin Jones b3abe23493 Set proper ownership command performance improvement
By changing the set ownership command from using the file module in combination with a with_items loop to a raw chown command, we can achieve a 98% performance increase here.

On a ceph cluster with a significant amount of directories and files in /var/lib/ceph, the file module has to run checks on ownership of all those directories and files to determine whether a change is needed.

In this case, we just want to explicitly set the ownership of all these directories and files to the ceph_uid

Added context note to all set proper ownership tasks

Signed-off-by: Kevin Jones <kevinjones@redhat.com>
(cherry picked from commit 47bf47c9d8)
2019-10-01 09:10:28 -04:00
Guillaume Abrioux 787a6e879e update: use ids to restart osds instead of device name
we must use the ids instead of device names in the tasks executed in
`post_tasks` for the osd rolling update otherwise it ends up with old
systemd units enabled.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1739209

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-08-13 13:42:58 +02:00
Dimitri Savineau 343eec7a53 shrink-osd: Stop ceph-disk container based on ID
Since bedc0ab we now manage ceph-osd systemd unit scripts based on ID
instead of device name but it was not present in the shrink-osd
playbook (ceph-disk version).
To keep backward compatibility on deployment that didn't do yet the
transition on OSD id then we should stop unit scripts for both device
and ID.
This commit adds the ulimit nofile container option to get better
performance on ceph-disk commands.
It also fixes an issue when the OSD id matches multiple OSD ids with
the same first digit.

$ ceph-disk list | grep osd.1
 /dev/sdb1 ceph data, prepared, cluster ceph, osd.1, block /dev/sdb2
 /dev/sdg1 ceph data, prepared, cluster ceph, osd.12, block /dev/sdg2

Finally removing the shrinked OSD directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-08-06 09:38:52 +02:00
Guillaume Abrioux d739f41549 shrink-osd: (ceph-disk only) remove prepare container
When shrinking an OSD, its corresponding 'prepare container' should be
removed otherwise it prevent from redeploying a new osd because of this
leftover.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-07-09 09:04:19 -04:00
Guillaume Abrioux 4b49013369 shrink-osd: (ceph-disk only) remove gpt header
Removing the gpt header on devices will ease ceph-disk to ceph-volume
migration when using shrink-osd + add-osd playbooks.
ceph-disk requires GPT header where ceph-volume will complain if GPT
header is present.
That won't break ceph-disk (re)deployment since we check and add the GPT
header if needed when deploying ceph-disk ODs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613735

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-07-09 09:04:19 -04:00
Guillaume Abrioux 8b91905dff purge: ensure no ceph kernel thread is present
This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20e4852888)
2019-06-24 15:36:21 +02:00
Guillaume Abrioux 520f4e9914 add-osd: fix error in validate execution role
ceph-facts should be run before we play ceph-validate since it has
reference to facts that are set in ceph-facts role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-06-24 14:36:18 +02:00
Dimitri Savineau 81de8a8106 remove ceph-agent role and references
The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.

Resolves: #4020

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7503098ca0)
2019-06-17 14:42:08 -04:00
Mike Christie 0a24078bbb igw: Fix rolling update service ordering
We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d7ef12910e)
2019-05-10 11:12:50 +02:00
Guillaume Abrioux f1b4874176 Revert "Revert "shrink_osd: use cv zap by fsid to remove parts/lvs""
This reverts commit 043ee8c158.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-05-10 09:13:10 +02:00
Guillaume Abrioux 043ee8c158 Revert "shrink_osd: use cv zap by fsid to remove parts/lvs"
This reverts commit be59e0b451.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-25 21:27:37 +02:00
Dimitri Savineau 9ff19cc604 rolling_update: restart all ceph-iscsi services
Currently only rbd-target-gw service is restarted during an update.
We also need to restart tcmu-runner and rbd-target-api services
during the ceph iscsi upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1048627ea)
2019-04-24 23:17:41 +00:00
Guillaume Abrioux c5c354a61a remove all NBSPs char in stable-3.2 branch
this can cause issues, let's replace all of these chars with real
spaces.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-10 13:27:48 +02:00
Guillaume Abrioux 7136f1734e purge: fix lvm-batch purge osd
`lvm_volumes` and/or `devices` variable(s) can be undefined depending on
the scenario chosen.

These tasks should be run only if these variable are defined, otherwise
it ends up with undefined variable errors.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0180738313)
2019-04-03 08:48:39 +02:00
Dimitri Savineau fa6d9c940a rolling_update: Update systemd unit regex for nvme
The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c8442f3705)
2019-04-01 15:22:24 +00:00
Dimitri Savineau 8e2cfd9d24 purge-docker-cluster: Remove ceph-osd service
The systemd ceph-osd@.service file used for starting the ceph osd
containers is used in all osd_scenarios.
Currently purging a containerized deployment using the lvm scenario
didn't remove the ceph-osd systemd service.
If the next deployment is a non-containerized deployment, the OSDs
won't be online because the file is still present and override the
one from the package.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7cc626b72d)
2019-04-01 09:10:29 +00:00
Dimitri Savineau ef9525482b add-osd.yml: Add become flag for ceph-validate
The check_devices task fails if the ceph-validate role isn't executed
as a privileged user (Permission denied).

failed: [osd0] (item=/dev/sdb) => {"changed": false, "err": "Error:
Error opening /dev/sdb: Permission denied\n", "item": "/dev/sdb",
"msg": "Error while getting device information with parted script:
'/sbin/parted -s -m /dev/sdb -- unit 'MiB' print'", "out": "", "rc": 1}

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b23c05ae52)
2019-03-12 14:48:03 +01:00
Guillaume Abrioux 4dd46ec396 add-osd: gather facts in second part of playbook
otherwise, it will end up with error like following:

```
FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}
```

because facts won't have been gathered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a440878533)
2019-03-04 15:48:44 +00:00
Guillaume Abrioux 06ad7e0b57 purge: fix rbd-mirror group name
the default is rbdmirrors in ceph-defaults

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47ebef374f)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux a8467d8f33 purge: fix rbd mirror purge
as of b70d54ac80 the service launched isn't
ceph-rbd-mirror@admin.service.

it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a915308477)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux 5470e6fa42 purge: do not remove /var/lib/apt/lists/*
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.

Typical error:
```
fatal: [mon0]: FAILED! => changed=false
  attempts: 3
  msg: No package matching 'ceph' is available
```

The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
  apt:
    update_cache: yes
    cache_valid_time: 3600
  register: result
  until: result is succeeded
```

since the task installing ceph packages has a `update_cache: no` it
fails:

```
- name: install ceph for debian
  apt:
    name: "{{ debian_ceph_pkgs | unique }}"
    update_cache: no
    state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
    default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
  register: result
  until: result is succeeded
```

/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3849f30f58)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux 255eab59ac purge: fix purge of lvm devices
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.

on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
  cmd: command -v ceph-volume
  failed_when_result: false
  msg: '[Errno 2] No such file or directory: ''command'': ''command'''
  rc: 2
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 89f77589fa)
2019-03-01 22:16:19 +00:00
Noah Watkins be59e0b451 shrink_osd: use cv zap by fsid to remove parts/lvs
Fixes:
  https://bugzilla.redhat.com/show_bug.cgi?id=1569413
  https://bugzilla.redhat.com/show_bug.cgi?id=1572933

Note: rebased

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 9a43674d2e)
2019-02-06 00:37:11 +00:00
Noah Watkins b8c39d7613 Add a ceph-volume aware shrink-osd playbook
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit f5dacbf7de)
2019-01-30 14:58:59 +01:00
Noah Watkins 8f57a95048 Rename ceph-disk version of shrink-osd playbook
This will be replaced by a ceph-volume aware verison.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 0782cfc546)
2019-01-30 14:58:59 +01:00
Giulio Fidente 75855b2d58 Preserve rolling_update backward compatibility with ansible < 2.5
Let's enforce the default value for `client_update_batch` to 20 since
`ansible_forks` isn't always available.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184

Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit ff8dbe114c)
2019-01-21 14:28:07 +00:00
Sébastien Han 04d8002614 switch: do not fail on missing key
Some people use the switch playbook to perform upgrade so they end up in
the same situation than https://bugzilla.redhat.com/show_bug.cgi?id=1650572
This is applying the same fix as
729744c6a8.

We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".

Signed-off-by: Sébastien Han <seb@redhat.com>
2019-01-14 18:54:46 +00:00
Guillaume Abrioux 416b503476 introduce new role ceph-facts
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.

Closes: #3282

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0eb56e36f8)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux c3bb76b8e9 purge-container: move facts gathering after ceph-defaults role import
This task has to be called after the role `ceph-defaults` has been
played, otherwise, `mon_group_name` will never be known.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a12de3e048)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux b9bf7c6703 purge-container: fix wrong syntax
we want a default value for `mon_group_name`, not for
`groups[mon_group_name]`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d0b3cb7f85)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux 0ff1260fc1 purge-docker: do not call ceph-osd role
calling ceph-osd role in purge playbook is not needed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ae7f3d66a6)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux c405fd1140 purge: gather monitors facts in OSD purge
the OSD part of the purge delegates commands on monitor node, we need to
gather monitors facts to know the `ansible_hostname` fact that is used
in the `docker_exec_cmd` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1a4a6ec855)
2019-01-07 09:14:10 +01:00
Sébastien Han 37ba313d76 purge-container: gather fact before calling ceph-defaults
ceph-defaults relies on facts so we must gather facts before running it.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 62111ff53c)
2019-01-07 09:14:10 +01:00
Sébastien Han 8e83ecfce1 purge-cluster: add support for mon/mgr collocation
Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fc6ebd8ebb)
2019-01-07 09:14:10 +01:00
Sébastien Han 12d6466582 purge-cluster: remove support for other init system
We only support systemd and use the service module anyway.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3a154fa0ad)
2019-01-07 09:14:10 +01:00
Sébastien Han 782959f094 purge-docker-cluster: add support for mgr/mon collocation
Recently we introduced the collocation of mon and mgr by default, so we
don't need to have an explicit mgrs section for this. This means we have
to remove the mgr container on the mon machines too.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 325a159415)

# Conflicts:
#	infrastructure-playbooks/purge-docker-cluster.yml
2019-01-07 09:14:10 +01:00
Sébastien Han 8ce8d580a4 purge-docker-cluste: add a task to check hosts
It's useful when running on CI to see what might remain on the machines.
So we list all the containers and images. We expect the list to be
empty.

We fail if we see containers running.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2bcc00896f)
2019-01-07 09:14:10 +01:00
Sébastien Han f37c21a9d0 purge-docker-cluster: add ceph-volume support
This commits adds the support for purging cluster that were deployed
with ceph-volume. It also separates nicely with a block intruction the
work to do when lvm is used or not.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1751885bc9)
2019-01-07 09:14:10 +01:00
Sébastien Han 668c7a4db7 fix json data type
Json is a type structure which is always typed as a string, where before
this we were declaring a dict, which is not a json valid structure.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1663026

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 896676ee80)
2019-01-04 12:02:34 +01:00
Guillaume Abrioux dc02156736 update: do not enforce `serial: 1` on client nodes
There is no need to enforce `serial: 1` on client nodes.
Let's make it parameterizable by introducing a new *extra* variable
`client_update_batch`, if not filled this will default to `{{
ansible_forks }}`.

NOTE: this is only usable as an extra variable passed with
`-e client_update_batch=<num>`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 268f2cef82)
2019-01-04 11:59:02 +01:00
Andrew Schoen e55ec6c0f5 purge-cluster: skip tasks that use ceph-volume if it's not installed
This will allow the playbook to be idempotent.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656935

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit ffd56177e7)
2018-12-20 14:03:30 +01:00
Guillaume Abrioux e37a90b5ec purge: add iscsi support
add iscsi support for both non containerized and containerized
deployment in purge playbooks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1651054

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78116fa6db)
2018-12-04 18:04:13 +01:00
Ramana Raja 0ec2ac34e3 rolling_update: fail if less than 3 MONs
... for non-containerized deployments as well.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1655470

Signed-off-by: Ramana Raja <rraja@redhat.com>
(cherry picked from commit cb784c601d)
2018-12-04 16:34:57 +01:00
Sébastien Han 2cea33f7fc rolling_update: default ceph json output to empty dict
So we can avoid the following failure:

The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded

We just need to set a default, the next iteration will have a more
complete json since the command won't fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-29 01:49:05 +00:00
Guillaume Abrioux 292d967d2f update: fix a typo
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d)
2018-11-29 01:49:05 +00:00
Guillaume Abrioux 1f4cf61058 rolling_update: refact set_fact `mon_host`
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018  14:02:30 +0000 (0:00:07.493)       0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584)
2018-11-29 01:49:05 +00:00
Sébastien Han d4f1f12bd0 rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f)
2018-11-29 01:49:05 +00:00