ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Guillaume Abrioux	82b934cfc1	rolling_update: unmask monitor service after a failure if for some reason the playbook fails after the service was stopped, disabled and masked and before it got restarted, enabled and unmasked, the playbook leaves the service masked and which can make users confused and forces them to unmask the unit manually. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `07029e1bf1`)	2021-03-29 15:22:23 +02:00
Guillaume Abrioux	a8420d41c6	update: stop ceph-crash service before upgrading This adds the missing service stop task for ceph-crash upgrade workflow. It should have been added through commit `15872e3db1e342238636bc9c8e1aef6bd1d3dcd8` in stable-4.0 but at the time we backported this patch ceph-crash wasn't implemented yet so the ceph-crash related content in this patch was removed. Then, ceph-crash has been implemented later so we are still missing this part of the patch in stable-4.0. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943471 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2021-03-26 16:18:50 +01:00
Alex Schultz	7ddbe74712	Use ansible_facts It has come to our attention that using ansible_* vars that are populated with INJECT_FACTS_AS_VARS=True is not very performant. In order to be able to support setting that to off, we need to update the references to use ansible_facts[<thing>] instead of ansible_<thing>. Related: ansible#73654 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406 Signed-off-by: Alex Schultz <aschultz@redhat.com> (cherry picked from commit `a7f2fa73e6`)	2021-03-26 00:16:58 +01:00
Guillaume Abrioux	2cd8c3637c	fix 'command -v' tasks `command -v` is a bash script which needs a shell to run. Fixes: #6325 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `14c472707c`)	2021-03-22 13:53:11 +01:00
Guillaume Abrioux	0d0723298f	purge: rm service-cid files This commit makes sure purge playbooks remove those file if for any reason they have been left. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1920900 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `b9dd253a4f`)	2021-03-11 13:52:48 +01:00
Guillaume Abrioux	932abbc8cf	switch2container: do not serialize the ceph-crash migration There's no need to slow down the playbook execution time by migrating all the `ceph-crash` instances in a serial way. Let's remove the `serial: 1` so the migration is achieved in a parallel way. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `980a5a7df4`)	2021-03-11 13:52:39 +01:00
Dimitri Savineau	8f26ffdbac	rolling_update: enforce ceph-container-engine When running the rolling_update.yml playbook and adding the dashboard component in the same time then the requirement (like container packages) aren't installed. This could lead to a failure in case of using authentication on the container registry because the playbook will try to login on the registry but podman/docker aren't yet installed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `48a456dc8c`)	2021-03-11 13:52:21 +01:00
Dimitri Savineau	3ba27c9387	rolling_update: exclude clients from node-exporter Since `b105549` we don't install node-exporter on client nodes so we should also exclude the client node from the node-exporter upgrade. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `94af3c87d1`)	2021-03-11 13:52:02 +01:00
Guillaume Abrioux	1b424ad5e9	purge: zap and destroy db and wal devices for lvm batch Those devices (db/wal) are never zapped in lvm batch deployment. Iterating over `dedicated_devices` and `bluestore_wal_devices` fixes this issue. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1922926 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `984191ac7f`)	2021-03-11 13:51:38 +01:00
Guillaume Abrioux	bb1f66cb51	switch2container: fix mon quorum check The current check makes no sense because it checks any of other monitor than the one being played (either a previous one already converted or a next that isn't yet converted) is present on the quorum. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `175ffa1b88`)	2021-03-11 13:50:27 +01:00
Guillaume Abrioux	858048560e	update: fix require-osd-release task This commit fixes two issues in rolling_update.yml: - `container_exec_cmd_update_osd` is unset in the `complete osd upgrade` play so it never runs the command in a container. - the 'require-osd-release' task is never applied because the condition looks for luminous release. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1930164 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2021-02-18 22:22:06 +01:00
Guillaume Abrioux	b903446fa4	containers: use --cpus instead --cpu-quota When using docker 1.13.1, the current condition: ``` {% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%} ``` is wrong because it compares the first digit (1) whereas it should compare the second one. It means we always use `--cpu-quota` although documentation recommend using `--cpus` when docker version is 1.13.1 or higher. From the doc: > --cpu-quota=<value> Impose a CPU CFS quota on the container. The number of > microseconds per --cpu-period that the container is limited to before > throttled. As such acting as the effective ceiling. > If you use Docker 1.13 or higher, use --cpus instead. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3e262e072b`)	2021-01-28 16:37:50 -05:00
Guillaume Abrioux	a36eee1852	fs2bs: skip migration when a mix of fs and bs is detected Since the default of `osd_objectstore` has changed as of 3.2, some deployments might have a mix of filestore and bluestore OSDs on a same node. In some specific cases, there's a possibility that a filestore OSD shares a journal/db device with a bluestore OSD. We shouldn't try to redeploy in this context because ceph-volume will complain. (either because in lvm batch you can't pass partition or about gpt header). The safest option is to skip the migration on the node when such a mix is detected or force all osds including those already using bluestore (option `force_filestore_to_bluestore=True` has to be passed as an extra var). If all OSDs are using filestore, then they will be migrated to bluestore. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `e66f12d138`)	2021-01-22 11:37:40 -05:00
Guillaume Abrioux	607ef5a7d2	common: do not use pipefail when not needed Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks where we don't need `set -o pipefail` Fixes: #6090 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `86a8889ee3`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	ba312a5b5d	lint: ignore 302,303,505 errors ignore 302,303 and 505 errors [302] Using command rather than an argument to e.g. file [303] Using command rather than module [505] referenced files must exist they aren't relevant on these tasks. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `195d88fcda`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	8a8a082693	lint: do not use 'local_action' Fix ansible-lint 504 error: [504] Do not use 'local_action', use 'delegate_to: localhost' Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c948b668eb`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	ace031e86e	lint: trailing whitespace Fix ansible-lint 201 error: [201] Trailing whitespace Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `dfc7e6e4bd`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	72fc8877cb	lint: all tasks should be named Fix ansible-lint 502 error: [502] All tasks should be named Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `97dd9218dd`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	ab62d27c44	lint: use shell only when shell functionality is required Fix ansible-lint 305 error: [305] Use shell only when shell functionality is required Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `11b4bf5083`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	2a0e07cfd7	lint: don't compare to literal true/false Fix ansible lint 601 error: [601] Don't compare to literal True/False Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `2011e4dbc8`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	87d53fea08	lint: variables should have spaces before and after Fix ansible lint 206 error: [206] Variables should have spaces before and after: {{ var_name }} Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9fba6eecfa`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	35e738c681	lint: commands should not change things Fix ansible lint 301 error: [301] Commands should not change things if nothing needs doing Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `5450de58b3`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	92b261df89	lint: set pipefail on shell tasks Fix ansible lint 306 error: [306] Shells that use pipes should set the pipefail option Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1879c26eb9`)	2020-12-16 14:05:45 +01:00
Dimitri Savineau	3f16132e44	library: add ceph_osd_flag module This adds ceph_osd_flag ansible module for replacing the command module usage with the ceph osd set/unset commands. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `5da593604a`)	2020-12-15 17:36:28 +01:00
Guillaume Abrioux	1ac034a802	switch2containers: do not stop ceph.target in osd play `ceph.target` should be disabled only. Otherwise, in collocation scenario you stop other collocated services in the OSD play which isn't what we want to do. Each daemon has its corresponding play for managing the transition to container. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0b05620597`)	2020-12-15 17:32:23 +01:00
Guillaume Abrioux	1fcf71dc33	common: drop `fetch_directory` feature This commit drops the `fetch_directory` feature. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1cc9666c09`)	2020-12-15 17:30:42 +01:00
Guillaume Abrioux	d14723d5b4	mon: refact initial keyring generation adding monitor is no longer possible because we generate a new mon keyring each time the playbook is run. Fixes: #5864 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1902281 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `970c6a4ee6`)	2020-12-01 09:53:26 -05:00
Dimitri Savineau	f917bb015c	ceph_key: set state as optional Most ansible module using a state parameter default to the present value (when available) instead of using it as a mandatory option. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `abb4023d76`)	2020-12-01 09:53:26 -05:00
Dimitri Savineau	ed9c51ff5a	switch2container: chown symlink in mon/mgr plays `fa2bb3a` only fix the symlink owner/group issue in the OSD play. If the OSDs are collocated with other services like MONs and MGRs then the chown command will fail. $ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} + chown: cannot dereference './block': Permission denied Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `35ed9977aa`)	2020-11-16 16:37:04 -05:00
Dimitri Savineau	33f74771d2	switch2container: disable ceph-osd enabled-runtime When deploying the ceph OSD via the packages then the ceph-osd@.service unit is configured as enabled-runtime. This means that each ceph-osd service will inherit from that state. The enabled-runtime systemd state doesn't survive after a reboot. For non containerized deployment the OSD are still starting after a reboot because there's the ceph-volume@.service and/or ceph-osd.target units that are doing the job. $ systemctl list-unit-files\|egrep '^ceph-(volume\|osd)'\|column -t ceph-osd@.service enabled-runtime ceph-volume@.service enabled ceph-osd.target enabled When switching to containerized deployment we are stopping/disabling ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the systemd unit files. But the new systemd units for containerized ceph-osd service will still inherit from ceph-osd@.service unit file. As a consequence, if an OSD host is rebooting after the playbook execution then the ceph-osd service won't come back because they aren't enabled at boot. This patch also adds a reboot and testinfra run after running the switch to container playbook. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `fa2bb3af86`)	2020-11-12 21:08:32 +01:00
Dimitri Savineau	522e183d8f	rolling_update: use ceph health instead of ceph -s The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the cluster health, we're using the health structure in the ceph status output. To optimize this, we could use the ceph health command which contains the same needed information. $ ceph status -f json \| wc -c 2001 $ ceph health -f json \| wc -c 46 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `acddf4fb67`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	bcd2797d11	rgw/rbdmirror: use service dump instead of ceph -s The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the rgw/rbdmirror services status, we're only using the servicmap structure in the ceph status output. To optimize this, we could use the ceph service dump command which contains the same needed information. This command returns less information and is slightly faster than the ceph status command. $ ceph status -f json \| wc -c 2001 $ ceph service dump -f json \| wc -c 1105 $ time ceph status -f json > /dev/null real 0m0.557s user 0m0.516s sys 0m0.040s $ time ceph service dump -f json > /dev/null real 0m0.454s user 0m0.434s sys 0m0.020s Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `3f9081931f`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	69b51b5f19	monitor: use quorum_status instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the quorum status, we're only using the quorum_names structure in the ceph status output. To optimize this, we could use the ceph quorum_status command which contains the same needed information. This command returns less information. $ ceph status -f json \| wc -c 2001 $ ceph quorum_status -f json \| wc -c 957 $ time ceph status -f json > /dev/null real 0m0.577s user 0m0.538s sys 0m0.029s $ time ceph quorum_status -f json > /dev/null real 0m0.544s user 0m0.527s sys 0m0.016s Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `88f91d8c12`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	1185b7e86a	osds: use pg stat command instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the pgs state, we're using the pgmap structure in the ceph status output. To optimize this, we could use the ceph pg stat command which contains the same needed information. This command returns less information (only about pgs) and is slightly faster than the ceph status command. $ ceph status -f json \| wc -c 2000 $ ceph pg stat -f json \| wc -c 240 $ time ceph status -f json > /dev/null real 0m0.529s user 0m0.503s sys 0m0.024s $ time ceph pg stat -f json > /dev/null real 0m0.426s user 0m0.409s sys 0m0.016s The data returned by the ceph status is even bigger when using the nautilus release. $ ceph status -f json \| wc -c 35005 $ ceph pg stat -f json \| wc -c 240 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `ee50588590`)	2020-11-03 14:38:49 -05:00
Guillaume Abrioux	4a56537680	fs2bs: support `osd_auto_discovery` scenario This commit adds the `osd_auto_discovery` scenario support in the filestore-to-bluestore playbook. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881523 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> Co-authored-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `8b1eeef18a`)	2020-09-29 10:48:36 -04:00
Guillaume Abrioux	25e23b052b	ansible.cfg: remove cfg file in infrastructure-playbooks There's no need ot have a copy of this file in infrastructure-playbooks directory. playbooks in that directory can be run from the root dir of ceph-ansible. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `f906caa6da`)	2020-09-29 16:31:33 +02:00
Guillaume Abrioux	c0755b1820	ansible.cfg: set force_valid_group_names param As of 2.10, group names containing a dash are invalid. However, setting this option makes it still possible to use a dash in group names and prevent this warning to show up. It might need to be definitely addressed in a future ansible release. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1880476 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `6938ed1302`)	2020-09-29 16:31:33 +02:00
Dimitri Savineau	05d4e76d42	switch2container: chown symlink for devices If the OSD directory is using symlinks for referencing devices (like block, db, wal for bluestore and journal for filestore) then the chown command could fail to change the owner:group on some system. $ ls -hl /var/lib/ceph/osd/ceph-0/ total 28K lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6 -rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid -rw------- 1 ceph ceph 37 Sep 15 01:53 fsid -rw------- 1 ceph ceph 55 Sep 15 01:53 keyring -rw------- 1 ceph ceph 6 Sep 15 01:53 ready -rw------- 1 ceph ceph 3 Sep 15 02:00 require_osd_release -rw------- 1 ceph ceph 10 Sep 15 01:53 type -rw------- 1 ceph ceph 2 Sep 15 01:53 whoami $ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} + chown: cannot dereference './block': Permission denied $ find /var/lib/ceph/osd/ceph-0 -not -user 167 /var/lib/ceph/osd/ceph-0/block Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `da4280e243`)	2020-09-15 15:30:21 -04:00
Dimitri Savineau	dac0415d75	switch2container: remove deb systemd units When running the switch2container playbook on a Debian based system then the systemd unit path isn't the same than Red Hat based system. Because the systemd unit files aren't removed then the new container systemd unit isn't take in count. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c1af69a7e7`)	2020-09-15 15:30:21 -04:00
Guillaume Abrioux	a88f911155	purge: remove potential socket leftover This commit ensure we remove any socket left by ceph and the `ceph-osd-run.sh` script. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861755 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `5e91e0f3e2`)	2020-09-14 16:51:00 -04:00
Dimitri Savineau	43da364188	container: run engine/common roles on first client We already do this in the site-container.yml playbook because we don't need docker/podman installed on all client nodes and having the container image only on the first client node. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `8ecbdc6ede`)	2020-09-10 20:36:08 -04:00
Guillaume Abrioux	851a89b8fc	purge-cluster: use sysfs method for unmapping rbd devices This way we keep consistency with purge-container-cluster.yml playbook. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `f77fa6e2a4`)	2020-09-10 20:35:16 -04:00
Guillaume Abrioux	66dde0034b	ceph-crash: introduce new role ceph-crash This commit introduces a new role `ceph-crash` in order to deploy everything needed for the ceph-crash daemon. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9d2f2108e1`)	2020-09-10 20:35:04 -04:00
Dimitri Savineau	b745c76491	ceph-facts: only get fsid when monitor are present When running the rolling_update playbook with an inventory without monitor nodes defined (like external scenario) then we can't retrieve the cluster fsid from the running monitor. In this scenario we have to pass this information manually (group_vars or host_vars). Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `f63022dfec`)	2020-09-10 17:42:28 -04:00
Francesco Pantano	858e50da6b	Add --cluster option on ceph require-osd-release command On DCN environments, or when multiple ceph cluster are configured, we need to specify the cluster name before running the command or the rolling_update playbook will fail during minor updates. Closes: https://bugzilla.redhat.com/1876447 Signed-off-by: Francesco Pantano <fpantano@redhat.com> (cherry picked from commit `cb64df30b6`)	2020-09-09 15:11:24 +02:00
Francesco Pantano	2691e385fb	Fix hosts field in rolling_update playbook when mds are processed In the OSP context, during the rolling update the playbook fails with the following error: ''' ERROR! The field 'hosts' has an invalid value, which includes an undefined variable. The error was: list object has no element 0 ''' This PR just change the hosts field providing a valid mons group value. Closes: https://bugzilla.redhat.com/1876803 Signed-off-by: Francesco Pantano <fpantano@redhat.com> (cherry picked from commit `e65f9a5c72`)	2020-09-09 15:11:02 +02:00
Guillaume Abrioux	c7f6d15793	shrink-mds: use mds_to_kill_hostname instead When using fqdn in inventory host file, this task will fail because the mds is registered with its shortname. It means we must use `mds_to_kill_hostname` in this task. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `51c382677d`)	2020-08-18 15:10:06 -04:00
Guillaume Abrioux	886e1d85c7	purge: import ceph-defaults in purge osd play Otherwise, `ceph_volume_debug` variable is undefined Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `33a544644a`)	2020-08-13 14:21:44 +02:00
Guillaume Abrioux	88c9f6d969	common: don't enable debug log on ceph-volume calls by default ceph-volume can generate large logs at some point. debug logs by definition should be enabled only when debugging. Let's make it customizable with a variable which is set to `False` by default. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `448cc280b7`)	2020-08-13 14:21:44 +02:00
Benoît Knecht	8e5d1159e0	purge-cluster: check if rbdmap exists When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the second time on the `ensure rbd devices are unmapped` task, because `rbdmap` isn't installed anymore at that point. This commit adds a check that ensures `rbdmap` is available, and skips the `ensure rbd devices are unmapped` task if it isn't. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch> (cherry picked from commit `a57fd7a090`)	2020-08-06 12:04:51 -04:00
Kevin Coakley	659262f687	Remove ceph-radosgw.target when switching to containerize daemons The task "remove old systemd unit file" under "switching from non-containerized to containerized ceph rgw" only removes the ceph-radosgw@.service file. The task should also remove the ceph-radosgw.target file, like the "remove old systemd unit files" tasks for the mons, mgrs, osds, etc, in order to clean up all of the unused systemd unit files. Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu> (cherry picked from commit `d19e6033b2`)	2020-08-06 09:41:23 -04:00
Guillaume Abrioux	8cf17750ee	shrink_osd: remove osd data directory Otherwise it leaves an empty directory. When shrinking and redeploying multiple OSDs you have no guarantee it will reuse the same osd id. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `8933bfde33`)	2020-08-06 13:10:42 +02:00
Benoît Knecht	4052ab29f2	shrink-osd: various fixes This handles missing /etc/ceph/osd, by ensuring we actually found files in `/etc/ceph/osd` before trying to slurp their content. This also add a missing `\| default(False)` to avoid fowlloing error: ``` fatal: [ceph01]: FAILED! => msg: \|- The conditional check 'ceph_osd_data_json[item.2]['encrypted'] \| bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] \| bool): 'dict object' has no attribute 'encrypted' ``` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416 Signed-off-by: Benoît Knecht <bknecht@protonmail.ch> (cherry picked from commit `fe8fbd3ee2`)	2020-08-06 13:10:42 +02:00
Dimitri Savineau	cbdff5f95b	rolling_update: restart mds after the upgrade In addition of `155e2a2`, the active mds daemons isn't stop/start correctly as opposed as the other services so that daemon doesn't come back after the upgrade. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `ec0a37a74f`)	2020-07-29 17:49:15 -04:00
Dimitri Savineau	7a970ac028	rolling_update: refact dashboard workflow The dashboard upgrade workflow should do the same process than the ceph upgrade otherwise any systemd unit modification won't be apply on the monitoring/dashboard stack. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `a6209bd957`)	2020-07-27 10:49:02 -04:00
Dimitri Savineau	15872e3db1	rolling_update: stop/start instead of restart During the daemon upgrade we're - stopping the service when it's not containerized - running the daemon role - start the service when it's not containerized - restart the service when it's containerized This implementation has multiple issue. 1/ We don't use the same service workflow when using containers or baremetal. 2/ The explicity daemon start isn't required since we'are already doing this in the daemon role. 3/ Any non backward changes in the systemd unit template (for containerized deployment) won't work due to the restart usage. This patch refacts the rolling_update playbook by using the same service stop task for both containerized and baremetal deployment at the start of the upgrade play. It removes the explicit service start task because it's already included in the dedicated role. The service restart tasks for containerized deployment are also removed. This following comment isn't valid because we should have backported ceph-crash implementation in stable-4.0 before this commit, which was not possible because of the needed tag v4.0.25.1 (async release for 4.1z1): ~~Finally, this adds the missing service stop task for ceph crash upgrade workflow.~~ Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `155e2a23d5`)	2020-07-27 09:43:01 -04:00
Guillaume Abrioux	02e7468b4a	update: use tasks_from when including ceph-facts When setting/unsetting osd flags, we can use `tasks_from` when importing `ceph-facts` role to save some times given that we only need this role for setting `container_binary` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d66b698be2`)	2020-07-23 17:26:04 +02:00
Dimitri Savineau	5db4219f26	facts: explicitly disable facter and ohai By default, ansible gathers facts from facter and ohai if installed on the remote nodes, given we don't need them, let's exclude these facts from our facts gathering Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c95adc564b`)	2020-07-20 21:23:48 +02:00
Guillaume Abrioux	518f4f579d	rgw: fix multi instances scaleout When rgw and osd are collocated, the current workflow prevents from scaling out the radosgw_num_instances parameter when rerunning the playbook. The environment file used in the rgw systemd template is rendered when executing the `ceph-rgw` role but during a new run of the playbook (in order to scale out rgw instances), handlers are triggered from `ceph-osd` role which is run before `ceph-rgw`, therefore it tries to start the new rgw daemon whereas its corresponding environment file hasn't been rendered yet and fails like following: ``` ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory ``` This commit moves the tasks generating this file in `ceph-config` role so it is generated early. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `7dd68b9ac1`)	2020-07-20 21:23:27 +02:00
Guillaume Abrioux	328db8bee1	rolling_update: add any_errors_fatal If a failure occurs in ceph-validate, the upgrade playbook keeps running where we expect it to fail. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `8f9cdf4b10`)	2020-07-20 21:22:25 +02:00
Dimitri Savineau	a99c94ea11	ceph-osd: remove ceph-osd-run.sh script Since we only have one scenario since nautilus then we can just move the container start command from ceph-osd-run.sh to the systemd unit service. As a result, the ceph-osd-run.sh.j2 template and the ceph_osd_docker_run_script_path variable are removed. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `829990e60d`)	2020-06-23 17:35:01 +02:00
Guillaume Abrioux	4e42503218	docker2podman: make images pulling optional This commit makes the images pulling skipped if podman isn't installed on the machine. In OSP context, the podman installation is done later in the workflow, it means all `podman pull` commands will fail. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `37b20b6525`)	2020-06-22 14:46:38 -04:00
Guillaume Abrioux	085341642e	switch-to-containers: set and unset osd flags The workflow in this playbook should be the same than in rolling_update, we should first set noout and nodeep-scrub flags before migrating the first osd and unset osd flags after the last osd is migrated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `2cfaa056e0`)	2020-06-17 12:15:49 -04:00
Guillaume Abrioux	c847c2f117	switch_to_containers: don't set noup flag We shouldn't set this flag when running switch_to_containers playbook. Otherwise the playbook fails waiting for pgs to be clean. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `b91d60d384`)	2020-06-17 09:24:19 -04:00
Dimitri Savineau	a165edb5ba	switch_to_container: fix osd systemd regex The systemd LOAD and ACTIVE fileds could have more than one space between both values. This update the systemd regex the same way we're using it in different part of the code. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `50140c9b5d`)	2020-06-16 12:10:36 -04:00
Dimitri Savineau	a97e24fee9	docker2podman: manage dashboard nodes The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus) were not manage during the docker to podman migration. This adds the systemd container template of those services to a dedicated file (systemd.yml) in order to include it in the docker2podman playbook. This also adds the dashboard container images pull from docker to podman. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `252e78b4e4`)	2020-06-03 13:20:24 -04:00
Dimitri Savineau	6f893e5ed9	docker2podman: pull images from docker daemon The docker2podman playbook only installs the podman package and updates the systemd units with the right container_binary value. We never pull the container image so if one service is restarted then the container image will be pulled first before the service can start which could cause longer downstream. To avoid to download the container image from internet again we can just pull it from the local docker daemon. The container_{binding,package,service}_name variables are removed because they are only used in the ceph-container-engine role which isn't call in this playbook. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `d38f21aeba`)	2020-06-03 13:20:24 -04:00
Dimitri Savineau	8c4865cd14	rolling_update: fix rbdmirror group name The rbdmirror group name was using the wrong variable definition. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c0a213f928`)	2020-06-03 13:20:03 -04:00
Dimitri Savineau	1921ace52d	docker-to-podman: conditional docker commands The docker commands should be based on the container_binary variable otherwise running the playbook on a host without docker (like podman only) will failed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829985 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-06-03 13:19:28 -04:00
Dimitri Savineau	46c640c169	filestore-to-bluestore: fix py2 on skipped tasks When using skipped variables with from_json filter and python2 then we need to have a default value otherwise the skipped task will fail. Unexpected templating type error occurred on ({{ (ceph_volume_lvm_list.stdout \| from_json) }}): expected string or buffer Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `2b9edba131`)	2020-04-20 13:38:19 -04:00
Guillaume Abrioux	ba6bd3ca3d	docker2podman: call `container_options_facts.yml` on osd nodes We must call `ceph-osd` role from `container_options_facts.yml` because ceph-osd-run.sh.j2 needs variables set in this file. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1819681 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `4a4f54f6ee`)	2020-04-02 11:01:14 -04:00
Guillaume Abrioux	32f879de32	purge-container: get all osds id Adding `--all` to the `systemctl list-units` command in order to get all osds id on the node (including stoppped osds). Otherwise, it will purge the cluster but there will be leftover after that. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1814542 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `5e7962ccf6`)	2020-03-31 11:00:41 -04:00
Dimitri Savineau	e2f1a0ade8	doc: update infra playbooks statements We don't need to copy the infrastructure playbooks in the root ceph-ansible directory. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `195944b123`)	2020-03-16 14:43:35 +01:00
Dimitri Savineau	957156c0fe	filestore-to-bluestore: stop ceph-volume services We only disable the ceph-osd services but not the ceph-volume lvm services during the filestore to bluestore migration. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `38a683e5bf`)	2020-03-12 21:10:33 +01:00
Dimitri Savineau	928c792f8d	filestore-to-bluestore: reuse dedicated journal If the filestore configuration was using a dedicated journal with either a partition or a LV/VG then we need to reuse this for bluestore DB. When filestore is using a raw devices then we shouldn't destroy everything (data + journal) but only data otherwise the journal partition won't exist anymore. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `535da53d69`)	2020-03-12 21:10:33 +01:00
Dimitri Savineau	3b0ee83594	shrink-rbdmirror: fix presence after removal We should add retry/delay to check the presence of the rbdmirror daemon in the cluster status because the status takes some time to be updated. Also the metadata.hostname isn't a good key to check because it doesn't reflect the ansible_hostname fact. We should use metadata.id instead. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `d1316ce77b`)	2020-03-03 15:19:45 +01:00
Dimitri Savineau	4b07d97346	shrink-mgr: fix systemd condition This playbook was using mds systemd condition. Also a command task was using pipeline which is not allowed. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `a664159061`)	2020-03-03 15:19:45 +01:00
Dimitri Savineau	92b671bcbe	shrink: don't use localhost node The ceph-facts are running on localhost so if this node is using a different OS/release that the ceph node we can have a mismatch between docker/podman container binary. This commit also reduces the scope of the ceph-facts role because we only need the container_binary tasks. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `08ac2e3034`)	2020-03-03 15:19:45 +01:00
Dimitri Savineau	e037e99bd2	purge: stop rgw instances by iteration It looks like that the service module doesn't support wildcard anymore for stopping/disabling multiple services. fatal: [rgw0]: FAILED! => changed=false msg: 'This module does not currently support using glob patterns, found '''' in service name: ceph-radosgw@' ...ignoring Instead we should iterate over the rgw_instances list. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `9d3b49293d`)	2020-03-03 10:31:48 +01:00
Guillaume Abrioux	5a51bd12dc	common: support OSDs with more than 2 digits When running environment with OSDs having ID with more than 2 digits, some tasks don't match the system units and therefore, playbook can fail. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a084a2a347`)	2020-02-28 11:06:47 -05:00
Guillaume Abrioux	d254a8b938	shrink-osd: support shrinking ceph-disk prepared osds This commit adds the ceph-disk prepared osds support Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1de2bf9991`)	2020-02-26 18:16:48 +01:00
Guillaume Abrioux	21851457d6	shrink-osd: don't run ceph-facts entirely We need to call ceph-facts only for setting `container_binary`. Since this task has been isolated we can use `tasks_from` to only execute the needed task. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `55970b18f1`)	2020-02-26 18:16:48 +01:00
Benoît Knecht	10b3bb2727	infrastructure-playbooks: Run shrink-osd tasks on monitor Instead of running shring-osd tasks on localhost and delegating most of them to the first monitor, run all of them on the first monitor directly. This has the added advantage of becoming root on the monitor only, not on localhost. Signed-off-by: Benoît Knecht <bknecht@protonmail.ch> (cherry picked from commit `8b3df4e418`)	2020-02-24 16:51:33 -05:00
Guillaume Abrioux	1d2a395aaf	switch_to_containers: increase health check values This commit increases the default values for the following variable consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml playbook. This also moves these variables in `ceph-defaults` role so the user can set different values if needed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3700aa5385`)	2020-02-10 12:57:17 -05:00
Guillaume Abrioux	cdc3e10cf3	purge/update: remove backward compatibility legacy This was introduced in 3.1 and marked as deprecation We can definitely drop it in stable-4.0 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0441812959`)	2020-02-03 09:33:05 -05:00
Guillaume Abrioux	5c3ba0787c	switch_to_containers: exclude clients nodes from facts gathering just like site.yml and rolling_update, let's exclude clients node from the fact gathering. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `332c39376b`)	2020-02-03 09:32:20 -05:00
Dimitri Savineau	487be2675a	filestore-to-bluestore: skip bluestore osd nodes If the OSD node is already using bluestore OSDs then we should skip all the remaining tasks to avoid purging OSD for nothing. Instead we warn the user. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `83c5a1d7a8`)	2020-02-03 15:16:51 +01:00
Guillaume Abrioux	675b6788f4	update: remove legacy tasks These tasks should have been removed with backport #4756 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-02-03 15:16:13 +01:00
wujie1993	dcd4b2955a	purge: fix purge cluster failed Fix purge cluster failed when local container images does not exist. Purge node-exporter and grafana-server only when dashboard_enabled is set to True. Signed-off-by: wujie1993 qq594jj@gmail.com (cherry picked from commit `d8b0b3cbd9`)	2020-02-03 15:14:56 +01:00
Dimitri Savineau	f982a70f02	filestore-to-bluestore: fix undefine osd_fsid_list If the playbook is used on a host running bluestore OSDs then the osd_fsid_list won't be filled because the bluestore OSDs are reported with 'type: block' via ceph-volume lvm list command but we are looking for 'type: data' (filestore). TASK [zap ceph-volume prepared OSDs] ********* fatal: [xxxxx]: FAILED! => msg: '''osd_fsid_list'' is undefined Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `cd76054f76`)	2020-01-28 22:21:49 -05:00
Dimitri Savineau	0a2927ce5e	filestore-to-bluestore: don't fail when with no PV When the PV is already removed from the devices then we should not fail to avoid errors like: stderr: No PV found on device /dev/sdb. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `a9c2300545`)	2020-01-24 16:14:47 -05:00
Guillaume Abrioux	fd217d9f08	rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node When upgrading from RHCS 3.x where ceph-metrics was deployed on a dedicated node to RHCS 4.0, it fails like following: ``` fatal: [magna005]: FAILED! => changed=false gid: 0 group: root mode: '0755' msg: 'chown failed: failed to look up user ceph' owner: root path: /etc/ceph secontext: unconfined_u:object_r:etc_t:s0 size: 4096 state: directory uid: 0 ``` because we are trying to run `ceph-config` on this node, it doesn't make sense so we should simply run this play on all groups except `[grafana-server]`. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `e5812fe45b`)	2020-01-22 18:28:54 +01:00
Dimitri Savineau	0abea70e29	filestore-to-bluestore: fix osd_auto_discovery When osd_auto_discovery is set then we need to refresh the ansible_devices fact between after the filestore OSD purge otherwise the devices fact won't be populated. Also remove the gpt header on ceph_disk_osds_devices because the devices is empty at this point for osd_auto_discovery. Adding the bool filter when needed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `bb3eae0c80`)	2020-01-22 10:06:17 +01:00
Dimitri Savineau	e4965e9ea9	filestore-to-bluestore: --destroy with raw devices We still need --destroy when using a raw device otherwise we won't be able to recreate the lvm stack on that device with bluestore. Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151' Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151' /dev/sdd: physical volume not initialized. --> Was unable to complete a new OSD, will rollback changes Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `f995b079a6`)	2020-01-21 18:26:55 +01:00
Guillaume Abrioux	0db611ebf8	shrink-mds: fix condition on fs deletion the new ceph status registered in `ceph_status` will report `fsmap.up` = 0 when it's the last mds given that it's done after we shrink the mds, it means the condition is wrong. Also adding a condition so we don't try to delete the fs if a standby node is going to rejoin the cluster. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3d0898aa5d`)	2020-01-15 11:28:12 +01:00
Guillaume Abrioux	2d85fab02d	osd: support scaling up using --limit This commit lets add-osd.yml in place but mark the deprecation of the playbook. Scaling up OSDs is now possible using --limit Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3496a0efa2`)	2020-01-14 09:12:34 -05:00
Guillaume Abrioux	e034a6da69	docker2podman: use set_fact to override variables play vars have lower precedence than role vars and `set_fact`. We must use a `set_fact` to reset these variables. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `b0c491800a`)	2020-01-10 17:41:27 +01:00
Guillaume Abrioux	02ec088568	docker2podman: force systemd to reload config This is needed after a change is made in systemd unit files. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1c2ec9fb40`)	2020-01-10 17:41:27 +01:00
Guillaume Abrioux	34c4f5baac	docker2podman: install podman This commit adds a package installation task in order to install podman during the docker-to-podman.yml migration playbook. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d746575fd0`)	2020-01-10 17:41:27 +01:00
Guillaume Abrioux	4c4b0edfec	update: only run post osd upgrade play on 1 mon There is no need to run these tasks n times from each monitor. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c878e99589`)	2020-01-10 17:16:51 +01:00

1 2 3 4 5 ...

678 Commits (6485e1a69ed90b446c2cf2d9fdbf703cd8105d6d)