ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Guillaume Abrioux	f0413c4a2b	update: do not gather facts on each play There's no benefit to gather facts again on each play in rolling_update.yml Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `2c77d0094c`)	2021-06-30 20:40:15 +02:00
Guillaume Abrioux	a391dad8e1	dashboard: fix typo introduced during backport during backport of `c8b92deba1` the pattern should have been s/monitoring_group_name/grafana_server_group_name/ Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964907 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `ac0a5c1e68`)	2021-05-26 18:51:18 +02:00
Guillaume Abrioux	2d59f4579b	update: fix ceph-crash stop task This is a workaround for an issue in ansible. When trying to stop/mask/disable this service in one task, the stop didn't actually happen, the task doesn't fail but for some reason the container is still present and running. Then the task starting the service in the role ceph-crash fails because it can't start the container since it's already running with the same name. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3db1ea7ec4`)	2021-05-05 09:47:32 +02:00
Guillaume Abrioux	5fd299e358	update: followup on `07029e1` Playbook must fail anyway, the `rescue` block has been introduced for unmasking the unit after the playbook has failed. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `e9ddb972fe`)	2021-03-29 15:22:23 +02:00
Guillaume Abrioux	82b934cfc1	rolling_update: unmask monitor service after a failure if for some reason the playbook fails after the service was stopped, disabled and masked and before it got restarted, enabled and unmasked, the playbook leaves the service masked and which can make users confused and forces them to unmask the unit manually. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `07029e1bf1`)	2021-03-29 15:22:23 +02:00
Guillaume Abrioux	a8420d41c6	update: stop ceph-crash service before upgrading This adds the missing service stop task for ceph-crash upgrade workflow. It should have been added through commit `15872e3db1e342238636bc9c8e1aef6bd1d3dcd8` in stable-4.0 but at the time we backported this patch ceph-crash wasn't implemented yet so the ceph-crash related content in this patch was removed. Then, ceph-crash has been implemented later so we are still missing this part of the patch in stable-4.0. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943471 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2021-03-26 16:18:50 +01:00
Alex Schultz	7ddbe74712	Use ansible_facts It has come to our attention that using ansible_* vars that are populated with INJECT_FACTS_AS_VARS=True is not very performant. In order to be able to support setting that to off, we need to update the references to use ansible_facts[<thing>] instead of ansible_<thing>. Related: ansible#73654 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406 Signed-off-by: Alex Schultz <aschultz@redhat.com> (cherry picked from commit `a7f2fa73e6`)	2021-03-26 00:16:58 +01:00
Dimitri Savineau	8f26ffdbac	rolling_update: enforce ceph-container-engine When running the rolling_update.yml playbook and adding the dashboard component in the same time then the requirement (like container packages) aren't installed. This could lead to a failure in case of using authentication on the container registry because the playbook will try to login on the registry but podman/docker aren't yet installed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `48a456dc8c`)	2021-03-11 13:52:21 +01:00
Dimitri Savineau	3ba27c9387	rolling_update: exclude clients from node-exporter Since `b105549` we don't install node-exporter on client nodes so we should also exclude the client node from the node-exporter upgrade. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `94af3c87d1`)	2021-03-11 13:52:02 +01:00
Guillaume Abrioux	858048560e	update: fix require-osd-release task This commit fixes two issues in rolling_update.yml: - `container_exec_cmd_update_osd` is unset in the `complete osd upgrade` play so it never runs the command in a container. - the 'require-osd-release' task is never applied because the condition looks for luminous release. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1930164 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2021-02-18 22:22:06 +01:00
Guillaume Abrioux	607ef5a7d2	common: do not use pipefail when not needed Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks where we don't need `set -o pipefail` Fixes: #6090 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `86a8889ee3`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	72fc8877cb	lint: all tasks should be named Fix ansible-lint 502 error: [502] All tasks should be named Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `97dd9218dd`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	35e738c681	lint: commands should not change things Fix ansible lint 301 error: [301] Commands should not change things if nothing needs doing Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `5450de58b3`)	2020-12-16 14:05:45 +01:00
Guillaume Abrioux	92b261df89	lint: set pipefail on shell tasks Fix ansible lint 306 error: [306] Shells that use pipes should set the pipefail option Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1879c26eb9`)	2020-12-16 14:05:45 +01:00
Dimitri Savineau	3f16132e44	library: add ceph_osd_flag module This adds ceph_osd_flag ansible module for replacing the command module usage with the ceph osd set/unset commands. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `5da593604a`)	2020-12-15 17:36:28 +01:00
Dimitri Savineau	f917bb015c	ceph_key: set state as optional Most ansible module using a state parameter default to the present value (when available) instead of using it as a mandatory option. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `abb4023d76`)	2020-12-01 09:53:26 -05:00
Dimitri Savineau	522e183d8f	rolling_update: use ceph health instead of ceph -s The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the cluster health, we're using the health structure in the ceph status output. To optimize this, we could use the ceph health command which contains the same needed information. $ ceph status -f json \| wc -c 2001 $ ceph health -f json \| wc -c 46 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `acddf4fb67`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	69b51b5f19	monitor: use quorum_status instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the quorum status, we're only using the quorum_names structure in the ceph status output. To optimize this, we could use the ceph quorum_status command which contains the same needed information. This command returns less information. $ ceph status -f json \| wc -c 2001 $ ceph quorum_status -f json \| wc -c 957 $ time ceph status -f json > /dev/null real 0m0.577s user 0m0.538s sys 0m0.029s $ time ceph quorum_status -f json > /dev/null real 0m0.544s user 0m0.527s sys 0m0.016s Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `88f91d8c12`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	1185b7e86a	osds: use pg stat command instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the pgs state, we're using the pgmap structure in the ceph status output. To optimize this, we could use the ceph pg stat command which contains the same needed information. This command returns less information (only about pgs) and is slightly faster than the ceph status command. $ ceph status -f json \| wc -c 2000 $ ceph pg stat -f json \| wc -c 240 $ time ceph status -f json > /dev/null real 0m0.529s user 0m0.503s sys 0m0.024s $ time ceph pg stat -f json > /dev/null real 0m0.426s user 0m0.409s sys 0m0.016s The data returned by the ceph status is even bigger when using the nautilus release. $ ceph status -f json \| wc -c 35005 $ ceph pg stat -f json \| wc -c 240 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `ee50588590`)	2020-11-03 14:38:49 -05:00
Dimitri Savineau	43da364188	container: run engine/common roles on first client We already do this in the site-container.yml playbook because we don't need docker/podman installed on all client nodes and having the container image only on the first client node. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `8ecbdc6ede`)	2020-09-10 20:36:08 -04:00
Guillaume Abrioux	66dde0034b	ceph-crash: introduce new role ceph-crash This commit introduces a new role `ceph-crash` in order to deploy everything needed for the ceph-crash daemon. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9d2f2108e1`)	2020-09-10 20:35:04 -04:00
Dimitri Savineau	b745c76491	ceph-facts: only get fsid when monitor are present When running the rolling_update playbook with an inventory without monitor nodes defined (like external scenario) then we can't retrieve the cluster fsid from the running monitor. In this scenario we have to pass this information manually (group_vars or host_vars). Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `f63022dfec`)	2020-09-10 17:42:28 -04:00
Francesco Pantano	858e50da6b	Add --cluster option on ceph require-osd-release command On DCN environments, or when multiple ceph cluster are configured, we need to specify the cluster name before running the command or the rolling_update playbook will fail during minor updates. Closes: https://bugzilla.redhat.com/1876447 Signed-off-by: Francesco Pantano <fpantano@redhat.com> (cherry picked from commit `cb64df30b6`)	2020-09-09 15:11:24 +02:00
Francesco Pantano	2691e385fb	Fix hosts field in rolling_update playbook when mds are processed In the OSP context, during the rolling update the playbook fails with the following error: ''' ERROR! The field 'hosts' has an invalid value, which includes an undefined variable. The error was: list object has no element 0 ''' This PR just change the hosts field providing a valid mons group value. Closes: https://bugzilla.redhat.com/1876803 Signed-off-by: Francesco Pantano <fpantano@redhat.com> (cherry picked from commit `e65f9a5c72`)	2020-09-09 15:11:02 +02:00
Guillaume Abrioux	88c9f6d969	common: don't enable debug log on ceph-volume calls by default ceph-volume can generate large logs at some point. debug logs by definition should be enabled only when debugging. Let's make it customizable with a variable which is set to `False` by default. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `448cc280b7`)	2020-08-13 14:21:44 +02:00
Dimitri Savineau	cbdff5f95b	rolling_update: restart mds after the upgrade In addition of `155e2a2`, the active mds daemons isn't stop/start correctly as opposed as the other services so that daemon doesn't come back after the upgrade. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `ec0a37a74f`)	2020-07-29 17:49:15 -04:00
Dimitri Savineau	7a970ac028	rolling_update: refact dashboard workflow The dashboard upgrade workflow should do the same process than the ceph upgrade otherwise any systemd unit modification won't be apply on the monitoring/dashboard stack. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `a6209bd957`)	2020-07-27 10:49:02 -04:00
Dimitri Savineau	15872e3db1	rolling_update: stop/start instead of restart During the daemon upgrade we're - stopping the service when it's not containerized - running the daemon role - start the service when it's not containerized - restart the service when it's containerized This implementation has multiple issue. 1/ We don't use the same service workflow when using containers or baremetal. 2/ The explicity daemon start isn't required since we'are already doing this in the daemon role. 3/ Any non backward changes in the systemd unit template (for containerized deployment) won't work due to the restart usage. This patch refacts the rolling_update playbook by using the same service stop task for both containerized and baremetal deployment at the start of the upgrade play. It removes the explicit service start task because it's already included in the dedicated role. The service restart tasks for containerized deployment are also removed. This following comment isn't valid because we should have backported ceph-crash implementation in stable-4.0 before this commit, which was not possible because of the needed tag v4.0.25.1 (async release for 4.1z1): ~~Finally, this adds the missing service stop task for ceph crash upgrade workflow.~~ Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `155e2a23d5`)	2020-07-27 09:43:01 -04:00
Guillaume Abrioux	02e7468b4a	update: use tasks_from when including ceph-facts When setting/unsetting osd flags, we can use `tasks_from` when importing `ceph-facts` role to save some times given that we only need this role for setting `container_binary` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d66b698be2`)	2020-07-23 17:26:04 +02:00
Dimitri Savineau	5db4219f26	facts: explicitly disable facter and ohai By default, ansible gathers facts from facter and ohai if installed on the remote nodes, given we don't need them, let's exclude these facts from our facts gathering Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c95adc564b`)	2020-07-20 21:23:48 +02:00
Guillaume Abrioux	328db8bee1	rolling_update: add any_errors_fatal If a failure occurs in ceph-validate, the upgrade playbook keeps running where we expect it to fail. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `8f9cdf4b10`)	2020-07-20 21:22:25 +02:00
Dimitri Savineau	8c4865cd14	rolling_update: fix rbdmirror group name The rbdmirror group name was using the wrong variable definition. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c0a213f928`)	2020-06-03 13:20:03 -04:00
Guillaume Abrioux	5a51bd12dc	common: support OSDs with more than 2 digits When running environment with OSDs having ID with more than 2 digits, some tasks don't match the system units and therefore, playbook can fail. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a084a2a347`)	2020-02-28 11:06:47 -05:00
Guillaume Abrioux	cdc3e10cf3	purge/update: remove backward compatibility legacy This was introduced in 3.1 and marked as deprecation We can definitely drop it in stable-4.0 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0441812959`)	2020-02-03 09:33:05 -05:00
Guillaume Abrioux	675b6788f4	update: remove legacy tasks These tasks should have been removed with backport #4756 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-02-03 15:16:13 +01:00
Guillaume Abrioux	fd217d9f08	rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node When upgrading from RHCS 3.x where ceph-metrics was deployed on a dedicated node to RHCS 4.0, it fails like following: ``` fatal: [magna005]: FAILED! => changed=false gid: 0 group: root mode: '0755' msg: 'chown failed: failed to look up user ceph' owner: root path: /etc/ceph secontext: unconfined_u:object_r:etc_t:s0 size: 4096 state: directory uid: 0 ``` because we are trying to run `ceph-config` on this node, it doesn't make sense so we should simply run this play on all groups except `[grafana-server]`. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `e5812fe45b`)	2020-01-22 18:28:54 +01:00
Guillaume Abrioux	4c4b0edfec	update: only run post osd upgrade play on 1 mon There is no need to run these tasks n times from each monitor. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c878e99589`)	2020-01-10 17:16:51 +01:00
Guillaume Abrioux	6e47e96a02	update: use flags noout and nodeep-scrub only 1. set noout and nodeep-scrub flags, 2. upgrade each OSD node, one by one, wait for active+clean pgs 3. after all osd nodes are upgraded, unset flags Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> Co-authored-by: Rachana Patel <racpatel@redhat.com> (cherry picked from commit `548db78b95`)	2020-01-10 17:16:51 +01:00
Dimitri Savineau	f042ece9af	rolling_update: run registry auth before upgrading There's some tasks using the new container image during the rolling upgrade playbook that needs to execute the registry login first otherwise the nodes won't be able to pull the container image. Unable to find image 'xxx.io/foo/bar:latest' locally Trying to pull repository xxx.io/foo/bar ... /usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest: unauthorized Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `3f344fdefe`)	2020-01-09 20:16:07 -05:00
Guillaume Abrioux	5062d4094c	update: restart iscsigws daemons after upgrade In containerized context, containers aren't stopped early in the sequence. It means they aren't restarted after the upgrade because the task is just checking the daemon status is started (eg: `state: started`). This commit also removes the task which ensure services are started because it's already done in the role ceph-iscsigw. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c7708eb458`)	2019-12-11 08:48:34 -05:00
Guillaume Abrioux	fe8858af38	upgrade: add dashboard deployment when upgrading from RHCS 3, dashboard has obviously never been deployed and it forces us to deploy it later manually. This commit adds the dashboard deployment as part of the upgrade to RHCS 4. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `451c5ca934`)	2019-12-11 08:48:34 -05:00
Guillaume Abrioux	e4c657d711	update: add default values when setting fact This commit adds a default value in the `with_dict` because when using python 2.7, if a task using a `with_dict` has a condition, it is evaluated anyway whereas in python 3 it isn't. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `e9823f319b`)	2019-10-29 16:00:21 -04:00
Dimitri Savineau	56f0cf79d9	rolling_update: remove default filter on mds group There's no need to use the default filter on active/standby groups because if the group doesn't exist then the play is just skipped. Currently this generates warnings like: [WARNING]: Could not match supplied host pattern, ignoring: \| [WARNING]: Could not match supplied host pattern, ignoring: default([]) Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `2ca79fcc99`)	2019-10-28 13:08:33 -04:00
Dimitri Savineau	ba4059d15a	rolling_update: fix active mds host value The active mds host should be based on the inventory hostname and not on the ansible hostname. The value returns under the mdsmap structure is based on the OS hostname so we need to find the right node in the inventory with this value when doing operation on inventory nodes. Othewise we could see error like: The task includes an option with an undefined variable. The error was: "hostvars[foobar]" is undefined Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `f1f2352c79`)	2019-10-28 13:08:33 -04:00
Dimitri Savineau	b547ad9e71	rolling_update: fix reset mon_host variable mon_host should use the inventory hostname and not the node hostname. Fix creates an issue when the inventory and node hostname are different. Closes: #4670 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `650bc0c3f0`)	2019-10-26 08:20:54 -04:00
Guillaume Abrioux	3625ea6ef8	update: use right node when creating active mds group This must be consistent with what is used in `name` parameter. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d06057ebd2`)	2019-10-25 09:42:52 +02:00
Guillaume Abrioux	73d97f525e	update: avoid skipping single mds deployment upgrade otherwise a single MDS would never be updated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d8ab11d2f8`)	2019-10-25 09:42:52 +02:00
Guillaume Abrioux	c599af6724	update: skip mds deactivation when no mds in inventory Let's skip this part of the code if there's no mds node in the inventory. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `5ec906c3af`)	2019-10-25 09:42:52 +02:00
Guillaume Abrioux	4a5d3c3c2d	update: add missing quotes Add missing quote in order to keep consistency. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `8d72ff8e5e`)	2019-10-21 13:26:37 -04:00
Guillaume Abrioux	9bc7f8a7d7	tests: add multimds coverage This commit makes the all_daemons scenario deploying 3 mds in order to cover the multimds case. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `25b98b2ce3`)	2019-10-18 22:09:04 +02:00

1 2 3 4

193 Commits (57f9553798f6db71fb31a6bbbe7141f5c7aac387)