ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Dimitri Savineau	3d3ce26327	rolling_update: fix mgr start with mon collocation `cec994b` introduced a regression when a mgr is collocated with a mon. During the mon upgrade, the mgr service is masked to avoid to be restarted on packages update. Then the start mgr task is failing because the service is still masked. Instead we should unmask it. Fixes: #5983 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:10:17 +01:00
Dimitri Savineau	16afe90806	infrastructure: consume ceph_fs module `bd611a7` introduced the new ceph_fs module but missed some tasks in rolling_update and shrink-mds playbooks. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:06:17 +01:00
Dimitri Savineau	acddf4fb67	rolling_update: use ceph health instead of ceph -s The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the cluster health, we're using the health structure in the ceph status output. To optimize this, we could use the ceph health command which contains the same needed information. $ ceph status -f json \| wc -c 2001 $ ceph health -f json \| wc -c 46 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:05:33 +01:00
Dimitri Savineau	88f91d8c12	monitor: use quorum_status instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the quorum status, we're only using the quorum_names structure in the ceph status output. To optimize this, we could use the ceph quorum_status command which contains the same needed information. This command returns less information. $ ceph status -f json \| wc -c 2001 $ ceph quorum_status -f json \| wc -c 957 $ time ceph status -f json > /dev/null real 0m0.577s user 0m0.538s sys 0m0.029s $ time ceph quorum_status -f json > /dev/null real 0m0.544s user 0m0.527s sys 0m0.016s Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:05:33 +01:00
Dimitri Savineau	ee50588590	osds: use pg stat command instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the pgs state, we're using the pgmap structure in the ceph status output. To optimize this, we could use the ceph pg stat command which contains the same needed information. This command returns less information (only about pgs) and is slightly faster than the ceph status command. $ ceph status -f json \| wc -c 2000 $ ceph pg stat -f json \| wc -c 240 $ time ceph status -f json > /dev/null real 0m0.529s user 0m0.503s sys 0m0.024s $ time ceph pg stat -f json > /dev/null real 0m0.426s user 0m0.409s sys 0m0.016s The data returned by the ceph status is even bigger when using the nautilus release. $ ceph status -f json \| wc -c 35005 $ ceph pg stat -f json \| wc -c 240 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:05:33 +01:00
Dimitri Savineau	bd611a785b	library: add ceph_fs module This adds the ceph_fs ansible module for replacing the command module usage with the ceph fs command. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-10-06 08:02:58 +02:00
Guillaume Abrioux	eefe11d90c	defaults: change default grafana-server name This change default value of grafana-server group name. Adding some tasks in ceph-defaults in order to keep backward compatibility. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-09-29 07:42:26 +02:00
Dimitri Savineau	50104650e7	add missing boolean filter Otherwise this will generate an ansible warning about the missing filter. [DEPRECATION WARNING]: evaluating xxx as a bare variable, this behaviour will go away and you might need to add \|bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature will be removed in version 2.12. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-28 20:45:01 +02:00
Dimitri Savineau	4808523403	rolling_update: remove msgr2 migration In Pacific we're are sure that users already achieved the msgr2 because that was introduced in Nautilus. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-25 19:14:42 +02:00
Dimitri Savineau	abb4023d76	ceph_key: set state as optional Most ansible module using a state parameter default to the present value (when available) instead of using it as a mandatory option. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-14 14:12:21 -04:00
Dimitri Savineau	8ecbdc6ede	container: run engine/common roles on first client We already do this in the site-container.yml playbook because we don't need docker/podman installed on all client nodes and having the container image only on the first client node. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-10 13:19:44 -04:00
Dimitri Savineau	f63022dfec	ceph-facts: only get fsid when monitor are present When running the rolling_update playbook with an inventory without monitor nodes defined (like external scenario) then we can't retrieve the cluster fsid from the running monitor. In this scenario we have to pass this information manually (group_vars or host_vars). Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-10 13:19:44 -04:00
Francesco Pantano	e65f9a5c72	Fix hosts field in rolling_update playbook when mds are processed In the OSP context, during the rolling update the playbook fails with the following error: ''' ERROR! The field 'hosts' has an invalid value, which includes an undefined variable. The error was: list object has no element 0 ''' This PR just change the hosts field providing a valid mons group value. Closes: https://bugzilla.redhat.com/1876803 Signed-off-by: Francesco Pantano <fpantano@redhat.com>	2020-09-08 11:52:08 -04:00
Francesco Pantano	cb64df30b6	Add --cluster option on ceph require-osd-release command On DCN environments, or when multiple ceph cluster are configured, we need to specify the cluster name before running the command or the rolling_update playbook will fail during minor updates. Closes: https://bugzilla.redhat.com/1876447 Signed-off-by: Francesco Pantano <fpantano@redhat.com>	2020-09-07 16:31:14 +02:00
Guillaume Abrioux	cec994b973	rolling_update: remove 'ignore_errors' There's no need to use `ignore_errors: true` on these tasks. Using a loop on the task stopping mon daemons allows us to avoid duplicating this task, the `ignore_errors` isn't needed here because it won't fail the playbook if one of the ID doesn't exist (shortname vs. fqdn) Using the right condition on the task starting the mgr daemon allows us to avoid using an `ignore_errors: true` as well. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-08-21 09:22:36 -04:00
Guillaume Abrioux	448cc280b7	common: don't enable debug log on ceph-volume calls by default ceph-volume can generate large logs at some point. debug logs by definition should be enabled only when debugging. Let's make it customizable with a variable which is set to `False` by default. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-08-11 15:03:20 +02:00
Dimitri Savineau	ec0a37a74f	rolling_update: restart mds after the upgrade In addition of `155e2a2`, the active mds daemons isn't stop/start correctly as opposed as the other services so that daemon doesn't come back after the upgrade. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-07-29 16:45:41 -04:00
Dimitri Savineau	a6209bd957	rolling_update: refact dashboard workflow The dashboard upgrade workflow should do the same process than the ceph upgrade otherwise any systemd unit modification won't be apply on the monitoring/dashboard stack. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-07-25 09:35:17 +02:00
Dimitri Savineau	155e2a23d5	rolling_update: stop/start instead of restart During the daemon upgrade we're - stopping the service when it's not containerized - running the daemon role - start the service when it's not containerized - restart the service when it's containerized This implementation has multiple issue. 1/ We don't use the same service workflow when using containers or baremetal. 2/ The explicity daemon start isn't required since we'are already doing this in the daemon role. 3/ Any non backward changes in the systemd unit template (for containerized deployment) won't work due to the restart usage. This patch refacts the rolling_update playbook by using the same service stop task for both containerized and baremetal deployment at the start of the upgrade play. It removes the explicit service start task because it's already included in the dedicated role. The service restart tasks for containerized deployment are also removed. Finally, this adds the missing service stop task for ceph crash upgrade workflow. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-07-25 09:35:17 +02:00
Guillaume Abrioux	9d2f2108e1	ceph-crash: introduce new role ceph-crash This commit introduces a new role `ceph-crash` in order to deploy everything needed for the ceph-crash daemon. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-07-21 20:22:12 +02:00
Dimitri Savineau	c95adc564b	facts: explicitly disable facter and ohai By default, ansible gathers facts from facter and ohai if installed on the remote nodes, given we don't need them, let's exclude these facts from our facts gathering Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-07-02 17:46:12 +02:00
Guillaume Abrioux	8f9cdf4b10	rolling_update: add any_errors_fatal If a failure occurs in ceph-validate, the upgrade playbook keeps running where we expect it to fail. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-06-29 12:58:53 -04:00
Dimitri Savineau	c0a213f928	rolling_update: fix rbdmirror group name The rbdmirror group name was using the wrong variable definition. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-05-13 11:57:42 +02:00
Dimitri Savineau	7b620a22bc	rolling_update: require_osd_release pacific Since [1] we need to set pacific for the required OSD release during the upgrade. [1] https://github.com/ceph/ceph/commit/cc99c3bc Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-04-16 15:32:00 -04:00
Guillaume Abrioux	6df7887f87	update: use tasks_from when including ceph-facts When setting/unsetting osd flags, we can use `tasks_from` when importing `ceph-facts` role to save some times given that we only need this role for setting `container_binary` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-04-06 17:00:00 +02:00
Guillaume Abrioux	a084a2a347	common: support OSDs with more than 2 digits When running environment with OSDs having ID with more than 2 digits, some tasks don't match the system units and therefore, playbook can fail. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-02-27 09:48:36 +01:00
Guillaume Abrioux	e5812fe45b	rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node When upgrading from RHCS 3.x where ceph-metrics was deployed on a dedicated node to RHCS 4.0, it fails like following: ``` fatal: [magna005]: FAILED! => changed=false gid: 0 group: root mode: '0755' msg: 'chown failed: failed to look up user ceph' owner: root path: /etc/ceph secontext: unconfined_u:object_r:etc_t:s0 size: 4096 state: directory uid: 0 ``` because we are trying to run `ceph-config` on this node, it doesn't make sense so we should simply run this play on all groups except `[grafana-server]`. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-01-22 11:29:36 -05:00
Guillaume Abrioux	d853da2a68	update: remove legacy This task is a code duplicate, probably a legacy, let's remove it. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-01-13 15:18:45 -05:00
Dimitri Savineau	3f344fdefe	rolling_update: run registry auth before upgrading There's some tasks using the new container image during the rolling upgrade playbook that needs to execute the registry login first otherwise the nodes won't be able to pull the container image. Unable to find image 'xxx.io/foo/bar:latest' locally Trying to pull repository xxx.io/foo/bar ... /usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest: unauthorized Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-01-09 16:14:33 -05:00
Guillaume Abrioux	e665d8e239	tests: upgrade from octopus to octopus on master we can't test upgrade from stable-4.0/CentOS 7 to master/CentOS 8. This commit refact the upgrade so we test upgrade from master/CentOS 8 to master/CentOS 8 (octopus to octopus) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-01-08 11:13:46 +01:00
Guillaume Abrioux	c7708eb458	update: restart iscsigws daemons after upgrade In containerized context, containers aren't stopped early in the sequence. It means they aren't restarted after the upgrade because the task is just checking the daemon status is started (eg: `state: started`). This commit also removes the task which ensure services are started because it's already done in the role ceph-iscsigw. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-12-05 13:02:06 -05:00
Guillaume Abrioux	451c5ca934	upgrade: add dashboard deployment when upgrading from RHCS 3, dashboard has obviously never been deployed and it forces us to deploy it later manually. This commit adds the dashboard deployment as part of the upgrade to RHCS 4. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-12-05 13:02:06 -05:00
Guillaume Abrioux	0441812959	purge/update: remove backward compatibility legacy This was introduced in 3.1 and marked as deprecation We can definitely drop it in stable-4.0 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-11-27 10:27:43 -05:00
Guillaume Abrioux	c878e99589	update: only run post osd upgrade play on 1 mon There is no need to run these tasks n times from each monitor. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-11-20 09:22:19 -05:00
Guillaume Abrioux	548db78b95	update: use flags noout and nodeep-scrub only 1. set noout and nodeep-scrub flags, 2. upgrade each OSD node, one by one, wait for active+clean pgs 3. after all osd nodes are upgraded, unset flags Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> Co-authored-by: Rachana Patel <racpatel@redhat.com>	2019-11-20 09:22:19 -05:00
Guillaume Abrioux	206ee589d6	update: reset flags before and after each osd node upgrade It might be possible at some point even with osd flags `noout` and `norebalance` set the PGs states can change depending on the amount of data written meantime. It means the check for PGs state will fail. This commit changes the way we set those flags: we set them before an OSD node upgrade and unset them before the PGs state check so they can recover. Fixes: #3961 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-11-08 09:10:52 -05:00
Guillaume Abrioux	e9823f319b	update: add default values when setting fact This commit adds a default value in the `with_dict` because when using python 2.7, if a task using a `with_dict` has a condition, it is evaluated anyway whereas in python 3 it isn't. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-29 14:45:28 -04:00
Dimitri Savineau	2ca79fcc99	rolling_update: remove default filter on mds group There's no need to use the default filter on active/standby groups because if the group doesn't exist then the play is just skipped. Currently this generates warnings like: [WARNING]: Could not match supplied host pattern, ignoring: \| [WARNING]: Could not match supplied host pattern, ignoring: default([]) Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-10-28 15:02:50 +01:00
Dimitri Savineau	f1f2352c79	rolling_update: fix active mds host value The active mds host should be based on the inventory hostname and not on the ansible hostname. The value returns under the mdsmap structure is based on the OS hostname so we need to find the right node in the inventory with this value when doing operation on inventory nodes. Othewise we could see error like: The task includes an option with an undefined variable. The error was: "hostvars[foobar]" is undefined Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-10-28 15:02:50 +01:00
Dimitri Savineau	650bc0c3f0	rolling_update: fix reset mon_host variable mon_host should use the inventory hostname and not the node hostname. Fix creates an issue when the inventory and node hostname are different. Closes: #4670 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-10-25 23:04:35 +02:00
Guillaume Abrioux	d06057ebd2	update: use right node when creating active mds group This must be consistent with what is used in `name` parameter. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-24 15:15:51 -04:00
Guillaume Abrioux	1122da7f4a	update: avoid skipping single mds deployment upgrade otherwise a single MDS would never be updated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-24 09:28:36 +02:00
Guillaume Abrioux	5ec906c3af	update: skip mds deactivation when no mds in inventory Let's skip this part of the code if there's no mds node in the inventory. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-23 11:06:13 -04:00
Guillaume Abrioux	8d72ff8e5e	update: add missing quotes Add missing quote in order to keep consistency. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-21 09:19:34 -04:00
Guillaume Abrioux	25b98b2ce3	tests: add multimds coverage This commit makes the all_daemons scenario deploying 3 mds in order to cover the multimds case. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-18 13:43:13 -04:00
Guillaume Abrioux	c4fc8cc878	upgrade: fix standby_mdss group creation This commit fixes the standby_mdss group creation by using `{{ item }}`. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-18 13:43:13 -04:00
Guillaume Abrioux	71cebf80a6	update: follow new recommandation to upgrade mds cluster Refact the mds cluster upgrade code in order to follow the documented recommandation. See: https://github.com/ceph/ceph/blob/master/doc/cephfs/upgrading.rst Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1569689 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-16 11:23:12 -04:00
Guillaume Abrioux	8138d4193c	update: import ceph-defaults role in first play Typical error: ``` fatal: [mon0]: FAILED! => msg: \|- The conditional check 'not delegate_facts_host \| bool or inventory_hostname in groups.get(client_group_name, [])' failed. The error was: error while evaluating conditional (not delegate_facts_host \| bool or inventory_hostname in groups.get(client_group_name, [])): 'client_group_name' is undefined ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-07 09:00:38 +02:00
Guillaume Abrioux	865d2eac9b	main: exclude client nodes from facts gathering when delegate_facts_host This commit excludes client nodes from facts gathering, they are not needed and can speed up this task. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-07 09:00:38 +02:00
Guillaume Abrioux	d84160a170	update: reset mon_host after mons upgrade after all mon are upgraded, let's reset mon_host which is used in the rest of the playbook for setting `container_exec_cmd` so we are sure to use the right value. Typical error: ``` failed: [mds0 -> mon0] (item={u'path': u'/var/lib/ceph/bootstrap-mds/ceph.keyring', u'name': u'client.bootstrap-mds', u'copy_key': True}) => changed=true ansible_loop_var: item cmd: - docker - exec - ceph-mon-mon2 - ceph - --cluster - ceph - auth - get - client.bootstrap-mds delta: '0:00:00.016294' end: '2019-09-27 13:54:58.828835' item: copy_key: true name: client.bootstrap-mds path: /var/lib/ceph/bootstrap-mds/ceph.keyring msg: non-zero return code rc: 1 start: '2019-09-27 13:54:58.812541' stderr: 'Error response from daemon: No such container: ceph-mon-mon2' stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-09-28 04:37:47 +02:00
Sam Choraria	7cc9f93680	rolling_update.yml: force ceph-volume scan on osds The rolling_update.yml playbook fails when scanning ceph-disk osds while deploying nautilus. The --force flag is required to scan existing osds and rewrite their json metadata. Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>	2019-09-26 16:53:25 +02:00
Guillaume Abrioux	6dce51183b	upgrade: accept HEALTH_OK and HEALTH_WARN as valid state `3a100cfa52` introduced a check which is a bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-06-21 10:08:56 +02:00
Guillaume Abrioux	3a100cfa52	rolling_update: fail early if cluster state is not OK starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea. Let's check for the cluster status before trying to upgrade. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-06-18 12:45:01 -04:00
Guillaume Abrioux	51b2813e04	rolling_update: only mask and stop unit in mgr part Otherwise it fails like following: ``` fatal: [mon0]: FAILED! => changed=false msg: \|- Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-06-18 12:45:01 -04:00
L3D	ab54fe20ec	ansible: use 'bool' filter on boolean conditionals By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these: ``` [DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable, this behaviour will go away and you might need to add \|bool to the expression in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature will be removed in version 2.12. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg. ``` Now appended ``\| bool`` on a lot of the affected variables. Sometimes the coding style from ``variable\|bool`` changed to ``variable \| bool`` (with spaces at the pipe). Closes: #4022 Signed-off-by: L3D <l3d@c3woc.de>	2019-06-06 10:21:17 +02:00
Guillaume Abrioux	e74d80e72f	rename docker_exec_cmd variable This commit renames the `docker_exec_cmd` variable to `container_exec_cmd` so it's more generic. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-05-16 16:39:13 +02:00
Mike Christie	d7ef12910e	igw: Fix rolling update service ordering We must stop tcmu-runner after the other rbd-target-* services because they may need to interact with tcmu-runner during shutdown. There is also a bug in some kernels where IO can get stuck in the kernel and by stopping rbd-target-* first we can make sure all IO is flushed. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611 Signed-off-by: Mike Christie <mchristi@redhat.com>	2019-05-10 09:40:52 +02:00
Dimitri Savineau	f1048627ea	rolling_update: restart all ceph-iscsi services Currently only rbd-target-gw service is restarted during an update. We also need to restart tcmu-runner and rbd-target-api services during the ceph iscsi upgrade. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-04-24 07:47:23 +00:00
Rishabh Dave	739a662c80	improve coding style Keywords requiring only one item shouldn't express it by creating a list with single item. Signed-off-by: Rishabh Dave <ridave@redhat.com>	2019-04-23 15:37:07 +02:00
Guillaume Abrioux	7eb42c9e8e	update: ensure tasks are executed on an upgraded mon These tasks must be run from a monitor which is upgraded otherwise it might fail. See: https://tracker.ceph.com/issues/39355 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-18 11:16:11 +02:00
Guillaume Abrioux	ed84325b1d	update: ensure ceph command returns 0 these commands could return something else than 0. Let's ensure all retries have been done before actually failing. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-18 11:16:11 +02:00
Guillaume Abrioux	543d1e2e41	update: set osd flags before upgrading any mon Typical error: ``` failed: [mon0 -> mon2] (item=noout) => changed=true cmd: - ceph - --cluster - ceph - osd - set - noout delta: '0:00:00.293756' end: '2019-04-17 06:31:57.552386' item: noout msg: non-zero return code rc: 1 start: '2019-04-17 06:31:57.258630' stderr: \|- Traceback (most recent call last): File "/bin/ceph", line 1222, in <module> retval = main() File "/bin/ceph", line 1146, in main sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli') File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs cmd['sig'] = parse_funcsig(cmd['sig']) File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig raise JsonFormat(s) ceph_argparse.JsonFormat: unknown type CephBool stderr_lines: - 'Traceback (most recent call last):' - ' File "/bin/ceph", line 1222, in <module>' - ' retval = main()' - ' File "/bin/ceph", line 1146, in main' - ' sigdict = parse_json_funcsigs(outbuf.decode(''utf-8''), ''cli'')' - ' File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs' - ' cmd[''sig''] = parse_funcsig(cmd[''sig''])' - ' File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig' - ' raise JsonFormat(s)' - 'ceph_argparse.JsonFormat: unknown type CephBool' stdout: '' stdout_lines: <omitted> ``` Having mixed versions of monitors seems to cause this error. Moving these tasks before any monitor gets upgraded seems to be enough to get around this issue. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-18 11:16:11 +02:00
Andrew Schoen	e2529dcd7f	rolling_update: ceph commands should use --cluster Signed-off-by: Andrew Schoen <aschoen@redhat.com>	2019-04-18 10:55:11 +02:00
Andrew Schoen	67453853ff	rolling_update: set num_osds to the number of running osds We do this so that the ceph-config role can most accurately report the number of osds for the generation of the ceph.conf file. We don't want to use ceph-volume to determine the number of osds because in an upgrade to nautilus ceph-volume won't be able to accurately count osds created by ceph-disk. Signed-off-by: Andrew Schoen <aschoen@redhat.com>	2019-04-18 10:55:11 +02:00
Andrew Schoen	28c47e4d1b	rolling_update: migrate ceph-disk osds to ceph-volume When upgrading to nautlius run ``ceph-volume simple scan`` and ``ceph-volume simple activate --all`` to migrate any running ceph-disk osds to ceph-volume. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1656460 Signed-off-by: Andrew Schoen <aschoen@redhat.com>	2019-04-18 10:55:11 +02:00
Guillaume Abrioux	c1e4529b0e	update: fix undefined error when no mgr group is declared if mgr group isn't defined in inventory, that task will fail with undefined error. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-11 09:22:35 +02:00
Dimitri Savineau	57b4e76d11	rolling_update: Remove ceph aliases ceph aliases have been introduced in stable-3.2 during the ceph deployment. On master this has been removed but we don't handle this removal in the upgrade from stable-3.2 to master via the rolling_update playbook. Also remove the task from purge-docker-cluster missing from `d9e7835` Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-04-09 16:50:10 +02:00
Dimitri Savineau	c8442f3705	rolling_update: Update systemd unit regex for nvme The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1). Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-03-26 12:01:00 +00:00
Guillaume Abrioux	78aac3e96a	update: followup on `edfdc49` all rgw instances should be stopped according to the multiple rgw instances support added in rolling_update.yml Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	f6e0185146	update: add containerized deployment upgrade support (L->N) Add a couple of fixes to allow containerized deployments upgrade support to upgrade from luminous/mimic to nautilus. - pass CEPH_CONTAINER_IMAGE and CEPH_CONTAINER_BINARY environment variable to the ceph_key module, - fix the docker exec command in 'waiting for the containerized monitor to join the quorum' task according to the `delegate_to` parameter, - override `docker_exec_cmd` in `ceph-facts` with `mon_host` when rolling_update is `True`, - do not run unnecessarily `create_mds_filesystems.yml` when performing an upgrade. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	1816b876ee	update: add missing hosts in facts gathering iscsigws were missing. The 'complete upgrade' couldn't complete because rolling_update was set to False for iscsigw nodes. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	45ba90c169	update: remove rbdmirror legacy task This task is no longer needed for next release. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	0ea0adf039	update: show all daemons version at the end Let's display all daemons version at the end of the playbook. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	f31d6d9485	update: enable new nautilus-only functionality once the cluster is upgraded to nautilus, we can complete the process by disallowing pre-nautilus OSDs and enabling all new nautilus-only functionality Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	afdaa70a63	update: enable msgr2 protocol This commit enable the msgr2 protocol when the cluster is fully upgraded to nautilus Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	ef096dd021	update: ensure mgrs are upgraded after ALL monitors As of `1c760904b0`, ceph-ansible implicitly bootstrap managers on monitors. mgrs must be upgraded only after all monitors, therefore, this commit refact the way mgrs are upgraded to be sure we don't upgrade a mgr during the monitors upgrade. This commit also ensure we handle the case were we split managers on dedicated nodes. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	7fa2434f0f	update: ensure /var/lib/ceph/bootstrap-rbd-mirror is present This directory is created by ceph-config node by node. In the upgrade context we need it to be created on ALL monitors as soon as the first iteration because of the task right after which creates and sends the keyrings on all monitors. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	82764afe8d	update: mask systemd service units during upgrade This prevents the packaging from restarting services before we do need to restart them in the rolling update sequence. We want to handle services restart at rolling_update playbook. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	8add55451c	update: set osd flags only once There is no need to set osd flags (noout, norebalance) each time we upgrade a mon. This commit moves up those tasks (before stopping the mon) so we don't need to delegate them. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	f7c6f4e0b6	update: fix tasks waiting for the node to join the quorum We actually want to ensure the node being upgraded is joining the quorum instead of the monitor picked up earlier. Indeed, the `mon_host`is used only in `delegate_to:` so we can still run ceph commands while the monitor being upgraded is stopped. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	32569b79e2	update: remove an old parameter in ceph_key module call the `containerized` parameter in ceph_key module doesn't exist anymore. This was making the module failing but was hidden because of the `ignore_errors: True`. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-03-25 16:02:56 -04:00
Guillaume Abrioux	edfdc49488	rolling_update: support multiple rgw instance `1ac94c048f` introduced the support of multiple rgw instances on a single host but somehow has missed to implement this feature in rolling_update. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-01-22 13:45:38 +01:00
Giulio Fidente	ff8dbe114c	Preserve rolling_update backward compatibility with ansible < 2.5 Signed-off-by: Giulio Fidente <gfidente@redhat.com>	2019-01-21 14:05:45 +01:00
Guillaume Abrioux	268f2cef82	update: do not enforce `serial: 1` on client nodes There is no need to enforce `serial: 1` on client nodes. Let's make it parameterizable by introducing a new extra variable `client_update_batch`, if not filled this will default to `{{ ansible_forks }}`. NOTE: this is only usable as an extra variable passed with `-e client_update_batch=<num>` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-01-02 16:55:08 +00:00
Guillaume Abrioux	0eb56e36f8	introduce new role ceph-facts sometimes we play the whole role `ceph-defaults` just to access the default value of some variables. It means we play the `facts.yml` part in this role while it's not desired. Splitting this role will speedup the playbook. Closes: #3282 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-12-12 11:18:01 +01:00
Rishabh Dave	e4f0af2b78	don't use private option for import_role Since sharing variables amongst roles has been made default since Ansible 2.6, private option has been deprecated; so stop using it. Signed-off-by: Rishabh Dave <ridave@redhat.com>	2018-12-04 23:45:59 +00:00
Ramana Raja	cb784c601d	rolling_update: fail if less than 3 MONs ... for non-containerized deployments as well. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1655470 Signed-off-by: Ramana Raja <rraja@redhat.com>	2018-12-04 14:28:49 +00:00
Sébastien Han	896676ee80	fix json data type Json is a type structure which is always typed as a string, where before this we were declaring a dict, which is not a json valid structure. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-12-04 12:34:54 +01:00
Sébastien Han	1c760904b0	site: collocated mon and mgr by default This will speed up the deployment and also deploy mon and mgr collocated just as recommended. This won't prevent you of adding more and dedicaded machines for mgr if needed. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-12-03 14:39:43 +01:00
Sébastien Han	bb7bfca113	rolling-update: remove old condition This failure condition was only valid at the time where clusters didn't have ceph-mgr activated. Now since we collocate the ceph-mgr with the mon by default, if the daemon wasn't present it will be created during the upgrade. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-12-03 14:39:43 +01:00
Guillaume Abrioux	a952122c38	rolling_update: create missing keyring only on running mon try to create the potentially missing keys only on monitors that are actually running. The current node being played is stopped before this task. By the way, delegating the command on all nodes but the current node being played ensures that the generated keys will be present on all monitors. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-29 16:40:46 +00:00
Sébastien Han	61fb6972ec	rolling_update: default ceph json output to empty dict So we can avoid the following failure: The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout \| from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout \| from_json)["quorum_names"] ' failed. The error was: No JSON object could be decoded We just need to set a default, the next iteration will have a more complete json since the command won't fail. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-29 10:46:15 +00:00
Guillaume Abrioux	73287f91bc	mgr: fix mgr keyring error on rolling_update when upgrading from RHCS 2.5 to 3.2, it fails because the task `create ceph mgr keyring(s) when mon is containerized` has a when condition `inventory_hostname == groups[mon_group_name]\|last`. First, this is incorrect because `inventory_hostname` is referring to a mgr node, it means this condition would have never been satisfied. Then, this condition + `serial: 1` makes the mgr keyring creating skipped on the first node. Further, the `ceph-mgr` role tries to copy the mgr keyring (it's not aware we are running `serial: 1`) this leads to a failure like the following: ``` TASK [ceph-mgr : copy ceph keyring(s) if needed] ************************************************************************************************************************************************************************************************************************************************************************* task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10 Tuesday 27 November 2018 12:03:34 +0000 (0:00:00.296) 0:11:01.290 **** An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring' failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"} ``` The ceph_key module is idempotent, so there is no need to have such a condition. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649957 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-27 18:19:56 +01:00
Sébastien Han	4f57e44f9c	defaults: declare container_binary Always declare container_binary and assign it a correct value. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-27 16:47:40 +00:00
Sébastien Han	49e0e19056	rolling_update: update ceph_key task for container Use the new way to create keys on containerized env as introduced by: 1098b71bda90db3dad19ac179f0ba900ccb0f953 Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-27 16:47:40 +00:00
Sébastien Han	2814d36c93	infra playbooks: use the right container binary Use podman or docker wether they are available or not. podman will be prioritized over docker if present. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-27 16:47:40 +00:00
Guillaume Abrioux	7c99b6df6d	update: fix a typo `hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo. That should be `hostvars[mon_host]['ansible_hostname']` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-26 18:22:20 +01:00
Guillaume Abrioux	af78173584	rolling_update: refact set_fact `mon_host` each monitor node should select another monitor which isn't itself. Otherwise, one node in the monitor group won't set this fact and causes failure. Typical error: ``` TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] * task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200 Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *** fatal: [mon1]: FAILED! => {} MSG: The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2' ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-26 18:22:20 +01:00
Sébastien Han	4e267bee4f	rolling_update: create rbd and rbd-mirror keyrings During an upgrade ceph won't create keys that were not existing on the previous version. So after the upgrade of let's Jewel to Luminous, once all the monitors have the new version they should get or create the keys. It's ok to have the task fails, especially for the rbd-mirror key, which only appears in Nautilus. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-26 18:22:20 +01:00
Guillaume Abrioux	c783bc70da	docker-common: rename role rename `ceph-docker-common` role to `ceph-container-common` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-12 10:51:48 +01:00

1 2 3 4 5

238 Commits (9f1880464bb41cfaa2b089f35dfbda723ded911c)