ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Guillaume Abrioux	195d88fcda	lint: ignore 302,303,505 errors ignore 302,303 and 505 errors [302] Using command rather than an argument to e.g. file [303] Using command rather than module [505] referenced files must exist they aren't relevant on these tasks. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-11-23 08:33:47 +01:00
Guillaume Abrioux	9fba6eecfa	lint: variables should have spaces before and after Fix ansible lint 206 error: [206] Variables should have spaces before and after: {{ var_name }} Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-11-23 08:33:47 +01:00
Guillaume Abrioux	5450de58b3	lint: commands should not change things Fix ansible lint 301 error: [301] Commands should not change things if nothing needs doing Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-11-23 08:33:47 +01:00
Guillaume Abrioux	1879c26eb9	lint: set pipefail on shell tasks Fix ansible lint 306 error: [306] Shells that use pipes should set the pipefail option Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-11-23 08:33:47 +01:00
Dimitri Savineau	35ed9977aa	switch2container: chown symlink in mon/mgr plays `fa2bb3a` only fix the symlink owner/group issue in the OSD play. If the OSDs are collocated with other services like MONs and MGRs then the chown command will fail. $ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} + chown: cannot dereference './block': Permission denied Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-16 13:40:57 -05:00
Dimitri Savineau	fa2bb3af86	switch2container: disable ceph-osd enabled-runtime When deploying the ceph OSD via the packages then the ceph-osd@.service unit is configured as enabled-runtime. This means that each ceph-osd service will inherit from that state. The enabled-runtime systemd state doesn't survive after a reboot. For non containerized deployment the OSD are still starting after a reboot because there's the ceph-volume@.service and/or ceph-osd.target units that are doing the job. $ systemctl list-unit-files\|egrep '^ceph-(volume\|osd)'\|column -t ceph-osd@.service enabled-runtime ceph-volume@.service enabled ceph-osd.target enabled When switching to containerized deployment we are stopping/disabling ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the systemd unit files. But the new systemd units for containerized ceph-osd service will still inherit from ceph-osd@.service unit file. As a consequence, if an OSD host is rebooting after the playbook execution then the ceph-osd service won't come back because they aren't enabled at boot. This patch also adds a reboot and testinfra run after running the switch to container playbook. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-12 20:05:39 +01:00
Dimitri Savineau	88f91d8c12	monitor: use quorum_status instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the quorum status, we're only using the quorum_names structure in the ceph status output. To optimize this, we could use the ceph quorum_status command which contains the same needed information. This command returns less information. $ ceph status -f json \| wc -c 2001 $ ceph quorum_status -f json \| wc -c 957 $ time ceph status -f json > /dev/null real 0m0.577s user 0m0.538s sys 0m0.029s $ time ceph quorum_status -f json > /dev/null real 0m0.544s user 0m0.527s sys 0m0.016s Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:05:33 +01:00
Dimitri Savineau	ee50588590	osds: use pg stat command instead of ceph status The ceph status command returns a lot of information stored in variables and/or facts which could consume resources for nothing. When checking the pgs state, we're using the pgmap structure in the ceph status output. To optimize this, we could use the ceph pg stat command which contains the same needed information. This command returns less information (only about pgs) and is slightly faster than the ceph status command. $ ceph status -f json \| wc -c 2000 $ ceph pg stat -f json \| wc -c 240 $ time ceph status -f json > /dev/null real 0m0.529s user 0m0.503s sys 0m0.024s $ time ceph pg stat -f json > /dev/null real 0m0.426s user 0m0.409s sys 0m0.016s The data returned by the ceph status is even bigger when using the nautilus release. $ ceph status -f json \| wc -c 35005 $ ceph pg stat -f json \| wc -c 240 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-11-03 09:05:33 +01:00
Dimitri Savineau	da4280e243	switch2container: chown symlink for devices If the OSD directory is using symlinks for referencing devices (like block, db, wal for bluestore and journal for filestore) then the chown command could fail to change the owner:group on some system. $ ls -hl /var/lib/ceph/osd/ceph-0/ total 28K lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6 -rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid -rw------- 1 ceph ceph 37 Sep 15 01:53 fsid -rw------- 1 ceph ceph 55 Sep 15 01:53 keyring -rw------- 1 ceph ceph 6 Sep 15 01:53 ready -rw------- 1 ceph ceph 3 Sep 15 02:00 require_osd_release -rw------- 1 ceph ceph 10 Sep 15 01:53 type -rw------- 1 ceph ceph 2 Sep 15 01:53 whoami $ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} + chown: cannot dereference './block': Permission denied $ find /var/lib/ceph/osd/ceph-0 -not -user 167 /var/lib/ceph/osd/ceph-0/block Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-15 20:05:49 +02:00
Dimitri Savineau	c1af69a7e7	switch2container: remove deb systemd units When running the switch2container playbook on a Debian based system then the systemd unit path isn't the same than Red Hat based system. Because the systemd unit files aren't removed then the new container systemd unit isn't take in count. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-09-15 20:05:49 +02:00
Kevin Coakley	d19e6033b2	Remove ceph-radosgw.target when switching to containerize daemons The task "remove old systemd unit file" under "switching from non-containerized to containerized ceph rgw" only removes the ceph-radosgw@.service file. The task should also remove the ceph-radosgw.target file, like the "remove old systemd unit files" tasks for the mons, mgrs, osds, etc, in order to clean up all of the unused systemd unit files. Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>	2020-08-04 11:08:12 -04:00
Guillaume Abrioux	9d2f2108e1	ceph-crash: introduce new role ceph-crash This commit introduces a new role `ceph-crash` in order to deploy everything needed for the ceph-crash daemon. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-07-21 20:22:12 +02:00
Dimitri Savineau	c95adc564b	facts: explicitly disable facter and ohai By default, ansible gathers facts from facter and ohai if installed on the remote nodes, given we don't need them, let's exclude these facts from our facts gathering Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-07-02 17:46:12 +02:00
Guillaume Abrioux	7dd68b9ac1	rgw: fix multi instances scaleout When rgw and osd are collocated, the current workflow prevents from scaling out the radosgw_num_instances parameter when rerunning the playbook. The environment file used in the rgw systemd template is rendered when executing the `ceph-rgw` role but during a new run of the playbook (in order to scale out rgw instances), handlers are triggered from `ceph-osd` role which is run before `ceph-rgw`, therefore it tries to start the new rgw daemon whereas its corresponding environment file hasn't been rendered yet and fails like following: ``` ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory ``` This commit moves the tasks generating this file in `ceph-config` role so it is generated early. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-07-02 10:39:50 -04:00
Guillaume Abrioux	b91d60d384	switch_to_containers: don't set noup flag We shouldn't set this flag when running switch_to_containers playbook. Otherwise the playbook fails waiting for pgs to be clean. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-06-17 01:32:18 +02:00
Dimitri Savineau	50140c9b5d	switch_to_container: fix osd systemd regex The systemd LOAD and ACTIVE fileds could have more than one space between both values. This update the systemd regex the same way we're using it in different part of the code. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2020-06-16 17:04:06 +02:00
Guillaume Abrioux	8aed824f71	switch_to_container: refact wait for pg check There is no need to make this check with several steps. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-05-16 07:31:57 +02:00
Guillaume Abrioux	2cfaa056e0	switch-to-containers: set and unset osd flags The workflow in this playbook should be the same than in rolling_update, we should first set noout and nodeep-scrub flags before migrating the first osd and unset osd flags after the last osd is migrated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-04-06 17:00:00 +02:00
Guillaume Abrioux	3700aa5385	switch_to_containers: increase health check values This commit increases the default values for the following variable consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml playbook. This also moves these variables in `ceph-defaults` role so the user can set different values if needed. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2020-02-07 14:59:14 -05:00
Guillaume Abrioux	332c39376b	switch_to_containers: exclude clients nodes from facts gathering just like site.yml and rolling_update, let's exclude clients node from the fact gathering. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-12-09 10:49:13 -05:00
Dimitri Savineau	39cfe0aa65	switch_to_containers: fix umount ceph partitions When a container is already running on a non containerized node then the umount ceph partition task is skipped. This is due to the container ps command which always returns 0 even if the filter matches nothing. We should run the umount task when: 1/ the container command is failing (not installed) : rc != 0 2/ the container command reports running ceph-osd containers : rc == 0 Also we should not fail on the ceph directory listing. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-12-02 09:19:50 +01:00
Dimitri Savineau	19edf707a5	switch_to_containers: umount osd lockbox partition When switching from a baremetal deployment to a containerized deployment we only umount the OSD data partition. If the OSD is encrypted (dmcrypt: true) then there's an additional partition (part number 5) used for the lockbox and mount in the /var/lib/ceph/osd-lockbox/ directory. Because this partition isn't umount then the containerized OSD aren't able to start. The partition is still mount by the system and can't be remount from the container. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-10-08 00:45:52 +02:00
Guillaume Abrioux	fa9b42e98e	switch_to_containers: do not re-set `ceph_uid` This commit refacts the way we set `ceph_uid` fact in `ceph-facts` and removes all `set_fact` tasks for `ceph_uid` in switch-to-containers playbook to avoid duplicated code. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-10-07 14:15:56 +02:00
Guillaume Abrioux	c5d0c90bb7	switch_to_containers: optimize ownership change As per https://github.com/ceph/ceph-ansible/pull/4323#issuecomment-538420164 using `find` command should be faster. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1757400 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> Co-Authored-by: Giulio Fidente <gfidente@redhat.com>	2019-10-07 14:15:56 +02:00
Kevin Jones	47bf47c9d8	Set proper ownership command performance improvement By changing the set ownership command from using the file module in combination with a with_items loop to a raw chown command, we can achieve a 98% performance increase here. On a ceph cluster with a significant amount of directories and files in /var/lib/ceph, the file module has to run checks on ownership of all those directories and files to determine whether a change is needed. In this case, we just want to explicitly set the ownership of all these directories and files to the ceph_uid Added context note to all set proper ownership tasks Signed-off-by: Kevin Jones <kevinjones@redhat.com>	2019-08-22 10:26:47 +02:00
Guillaume Abrioux	55420d6253	roles: introduce `ceph-container-engine` role This commit splits the current `ceph-container-common` role. This introduces a new role `ceph-container-engine` which handles the tasks specific to the installation of containers tools (docker/podman). This is needed for the ceph-dashboard implementation for 2 main reasons: 1/ Since the ceph-dashboard stack is only containerized, we must install everything needed to run containers even in non containerized deployments. Splitting this role allows us to not have to call the full `ceph-container-common` role which would run a bunch of unneeded tasks that would have been skipped anyway. 2/ The current implementation would have required to run `ceph-container-common` on all ceph-clients nodes which would have been conflicting with `9d3517c670` (we don't want to run ceph-container-common on all client nodes, see mentioned commit for more details) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-05-22 13:02:10 +02:00
Guillaume Abrioux	e74d80e72f	rename docker_exec_cmd variable This commit renames the `docker_exec_cmd` variable to `container_exec_cmd` so it's more generic. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-05-16 16:39:13 +02:00
Rishabh Dave	739a662c80	improve coding style Keywords requiring only one item shouldn't express it by creating a list with single item. Signed-off-by: Rishabh Dave <ridave@redhat.com>	2019-04-23 15:37:07 +02:00
Dimitri Savineau	150acba8c5	switch-from-non-containerized: stop all osds `e6bfb84` introduced a regression in the switch from non containerized to container deployment. We need to stop all previous OSDs services. We just don't need the ceph-disk pattern in the regex. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-04-11 16:26:53 -04:00
Guillaume Abrioux	e6bfb843f4	switch_to_containers: remove ceph-disk references as of stable-4.0, ceph-disk is no longer supported. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-11 11:57:02 -04:00
Guillaume Abrioux	69310a5cd6	switch_to_containers: support multiple rgw instances per host add multiple rgw instances per host in switch_to_containers playbook. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-02-13 09:42:27 +01:00
Guillaume Abrioux	70f1eea9b2	switch_to_containers: remove non-containerized systemd unit files remove old systemd unit files (non-containerized) during the switch_to_containers transition. We have seen sometimes the unit started is the old one instead of the new systemd unit generated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-02-13 09:42:27 +01:00
Guillaume Abrioux	4064035a54	switch_to_containers: use ceph binary from container use the ceph binary from the container instead of the host. If the ceph CLI version isn't compatible between host and container image, it can cause the CLI to hang. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-02-13 09:42:27 +01:00
Guillaume Abrioux	7e0a70f7a8	switch_to_containers: do not try to redeploy monitors `ceph-mon` tries to redeploy monitors because it assumes it was not yet deployed since `mon_socket_stat` and `ceph_mon_container_stat` are undefined (indeed, we stop the daemon before calling `ceph-mon` in the switch_to_containers playbook). Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-02-13 09:42:27 +01:00
Guillaume Abrioux	0eb56e36f8	introduce new role ceph-facts sometimes we play the whole role `ceph-defaults` just to access the default value of some variables. It means we play the `facts.yml` part in this role while it's not desired. Splitting this role will speedup the playbook. Closes: #3282 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-12-12 11:18:01 +01:00
Rishabh Dave	2fb12ae554	use pre_tasks and post_tasks when necessary Signed-off-by: Rishabh Dave <ridave@redhat.com>	2018-12-05 08:17:10 +00:00
Rishabh Dave	e4f0af2b78	don't use private option for import_role Since sharing variables amongst roles has been made default since Ansible 2.6, private option has been deprecated; so stop using it. Signed-off-by: Rishabh Dave <ridave@redhat.com>	2018-12-04 23:45:59 +00:00
Sébastien Han	2814d36c93	infra playbooks: use the right container binary Use podman or docker wether they are available or not. podman will be prioritized over docker if present. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-27 16:47:40 +00:00
Sébastien Han	c14f9b78ff	switch: do not look for devices anymore It's easier lookup a directoriy instead of the block devices, especially because of ceph-volume and ceph-disk have a different way to handle devices. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-23 07:56:23 +00:00
Sébastien Han	cd56dad9fa	switch: disable all ceph units Prior to this commit we were only disabling ceph-osd units, but forgot the ceph.target which is controlling everything and will restart the ceph-osd units at each reboot. Now that everything gets disabled there won't be any conflicts between the old non-container and the new container units. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-23 07:56:23 +00:00
Sébastien Han	fe1d09925a	switch: do not mask systemd unit If we mask it we won't be able to start the OSD container since now the osd container use the osd ID as a name such as: ceph-osd@0 Fixes the error: Failed to execute operation: Cannot send after transport endpoint shutdown Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-23 07:56:23 +00:00
Guillaume Abrioux	c783bc70da	docker-common: rename role rename `ceph-docker-common` role to `ceph-container-common` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-11-12 10:51:48 +01:00
Rishabh Dave	3f62fc585f	don't use "role" or "roles" to include roles Since import_role and include_role are more readable, explicit (about the nature of inclusion) and flexible (allows placibf inclusion anywhere) amongst the tasks, use them instead of using roles or role keyword. Besides, these keywords also allow more arguments. Signed-off-by: Rishabh Dave <ridave@redhat.com>	2018-10-31 09:38:59 +01:00
Guillaume Abrioux	d8d3e55006	remove restapi role As of `mimic`, restapi is no longer available because of manager daemon. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-10-30 14:19:13 +01:00
Sébastien Han	9fccffa1ca	switch: allow switch big clusters (more than 99 osds) The current regex had a limitation of 99 OSDs, now this limit has been removed and regardless the number of OSDs they will all be collected. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430 Signed-off-by: Sébastien Han <seb@redhat.com>	2018-10-10 16:35:30 -04:00
Noah Watkins	8dcc8d1434	Stringify ceph_docker_image_tag This could be a numeric input, but is treated like a string leading to runtime errors. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823 Signed-off-by: Noah Watkins <nwatkins@redhat.com>	2018-10-10 04:26:33 +00:00
Noah Watkins	306e308f13	Avoid using tests as filter Fixes the deprecation warning: [DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of using `result\|search` use `result is search`. Signed-off-by: Noah Watkins <nwatkins@redhat.com>	2018-10-10 04:26:33 +00:00
Sébastien Han	bae0f41705	switch: copy initial mon keyring We need to copy this key into /etc/ceph so when ceph-docker-common runs it can fetch it to the ansible server. Previously the task wasn't not failing because `fail_on_missing` was False before 2.5, so now it's True hence the failure. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-10-03 13:58:53 +00:00
Guillaume Abrioux	03e76af7b4	switch: add missing call to ceph-handler role Add missing call the ceph-handler role, otherwise we can't have reference to variable registered from ceph-handler from other roles. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-10-03 13:58:53 +00:00
Guillaume Abrioux	54b02fe187	switch: support migration when cluster is scrubbing Similar to `c13a3c3` we must allow scrubbing when running this playbook. In cluster with a large number of PGs, it can be expected some of them scrubbing, it's a normal operation. Preventing from scrubbing operation force to set noscrub flag. This commit allows to switch from non containerized to containerized environment even while PGs are scrubbing. Closes: #3182 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-10-03 13:58:53 +00:00

1 2

92 Commits (6ba4c8c70b65be7ccd7c6c2e6f0959bd4f058e12)