ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Guillaume Abrioux	787a6e879e	update: use ids to restart osds instead of device name we must use the ids instead of device names in the tasks executed in `post_tasks` for the osd rolling update otherwise it ends up with old systemd units enabled. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1739209 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-08-13 13:42:58 +02:00
Dimitri Savineau	343eec7a53	shrink-osd: Stop ceph-disk container based on ID Since `bedc0ab` we now manage ceph-osd systemd unit scripts based on ID instead of device name but it was not present in the shrink-osd playbook (ceph-disk version). To keep backward compatibility on deployment that didn't do yet the transition on OSD id then we should stop unit scripts for both device and ID. This commit adds the ulimit nofile container option to get better performance on ceph-disk commands. It also fixes an issue when the OSD id matches multiple OSD ids with the same first digit. $ ceph-disk list \| grep osd.1 /dev/sdb1 ceph data, prepared, cluster ceph, osd.1, block /dev/sdb2 /dev/sdg1 ceph data, prepared, cluster ceph, osd.12, block /dev/sdg2 Finally removing the shrinked OSD directory. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>	2019-08-06 09:38:52 +02:00
Guillaume Abrioux	d739f41549	shrink-osd: (ceph-disk only) remove prepare container When shrinking an OSD, its corresponding 'prepare container' should be removed otherwise it prevent from redeploying a new osd because of this leftover. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-07-09 09:04:19 -04:00
Guillaume Abrioux	4b49013369	shrink-osd: (ceph-disk only) remove gpt header Removing the gpt header on devices will ease ceph-disk to ceph-volume migration when using shrink-osd + add-osd playbooks. ceph-disk requires GPT header where ceph-volume will complain if GPT header is present. That won't break ceph-disk (re)deployment since we check and add the GPT header if needed when deploying ceph-disk ODs. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613735 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-07-09 09:04:19 -04:00
Guillaume Abrioux	8b91905dff	purge: ensure no ceph kernel thread is present This tries to first unmount any cephfs/nfs-ganesha mount point on client nodes, then unmap any mapped rbd devices and finally it tries to remove ceph kernel modules. If it fails it means some resources are still busy and should be cleaned manually before continuing to purge the cluster. This is done early in the playbook so the cluster stays untouched until everything is ready for that operation, otherwise if you try to redeploy a cluster it could end up by getting confused by leftover from previous deployment. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `20e4852888`)	2019-06-24 15:36:21 +02:00
Guillaume Abrioux	520f4e9914	add-osd: fix error in validate execution role ceph-facts should be run before we play ceph-validate since it has reference to facts that are set in ceph-facts role. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-06-24 14:36:18 +02:00
Dimitri Savineau	81de8a8106	remove ceph-agent role and references The ceph-agent role was used only for RHCS 2 (jewel) so it's not usefull anymore. The current code will fail on CentOS distribution because the rhscon package is only avaible on Red Hat with the RHCS 2 repository and this ceph release is supported on stable-3.0 branch. Resolves: #4020 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `7503098ca0`)	2019-06-17 14:42:08 -04:00
Mike Christie	0a24078bbb	igw: Fix rolling update service ordering We must stop tcmu-runner after the other rbd-target-* services because they may need to interact with tcmu-runner during shutdown. There is also a bug in some kernels where IO can get stuck in the kernel and by stopping rbd-target-* first we can make sure all IO is flushed. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611 Signed-off-by: Mike Christie <mchristi@redhat.com> (cherry picked from commit `d7ef12910e`)	2019-05-10 11:12:50 +02:00
Guillaume Abrioux	f1b4874176	Revert "Revert "shrink_osd: use cv zap by fsid to remove parts/lvs"" This reverts commit `043ee8c158`. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-05-10 09:13:10 +02:00
Guillaume Abrioux	043ee8c158	Revert "shrink_osd: use cv zap by fsid to remove parts/lvs" This reverts commit `be59e0b451`. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-25 21:27:37 +02:00
Dimitri Savineau	9ff19cc604	rolling_update: restart all ceph-iscsi services Currently only rbd-target-gw service is restarted during an update. We also need to restart tcmu-runner and rbd-target-api services during the ceph iscsi upgrade. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `f1048627ea`)	2019-04-24 23:17:41 +00:00
Guillaume Abrioux	c5c354a61a	remove all NBSPs char in stable-3.2 branch this can cause issues, let's replace all of these chars with real spaces. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2019-04-10 13:27:48 +02:00
Guillaume Abrioux	7136f1734e	purge: fix lvm-batch purge osd `lvm_volumes` and/or `devices` variable(s) can be undefined depending on the scenario chosen. These tasks should be run only if these variable are defined, otherwise it ends up with undefined variable errors. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0180738313`)	2019-04-03 08:48:39 +02:00
Dimitri Savineau	fa6d9c940a	rolling_update: Update systemd unit regex for nvme The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1). Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828 Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `c8442f3705`)	2019-04-01 15:22:24 +00:00
Dimitri Savineau	8e2cfd9d24	purge-docker-cluster: Remove ceph-osd service The systemd ceph-osd@.service file used for starting the ceph osd containers is used in all osd_scenarios. Currently purging a containerized deployment using the lvm scenario didn't remove the ceph-osd systemd service. If the next deployment is a non-containerized deployment, the OSDs won't be online because the file is still present and override the one from the package. Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `7cc626b72d`)	2019-04-01 09:10:29 +00:00
Dimitri Savineau	ef9525482b	add-osd.yml: Add become flag for ceph-validate The check_devices task fails if the ceph-validate role isn't executed as a privileged user (Permission denied). failed: [osd0] (item=/dev/sdb) => {"changed": false, "err": "Error: Error opening /dev/sdb: Permission denied\n", "item": "/dev/sdb", "msg": "Error while getting device information with parted script: '/sbin/parted -s -m /dev/sdb -- unit 'MiB' print'", "out": "", "rc": 1} Signed-off-by: Dimitri Savineau <dsavinea@redhat.com> (cherry picked from commit `b23c05ae52`)	2019-03-12 14:48:03 +01:00
Guillaume Abrioux	4dd46ec396	add-osd: gather facts in second part of playbook otherwise, it will end up with error like following: ``` FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"} ``` because facts won't have been gathered. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a440878533`)	2019-03-04 15:48:44 +00:00
Guillaume Abrioux	06ad7e0b57	purge: fix rbd-mirror group name the default is rbdmirrors in ceph-defaults Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `47ebef374f`)	2019-03-01 22:16:19 +00:00
Guillaume Abrioux	a8467d8f33	purge: fix rbd mirror purge as of `b70d54ac80` the service launched isn't ceph-rbd-mirror@admin.service. it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a915308477`)	2019-03-01 22:16:19 +00:00
Guillaume Abrioux	5470e6fa42	purge: do not remove /var/lib/apt/lists/* removing the content of this directory seems a bit agressive and cause a redeployment to fail after a purge on debian based distrubition. Typical error: ``` fatal: [mon0]: FAILED! => changed=false attempts: 3 msg: No package matching 'ceph' is available ``` The following task will consider the cache is still valid, so apt doesn't refresh it: ``` - name: update apt cache if cache_valid_time has expired apt: update_cache: yes cache_valid_time: 3600 register: result until: result is succeeded ``` since the task installing ceph packages has a `update_cache: no` it fails: ``` - name: install ceph for debian apt: name: "{{ debian_ceph_pkgs \| unique }}" update_cache: no state: "{{ (upgrade_ceph_packages\|bool) \| ternary('latest','present') }}" default_release: "{{ ceph_stable_release_uca \| default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}" register: result until: result is succeeded ``` /tmp/* isn't specific to ceph as well, so we shouldn't remove everything in this directory. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3849f30f58`)	2019-03-01 22:16:19 +00:00
Guillaume Abrioux	255eab59ac	purge: fix purge of lvm devices using `shell` module seems to be the only way to make this task working on rhel based distribution AND debian based distributions. on ubuntu, using `command` ansible module fails like following (not due to `sudo` usage or not): ``` ok: [osd1] => changed=false cmd: command -v ceph-volume failed_when_result: false msg: '[Errno 2] No such file or directory: ''command'': ''command''' rc: 2 ``` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `89f77589fa`)	2019-03-01 22:16:19 +00:00
Noah Watkins	be59e0b451	shrink_osd: use cv zap by fsid to remove parts/lvs Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1569413 https://bugzilla.redhat.com/show_bug.cgi?id=1572933 Note: rebased Signed-off-by: Noah Watkins <noahwatkins@gmail.com> (cherry picked from commit `9a43674d2e`)	2019-02-06 00:37:11 +00:00
Noah Watkins	b8c39d7613	Add a ceph-volume aware shrink-osd playbook Signed-off-by: Noah Watkins <nwatkins@redhat.com> (cherry picked from commit `f5dacbf7de`)	2019-01-30 14:58:59 +01:00
Noah Watkins	8f57a95048	Rename ceph-disk version of shrink-osd playbook This will be replaced by a ceph-volume aware verison. Signed-off-by: Noah Watkins <nwatkins@redhat.com> (cherry picked from commit `0782cfc546`)	2019-01-30 14:58:59 +01:00
Giulio Fidente	75855b2d58	Preserve rolling_update backward compatibility with ansible < 2.5 Let's enforce the default value for `client_update_batch` to 20 since `ansible_forks` isn't always available. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184 Signed-off-by: Giulio Fidente <gfidente@redhat.com> (cherry picked from commit `ff8dbe114c`)	2019-01-21 14:28:07 +00:00
Sébastien Han	04d8002614	switch: do not fail on missing key Some people use the switch playbook to perform upgrade so they end up in the same situation than https://bugzilla.redhat.com/show_bug.cgi?id=1650572 This is applying the same fix as `729744c6a8`. We don't want to fail on key that are not present since they will get created after the mons are updated. They will be created by the task "create potentially missing keys (rbd and rbd-mirror)". Signed-off-by: Sébastien Han <seb@redhat.com>	2019-01-14 18:54:46 +00:00
Guillaume Abrioux	416b503476	introduce new role ceph-facts sometimes we play the whole role `ceph-defaults` just to access the default value of some variables. It means we play the `facts.yml` part in this role while it's not desired. Splitting this role will speedup the playbook. Closes: #3282 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0eb56e36f8`)	2019-01-07 09:14:10 +01:00
Guillaume Abrioux	c3bb76b8e9	purge-container: move facts gathering after ceph-defaults role import This task has to be called after the role `ceph-defaults` has been played, otherwise, `mon_group_name` will never be known. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a12de3e048`)	2019-01-07 09:14:10 +01:00
Guillaume Abrioux	b9bf7c6703	purge-container: fix wrong syntax we want a default value for `mon_group_name`, not for `groups[mon_group_name]`. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `d0b3cb7f85`)	2019-01-07 09:14:10 +01:00
Guillaume Abrioux	0ff1260fc1	purge-docker: do not call ceph-osd role calling ceph-osd role in purge playbook is not needed. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `ae7f3d66a6`)	2019-01-07 09:14:10 +01:00
Guillaume Abrioux	c405fd1140	purge: gather monitors facts in OSD purge the OSD part of the purge delegates commands on monitor node, we need to gather monitors facts to know the `ansible_hostname` fact that is used in the `docker_exec_cmd` fact. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `1a4a6ec855`)	2019-01-07 09:14:10 +01:00
Sébastien Han	37ba313d76	purge-container: gather fact before calling ceph-defaults ceph-defaults relies on facts so we must gather facts before running it. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `62111ff53c`)	2019-01-07 09:14:10 +01:00
Sébastien Han	8e83ecfce1	purge-cluster: add support for mon/mgr collocation Recently we introduced the default collocation of mon/mgr without the need of a dedicated mgrs section. This means we have to stop the mgr process on that machine too. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `fc6ebd8ebb`)	2019-01-07 09:14:10 +01:00
Sébastien Han	12d6466582	purge-cluster: remove support for other init system We only support systemd and use the service module anyway. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `3a154fa0ad`)	2019-01-07 09:14:10 +01:00
Sébastien Han	782959f094	purge-docker-cluster: add support for mgr/mon collocation Recently we introduced the collocation of mon and mgr by default, so we don't need to have an explicit mgrs section for this. This means we have to remove the mgr container on the mon machines too. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `325a159415`) # Conflicts: # infrastructure-playbooks/purge-docker-cluster.yml	2019-01-07 09:14:10 +01:00
Sébastien Han	8ce8d580a4	purge-docker-cluste: add a task to check hosts It's useful when running on CI to see what might remain on the machines. So we list all the containers and images. We expect the list to be empty. We fail if we see containers running. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `2bcc00896f`)	2019-01-07 09:14:10 +01:00
Sébastien Han	f37c21a9d0	purge-docker-cluster: add ceph-volume support This commits adds the support for purging cluster that were deployed with ceph-volume. It also separates nicely with a block intruction the work to do when lvm is used or not. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `1751885bc9`)	2019-01-07 09:14:10 +01:00
Sébastien Han	668c7a4db7	fix json data type Json is a type structure which is always typed as a string, where before this we were declaring a dict, which is not a json valid structure. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1663026 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `896676ee80`)	2019-01-04 12:02:34 +01:00
Guillaume Abrioux	dc02156736	update: do not enforce `serial: 1` on client nodes There is no need to enforce `serial: 1` on client nodes. Let's make it parameterizable by introducing a new extra variable `client_update_batch`, if not filled this will default to `{{ ansible_forks }}`. NOTE: this is only usable as an extra variable passed with `-e client_update_batch=<num>` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `268f2cef82`)	2019-01-04 11:59:02 +01:00
Andrew Schoen	e55ec6c0f5	purge-cluster: skip tasks that use ceph-volume if it's not installed This will allow the playbook to be idempotent. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656935 Signed-off-by: Andrew Schoen <aschoen@redhat.com> (cherry picked from commit `ffd56177e7`)	2018-12-20 14:03:30 +01:00
Guillaume Abrioux	e37a90b5ec	purge: add iscsi support add iscsi support for both non containerized and containerized deployment in purge playbooks. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1651054 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `78116fa6db`)	2018-12-04 18:04:13 +01:00
Ramana Raja	0ec2ac34e3	rolling_update: fail if less than 3 MONs ... for non-containerized deployments as well. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1655470 Signed-off-by: Ramana Raja <rraja@redhat.com> (cherry picked from commit `cb784c601d`)	2018-12-04 16:34:57 +01:00
Sébastien Han	2cea33f7fc	rolling_update: default ceph json output to empty dict So we can avoid the following failure: The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout \| from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout \| from_json)["quorum_names"] ' failed. The error was: No JSON object could be decoded We just need to set a default, the next iteration will have a more complete json since the command won't fail. Signed-off-by: Sébastien Han <seb@redhat.com>	2018-11-29 01:49:05 +00:00
Guillaume Abrioux	292d967d2f	update: fix a typo `hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo. That should be `hostvars[mon_host]['ansible_hostname']` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `7c99b6df6d`)	2018-11-29 01:49:05 +00:00
Guillaume Abrioux	1f4cf61058	rolling_update: refact set_fact `mon_host` each monitor node should select another monitor which isn't itself. Otherwise, one node in the monitor group won't set this fact and causes failure. Typical error: ``` TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] * task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200 Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *** fatal: [mon1]: FAILED! => {} MSG: The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2' ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `af78173584`)	2018-11-29 01:49:05 +00:00
Sébastien Han	d4f1f12bd0	rolling_update: create rbd and rbd-mirror keyrings During an upgrade ceph won't create keys that were not existing on the previous version. So after the upgrade of let's Jewel to Luminous, once all the monitors have the new version they should get or create the keys. It's ok to have the task fails, especially for the rbd-mirror key, which only appears in Nautilus. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `4e267bee4f`)	2018-11-29 01:49:05 +00:00
Sébastien Han	26ea96424c	switch: do not look for devices anymore It's easier lookup a directoriy instead of the block devices, especially because of ceph-volume and ceph-disk have a different way to handle devices. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `c14f9b78ff`)	2018-11-29 00:31:47 +01:00
Sébastien Han	57ac7b94c0	switch: disable all ceph units Prior to this commit we were only disabling ceph-osd units, but forgot the ceph.target which is controlling everything and will restart the ceph-osd units at each reboot. Now that everything gets disabled there won't be any conflicts between the old non-container and the new container units. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `cd56dad9fa`)	2018-11-29 00:31:47 +01:00
Sébastien Han	8d0379b4d9	switch: do not mask systemd unit If we mask it we won't be able to start the OSD container since now the osd container use the osd ID as a name such as: ceph-osd@0 Fixes the error: Failed to execute operation: Cannot send after transport endpoint shutdown Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `fe1d09925a`)	2018-11-29 00:31:47 +01:00
Guillaume Abrioux	b72d806f4c	mgr: fix mgr keyring error on rolling_update when upgrading from RHCS 2.5 to 3.2, it fails because the task `create ceph mgr keyring(s) when mon is containerized` has a when condition `inventory_hostname == groups[mon_group_name]\|last`. First, this is incorrect because `inventory_hostname` is referring to a mgr node, it means this condition would have never been satisfied. Then, this condition + `serial: 1` makes the mgr keyring creating skipped on the first node. Further, the `ceph-mgr` role tries to copy the mgr keyring (it's not aware we are running `serial: 1`) this leads to a failure like the following: ``` TASK [ceph-mgr : copy ceph keyring(s) if needed] ************************************************************************************************************************************************************************************************************************************************************************* task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10 Tuesday 27 November 2018 12:03:34 +0000 (0:00:00.296) 0:11:01.290 **** An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring' failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"} ``` The ceph_key module is idempotent, so there is no need to have such a condition. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649957 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `73287f91bc`)	2018-11-28 23:11:46 +01:00

1 2 3 4 5 ...

380 Commits (e0e9fa47df4ed857947783cd443a3a44365197b2)