Commit Graph

418 Commits (540ff7826e223f1e5f8f62bfa87a93fa6d477c55)

Author SHA1 Message Date
Guillaume Abrioux 7136f1734e purge: fix lvm-batch purge osd
`lvm_volumes` and/or `devices` variable(s) can be undefined depending on
the scenario chosen.

These tasks should be run only if these variable are defined, otherwise
it ends up with undefined variable errors.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0180738313)
2019-04-03 08:48:39 +02:00
Dimitri Savineau fa6d9c940a rolling_update: Update systemd unit regex for nvme
The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c8442f3705)
2019-04-01 15:22:24 +00:00
Dimitri Savineau 8e2cfd9d24 purge-docker-cluster: Remove ceph-osd service
The systemd ceph-osd@.service file used for starting the ceph osd
containers is used in all osd_scenarios.
Currently purging a containerized deployment using the lvm scenario
didn't remove the ceph-osd systemd service.
If the next deployment is a non-containerized deployment, the OSDs
won't be online because the file is still present and override the
one from the package.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7cc626b72d)
2019-04-01 09:10:29 +00:00
Dimitri Savineau ef9525482b add-osd.yml: Add become flag for ceph-validate
The check_devices task fails if the ceph-validate role isn't executed
as a privileged user (Permission denied).

failed: [osd0] (item=/dev/sdb) => {"changed": false, "err": "Error:
Error opening /dev/sdb: Permission denied\n", "item": "/dev/sdb",
"msg": "Error while getting device information with parted script:
'/sbin/parted -s -m /dev/sdb -- unit 'MiB' print'", "out": "", "rc": 1}

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b23c05ae52)
2019-03-12 14:48:03 +01:00
Guillaume Abrioux 4dd46ec396 add-osd: gather facts in second part of playbook
otherwise, it will end up with error like following:

```
FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}
```

because facts won't have been gathered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a440878533)
2019-03-04 15:48:44 +00:00
Guillaume Abrioux 06ad7e0b57 purge: fix rbd-mirror group name
the default is rbdmirrors in ceph-defaults

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47ebef374f)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux a8467d8f33 purge: fix rbd mirror purge
as of b70d54ac80 the service launched isn't
ceph-rbd-mirror@admin.service.

it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a915308477)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux 5470e6fa42 purge: do not remove /var/lib/apt/lists/*
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.

Typical error:
```
fatal: [mon0]: FAILED! => changed=false
  attempts: 3
  msg: No package matching 'ceph' is available
```

The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
  apt:
    update_cache: yes
    cache_valid_time: 3600
  register: result
  until: result is succeeded
```

since the task installing ceph packages has a `update_cache: no` it
fails:

```
- name: install ceph for debian
  apt:
    name: "{{ debian_ceph_pkgs | unique }}"
    update_cache: no
    state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
    default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
  register: result
  until: result is succeeded
```

/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3849f30f58)
2019-03-01 22:16:19 +00:00
Guillaume Abrioux 255eab59ac purge: fix purge of lvm devices
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.

on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
  cmd: command -v ceph-volume
  failed_when_result: false
  msg: '[Errno 2] No such file or directory: ''command'': ''command'''
  rc: 2
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 89f77589fa)
2019-03-01 22:16:19 +00:00
Noah Watkins be59e0b451 shrink_osd: use cv zap by fsid to remove parts/lvs
Fixes:
  https://bugzilla.redhat.com/show_bug.cgi?id=1569413
  https://bugzilla.redhat.com/show_bug.cgi?id=1572933

Note: rebased

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
(cherry picked from commit 9a43674d2e)
2019-02-06 00:37:11 +00:00
Noah Watkins b8c39d7613 Add a ceph-volume aware shrink-osd playbook
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit f5dacbf7de)
2019-01-30 14:58:59 +01:00
Noah Watkins 8f57a95048 Rename ceph-disk version of shrink-osd playbook
This will be replaced by a ceph-volume aware verison.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 0782cfc546)
2019-01-30 14:58:59 +01:00
Giulio Fidente 75855b2d58 Preserve rolling_update backward compatibility with ansible < 2.5
Let's enforce the default value for `client_update_batch` to 20 since
`ansible_forks` isn't always available.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184

Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit ff8dbe114c)
2019-01-21 14:28:07 +00:00
Sébastien Han 04d8002614 switch: do not fail on missing key
Some people use the switch playbook to perform upgrade so they end up in
the same situation than https://bugzilla.redhat.com/show_bug.cgi?id=1650572
This is applying the same fix as
729744c6a8.

We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".

Signed-off-by: Sébastien Han <seb@redhat.com>
2019-01-14 18:54:46 +00:00
Guillaume Abrioux 416b503476 introduce new role ceph-facts
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.

Closes: #3282

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0eb56e36f8)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux c3bb76b8e9 purge-container: move facts gathering after ceph-defaults role import
This task has to be called after the role `ceph-defaults` has been
played, otherwise, `mon_group_name` will never be known.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a12de3e048)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux b9bf7c6703 purge-container: fix wrong syntax
we want a default value for `mon_group_name`, not for
`groups[mon_group_name]`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d0b3cb7f85)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux 0ff1260fc1 purge-docker: do not call ceph-osd role
calling ceph-osd role in purge playbook is not needed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ae7f3d66a6)
2019-01-07 09:14:10 +01:00
Guillaume Abrioux c405fd1140 purge: gather monitors facts in OSD purge
the OSD part of the purge delegates commands on monitor node, we need to
gather monitors facts to know the `ansible_hostname` fact that is used
in the `docker_exec_cmd` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1a4a6ec855)
2019-01-07 09:14:10 +01:00
Sébastien Han 37ba313d76 purge-container: gather fact before calling ceph-defaults
ceph-defaults relies on facts so we must gather facts before running it.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 62111ff53c)
2019-01-07 09:14:10 +01:00
Sébastien Han 8e83ecfce1 purge-cluster: add support for mon/mgr collocation
Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fc6ebd8ebb)
2019-01-07 09:14:10 +01:00
Sébastien Han 12d6466582 purge-cluster: remove support for other init system
We only support systemd and use the service module anyway.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 3a154fa0ad)
2019-01-07 09:14:10 +01:00
Sébastien Han 782959f094 purge-docker-cluster: add support for mgr/mon collocation
Recently we introduced the collocation of mon and mgr by default, so we
don't need to have an explicit mgrs section for this. This means we have
to remove the mgr container on the mon machines too.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 325a159415)

# Conflicts:
#	infrastructure-playbooks/purge-docker-cluster.yml
2019-01-07 09:14:10 +01:00
Sébastien Han 8ce8d580a4 purge-docker-cluste: add a task to check hosts
It's useful when running on CI to see what might remain on the machines.
So we list all the containers and images. We expect the list to be
empty.

We fail if we see containers running.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 2bcc00896f)
2019-01-07 09:14:10 +01:00
Sébastien Han f37c21a9d0 purge-docker-cluster: add ceph-volume support
This commits adds the support for purging cluster that were deployed
with ceph-volume. It also separates nicely with a block intruction the
work to do when lvm is used or not.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 1751885bc9)
2019-01-07 09:14:10 +01:00
Sébastien Han 668c7a4db7 fix json data type
Json is a type structure which is always typed as a string, where before
this we were declaring a dict, which is not a json valid structure.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1663026

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 896676ee80)
2019-01-04 12:02:34 +01:00
Guillaume Abrioux dc02156736 update: do not enforce `serial: 1` on client nodes
There is no need to enforce `serial: 1` on client nodes.
Let's make it parameterizable by introducing a new *extra* variable
`client_update_batch`, if not filled this will default to `{{
ansible_forks }}`.

NOTE: this is only usable as an extra variable passed with
`-e client_update_batch=<num>`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 268f2cef82)
2019-01-04 11:59:02 +01:00
Andrew Schoen e55ec6c0f5 purge-cluster: skip tasks that use ceph-volume if it's not installed
This will allow the playbook to be idempotent.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656935

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit ffd56177e7)
2018-12-20 14:03:30 +01:00
Guillaume Abrioux e37a90b5ec purge: add iscsi support
add iscsi support for both non containerized and containerized
deployment in purge playbooks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1651054

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78116fa6db)
2018-12-04 18:04:13 +01:00
Ramana Raja 0ec2ac34e3 rolling_update: fail if less than 3 MONs
... for non-containerized deployments as well.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1655470

Signed-off-by: Ramana Raja <rraja@redhat.com>
(cherry picked from commit cb784c601d)
2018-12-04 16:34:57 +01:00
Sébastien Han 2cea33f7fc rolling_update: default ceph json output to empty dict
So we can avoid the following failure:

The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded

We just need to set a default, the next iteration will have a more
complete json since the command won't fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-29 01:49:05 +00:00
Guillaume Abrioux 292d967d2f update: fix a typo
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d)
2018-11-29 01:49:05 +00:00
Guillaume Abrioux 1f4cf61058 rolling_update: refact set_fact `mon_host`
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018  14:02:30 +0000 (0:00:07.493)       0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584)
2018-11-29 01:49:05 +00:00
Sébastien Han d4f1f12bd0 rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f)
2018-11-29 01:49:05 +00:00
Sébastien Han 26ea96424c switch: do not look for devices anymore
It's easier lookup a directoriy instead of the block devices,
especially because of ceph-volume and ceph-disk have a different way to
handle devices.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit c14f9b78ff)
2018-11-29 00:31:47 +01:00
Sébastien Han 57ac7b94c0 switch: disable all ceph units
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit cd56dad9fa)
2018-11-29 00:31:47 +01:00
Sébastien Han 8d0379b4d9 switch: do not mask systemd unit
If we mask it we won't be able to start the OSD container since now the
osd container use the osd ID as a name such as: ceph-osd@0

Fixes the error:  Failed to execute operation: Cannot send after transport endpoint shutdown

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fe1d09925a)
2018-11-29 00:31:47 +01:00
Guillaume Abrioux b72d806f4c mgr: fix mgr keyring error on rolling_update
when upgrading from RHCS 2.5 to 3.2, it fails because the task `create
ceph mgr keyring(s) when mon is containerized` has a when condition
`inventory_hostname == groups[mon_group_name]|last`.
First, this is incorrect because `inventory_hostname` is referring to a
mgr node, it means this condition would have never been satisfied.
Then, this condition + `serial: 1` makes the mgr keyring creating skipped on
the first node. Further, the `ceph-mgr` role tries to copy the mgr
keyring (it's not aware we are running `serial: 1`) this leads to a
failure like the following:

```
TASK [ceph-mgr : copy ceph keyring(s) if needed] ***************************************************************************************************************************************************************************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10
Tuesday 27 November 2018  12:03:34 +0000 (0:00:00.296)       0:11:01.290 ******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'
failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"}
```

The ceph_key module is idempotent, so there is no need to have such a
condition.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649957

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 73287f91bc)
2018-11-28 23:11:46 +01:00
Rishabh Dave a74f4204cd remove configuration files for ceph packages on ubuntu clusters
For apt-get, purge command needs to be used, instead of remove command,
to remove related configuration files. Otherwise, packages might be
shown as installed while running dpkg command even after removing them.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1640061
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 640cad3fd8)
2018-11-09 16:50:25 +01:00
Mike Christie 77de54025b igw: stop tcmu-runner on iscsi purge
When the iscsi purge playbook is run we stop the gw and api daemons but
not tcmu-runner which I forgot on the previous PR.

Fixes Red Hat BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1621255

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit b523a44a1a)
2018-11-09 16:50:04 +01:00
Ali Maredia 219fa8f919 infrastructure playbooks: ensure nvme_device is defined in lv-create.yml
Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-10-29 08:41:42 +00:00
Mike Christie 0904860032 igw: stop daemons on purge all calls
When purging the entire igw config (lio and rbd) stop disable the api
and gw daemons.

Fixes Red Hat BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1621255

Signed-off-by: Mike Christie <mchristi@redhat.com>
2018-10-25 12:59:18 +02:00
Sébastien Han 44d0da0dd4 rolling_update: fix upgrade when using fqdn
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-19 13:06:56 +00:00
Guillaume Abrioux b8418ebd17 add-osds: followup on 3632b26
Three fixes:

- fix a typo in vagrant_variables that cause a networking issue for
containerized scenario.
- add containerized_deployment: true
- remove a useless block of code: the fact docker_exec_cmd is set in
ceph-defaults which is played right after.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-17 17:07:25 +02:00
Sébastien Han d6e79044ef infra: add a gather-ceph-logs.yml playbook
Add a gather-ceph-logs.yml which will log onto all the machines from
your inventory and will gather ceph logs. This is not intended to work
on containerized environments since the logs are stored in journald.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582280
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-17 13:52:19 +00:00
Sébastien Han fbd878c8d5 infra: rename osd-configure to add-osd and improve it
The playbook has various improvements:

* run ceph-validate role before doing anything
* run ceph-fetch-keys only on the first monitor of the inventory list
* set noup flag so PGs get distributed once all the new OSDs have been
added to the cluster and unset it when they are up and running

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-17 11:26:11 +00:00
Guillaume Abrioux 40b7747af7 remove jewel support
As of now, we should no longer support Jewel in ceph-ansible.
The latest ceph-ansible release supporting Jewel is `stable-3.1`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-12 23:38:17 +00:00
Sébastien Han 9fccffa1ca switch: allow switch big clusters (more than 99 osds)
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-10 16:35:30 -04:00
Noah Watkins 8dcc8d1434 Stringify ceph_docker_image_tag
This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-10-10 04:26:33 +00:00
Noah Watkins 306e308f13 Avoid using tests as filter
Fixes the deprecation warning:

  [DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
  using `result|search` use `result is search`.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-10-10 04:26:33 +00:00
Guillaume Abrioux 79bd06ad28 rolling_update: add ceph-handler role
since the introduction of ceph-handler, it has to be added in
rolling_update playbook as well

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-05 13:48:04 +00:00
Rishabh Dave b5d2ea269f don't use "static" field while including tasks
Instead used "import_tasks" and "include_tasks" to tell whether tasks
must be included statically or dynamically.

Fixes: https://github.com/ceph/ceph-ansible/issues/2998
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-10-04 07:44:28 +00:00
Sébastien Han bae0f41705 switch: copy initial mon keyring
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-03 13:58:53 +00:00
Guillaume Abrioux 03e76af7b4 switch: add missing call to ceph-handler role
Add missing call the ceph-handler role, otherwise we can't have
reference to variable registered from ceph-handler from other roles.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-03 13:58:53 +00:00
Guillaume Abrioux 54b02fe187 switch: support migration when cluster is scrubbing
Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-03 13:58:53 +00:00
Andrew Schoen 9747f3dbd5 purge-cluster: zap devices used with the lvm scenario
Fixes: https://github.com/ceph/ceph-ansible/issues/3156

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-09-28 14:49:56 +02:00
wumingqiao 5da71e1ca1 purge-cluster: recursively remove ceph-related files, symlinks and directories under /etc/systemd/system.
fix: https://github.com/ceph/ceph-ansible/issues/3166

Signed-off-by: wumingqiao <wumingqiao@beyondcent.com>
2018-09-28 14:49:22 +02:00
Rishabh Dave 380168dadc don't use "include" to include tasks
Use "import_tasks" or "include_tasks" instead.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-09-27 17:53:40 +02:00
Guillaume Abrioux 144c92b21f purge: actually remove of /var/lib/ceph/*
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16  /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23  /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-27 17:45:21 +02:00
Guillaume Abrioux 179c4d00d7 rolling_update: ensure pgs_by_state has at least 1 entry
Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-25 14:58:54 +00:00
Guillaume Abrioux c13a3c3492 upgrade: consider all 'active+clean' states as valid pgs
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-25 12:12:06 +00:00
Guillaume Abrioux 57f0b6a476 shrink-osd: follow up on 36fb3cde
- Adds loop in bash to satisfy the 1:n relation between `osd_hosts` and the
different device lists.
- Fixes some container name which were using the host hostname instead
of the actual container one.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 07:27:41 +00:00
Sébastien Han 735e1917db shrink-osd: purge dedicated devices
Once the OSD is destroyed we also have to purge the associated devices,
this means purging journal, db , wal partitions too.

This now works for container and non-container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-18 07:27:41 +00:00
Guillaume Abrioux 4159326a18 shrink-osd: fix purge osd on containerized deployment
ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 18:14:01 +02:00
Sébastien Han 38dc20e74b purge: only purge /var/lib/ceph content
Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-03 10:51:24 +02:00
Ali Maredia 561ec9203d infrastructure-playbooks: add comments for lv_vars.yml
Add comments telling user that devices used in
playbooks must not have GPT/FS/RAID signatures

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Ali Maredia 77eb459a88 infrastructure playbooks: remove lv-create error msg
remove error message when PV creation fails

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Ali Maredia e1ff438800 infrastructure-playbooks: failure msg for pvcreate
Add a message for when PV creation fails.

This message alerts users that FS/GPT/RAID
signatures could still on the device and the
reason for the failures.

`wipefs -a $device` needs to be run to fix this issue.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-28 20:21:42 +00:00
Sébastien Han 2e6e885bb7 rolling_upgrade: set sortbitwise properly
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 12:22:32 +00:00
Sébastien Han 77a3a682f3 iscsi group name preserve backward compatibility
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 23:52:19 +02:00
Sébastien Han b738706810 take-over-existing-cluster: do not call var_files
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 14:47:04 +02:00
Andrew Schoen 04df3f0802 lv-create: use copy instead of the template module
The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 131796f275 lv-create: add an example logfile_path config option in lv_vars.yml
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen b0bfc17351 lv-teardown: fail silently if lv_vars.yml is not found
This allows user to opt out of using lv_vars.yml and load configuration
from other sources.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 8424858b40 lv-teardown: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen e43eec57bb lv-create: fail silenty if lv_vars.yml is not found
If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen fde47be13c lv-create: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 35301b35af lv-create: use the template module to write log file
The copy module will not expand the template and render the variables
included, so we must use template.

Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 909b38da82 infrastructure-playbooks/vars/lv_vars.yaml: minor fixes
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha f65f3ea89f infrastructure-playbooks/lv-create.yml: use tempfile to create logfile
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 65fdad0723 infrastructure-playbooks/lv-create.yml: add lvm_volumes to suggested paste
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 50a6d8141c infrastructure-playbooks/lv-create.yml: copy without using a template file
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 186c4e11c7 infrastructure-playbooks/lv-create.yml: don't use action to copy
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 9d43806df9 infrastructure-playbooks: standardize variable usage with a space after brackets
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha e0293de3e7 vars/lv_vars.yaml: remove journal_device
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Ali Maredia 1f018d8612 infrastructure-playbooks: playbooks for creating LVs for bucket indexes and journals
These playbooks create and tear down logical
volumes for OSD data on HDDs and for a bucket index and
journals on 1 NVMe device.

Users should follow the guidelines set in var/lv_vars.yaml

After the lv-create.yml playbook is run, output is
sent to /tmp/logfile.txt for copy and paste into
osds.yml

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-16 16:38:23 +02:00
Sébastien Han dad10e8f3f rolling_update: register container osd units
Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-16 11:13:12 +02:00
Jeffrey Zhang 85cc61a6d9 Use /var/lib/ceph/osd folder to filter osd mount point
In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.

Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
2018-08-14 13:00:24 +00:00
Sébastien Han b3266c5be2 rolling_update: set osd sortbitwise
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-07-24 17:19:02 +02:00
Sébastien Han ce1dd8d2b3 shrink-osd: purge osd on containerized deployment
Prior to this commit we were only stopping the container, but now we
also purge the devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-07-18 14:26:22 +00:00
Guillaume Abrioux d0746e0858 common: switch from docker module to docker_container
As of ansible 2.4, `docker` module has been removed (was deprecated
since ansible 2.1).
We must switch to `docker_container` instead.

See: https://docs.ansible.com/ansible/latest/modules/docker_module.html#docker-module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-07-10 20:08:07 +00:00
Vishal Kanaujia 44d514850a Rolling upgrades: Migrate to ceph-key module
This change moves ceph-mgr upgrades to using ceph-key library.
Fixes: #2758

Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
2018-07-03 18:22:14 +02:00
Sébastien Han 20c8065e48 ceph-iscsi: rename group iscsi_gws
Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-06-08 10:21:54 +02:00
Guillaume Abrioux 232a16d77f rolling_update: fix facts gathering delegation
this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.

Closes: #2708

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-06-06 16:36:30 +08:00
Vishal Kanaujia 08d9432454 Rolling upgrades should use norebalance flag for OSDs
The rolling upgrades playbook should have norebalance flag set for
OSDs upgrades to wait only for recovery.

Fixes: #2657
Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
2018-06-04 10:59:01 +02:00
Sébastien Han e91648a7af rolling_update: add role ceph-iscsi-gw
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575829
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-26 02:38:47 -07:00
Paul Cuzner 2890b57cfc Add privilege escalation to iscsi purge tasks
Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-05-25 03:50:24 -07:00
Sébastien Han da5b104098 rolling_update: fix get fsid for containers
When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:

TASK [get current fsid] ********************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214
Tuesday 22 May 2018  22:48:32 +0000 (0:00:02.615)       0:11:01.035 ***********
fatal: [mgr0 -> mon0]: FAILED! => {
    "changed": true,
    "cmd": [
        "ceph",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:05:00.260674",
    "end": "2018-05-22 22:53:34.555743",
    "rc": 1,
    "start": "2018-05-22 22:48:34.295069"
}

STDERR:

2018-05-22 22:48:34.495651 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000
2018-05-22 22:48:34.495684 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault

This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-23 04:44:12 +02:00
Guillaume Abrioux 9801bde4d4 purge_cluster: fix dmcrypt purge
dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-21 08:23:10 +02:00
Guillaume Abrioux 415dc0a29b take-over: fix bug when trying to override variable
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.

Typical error:

```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```

Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-18 10:10:08 +02:00