Commit Graph

73 Commits (9cad113e2f22132d08208cd58462f11056c41305)

Author SHA1 Message Date
Guillaume Abrioux 9cad113e2f purge_cluster: fix bug when building device list
there is some leftover on devices when purging osds because of a invalid
device list construction.

typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
    "changed": true,
    "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
    "delta": "0:00:00.015188",
    "end": "2018-05-16 12:41:40.408597",
    "item": "/dev/sda sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:40.393409"
}

STDOUT:

sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600

STDERR:

Error: Could not stat device /dev/sda sda1 - No such file or directory.
```

the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output

eg:

```
changed: [osd3] => (item=/dev/sda1) => {
    "changed": true,
    "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
    "delta": "0:00:00.013634",
    "end": "2018-05-16 12:41:09.068166",
    "item": "/dev/sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:09.054532"
}

STDOUT:

/dev/sda sda1
```

For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-17 08:37:17 +02:00
Andrew Schoen 08f4875533 ceph_volume: refactor to not run ceph osd destroy
This changes state to action and gives the options 'create'
or 'zap'. The zap parameter is also removed.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Andrew Schoen c6e8f8fb11 purge-cluster: no need to use objectstore for ceph_volume module
When zapping objectstore is not required.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Andrew Schoen c29a75ac7f purge-cluster: use ceph_volume module to zap and destroy OSDs
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Guillaume Abrioux dd0c98c5a2 common: do not use `shell` module when it is not needed
There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux deaf273b25 syntax: change local_action syntax
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux c5b7b37105 purge-cluster: clean some code
Avoid using regexp to match device

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Guillaume Abrioux eeedefdf02 purge-cluster: wipe disk using dd
`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.

typical error seen :

```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Guillaume Abrioux aaaf980140 purge: fix bug on 'wait_for' task
this task hangs because `{{ inventory_hostname }}` doesn't resolv to an
actual ip address.
Using `hostvars[inventory_hostname]['ansible_default_ipv4']['address']`
should fix this because it will reach the node with its actual IP
address.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-29 11:10:56 +01:00
Guillaume Abrioux 947766e294 purge-cluster: remove usage of `with_fileglob`
`with_fileglob` loops over files on the machine where ansible-playbook
is being run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-21 08:24:11 +01:00
Sébastien Han 2837d0a22e purge: do not reboot by default
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-26 14:18:38 +02:00
Sébastien Han 24b82c2679 purge: fix journal purge
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-10 09:57:39 +02:00
Guillaume Abrioux f147b119ed Merge pull request #2014 from ceph/fixes-2
infra: use the pg check in the right place
2017-10-09 20:14:06 +02:00
Sébastien Han 450108fab9 infra: add independant purge-iscsi-gateways.yml
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-09 17:25:44 +02:00
Boris Ranto 64e272d818 purge-cluster: Do not use shell for rm
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:54:37 +02:00
Boris Ranto f696cb7637 purge-cluster: Do not fail on systemd commands
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:52:56 +02:00
Sébastien Han b6b24a5ca9 iscsi: fix wrong group name for iscsi
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498490
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-05 17:25:32 +02:00
zhangwentao 86a6db0d58 purge-cluster: delete block partitions if using bluestore 2017-09-29 14:04:17 +08:00
Andrew Schoen fccc604f4a purge-cluster: default lvm_volumes if not defined
Most osd scenarios do not use lvm_volumes, so default it in
purge-cluster.yml if it's not defined.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-09-26 15:14:29 -05:00
Guillaume Abrioux c80ba7a307 purge: implement mgr purge
unti now, mgr nodes are not managed by purge-cluster.yml, therefore it
breaks scenario like purge_cluster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-09-24 21:18:50 +02:00
Sébastien Han ba3e3b6cc7 purge: only purge specific directories for mon
Handles the case when a mon is collocated with an OSD.

Closes: https://github.com/ceph/ceph-ansible/issues/1877
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-13 17:07:04 -06:00
Sébastien Han aa364264cd resync ceph-iscsi-gw with old upstream
Taken from https://github.com/pcuzner/ceph-iscsi-ansible/tree/tcmu-fixes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1454945 and
https://bugzilla.redhat.com/show_bug.cgi?id=1484083
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-12 18:06:10 -06:00
Sébastien Han b9ced956d7 purge: get lockbox mountpoint and unmount it
Prior command was avoiding the lockbox mountpoint and the playbook was
failing with:

rmtree failed: [Errno 30] Read-only file system:
'/var/lib/ceph/osd-lockbox/4e9d8052-87c2-4fde-a56c-b8c108a3eefc/key-management-mode'

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-07 16:31:31 +02:00
Ben England 617d9ee75d dont use devices var anymore, works for osd_auto_discover 2017-08-28 17:27:01 -04:00
Andrew Schoen bed57572cc purge-cluster: adds support for purging lvm osds
This also adds a new testing scenario for purging lvm osds

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-08-23 10:33:35 -05:00
Sébastien Han 9c824b9818 purge: add ability to purge bluestore osd
We now purge block db and/or wal partitions if we find any.

Closes: https://github.com/ceph/ceph-ansible/issues/1770
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-21 18:08:18 +02:00
Sébastien Han 30991b1c0a osd: simplify scenarios
There is only two main scenarios now:

* collocated: everything remains on the same device:
  - data, db, wal for bluestore
  - data and journal for filestore
* non-collocated: dedicated device for some of the component

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-03 10:20:39 +02:00
Sébastien Han fad9d0caec Merge pull request #1690 from yanyixing/master
fix: when osd device is a disk partition
2017-07-26 15:55:29 +02:00
yanyx 2e6233271e fix: when osd device is a disk partition 2017-07-25 21:39:43 +08:00
Sébastien Han 0c18cf199e purge: remove leftover unit files
Closes https://github.com/ceph/ceph-ansible/issues/1672

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-07-25 13:26:28 +02:00
Andrew Schoen 5a3f95dfc1 purge-cluster: check for any running ceph process after purge
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-25 09:30:22 -05:00
Andrew Schoen 26bdd59f5d purge-cluster: we don't support sysv or upstart anymore
Now that ceph-ansible only supports > jewel we don't need
to bother with sysv or upstart

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:14:38 -07:00
Andrew Schoen 7ca2bddcce purge-cluster: do not need to check for running ceph processes
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:12:46 -07:00
Andrew Schoen aac79df3b3 purge-cluster: no need to remove ceph.target
The package uninstalls will stop ceph.target

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:11:03 -07:00
Daniel Lupescu d5e56c481a purge-cluster: fix grep match for NVMe and HP Smart Array devices
raw_device would return invalid block device names for NVMe and HPSA
devices which would cause sgdisk partition deletion to fail

$ echo /dev/nvme1n1p3 | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}'
/dev/nvme1n1p

$ echo /dev/cciss/c0d0p2 |  egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}'
/dev/cciss/c0d0p
2017-04-11 16:13:28 +03:00
Sébastien Han c37aaa41f4 playbook: homogenize the way list osd ids
Problem: too many different commands to do the same thing. The 'cut'
command on infrastructure-playbooks/purge-cluster.yml was also wrong.
This sed command from osixia in ceph-docker
https://github.com/ceph/ceph-docker/pull/580/ addresses all the
scenarios.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-03-30 11:51:38 +02:00
Andrew Schoen 4fe6607004 purge-cluster: do not set group name vars at playbook level
This has the behavior of overriding custom values set in group_vars.
I've added defaults to the rest of the group names so that if they are
not overridden in group_vars then defaults will be used.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1354700

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:08 -06:00
Shengjing Zhu 32923fd217 fix grep match pattern for osd ids
Some playbooks use [0-9]*, others use \d+$
The latter is more correct since cluster name may contain numbers.

Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>
2017-02-20 16:35:56 +08:00
Andrew Schoen 22f52a9dc6 purge-cluster: also purge dmcrypt dedicated journals
See: https://bugzilla.redhat.com/show_bug.cgi?id=1414647

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-15 10:27:17 -06:00
Andrew Schoen c5f561a4e9 purge-cluster: remove calamari-server package
See: https://bugzilla.redhat.com/show_bug.cgi?id=1422134

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves rhbz#1422134
2017-02-14 09:24:02 -06:00
Andrew Schoen 865b4500dc purge-cluster: set a default value for fetch_directory if not defined
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:25:43 -06:00
Andrew Schoen adf6aee643 purge-cluster: remove all include tasks
Including variables from role defaults or files in a group_vars
directory relative to the playbook is a bad practice. We don't want to
do this because including these defaults at the task level overrides
values that would be set in a group_vars directory relative to the
inventory file, which is the correct usage if you wish to override
those default values.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:25:43 -06:00
Andrew Schoen 0476b24af1 purge-cluster: do not use ceph-detect-init
We can not always ensure that ceph-detect-init will be
present on the system.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1418980

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:24:44 -06:00
Sébastien Han 72cd9199ac purge: ability to purge client role
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-07 22:14:18 +01:00
Sébastien Han d5dd658cfa purge: do not stop ceph.target on each daemon
Doing this cause some all the daemons to go down at the same time. In a
scenario where we colocate a monitor and an osd, this osds will take
some time to go down which will make the 'umount' task fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han cb57a359ba purge: do not fail on purge ceph files
On systems running docker there is an issue with lxfs that results in
the find command returning 1 but actually did the job.
e.g: on a system with docker runnning find /var will give us the
following error:

find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/systemd-update-utmp.service/devices.deny':
Permission denied
find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/dev-random.mount/devices.allow':
Permission denied
...
...

However ceph files got deleted so we ignore the error.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han e371bd591c purge: fix ubuntu purge when not using systemd
We now rely on the cli tool ceph-detect-init which will tell us the init
system in used on the distribution. We do this instead of the previous
lookup for systemd unit files to call the right task depending on the
init system.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han 0e2e270ab2 purge: allow purge to run multiple times
with_items is evaluated before the when so in a second run where the
variable is empty if will fail with "'dict object' has no attribute
'stdout_lines'". To fix this we had a default array so with_items does
not fail and the task is skipped with the when.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Andrew Schoen d3cb8dba4e purge-cluster: fix failure when raw_multi_journal is not defined
Because the purge-cluster.yml playbook does not have access to the roles
default vars then we can be sure that raw_multi_journal is defined. For
example, if this was purging a dmcrypt journal then raw_multi_journal
might not be defined at all in group_vars/all.yml or
group_vars/osds.yml.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-27 05:23:17 -06:00
Andrew Schoen b2a6f095f1 purge-cluster: fix syntax when deleting dmcrypt devices
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-26 11:28:30 -06:00