Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fc6ebd8ebb)
For apt-get, purge command needs to be used, instead of remove command,
to remove related configuration files. Otherwise, packages might be
shown as installed while running dpkg command even after removing them.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1640061
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 640cad3fd8)
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.
`/var/lib/ceph/*` files are not purged it means there is a leftover.
When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.
Typical error (from container output):
```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16 /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23 /entrypoint.sh:
SUCCESS
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.
Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is some leftover on devices when purging osds because of a invalid
device list construction.
typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
"changed": true,
"cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
"delta": "0:00:00.015188",
"end": "2018-05-16 12:41:40.408597",
"item": "/dev/sda sda1",
"rc": 0,
"start": "2018-05-16 12:41:40.393409"
}
STDOUT:
sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600
STDERR:
Error: Could not stat device /dev/sda sda1 - No such file or directory.
```
the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output
eg:
```
changed: [osd3] => (item=/dev/sda1) => {
"changed": true,
"cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
"delta": "0:00:00.013634",
"end": "2018-05-16 12:41:09.068166",
"item": "/dev/sda1",
"rc": 0,
"start": "2018-05-16 12:41:09.054532"
}
STDOUT:
/dev/sda sda1
```
For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This changes state to action and gives the options 'create'
or 'zap'. The zap parameter is also removed.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.
typical error seen :
```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
this task hangs because `{{ inventory_hostname }}` doesn't resolv to an
actual ip address.
Using `hostvars[inventory_hostname]['ansible_default_ipv4']['address']`
should fix this because it will reach the node with its actual IP
address.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.
Signed-off-by: Sébastien Han <seb@redhat.com>
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.
Signed-off-by: Boris Ranto <branto@redhat.com>
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.
Signed-off-by: Boris Ranto <branto@redhat.com>
unti now, mgr nodes are not managed by purge-cluster.yml, therefore it
breaks scenario like purge_cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior command was avoiding the lockbox mountpoint and the playbook was
failing with:
rmtree failed: [Errno 30] Read-only file system:
'/var/lib/ceph/osd-lockbox/4e9d8052-87c2-4fde-a56c-b8c108a3eefc/key-management-mode'
Signed-off-by: Sébastien Han <seb@redhat.com>
There is only two main scenarios now:
* collocated: everything remains on the same device:
- data, db, wal for bluestore
- data and journal for filestore
* non-collocated: dedicated device for some of the component
Signed-off-by: Sébastien Han <seb@redhat.com>
Problem: too many different commands to do the same thing. The 'cut'
command on infrastructure-playbooks/purge-cluster.yml was also wrong.
This sed command from osixia in ceph-docker
https://github.com/ceph/ceph-docker/pull/580/ addresses all the
scenarios.
Signed-off-by: Sébastien Han <seb@redhat.com>
This has the behavior of overriding custom values set in group_vars.
I've added defaults to the rest of the group names so that if they are
not overridden in group_vars then defaults will be used.
See: https://bugzilla.redhat.com/show_bug.cgi?id=1354700
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Some playbooks use [0-9]*, others use \d+$
The latter is more correct since cluster name may contain numbers.
Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>