Commit Graph

247 Commits (96c049be5b8e478548f68ec7312cd299fcda1bbc)

Author SHA1 Message Date
Guillaume Abrioux c04e67347c update: look for short and fqdn in ceph_health_raw
According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-19 10:27:47 +01:00
Andrew Schoen 699c777e68 rolling update: fix undefined jewel_minor_update failure
Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.

The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-13 17:03:05 +01:00
Andrew Schoen 7c7017ebe6 infra: do not include host_vars/* in take-over-existing-cluster.yml
These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-12 11:48:47 +01:00
Guillaume Abrioux 3b2f6c34e4 purge-docker: fix ceph-osd-zap name container
the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).

this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```

having the the basename of the device path is enough for the container
name.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-02 22:09:11 +01:00
Guillaume Abrioux dd0c98c5a2 common: do not use `shell` module when it is not needed
There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux deaf273b25 syntax: change local_action syntax
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux f372a4232e purge: fix resolve parent device task
This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-30 17:40:10 +01:00
Guillaume Abrioux c7ec12d49c upgrade: skip luminous tasks for jewel minor update
These tasks are needed only when upgrading to luminous.
They are not needed in Jewel minor upgrade and by the way, they fail because
`ceph versions` command doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-25 18:30:34 +01:00
Sébastien Han 8af7459476 rolling update: add mgr exception for jewel minor updates
When update from a minor Jewel version to another, the playbook will
fail on the task "fail if no mgr host is present in the inventory".
This now can be worked around by running Ansible with_items

-e jewel_minor_update=true

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1535382
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-18 14:06:05 +01:00
Guillaume Abrioux 55298fa80c purge-container: use lsblk to resolv parent device
Using `lsblk` to resolv the parent device is better than just removing the last
char when passing it to the zap container.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-17 15:54:20 +01:00
Guillaume Abrioux 58eb045d2f purge-container: remove awk usage in favor of blkid
Avoid using `awk` to get the different devices from the partlabel.
Using `blkid` is more readable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-17 15:54:20 +01:00
Andrew Schoen b613321c21 switch-to-containers: do not fail when stopping the nfs-ganesha service
If we're working with a jewel cluster then this service will not exist.

This is mainly a problem with CI testing because our tests are setup to
work with both jewel and luminous, meaning that eventhough we want to
test jewel we still have a nfs-ganesha host in the test causing these
tasks to run.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Andrew Schoen 0b4b60e3c9 switch-to-containers: do not fail when stopping the ceph-mgr daemon
If we are working with a jewel cluster ceph mgr does not exist
and this makes the playbook fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Andrew Schoen 997edea271 rolling_update: do not fail the playbook if nfs-ganesha is not present
The rolling update playbook was attempting to stop the
nfs-ganesha service on nodes where jewel is still installed.
The nfs-ganesha service did not exist in jewel so the task fails.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Guillaume Abrioux c5b7b37105 purge-cluster: clean some code
Avoid using regexp to match device

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Guillaume Abrioux eeedefdf02 purge-cluster: wipe disk using dd
`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.

typical error seen :

```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Sébastien Han 200785832f rolling_update: do not require root to answer question
There is no need to ask for root on the local action. This will prompt
for a password the current user is not part of sudoers. That's
  unnecessary anyways.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1516947
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-19 14:04:55 +01:00
Guillaume Abrioux aaaf980140 purge: fix bug on 'wait_for' task
this task hangs because `{{ inventory_hostname }}` doesn't resolv to an
actual ip address.
Using `hostvars[inventory_hostname]['ansible_default_ipv4']['address']`
should fix this because it will reach the node with its actual IP
address.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-29 11:10:56 +01:00
Guillaume Abrioux 947766e294 purge-cluster: remove usage of `with_fileglob`
`with_fileglob` loops over files on the machine where ansible-playbook
is being run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-21 08:24:11 +01:00
Guillaume Abrioux d9c1b61092 purge-docker: remove osd disk prepare logs
`with_fileglob` loops over files on the machine that runs the playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-16 14:27:36 +01:00
Sébastien Han 68566444e9
Merge pull request #2142 from squidboylan/master
infra: fix take-over-existing-cluster.yml playbook
2017-11-13 22:06:16 +11:00
Guillaume Abrioux fa675f2ead purge-docker-cluster: ensure old logs are removed
purge-docker-cluster must remove all osd_disk_prepare logs in
`{{ ceph_osd_docker_run_script_path }}`, otherwise if you purge your
cluster and try to redeploy it, osds will fail to start since because it
will try to retrieve find a partition uuid which doesn't exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510470

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-09 17:49:20 +01:00
Caleb Boylan 41d10a2f64 infra: fix take-over-existing-cluster.yml playbook
The ansible inventory could have more than just ceph-ansible hosts, so
we shouldnt use "hosts: all", also only grab one file when getting
the ceph cluster name instead of failing when there is more than one
file in /etc/ceph. Also fix location of the ceph.conf template
2017-11-06 15:00:30 -08:00
Sébastien Han 473673ab41 shrink-mon: fix typo in the code doc
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-27 11:59:22 +02:00
Sébastien Han 2837d0a22e purge: do not reboot by default
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-26 14:18:38 +02:00
Guillaume Abrioux f90f2f3a04 purge: containers are not stopped
During purge osd, the containers are not stopped because of a typo, as a
result, all the devices can't be unmounted later.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-10-25 07:58:00 +02:00
Sébastien Han 4413511b66 all: backward compatibility between stable-2.2 and 3.0
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.

We will drop this in a near future, maybe 3.1 or 3.2.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-20 11:54:10 +02:00
Guillaume Abrioux 982326373b upgrade: fix upgrade jewel to luminous for nfs nodes
nfs nodes can't be upgraded from jewel to luminous because ceph-nfs role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
package is upgraded in `ceph-nfs` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-10-19 20:54:23 +02:00
Guillaume Abrioux 70034451e9 upgrade: fix upgrade jewel to luminous for mgr nodes
mgr nodes can't be upgraded from jewel to luminous because ceph-mgr role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
ceph-mgr package is upgraded in `ceph-mgr` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 302e563601cd6820b1ae44fabdfb1506688c7c9b)
2017-10-19 20:54:23 +02:00
Sébastien Han d920d4839d upgrade: support for rbd mirror and nfs
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-17 10:54:47 +02:00
Sébastien Han 39bf102b64 switch: nicer way to check mon quorum
re-use the same syntax as rolling_udate.yml

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-17 10:54:36 +02:00
Sébastien Han b685aceede Merge pull request #2044 from major/avoid-jinja-in-when
Remove jinja2 delimiters from `when` keys
2017-10-12 22:23:06 +02:00
Major Hayden c01851325e
Remove jinja2 delimiters from `when` keys
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.
2017-10-12 11:27:42 -05:00
Major Hayden 33b200d43a
Suppress yum/dnf/rpm command warnings
Ansible throws warnings when using yum/dnf/rpm with the command
module:

    [WARNING]: Consider using yum module rather than running yum

This patch adds the `warn: no` argument to suppress the warnings
in the Ansible output.
2017-10-12 08:38:05 -05:00
Sébastien Han 13bce287ad infra: replace osd playbook
This playbook can replace failed OSD in containerized and
non-containerized env.
The current limitation is that it won't allow you to choose between
filestore/bluestore and will do collocation as well.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-12 11:53:30 +02:00
Sébastien Han 85e13a864c purge-iscsi: fix group name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1500281
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-11 12:52:12 +02:00
Sébastien Han 24b82c2679 purge: fix journal purge
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-10 09:57:39 +02:00
Guillaume Abrioux f147b119ed Merge pull request #2014 from ceph/fixes-2
infra: use the pg check in the right place
2017-10-09 20:14:06 +02:00
Sébastien Han 450108fab9 infra: add independant purge-iscsi-gateways.yml
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-09 17:25:44 +02:00
Sébastien Han 774697ebd8 infra: use the pg check in the right place
Use the pg check before doing the pg check, not on the quorum check.
Also never quote int when doing comparaison.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-09 17:25:41 +02:00
Sébastien Han a3e7bcb13f Merge pull request #2013 from ceph/wip-purge-cluster
A couple of purge cluster fixes
2017-10-09 17:18:30 +02:00
Sébastien Han 33a3aa0dda switch: check pgs only when num_pgs > 0
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:42:09 +02:00
Sébastien Han 05f26031ea rolling_update: perform pg check when pgs_num > 0
If num_pgs = 0 the check will never return 0.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:09 +02:00
Sébastien Han c3c63ae539 switch: rework and fix clean pg wait
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:09 +02:00
Sébastien Han c693e95cbf purge-docker: rework device detection
we don't need "devices" and other device variable anymore, the playbook
detects that for us.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:04 +02:00
Sébastien Han 2fb4981ca9 shrink-osd: admin key not needed for container shrink
Also do some clean

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 00:20:43 +02:00
Boris Ranto 64e272d818 purge-cluster: Do not use shell for rm
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:54:37 +02:00
Boris Ranto f696cb7637 purge-cluster: Do not fail on systemd commands
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:52:56 +02:00
Sébastien Han b6b24a5ca9 iscsi: fix wrong group name for iscsi
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498490
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-05 17:25:32 +02:00
Sébastien Han f37e014a65 Merge pull request #1974 from ceph/mgr-upgrade-luminous
upgrade: a support for mgrs
2017-10-03 19:57:31 +02:00