Commit Graph

4029 Commits (89e76e5bafc7bcca7791cf71fdc110e14fd8f44c)
 

Author SHA1 Message Date
Subhachandra Chandra c7e269fcf5 Fix restarting OSDs twice during a rolling update.
During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.
2018-05-22 19:23:07 +02:00
Alfredo Deza 4d1338b4bf validate: split schema for lvm osd scenario per objecstore
The bluestore lvm osd scenario does not require a journal entry. For
this reason we need to have a separate schema for that and filestore or
notario will fail validation for the bluestore lvm scenario because the
journal key does not exist in lvm_volumes.

Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit d916246bfeb927779fa920bab2e0cc736128c8a7)
2018-05-22 17:57:28 +02:00
Andrew Schoen a9ad8eb5f3 ceph-validate: do not check ceph version on dev or rhcs installs
A dev or rhcs install does not require ceph_stable_release to be set and
instead generates that by looking at the installed ceph-version.
However, at this point in the playbook ceph may not have been installed
yet and ceph-common has not be run.

Fixes: https://github.com/ceph/ceph-ansible/issues/2618

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-21 23:11:04 +02:00
Guillaume Abrioux 9801bde4d4 purge_cluster: fix dmcrypt purge
dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-21 08:23:10 +02:00
Andrew Schoen e7d02a50d8 ceph-validate: move system checks from ceph-common to ceph-validate
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 62c34e3c9d set the python-notario version to >= 0.0.13 in ceph-ansible.spec.in
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen c40ed1c66b site.yml: combine validate play with fact gathering play
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen fd7bb16e2f docs: explain the ceph-validate role and how it validates configuration
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen cf2868f0d1 validate: support validation of osd_auto_discovery
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 8b6097e565 validate: remove objectstore from osd options schema
objectstore is not a valid option, it's osd_objectstore and it's already
validated in install_options

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 645f61c351 ceph-defaults: remove backwards compat for containerized_deployment
The validation module does not get config options with the template
syntax rendered, so we're gonna remove that and just default it to
False. The backwards compat was schedule to be removed in 3.1 anyway.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen c65ea7e9d7 site-docker: validate config before pulling container images
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 890e265fd3 validate: adds a CEPH_RELEASES constant
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen d30a99c350 validate: add support for containerized_deployment
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 5d64eb79c1 validate: show an error and stop the playbook when notario is missing
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 62d6f2d84a site-docker.yml: add config validation play
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen a80a109ac9 site.yml: the validation play must use become: true
The ceph-defaults role expects this.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 12bdb8ef87 docs: add instructions for installing ansible and notario
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen ef48ed4e5a adds a requiremnts.txt file for the project
With the addition of the validate module we need to ensure
that notario is installed. This will be done with the use
of this requirments.txt file and pip.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen dea1ea93d5 tests: use notario>=0.0.13 when testing
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen f84c2ba27b ceph-defaults: fix failing tasks when osd_scenario was not set correctly
When devices is not defined because you want to use the 'lvm'
osd_scenario but you've made a mistake selecting that scenario these
tasks should not fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 91f65e2420 validate: improve error messages when config fails validation
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen d83bdce8a9 site.yml: abort playbook when it fails during config validation
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 1f15a81c48 ceph-defaults: move cephfs vars from the ceph-mon role
We're doing this so we can validate this in the ceph-validate role

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen ffe05872ac validate: only validate cephfs_pools on mon nodes
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 760a1afc21 validate: only validate osd config options on osd hosts
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 4325ccc857 validate: only check mon and rgw config if the node is in those groups
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen b2b905f47e site.yml: remove the testing task that fails the playbook run
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 48c2a4fda8 validate: check rados config options
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 377fe81c10 validate: make sure ceph_stable_release is set to the correct value
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen ba7f09c0a7 ceph-validate: move var checks from ceph-common into this role
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 32bac6b491 ceph-validate: move var checks from ceph-osd into this role
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 29a9dffc83 ceph-validate: move ceph-mon config checks into this role
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen d87a32347f adds a new ceph-validate role
This will be used to validate config given to ceph-ansible.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 692ab26734 validate: validate osd_scenarios
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 4baa8389e0 validate: check monitor options
validates monitor_address, monitor_address_block and monitor_interface

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 4008d700a4 site.yml: move validate task to it's own play
This needs to be in it's own play with ceph-defaults included
so that I can validate things that might be defaulted in that
role.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 9f68dad2ff validate: first pass at validating the install options
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Alfredo Deza 0ace2e9534 site: add validation task
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-05-18 17:58:24 +02:00
Alfredo Deza 86a32071e8 rpm: add python-notario as a dependency for validation
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-05-18 17:58:24 +02:00
Alfredo Deza e33608ec16 library: add a placeholder module for the validate action plugin
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-05-18 17:58:24 +02:00
Alfredo Deza 36dc7c7862 plugins create an action plugin for validation using notario
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-05-18 17:58:24 +02:00
Sébastien Han 2f43e9dab5 defaults: restart_osd_daemon unit spaces
Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail

It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-18 17:53:47 +02:00
Michael Vollman ed050bf3f6 Do nothing when mgr module is in good state
Check whether a mgr module is supposed to be disabled before disabling
it and whether it is already enabled before enabling it.

Signed-off-by: Michael Vollman <michael.b.vollman@gmail.com>
2018-05-18 15:21:45 +02:00
Guillaume Abrioux 415dc0a29b take-over: fix bug when trying to override variable
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.

Typical error:

```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```

Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-18 10:10:08 +02:00
Ha Phan fa8e2e7522 Adding mgr_vms variable 2018-05-17 17:30:27 +02:00
Andy McCrae f45662e270 Fix template reference for ganesha.conf
We can simply reference the template name since it exists within the
role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH
or playbooks directory for the file.
2018-05-17 15:23:52 +02:00
Sébastien Han 49a4712485 switch: disable ceph-disk units
During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-17 08:48:28 +02:00
Guillaume Abrioux a9247c4de7 purge_cluster: wipe all partitions
In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-17 08:37:17 +02:00
Guillaume Abrioux 9cad113e2f purge_cluster: fix bug when building device list
there is some leftover on devices when purging osds because of a invalid
device list construction.

typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
    "changed": true,
    "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
    "delta": "0:00:00.015188",
    "end": "2018-05-16 12:41:40.408597",
    "item": "/dev/sda sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:40.393409"
}

STDOUT:

sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600

STDERR:

Error: Could not stat device /dev/sda sda1 - No such file or directory.
```

the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output

eg:

```
changed: [osd3] => (item=/dev/sda1) => {
    "changed": true,
    "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
    "delta": "0:00:00.013634",
    "end": "2018-05-16 12:41:09.068166",
    "item": "/dev/sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:09.054532"
}

STDOUT:

/dev/sda sda1
```

For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-17 08:37:17 +02:00