If OSDs don't restart normally we now also dump info of the crush map,
crush rules, crush tree and pools.
If the monitors don't restart normally we also print the socket status
by calling mon_status and quorum_status.
Signed-off-by: Sébastien Han <seb@redhat.com>
Having callback_plugins, and action plugins in random locations causes
a lot of disparity.
We should centralize this into one place in the plugins directory and
fix up the ansible.cfg to reflect this.
Additionally, since the ansible.cfg already reflects action_plugins, we
don't need a link to action_plugins in the base of the repository.
The `pools` dict defined in `roles/ceph-client/defaults/main.yml`
shouldn't have `{{ ceph_conf_overrides.global.osd_pool_default_pg_num
}}` as default value for `pgs` keys.
For instance, if you want some pools to be created but without explicitely
specifying the pgs for these pools (it means you want to use the
`osd_pool_default_pg_num`), you will be obliged to define
`{{ ceph_conf_overrides.global.osd_pool_default_pg_num }}` anyway while you
wanted to use the current default value already defined in the cluster which is
retrieved early in the playbook and stored in the
`{{ osd_pool_default_pg_num }}` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Running the last portion (insert new default and add new default crush
tasks) of crush_rules.yml only on the last monitor is
wrong since ceph CLI calls usually end up on the master having the
quorum, which is by default the one with the lower IP.
So if we run the command and end up on another mon the creation will
happen on the default crush rule because the particular mon hasn't been
updated.
To fix this we remove the |last on the include and use run_once: true on
certain tasks, then we let the final two tasks run on all the monitors.
Signed-off-by: Sébastien Han <seb@redhat.com>
On releases after jewel the option
'osd_pool_default_crush_replicated_ruleset' does not exist anymore, it's
called osd_pool_default_crush_rule.
Signed-off-by: Sébastien Han <seb@redhat.com>
This was causing a lot of pain with the handlers. Also the
implementation was not ideal since we were assembling files. Everything
can now be done with the ceph_crush module so let's remove that.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now run tests on the newly created ceph_crush module. Now the CI will
create a specific hierarchy for the OSD.
Signed-off-by: Sébastien Han <seb@redhat.com>
Instead of creating the CRUSH hierarchy with Ansible tasks using the
command module we now rely on the ceph_crush module.
Signed-off-by: Sébastien Han <seb@redhat.com>
This module allows us to create Ceph CRUSH hierarchy. The module works
with
hostvars from individual OSD hosts.
Here is an example of the expected configuration in the inventory file:
[osds]
ceph-osd-01 osd_crush_location="{ 'root': 'mon-roottt', 'rack':
'mon-rackkkk', 'pod': 'monpod', 'host': 'localhost' }" # valid case
Then, if create_crush_tree is enabled the module will create the
appropriate CRUSH buckets and their types in Ceph.
Some pre-requesites:
* a 'host' bucket must be defined
* at least two buckets must be defined (this includes the 'host')
Signed-off-by: Sébastien Han <seb@redhat.com>
One could want to add new crush rules while keeping his current default rule.
Fixed it so that it works with all rules defined as "default: false". If multiple rules are defined as default (should not be) then the last rule listed in "crush_rules" is taken as default.
This was taken from the openshift ansible repository here:
https://github.com/leseb/openshift-ansible/tree/master/roles/installer_checkpoint
Rationale:
A complete OpenShift cluster installation is comprised of many different
components which can take 30 minutes to several hours to complete. If
the installation should fail, it could be confusing to understand at
which component the failure occurred. Additionally, it may be desired to
re-run only the component which failed instead of starting over from the
beginning. Components which came after the failed component would also
need to be run individually.
Ceph has a similar situation so we can benefit from that
callback_plugin.
Signed-off-by: Sébastien Han <seb@redhat.com>
As part of fcba2c801a these vars were
removed and no longer do anything:
radosgw_dns_name
radosgw_resolve_cname
This patch removes them from the group_vars files and defaults/main.yml
If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.
Signed-off-by: Sébastien Han <seb@redhat.com>
In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.
Signed-off-by: Sébastien Han <seb@redhat.com>
The jobs launches by the CI are not using 'ansible.cfg'.
There are some parameters that should avoid SSH failure that we are used
to see in the CI so far.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The nfs-ganesha package has been fixed as part of this commit:
963b6681df
Once the package is rebuilt this should be good to merge.
This reverts commit e88af3c4cb.
Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.
The ceph-ansible upstream CI runs severals tests, including a
'idempotency/handlers' test. It means the playbook is run a first time
and then a second time with an other container image version to ensure the
handlers run properly and the containers are well restarted.
This can cause issues.
For instance, in that specific case which drove me to submit this commit,
I've hit the case where `latest` image ships ceph 12.2.3 while the `stable-3.0`
(which is the image used for the second run) ships ceph 12.2.2.
The goal of this test is not to verify we can upgrade from a specific
version to another but to ensure handlers are working even if it's a valid
failure here.
It should be caught by a test dedicated to that usecase.
We just need to have a container image which has a different id for
the upstream CI, we need the same content in container imagebut a different
image id in the registry since the test relies on image id to decide whether
the container should be restarted.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.
To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.
The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common
Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.
For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.
To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.
Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.
Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
Since we have a task to test the handlers we can test a new container to
validate the service restart on a new container image.
Signed-off-by: Sébastien Han <seb@redhat.com>
Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.
The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs
Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.
We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
Now by running the playbook like this:
ansible-playbook site.yml --tags='ceph_update_config'
You can only generate a ceph configuration file on the nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1543434
Signed-off-by: Sébastien Han <seb@redhat.com>
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit fixes a bug that occurs especially for dmcrypt scenarios.
There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.
If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :
```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```
It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.
Adding `--net=host` to the 'disk_list' container fixes this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.