During the first iteration, the command won't return anything, or can
simply fail and might not return a valid json structure. Ansible will
fail parsing it in the filter `from_json` so let's default that variable
to empty dictionary.
Signed-off-by: Sébastien Han <seb@redhat.com>
This removes a bit of unnecessary code, the check was always wrong
because of the condition 'not ceph_current_status.get('rc', 1) == 0'
It will never match since `Not` is used for bool and we are checking for
an rc.
Also, even though the check would work, this will be a major blocker for
a complete meltdown. If the whole platform is shutdown then nothing will
be up but files will be present, so this check is definitely wrong.
Signed-off-by: Sébastien Han <seb@redhat.com>
This is false, `./defaults/main.yml` is not supposed to be modified
directly. groups_vars a/o host_vars should always be preferred.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This will fix the following yamllint warning:
Variables should have spaces after {{ and before }}
Signed-off-by: Christian Berendt <berendt@betacloud-solutions.de>
change default value of `radosgw_address` to keep consistency with
`monitor_address`.
Moreover, `ceph-validate` checks if the value is '0.0.0.0' to determine
if it has to run `check_eth_rgw.yml`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600227
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is not needed to play these tasks on nodes that are not in rgw
group.
Always playing this code makes `shrink_mon.yml` failing.
Typical error:
```
TASK [ceph-defaults : set_fact _radosgw_address to radosgw_interface - ipv4] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-shrink_mon/roles/ceph-defaults/tasks/set_radosgw_address.yml:21
Thursday 22 November 2018 12:34:51 +0000 (0:00:00.154) 0:00:12.371 *****
fatal: [localhost]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_eth1'
```
Indeed, `radosgw_interface` is the network interface on rgw only. It is
expected that this same interface doesn't exist on `localhost`, so, when
running `shrink_mon.yml`, the role `ceph-defaults` is called in
`hosts: localhost` and causes the playbook to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It seems Atomic 7.5 has podman already, however this is an old version
(0.4). The podman integration is targetting RHEL 8, so Fedora is
currently the closest to that.
Signed-off-by: Sébastien Han <seb@redhat.com>
During its initialisation both rbd-target-api and rbd-target-gw try to
open /dev/log for their syslog handler. If the device is not present the
service fails to start. Thus expose /dev/log from the host in the
container solves that problem.
Signed-off-by: Sébastien Han <seb@redhat.com>
Since 84fcf4639140c390a7f1fcd790ba190503713f86 we now use the container
binary cli to create ceph keys instead of creating a container and
'docker execing' into it.
Signed-off-by: Sébastien Han <seb@redhat.com>
In order to be able to retrieve udev information, we must expose its
socket. As per, https://github.com/ceph/ceph/pull/25201 ceph-volume will
start consuming udev output.
Signed-off-by: Sébastien Han <seb@redhat.com>
This is to add a granularity level.
We can have ceph specific variables that user shouldn't have to change
here.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add real default value for osd pool size customization.
Ceph itself has an `osd_pool_default_size` default value to `3`.
If users don't specify a pool size in various pools definition within
ceph-ansible, we should default to `3`.
By the way, this kind of condition isn't really clear:
```
when:
- rbd_pool_size | default ("")
```
we should try to get the customized value then default to what is in
`osd_pool_default_size` (which has its default value pointing to
`ceph_osd_pool_default_size` (`3`) as well) and compare it to
`ceph_osd_pool_default_size`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
`osd_pool_default_pg_num` parameter is set in `ceph-mon`.
When using ceph-ansible with `--limit` on a specifc group of nodes, it
will fail when trying to access this variables since it wouldn't be
defined.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1518696
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
ceph.conf doesn't accept float value.
Typical error seen:
```
$ sudo ceph daemon osd.2 config get osd_memory_target
Can't get admin socket path: unable to get conf option admin_socket for osd.2:
parse error setting 'osd_memory_target' to '7823740108,8' (strict_si_cast:
unit prefix not recognized)
```
This commit ensures the value inserted in ceph.conf will be an integer.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It is safer to use the list filter than the keys() method since the keys
method does have some interoperability issues between python2 and
python3 based ansible/jinja.
Signed-off-by: Boris Ranto <branto@redhat.com>
If you use python3 based ansible then keys() returns a dict_keys object,
not a list of keys. This breaks the installation on such a system. Using
the list filter provides a more robust solution that should work on both
python2 and python3 based ansible. You can find some more information
about the issue, here:
https://github.com/ansible/ansible/issues/19514
Signed-off-by: Boris Ranto <branto@redhat.com>
* The default value of osd_memory_target used by ceph is 4294967296 bytes,
so use the same as ceph-ansible default.
* Convert ansible_memtotal_mb to bytes to calculate osd_memory_target
Signed-off-by: Neha Ojha <nojha@redhat.com>
This error was introduced in the recent refactor of ceph-docker-common
in https://github.com/ceph/ceph-ansible/pull/3251. However, the Ansible
galaxy linter is not happy about it and fails importing the role.
Removing this since it's not used anymore.
Signed-off-by: Sébastien Han <seb@redhat.com>
if firewalld.service systemd unit is masked, the handler will fail when
trying to restart it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650281
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
since `ceph-volume` introduction, there is no need to split those tasks.
Let's refact this part of the code so it's clearer.
By the way, this was breaking rolling_update.yml when `openstack_config:
true` playbook because nothing ensured OSDs were started in ceph-osd role (In
`openstack_config.yml` there is a check ensuring all OSD are UP which was
obviously failing) and resulted with OSDs on the last OSD node not started
anyway.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Those tasks aren't needed in docker-common since the introduction of
`ceph-infra` role. They are duplicated tasks.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
this is already done in ceph-defaults, there is no need to have this
check in ceph-docker-common.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
this fact is already set in ceph-defaults, there is no need to set it
again in ceph-docker-common
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Instead of looping over a list of packages or repeating the task
separately for different packages, pass the list of packages to the
task performing package management.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
This is needed for Nautilus since the ceph-create-keys script goes away.
(https://github.com/ceph/ceph/pull/21305)
Now the module if called with 'state: fetch_initial_keys' will lookup
keys generated by the monitor and write them down on the filesystem to
the right location (/etc/ceph and /var/lib/ceph/boostrap*).
This is not applicable to container since keys are generated by the
container only.
Signed-off-by: Sébastien Han <seb@redhat.com>
The firewall setup for igw is not getting setup because iscsi_group_name
does not it exist. It should be iscsi_gw_group_name.
Signed-off-by: Mike Christie <mchristi@redhat.com>
The default igw api port is 5000 in the manual setup docs and
ceph-iscsi-config package so this syncs up ansible.
Signed-off-by: Mike Christie <mchristi@redhat.com>
This is needed for Nautilus since the ceph-create-keys script goes away.
(https://github.com/ceph/ceph/pull/21305)
Now the module if called with 'state: fetch_initial_keys' will lookup
keys generated by the monitor and write them down on the filesystem to
the right location (/etc/ceph and /var/lib/ceph/boostrap*).
This is not applicable to container since keys are generated by the
container only.
Signed-off-by: Sébastien Han <seb@redhat.com>
description = 'Use `when: var` rather than `when: var != ""` (or ' \ 'conversely `when: not var` rather than `when: var == ""`)'
Signed-off-by: Sébastien Han <seb@redhat.com>
The use of a handler meant that the cache would be updated at the very
end of the play, which doesn't work when adding a development repo and
trying to install right after it. This mostly reverts
53cdddf886 without an actual `git revert`
because that caused other conflicts.
Signed-off-by: Alfredo Deza <adeza@redhat.com>
Update the meta with the relavant support such as:
* ansible version: min 2.4
* distro supported (tested on) centos 7
Signed-off-by: Sébastien Han <seb@redhat.com>
Do not run the linter for these 3:
* we use latest for pip docker-py package
* for ssl keys this is a false positive since the inital command is a
'shell' it'll always change
* for keystone, we must use shell since the with_items contains pipes
Signed-off-by: Sébastien Han <seb@redhat.com>
Calling command should have changed_when false otherwise each time it
runs it will show as 'changed' and this is irrelevant.
Commands should not change things if nothing needs doing
Signed-off-by: Sébastien Han <seb@redhat.com>
since the jinja logic has been moved into ansible task, we can simply
this part of the code and use `_current_monitor_address`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
So we don't have to loop over `_monitor_addresses` when we need the
monitor address of the current node being played.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
using consecutive set_fact in the playbook instead of complex jinja syntax
makes ceph.conf.j2 more readable.
By the way, jinja can be painful to debug at some point.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Latest ansible version at the moment is 2.7
We should explicitly require 2.7 only on master branch.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Let's test ceph-ansible master against ansible 2.7 to catch early any
potential issue with this ansible version.
Closes: #3148
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
run commands on containers when containerized deployments.
(At the moment, all commands are run on the host only)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
since `rgw_multisite_endpoint_addr` has a default value to
`{{ ansible_fqdn }}`, it shouldn't be mandatory to set this variable.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
- updated README-MULTISITE
- re-added destroy.yml
- added tasks in ceph-validate to make sure the
rgw multisite vars are set
Signed-off-by: Ali Maredia <amaredia@redhat.com>
We should give users the possibility to set the IP they want as
multisite endpoint, setting the default value to `{{ ansible_fqdn }}` to
not force them to set this variable.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
- remove destroy tasks
- cleanup conditionals and syntax
- remove unnecessary realm pulls
- enable multisite to be tested in automated
testing infra
- add multisite related vars to main.yml and
group_vars
- update README-MULTISITE
- ensure all `radosgw-admin` commands are being run
on a mon
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Since we do not have enough data to put valid upper bounds for the memory
usage of these daemons, do not put artificial limits by default. This will
help us avoid failures like OOM kills due to low default values.
Whenever required, these limits can be manually enforced by the user.
More details in
https://bugzilla.redhat.com/show_bug.cgi?id=1638148
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1638148
Signed-off-by: Neha Ojha <nojha@redhat.com>
we ensure that firewalld is installed and running before adding any
rule. This has no sense anymore not to reload firewalld once the rule
are added.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The playbook has various improvements:
* run ceph-validate role before doing anything
* run ceph-fetch-keys only on the first monitor of the inventory list
* set noup flag so PGs get distributed once all the new OSDs have been
added to the cluster and unset it when they are up and running
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962
Signed-off-by: Sébastien Han <seb@redhat.com>
This commits simplies the usage of the ceph-fetch-keys role. The role
now has a nicer way to find various ceph keys and fetch them on the
ansible server.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962
Signed-off-by: Sébastien Han <seb@redhat.com>
Currently a throw-away container is built to run ceph client
commands to setup users, pools & auth keys. This utilises
the same base ceph container which has all the ceph services
inside it.
This PR allows the use of a separate container if the deployer
wishes - but defaults to use the same full ceph container.
This can be used for different architectures or distributions,
which may support the the Ceph client, but not Ceph server,
and allows the deployer to build and specify a separate client
container if need be.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
a non skipped task won't have the `skipped` attribute, so `start
firewalld` task will complain about that.
Indeed, `skipped` and `rc` attributes won't exist since the first task
`check firewalld installation on redhat or suse` won't be skipped in
case of non-containerized deployment.
Fixes: #3236
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1541840
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Liberty is no longer available in the UCA. The last available release there
is currently Queens.
Signed-off-by: Christian Berendt <berendt@betacloud-solutions.de>
`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
As of now, we should no longer support Jewel in ceph-ansible.
The latest ceph-ansible release supporting Jewel is `stable-3.1`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit does a couple of things:
* Avoid code duplication
* Clarify the code
* add more unit tests
* add myself to the author of the module
Signed-off-by: Sébastien Han <seb@redhat.com>
This task was created for ceph-disk based deployments so it's not needed
when osd are prepared with ceph-volume.
Signed-off-by: Sébastien Han <seb@redhat.com>
The restart script wasn't working with the current new addition of
ceph-volume in container where now OSDs have the OSD id name in the
container name.
Signed-off-by: Sébastien Han <seb@redhat.com>
Now that the container is named ceph-osd@<id> looking for something that
contains a host is not necessary. This is also backward compatible as it
will continue to match container names with hostname in them.
Signed-off-by: Sébastien Han <seb@redhat.com>
We don't need to pass the device and discover the OSD ID. We have a
task that gathers all the OSD ID present on that machine, so we simply
re-use them and activate them. This also handles the situation when you
have multiple OSDs running on the same device.
Signed-off-by: Sébastien Han <seb@redhat.com>
We don't need to pass the hostname on the container name but we can keep
it simple and just call it ceph-osd-$id.
Signed-off-by: Sébastien Han <seb@redhat.com>
expose_partitions is only needed on ceph-disk OSDs so we don't need to
activate this code when running lvm prepared OSDs.
Signed-off-by: Sébastien Han <seb@redhat.com>
The batch option got recently added, while rebasing this patch it was
necessary to implement it. So now, the batch option can work on
containerized environments.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630977
Signed-off-by: Sébastien Han <seb@redhat.com>
At the moment, all daemons accept connections from 0.0.0.0.
We should at least restrict to public_network and add
cluster_network for OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1541840
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Fixes the deprecation warning:
[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
These checks will never pass unless ceph_stable_release is passed and
ceph-defaults is run before ceph-validate. Additionally, we don't want
to support deploying jewel upstream at ceph-ansible master.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1637537
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Check firewall isn't working as expected and might break deployments.
This part of the code will be reworked soon.
Let's focus on configure_firewall code for now.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1541840
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Instead used "import_tasks" and "include_tasks" to tell whether tasks
must be included statically or dynamically.
Fixes: https://github.com/ceph/ceph-ansible/issues/2998
Signed-off-by: Rishabh Dave <ridave@redhat.com>
`monitor_address_block` should be read from hostvars[host] instead of
current node being played.
eg:
Let's assume we have:
```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```
the ceph.conf generation task will end up with:
```
fatal: [ceph-mon0]: FAILED! => {}
MSG:
'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```
the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:
```
{%- else -%}
{% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
{% if ip_version == 'ipv4' -%}
{{ hostvars[host][interface][ip_version]['address'] }}
{%- elif ip_version == 'ipv6' -%}
[{{ hostvars[host][interface][ip_version][0]['address'] }}]
{%- endif %}
{%- endif %}
```
`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.
Fixes issue #3086 on Github
Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
The linux kernel target layer, LIO, does not support the iscsi target to
mix ACLs that have chap enabled and disabled under the same tpg. This
patch adds a check and fails if this type of setup is detected.
This fixes Red Hat BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1615088
Signed-off-by: Mike Christie <mchristi@redhat.com>
The role contains all the handlers for Ceph services. We decided to
leave ceph-defaults role with variables and a few facts only. This is
useful when organizing the site.yml files and also adding the known
variables to infrastructure-playbooks.
Signed-off-by: Sébastien Han <seb@redhat.com>
As per #1013 it appears that BS will soon use THP to lower TLB misses,
also disabling THP hasn't demonstrated any gains so far.
Closes: https://github.com/ceph/ceph-ansible/issues/1013
Signed-off-by: Sébastien Han <seb@redhat.com>
`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.
On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".
Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.
(could this be backported to at least stable-3.0 please?)
Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.
Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
...with the exception of the purge operation, since
removing Calamari would still be useful for an old
cluster.
Signed-off-by: John Spray <john.spray@redhat.com>
For now our best guess is to count the number of devices and multiply
by osds_per_device. Ideally we'd like to run ceph-volume lvm batch
--report and get the number of OSDs that way, but currently we need
a ceph.conf in place already before we can do that. There is a tracker
ticket that would allow os to get around the need for a ceph.conf:
http://tracker.ceph.com/issues/36088
Fixes: https://github.com/ceph/ceph-ansible/issues/3135
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
the default value for _rgw_hostname was took from the current node being
played while it should be took from the respective node in the loop.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This avoids errors when the osd scenario choosen does not require
setting devices or lvm_volumes. The default values for these are not
set because they exist in the ceph-osd role, not ceph-defaults.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
BlueStore's cache is sized conservatively by default, so that it does
not overwhelm under-provisioned servers. The default is 1G for HDD, and
3G for SSD.
To replace the page cache, as much memory as possible should be given to
BlueStore. This is required for good performance. Since ceph-ansible
knows how much memory a host has, it can set
`bluestore cache size = max(total host memory / num OSDs on this host * safety
factor, 1G)`
Due to fragmentation and other memory use not included in bluestore's
cache, a safety factor of 0.5 for dedicated nodes and 0.2 for
hyperconverged nodes is recommended.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1595003
Signed-off-by: Neha Ojha <nojha@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
The commit:
commit 1164cdc002
Author: Guillaume Abrioux <gabrioux@redhat.com>
Date: Thu Aug 2 11:58:47 2018 +0200
iscsigw: install ceph-iscsi-cli package
installs the cli package but does not start and enable the
rbd-target-api daemon needed for gwcli to communicate with the igw
nodes. This patch just enables and starts it for the non-container
setup. The container setup is already doing this.
This fixes bz https://bugzilla.redhat.com/show_bug.cgi?id=1613963
Signed-off-by: Mike Christie <mchristi@redhat.com>
As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.
Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If this is set to anything other than the default value of 1 then the
--osds-per-device flag will be used by the batch command to define how
many osds will be created per device.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This command line is not supported.
According to official documentation:
```
Note that shell command lines are not directly supported.
If shell command lines are to be used,
they need to be passed explicitly to a shell implementation of some kind.
```
We must run this using /bin/sh instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.
Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit is adding quotes that make keyring unusuable
eg:
```
client.john
key: AQAN0RdbAAAAABAAH5D3WgMN9Rxw3M8jkpMIfg==
caps: [mds] ''
caps: [mgr] 'allow *'
caps: [mon] 'allow rw'
caps: [osd] 'allow rw'
```
Trying to import such a keyring and use it will result:
```
Error EACCES: access denied
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623417
This reverts commit 424815501a.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.
Run rados commands from a ceph container instead so that
they will succeed.
Signed-off-by: Tom Barron <tpb@dyncloud.net>
If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.
Signed-off-by: Markos Chandras <mchandras@suse.de>
The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.
This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.
Currently ppc64le is not supported for Ceph server components.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
A couple if things were wrong in the initial commit:
* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it
* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.
* set the fact rgw_hostname on rgw nodes only
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
The config_template plugin exists in the ceph-common role so that
config_template will still work with ansible galaxy.
This PR syncs the config_template module from the base of the repo in
plugins/actions to the ceph-common role.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:
msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'
We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.
Signed-off-by: Markos Chandras <mchandras@suse.de>
Since commit f422efb1d6 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible
fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"
This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.
Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role
Signed-off-by: Markos Chandras <mchandras@suse.de>
Follow up on 36942af698
"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic. Many ways to fix this, here's one.
Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
This reverts commit e84f11e99e.
This commit was giving a new failure later during the rolling_update
process. Basically, this was modifying the list of devices and started
impacting the ceph-osd itself. The modification to accomodate the
osd_auto_discovery parameter should happen outside of the ceph-osd.
Also we are trying to not play ceph-osd role during the rolling_update
process so we can speed up the upgrade.
Signed-off-by: Sébastien Han <seb@redhat.com>
fqdn configuration possibility caused a lot of trouble, it's adding a
lot of complexity because of multiple cases and the relation between
ceph-ansible and ceph-container. Moreover, there is no benefit for such
a feature.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the ceph.conf.j2 always assumes the hostname used to register the
radosgw in the servicemap is equivalent to `{{ ansible_hostname }}`
which returns the shortname form.
We need to detect which form of the hostname was used in case of already
deployed cluster and update the ceph.conf accordingly.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1580408
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is no need to have all these conditions.
for instance, assuming `mds_group_name` is set to 'mdss':
- `if groups[mds_group_name] is defined` checks if `'mdss'` is present in `{{ groups }}`
- `if {{ mds_group_name }} in group_names` checks if the current node is part
the group `'mdss'`
- `if inventory_hostname in groups.get(mds_group_name, [])` checks if
the current node is part of the group 'mdss'
The third condition is enough to cover the need of ensuring we are
running on a mds node.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If calamari is already installed and ceph has been upgraded to a higher
version the initialisation will fail later. So if we detect the
calamari-server is too old compare to ceph_rhcs_version we try to update
it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1601755
Signed-off-by: Sébastien Han <seb@redhat.com>
rolling_update relies on the list of devices when performing the restart
of the OSDs. The task that is builind the devices list out of the
ansible_devices dict only runs when there are no partitions on the
drives. However during an upgrade the OSD are already configured, they
have been prepared and have partitions so this task won't run and thus
the devices list will be empty, skipping the restart during
rolling_update. We now run the same task under different requirements
when rolling_update is true and build a list when:
* osd_auto_discovery is true
* rolling_update is true
* ansible_devices exists
* no dm/lv are part of the discovery
* the device is not removable
* the device has more than 1 sector
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
This is used with the lvm osd scenario. When using devices you need the
option to set the crush device class for all of the OSDs that are
created from those devices.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This adds the action 'batch' to the ceph-volume module so that we can
run the new 'ceph-volume lvm batch' subcommand. A functional test is
also included.
If devices is defind and osd_scenario is lvm then the 'ceph-volume lvm
batch' command will be used to create the OSDs.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Since the container now simply reads the ceph.conf, we remove all the
unnecessary options.
Also this PR is the foundation to support multiple backend, such as the
new 'beast' from Ceph Mimic.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
The include does not need a condition on containerized_deployment since
we are already in an include than has the same condition.
Signed-off-by: Sébastien Han <seb@redhat.com>
In environments where we wish to have manual/greater control over
how the bootstrap keyrings are used, we need to able to externally
define what the mgr keyring secret will be and have ceph-ansible
use it, instead of it being autogenerated
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610213
Signed-off-by: Graeme Gillies <ggillies@akamai.com>
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.
Closes: #2964
Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This was introduced by
59ee2e8d3b
and made our socket checks impossible to run. The PID could be found,
but the cctid cannot.
This happens during upgrade to mimic and on cluster running on mimic.
So let's force the admin socket the way it was so we can properly check
for existing instances also the line $cluster-$name.$pid.$cctid.asok
is only needed when running multiple instances of the same daemon,
thing ceph-ansible cannot do at the time of writing
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1610220
Signed-off-by: Sébastien Han <seb@redhat.com>
Instead of failing the entire purge operation when the rbd command fails
just log an error. This will allow the higher level target and config
cleanup to complete, and the user only has to manually delete the rbd
images.
Signed-off-by: Mike Christie <mchristi@redhat.com>
We were not passing in the ceph conf info into the rbd image removal
command, so if the clustername was not the default igw purge would fail
due to the rbd rm command failing.
This just fixes the bug by passing in the ceph conf info which has the
clustername to use.
This fixes Red Hat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1601949
Signed-off-by: Mike Christie <mchristi@redhat.com>
The container runs with --rm which means it will be deleted by Docker
when exiting. Also 'docker rm -f' is not idempotent and returns 1 if the
container does not exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1609007
Signed-off-by: Sébastien Han <seb@redhat.com>
rbd-mirror can't start when deploying jewel because it needs admin
keyring.
Getting back this task brings backward compatibility for jewel
deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we want to be backward compatible with release prior to luminous, we
have to set the rule name accordingly to default values used in jewel.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This task would be run on both containerized *and* non containerized
deployment.
Let's have a proper title to avoid confusion.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In containerized deployments we now inherite from the
radosgw_civetweb_options options when bootstrapping the container.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
When distributing ceph-nfs role, creation of rados index object
fails as it assumes availability of client.admin locally.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1607970
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
Check if the interface provided:
* exists in the gathered facts (thus on the system)
* is active
* has an IP address (depending on ip_version )
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600227
Signed-off-by: Sébastien Han <seb@redhat.com>
Ansible 2.4 is currently end-of-life.
Ansible 2.5 will go end-of-life after Ansible 2.7 is released.
Fixes: #2901
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Do not run device validation on every hosts, only on OSD nodes.
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
We know make sure that:
* devices are actually block special files
* length of dedicated_device is identical to devices
Signed-off-by: Sébastien Han <seb@redhat.com>
Since `V2.6-stable` is available and has packages for `mimic`, let's
update this default value accordingly so nfs nodes can be deployed with
mimic.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Relying on `copy_admin_key` to import created keys on client nodes makes
us obliged to copy admin key on those nodes which is not something we might
want.
We should use the fact `condition_copy_admin_key` which will be set to
`True` when the delegated node is a mon which means we can import keys
without taking care of admin keyring.
Fixes: #2867
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Follow up on #2784
We must check in the generated fact `_disabled_ceph_mgr_modules` to
enable disabled mgr module.
Otherwise, this task will be skipped because it's not comparing the
right list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
On containerized deployment, if a mon is stopped, the socket is not
purged and can cause failure when a cluster is redeployed after the
purge playbook has been run.
Typical error:
```
fatal: [osd0]: FAILED! => {}
MSG:
'dict object' has no attribute 'osd_pool_default_pg_num'
```
the fact is not set because of this previous failure earlier:
```
ok: [mon0] => {
"changed": false,
"cmd": "docker exec ceph-mon-mon0 ceph --cluster test daemon mon.mon0 config get osd_pool_default_pg_num",
"delta": "0:00:00.217382",
"end": "2018-07-09 22:25:53.155969",
"failed_when_result": false,
"rc": 22,
"start": "2018-07-09 22:25:52.938587"
}
STDERR:
admin_socket: exception getting command descriptions: [Errno 111] Connection refused
MSG:
non-zero return code
```
This failure happens when the ceph-mon service is stopped, indeed, since
the socket isn't purged, it's a leftover which is confusing the process.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When you delete a zone without removing from zonegroup, the period update would
fail since that command needs to load the zone and zonegroup to be able to
update the master. Period update would fail with an error like this:
radosgw-admin period update --commit
-1 Cannot find zone id= (name=), switching to local zonegroup configuration
-1 Cannot find zone id= (name=)
Signed-off-by: Shilpa Jagannath <smanjara@redhat.com>
As of Kraken, the journal code does not use the hdparm command anymore
so we can remove it from our package dependency list.
Fixes: https://github.com/ceph/ceph-ansible/issues/1402
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit f6910efa24389c264062963b2054c7cd29ffebb3)
The container image recently merged both cluster and mon log into a
single stream. Following this, we now see this warning coming from the
container image:
2018-06-19 13:44:01.542990 7ff75b024700 1 mon.vm02@1(peon).log
v57928205 unable to write to '/var/log/ceph/ceph.log' for channel
'cluster': (2) No such file or directory
So we now tell the mon to not log cluster log on the filesystem.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591771
Signed-off-by: Sébastien Han <seb@redhat.com>
We forgot to add mgr_group_name when checking for the mon repo, thus the
conditional on the next task was failing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1598185
Signed-off-by: Sébastien Han <seb@redhat.com>
The data structure has slightly changed on mimic.
Prior to mimic, it used to be:
```
{
"enabled_modules": [
"status"
],
"disabled_modules": [
"balancer",
"dashboard",
"influx",
"localpool",
"prometheus",
"restful",
"selftest",
"zabbix"
]
}
```
From mimic it looks like this:
```
{
"enabled_modules": [
"status"
],
"disabled_modules": [
{
"name": "balancer",
"can_run": true,
"error_string": ""
},
{
"name": "dashboard",
"can_run": true,
"error_string": ""
}
]
}
```
This means we can't simply check if `item` is in `item in
_ceph_mgr_modules.disabled_modules`
the idea here is to use filter `map(attribute='name')` to build a list
when deploying mimic.
Fixes: #2766
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The container runs for 300 sec, then dies and removes itself thanks to
the '--rm' option, so there is no point of removing it. Also this is
causing failure under some circonstances.
Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
We now add a default 'rbd' application type to each pool we create. This
will remove the warning: " application not enabled on N pool(s) "
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590275
Signed-off-by: Sébastien Han <seb@redhat.com>
The script ceph-osd-run.sh holds the config options to start the
container, if one of these options are modified we must restart the
container. This was not the case before becauase the 'notify' flag
wasn't present.
Closing: https://bugzilla.redhat.com/show_bug.cgi?id=1596061
Signed-off-by: Sébastien Han <seb@redhat.com>
When using a module there is no need to apply this Ansible option. The
module will handle the idempotency on its own. So the module decides
wether or not the task has changed during the execution.
Signed-off-by: Sébastien Han <seb@redhat.com>
keyring files in /etc/ceph. Default value is the same as it was (0600),
but this variable allows user to override it (f.e. set it to 0640).
Signed-off-by: George Shuklin <george.shuklin@gmail.com>
During 226f80c22b only Debian package
installs had the correct state set to ensure packages were upgraded when
the "upgrade_ceph_packages" var was set to true.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
--net=host was hardcoded in the startup line so even though
mon_docker_net_host was set to False the net option would always be
activated.
mon_docker_net_host is set to True by default so this commit does not
change the behaviour.
Signed-off-by: Sébastien Han <seb@redhat.com>
Depending on your setup, ceph-mgr might get restarted multiple times.
When this is done to fast, systemd will prevent further restarts because of
configured limits in the ceph-mgr systemd unit file.
Resetting the failure count will prevent this problem. The reset is done before
the restart so in case of a real problem during the restart it still fails.
Fixes: #2768
Signed-off-by: Christian Zunker <christian.zunker@codecentric.cloud>
Currently we expect that if configure_firewall is set to True to have
firewalld enabled and running. Let's enforce that.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Sébastien Han <seb@redhat.com>
As discussed with the cores, the current limits are too low and should
be bumped to higher value.
So now by default monitors get 3GB and OSDs get 5GB.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1591876
Signed-off-by: Sébastien Han <seb@redhat.com>
The 'dummy' container is created only on first client node, it means we
must seek to destroy this container only on this node, otherwise this
can cause failure like following :
```
fatal: [192.168.24.8]: FAILED! => {"changed": false, "cmd": ["docker", "rm",
"-f", "ceph-create-keys"], "delta": "0:00:00.023692", "end": "2018-06-12
20:56:07.261278", "msg": "non-zero return code", "rc": 1, "start":
"2018-06-12 20:56:07.237586", "stderr": "Error response from daemon: No such
container: ceph-create-keys", "stderr_lines": ["Error response from daemon: No
such container: ceph-create-keys"], "stdout": "", "stdout_lines": []}
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590746
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this patch if you were running on a Red Hat system,
ceph-ansible would try to configure firewalld for you without the
operators's consent.
Now you can enable or disable the fw configuration by setting
configure_firewall to either true or false.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Sébastien Han <seb@redhat.com>
The current secure cluster play runs with all the monitors. The rerun
of this task is unnecessary and can be skipped.
Fixes: #2737
Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
combining `run_once: true` with `inventory_hostname ==
groups.get(client_group_name) | first` might cause bug when the only
node being run is not the first in the group.
In a deployment with a single client node it might cause issue because
sometimes keyring won't be created since the task could be definitively
skipped.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now have the ability to deploy a containerized version of ceph-iscsi.
The result is similar to the non-containerized version, you simply have
3 containers running for the following services:
* rbd-target-api
* rbd-target-gw
* tcmu-runner
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1508144
Signed-off-by: Sébastien Han <seb@redhat.com>
Potential error if someone doesnt pass the mode in `keys` dict for
client nodes:
```
fatal: [client2]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'mode'
The error appears to have been in '/home/guits/ceph-ansible/roles/ceph-client/tasks/create_users_keys.yml': line 117, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: get client cephx keys
^ here
exception type: <class 'ansible.errors.AnsibleUndefinedVariable'>
exception: 'dict object' has no attribute 'mode'
```
adding a default value will avoid the deployment failing for this.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Functional tests are broken when testing against 'dev' release (ceph).
Adding a dummy value here will make it possible to run ceph-ansible CI
against dev ceph release.
Typical error:
```
> if request.node.get_marker("from_luminous") and ceph_release_num[ceph_stable_release] < ceph_release_num['luminous']:
E KeyError: 'dev'
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fd1487d93f21b609a637053f5b33cd2a4e408d00)
We need to do this because on dev or rhcs installs ceph_stable_release
is not mandatory and the firewall check tasks have a task that is
conditional based off the installed version of ceph. If we perform those
checks after package install then they will not fail on dev or rhcs
installs.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
the `docker_exec_cmd` fact set in client role when there is no monitor
in inventory is wrong, `ceph-client-{{ hostname }}` is never created so
it will fail anyway.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When configuring openstack, the created keyrings aren't copied over to
all monitors nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Refact of 8704144e31
There is no need to have duplicated tasks for this. The rgw pools
creation should be delegated on a monitor node se we don't have to care
if the admin keyring is present on rgw node.
By the way, only one task is needed to create the pools, we just need to
use the `docker_exec_cmd` fact already defined in `ceph-defaults` to
achieve it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1550281
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The initial keyring is generated from ansible server locally and the snippet works well for both v2 and v3 of python.
I don't see any reason why we should explicitly invoke`python2` instead of just `python`.
In some setups, `python2` is not symlinked to `python`; while `python` and `python3` refer to v2 and v3 respectively.
Signed-off-by: Ha Phan <thanhha.work@gmail.com>
Prior to this commit the firewall tasks were not opening the ceph-mgr
ports. This would lead to unclean configuration since the ceph-mgr
daemons can not connect to the OSDs.
Thi commit opens the right ports on the ceph-mgr nodes to talk with the
OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526400
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph command has to be executed from one of the monitor containers
if not admin copy present in RGWs. Task has to be delegated then.
Adds test to check proper RGW pool creation for Docker container scenarios.
Signed-off-by: Jorge Tudela <jtudelag@redhat.com>
Since the openstack_config.yml has been moved to `ceph-osd` we must move
this `set_fact` in ceph-osd otherwise the tasks in
`openstack_config.yml` using `openstack_keys` will actually use the
defaults value from `ceph-defaults`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1585139
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The first 14.x tag has been cut so this needs to be added so that
version detection will still work on the master branch of ceph.
Fixes: https://github.com/ceph/ceph-ansible/issues/2671
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This is a follow up on #2628.
Even with the openstack pools creation moved later in the playbook,
there is still an issue because OSDs are not all UP when trying to
create pools.
Adding a task which checks for all OSDs to be UP with a `retries/until`
condition should definitively fix this issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When playing ceph-mds role, mon nodes have set a fact with the default
pg num for osd pools, we can simply default to this value for cephfs
pools (`cephfs_pools` variable).
At the moment the variable definition for `cephfs_pools` looks like:
```
cephfs_pools:
- { name: "{{ cephfs_data }}", pgs: "" }
- { name: "{{ cephfs_metadata }}", pgs: "" }
```
and we have a task in `ceph-validate` to ensure `pgs` has been set to a
valid value.
We could simply avoid this check by setting the default value of `pgs`
to `hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num']` and
let to users the possibility to override this value.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1581164
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
in `ceph-osd` there is no need to set `docker_exec_cmd` since the only
place where this fact is used is in `openstack_config.yml` which
delegate all docker command to a monitor node. It means we need the
`docker_exec_cmd` fact that has been set referring to `ceph-mon-*`
containers, this fact is already set earlier in `ceph-defaults`.
By the way, when collocating an OSD with a MON it fails because the container
`ceph-osd-{{ ansible_hostname }}` doesn't exist.
Removing this task will allow to collocate an OSD with a MON.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1584179
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When collocating mds on monitor node, the cephpfs will fail
because `docker_exec_cmd` is reset to `ceph-mds-monXX` which is
incorrect because we need to delegate the task on `ceph-mon-monXX`.
In addition, it wouldn't have worked since `ceph-mds-monXX` container
isn't started yet.
Moving the task earlier in the `ceph-mds` role will fix this issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637
Signed-off-by: Sébastien Han <seb@redhat.com>
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.
The idea here is to move cephfs pools creation in `ceph-mds` role.
[1] e59258943b/src/mon/OSDMonitor.cc (L5673)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.
The idea here is to move openstack pools creation at the end of `ceph-osd` role.
[1] e59258943b/src/mon/OSDMonitor.cc (L5673)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.
It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.
Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a.
Now profiles drives the setting of rgw keystone *.
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.
fixes: https://github.com/ceph/ceph-ansible/issues/2592
Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.
A dev or rhcs install does not require ceph_stable_release to be set and
instead generates that by looking at the installed ceph-version.
However, at this point in the playbook ceph may not have been installed
yet and ceph-common has not be run.
Fixes: https://github.com/ceph/ceph-ansible/issues/2618
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
The validation module does not get config options with the template
syntax rendered, so we're gonna remove that and just default it to
False. The backwards compat was schedule to be removed in 3.1 anyway.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
When devices is not defined because you want to use the 'lvm'
osd_scenario but you've made a mistake selecting that scenario these
tasks should not fail.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail
It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317
Signed-off-by: Sébastien Han <seb@redhat.com>
Check whether a mgr module is supposed to be disabled before disabling
it and whether it is already enabled before enabling it.
Signed-off-by: Michael Vollman <michael.b.vollman@gmail.com>
We can simply reference the template name since it exists within the
role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH
or playbooks directory for the file.
To make the package installation more efficient we should install
packages as a list rather than as individual tasks or using a
"with_items" loop. The package managers can handle a list passed to them
to install in one go.
We can use a specified list and substitute any packages that are not to
be installed with the ceph-common package, which is installed on every
package install, then apply the unique filter to the package install
list.
There is no need to stat for created mgr keyrings since they are created
anyway when deploying a ceph cluster > jewel. In case of a jewel
deployment we won't enter that block.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This file is a leftover from PR ceph/ceph-ansible#2516
It is not used anymore so it can be removed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Until all the mons haven't been updated to Luminous, there is no way to
create a key. So we should do the key creation in the mon role only if
we are not part of an update.
If we are then the key creation is done after the mons upgrade to
Luminous.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Sébastien Han <seb@redhat.com>
trying to mask target when `/etc/systemd/system/target.service` doesn't
exist seems to be a bug.
There is no need to mask a unit file which doesn't exist.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The order of fs.aio-max-nr (which is hard-coded to 1048576) means that
if you set fs.aio-max-nr in os_tuning_params it will effectively be
ignored for bluestore scenarios.
To resolve this we should move the setting of fs.aio-max-nr above the
setting of os_tuning_params, in this way the operator can define the
value of fs.aio-max-nr to be something other than 1048576 if they want
to.
Additionally, we can make the sysctl settings happen in 1 task rather
than multiple.
trying to set the default value for pg_num to
`hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num'])` will
break in case of external client nodes deployment.
the `pg_num` attribute should be mandatory and be tested in future
`ceph-validate` role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
On containerized deployment,
when upgrading from jewel to luminous, mgr keyring creation fails because the
command to create mgr keyring is executed on a container that is still
running jewel since the container is restarted later to run the new
image, therefore, it fails with bad entity error.
To get around this situation, we can delegate the command to create
these keyrings on the first monitor when we are running the playbook on the last monitor.
That way we ensure we will issue the command on a container that has
been well restarted with the new image.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The Debian and SuSE installs for nfs-ganesha on the non-rhcs repository
requires you to allow_unauthenticated for Debian, and disable_gpg_check
for SuSE. The nfs-ganesha-rgw package already does this, but the
nfs-ganesha-ceph package will fail to install because of this same
issue.
This PR moves the installations to happen when the appropriate flags are
set to True (nfs_obj_gw & nfs_file_gw), but does it per distro (one for
SuSE and one for Debian) so that the appropriate flag can be passed to
ignore the GPG check.
When 'ceph_nfs_disable_caching' is set to True, disable attribute
caching done by Ganesha for all Ganesha exports.
Signed-off-by: Ramana Raja <rraja@redhat.com>
If we are in a middle of an update we want to get the new package
version being installed so the task that copies the repo files should
not be skipped.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572032
Signed-off-by: Sébastien Han <seb@redhat.com>
The apt-cache update can fail due to transient issues related to the
action being a network operation. To reduce the impact of these
transient failures this patch adds a retry to the update_cache task.
However, the apt_repository tasks which would perform an apt_update
won't retry the apt_update on a failure in the same way, as such this PR
moves the apt_update into an individual task, once per role.
Finally, the apt_repository tasks no longer have a changed_when: false,
and the apt_cache update is only performed once per role, if the
repositories change. Otherwise the cache is updated on the "apt" install
tasks if the cache_timeout has been reached.
the value in `docker_exec_client_cmd` doesn't allow to check for
existing pools because it's set with a wrong value for the entrypoint
that is going to be used.
It means the check were going to fail anyway even if pools actually exist.
Using jinja syntax to set `docker_exec_cmd` allows to handle the case
where you don't have monitors in your inventory.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If openstack_pools contains an application key it will be used to apply
this application pool type to a pool.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1562220
Signed-off-by: Sébastien Han <seb@redhat.com>
As of ceph 12.2.5 the type of the parameter `type` is not a name anymore but
an id, therefore an `int` is expected otherwise it will fail with the
following error
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The last mon creates the keys with a particular mode, while copying them
to the other mons (first and second) we must re-use the mode that was
set.
The same applies for the client node, the slurp preserves the initial
'item' so we can get the mode for the copy.
Signed-off-by: Sébastien Han <seb@redhat.com>
This key is created after the last mon is up so there is no need to try
to push it from the first mon. The initia mon container is not creating
the mgr key, ansible does. So this key will never exist.
The key will go into the fetch dir once the last mon is up, then when
the ceph-mgr plays it will try to get it from the fetch directory.
Signed-off-by: Sébastien Han <seb@redhat.com>
During the initial bootstrap of the first mon, the monmap file is
destroyed so it's not available and ansible will never find it.
Signed-off-by: Sébastien Han <seb@redhat.com>
Useful for softwares that do data collection/monitoring like collectd.
They can connect to the socket and then retrieve information.
Even though the sockets are exposed now, I'm keeping the docker exec to
check the socket, this will allow newer version of ceph-ansible to work
with older versions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1563280
Signed-off-by: Sébastien Han <seb@redhat.com>
We now have the ability to detect the uid/gid of the ceph user depending
on the distribution we are running on and so we are doing non-container
deployements.
Signed-off-by: Sébastien Han <seb@redhat.com>
We know bindmount with the :z option at the end of the -v command so
this will basically run the exact same command as we used to run. So to
speak:
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
Signed-off-by: Sébastien Han <seb@redhat.com>
This fixes the case where the playbook died and never removed the
container. So now, once the container exits it will remove itself from
the container list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
If the user has set copy_admin_key to true we assume he/she wants to
import the key in Ceph and not only create the key on the filesystem.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-authtool does not support raw arguements so we have to quote caps
declaration like this allow 'bla bla' instead of allow bla bla
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit does a couple of things:
* use a common.yml file that contains things that can be played on both
container and non-container
* refactor the ability to copy the admin key to the nodes
Signed-off-by: Sébastien Han <seb@redhat.com>
Red Hat is now using tags[3,latest] for image rhceph/rhceph-3-rhel7.
Because of this, the ceph_uid conditional passes for Debian
when 'ceph_docker_image_tag: latest' on RH deployments.
I've added an additional task to check for rhceph image specifically,
and also updated the RH family task for ceph/daemon [centos|fedora]tags.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
When installing rhcs on Debian systems the red hat repos must have the
highest priority so we avoid packages conflicts and install the rhcs
version.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1565850
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no need to check for a running cluster n*nodes time in
`ceph-defaults` so let's add a `run_once: true` to save some resources
and time.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Regardless if the partition is 'ceph' or something else, we don't want
to be as strick as checking for a particular partition.
If the drive has a partition, we just don't do anything.
This solves the case where the server reboots, disks get a different
/dev/sda (node) allocation. In this case, prior to restarting the server
/dev/sda was an OSD, but now it's /dev/sdb and the other way around.
In such scenario, we will try to prepare the OSD and create a new
partition, so let's not mess around with devices that have partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498303
Signed-off-by: Sébastien Han <seb@redhat.com>
allow_multimds will be officially deprecated in Mimic, specify it
only for all versions of Ceph where it was declared stable. Going
forward, specify only max_mds.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
NFS-ganesha cannot start is the nfs-server service
is running. This commit stops nfs-server in case it
is running on a (debian, redhat, suse) node before
the nfs-ganesha service starts up
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Add a variable, ceph_nfs_disable_caching, that if set to true
disables ganesha's directory and attribute caching as much as
possible.
Also, disable caching done by ganesha, when 'nfs_file_gw'
variable is true, i.e., when Ganesha is used as CephFS's gateway.
This is the recommended Ganesha setting as libcephfs already caches
information. And doing so helps avoid cache incoherency issues
especially with clustered ganesha over CephFS.
Fixes: https://tracker.ceph.com/issues/23393
Signed-off-by: Ramana Raja <rraja@redhat.com>
If people keep on using the mon_cap, osd_cap etc the playbook will
translate this old syntax on the flight.
Signed-off-by: Sébastien Han <seb@redhat.com>
backward compatibility with `ceph_mon_docker_interface` and
`ceph_mon_docker_subnet` was not working since there wasn't lookup on
`monitor_interface` and `public_network`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this change, if a user had ceph-test-12.2.1 installed, and
upgraded to ceph v12.2.3 or newer, the RPM upgrade process would
fail.
The problem is that the ceph-test RPM did not depend on an exact version
of ceph-common until v12.2.3.
In Ceph v12.2.3, ceph-{osdomap,kvstore,monstore}-tool binaries moved
from ceph-test into ceph-base. When ceph-test is not yet up-to-date, Yum
encounters package conflicts between the older ceph-test and newer
ceph-base.
When all users have upgraded beyond Ceph < 12.2.3, this is no longer
relevant.
According to our recent change, we now use "CentOS" as a latest
container image. We need to reflect this on the ceph_uid.
Signed-off-by: Sébastien Han <seb@redhat.com>
Tripleo deployment failed when the monitors not manged
by tripleo itself with:
FAILED! => {"msg": "list object has no element 0"}
The failing play item was introduced by
f46217b69a .
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1552327
Signed-off-by: Attila Fazekas <afazekas@redhat.com>
because of `serial: 1`, it can be an issue when the playbook is being
run on client nodes.
Since the refact of `ceph-client` we skip the role `ceph-defaults` on
every node except the first client node, it means that the task is not
going to be played because of `run_once: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit refacts this role so we don't have to pull container image
on client nodes just to create pools and keys.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1550977
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This seems to be a leftover.
This commit removes an unnecessary 'set linux permissions' on
`/var/lib/ceph`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This check is alone in `ceph-docker-common` since a previous code
refactor.
Moving this check in `ceph-defaults` allows us to run `ceph-clients`
without having to run `ceph-docker-common` even in non-containerized
deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this patch, the certificates where being generated on a single
node only (because of the run_once: true). Thus certificates were not
distributed on all the gateway nodes.
This would require a second ansible run to work. This patches fix the
creation and keys's distribution on all the nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540845
Signed-off-by: Sébastien Han <seb@redhat.com>
This update will resolve error['cephfs' is undefined.] in multimds container deployments.
See: roles/ceph-mon/tasks/create_mds_filesystems.yml. The same last two tasks are present there, and actully need to happen in that role since "{{ cephfs }}" gets defined in
roles/ceph-mon/defaults/main.yml, and not roles/ceph-mds/defaults/main.yml.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
Copy the admin key when configured nfs_file_gw (but not nfs_obj_gw). Also,
copy/setup RGW related directories only when configured as nfs_obj_gw.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This variable is needed for containerized clusters and is required for
the ceph-docker-common role. Typically the is_atomic variable is set in
site-docker.yml.sample though so if ceph-docker-common is used outside
of that playbook it needs set in another way. Moving the creation of
the variable inside this role means playbooks don't need to worry
about setting it.
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1558252
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Since the approach to creating a ceph.conf file has changed, and now
no-longer relies on assembling config file fragments in /etc/ceph/ceph.d
we can avoid the conf_overrides rendering on the local host and skip out
the tasks related to that, instead using just the config_template task
to configure the file directly.
When creating pools, it's crucial to expose all the options available as
part of the pool creation command. As explained in:
http://docs.ceph.com/docs/jewel/rados/operations/pools/
Signed-off-by: Sébastien Han <seb@redhat.com>
If OSDs don't restart normally we now also dump info of the crush map,
crush rules, crush tree and pools.
If the monitors don't restart normally we also print the socket status
by calling mon_status and quorum_status.
Signed-off-by: Sébastien Han <seb@redhat.com>
The `pools` dict defined in `roles/ceph-client/defaults/main.yml`
shouldn't have `{{ ceph_conf_overrides.global.osd_pool_default_pg_num
}}` as default value for `pgs` keys.
For instance, if you want some pools to be created but without explicitely
specifying the pgs for these pools (it means you want to use the
`osd_pool_default_pg_num`), you will be obliged to define
`{{ ceph_conf_overrides.global.osd_pool_default_pg_num }}` anyway while you
wanted to use the current default value already defined in the cluster which is
retrieved early in the playbook and stored in the
`{{ osd_pool_default_pg_num }}` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Running the last portion (insert new default and add new default crush
tasks) of crush_rules.yml only on the last monitor is
wrong since ceph CLI calls usually end up on the master having the
quorum, which is by default the one with the lower IP.
So if we run the command and end up on another mon the creation will
happen on the default crush rule because the particular mon hasn't been
updated.
To fix this we remove the |last on the include and use run_once: true on
certain tasks, then we let the final two tasks run on all the monitors.
Signed-off-by: Sébastien Han <seb@redhat.com>
On releases after jewel the option
'osd_pool_default_crush_replicated_ruleset' does not exist anymore, it's
called osd_pool_default_crush_rule.
Signed-off-by: Sébastien Han <seb@redhat.com>
This was causing a lot of pain with the handlers. Also the
implementation was not ideal since we were assembling files. Everything
can now be done with the ceph_crush module so let's remove that.
Signed-off-by: Sébastien Han <seb@redhat.com>
Instead of creating the CRUSH hierarchy with Ansible tasks using the
command module we now rely on the ceph_crush module.
Signed-off-by: Sébastien Han <seb@redhat.com>
One could want to add new crush rules while keeping his current default rule.
Fixed it so that it works with all rules defined as "default: false". If multiple rules are defined as default (should not be) then the last rule listed in "crush_rules" is taken as default.
As part of fcba2c801a these vars were
removed and no longer do anything:
radosgw_dns_name
radosgw_resolve_cname
This patch removes them from the group_vars files and defaults/main.yml
If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.
Signed-off-by: Sébastien Han <seb@redhat.com>
In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.
Signed-off-by: Sébastien Han <seb@redhat.com>
That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The nfs-ganesha package has been fixed as part of this commit:
963b6681df
Once the package is rebuilt this should be good to merge.
This reverts commit e88af3c4cb.
Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.
To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.
The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common
Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.
For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.
To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.
Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.
Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs
Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.
We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit fixes a bug that occurs especially for dmcrypt scenarios.
There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.
If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :
```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```
It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.
Adding `--net=host` to the 'disk_list' container fixes this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.
Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml
Signed-off-by: Greg Charot <gcharot@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.
but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.
This happens always, regardless if the setup and deployment is correct.
Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.
How reproducible: 100%
Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster
Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
"attempts": 10,
"changed": false,
"cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
"delta": "0:00:00.002717",
"end": "2018-01-21 18:10:31.237933",
"failed": true,
"failed_when_result": false,
"rc": 0,
"start": "2018-01-21 18:10:31.235216"
}
STDOUT:
0
1
2
Expected results:
There aren't any (or just a few) timeouts while the OSDs are found
Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.
Basically this line:
until: osd_id.stdout_lines|length == devices|unique|length
Means in this 2 OSD case it is trying to ensure the following incorrect condition:
until: 2 == 0
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.
Fixes: #2111
This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
Sometime the playbook gets stuck because even with `--connect-timeout=`
option, the connexion to the existing ceph cluster never timeout.
As a workaround, using `timeout` command provided by coreutils will
actually timeout if we can't connect to the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537003
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is to keep backward compatibility with stable-2.2 and satisfy the
check "verify dedicated devices have been provided" in
`check_mandatory_vars.yml`. This check is looking for
`dedicated_devices` so we need to default it's value to
`raw_journal_devices` when `raw_multi_journal` is set to `True`.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1536098
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Some systems that were deployed with old tools can leave units named
"ceph-radosgw@radosgw.gateway.service". As a consequence, they will
prevent the new unit to start.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1509584
Signed-off-by: Sébastien Han <seb@redhat.com>
Currently, we can define crush location for each host but only crush roots and crush rules are created. This commit automates other routines for a complete solution:
1) Creates rack type crush buckets defined in {{ ceph_crush_rack }} of each osd host. If it's not defined by user then a rack named 'default_rack_{{ ceph_crush_root }}' would be added and used in next steps.
2) Move rack type crush buckets defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
3) Move hosts defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
On a non-collocated scenario, if a drive is faulty we can't really
remove it from the list of 'devices' without messing up or having to
re-arrange the order of the 'dedicated_devices'. We want to keep this
device list ordered. This will prevent the activation failing on a
device that we know is failing but we can't remove it yet to not mess up
the dedicated_devices mapping with devices.
Signed-off-by: Sébastien Han <seb@redhat.com>
Having handlers in both ceph-defaults and ceph-docker-common roles can make the
playbook restarting two times services. Handlers can be triggered first
time because of a change in ceph.conf and a second time because a new
image has been pulled.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This wasn't any good choice to implement this.
We had several options and none of them were ideal since handlers can
not be triggered cross-roles.
We could have achieved that by doing:
* option 1 was to add a dependancy in the meta of the ceph-docker-common
role. We had that long ago and we decided to stop so everything is
managed via site.yml
* option 2 was to import files from another role. This is messy and we
don't that anywhere in the current code base. We will continue to do so.
There is option 3 where we pull the image from the ceph-config role.
This is not suitable as well since the docker command won't be available
unless you run Atomic distro. This would also mean that you're trying to
pull twice. First time in ceph-config, second time in ceph-docker-common
The only option I came up with was to duplicate a bit of the ceph-config
handlers code.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
When containerized deployment, `docker_exec_cmd` is not set before the
task which try to retrieve the current fsid is played, it means it
considers there is no existing fsid and try to generate a new one.
Typical error:
```
ok: [mon0 -> mon0] => {
"changed": false,
"cmd": [
"ceph",
"--connect-timeout",
"3",
"--cluster",
"test",
"fsid"
],
"delta": "0:00:00.179909",
"end": "2018-01-09 10:36:58.759846",
"failed": false,
"failed_when_result": false,
"rc": 1,
"start": "2018-01-09 10:36:58.579937"
}
STDERR:
Error initializing cluster client: Error('error calling conf_read_file: errno EINVAL',)
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Previously we were using ceph_conf_overrides however this doesn't play
nice for softwares like TripleO that uses ceph_conf_overrides inside its
own code. For now, and since this is the only occurence of this, we can
ensure no logs through the ceph conf template.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1532619
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no reasons why we can't use crush rules when deploying
containers. So moving the inlcude in the main.yml so it can be called.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-create-keys is idempotent so it's not an issue to run it each time
we play ansible. This also fix issues where the 'creates' arg skips the
task and no keys get generated on newer version, e.g during an upgrade.
Closes: https://github.com/ceph/ceph-ansible/issues/2228
Signed-off-by: Sébastien Han <seb@redhat.com>
When upgrading from OSP11 to OSP12 container, ceph-ansible attempts to
disable the RGW service provided by the overcloud image. The task
attempts to stop/disable ceph-rgw@{{ ansible-hostname }} and
ceph-radosgw@{{ ansible-hostname }}.service. The actual service name is
ceph-radosgw@radosgw.$name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525209
Signed-off-by: Sébastien Han <seb@redhat.com>
the gpt label creation doesn't work even with parted module.
This commit fixes the gpt label creation by using parted command
instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We have a scenario when we switch from non-container to containers. This
means we don't know anything about the ceph partitions associated to an
OSD. Normally in a containerized context we have files containing the
preparation sequence. From these files we can get the capabilities of
each OSD. As a last resort we use a ceph-disk call inside a dummy bash
container to discover the ceph journal on the current osd.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525612
Signed-off-by: Sébastien Han <seb@redhat.com>
This resolves the following error:
E: There were unauthenticated packages and -y was used without
--allow-unauthenticated
Signed-off-by: Sébastien Han <seb@redhat.com>
The name docker_version is very generic and is also used by other
roles. As a result, there may be name conflicts. To avoid this a
ceph_ prefix should be used for this fact. Since it is an internal
fact renaming is not a problem.
making `osd_pool_default_pg_num` mandatory is a bit agressive and is
unrelated when you just want to create users keyrings.
Closes: #2241
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add the variables ceph_osd_docker_cpuset_cpus and
ceph_osd_docker_cpuset_mems, so that a user may specify
the CPUs and memory nodes of NUMA systems on which OSD
containers are run.
Provides a example in osds.yaml.sample to guide user
based on sample `lscpu` output since cpuset-mems refers
to the memory by NUMA node only while cpuset-cpus can
refer to individual vCPUs within a NUMA node.
If a deployer uses an interface name with a dash/hyphen in it, such
as 'br-storage' for the monitor_interface group_var, the ceph.conf.j2
template fails to find the right facts. It looks for
'ansible_br-storage' but only 'ansible_br_storage' exists.
This patch converts the interface name to underscores when the
template does the fact lookup.
The CI complains because of `ceph_uid` fact which doesn't exist since
the docker image tag used in the CI doesn't match with this condition.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is particularly useful in CI environments where you dont have
the option of adding extra devices or volumes to the host. It is also
a simple change to support loopback devices
In case where docker CLI is available but docker is not running, we
don't want to trigger the restart of the daemons.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since some daemons now install their own packages the task checking the
ceph version fails on Debian systems. So the 'ceph-common' package must
be installed on all the machines.
Signed-off-by: Sébastien Han <seb@redhat.com>
This task was originally needed to fix a docker installation issue
(see: #1030). This has been fixed, therefore it can be removed.
Fixes: #2199
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
A recent change [1] required that the openstack_keys
param always containe an acls list. However, it's
possible it might not contain that list. Thus, this
param sets a default for that list to be empty if it
is not in the structure as defined by the user.
[1] d65cbaa539
We were activating dmcrypt devices with the wrong command. Basically the
first task execute the wrong activate command. The task fails but
continues because of the 'failed_when: false'. Then the right activation
sequence is being done by the next task.
Signed-off-by: Sébastien Han <seb@redhat.com>
when `ceph-rbd-mirror.target` is not enabled, the service won't start
after a reboot because there is a dependency between these two units.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If ceph-ansible deploys a Ceph cluster with "openstack_config: true"
and sets the openstack_keys map to have certain ACLs or permissions,
the requested ACLs or permissions are only set on one of the monitor
nodes [2] when they should be set on all of them.
This patch solves [3] the above issue by having the chmod and setfacl
tasks iterate the list of mon nodes (including the mon node that the
task was delegated to) to apply the chmod of setfacl to the keys in
openstack_keys.
[1]
```
openstack_keys:
- { name: client.openstack, key: "$(ceph-authtool --gen-print-key)", mon_cap: "allow r", osd_cap: "allow class-read object_prefix rbd_children, allow rwx pool=images, allow rwx pool=vms, allow rwx pool=volumes, allow rwx pool=backups", mode: "0600", acls: ["u:nova:r--", "u:cinder:r--", "u:glance:r--", "u:gnocchi:r--"] }
```
[2]
```
$ ansible mons -m shell -b -a "ls -l /etc/ceph/ceph.client.openstack.keyring ; getfacl /etc/ceph/ceph.client.openstack.keyring"
192.168.1.26 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.29 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
group::r--
other::r--getfacl: Removing leading '/' from absolute path names
192.168.1.23 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
group::r--
other::r--getfacl: Removing leading '/' from absolute path names
$
```
[3]
```
(undercloud) [stack@hci-director ceph-ansible]$ ansible mons -m shell -b -a "ls -l /etc/ceph/ceph.client.openstack.keyring ; getfacl /etc/ceph/ceph.client.openstack.keyring"
192.168.1.25 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.29 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.27 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
(undercloud) [stack@hci-director ceph-ansible]$
```
Add support for the openSUSE distributions. The required packages
are available either in the distribution repositories or in the
OBS one.
Signed-off-by: Markos Chandras <mchandras@suse.de>
When we consume the distribution packages, we don't have the choise on
which version to install, so we shouldn't require that variable to be
set. Distributions normally provide only one version of Ceph in the
official repositories so we get whatever they provide.
Signed-off-by: Markos Chandras <mchandras@suse.de>
openSUSE Leap 42.3 provides support for Ceph Luminous in both the
distribution package and the latest available version in the OBS
repository so add these as the only available installation methods for
openSUSE.
Signed-off-by: Markos Chandras <mchandras@suse.de>
Like 80d32dec, the path to the fact is not correct.
In any case, we will retrieve the IP address in hostvars, the variable
is the way we get the interface name according where it has been set
(eg.: inventory host file vs. group_vars/)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is no need to have a condition on this task, this test should be
always run since the result will be interpreted later.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This will prevent ceph-ansible from using a loop device while it
shouldn't in auto_discovery mode.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The path to the fact is not correct.
In any case, we will retrieve the IP address in hostvars, the variable
is the way we get the interface name according where it has been set
(eg.: inventory host file vs. group_vars/)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Use `devices` variable instead of `ansible_devices`, otherwise it means
we are not using the devices which have been 'auto discovered'
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current code will also return lvm devices such as /dev/dm-2, this
kind of device type is not supported by ceph-disk at the moment. Now we
just ignore them.
Signed-off-by: Sébastien Han <seb@redhat.com>
- One can not run scripts directly in place, that mounted with `noexec`
option. But one can run scripts as arguments for `bash/sh`.
Signed-off-by: Arano-kai <captcha.is.evil@gmail.com>
During the initial implementation of this 'old' thing we were falling
into this issue without noticing
https://github.com/moby/moby/issues/30341 and where blindly using --rm,
now this is fixed the prepare container disappears and thus activation
fail.
I'm fixing this for old jewel images.
Also this fixes the machine reboot case where the docker logs are
purgend. In the old scenario, we now store the log locally in the same
directory as the ceph-osd-run.sh script.
Signed-off-by: Sébastien Han <seb@redhat.com>
Setting monitor_interface in group_vars/all.yml makes the
hostvars[host]['monitor_interface'] non-existing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1507922
Signed-off-by: Sébastien Han <seb@redhat.com>
Only chmod or setfacl the requested keyring(s) in the
opentack_keys data structure when the mode or acls keys
of that data structure exist.
User may specify four permission combinations for the
keyring file(s): 1. only set ACL, 2. only set mode,
3. set neither mode nor ACL, 4. set mode and then ACL.
Fixes: #2092
Use "ceph_tcmalloc_max_total_thread_cache" to set the
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES value inside /etc/default/ceph for
Debian installs, or /etc/sysconfig/ceph for Red Hat/CentOS installs.
By default this is set to 0, so the default package value will be used,
if specified this value will be changed to match the variable, and ceph
osd services will be restarted.
There was a huge resync from luminous to jewel in ceph-docker:
https://github.com/ceph/ceph-docker/pull/797
This change brought a new handy function to discover partitions tight to
an OSD. This function doesn't exist in the old image so the
ceph-osd-run.sh script breaks when trying to deploy Jewel OSD with that
old Jewel image version.
Signed-off-by: Sébastien Han <seb@redhat.com>
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.
We will drop this in a near future, maybe 3.1 or 3.2.
Signed-off-by: Sébastien Han <seb@redhat.com>
This feature isn't available before luminous, therefore, we need to play
them only on luminous and after otherwise the playbook will fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3f3d4b9c727d06154c422d445fc2a245aceaed89)
3a58757 introduced an issue for Jewel deployments, since this role is
skipped, `enabled_ceph_mgr_modules.stdout` doesn't exist, therefore, it
ends up with an attribute error.
Uses `.get()` to retrieve `stdout` with a default value so it won't fail
if this attribute doesn't exist (jewel).
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In Jewel, we don't use bootstrap-rbd keyring for rbd-mirror nodes, it
results with a socket path/name different according to which ceph
release you are deploying.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit add new osd scenarios, it aims to simplify the CI setup and
brings a better coverage on the OSD scenarios.
We decided to differentiate between filestore and bluestore, thinking
ahead when filestore won't be supported anymore.
So we now have two classes of tests:
* Filestore
* Bluestore
In each of those classes we have container and non-container.
Then for each we test the following:
* collocated
* collocated dmcrypt
* non-collocated
* non-collocated dmcrypt
* auto discovery collocated
* auto discovery collocated dmcrypt
This gives us a nice coverage and also reduces the footprint on the CI.
We are now up to 4 scenarios, each containing 6 OSD VMs.
Signed-off-by: Sébastien Han <seb@redhat.com>
When doing collocation the condition "inventory_hostname in play_hosts"
is breaking the restart workflow.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup
Signed-off-by: Sébastien Han <seb@redhat.com>
This will solve the following issue when starting docker containers on ubuntu:
invalid argument "1\u00a0" for --cpus=1 : failed to parse 1 as a rational number
Closes-bug: #2056
1. add the variables to docker_collocation
2. trigger the check when a MDS is part of the inventory file, not when
we run on an MDS...
Signed-off-by: Sébastien Han <seb@redhat.com>
The container deployment is serialized, adding this task as a best
effort. If docker is already present we pull the image otherwise we wait
for the role to play.
Signed-off-by: Sébastien Han <seb@redhat.com>
on jewel `ceph-rbd-mirror.target` isn't enabled, therefore, if the node
is rebooted, the service doesn't get started.
from ceph-rbd-mirror unit file:
```
[Install]
WantedBy=ceph-rbd-mirror.target
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This patch simplifies the checks and installation tasks for NTP.
Debian and Red Hat had a check for NTP's presence but would then
install NTP right afterwards anyways. In addition, there were
tasks for atomic that weren't used anywhere else in the role.
This patch also uses a dynamic include to reduce delays from
skipped tasks.
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.
We now have a variable called ceph_pools that is mandatory when
deploying a MDS.
It's a dictionnary that contains a pool name and a PG count. PG count is
mandatory and must be set, the playbook will fail otherwise.
Closes: https://github.com/ceph/ceph-ansible/issues/2017
Signed-off-by: Sébastien Han <seb@redhat.com>
Ansible throws warnings when using yum/dnf/rpm with the command
module:
[WARNING]: Consider using yum module rather than running yum
This patch adds the `warn: no` argument to suppress the warnings
in the Ansible output.
The `always_run` key is deprecated and being removed in Ansible 2.4.
Using it causes a warning to be displayed:
[DEPRECATION WARNING]: always_run is deprecated.
This patch changes all instances of `always_run` to use the `always`
tag, which causes the task to run each time the playbook runs.
Modern versions of Ansible can handle a list of packages passed
directly to the package modules. This patch optimizes the package
install process by passing the list of packages directly to the
module.
This is causing unknown issues when trying to start a dmcrypt container.
Basically the container is stuck at mount opening the LUKS device. This
is still unknown why this is causing trouble but we need to move
forward. Also, this doesn't seem to help in any ways to fix the race
condition we've seen.
Here is the log for dmcrypt:
cryptsetup 1.7.4 processing "cryptsetup --debug --verbose --key-file
key luksClose fbf8887d-8694-46ca-b9ff-be79a668e2a9"
Running command close.
Locking memory.
Installing SIGINT/SIGTERM handler.
Unblocking interruption on signal.
Allocating crypt device context by device
fbf8887d-8694-46ca-b9ff-be79a668e2a9.
Initialising device-mapper backend library.
dm version [ opencount flush ] [16384] (*1)
dm versions [ opencount flush ] [16384] (*1)
Detected dm-crypt version 1.14.1, dm-ioctl version 4.35.0.
Device-mapper backend running with UDEV support enabled.
dm status fbf8887d-8694-46ca-b9ff-be79a668e2a9 [ opencount flush ]
[16384] (*1)
Releasing device-mapper backend.
Trying to open and read device /dev/sdc1 with direct-io.
Allocating crypt device /dev/sdc1 context.
Trying to open and read device /dev/sdc1 with direct-io.
Initialising device-mapper backend library.
dm table fbf8887d-8694-46ca-b9ff-be79a668e2a9 [ opencount flush
securedata ] [16384] (*1)
Trying to open and read device /dev/sdc1 with direct-io.
Crypto backend (gcrypt 1.5.3) initialized in cryptsetup library
version 1.7.4.
Detected kernel Linux 3.10.0-693.el7.x86_64 x86_64.
Reading LUKS header of size 1024 from device /dev/sdc1
Key length 32, device size 1943016847 sectors, header size 2050
sectors.
Deactivating volume fbf8887d-8694-46ca-b9ff-be79a668e2a9.
dm status fbf8887d-8694-46ca-b9ff-be79a668e2a9 [ opencount flush ]
[16384] (*1)
Udev cookie 0xd4d14e4 (semid 32769) created
Udev cookie 0xd4d14e4 (semid 32769) incremented to 1
Udev cookie 0xd4d14e4 (semid 32769) incremented to 2
Udev cookie 0xd4d14e4 (semid 32769) assigned to REMOVE task(2) with
flags (0x0)
dm remove fbf8887d-8694-46ca-b9ff-be79a668e2a9 [ opencount flush
retryremove ] [16384] (*1)
fbf8887d-8694-46ca-b9ff-be79a668e2a9: Stacking NODE_DEL [verify_udev]
Udev cookie 0xd4d14e4 (semid 32769) decremented to 1
Udev cookie 0xd4d14e4 (semid 32769) waiting for zero
Signed-off-by: Sébastien Han <seb@redhat.com>
in addition to c4dcdaa20 this commit adds the missing condition on
install tasks for debian_rhcs deployment. Without them, these tasks are
played on any kind of deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
* DBus on host should include ganesha service file
* to allow ganesha container to respond on DBus it needs to run
in --privileged mode (ganesha folks contacted to look at this)
* ceph_nfs_include_exports_dir variable replaced with more general
ceph_nfs_dynamic_exports
This is something that has nothing to do in `ceph-common`, this
is too specific to `ceph-iscsi-gw` role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is something that has nothing to do in `ceph-common`, this
is too specific to `ceph-nfs` role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Make role `ceph-mgr` handling itself the installation of `ceph-mgr`
package because it's complicated to manage it regarding we are going to
install `jewel vs. luminous`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
During the initial play, the docker command doesn't not exist and then
there is no stdout_lines to the command. So get allows us to fix this by
declaring an array if the command fails.
Signed-off-by: Sébastien Han <seb@redhat.com>
Use an intermediate variable to build the final `dedicated_devices` list
to avoid duplicate entry in that array. (We need a 1:1 relation between
`dedicated_devices` and `devices` since we are using a `with_together`
later.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
With jewel, `bootstrap_rbd_keyring` is not set because of this condition:
```
when:
- ceph_release_num.{{ ceph_release }} >= ceph_release_num.luminous
```
Therefore, the task `try to fetch ceph config and keys` will fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
* Change version from 2 to 3.
* use ceph_rhcs_cdn_debian_repo_version to use other repositories along
* with ceph_rhcs_cdn_debian_repo
Signed-off-by: Sébastien Han <seb@redhat.com>
`ceph_release` isn't available at this step of the playbook because it
is set later based on the installed binaries.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1486062
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The value of doing this is fairly low compare to the added value.
So we remove these tasks, if rbd pool on Jewel doesn't have the right PG
value you can always increase it.
Signed-off-by: Sébastien Han <seb@redhat.com>
All keyring are getting copied to all nodes.
This commit fixes a leftover from a previous code refactor.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498583
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
iscsi-gw needs a 'rbd' pool to configure iscsi target.
Note: I could have used the facts already set in `ceph-mon` but I voluntarily
didn't do it to not create a dependancy between these two roles.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit refacts the code regarding all `set_osd_pool_default_*`
related tasks by avoiding usage of useless `set_fact` to determine
whether a key is present in `ceph_conf_overrides`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The rbd pool is the default pool that gets created during ceph cluster
initializaiton. If we act on the rbd related operations too early, the
rbd pool does not exist yet. Move the call to perform rbd operations
to a later stage after other pools have been created.
The rbd_pool.yml playbook has all the operations related to the rbd pool.
Replace the always_run (deprecated) directive with check_mode.
Most of the ceph related tasks only need to run once. The run_once directive
executes the task on the first host.
The ceph sub-command to delete a pool is delete (not rm).
The changes submitted here were tested with this ceph version.
ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)
This upload includes these changes:
- Use the fail module (instead of assert).
- From luminous release, the rbd pool is no longer created by default.
Delete the code to create the rbd pool for luminous release
- Conform the .yml files to use the suggested syntax.
The commands are executed on the mcp nodes and I think shell ansible module
is the right one to use. The command module is used to execute commands on
remote nodes. I can make the change to use command module if that is
prefrerred.
This is to ensure `docker_exec_cmd` fact is set with the correct value
in case of daemons collocation
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Also we now play ceph-config to have everything being generated for new
daemons bootstrap during upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1497959
Signed-off-by: Sébastien Han <seb@redhat.com>
generate_crt|bool|default(false) won't apply the default value, this
generate_crt|default(false)|bool will
Signed-off-by: Sébastien Han <seb@redhat.com>
The task which sets `ceph_current_fsid` in `facts.yml` in case of containerized
deployment, will definitely fail because `docker_exec_cmd` is not set
yet.
This commits simply makes `facts.yml` played after `check_socket.yml` so
`docker_exec_cmd` is set properly.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commits refacts the role ceph-mds
The goal here is to create cephfs in `ceph-mon` for both containerized
and non-containerized cases so we don't need the admin keyring on mds
nodes anymore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1488999
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Using systemd module allows us to do in one task what we did in three
tasks:
- enable unit file,
- issue a `daemon-reload`,
- start the service
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Running the socket check on all the hosts will override the default
value of docker_exec_cmd, leaving it with the last value (currently
rbd-mirror), as a result the subsequent docker_exec_cmd usage for the
:x
Signed-off-by: Sébastien Han <seb@redhat.com>
There is a bug in the rbd mirror unit file, the upstream fix is here:
https://github.com/ceph/ceph/pull/17969.
This should be reverted once the patch is merged and backport is done.
Signed-off-by: Sébastien Han <seb@redhat.com>
This fixes the error :
```
The conditional check 'sestatus.stdout != 'Disabled'' failed.
```
that occurs when running on non rhel based system since the
`sestatus` fact is registered only on rhel based distribution.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Specify the timeout flag to ceph-create-keys, which causes it to time out
if a monitor quorum isn't achieved. This overrides the default timeout
of 10 minutes, causing ceph-ansible to fail faster in the event of cluster
network issues.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
We have seen issues with leftover socker. So now, if a socket is found
we also check if it's accessed by a process. If so, we can run the
handler, if not we remove it and continue the playbook.
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
It's sad but we can not rely on the prepare container anymore since the
log are flushed after reboot. So inpecting the container does not return
anything.
Now, instead we use a ephemeral container to look up for the
journal/block.db/block.wal (depending if filestore or bluestore) and
build the activate command accordingly.
Signed-off-by: Sébastien Han <seb@redhat.com>
need to use `hostvars[host]['XXX']` to retrieve the monitor
interface and/or radosgw interface.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1493920
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>