This key is created after the last mon is up so there is no need to try
to push it from the first mon. The initia mon container is not creating
the mgr key, ansible does. So this key will never exist.
The key will go into the fetch dir once the last mon is up, then when
the ceph-mgr plays it will try to get it from the fetch directory.
Signed-off-by: Sébastien Han <seb@redhat.com>
During the initial bootstrap of the first mon, the monmap file is
destroyed so it's not available and ansible will never find it.
Signed-off-by: Sébastien Han <seb@redhat.com>
Useful for softwares that do data collection/monitoring like collectd.
They can connect to the socket and then retrieve information.
Even though the sockets are exposed now, I'm keeping the docker exec to
check the socket, this will allow newer version of ceph-ansible to work
with older versions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1563280
Signed-off-by: Sébastien Han <seb@redhat.com>
We now have the ability to detect the uid/gid of the ceph user depending
on the distribution we are running on and so we are doing non-container
deployements.
Signed-off-by: Sébastien Han <seb@redhat.com>
We know bindmount with the :z option at the end of the -v command so
this will basically run the exact same command as we used to run. So to
speak:
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
Signed-off-by: Sébastien Han <seb@redhat.com>
This fixes the case where the playbook died and never removed the
container. So now, once the container exits it will remove itself from
the container list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
If the user has set copy_admin_key to true we assume he/she wants to
import the key in Ceph and not only create the key on the filesystem.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-authtool does not support raw arguements so we have to quote caps
declaration like this allow 'bla bla' instead of allow bla bla
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit does a couple of things:
* use a common.yml file that contains things that can be played on both
container and non-container
* refactor the ability to copy the admin key to the nodes
Signed-off-by: Sébastien Han <seb@redhat.com>
Red Hat is now using tags[3,latest] for image rhceph/rhceph-3-rhel7.
Because of this, the ceph_uid conditional passes for Debian
when 'ceph_docker_image_tag: latest' on RH deployments.
I've added an additional task to check for rhceph image specifically,
and also updated the RH family task for ceph/daemon [centos|fedora]tags.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
When installing rhcs on Debian systems the red hat repos must have the
highest priority so we avoid packages conflicts and install the rhcs
version.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1565850
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no need to check for a running cluster n*nodes time in
`ceph-defaults` so let's add a `run_once: true` to save some resources
and time.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Regardless if the partition is 'ceph' or something else, we don't want
to be as strick as checking for a particular partition.
If the drive has a partition, we just don't do anything.
This solves the case where the server reboots, disks get a different
/dev/sda (node) allocation. In this case, prior to restarting the server
/dev/sda was an OSD, but now it's /dev/sdb and the other way around.
In such scenario, we will try to prepare the OSD and create a new
partition, so let's not mess around with devices that have partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498303
Signed-off-by: Sébastien Han <seb@redhat.com>
allow_multimds will be officially deprecated in Mimic, specify it
only for all versions of Ceph where it was declared stable. Going
forward, specify only max_mds.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
NFS-ganesha cannot start is the nfs-server service
is running. This commit stops nfs-server in case it
is running on a (debian, redhat, suse) node before
the nfs-ganesha service starts up
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Add a variable, ceph_nfs_disable_caching, that if set to true
disables ganesha's directory and attribute caching as much as
possible.
Also, disable caching done by ganesha, when 'nfs_file_gw'
variable is true, i.e., when Ganesha is used as CephFS's gateway.
This is the recommended Ganesha setting as libcephfs already caches
information. And doing so helps avoid cache incoherency issues
especially with clustered ganesha over CephFS.
Fixes: https://tracker.ceph.com/issues/23393
Signed-off-by: Ramana Raja <rraja@redhat.com>
If people keep on using the mon_cap, osd_cap etc the playbook will
translate this old syntax on the flight.
Signed-off-by: Sébastien Han <seb@redhat.com>
backward compatibility with `ceph_mon_docker_interface` and
`ceph_mon_docker_subnet` was not working since there wasn't lookup on
`monitor_interface` and `public_network`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this change, if a user had ceph-test-12.2.1 installed, and
upgraded to ceph v12.2.3 or newer, the RPM upgrade process would
fail.
The problem is that the ceph-test RPM did not depend on an exact version
of ceph-common until v12.2.3.
In Ceph v12.2.3, ceph-{osdomap,kvstore,monstore}-tool binaries moved
from ceph-test into ceph-base. When ceph-test is not yet up-to-date, Yum
encounters package conflicts between the older ceph-test and newer
ceph-base.
When all users have upgraded beyond Ceph < 12.2.3, this is no longer
relevant.
According to our recent change, we now use "CentOS" as a latest
container image. We need to reflect this on the ceph_uid.
Signed-off-by: Sébastien Han <seb@redhat.com>
Tripleo deployment failed when the monitors not manged
by tripleo itself with:
FAILED! => {"msg": "list object has no element 0"}
The failing play item was introduced by
f46217b69a .
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1552327
Signed-off-by: Attila Fazekas <afazekas@redhat.com>
because of `serial: 1`, it can be an issue when the playbook is being
run on client nodes.
Since the refact of `ceph-client` we skip the role `ceph-defaults` on
every node except the first client node, it means that the task is not
going to be played because of `run_once: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit refacts this role so we don't have to pull container image
on client nodes just to create pools and keys.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1550977
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This seems to be a leftover.
This commit removes an unnecessary 'set linux permissions' on
`/var/lib/ceph`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This check is alone in `ceph-docker-common` since a previous code
refactor.
Moving this check in `ceph-defaults` allows us to run `ceph-clients`
without having to run `ceph-docker-common` even in non-containerized
deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this patch, the certificates where being generated on a single
node only (because of the run_once: true). Thus certificates were not
distributed on all the gateway nodes.
This would require a second ansible run to work. This patches fix the
creation and keys's distribution on all the nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540845
Signed-off-by: Sébastien Han <seb@redhat.com>
This update will resolve error['cephfs' is undefined.] in multimds container deployments.
See: roles/ceph-mon/tasks/create_mds_filesystems.yml. The same last two tasks are present there, and actully need to happen in that role since "{{ cephfs }}" gets defined in
roles/ceph-mon/defaults/main.yml, and not roles/ceph-mds/defaults/main.yml.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
Copy the admin key when configured nfs_file_gw (but not nfs_obj_gw). Also,
copy/setup RGW related directories only when configured as nfs_obj_gw.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This variable is needed for containerized clusters and is required for
the ceph-docker-common role. Typically the is_atomic variable is set in
site-docker.yml.sample though so if ceph-docker-common is used outside
of that playbook it needs set in another way. Moving the creation of
the variable inside this role means playbooks don't need to worry
about setting it.
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1558252
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Since the approach to creating a ceph.conf file has changed, and now
no-longer relies on assembling config file fragments in /etc/ceph/ceph.d
we can avoid the conf_overrides rendering on the local host and skip out
the tasks related to that, instead using just the config_template task
to configure the file directly.
When creating pools, it's crucial to expose all the options available as
part of the pool creation command. As explained in:
http://docs.ceph.com/docs/jewel/rados/operations/pools/
Signed-off-by: Sébastien Han <seb@redhat.com>
If OSDs don't restart normally we now also dump info of the crush map,
crush rules, crush tree and pools.
If the monitors don't restart normally we also print the socket status
by calling mon_status and quorum_status.
Signed-off-by: Sébastien Han <seb@redhat.com>
The `pools` dict defined in `roles/ceph-client/defaults/main.yml`
shouldn't have `{{ ceph_conf_overrides.global.osd_pool_default_pg_num
}}` as default value for `pgs` keys.
For instance, if you want some pools to be created but without explicitely
specifying the pgs for these pools (it means you want to use the
`osd_pool_default_pg_num`), you will be obliged to define
`{{ ceph_conf_overrides.global.osd_pool_default_pg_num }}` anyway while you
wanted to use the current default value already defined in the cluster which is
retrieved early in the playbook and stored in the
`{{ osd_pool_default_pg_num }}` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Running the last portion (insert new default and add new default crush
tasks) of crush_rules.yml only on the last monitor is
wrong since ceph CLI calls usually end up on the master having the
quorum, which is by default the one with the lower IP.
So if we run the command and end up on another mon the creation will
happen on the default crush rule because the particular mon hasn't been
updated.
To fix this we remove the |last on the include and use run_once: true on
certain tasks, then we let the final two tasks run on all the monitors.
Signed-off-by: Sébastien Han <seb@redhat.com>
On releases after jewel the option
'osd_pool_default_crush_replicated_ruleset' does not exist anymore, it's
called osd_pool_default_crush_rule.
Signed-off-by: Sébastien Han <seb@redhat.com>
This was causing a lot of pain with the handlers. Also the
implementation was not ideal since we were assembling files. Everything
can now be done with the ceph_crush module so let's remove that.
Signed-off-by: Sébastien Han <seb@redhat.com>
Instead of creating the CRUSH hierarchy with Ansible tasks using the
command module we now rely on the ceph_crush module.
Signed-off-by: Sébastien Han <seb@redhat.com>
One could want to add new crush rules while keeping his current default rule.
Fixed it so that it works with all rules defined as "default: false". If multiple rules are defined as default (should not be) then the last rule listed in "crush_rules" is taken as default.
As part of fcba2c801a these vars were
removed and no longer do anything:
radosgw_dns_name
radosgw_resolve_cname
This patch removes them from the group_vars files and defaults/main.yml
If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.
Signed-off-by: Sébastien Han <seb@redhat.com>
In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.
Signed-off-by: Sébastien Han <seb@redhat.com>
That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The nfs-ganesha package has been fixed as part of this commit:
963b6681df
Once the package is rebuilt this should be good to merge.
This reverts commit e88af3c4cb.
Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.
To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.
The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common
Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.
For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.
To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.
Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.
Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs
Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.
We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit fixes a bug that occurs especially for dmcrypt scenarios.
There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.
If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :
```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```
It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.
Adding `--net=host` to the 'disk_list' container fixes this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.
Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml
Signed-off-by: Greg Charot <gcharot@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.
but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.
This happens always, regardless if the setup and deployment is correct.
Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.
How reproducible: 100%
Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster
Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
"attempts": 10,
"changed": false,
"cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
"delta": "0:00:00.002717",
"end": "2018-01-21 18:10:31.237933",
"failed": true,
"failed_when_result": false,
"rc": 0,
"start": "2018-01-21 18:10:31.235216"
}
STDOUT:
0
1
2
Expected results:
There aren't any (or just a few) timeouts while the OSDs are found
Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.
Basically this line:
until: osd_id.stdout_lines|length == devices|unique|length
Means in this 2 OSD case it is trying to ensure the following incorrect condition:
until: 2 == 0
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.
Fixes: #2111
This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
Sometime the playbook gets stuck because even with `--connect-timeout=`
option, the connexion to the existing ceph cluster never timeout.
As a workaround, using `timeout` command provided by coreutils will
actually timeout if we can't connect to the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537003
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is to keep backward compatibility with stable-2.2 and satisfy the
check "verify dedicated devices have been provided" in
`check_mandatory_vars.yml`. This check is looking for
`dedicated_devices` so we need to default it's value to
`raw_journal_devices` when `raw_multi_journal` is set to `True`.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1536098
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Some systems that were deployed with old tools can leave units named
"ceph-radosgw@radosgw.gateway.service". As a consequence, they will
prevent the new unit to start.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1509584
Signed-off-by: Sébastien Han <seb@redhat.com>
Currently, we can define crush location for each host but only crush roots and crush rules are created. This commit automates other routines for a complete solution:
1) Creates rack type crush buckets defined in {{ ceph_crush_rack }} of each osd host. If it's not defined by user then a rack named 'default_rack_{{ ceph_crush_root }}' would be added and used in next steps.
2) Move rack type crush buckets defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
3) Move hosts defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
On a non-collocated scenario, if a drive is faulty we can't really
remove it from the list of 'devices' without messing up or having to
re-arrange the order of the 'dedicated_devices'. We want to keep this
device list ordered. This will prevent the activation failing on a
device that we know is failing but we can't remove it yet to not mess up
the dedicated_devices mapping with devices.
Signed-off-by: Sébastien Han <seb@redhat.com>
Having handlers in both ceph-defaults and ceph-docker-common roles can make the
playbook restarting two times services. Handlers can be triggered first
time because of a change in ceph.conf and a second time because a new
image has been pulled.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This wasn't any good choice to implement this.
We had several options and none of them were ideal since handlers can
not be triggered cross-roles.
We could have achieved that by doing:
* option 1 was to add a dependancy in the meta of the ceph-docker-common
role. We had that long ago and we decided to stop so everything is
managed via site.yml
* option 2 was to import files from another role. This is messy and we
don't that anywhere in the current code base. We will continue to do so.
There is option 3 where we pull the image from the ceph-config role.
This is not suitable as well since the docker command won't be available
unless you run Atomic distro. This would also mean that you're trying to
pull twice. First time in ceph-config, second time in ceph-docker-common
The only option I came up with was to duplicate a bit of the ceph-config
handlers code.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
When containerized deployment, `docker_exec_cmd` is not set before the
task which try to retrieve the current fsid is played, it means it
considers there is no existing fsid and try to generate a new one.
Typical error:
```
ok: [mon0 -> mon0] => {
"changed": false,
"cmd": [
"ceph",
"--connect-timeout",
"3",
"--cluster",
"test",
"fsid"
],
"delta": "0:00:00.179909",
"end": "2018-01-09 10:36:58.759846",
"failed": false,
"failed_when_result": false,
"rc": 1,
"start": "2018-01-09 10:36:58.579937"
}
STDERR:
Error initializing cluster client: Error('error calling conf_read_file: errno EINVAL',)
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Previously we were using ceph_conf_overrides however this doesn't play
nice for softwares like TripleO that uses ceph_conf_overrides inside its
own code. For now, and since this is the only occurence of this, we can
ensure no logs through the ceph conf template.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1532619
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no reasons why we can't use crush rules when deploying
containers. So moving the inlcude in the main.yml so it can be called.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-create-keys is idempotent so it's not an issue to run it each time
we play ansible. This also fix issues where the 'creates' arg skips the
task and no keys get generated on newer version, e.g during an upgrade.
Closes: https://github.com/ceph/ceph-ansible/issues/2228
Signed-off-by: Sébastien Han <seb@redhat.com>
When upgrading from OSP11 to OSP12 container, ceph-ansible attempts to
disable the RGW service provided by the overcloud image. The task
attempts to stop/disable ceph-rgw@{{ ansible-hostname }} and
ceph-radosgw@{{ ansible-hostname }}.service. The actual service name is
ceph-radosgw@radosgw.$name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525209
Signed-off-by: Sébastien Han <seb@redhat.com>
the gpt label creation doesn't work even with parted module.
This commit fixes the gpt label creation by using parted command
instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We have a scenario when we switch from non-container to containers. This
means we don't know anything about the ceph partitions associated to an
OSD. Normally in a containerized context we have files containing the
preparation sequence. From these files we can get the capabilities of
each OSD. As a last resort we use a ceph-disk call inside a dummy bash
container to discover the ceph journal on the current osd.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525612
Signed-off-by: Sébastien Han <seb@redhat.com>
This resolves the following error:
E: There were unauthenticated packages and -y was used without
--allow-unauthenticated
Signed-off-by: Sébastien Han <seb@redhat.com>
The name docker_version is very generic and is also used by other
roles. As a result, there may be name conflicts. To avoid this a
ceph_ prefix should be used for this fact. Since it is an internal
fact renaming is not a problem.
making `osd_pool_default_pg_num` mandatory is a bit agressive and is
unrelated when you just want to create users keyrings.
Closes: #2241
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add the variables ceph_osd_docker_cpuset_cpus and
ceph_osd_docker_cpuset_mems, so that a user may specify
the CPUs and memory nodes of NUMA systems on which OSD
containers are run.
Provides a example in osds.yaml.sample to guide user
based on sample `lscpu` output since cpuset-mems refers
to the memory by NUMA node only while cpuset-cpus can
refer to individual vCPUs within a NUMA node.
If a deployer uses an interface name with a dash/hyphen in it, such
as 'br-storage' for the monitor_interface group_var, the ceph.conf.j2
template fails to find the right facts. It looks for
'ansible_br-storage' but only 'ansible_br_storage' exists.
This patch converts the interface name to underscores when the
template does the fact lookup.
The CI complains because of `ceph_uid` fact which doesn't exist since
the docker image tag used in the CI doesn't match with this condition.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is particularly useful in CI environments where you dont have
the option of adding extra devices or volumes to the host. It is also
a simple change to support loopback devices
In case where docker CLI is available but docker is not running, we
don't want to trigger the restart of the daemons.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since some daemons now install their own packages the task checking the
ceph version fails on Debian systems. So the 'ceph-common' package must
be installed on all the machines.
Signed-off-by: Sébastien Han <seb@redhat.com>
This task was originally needed to fix a docker installation issue
(see: #1030). This has been fixed, therefore it can be removed.
Fixes: #2199
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
A recent change [1] required that the openstack_keys
param always containe an acls list. However, it's
possible it might not contain that list. Thus, this
param sets a default for that list to be empty if it
is not in the structure as defined by the user.
[1] d65cbaa539
We were activating dmcrypt devices with the wrong command. Basically the
first task execute the wrong activate command. The task fails but
continues because of the 'failed_when: false'. Then the right activation
sequence is being done by the next task.
Signed-off-by: Sébastien Han <seb@redhat.com>
when `ceph-rbd-mirror.target` is not enabled, the service won't start
after a reboot because there is a dependency between these two units.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If ceph-ansible deploys a Ceph cluster with "openstack_config: true"
and sets the openstack_keys map to have certain ACLs or permissions,
the requested ACLs or permissions are only set on one of the monitor
nodes [2] when they should be set on all of them.
This patch solves [3] the above issue by having the chmod and setfacl
tasks iterate the list of mon nodes (including the mon node that the
task was delegated to) to apply the chmod of setfacl to the keys in
openstack_keys.
[1]
```
openstack_keys:
- { name: client.openstack, key: "$(ceph-authtool --gen-print-key)", mon_cap: "allow r", osd_cap: "allow class-read object_prefix rbd_children, allow rwx pool=images, allow rwx pool=vms, allow rwx pool=volumes, allow rwx pool=backups", mode: "0600", acls: ["u:nova:r--", "u:cinder:r--", "u:glance:r--", "u:gnocchi:r--"] }
```
[2]
```
$ ansible mons -m shell -b -a "ls -l /etc/ceph/ceph.client.openstack.keyring ; getfacl /etc/ceph/ceph.client.openstack.keyring"
192.168.1.26 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.29 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
group::r--
other::r--getfacl: Removing leading '/' from absolute path names
192.168.1.23 | SUCCESS | rc=0 >>
-rw-r--r--. 1 root root 253 Nov 3 20:30 /etc/ceph/ceph.client.openstack.keyring
user::rw-
group::r--
other::r--getfacl: Removing leading '/' from absolute path names
$
```
[3]
```
(undercloud) [stack@hci-director ceph-ansible]$ ansible mons -m shell -b -a "ls -l /etc/ceph/ceph.client.openstack.keyring ; getfacl /etc/ceph/ceph.client.openstack.keyring"
192.168.1.25 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.29 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
192.168.1.27 | SUCCESS | rc=0 >>
-rw-r-----+ 1 root root 253 Nov 14 01:12 /etc/ceph/ceph.client.openstack.keyring
user::rw-
user:glance:r--
user:nova:r--
user:cinder:r--
user:gnocchi:r--
group::---
mask::r--
other::---getfacl: Removing leading '/' from absolute path names
(undercloud) [stack@hci-director ceph-ansible]$
```
Add support for the openSUSE distributions. The required packages
are available either in the distribution repositories or in the
OBS one.
Signed-off-by: Markos Chandras <mchandras@suse.de>
When we consume the distribution packages, we don't have the choise on
which version to install, so we shouldn't require that variable to be
set. Distributions normally provide only one version of Ceph in the
official repositories so we get whatever they provide.
Signed-off-by: Markos Chandras <mchandras@suse.de>
openSUSE Leap 42.3 provides support for Ceph Luminous in both the
distribution package and the latest available version in the OBS
repository so add these as the only available installation methods for
openSUSE.
Signed-off-by: Markos Chandras <mchandras@suse.de>
Like 80d32dec, the path to the fact is not correct.
In any case, we will retrieve the IP address in hostvars, the variable
is the way we get the interface name according where it has been set
(eg.: inventory host file vs. group_vars/)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is no need to have a condition on this task, this test should be
always run since the result will be interpreted later.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This will prevent ceph-ansible from using a loop device while it
shouldn't in auto_discovery mode.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The path to the fact is not correct.
In any case, we will retrieve the IP address in hostvars, the variable
is the way we get the interface name according where it has been set
(eg.: inventory host file vs. group_vars/)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Use `devices` variable instead of `ansible_devices`, otherwise it means
we are not using the devices which have been 'auto discovered'
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current code will also return lvm devices such as /dev/dm-2, this
kind of device type is not supported by ceph-disk at the moment. Now we
just ignore them.
Signed-off-by: Sébastien Han <seb@redhat.com>
- One can not run scripts directly in place, that mounted with `noexec`
option. But one can run scripts as arguments for `bash/sh`.
Signed-off-by: Arano-kai <captcha.is.evil@gmail.com>
During the initial implementation of this 'old' thing we were falling
into this issue without noticing
https://github.com/moby/moby/issues/30341 and where blindly using --rm,
now this is fixed the prepare container disappears and thus activation
fail.
I'm fixing this for old jewel images.
Also this fixes the machine reboot case where the docker logs are
purgend. In the old scenario, we now store the log locally in the same
directory as the ceph-osd-run.sh script.
Signed-off-by: Sébastien Han <seb@redhat.com>
Setting monitor_interface in group_vars/all.yml makes the
hostvars[host]['monitor_interface'] non-existing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1507922
Signed-off-by: Sébastien Han <seb@redhat.com>
Only chmod or setfacl the requested keyring(s) in the
opentack_keys data structure when the mode or acls keys
of that data structure exist.
User may specify four permission combinations for the
keyring file(s): 1. only set ACL, 2. only set mode,
3. set neither mode nor ACL, 4. set mode and then ACL.
Fixes: #2092
Use "ceph_tcmalloc_max_total_thread_cache" to set the
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES value inside /etc/default/ceph for
Debian installs, or /etc/sysconfig/ceph for Red Hat/CentOS installs.
By default this is set to 0, so the default package value will be used,
if specified this value will be changed to match the variable, and ceph
osd services will be restarted.
There was a huge resync from luminous to jewel in ceph-docker:
https://github.com/ceph/ceph-docker/pull/797
This change brought a new handy function to discover partitions tight to
an OSD. This function doesn't exist in the old image so the
ceph-osd-run.sh script breaks when trying to deploy Jewel OSD with that
old Jewel image version.
Signed-off-by: Sébastien Han <seb@redhat.com>
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.
We will drop this in a near future, maybe 3.1 or 3.2.
Signed-off-by: Sébastien Han <seb@redhat.com>
This feature isn't available before luminous, therefore, we need to play
them only on luminous and after otherwise the playbook will fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3f3d4b9c727d06154c422d445fc2a245aceaed89)
3a58757 introduced an issue for Jewel deployments, since this role is
skipped, `enabled_ceph_mgr_modules.stdout` doesn't exist, therefore, it
ends up with an attribute error.
Uses `.get()` to retrieve `stdout` with a default value so it won't fail
if this attribute doesn't exist (jewel).
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In Jewel, we don't use bootstrap-rbd keyring for rbd-mirror nodes, it
results with a socket path/name different according to which ceph
release you are deploying.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit add new osd scenarios, it aims to simplify the CI setup and
brings a better coverage on the OSD scenarios.
We decided to differentiate between filestore and bluestore, thinking
ahead when filestore won't be supported anymore.
So we now have two classes of tests:
* Filestore
* Bluestore
In each of those classes we have container and non-container.
Then for each we test the following:
* collocated
* collocated dmcrypt
* non-collocated
* non-collocated dmcrypt
* auto discovery collocated
* auto discovery collocated dmcrypt
This gives us a nice coverage and also reduces the footprint on the CI.
We are now up to 4 scenarios, each containing 6 OSD VMs.
Signed-off-by: Sébastien Han <seb@redhat.com>
When doing collocation the condition "inventory_hostname in play_hosts"
is breaking the restart workflow.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup
Signed-off-by: Sébastien Han <seb@redhat.com>
This will solve the following issue when starting docker containers on ubuntu:
invalid argument "1\u00a0" for --cpus=1 : failed to parse 1 as a rational number
Closes-bug: #2056
1. add the variables to docker_collocation
2. trigger the check when a MDS is part of the inventory file, not when
we run on an MDS...
Signed-off-by: Sébastien Han <seb@redhat.com>
The container deployment is serialized, adding this task as a best
effort. If docker is already present we pull the image otherwise we wait
for the role to play.
Signed-off-by: Sébastien Han <seb@redhat.com>
on jewel `ceph-rbd-mirror.target` isn't enabled, therefore, if the node
is rebooted, the service doesn't get started.
from ceph-rbd-mirror unit file:
```
[Install]
WantedBy=ceph-rbd-mirror.target
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This patch simplifies the checks and installation tasks for NTP.
Debian and Red Hat had a check for NTP's presence but would then
install NTP right afterwards anyways. In addition, there were
tasks for atomic that weren't used anywhere else in the role.
This patch also uses a dynamic include to reduce delays from
skipped tasks.
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.