Extra space in systemctl list-units can cause restart_osd_daemon.sh to
fail
It looks like if you have more services enabled in the node space
between "loaded" and "active" get more space as compared to one space
given in command the command[1].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317
Signed-off-by: Sébastien Han <seb@redhat.com>
Check whether a mgr module is supposed to be disabled before disabling
it and whether it is already enabled before enabling it.
Signed-off-by: Michael Vollman <michael.b.vollman@gmail.com>
We can simply reference the template name since it exists within the
role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH
or playbooks directory for the file.
To make the package installation more efficient we should install
packages as a list rather than as individual tasks or using a
"with_items" loop. The package managers can handle a list passed to them
to install in one go.
We can use a specified list and substitute any packages that are not to
be installed with the ceph-common package, which is installed on every
package install, then apply the unique filter to the package install
list.
There is no need to stat for created mgr keyrings since they are created
anyway when deploying a ceph cluster > jewel. In case of a jewel
deployment we won't enter that block.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This file is a leftover from PR ceph/ceph-ansible#2516
It is not used anymore so it can be removed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Until all the mons haven't been updated to Luminous, there is no way to
create a key. So we should do the key creation in the mon role only if
we are not part of an update.
If we are then the key creation is done after the mons upgrade to
Luminous.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Sébastien Han <seb@redhat.com>
trying to mask target when `/etc/systemd/system/target.service` doesn't
exist seems to be a bug.
There is no need to mask a unit file which doesn't exist.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The order of fs.aio-max-nr (which is hard-coded to 1048576) means that
if you set fs.aio-max-nr in os_tuning_params it will effectively be
ignored for bluestore scenarios.
To resolve this we should move the setting of fs.aio-max-nr above the
setting of os_tuning_params, in this way the operator can define the
value of fs.aio-max-nr to be something other than 1048576 if they want
to.
Additionally, we can make the sysctl settings happen in 1 task rather
than multiple.
trying to set the default value for pg_num to
`hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num'])` will
break in case of external client nodes deployment.
the `pg_num` attribute should be mandatory and be tested in future
`ceph-validate` role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
On containerized deployment,
when upgrading from jewel to luminous, mgr keyring creation fails because the
command to create mgr keyring is executed on a container that is still
running jewel since the container is restarted later to run the new
image, therefore, it fails with bad entity error.
To get around this situation, we can delegate the command to create
these keyrings on the first monitor when we are running the playbook on the last monitor.
That way we ensure we will issue the command on a container that has
been well restarted with the new image.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The Debian and SuSE installs for nfs-ganesha on the non-rhcs repository
requires you to allow_unauthenticated for Debian, and disable_gpg_check
for SuSE. The nfs-ganesha-rgw package already does this, but the
nfs-ganesha-ceph package will fail to install because of this same
issue.
This PR moves the installations to happen when the appropriate flags are
set to True (nfs_obj_gw & nfs_file_gw), but does it per distro (one for
SuSE and one for Debian) so that the appropriate flag can be passed to
ignore the GPG check.
When 'ceph_nfs_disable_caching' is set to True, disable attribute
caching done by Ganesha for all Ganesha exports.
Signed-off-by: Ramana Raja <rraja@redhat.com>
If we are in a middle of an update we want to get the new package
version being installed so the task that copies the repo files should
not be skipped.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572032
Signed-off-by: Sébastien Han <seb@redhat.com>
The apt-cache update can fail due to transient issues related to the
action being a network operation. To reduce the impact of these
transient failures this patch adds a retry to the update_cache task.
However, the apt_repository tasks which would perform an apt_update
won't retry the apt_update on a failure in the same way, as such this PR
moves the apt_update into an individual task, once per role.
Finally, the apt_repository tasks no longer have a changed_when: false,
and the apt_cache update is only performed once per role, if the
repositories change. Otherwise the cache is updated on the "apt" install
tasks if the cache_timeout has been reached.
the value in `docker_exec_client_cmd` doesn't allow to check for
existing pools because it's set with a wrong value for the entrypoint
that is going to be used.
It means the check were going to fail anyway even if pools actually exist.
Using jinja syntax to set `docker_exec_cmd` allows to handle the case
where you don't have monitors in your inventory.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If openstack_pools contains an application key it will be used to apply
this application pool type to a pool.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1562220
Signed-off-by: Sébastien Han <seb@redhat.com>
As of ceph 12.2.5 the type of the parameter `type` is not a name anymore but
an id, therefore an `int` is expected otherwise it will fail with the
following error
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The last mon creates the keys with a particular mode, while copying them
to the other mons (first and second) we must re-use the mode that was
set.
The same applies for the client node, the slurp preserves the initial
'item' so we can get the mode for the copy.
Signed-off-by: Sébastien Han <seb@redhat.com>
This key is created after the last mon is up so there is no need to try
to push it from the first mon. The initia mon container is not creating
the mgr key, ansible does. So this key will never exist.
The key will go into the fetch dir once the last mon is up, then when
the ceph-mgr plays it will try to get it from the fetch directory.
Signed-off-by: Sébastien Han <seb@redhat.com>
During the initial bootstrap of the first mon, the monmap file is
destroyed so it's not available and ansible will never find it.
Signed-off-by: Sébastien Han <seb@redhat.com>
Useful for softwares that do data collection/monitoring like collectd.
They can connect to the socket and then retrieve information.
Even though the sockets are exposed now, I'm keeping the docker exec to
check the socket, this will allow newer version of ceph-ansible to work
with older versions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1563280
Signed-off-by: Sébastien Han <seb@redhat.com>
We now have the ability to detect the uid/gid of the ceph user depending
on the distribution we are running on and so we are doing non-container
deployements.
Signed-off-by: Sébastien Han <seb@redhat.com>
We know bindmount with the :z option at the end of the -v command so
this will basically run the exact same command as we used to run. So to
speak:
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
Signed-off-by: Sébastien Han <seb@redhat.com>
This fixes the case where the playbook died and never removed the
container. So now, once the container exits it will remove itself from
the container list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
If the user has set copy_admin_key to true we assume he/she wants to
import the key in Ceph and not only create the key on the filesystem.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-authtool does not support raw arguements so we have to quote caps
declaration like this allow 'bla bla' instead of allow bla bla
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1568157
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit does a couple of things:
* use a common.yml file that contains things that can be played on both
container and non-container
* refactor the ability to copy the admin key to the nodes
Signed-off-by: Sébastien Han <seb@redhat.com>
Red Hat is now using tags[3,latest] for image rhceph/rhceph-3-rhel7.
Because of this, the ceph_uid conditional passes for Debian
when 'ceph_docker_image_tag: latest' on RH deployments.
I've added an additional task to check for rhceph image specifically,
and also updated the RH family task for ceph/daemon [centos|fedora]tags.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
When installing rhcs on Debian systems the red hat repos must have the
highest priority so we avoid packages conflicts and install the rhcs
version.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1565850
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no need to check for a running cluster n*nodes time in
`ceph-defaults` so let's add a `run_once: true` to save some resources
and time.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Regardless if the partition is 'ceph' or something else, we don't want
to be as strick as checking for a particular partition.
If the drive has a partition, we just don't do anything.
This solves the case where the server reboots, disks get a different
/dev/sda (node) allocation. In this case, prior to restarting the server
/dev/sda was an OSD, but now it's /dev/sdb and the other way around.
In such scenario, we will try to prepare the OSD and create a new
partition, so let's not mess around with devices that have partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498303
Signed-off-by: Sébastien Han <seb@redhat.com>
allow_multimds will be officially deprecated in Mimic, specify it
only for all versions of Ceph where it was declared stable. Going
forward, specify only max_mds.
Signed-off-by: Douglas Fuller <dfuller@redhat.com>
NFS-ganesha cannot start is the nfs-server service
is running. This commit stops nfs-server in case it
is running on a (debian, redhat, suse) node before
the nfs-ganesha service starts up
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1508506
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Add a variable, ceph_nfs_disable_caching, that if set to true
disables ganesha's directory and attribute caching as much as
possible.
Also, disable caching done by ganesha, when 'nfs_file_gw'
variable is true, i.e., when Ganesha is used as CephFS's gateway.
This is the recommended Ganesha setting as libcephfs already caches
information. And doing so helps avoid cache incoherency issues
especially with clustered ganesha over CephFS.
Fixes: https://tracker.ceph.com/issues/23393
Signed-off-by: Ramana Raja <rraja@redhat.com>
If people keep on using the mon_cap, osd_cap etc the playbook will
translate this old syntax on the flight.
Signed-off-by: Sébastien Han <seb@redhat.com>
backward compatibility with `ceph_mon_docker_interface` and
`ceph_mon_docker_subnet` was not working since there wasn't lookup on
`monitor_interface` and `public_network`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this change, if a user had ceph-test-12.2.1 installed, and
upgraded to ceph v12.2.3 or newer, the RPM upgrade process would
fail.
The problem is that the ceph-test RPM did not depend on an exact version
of ceph-common until v12.2.3.
In Ceph v12.2.3, ceph-{osdomap,kvstore,monstore}-tool binaries moved
from ceph-test into ceph-base. When ceph-test is not yet up-to-date, Yum
encounters package conflicts between the older ceph-test and newer
ceph-base.
When all users have upgraded beyond Ceph < 12.2.3, this is no longer
relevant.
According to our recent change, we now use "CentOS" as a latest
container image. We need to reflect this on the ceph_uid.
Signed-off-by: Sébastien Han <seb@redhat.com>
Tripleo deployment failed when the monitors not manged
by tripleo itself with:
FAILED! => {"msg": "list object has no element 0"}
The failing play item was introduced by
f46217b69a .
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1552327
Signed-off-by: Attila Fazekas <afazekas@redhat.com>
because of `serial: 1`, it can be an issue when the playbook is being
run on client nodes.
Since the refact of `ceph-client` we skip the role `ceph-defaults` on
every node except the first client node, it means that the task is not
going to be played because of `run_once: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit refacts this role so we don't have to pull container image
on client nodes just to create pools and keys.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1550977
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This seems to be a leftover.
This commit removes an unnecessary 'set linux permissions' on
`/var/lib/ceph`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This check is alone in `ceph-docker-common` since a previous code
refactor.
Moving this check in `ceph-defaults` allows us to run `ceph-clients`
without having to run `ceph-docker-common` even in non-containerized
deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Prior to this patch, the certificates where being generated on a single
node only (because of the run_once: true). Thus certificates were not
distributed on all the gateway nodes.
This would require a second ansible run to work. This patches fix the
creation and keys's distribution on all the nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540845
Signed-off-by: Sébastien Han <seb@redhat.com>
This update will resolve error['cephfs' is undefined.] in multimds container deployments.
See: roles/ceph-mon/tasks/create_mds_filesystems.yml. The same last two tasks are present there, and actully need to happen in that role since "{{ cephfs }}" gets defined in
roles/ceph-mon/defaults/main.yml, and not roles/ceph-mds/defaults/main.yml.
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
Copy the admin key when configured nfs_file_gw (but not nfs_obj_gw). Also,
copy/setup RGW related directories only when configured as nfs_obj_gw.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
This variable is needed for containerized clusters and is required for
the ceph-docker-common role. Typically the is_atomic variable is set in
site-docker.yml.sample though so if ceph-docker-common is used outside
of that playbook it needs set in another way. Moving the creation of
the variable inside this role means playbooks don't need to worry
about setting it.
fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1558252
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Since the approach to creating a ceph.conf file has changed, and now
no-longer relies on assembling config file fragments in /etc/ceph/ceph.d
we can avoid the conf_overrides rendering on the local host and skip out
the tasks related to that, instead using just the config_template task
to configure the file directly.
When creating pools, it's crucial to expose all the options available as
part of the pool creation command. As explained in:
http://docs.ceph.com/docs/jewel/rados/operations/pools/
Signed-off-by: Sébastien Han <seb@redhat.com>
If OSDs don't restart normally we now also dump info of the crush map,
crush rules, crush tree and pools.
If the monitors don't restart normally we also print the socket status
by calling mon_status and quorum_status.
Signed-off-by: Sébastien Han <seb@redhat.com>
The `pools` dict defined in `roles/ceph-client/defaults/main.yml`
shouldn't have `{{ ceph_conf_overrides.global.osd_pool_default_pg_num
}}` as default value for `pgs` keys.
For instance, if you want some pools to be created but without explicitely
specifying the pgs for these pools (it means you want to use the
`osd_pool_default_pg_num`), you will be obliged to define
`{{ ceph_conf_overrides.global.osd_pool_default_pg_num }}` anyway while you
wanted to use the current default value already defined in the cluster which is
retrieved early in the playbook and stored in the
`{{ osd_pool_default_pg_num }}` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Running the last portion (insert new default and add new default crush
tasks) of crush_rules.yml only on the last monitor is
wrong since ceph CLI calls usually end up on the master having the
quorum, which is by default the one with the lower IP.
So if we run the command and end up on another mon the creation will
happen on the default crush rule because the particular mon hasn't been
updated.
To fix this we remove the |last on the include and use run_once: true on
certain tasks, then we let the final two tasks run on all the monitors.
Signed-off-by: Sébastien Han <seb@redhat.com>
On releases after jewel the option
'osd_pool_default_crush_replicated_ruleset' does not exist anymore, it's
called osd_pool_default_crush_rule.
Signed-off-by: Sébastien Han <seb@redhat.com>
This was causing a lot of pain with the handlers. Also the
implementation was not ideal since we were assembling files. Everything
can now be done with the ceph_crush module so let's remove that.
Signed-off-by: Sébastien Han <seb@redhat.com>
Instead of creating the CRUSH hierarchy with Ansible tasks using the
command module we now rely on the ceph_crush module.
Signed-off-by: Sébastien Han <seb@redhat.com>
One could want to add new crush rules while keeping his current default rule.
Fixed it so that it works with all rules defined as "default: false". If multiple rules are defined as default (should not be) then the last rule listed in "crush_rules" is taken as default.
As part of fcba2c801a these vars were
removed and no longer do anything:
radosgw_dns_name
radosgw_resolve_cname
This patch removes them from the group_vars files and defaults/main.yml
If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.
Signed-off-by: Sébastien Han <seb@redhat.com>
In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.
Signed-off-by: Sébastien Han <seb@redhat.com>
That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The nfs-ganesha package has been fixed as part of this commit:
963b6681df
Once the package is rebuilt this should be good to merge.
This reverts commit e88af3c4cb.
Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.
To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.
The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common
Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.
For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.
To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.
Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.
Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
When used along with delegate, run_once does not belong well. Thus,
using | last always brings the desired result.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs
Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.
We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit fixes a bug that occurs especially for dmcrypt scenarios.
There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.
If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :
```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```
It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.
Adding `--net=host` to the 'disk_list' container fixes this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.
Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml
Signed-off-by: Greg Charot <gcharot@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.
but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.
This happens always, regardless if the setup and deployment is correct.
Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.
How reproducible: 100%
Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster
Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
"attempts": 10,
"changed": false,
"cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
"delta": "0:00:00.002717",
"end": "2018-01-21 18:10:31.237933",
"failed": true,
"failed_when_result": false,
"rc": 0,
"start": "2018-01-21 18:10:31.235216"
}
STDOUT:
0
1
2
Expected results:
There aren't any (or just a few) timeouts while the OSDs are found
Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.
Basically this line:
until: osd_id.stdout_lines|length == devices|unique|length
Means in this 2 OSD case it is trying to ensure the following incorrect condition:
until: 2 == 0
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.
Fixes: #2111
This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
Sometime the playbook gets stuck because even with `--connect-timeout=`
option, the connexion to the existing ceph cluster never timeout.
As a workaround, using `timeout` command provided by coreutils will
actually timeout if we can't connect to the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537003
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is to keep backward compatibility with stable-2.2 and satisfy the
check "verify dedicated devices have been provided" in
`check_mandatory_vars.yml`. This check is looking for
`dedicated_devices` so we need to default it's value to
`raw_journal_devices` when `raw_multi_journal` is set to `True`.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1536098
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Some systems that were deployed with old tools can leave units named
"ceph-radosgw@radosgw.gateway.service". As a consequence, they will
prevent the new unit to start.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1509584
Signed-off-by: Sébastien Han <seb@redhat.com>
Currently, we can define crush location for each host but only crush roots and crush rules are created. This commit automates other routines for a complete solution:
1) Creates rack type crush buckets defined in {{ ceph_crush_rack }} of each osd host. If it's not defined by user then a rack named 'default_rack_{{ ceph_crush_root }}' would be added and used in next steps.
2) Move rack type crush buckets defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
3) Move hosts defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
On a non-collocated scenario, if a drive is faulty we can't really
remove it from the list of 'devices' without messing up or having to
re-arrange the order of the 'dedicated_devices'. We want to keep this
device list ordered. This will prevent the activation failing on a
device that we know is failing but we can't remove it yet to not mess up
the dedicated_devices mapping with devices.
Signed-off-by: Sébastien Han <seb@redhat.com>
Having handlers in both ceph-defaults and ceph-docker-common roles can make the
playbook restarting two times services. Handlers can be triggered first
time because of a change in ceph.conf and a second time because a new
image has been pulled.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This wasn't any good choice to implement this.
We had several options and none of them were ideal since handlers can
not be triggered cross-roles.
We could have achieved that by doing:
* option 1 was to add a dependancy in the meta of the ceph-docker-common
role. We had that long ago and we decided to stop so everything is
managed via site.yml
* option 2 was to import files from another role. This is messy and we
don't that anywhere in the current code base. We will continue to do so.
There is option 3 where we pull the image from the ceph-config role.
This is not suitable as well since the docker command won't be available
unless you run Atomic distro. This would also mean that you're trying to
pull twice. First time in ceph-config, second time in ceph-docker-common
The only option I came up with was to duplicate a bit of the ceph-config
handlers code.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
When containerized deployment, `docker_exec_cmd` is not set before the
task which try to retrieve the current fsid is played, it means it
considers there is no existing fsid and try to generate a new one.
Typical error:
```
ok: [mon0 -> mon0] => {
"changed": false,
"cmd": [
"ceph",
"--connect-timeout",
"3",
"--cluster",
"test",
"fsid"
],
"delta": "0:00:00.179909",
"end": "2018-01-09 10:36:58.759846",
"failed": false,
"failed_when_result": false,
"rc": 1,
"start": "2018-01-09 10:36:58.579937"
}
STDERR:
Error initializing cluster client: Error('error calling conf_read_file: errno EINVAL',)
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Previously we were using ceph_conf_overrides however this doesn't play
nice for softwares like TripleO that uses ceph_conf_overrides inside its
own code. For now, and since this is the only occurence of this, we can
ensure no logs through the ceph conf template.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1532619
Signed-off-by: Sébastien Han <seb@redhat.com>
There is no reasons why we can't use crush rules when deploying
containers. So moving the inlcude in the main.yml so it can be called.
Signed-off-by: Sébastien Han <seb@redhat.com>
ceph-create-keys is idempotent so it's not an issue to run it each time
we play ansible. This also fix issues where the 'creates' arg skips the
task and no keys get generated on newer version, e.g during an upgrade.
Closes: https://github.com/ceph/ceph-ansible/issues/2228
Signed-off-by: Sébastien Han <seb@redhat.com>
When upgrading from OSP11 to OSP12 container, ceph-ansible attempts to
disable the RGW service provided by the overcloud image. The task
attempts to stop/disable ceph-rgw@{{ ansible-hostname }} and
ceph-radosgw@{{ ansible-hostname }}.service. The actual service name is
ceph-radosgw@radosgw.$name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525209
Signed-off-by: Sébastien Han <seb@redhat.com>
the gpt label creation doesn't work even with parted module.
This commit fixes the gpt label creation by using parted command
instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We have a scenario when we switch from non-container to containers. This
means we don't know anything about the ceph partitions associated to an
OSD. Normally in a containerized context we have files containing the
preparation sequence. From these files we can get the capabilities of
each OSD. As a last resort we use a ceph-disk call inside a dummy bash
container to discover the ceph journal on the current osd.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525612
Signed-off-by: Sébastien Han <seb@redhat.com>
This resolves the following error:
E: There were unauthenticated packages and -y was used without
--allow-unauthenticated
Signed-off-by: Sébastien Han <seb@redhat.com>
The name docker_version is very generic and is also used by other
roles. As a result, there may be name conflicts. To avoid this a
ceph_ prefix should be used for this fact. Since it is an internal
fact renaming is not a problem.
making `osd_pool_default_pg_num` mandatory is a bit agressive and is
unrelated when you just want to create users keyrings.
Closes: #2241
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>