Commit Graph

1679 Commits (4a8986459f2ed7e077390162d0df431a3321a478)

Author SHA1 Message Date
Andy McCrae 2779d2a850 Adjust /etc/updatedb.conf to not parse /var/lib/ceph
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.

To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
2018-02-20 11:32:56 +01:00
Andy McCrae e88af3c4cb [TEST] Test setting up correct systemd file for nfs-ganesha
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.

The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
2018-02-20 10:49:37 +01:00
Paul Bourke 463b5c6b22 Remove redundant task to check if atomic
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common

Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
2018-02-19 10:10:46 +01:00
Andy McCrae 59a4335a56 Restart services if handler called
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.

For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.

To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.

Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.

Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
2018-02-16 10:40:20 +01:00
Sébastien Han c816a9282c container: osd remove run_once
When used along with  delegate, run_once does not belong well. Thus,
using | last always brings the desired result.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han d47d02a5eb docker-common: fix container restart on new image
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han ebc195487c default: remove duplicate code
This is already defined in ceph-defaults.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Caleb Boylan 0be60456ce osd: Add support for multipath disks
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs

Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
2018-02-09 18:06:25 +01:00
Andy McCrae b4dbc862d6 Set application for OpenStack pools
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.

We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
2018-02-09 17:15:55 +01:00
Sébastien Han 22f843e3d4 default: define 'osd_scenario' variable
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-08 17:42:12 +01:00
Guillaume Abrioux e537779bb3 osd: fix osd restart when dmcrypt
This commit fixes a bug that occurs especially for dmcrypt scenarios.

There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.

If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :

```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```

It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.

Adding `--net=host` to the 'disk_list' container fixes this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-08 15:45:13 +01:00
Giulio Fidente bdcc52b96d Check for docker sockets named after both _hostname or _fqdn
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.
2018-02-06 14:16:54 +01:00
Greg Charot a6d1922a2e mon: Fixed crush_rule_config for containerised deployment.
Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml

Signed-off-by: Greg Charot <gcharot@redhat.com>
2018-02-06 05:12:59 +01:00
Guillaume Abrioux dd0c98c5a2 common: do not use `shell` module when it is not needed
There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux deaf273b25 syntax: change local_action syntax
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Sébastien Han 6f9dd26caa config: remove any spaces in public_network or cluster_network
With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.

but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-30 17:47:15 +01:00
Sébastien Han 5132cc3de4 Do not search osd ids if ceph-volume
Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.

This happens always, regardless if the setup and deployment is correct.

Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.

How reproducible: 100%

Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster

Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
    "attempts": 10,
    "changed": false,
    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
    "delta": "0:00:00.002717",
    "end": "2018-01-21 18:10:31.237933",
    "failed": true,
    "failed_when_result": false,
    "rc": 0,
    "start": "2018-01-21 18:10:31.235216"
}

STDOUT:

0
1
2

Expected results:
There aren't any (or just a few) timeouts while the OSDs are found

Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.

Basically this line:

    until: osd_id.stdout_lines|length == devices|unique|length

Means in this 2 OSD case it is trying to ensure the following incorrect condition:

    until: 2 == 0

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
2018-01-30 14:44:38 +01:00
Andy McCrae 481173f203 Add default for radosgw_keystone_ssl
This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.

Fixes: #2111
2018-01-30 11:30:23 +01:00
Guillaume Abrioux f1232b33fd Revert "monitor_interface: document need to use monitor_address when using IPv6"
This reverts commit 10b91661ce.

This reverts also the same comment added in
1359869497
2018-01-29 14:43:24 +01:00
Eduard Egorov 93e9f3723b config: add host-specific ceph_conf_overrides evaluation and generation.
This allows us to use host-specific variables in ceph_conf_overrides variable. For example, this fixes usage of such variables (e.g. 'nss db path' having {{ ansible_hostname }} inside) in ceph_conf_overrides for rados gateway configuration (see profiles/rgw-keystone-v3) - issue #2157.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2018-01-26 10:15:03 +01:00
Guillaume Abrioux ec16cbdb1a defaults: avoid getting stuck (ceph --connect-timeout)
Sometime the playbook gets stuck because even with `--connect-timeout=`
option, the connexion to the existing ceph cluster never timeout.

As a workaround, using `timeout` command provided by coreutils will
actually timeout if we can't connect to the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537003

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-25 10:15:59 +01:00
Andrew Schoen 79473badfe ceph-osd: adds dmcrypt to the lvm scenario
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-24 14:10:08 +01:00
Guillaume Abrioux 9306a1789c osds: change default value for `dedicated_devices`
This is to keep backward compatibility with stable-2.2 and satisfy the
check "verify dedicated devices have been provided" in
`check_mandatory_vars.yml`. This check is looking for
`dedicated_devices` so we need to default it's value to
`raw_journal_devices` when `raw_multi_journal` is set to `True`.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1536098

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-22 18:02:51 +01:00
Sébastien Han f88795e843 rgw: disable legacy unit
Some systems that were deployed with old tools can leave units named
"ceph-radosgw@radosgw.gateway.service". As a consequence, they will
prevent the new unit to start.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1509584
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-18 14:12:18 +01:00
Andrew Schoen fb4a6dc9a4 docs for the crush_device_class option of lvm_volumes
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-17 13:49:29 +01:00
Andrew Schoen 6cbb56a3b6 ceph-osd: adds the crush_device_class param to the lvm scenario
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-17 13:49:29 +01:00
Eduard Egorov 7d7080df6c crush: create rack type buckets and build crush tree according to {{ osd_crush_location }}.
Currently, we can define crush location for each host but only crush roots and crush rules are created. This commit automates other routines for a complete solution:
  1) Creates rack type crush buckets defined in {{ ceph_crush_rack }} of each osd host. If it's not defined by user then a rack named 'default_rack_{{ ceph_crush_root  }}' would be added and used in next steps.
  2) Move rack type crush buckets defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.
  3) Move hosts defined in {{ ceph_crush_rack }} into crush roots defined in {{ ceph_crush_root }} of each osd host.

Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2018-01-11 17:42:18 +01:00
Sébastien Han 6db4aea453 osd: skip devices marked as '/dev/dead'
On a non-collocated scenario, if a drive is faulty we can't really
remove it from the list of 'devices' without messing up or having to
re-arrange the order of the 'dedicated_devices'. We want to keep this
device list ordered. This will prevent the activation failing on a
device that we know is failing but we can't remove it yet to not mess up
the dedicated_devices mapping with devices.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-11 17:34:32 +01:00
Guillaume Abrioux 70401f955b container: trigger handlers on systemd file change
When a systemd unit file is changed we should trigger handlers to
restart the services.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-10 16:46:42 +01:00
Guillaume Abrioux b29a42cba6 handlers: avoid duplicate handler
Having handlers in both ceph-defaults and ceph-docker-common roles can make the
playbook restarting two times services. Handlers can be triggered first
time because of a change in ceph.conf and a second time because a new
image has been pulled.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-10 16:46:42 +01:00
Sébastien Han 8a19a83354 container: restart container when there is a new image
This wasn't any good choice to implement this.
We had several options and none of them were ideal since handlers can
not be triggered cross-roles.
We could have achieved that by doing:

* option 1 was to add a dependancy in the meta of the ceph-docker-common
role. We had that long ago and we decided to stop so everything is
managed via site.yml

* option 2 was to import files from another role. This is messy and we
don't that anywhere in the current code base. We will continue to do so.

There is option 3 where we pull the image from the ceph-config role.
This is not suitable as well since the docker command won't be available
unless you run Atomic distro. This would also mean that you're trying to
pull twice. First time in ceph-config, second time in ceph-docker-common

The only option I came up with was to duplicate a bit of the ceph-config
handlers code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-10 16:46:42 +01:00
Guillaume Abrioux 900f447c82 containers: fix bug when looking for existing cluster
When containerized deployment, `docker_exec_cmd` is not set before the
task which try to retrieve the current fsid is played, it means it
considers there is no existing fsid and try to generate a new one.

Typical error:

```
ok: [mon0 -> mon0] => {
    "changed": false,
    "cmd": [
        "ceph",
        "--connect-timeout",
        "3",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:00:00.179909",
    "end": "2018-01-09 10:36:58.759846",
    "failed": false,
    "failed_when_result": false,
    "rc": 1,
    "start": "2018-01-09 10:36:58.579937"
}

STDERR:

Error initializing cluster client: Error('error calling conf_read_file: errno EINVAL',)
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-10 16:23:18 +01:00
Sébastien Han c2e04623a5 container: change the way we force no logs inside the container
Previously we were using ceph_conf_overrides however this doesn't play
nice for softwares like TripleO that uses ceph_conf_overrides inside its
own code. For now, and since this is the only occurence of this, we can
ensure no logs through the ceph conf template.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1532619
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-10 16:21:47 +01:00
Guillaume Abrioux acfbebe67e defaults: rename check_socket files for containers
When containerized deployment, we are not looking for a socket but for a
running container.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-10 15:44:47 +01:00
Sébastien Han f0787e64da mon: use crush rules for non-container too
There is no reasons why we can't use crush rules when deploying
containers. So moving the inlcude in the main.yml so it can be called.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-10 15:21:36 +01:00
Sébastien Han 97f520bc74 containers: bump memory limit
A default value of 4GB for MDS is more appropriate and 3GB for OSD also.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1531607
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-09 11:26:50 +01:00
Sébastien Han 0b55abe3d0 mon: always run ceph-create-keys
ceph-create-keys is idempotent so it's not an issue to run it each time
we play ansible. This also fix issues where the 'creates' arg skips the
task and no keys get generated on newer version, e.g during an upgrade.

Closes: https://github.com/ceph/ceph-ansible/issues/2228
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-21 13:50:01 +01:00
Sébastien Han ad54e19262 rgw: disable legacy rgw service unit
When upgrading from OSP11 to OSP12 container, ceph-ansible attempts to
disable the RGW service provided by the overcloud image. The task
attempts to stop/disable ceph-rgw@{{ ansible-hostname }} and
ceph-radosgw@{{ ansible-hostname }}.service. The actual service name is
ceph-radosgw@radosgw.$name

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525209
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-21 13:48:42 +01:00
Guillaume Abrioux 895949d6c4 osd: fix check gpt
the gpt label creation doesn't work even with parted module.
This commit fixes the gpt label creation by using parted command
instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Sébastien Han bbc79765f3 osd: best effort if no device is found during activation
We have a scenario when we switch from non-container to containers. This
means we don't know anything about the ceph partitions associated to an
OSD. Normally in a containerized context we have files containing the
preparation sequence. From these files we can get the capabilities of
each OSD. As a last resort we use a ceph-disk call inside a dummy bash
container to discover the ceph journal on the current osd.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1525612
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-19 14:40:48 +01:00
Sébastien Han dfbef8361d nfs: fix package install for debian/suss systems
This resolves the following error:
E: There were unauthenticated packages and -y was used without
--allow-unauthenticated

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-19 13:30:49 +01:00
Christian Berendt 50a848dc40 Rename fact docker_version to ceph_docker_version
The name docker_version is very generic and is also used by other
roles. As a result, there may be name conflicts. To avoid this a
ceph_ prefix should be used for this fact. Since it is an internal
fact renaming is not a problem.
2017-12-15 20:12:21 +01:00
Markos Chandras 162b7d2b23 roles: ceph-mgr: Install the ceph-mgr package on SUSE
The ceph-mgr package name is identical to RedHat so add the SUSE family
to the existing task.
2017-12-15 09:22:14 +01:00
Guillaume Abrioux a24fd1cfd9 client: don't make `osd_pool_default_pg_num` mandatory
making `osd_pool_default_pg_num` mandatory is a bit agressive and is
unrelated when you just want to create users keyrings.

Closes: #2241

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-14 17:22:07 +01:00
Guillaume Abrioux ab1dd3027a client: don't try to generate keys
the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-14 17:22:07 +01:00
Guillaume Abrioux 26afe46e13 docker: add missing condition for selinux tasks
on `client` and `mds` roles, it tries to set selinux even on non rhel
based distributions.`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-14 17:00:14 +01:00
Sébastien Han 7eaf444328 default: look for the right return code on socket stat in-use
As reported in https://github.com/ceph/ceph-ansible/issues/2254, the
check with fuser is not ideal. If fuser is not available the return code
is 127. Here we want to make sure that we looking for the correct return
code, so 1.

Closes: https://github.com/ceph/ceph-ansible/issues/2254
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-14 16:59:14 +01:00
John Fulton 8cba44262c Add flags for OSD 'docker run --cpuset-{cpus,mems}'
Add the variables ceph_osd_docker_cpuset_cpus and
ceph_osd_docker_cpuset_mems, so that a user may specify
the CPUs and memory nodes of NUMA systems on which OSD
containers are run.

Provides a example in osds.yaml.sample to guide user
based on sample `lscpu` output since cpuset-mems refers
to the memory by NUMA node only while cpuset-cpus can
refer to individual vCPUs within a NUMA node.
2017-12-14 16:39:35 +01:00
Eduard Egorov a8a2c13f6a firewall: add mds, nfs, restapi and iscsi ports, remove 'configure_firewall' variable used for conditional execution. Include the task only on rpm-based systems.
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2017-12-12 23:44:55 +01:00
Eduard Egorov 6a5e0da30d firewall: configure firewalld if it's already installed on the host (#2192).
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2017-12-12 23:44:55 +01:00