Commit Graph

3590 Commits (cfe8e51d992b820de7be12604c4430e7ba18c4c5)
 

Author SHA1 Message Date
Sébastien Han 73c4846744 mon: use ceph_crush module in the playbook
Instead of creating the CRUSH hierarchy with Ansible tasks using the
command module we now rely on the ceph_crush module.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-06 15:24:31 +00:00
Sébastien Han 5fac3784f7 add ceph_crush module
This module allows us to create Ceph CRUSH hierarchy. The module works
with
hostvars from individual OSD hosts.
Here is an example of the expected configuration in the inventory file:

[osds]
ceph-osd-01 osd_crush_location="{ 'root': 'mon-roottt', 'rack':
'mon-rackkkk', 'pod': 'monpod', 'host': 'localhost' }"  # valid case

Then, if create_crush_tree is enabled the module will create the
appropriate CRUSH buckets and their types in Ceph.

Some pre-requesites:

* a 'host' bucket must be defined
* at least two buckets must be defined (this includes the 'host')

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-06 15:24:31 +00:00
Greg Charot 78c1f1938f mons: Current crush_rule playbook does not work if there is no default rule defined (default: true).
One could want to add new crush rules while keeping his current default rule.
Fixed it so that it works with all rules defined as "default: false". If multiple rules are defined as default (should not be) then the last rule listed in "crush_rules" is taken as default.
2018-03-06 15:24:31 +00:00
Greg Charot 77f9c1df10 no reason the ceph-ansible ansible default provided crush_rule_hdd rule should be set as rack root + default ruleset 2018-03-06 15:24:31 +00:00
Greg Charot 50afc3fbf3 We don't want to automatically move the rbd pool to the new default crush rule. This operation shall be performed by the cluster operator. 2018-03-06 15:24:31 +00:00
Sébastien Han f2e0ceed78 add support for installation checkpoint
This was taken from the openshift ansible repository here:
https://github.com/leseb/openshift-ansible/tree/master/roles/installer_checkpoint

Rationale:

A complete OpenShift cluster installation is comprised of many different
components which can take 30 minutes to several hours to complete. If
the installation should fail, it could be confusing to understand at
which component the failure occurred. Additionally, it may be desired to
re-run only the component which failed instead of starting over from the
beginning. Components which came after the failed component would also
need to be run individually.

Ceph has a similar situation so we can benefit from that
callback_plugin.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-06 15:21:40 +00:00
Andy McCrae 04ca685ba7 Remove vars that are no longer used
As part of fcba2c801a these vars were
removed and no longer do anything:

radosgw_dns_name
radosgw_resolve_cname

This patch removes them from the group_vars files and defaults/main.yml
2018-03-06 09:16:25 +01:00
jtudelag c3267b77b7 Makes use of docker_exec_cmd in ceph-mon role.
Keeps consistency inside the role and among roles.
Makes the code more readable.
2018-03-05 12:48:35 +00:00
Sébastien Han cb0f598965 common: run updatedb task on debian systems only
The command doesn't exist on Red Hat systems so it's better to skip it
instead of ignoring the error.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-02 20:59:10 +00:00
Sébastien Han 7f19df8196 rgw: add cluster name option to the handler
If the cluster name is different than 'ceph', the command will fail so
we need to pass the cluster name.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-02 20:59:10 +00:00
Sébastien Han fd94840a6e ci: add copy_admin_key test to container scenario
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-02 20:59:10 +00:00
Sébastien Han 9c85280602 rgw: ability to copy ceph admin key on containerized
If we now set copy_admin_key while running a containerized scenario, the
ceph admin key will be copied on the node.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-02 20:59:10 +00:00
Sébastien Han 67f46d8ec3 rgw: run the handler on a mon host
In case the admin wasn't copied over to the node this command would
fail. So it's safer to run it from a monitor directly.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-03-02 20:59:10 +00:00
Guillaume Abrioux 1e283bf69b tests: make CI jobs using 'ansible.cfg'
The jobs launches by the CI are not using 'ansible.cfg'.
There are some parameters that should avoid SSH failure that we are used
to see in the CI so far.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-28 13:51:52 +01:00
Guillaume Abrioux 6d35bc9bde client: use `ceph_uid` fact to set uid/gid on admin key
That task is failing on containerized deployment because `ceph:ceph`
doesn't exist.
The idea here is to use the `{{ ceph_uid }}` to set the ownerships for
the admin keyring when containerized_deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540578

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-26 15:52:05 +01:00
Grant Slater 1e1b26ca4d mds: fix ansible_service_mgr typo
This commit fixes a typo introduced by 4671b9e74e
2018-02-26 13:05:14 +01:00
Andy McCrae c33dae7509 Revert "[TEST] Test setting up correct systemd file for nfs-ganesha"
The nfs-ganesha package has been fixed as part of this commit:
963b6681df

Once the package is rebuilt this should be good to merge.

This reverts commit e88af3c4cb.
2018-02-26 10:23:42 +01:00
Giulio Fidente a83e1aeea3 Make rule_name optional when defining items in openstack_pools
Previously it was necessary to provide a value (eventually an
empty string) for the "rule_name" key for each item in
openstack_pools. This change makes that optional and defaults to
empty string when not given.
2018-02-23 15:11:53 +01:00
Sébastien Han 165d9dec10 remove kernel.pid_max
This is now managed by Ceph packages.

See: https://github.com/ceph/ceph/pull/18544/files

http://tracker.ceph.com/issues/21929

Closes: https://github.com/ceph/ceph-ansible/issues/2410

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-23 13:57:57 +01:00
Guillaume Abrioux 4a8986459f tests: change ceph_docker_image_tag for 2nd run
The ceph-ansible upstream CI runs severals tests, including a
'idempotency/handlers' test. It means the playbook is run a first time
and then a second time with an other container image version to ensure the
handlers run properly and the containers are well restarted.
This can cause issues.
For instance, in that specific case which drove me to submit this commit,
I've hit the case where `latest` image ships ceph 12.2.3 while the `stable-3.0`
(which is the image used for the second run) ships ceph 12.2.2.

The goal of this test is not to verify we can upgrade from a specific
version to another but to ensure handlers are working even if it's a valid
failure here.
It should be caught by a test dedicated to that usecase.

We just need to have a container image which has a different id for
the upstream CI, we need the same content in container imagebut a different
image id in the registry since the test relies on image id to decide whether
the container should be restarted.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-23 13:54:32 +01:00
Guillaume Abrioux 707458c979 ci: add tripleo scenario testing
This should help to see earlier any failure in a tripleo deployment scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-23 13:54:32 +01:00
Andy McCrae 2779d2a850 Adjust /etc/updatedb.conf to not parse /var/lib/ceph
Using updatedb -e doesnt make a permanent change, but will updatedb
without the passed path.

To make this change more permanent we should update the
/etc/updatedb.conf file to include /var/lib/ceph.
2018-02-20 11:32:56 +01:00
Andy McCrae e88af3c4cb [TEST] Test setting up correct systemd file for nfs-ganesha
Don't merge this.
Test to see if we copy over the nfs-ganesha-lock.service.debian8 file
properly, whether the Xenial CI job will work.

The upstream download.ceph.com nfs-ganesha package should be fixed for
xenial (which is in progress).
2018-02-20 10:49:37 +01:00
Guillaume Abrioux c04e67347c update: look for short and fqdn in ceph_health_raw
According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-19 10:27:47 +01:00
Paul Bourke 463b5c6b22 Remove redundant task to check if atomic
This fact is already set in site-docker.yml so there's no need to check
it again in ceph-docker-common

Signed-off-by: Paul Bourke <paul.bourke@oracle.com>
2018-02-19 10:10:46 +01:00
Andy McCrae 59a4335a56 Restart services if handler called
This patch fixes an issue where if hosts have different service lists,
it will prevent restarting changes on services that run later on.

For example, hostA in the mons and rgws group would initiate a config
change and restart of services on all mons and rgws hosts, even though
a separate hostB (which is only in the rgws group) has not had its
configuration changed yet. Additionally, when the second host has its
coniguration changed as part of the ceph-rgw role, it will not initiate
a restart since its inventory name != the first hosts.

To fix this we should run the restart once (using run_once: True)
as long as the host has called the handler. This will ensure that even
if only 1 host has called the handler it will initiate a restart on all
hosts that have called the handler.

Additionally, we add a var that is set when the handler runs, this will
ensure that only hosts that have called the handler get restarted.

Includes minor fix to remove unrequired "inventory_hostname in
play_hosts" when: clause. This is no longer required since the handlers
were changed. The host calling the handler will be in play_hosts
already.
2018-02-16 10:40:20 +01:00
Sébastien Han c816a9282c container: osd remove run_once
When used along with  delegate, run_once does not belong well. Thus,
using | last always brings the desired result.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han d47d02a5eb docker-common: fix container restart on new image
We now look for any excisting containers, if any we compare their
running image with the latest pulled container image.
For OSDs, we iterate over the list of running OSDs, this handles the
case where the first OSD of the list has been updated (runs the new
image) and not the others.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1526513
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han ebc195487c default: remove duplicate code
This is already defined in ceph-defaults.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han 7d690878df test: add test for containers resources changes
We change the ceph_mon_docker_memory_limit on the second run, this
should trigger a restart of services.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Sébastien Han 79864a8936 test: add test for restart on new container image
Since we have a task to test the handlers we can test a new container to
validate the service restart on a new container image.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-14 02:01:29 +01:00
Andrew Schoen 699c777e68 rolling update: fix undefined jewel_minor_update failure
Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.

The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-13 17:03:05 +01:00
Andrew Schoen 7c7017ebe6 infra: do not include host_vars/* in take-over-existing-cluster.yml
These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-12 11:48:47 +01:00
Caleb Boylan 0be60456ce osd: Add support for multipath disks
Multipath disks have partitions with a different format than what
ceph-ansible currently supports, this update makes ceph-ansible
aware of that format so multipath disks can be used as OSDs

Signed-off-by: Caleb Boylan <caleb.boylan@ormuco.com>
2018-02-09 18:06:25 +01:00
Andy McCrae b4dbc862d6 Set application for OpenStack pools
Since Luminous we need to set the application tag for each pool,
otherwise a CEPH_WARNING is generated when the pools are in use.

We should assign the OpenStack pools to their default which would be
"rbd". When updating to Luminous this would happen automatically to the
vms, images, backups and volumes pools, but for new deploys this is not
the case.
2018-02-09 17:15:55 +01:00
Sébastien Han ff90661033 site: ability to only generate a ceph.conf on the machines
Now by running the playbook like this:

ansible-playbook site.yml --tags='ceph_update_config'

You can only generate a ceph configuration file on the nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1543434
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-09 14:07:58 +01:00
Sébastien Han 22f843e3d4 default: define 'osd_scenario' variable
osd_scenario does not exist in the ceph-default role so if we try to
play ceph-default on an OSD node, the playbook will fail with undefined
variable.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-02-08 17:42:12 +01:00
Guillaume Abrioux e537779bb3 osd: fix osd restart when dmcrypt
This commit fixes a bug that occurs especially for dmcrypt scenarios.

There is an issue where the 'disk_list' container can't reach the ceph
cluster because it's not launched with `--net=host`.

If this container can't reach the cluster, it will hang on this step
(when trying to retrieve the dm-crypt key) :

```
+common_functions.sh:448: open_encrypted_part(): ceph --cluster abc12 --name \
client.osd-lockbox.9138767f-7445-49e0-baad-35e19adca8bb --keyring \
/var/lib/ceph/osd-lockbox/9138767f-7445-49e0-baad-35e19adca8bb/keyring \
config-key get dm-crypt/osd/9138767f-7445-49e0-baad-35e19adca8bb/luks
+common_functions.sh:452: open_encrypted_part(): base64 -d
+common_functions.sh:452: open_encrypted_part(): cryptsetup --key-file \
-luksOpen /dev/sdb1 9138767f-7445-49e0-baad-35e19adca8bb
```

It means the `ceph-run-osd.sh` script won't be able to start the
`osd_disk_activate` process in ceph-container because he won't have
filled the `$DOCKER_ENV` environment variable properly.

Adding `--net=host` to the 'disk_list' container fixes this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1543284

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-08 15:45:13 +01:00
Grant fff438fb27 Update Documentation example link to 3.0
Update the Documentation link from 2.2 -> 3.0

Helpful for newbies.
2018-02-07 16:34:45 +01:00
Giulio Fidente bdcc52b96d Check for docker sockets named after both _hostname or _fqdn
While hostname -f will always return an hostname including its
domain part and -s without the domain part, the behavior when
no arguments are given can include or not include the domain part
depending on how the system is configured; the socket name might
not match the instance name then.
2018-02-06 14:16:54 +01:00
Greg Charot a6d1922a2e mon: Fixed crush_rule_config for containerised deployment.
Was called too early, container was not yet started so the commands failed.
Moved the section after include docker/main.yml

Signed-off-by: Greg Charot <gcharot@redhat.com>
2018-02-06 05:12:59 +01:00
Guillaume Abrioux 3b2f6c34e4 purge-docker: fix ceph-osd-zap name container
the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).

this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```

having the the basename of the device path is enough for the container
name.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-02 22:09:11 +01:00
Guillaume Abrioux dd0c98c5a2 common: do not use `shell` module when it is not needed
There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux deaf273b25 syntax: change local_action syntax
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Sébastien Han b1a3c6e4cc osd: resync group_vars file
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-30 17:47:15 +01:00
Sébastien Han 6f9dd26caa config: remove any spaces in public_network or cluster_network
With two public networks configured - we found that with
"NETWORK_ADDR_1, NETWORK_ADDR_2" install process consistently became
broken, trying to find docker registry on second network, and not
finding mon container.

but without spaces
"NETWORK_ADDR_1,NETWORK_ADDR_2" install succeeds
so, containerized install is more peculiar with formatting of this line

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1534003
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-30 17:47:15 +01:00
Guillaume Abrioux f372a4232e purge: fix resolve parent device task
This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-30 17:40:10 +01:00
Sébastien Han 5132cc3de4 Do not search osd ids if ceph-volume
Description of problem: The 'get osd id' task goes through all the 10 times (and its respective timeouts) to make sure that the number of OSDs in the osd directory match the number of devices.

This happens always, regardless if the setup and deployment is correct.

Version-Release number of selected component (if applicable): Surely the latest. But any ceph-ansible version that contains ceph-volume support is affected.

How reproducible: 100%

Steps to Reproduce:
1. Use ceph-volume (LVM) to deploy OSDs
2. Avoid using anything in the 'devices' section
3. Deploy the cluster

Actual results:
TASK [ceph-osd : get osd id _uses_shell=True, _raw_params=ls /var/lib/ceph/osd/ | sed 's/.*-//'] **********************************************************************************************************************************************
task path: /Users/alfredo/python/upstream/ceph/src/ceph-volume/ceph_volume/tests/functional/lvm/.tox/xenial-filestore-dmcrypt/tmp/ceph-ansible/roles/ceph-osd/tasks/start_osds.yml:6
FAILED - RETRYING: get osd id (10 retries left).
FAILED - RETRYING: get osd id (9 retries left).
FAILED - RETRYING: get osd id (8 retries left).
FAILED - RETRYING: get osd id (7 retries left).
FAILED - RETRYING: get osd id (6 retries left).
FAILED - RETRYING: get osd id (5 retries left).
FAILED - RETRYING: get osd id (4 retries left).
FAILED - RETRYING: get osd id (3 retries left).
FAILED - RETRYING: get osd id (2 retries left).
FAILED - RETRYING: get osd id (1 retries left).
ok: [osd0] => {
    "attempts": 10,
    "changed": false,
    "cmd": "ls /var/lib/ceph/osd/ | sed 's/.*-//'",
    "delta": "0:00:00.002717",
    "end": "2018-01-21 18:10:31.237933",
    "failed": true,
    "failed_when_result": false,
    "rc": 0,
    "start": "2018-01-21 18:10:31.235216"
}

STDOUT:

0
1
2

Expected results:
There aren't any (or just a few) timeouts while the OSDs are found

Additional info:
This is happening because the check is mapping the number of "devices" defined for ceph-disk (in this case it would be 0) to match the number of OSDs found.

Basically this line:

    until: osd_id.stdout_lines|length == devices|unique|length

Means in this 2 OSD case it is trying to ensure the following incorrect condition:

    until: 2 == 0

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1537103
2018-01-30 14:44:38 +01:00
Andy McCrae 481173f203 Add default for radosgw_keystone_ssl
This should default to False. The default for Keystone is not to use PKI
keys, additionally, anybody using this setting had to have been manually
setting it before.

Fixes: #2111
2018-01-30 11:30:23 +01:00
Guillaume Abrioux f1232b33fd Revert "monitor_interface: document need to use monitor_address when using IPv6"
This reverts commit 10b91661ce.

This reverts also the same comment added in
1359869497
2018-01-29 14:43:24 +01:00