Commit Graph

3845 Commits (f463d1838eddc851ad81905dfc8412dcc6953ced)
 

Author SHA1 Message Date
Guillaume Abrioux f463d1838e mgr: wait for all mgr to be available
before managing mgr modules, we must ensure all mgr are available
otherwise we can hit failure like following:

```
stdout:Error ENOENT: all mgr daemons do not support module 'restful', pass --force to force enablement
```

It happens because all mgr are not yet available when trying to manage
with mgr modules.

This should have been cherry-picked from
41f7518c1b but there's too much changes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-07-11 10:02:25 +02:00
Guillaume Abrioux 652374636e nfs: add coverage on `ganesha_conf_overrides`
This commit adds `ganesha_conf_overrides` variable in CI testing.
This fixes the test `test_nfs_config_override`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-07-10 14:24:52 +02:00
Guillaume Abrioux 24810e0da2 tests: fix purge scenarios names
This commit fixes the purge_* scenario names in stable-3.1

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-07-10 11:57:22 +02:00
Guillaume Abrioux 13602e426d tests: add missing variables in collocation scenario
add :

ceph_origin: repository
ceph_repository: community

in all.yml for collocation scenario (non contanier)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-05-23 11:14:00 +02:00
Guillaume Abrioux 018297957e tests: fix path to inventory host file in tox-update.ini
the path had `/{env:CONTAINER_DIR:}` which is already added in
`changedir=` section. That led to a wrong path so the initial deployment
couldn't complete.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-05-22 13:54:05 +02:00
Guillaume Abrioux bf17099964 tests: split update in a dedicated tox.ini file
This commit splits the update scenario into a dedicated tox.ini file.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-05-21 09:25:15 +02:00
Guillaume Abrioux 4cc08f7e1d tests: use INVENTORY env variable in tox
let's use `INVENTORY` variable to run against the right inventory host
regarding which OS we are running on.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-05-20 13:36:18 +02:00
Guillaume Abrioux d63b1c993d tests: add back testinfra testing
136bfe0 removed testinfra testing on all scenario excepted all_daemons

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8d106c2c58)
2019-04-04 14:26:58 +00:00
Guillaume Abrioux 9a8c1d4081 tests: pin pytest-xdist to 1.27.0
looks like newer version of pytest-xdist requires pytest>=4.4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ba0a95211c)
2019-04-04 14:26:58 +00:00
Dimitri Savineau 8cad54e0ef tox: Fix container purge jobs
On containerized CI jobs the playbook executed is purge-cluster.yml
but it should be set to purge-docker-cluster.yml

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd0869cd01)
2019-04-04 09:14:05 +02:00
Guillaume Abrioux dd77affe7f tests: fix shrink_mon scenario
since the node names have changed recently (the 'ceph-' prefix has been
removed), we must change the name in the shrink_mon playbook command
here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-03 10:03:10 +02:00
Guillaume Abrioux a80ea0a929 tests: fix shrink_osd scenario
the wrong image version was used to run shrink_osd playbook.
in stable-3.1 we should use a luminous image, not nautilus which doesn't
have ceph-disk binary anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-03 09:48:04 +02:00
Guillaume Abrioux 7926eebebf tests: disable nfs scenario
The packages are broken, so let's remove it, until this solved.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-03 07:27:42 +00:00
Guillaume Abrioux f4f41d62ce tests: test idempotency only on all_daemons job
there's no need to test this on all scenarios.
testing idempotency on all_daemons should be enough and allow us to save
precious resources for the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 136bfe096c)
2019-04-03 07:27:42 +00:00
Guillaume Abrioux 64bee9cb86 osd: backward compatibility with old disk_list.sh location
Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 987bdac963)
2019-04-02 11:09:46 +02:00
Guillaume Abrioux 69cda84a21 iscsi-gws: remove a leftover
remove leftover introduced by 9d590f4

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d4b3c1d409)
2019-03-28 15:36:26 +00:00
Guillaume Abrioux ff243781c5 iscsi: fix permission denied error
Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
  msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```

`become: True` is not needed on the following task:

`copy crt file(s) to gateway nodes`.

Since it's already set in the main playbook (site.yml/site-container.yml)

The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).

The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d590f4339)
2019-03-28 15:36:26 +00:00
Guillaume Abrioux d9895338d0 tests: rename all nodes name
remove the 'ceph-' prefix in order to have the same names in all
branches.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-28 13:39:54 +00:00
Guillaume Abrioux 9df795abdc tests: use memory backend for cache fact
force ansible to generate facts for each run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a1bafdc21)
2019-03-05 10:06:08 +01:00
Guillaume Abrioux 7c51657c58 tests: remove lvm_batch scenario
this scenario doesn't exist in stable-3.1

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-04 16:26:56 +01:00
Guillaume Abrioux a16ab0cad5 tests: refact all stable-3.1 testing
refact the testing on stable-3.1 the same way it has been made for
stabe-3.2 and master.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-04 14:44:57 +01:00
Patrick Donnelly cb92299756 use shortname in keyring path
socket.gethostname may return a FQDN. Problem found in Linode.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 8cd0308f5f)
2019-01-30 15:01:04 +01:00
Rishabh Dave b39345751f ceph-common: disable unrequired NTP services
When one of the currently supported NTP services has been set up,
disable rest of the NTP services on Ceph nodes.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1651875
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 6fa757d343)
2019-01-14 16:37:35 +01:00
Rishabh Dave ada7a400c2 ceph-common: merge ntp_debian.yml and ntp_rpm.yml
Merge ntp_debian.yml and ntp_rpm.yml into one (the new file is called
setup_ntp.yml) since they are almost identical.

Since this is as a "as it is" backport for the original commit, it also
adds the feature of supporting multiple NTP daemons (namely, chronyd &
timesyncd). This is to maintain consistency across all branches
since the backport for stable-3.2 was auto-merged by mergify despite
of conflicts.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b03ab60742)
2019-01-14 16:37:35 +01:00
Benjamin Cherian bb41a7da20 Add support for different NTP daemons
Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.

Fixes issue #3086 on Github

Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
(cherry picked from commit 85071e6e53)
2019-01-14 16:37:35 +01:00
Sébastien Han c34027c3ba rolling_update: do not fail on missing keys
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-29 15:50:07 +01:00
Guillaume Abrioux 741ef74629 update: fix a typo
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d)
2018-11-26 19:36:30 +00:00
Guillaume Abrioux 9022f83450 rolling_update: refact set_fact `mon_host`
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018  14:02:30 +0000 (0:00:07.493)       0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584)
2018-11-26 19:36:30 +00:00
Sébastien Han 5c9aa5ed66 rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f)
2018-11-26 19:36:30 +00:00
Sébastien Han 46a2701b5e ceph_key: add a get_key function
When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 691f373543)
2018-11-26 19:36:30 +00:00
Jairo Llopis a5aca6ebbc Fix problem with ceph_key in python3
Pretty basic problem of iteritems removal.

Signed-off-by: Jairo Llopis <yajo.sk8@gmail.com>
(cherry picked from commit fc20973c2b)
2018-10-26 16:23:34 +02:00
Guillaume Abrioux 10403b76e3 tox: fix a typo
the line setting `ANSIBLE_CONFIG` obviously contains a typo introduced
by 1e283bf69b

`ANSIBLE_CONFIG` has to point to a path only (path to an ansible.cfg)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a0cceb3e44)
2018-10-26 16:22:46 +02:00
Sébastien Han d814644c4a rolling_update: fix upgrade when using fqdn
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 44d0da0dd4)
2018-10-24 12:42:14 +00:00
Guillaume Abrioux 7c9699ad51 tests: do not install lvm2 on atomic host
we need to detect whether we are running on atomic host to not try to
install lvm2 package.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d2ca24eca8)
2018-10-16 14:35:08 +02:00
Alfredo Deza f4a5551bfd tests: install lvm2 before setting up ceph-volume/LVM tests
Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit 3e488e8298)
2018-10-16 14:35:08 +02:00
Noah Watkins e089f46607 Stringify ceph_docker_image_tag
This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 8dcc8d1434)
2018-10-16 14:35:08 +02:00
Noah Watkins 75c9130865 Avoid using tests as filter
Fixes the deprecation warning:

  [DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
  using `result|search` use `result is search`.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 306e308f13)
2018-10-16 14:35:08 +02:00
Andy McCrae ee1b6dd83c Sync config_template with upstream for Ansible 2.6
The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.

Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.

Closes: #2843
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit a1b3d5b7c3)
2018-10-15 22:00:35 +00:00
Sébastien Han d0b03f6faa switch: copy initial mon keyring
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bae0f41705)
2018-10-15 13:59:21 +02:00
Guillaume Abrioux da05c1fd31 switch: support migration when cluster is scrubbing
Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54b02fe187)
2018-10-15 13:59:21 +02:00
Guillaume Abrioux 75c2b83e43 defaults: fix osd containers handler
`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.

This should have been backported but it's part of a too important
refact in master that can't be backported.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-15 10:33:56 +02:00
Sébastien Han 513608cebe switch: allow switch big clusters (more than 99 osds)
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9fccffa1ca)
(cherry picked from commit d5e57af23d)
2018-10-15 10:33:56 +02:00
Guillaume Abrioux 4e4184e579 defaults: fix osd handlers that are never triggered
`run_once: true` + `inventory_hostname == groups.get(osd_group_name) |
last` is a bad combination since if the only node being run isn't the
last, the task will be definitly skipped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-03 14:09:39 +00:00
Guillaume Abrioux ba6c3a8e6b config: look up for monitor_address_block in hostvars
`monitor_address_block` should be read from hostvars[host] instead of
current node being played.

eg:

Let's assume we have:

```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```

the ceph.conf generation task will end up with:

```
fatal: [ceph-mon0]: FAILED! => {}

MSG:

'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```

the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:

```
    {%- else -%}
      {% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
      {% if ip_version == 'ipv4' -%}
        {{ hostvars[host][interface][ip_version]['address'] }}
      {%- elif ip_version == 'ipv6' -%}
        [{{ hostvars[host][interface][ip_version][0]['address'] }}]
      {%- endif %}
    {%- endif %}
```

`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6130bc841d)
2018-10-02 21:54:09 +00:00
Guillaume Abrioux 79a5725cf6 purge: actually remove of /var/lib/ceph/*
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16  /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23  /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 144c92b21f)
2018-09-27 21:42:43 +02:00
Matthew Vernon 0bb13cff08 restart_osd_daemon.sh.j2 - use `+` rather than `{1,}` in regex
`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.

Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 806461ac6e)
2018-09-26 21:38:36 +00:00
Matthew Vernon d701c192e0 restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648)
2018-09-26 21:38:36 +00:00
Guillaume Abrioux fdc2d7681d rolling_update: ensure pgs_by_state has at least 1 entry
Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 179c4d00d7)
2018-09-26 10:58:51 +00:00
Guillaume Abrioux f008f40628 upgrade: consider all 'active+clean' states as valid pgs
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c13a3c3492)
2018-09-25 14:13:16 +00:00
Giulio Fidente 7d2a13f8c7 Fix version check in ceph.conf template
We need to look for ceph_release when comparing with release names,
not ceph_version.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1631789
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit 6126210e0e)
2018-09-24 12:32:32 +00:00