before managing mgr modules, we must ensure all mgr are available
otherwise we can hit failure like following:
```
stdout:Error ENOENT: all mgr daemons do not support module 'restful', pass --force to force enablement
```
It happens because all mgr are not yet available when trying to manage
with mgr modules.
This should have been cherry-picked from
41f7518c1b but there's too much changes.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds `ganesha_conf_overrides` variable in CI testing.
This fixes the test `test_nfs_config_override`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the path had `/{env:CONTAINER_DIR:}` which is already added in
`changedir=` section. That led to a wrong path so the initial deployment
couldn't complete.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
let's use `INVENTORY` variable to run against the right inventory host
regarding which OS we are running on.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
looks like newer version of pytest-xdist requires pytest>=4.4.0
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ba0a95211c)
On containerized CI jobs the playbook executed is purge-cluster.yml
but it should be set to purge-docker-cluster.yml
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd0869cd01)
since the node names have changed recently (the 'ceph-' prefix has been
removed), we must change the name in the shrink_mon playbook command
here.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the wrong image version was used to run shrink_osd playbook.
in stable-3.1 we should use a luminous image, not nautilus which doesn't
have ceph-disk binary anymore.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there's no need to test this on all scenarios.
testing idempotency on all_daemons should be enough and allow us to save
precious resources for the CI.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 136bfe096c)
Since all files in container image have moved to `/opt/ceph-container`
this check must look for new AND the old path so it's backward
compatible. Otherwise it could end up by templating an inconsistent
`ceph-osd-run.sh`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 987bdac963)
Typical error:
```
fatal: [iscsi-gw0]: FAILED! =>
msg: 'an error occurred while trying to read the file ''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'': [Errno 13] Permission denied: b''/home/guits/ceph-ansible/tests/functional/all_daemons/fetch/e5f4ab94-c099-4781-b592-dbd440a9d6f3/iscsi-gateway.key'''
```
`become: True` is not needed on the following task:
`copy crt file(s) to gateway nodes`.
Since it's already set in the main playbook (site.yml/site-container.yml)
The thing is that the files get generated in the 'fetch_directory' with
root user because there is a 'delegate_to' + we run the playbook with
`become: True` (from main playbook).
The idea here is to create files under ansible user so we can open them
later to copy them on the remote machine.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d590f4339)
socket.gethostname may return a FQDN. Problem found in Linode.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 8cd0308f5f)
When one of the currently supported NTP services has been set up,
disable rest of the NTP services on Ceph nodes.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1651875
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 6fa757d343)
Merge ntp_debian.yml and ntp_rpm.yml into one (the new file is called
setup_ntp.yml) since they are almost identical.
Since this is as a "as it is" backport for the original commit, it also
adds the feature of supporting multiple NTP daemons (namely, chronyd &
timesyncd). This is to maintain consistency across all branches
since the backport for stable-3.2 was auto-merged by mergify despite
of conflicts.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b03ab60742)
Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.
Fixes issue #3086 on Github
Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
(cherry picked from commit 85071e6e53)
We don't want to fail on key that are not present since they will get
created after the mons are updated. They will be created by the task
"create potentially missing keys (rbd and rbd-mirror)".
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7c99b6df6d)
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.
Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018 14:02:30 +0000 (0:00:07.493) 0:02:50.005 *****
fatal: [mon1]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit af78173584)
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 4e267bee4f)
When checking if a key exists we also have to ensure that the key exists
on the filesystem, the key can change on Ceph but still have an outdated
version on the filesystem. This solves this issue.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 691f373543)
the line setting `ANSIBLE_CONFIG` obviously contains a typo introduced
by 1e283bf69b
`ANSIBLE_CONFIG` has to point to a path only (path to an ansible.cfg)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a0cceb3e44)
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 44d0da0dd4)
we need to detect whether we are running on atomic host to not try to
install lvm2 package.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d2ca24eca8)
Fixes the deprecation warning:
[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
(cherry picked from commit 306e308f13)
The original_basename option in the copy module changed to be
_original_basename in Ansible 2.6+, this PR resyncs the config_template
module to allow this to work with both Ansible 2.6+ and before.
Additionally, this PR removes the _v1_config_template.py file, since
ceph-ansible no longer supports versions of Ansible before version 2,
and so we shouldn't continue to carry that code.
Closes: #2843
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
(cherry picked from commit a1b3d5b7c3)
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit bae0f41705)
Similar to c13a3c3 we must allow scrubbing when running this playbook.
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.
This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.
Closes: #3182
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54b02fe187)
`ceph_osd_container_stat` might not be set on other osd node.
We must ensure we are on the last node before trying to evaluate
`ceph_osd_container_stat`.
This should have been backported but it's part of a too important
refact in master that can't be backported.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 9fccffa1ca)
(cherry picked from commit d5e57af23d)
`run_once: true` + `inventory_hostname == groups.get(osd_group_name) |
last` is a bad combination since if the only node being run isn't the
last, the task will be definitly skipped.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
`monitor_address_block` should be read from hostvars[host] instead of
current node being played.
eg:
Let's assume we have:
```
[mons]
ceph-mon0 monitor_address=192.168.1.10
ceph-mon1 monitor_interface=eth1
ceph-mon2 monitor_address_block=192.168.1.0/24
```
the ceph.conf generation task will end up with:
```
fatal: [ceph-mon0]: FAILED! => {}
MSG:
'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_interface'
```
the reason is that it will assume `monitor_address_block` isn't defined even on
ceph-mon2 because looking for `monitor_address_block` instead of
`hostvars[host]['monitor_address_block']`, therefore it enters in the condition as default value:
```
{%- else -%}
{% set interface = 'ansible_' + (monitor_interface | replace('-', '_')) %}
{% if ip_version == 'ipv4' -%}
{{ hostvars[host][interface][ip_version]['address'] }}
{%- elif ip_version == 'ipv6' -%}
[{{ hostvars[host][interface][ip_version][0]['address'] }}]
{%- endif %}
{%- endif %}
```
`monitor_interface` is set with default value `'interface'` so the `interface`
variable is built with 'ansible_' + 'interface'. It makes ansible throwing a
confusing message about `'ansible_interface'`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1635303
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6130bc841d)
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.
`/var/lib/ceph/*` files are not purged it means there is a leftover.
When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.
Typical error (from container output):
```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16 /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23 /entrypoint.sh:
SUCCESS
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 144c92b21f)
`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 806461ac6e)
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.
On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".
Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.
(could this be backported to at least stable-3.0 please?)
Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648)
Previous commit c13a3c3 has removed a condition.
This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 179c4d00d7)
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.
This commit allows an upgrade even while PGs are scrubbing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c13a3c3492)