Instead of reusing the condition 'inventory_hostname in groups[osds]'
on each device facts tasks then we can move all the tasks into a
dedicated file and set the condition on the import_tasks statement.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d704b05e52)
We currently don't check if the logical volume used in lvm_volumes list
for either bluestore data/db/wal or filestore data/journal exist.
We're only doing this on raw devices for batch scenario.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 55bca07cb6)
When using dedicated devices for db/journal/wal objecstore with
ceph-volume lvm batch then we should also validate that those devices
exist and don't use a gpt partition table in addition of the devices
and lvm_volume.data variables.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 808e7106de)
Instead of using findmnt command to find the device associated to the
root mount point then we can use the ansible_mounts fact.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7e50380f7f)
Instead of doing two parted calls we can check first if the device exist
and then test the partition table.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 14d458b3b4)
2888c08 introduced a regression as the check_devices tasks file was
only included based on the devices variable.
But that file also validate some devices from the lvm_volumes variable.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1906022
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ac0342b72e)
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8e4ef7d6da)
The alertmanager, grafana and prometheus configuration file are
generated with the template module which doesn't allow for using
config overrides.
Instead we could use the config_template plugin action and add a
new variable for overrides (one for each component).
With this patch, one should be able to add configuration to
prometheus with the following:
---
alertmanager_conf_overrides:
global:
smtp_smarthost: 'localhost:25'
...
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1902999
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5a41026347)
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 738fa9428a)
When using python 2 and the task with a loop is skipped then it generates
an error.
Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf6e33346e)
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13036115e2)
It's better to fail the playbook so the user is aware the straw2
migration has failed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):
```
crush map has legacy tunables (require firefly, min is hammer)
```
because straw bucket is a firefly feature it needs to be converted to
straw2.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
When deploying dashboard with ssl certificates generated by
ceph-ansible, we enforce the CN to 'ceph-dashboard' which can makes
application such alertmanager complain like following:
`err="Post https://mgr0:8443/api/prometheus_receiver: x509: certificate is valid for ceph-dashboard, not mgr0" context_err="context deadline exceeded"`
The idea here is to add alternative names matching all mgr/mon instances
in the certificate so this error won't appear in logs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978869
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 72a0336c71)
The ceph crash insatll checkpoint callback was missing in the main
playbooks.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 993d06c4d9)
Instead of reusing the condition 'inventory_hostname in groups[osds]'
on each device facts tasks then we can move all the tasks into a
dedicated file and set the condition on the import_tasks statement.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d704b05e52)
We currently don't check if the logical volume used in lvm_volumes list
for either bluestore data/db/wal or filestore data/journal exist.
We're only doing this on raw devices for batch scenario.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 55bca07cb6)
When using dedicated devices for db/journal/wal objecstore with
ceph-volume lvm batch then we should also validate that those devices
exist and don't use a gpt partition table in addition of the devices
and lvm_volume.data variables.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 808e7106de)
Instead of using findmnt command to find the device associated to the
root mount point then we can use the ansible_mounts fact.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7e50380f7f)
Instead of doing two parted calls we can check first if the device exist
and then test the partition table.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 14d458b3b4)
2888c08 introduced a regression as the check_devices tasks file was
only included based on the devices variable.
But that file also validate some devices from the lvm_volumes variable.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1906022
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ac0342b72e)
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 037d8cd05e)
When calling the `ceph_key` module with `state: info`, if the ceph
command called fails, the actual error is hidden by the module which
makes it pretty difficult to troubleshoot.
The current code always states that if rc is not equal to 0 the keyring
doesn't exist.
`state: info` should always return the actual rc, stdout and stderr.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964889
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d58500ade0)
All ceph daemons need to have the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES
environment variable set to 128MB by default in container setup.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970913
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9758e3c513)
It was requested for us to update our alerting definitions to include a
slow OSD Ops health check.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1951664
Signed-off-by: Boris Ranto <branto@redhat.com>
(cherry picked from commit 2491d4e004)
There's no benefit to gather facts again on each play in
rolling_update.yml
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 26a7256c4c)
This adds a github workflow for checking the signed off line in commit
messages.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8c09497567)
This adds the ansible --syntax-check test in the ansible-lint workflow
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ed423ad88)
There's no need to copy this keyring when using nfs with mds
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8dbee99882)
When running the switch-to-containers playbook with multisite enabled,
the fact "rgw_instances" is only set for the node being processed
(serial: 1), the consequence of that is that the set_fact of
'rgw_instances_all' can't iterate over all rgw node in order to look up
each 'rgw_instances_host'.
Adding a condition checking whether hostvars[item]["rgw_instances_host"]
is defined fixes this issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967926
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8279d14d32)
When monitors and rgw are collocated with multisite enabled, the
rolling_update playbook fails because during the workflow, we run some
radosgw-admin commands very early on the first mon even though this is
the monitor being upgraded, it means the container doesn't exist since
it was stopped.
This block is relevant only for scaling out rgw daemons or initial
deployment. In rolling_update workflow, it is not needed so let's skip
it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970232
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f7166cccbf)