If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 97148dd58c)
It's better to fail the playbook so the user is aware the straw2
migration has failed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):
```
crush map has legacy tunables (require firefly, min is hammer)
```
because straw bucket is a firefly feature it needs to be converted to
straw2.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8e4ef7d6da)
Add any_errors_fatal: true in cephadm-adopt playbook.
We should stop the playbook execution when a task throws an error.
Otherwise it can lead to unexpected behavior.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1976179
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3b804a61dd)
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 037d8cd05e)
There's no benefit to gather facts again on each play in
rolling_update.yml
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
Starting RHCS 5, there's no ISO available anymore.
This removes all ISO variables and the ceph_repository_type variable.
Closes: #6626
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a05730b38a)
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 26a7256c4c)
If multi-realms were deployed with several instances belonging to the same
realm and zone using the same port on different nodes, the service id
expected by cephadm will be the same and therefore only one service will
be deployed. We need to create a service called
`<node>.<realm>.<zone>.<port>` to be sure the service name will be unique
and well deployed on the expected node in order to preserve backward
compatibility with the rgws instances that were deployed with
ceph-ansible.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 31311b03ed)
We need to support rgw multisite deployments.
This commit makes the adoption playbook support this kind of deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fc784fc44c)
This is an unsupported configuration since there
are issues with RGW+NFS upgraded from Nautilus to Pacific.
This approach might be seen as a bit aggressive but it is preferable
to wait before upgrading in that case.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970003
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When no `[mgrs]` group is defined in the inventory, mgr daemon are
implicitly collocated with monitors.
This task currently relies on the length of the mgr group in order to
tell cephadm to deploy mgr daemons.
If there's no `[mgrs]` group defined in the inventory, it will ask
cephadm to deploy 0 mgr daemon which doesn't make sense and will throw
an error.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970313
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f9a73149a4)
0990ae4109 changed the filter in
selectattr() from 'match' to 'equalto' but due to an incompatibility with
the Jinja2 version for python 2.7 on el7 we must stick to using 'match'
filter.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d6745e9cd9)
using 'match' filter in that task will lead to bad behavior if I have
the following node names for instance:
- node1
- node11
- node111
with `selectattr('name', 'match', inventory_hostname)` it will match
'node1' along with 'node11' and 'node111'.
using 'equalto' filter will make sure we only match the target node.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1963066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0990ae4109)
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3db1ea7ec4)
ceph-ansible leaves a ceph-crash container in containerized deployment.
It means we end up with 2 ceph-crash containers running after the
migration playbook is complete.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1954614
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 22c18e82f0)
Due to a recent breaking change in ceph, this command must be modified
to add the <svc_id> parameter.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1f40c12502)
When migrating from a cluster with no MDS nodes deployed,
`{{ cephfs_data_pool.name }}` doesn't exist so we need to create a pool
for storing nfs export objects.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1950403
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bb7d37fb6a)
When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.
This commit makes the playbook chown only the files corresponding to the
daemon being migrated.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ddbc11c4a9)
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.
Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d4267051f)
the adoption playbook should use `radosgw_num_instances` in order to
determine how much rgw instance it should set recreate.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943170
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1ffc4df6b6)
This play doesn't nothing else than stopping/removing rgw daemons.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ee44d86072)
when running docker-to-podman playbook, there's no need to call
`ceph-config` and `ceph-rgw` from the role `ceph-handler`.
It can even have side effects when coming from a baremetal cluster that
was previously migrated using the switch-to-containers playbook. Indeed
it might complain about missing .target systemd unit since they are
removed during that migration.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944999
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 70f19be367)
this adds a small documentation in the header of the playbook in order
to explain what is the goal of this playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 36b4227dcd)
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9ddb972fe)
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 07029e1bf1)
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant. In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.
Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
`command -v` is a bash script which needs a shell to run.
Fixes: #6325
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 14c472707c)
This commit makes the playbook fetch the minimal current ceph
configuration and write it later on monitoring nodes so `cephadm` can
proceed with the adoption.
When a monitoring stack was deployed on a dedicated node, it means no
`ceph.conf` file was written, `cephadm` requires a `ceph.conf` in order
to adopt the daemon present on the node.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1939887
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b445df0479)
If the legacy name `grafana-server` is still being used when upgrading
from Nautilus to Pacific, the task that sets the fact `rolling_update`
to `true` doesn't run on the node(s) included in that group. Indeed the
play where we set this fact (`rolling_update`) only runs on the group
`monitoring_group_name | default('monitoring')`.
As a workaround, we can run earlier the task which converts the
`grafana-server` group name to `monitoring`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935554
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6ccc8b4722)
This commit makes sure purge playbooks remove those file if for any reason they
have been left.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1920900
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b9dd253a4f)
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 980a5a7df4)
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
since master is now deploying quincy, we must update this.
Otherwise, it will fail like following:
```
Error EPERM: require_osd_release cannot be lowered once it has been set
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's no reason to not use the ceph_osd_flag module to set/unset osd
flags.
Also if there's no OSD nodes in the inventory then we don't need to
execute the set/unset play.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of doing some scripting via the shell module, we can use the
parted ansible module to check the boot flag on partitions.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Those devices (db/wal) are never zapped in lvm batch deployment.
Iterating over `dedicated_devices` and `bluestore_wal_devices` fixes
this issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1922926
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When rerunning the cephadm-adopt.yml playbook the radosgw realm,
zonegroup and zone tasks will fail because the task isn't
idempotent.
Using the radosgw ansible modules solves that problem.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
If the cephadm-adopt.yml fails during the first execution and some
daemons have already been adopted by cephadm then we can't rerun
the playbook because the old container won't exist anymore.
Error: no container with name or ID ceph-mon-xxx found: no such container
If the daemons are adopted then the old systemd unit doesn't exist anymore
so any call to that unit with systemd will fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918424
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>