When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.
This commit makes the playbook chown only the files corresponding to the
daemon being migrated.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.
Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the adoption playbook should use `radosgw_num_instances` in order to
determine how much rgw instance it should set recreate.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943170
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
this adds a small documentation in the header of the playbook in order
to explain what is the goal of this playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
when running docker-to-podman playbook, there's no need to call
`ceph-config` and `ceph-rgw` from the role `ceph-handler`.
It can even have side effects when coming from a baremetal cluster that
was previously migrated using the switch-to-containers playbook. Indeed
it might complain about missing .target systemd unit since they are
removed during that migration.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944999
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit makes the playbook fetch the minimal current ceph
configuration and write it later on monitoring nodes so `cephadm` can
proceed with the adoption.
When a monitoring stack was deployed on a dedicated node, it means no
`ceph.conf` file was written, `cephadm` requires a `ceph.conf` in order
to adopt the daemon present on the node.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1939887
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If the legacy name `grafana-server` is still being used when upgrading
from Nautilus to Pacific, the task that sets the fact `rolling_update`
to `true` doesn't run on the node(s) included in that group. Indeed the
play where we set this fact (`rolling_update`) only runs on the group
`monitoring_group_name | default('monitoring')`.
As a workaround, we can run earlier the task which converts the
`grafana-server` group name to `monitoring`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935554
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant. In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.
Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
This commit makes sure purge playbooks remove those file if for any reason they
have been left.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1920900
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
since master is now deploying quincy, we must update this.
Otherwise, it will fail like following:
```
Error EPERM: require_osd_release cannot be lowered once it has been set
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's no reason to not use the ceph_osd_flag module to set/unset osd
flags.
Also if there's no OSD nodes in the inventory then we don't need to
execute the set/unset play.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of doing some scripting via the shell module, we can use the
parted ansible module to check the boot flag on partitions.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Those devices (db/wal) are never zapped in lvm batch deployment.
Iterating over `dedicated_devices` and `bluestore_wal_devices` fixes
this issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1922926
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When rerunning the cephadm-adopt.yml playbook the radosgw realm,
zonegroup and zone tasks will fail because the task isn't
idempotent.
Using the radosgw ansible modules solves that problem.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
If the cephadm-adopt.yml fails during the first execution and some
daemons have already been adopted by cephadm then we can't rerun
the playbook because the old container won't exist anymore.
Error: no container with name or ID ceph-mon-xxx found: no such container
If the daemons are adopted then the old systemd unit doesn't exist anymore
so any call to that unit with systemd will fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918424
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Instead of iterate over the host list for adding the node/label to the
host orchestrator configuration then we can do it parallelly.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds cephadm_adopt ansible module for replacing the command module
usage with the cephadm adopt command.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`
Fixes: #6090
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We should always use the ceph_volume ansible module when possible.
This patch replace the ceph-volume inventory and lvm {list,zap} commands
called via the command/shell modules by the corresponding call with the
ceph_volume module.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds cephadm_bootstrap ansible module for replacing the command module
usage with the cephadm bootstrap command.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds ceph_osd ansible module for replacing the command module
usage with the ceph osd destroy/down/in/out/purge/rm commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds ceph_mgr_module ansible module for replacing the command module
usage with the ceph mgr module enable/disable commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
`ceph.target` should be disabled only. Otherwise, in collocation
scenario you stop other collocated services in the OSD play which isn't
what we want to do. Each daemon has its corresponding play for managing
the transition to container.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This adds ceph_volume_simple_{activate,scan} ansible modules for replacing
the command module usage with the ceph-volume simple activate/scan commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
adding monitor is no longer possible because we generate a new mon
keyring each time the playbook is run.
Fixes: #5864
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
ignore 302,303 and 505 errors
[302] Using command rather than an argument to e.g. file
[303] Using command rather than module
[505] referenced files must exist
they aren't relevant on these tasks.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>