When deploying the ceph OSD via the packages then the ceph-osd@.service
unit is configured as enabled-runtime.
This means that each ceph-osd service will inherit from that state.
The enabled-runtime systemd state doesn't survive after a reboot.
For non containerized deployment the OSD are still starting after a
reboot because there's the ceph-volume@.service and/or ceph-osd.target
units that are doing the job.
$ systemctl list-unit-files|egrep '^ceph-(volume|osd)'|column -t
ceph-osd@.service enabled-runtime
ceph-volume@.service enabled
ceph-osd.target enabled
When switching to containerized deployment we are stopping/disabling
ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the
systemd unit files.
But the new systemd units for containerized ceph-osd service will still
inherit from ceph-osd@.service unit file.
As a consequence, if an OSD host is rebooting after the playbook execution
then the ceph-osd service won't come back because they aren't enabled at
boot.
This patch also adds a reboot and testinfra run after running the switch
to container playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fa2bb3af86)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.
$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb67)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the rgw/rbdmirror services status, we're only using the
servicmap structure in the ceph status output.
To optimize this, we could use the ceph service dump command which contains
the same needed information.
This command returns less information and is slightly faster than the ceph
status command.
$ ceph status -f json | wc -c
2001
$ ceph service dump -f json | wc -c
1105
$ time ceph status -f json > /dev/null
real 0m0.557s
user 0m0.516s
sys 0m0.040s
$ time ceph service dump -f json > /dev/null
real 0m0.454s
user 0m0.434s
sys 0m0.020s
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f9081931f)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.
$ ceph status -f json | wc -c
2001
$ ceph quorum_status -f json | wc -c
957
$ time ceph status -f json > /dev/null
real 0m0.577s
user 0m0.538s
sys 0m0.029s
$ time ceph quorum_status -f json > /dev/null
real 0m0.544s
user 0m0.527s
sys 0m0.016s
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.
$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null
real 0m0.529s
user 0m0.503s
sys 0m0.024s
$ time ceph pg stat -f json > /dev/null
real 0m0.426s
user 0m0.409s
sys 0m0.016s
The data returned by the ceph status is even bigger when using the
nautilus release.
$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
This commit adds the `osd_auto_discovery` scenario support in the
filestore-to-bluestore playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881523
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8b1eeef18a)
There's no need ot have a copy of this file in infrastructure-playbooks
directory.
playbooks in that directory can be run from the root dir of
ceph-ansible.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f906caa6da)
As of 2.10, group names containing a dash are invalid.
However, setting this option makes it still possible to use a dash in
group names and prevent this warning to show up.
It might need to be definitely addressed in a future ansible release.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1880476
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6938ed1302)
When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e7)
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
This way we keep consistency with purge-container-cluster.yml playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4)
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.
Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
In the OSP context, during the rolling update the playbook fails
with the following error:
'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''
This PR just change the hosts field providing a valid mons group
value.
Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.
It means we must use `mds_to_kill_hostname` in this task.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677d)
ceph-volume can generate large logs at some point.
debug logs by definition should be enabled only when debugging.
Let's make it customizable with a variable which is set to `False` by
default.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.
This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a090)
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.
Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b2)
Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33)
This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.
This also add a missing `| default(False)` to avoid fowlloing error:
```
fatal: [ceph01]: FAILED! =>
msg: |-
The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2)
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
During the daemon upgrade we're
- stopping the service when it's not containerized
- running the daemon role
- start the service when it's not containerized
- restart the service when it's containerized
This implementation has multiple issue.
1/ We don't use the same service workflow when using containers
or baremetal.
2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.
3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.
This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.
This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):
~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.
The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:
```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```
This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
Since we only have one scenario since nautilus then we can just move
the container start command from ceph-osd-run.sh to the systemd unit
service.
As a result, the ceph-osd-run.sh.j2 template and the
ceph_osd_docker_run_script_path variable are removed.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 829990e60d)
This commit makes the images pulling skipped if podman isn't installed
on the machine.
In OSP context, the podman installation is done later in the workflow,
it means all `podman pull` commands will fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 37b20b6525)
The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e0)
We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d384)
The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5d)
The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus)
were not manage during the docker to podman migration.
This adds the systemd container template of those services to a dedicated
file (systemd.yml) in order to include it in the docker2podman playbook.
This also adds the dashboard container images pull from docker to podman.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 252e78b4e4)
The docker2podman playbook only installs the podman package and updates
the systemd units with the right container_binary value.
We never pull the container image so if one service is restarted then
the container image will be pulled first before the service can start
which could cause longer downstream.
To avoid to download the container image from internet again we can just
pull it from the local docker daemon.
The container_{binding,package,service}_name variables are removed
because they are only used in the ceph-container-engine role which
isn't call in this playbook.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d38f21aeba)
The rbdmirror group name was using the wrong variable definition.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
The docker commands should be based on the container_binary variable
otherwise running the playbook on a host without docker (like podman
only) will failed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829985
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When using skipped variables with from_json filter and python2 then we
need to have a default value otherwise the skipped task will fail.
Unexpected templating type error occurred on
({{ (ceph_volume_lvm_list.stdout | from_json) }}): expected string or
buffer
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2b9edba131)
We must call `ceph-osd` role from `container_options_facts.yml` because
ceph-osd-run.sh.j2 needs variables set in this file.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1819681
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a4f54f6ee)
Adding `--all` to the `systemctl list-units` command in order to get
*all* osds id on the node (including stoppped osds). Otherwise, it will
purge the cluster but there will be leftover after that.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1814542
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5e7962ccf6)
We don't need to copy the infrastructure playbooks in the root
ceph-ansible directory.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 195944b123)
We only disable the ceph-osd services but not the ceph-volume lvm
services during the filestore to bluestore migration.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 38a683e5bf)
If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.
When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 535da53d69)
We should add retry/delay to check the presence of the rbdmirror daemon
in the cluster status because the status takes some time to be updated.
Also the metadata.hostname isn't a good key to check because it doesn't
reflect the ansible_hostname fact. We should use metadata.id instead.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d1316ce77b)
This playbook was using mds systemd condition.
Also a command task was using pipeline which is not allowed.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a664159061)
The ceph-facts are running on localhost so if this node is using a
different OS/release that the ceph node we can have a mismatch between
docker/podman container binary.
This commit also reduces the scope of the ceph-facts role because we only
need the container_binary tasks.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 08ac2e3034)
It looks like that the service module doesn't support wildcard anymore
for stopping/disabling multiple services.
fatal: [rgw0]: FAILED! => changed=false
msg: 'This module does not currently support using glob patterns,
found ''*'' in service name: ceph-radosgw@*'
...ignoring
Instead we should iterate over the rgw_instances list.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9d3b49293d)
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55970b18f1)
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.
This has the added advantage of becoming root on the monitor only, not
on localhost.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b3df4e418)
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3700aa5385)
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 83c5a1d7a8)
Fix purge cluster failed when local container images does not exist.
Purge node-exporter and grafana-server only when dashboard_enabled is set to True.
Signed-off-by: wujie1993 qq594jj@gmail.com
(cherry picked from commit d8b0b3cbd9)
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).
TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
msg: '''osd_fsid_list'' is undefined
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cd76054f76)
When the PV is already removed from the devices then we should not fail
to avoid errors like:
stderr: No PV found on device /dev/sdb.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a9c2300545)
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:
```
fatal: [magna005]: FAILED! => changed=false
gid: 0
group: root
mode: '0755'
msg: 'chown failed: failed to look up user ceph'
owner: root
path: /etc/ceph
secontext: unconfined_u:object_r:etc_t:s0
size: 4096
state: directory
uid: 0
```
because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bb3eae0c80)
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.
Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
/dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f995b079a6)
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d0898aa5d)
This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3496a0efa2)
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b0c491800a)
This is needed after a change is made in systemd unit files.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1c2ec9fb40)
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d746575fd0)
There is no need to run these tasks n times from each monitor.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b95)
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a09d1c38bf)
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.
Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f344fdefe)
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 747555dfa6)
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b)
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 931a842f21)
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8056514134)
We only need to set `container_binary`.
Let's use `tasks_from` option.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0ae0a9ce28)
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 77b39d235b)
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfe5a04bf)
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.
Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8e6ef818a2)
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51d601193e)
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e3305e6bb6)
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4833b85e04)
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).
This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c7708eb458)
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 451c5ca934)
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 89f6cc54a2)
Since we now support podman, let's rename the playbook so it's more
generic.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7bc7e3669d)
If the new mon/osd node doesn't have python installed then we need to
execute the tasks from raw_install_python.yml.
Closes: #4368
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34b03d1873)
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.
We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0
Also we should not fail on the ceph directory listing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65)
If the container binary is podman, we shouldn't try to stop docker here.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b18476a1a6)
in order to be able to call container_binary without having to run the
whole ceph-facts role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fe5ffe589e)
All containers are removed when systemd stops them.
There is no need to call this module in purge container playbook.
This commit also removes all docker_image task and remove all container
images in the final cleanup play.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776736
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d23383a820)
This is needed to avoid following error:
```
ERROR! The requested handler 'restart ceph mons' was not found in either the main handlers list nor in the listening handlers list
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a43a872105)
let's use `client_group_name` instead of hardcoding the name.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7fe0d55eff)
We must import this role in the first play otherwise the first call to
`client_group_name`fails.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6526a25ab5)
in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.
Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3cfcc7a105)
This commit adds a default value in the `with_dict` because when using
python 2.7, if a task using a `with_dict` has a condition, it is
evaluated anyway whereas in python 3 it isn't.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9823f319b)