This commit fixes two issues in rolling_update.yml:
- `container_exec_cmd_update_osd` is unset in the `complete osd upgrade`
play so it never runs the command in a container.
- the 'require-osd-release' task is never applied because the condition
looks for luminous release.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1930164
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using docker 1.13.1, the current condition:
```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```
is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.
From the doc:
> --cpu-quota=<value> Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e262e072b)
Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e66f12d138)
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`
Fixes: #6090
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86a8889ee3)
ignore 302,303 and 505 errors
[302] Using command rather than an argument to e.g. file
[303] Using command rather than module
[505] referenced files must exist
they aren't relevant on these tasks.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 195d88fcda)
Fix ansible-lint 504 error:
[504] Do not use 'local_action', use 'delegate_to: localhost'
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c948b668eb)
Fix ansible-lint 502 error:
[502] All tasks should be named
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 97dd9218dd)
Fix ansible-lint 305 error:
[305] Use shell only when shell functionality is required
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 11b4bf5083)
Fix ansible lint 206 error:
[206] Variables should have spaces before and after: {{ var_name }}
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9fba6eecfa)
Fix ansible lint 301 error:
[301] Commands should not change things if nothing needs doing
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5450de58b3)
Fix ansible lint 306 error:
[306] Shells that use pipes should set the pipefail option
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9)
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a)
`ceph.target` should be disabled only. Otherwise, in collocation
scenario you stop other collocated services in the OSD play which isn't
what we want to do. Each daemon has its corresponding play for managing
the transition to container.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b05620597)
adding monitor is no longer possible because we generate a new mon
keyring each time the playbook is run.
Fixes: #5864
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1902281
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 970c6a4ee6)
Most ansible module using a state parameter default to the present
value (when available) instead of using it as a mandatory option.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit abb4023d76)
fa2bb3a only fix the symlink owner/group issue in the OSD play. If the
OSDs are collocated with other services like MONs and MGRs then the
chown command will fail.
$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 35ed9977aa)
When deploying the ceph OSD via the packages then the ceph-osd@.service
unit is configured as enabled-runtime.
This means that each ceph-osd service will inherit from that state.
The enabled-runtime systemd state doesn't survive after a reboot.
For non containerized deployment the OSD are still starting after a
reboot because there's the ceph-volume@.service and/or ceph-osd.target
units that are doing the job.
$ systemctl list-unit-files|egrep '^ceph-(volume|osd)'|column -t
ceph-osd@.service enabled-runtime
ceph-volume@.service enabled
ceph-osd.target enabled
When switching to containerized deployment we are stopping/disabling
ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the
systemd unit files.
But the new systemd units for containerized ceph-osd service will still
inherit from ceph-osd@.service unit file.
As a consequence, if an OSD host is rebooting after the playbook execution
then the ceph-osd service won't come back because they aren't enabled at
boot.
This patch also adds a reboot and testinfra run after running the switch
to container playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fa2bb3af86)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.
$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit acddf4fb67)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the rgw/rbdmirror services status, we're only using the
servicmap structure in the ceph status output.
To optimize this, we could use the ceph service dump command which contains
the same needed information.
This command returns less information and is slightly faster than the ceph
status command.
$ ceph status -f json | wc -c
2001
$ ceph service dump -f json | wc -c
1105
$ time ceph status -f json > /dev/null
real 0m0.557s
user 0m0.516s
sys 0m0.040s
$ time ceph service dump -f json > /dev/null
real 0m0.454s
user 0m0.434s
sys 0m0.020s
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f9081931f)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.
$ ceph status -f json | wc -c
2001
$ ceph quorum_status -f json | wc -c
957
$ time ceph status -f json > /dev/null
real 0m0.577s
user 0m0.538s
sys 0m0.029s
$ time ceph quorum_status -f json > /dev/null
real 0m0.544s
user 0m0.527s
sys 0m0.016s
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.
$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null
real 0m0.529s
user 0m0.503s
sys 0m0.024s
$ time ceph pg stat -f json > /dev/null
real 0m0.426s
user 0m0.409s
sys 0m0.016s
The data returned by the ceph status is even bigger when using the
nautilus release.
$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
This commit adds the `osd_auto_discovery` scenario support in the
filestore-to-bluestore playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881523
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8b1eeef18a)
There's no need ot have a copy of this file in infrastructure-playbooks
directory.
playbooks in that directory can be run from the root dir of
ceph-ansible.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f906caa6da)
As of 2.10, group names containing a dash are invalid.
However, setting this option makes it still possible to use a dash in
group names and prevent this warning to show up.
It might need to be definitely addressed in a future ansible release.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1880476
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6938ed1302)
When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e7)
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
This way we keep consistency with purge-container-cluster.yml playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4)
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.
Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
In the OSP context, during the rolling update the playbook fails
with the following error:
'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''
This PR just change the hosts field providing a valid mons group
value.
Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.
It means we must use `mds_to_kill_hostname` in this task.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677d)
ceph-volume can generate large logs at some point.
debug logs by definition should be enabled only when debugging.
Let's make it customizable with a variable which is set to `False` by
default.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.
This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a090)
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.
Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b2)
Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33)
This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.
This also add a missing `| default(False)` to avoid fowlloing error:
```
fatal: [ceph01]: FAILED! =>
msg: |-
The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2)
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
During the daemon upgrade we're
- stopping the service when it's not containerized
- running the daemon role
- start the service when it's not containerized
- restart the service when it's containerized
This implementation has multiple issue.
1/ We don't use the same service workflow when using containers
or baremetal.
2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.
3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.
This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.
This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):
~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.
The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:
```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```
This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)