With this commit, upgrading a cluster from Nautilus to Pacific with
active rgw multisite replication will be blocked.
This is because a lot of bugs are currently present in Pacific regarding
RGW multisite.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2063702
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
ceph-facts roles makes decisions based on the fact `rolling_update` so
it must be called before we run this role.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Change needed in order to support --limit on mon nodes.
Otherwise, a call to `hostvars[groups[mon_group_name][0]]['_current_monitor_address']`
throws an error:
```
"The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute '_current_monitor_address'"
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304#c28
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
when using --limit osds, the play before and after osd upgrade are
skipped because we use `hosts: "{{ mon_group_name | default('mons') }}[0]"`
using `hosts: "{{ osds_group_name | default('osds') }}" with
`delegate_to` to the first monitor addresses this issue.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It can be useful in a large cluster deployment to split the upgrade and
only upgrade a group of nodes at a time.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.
play1:
register: balancer_status
play2:
register: balancer_status <-- skipped
play3:
when: (balancer_status.stdout | from_json)['active'] | bool
This leads to issue like:
The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.
$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)
TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021 15:55:29 +0000 (0:00:00.484) 0:00:15.802 *********
fatal: [client0]: FAILED! =>
msg: '''dict object'' has no attribute ''mons'''
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.
This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When using python 2 and the task with a loop is skipped then it generates
an error.
Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):
```
crush map has legacy tunables (require firefly, min is hammer)
```
because straw bucket is a firefly feature it needs to be converted to
straw2.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Starting RHCS 5, there's no ISO available anymore.
This removes all ISO variables and the ceph_repository_type variable.
Closes: #6626
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If the legacy name `grafana-server` is still being used when upgrading
from Nautilus to Pacific, the task that sets the fact `rolling_update`
to `true` doesn't run on the node(s) included in that group. Indeed the
play where we set this fact (`rolling_update`) only runs on the group
`monitoring_group_name | default('monitoring')`.
As a workaround, we can run earlier the task which converts the
`grafana-server` group name to `monitoring`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935554
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant. In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.
Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
since master is now deploying quincy, we must update this.
Otherwise, it will fail like following:
```
Error EPERM: require_osd_release cannot be lowered once it has been set
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`
Fixes: #6090
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds ceph_volume_simple_{activate,scan} ansible modules for replacing
the command module usage with the ceph-volume simple activate/scan commands.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
cec994b introduced a regression when a mgr is collocated with a mon.
During the mon upgrade, the mgr service is masked to avoid to be
restarted on packages update.
Then the start mgr task is failing because the service is still masked.
Instead we should unmask it.
Fixes: #5983
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
bd611a7 introduced the new ceph_fs module but missed some tasks in
rolling_update and shrink-mds playbooks.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.
$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.
$ ceph status -f json | wc -c
2001
$ ceph quorum_status -f json | wc -c
957
$ time ceph status -f json > /dev/null
real 0m0.577s
user 0m0.538s
sys 0m0.029s
$ time ceph quorum_status -f json > /dev/null
real 0m0.544s
user 0m0.527s
sys 0m0.016s
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.
$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null
real 0m0.529s
user 0m0.503s
sys 0m0.024s
$ time ceph pg stat -f json > /dev/null
real 0m0.426s
user 0m0.409s
sys 0m0.016s
The data returned by the ceph status is even bigger when using the
nautilus release.
$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This adds the ceph_fs ansible module for replacing the command module
usage with the ceph fs command.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This change default value of grafana-server group name.
Adding some tasks in ceph-defaults in order to keep backward
compatibility.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise this will generate an ansible warning about the missing
filter.
[DEPRECATION WARNING]: evaluating xxx as a bare variable, this behaviour
will go away and you might need to add |bool to the expression in the
future.
Also see CONDITIONAL_BARE_VARS configuration toggle.. This feature will
be removed in version 2.12.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
In Pacific we're are sure that users already achieved the msgr2 because
that was introduced in Nautilus.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Most ansible module using a state parameter default to the present
value (when available) instead of using it as a mandatory option.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>