Commit Graph

675 Commits (dbe940f1a7ded3ad1c68ce83877e1e265413d196)

Author SHA1 Message Date
Guillaume Abrioux dbe940f1a7 purge: ceph-crash purge fixes
This fixes the service file removal and makes the playbook
call `systemctl reset-failed` on the service because in Ceph
Nautilus, ceph-crash doesn't handle `SIGTERM` signal.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2055992

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2f11982590)
(cherry picked from commit 7a570c719e)
2022-05-09 13:45:16 +02:00
Guillaume Abrioux 48a8b1cc34 update: move a set_fact
ceph-facts roles makes decisions based on the fact `rolling_update` so
it must be called before we run this role.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5edcc4214)
2021-11-03 11:50:50 +01:00
Guillaume Abrioux a9a7c35a74 update: support --limit on monitor nodes
Change needed in order to support --limit on mon nodes.
Otherwise, a call to `hostvars[groups[mon_group_name][0]]['_current_monitor_address']`
throws an error:

```
"The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute '_current_monitor_address'"
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304#c28

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 82eee4303b)
2021-10-29 01:41:13 +02:00
Per Abildgaard Toft c5e4851a3f shrink-osd: fix regression because of a wrong regex
968891f449 introduced a regression.
The regex is wrong because it doesn't allow to shrink osds with id
greater than 9

Fixes: #6950

Signed-off-by: Per Abildgaard Toft <per@minfejl.dk>
(cherry picked from commit 84118a3063)
2021-10-26 16:39:33 +02:00
Guillaume Abrioux 3f4abb09b4 shrink-osd: check osd id format
This adds a check early in order to ensure the format of osd ids passed
is correct.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2005734

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 968891f449)
2021-10-26 16:39:33 +02:00
Guillaume Abrioux 2c9fc7f517 rolling_update: modify default health_osd_check_*
let's do more retries with a shorter delay.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 50a21d695e)
2021-10-26 16:39:09 +02:00
Guillaume Abrioux 3dd96da652 rolling_update: fix pre and post osd upgrade play
when using --limit osds, the play before and after osd upgrade are
skipped because we use `hosts: "{{ mon_group_name | default('mons') }}[0]"`
using `hosts: "{{ osds_group_name | default('osds') }}" with
`delegate_to` to the first monitor addresses this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fc9f87c45f)
2021-10-25 23:22:35 +02:00
Guillaume Abrioux dc1a4c29ea update: support upgrading a subset of nodes
It can be useful in a large cluster deployment to split the upgrade and
only upgrade a group of nodes at a time.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2014304

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5cf9db2b0)
2021-10-25 23:22:35 +02:00
Seena Fallah 0a93de938b purge: add remove_docker tag
This can help to skip docker removal tasks

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
(cherry picked from commit ff39c8d70b)
2021-09-14 20:50:06 +02:00
Seena Fallah 0ede37b2ec purge: add container_binary needed for zap osds
`container_binary` isn't set anymore in the purge osd play because of a
regression introduced by 60aa70a.
The CI didn't catch it because the play purging node-exporter sets this
variable for all nodes before we run the purge osd play.

This commit fixes this regression.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
(cherry picked from commit a51ce767ca)
2021-09-09 14:40:53 +02:00
Dimitri Savineau 0d670c7942 purge-dashboard: remove cid files
This adds the service cid file cleanup as supported in the classic purge
playbook since b9dd253

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cddc23f511)
2021-09-08 12:05:33 -04:00
Guillaume Abrioux 20583e83dd containers: introduce target systemd unit
This adds ceph-*.target systemd unit files support for containerized
deployments.
This also fixes a regression introduced by PR #6719 (rgw and nfs systemd
units not getting purged)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1962748

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 09ef465f62)
2021-08-18 13:43:01 -04:00
Guillaume Abrioux 2d38d8266b update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c14e9114ba)
2021-08-17 15:47:38 -04:00
Dimitri Savineau 712a9c4403 switch2container: fix mon quorum check
This was reverted by 7ddbe74

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1990733

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-10 10:03:02 +02:00
VasishtaShastry 4ae9f321ac Fixes typo in rgw-add-users-buckets playbook
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
(cherry picked from commit 478d9fdcb6)
2021-08-09 14:31:55 -04:00
Dimitri Savineau e1e22933a7 add-osd: use container_exec_cmd fact from mon host
Because we're delegating the task to the first monitor node, we need to be
sure that the container_exec_cmd fact is the one from that node too otherwise
we could have a mismatch on the ceph-mon container name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1990772

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-09 15:48:23 +02:00
Dimitri Savineau 03ed9e111c infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 386661699b)
2021-08-04 11:48:13 -04:00
Dimitri Savineau 1044940304 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 06471a4b82)
2021-08-03 14:03:35 -04:00
Dimitri Savineau 5c6921e553 rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e87a47cf0c)
2021-08-03 12:42:28 -04:00
Benoît Knecht bacaa654b1 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit d7653dca95)
2021-08-02 15:54:34 +02:00
Guillaume Abrioux 77171216fb update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eec38784ec)
2021-07-26 14:10:24 -04:00
Guillaume Abrioux 907fb08956 purge: support osd_auto_discovery
This adds a task that zaps by osd id so we can support the scenario
where osds were deployed with `osd_auto_discovery` is true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876860

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4144074a50)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux 3dcfbc2edf purge: merge playbooks
This refactor merges the two playbooks so we only have to maintain 1
playbook.
(Symlink the old purge-container-cluster.yml playbook for backward
 compatibility).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 17cd83bf3a)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux e4fea521d9 purge: drop variables from 'hosts' sections
Those variables are useless given this is not possible to override them.
Let's replace them with the hardcoded name instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6b50401d0c)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux cf812d06e3 purge: reindent playbook
This commit reindents the playbook.
Also improve readability by adding an extra line between plays.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 60aa70a128)
2021-07-26 17:53:06 +02:00
Dimitri Savineau f433d06a93 rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 97148dd58c)
2021-07-26 17:49:23 +02:00
Dimitri Savineau c8ca73f620 infra: add playbook to purge dashboard/monitoring
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8e4ef7d6da)
2021-07-26 17:48:32 +02:00
Dimitri Savineau 8e939dc377 common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 738fa9428a)
2021-07-21 10:03:36 -04:00
Dimitri Savineau 17b9ff03d2 common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf6e33346e)
2021-07-21 09:54:46 -04:00
Guillaume Abrioux f7882bbc02 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13036115e2)
2021-07-21 09:40:18 -04:00
Guillaume Abrioux f0cd3c4f48 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
2021-07-09 17:29:54 -04:00
Guillaume Abrioux 65ce69567a update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4eb4268dee)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux 1179ea8b2f update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux 595a61c137 purge: add monitoring group in final cleanup play
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 037d8cd05e)
2021-07-02 14:37:18 -04:00
Guillaume Abrioux f0413c4a2b update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
2021-06-30 20:40:15 +02:00
Guillaume Abrioux 8802dcf05f shrink-mgr: modify existing mgr check
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 26a7256c4c)
2021-06-30 16:12:07 +02:00
Dimitri Savineau 25ea12d31d switch2container: run ceph-validate role
This adds the ceph-validate role before starting the switch to a containerized
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1968177

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fc160b3be1)
2021-06-30 09:32:34 +02:00
Guillaume Abrioux a391dad8e1 dashboard: fix typo introduced during backport
during backport of c8b92deba1 the pattern
should have been s/monitoring_group_name/grafana_server_group_name/

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964907

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ac0a5c1e68)
2021-05-26 18:51:18 +02:00
Guillaume Abrioux 0eed858952 fs2bs: use match filter in selectattr()
0990ae4109 changed the filter in
selectattr() from 'match' to 'equalto' but due to an incompatibility with
the Jinja2 version for python 2.7 on el7 we must stick to using 'match'
filter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d6745e9cd9)
2021-05-26 09:58:19 +02:00
Guillaume Abrioux abebf9b23e fs2bs: fix wrong filter when setting osd_ids
using 'match' filter in that task will lead to bad behavior if I have
the following node names for instance:

- node1
- node11
- node111

with `selectattr('name', 'match', inventory_hostname)` it will match
'node1' along with 'node11' and 'node111'.

using 'equalto' filter will make sure we only match the target node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1963066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0990ae4109)
2021-05-26 09:58:19 +02:00
Guillaume Abrioux 2d59f4579b update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3db1ea7ec4)
2021-05-05 09:47:32 +02:00
Guillaume Abrioux 650964a8c7 fs2bs: add a final play
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.

Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d4267051f)
2021-04-28 08:55:34 +02:00
Guillaume Abrioux b4340b71d9 switch-to-containers: only chown corresponding files
When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.

This commit makes the playbook chown only the files corresponding to the
daemon being migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ddbc11c4a9)
2021-04-15 05:24:44 +02:00
Guillaume Abrioux 3ef9690cd1 docker2podman: skip some role imports from handler
when running docker-to-podman playbook, there's no need to call
`ceph-config` and `ceph-rgw` from the role `ceph-handler`.
It can even have side effects when coming from a baremetal cluster that
was previously migrated using the switch-to-containers playbook. Indeed
it might complain about missing .target systemd unit since they are
removed during that migration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944999

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 70f19be367)
2021-04-12 13:30:31 +02:00
Guillaume Abrioux c0c90c6747 docker2podman: add documentation/header
this adds a small documentation in the header of the playbook in order
to explain what is the goal of this playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 36b4227dcd)
2021-04-12 09:45:20 +02:00
Guillaume Abrioux 3bf2c45123 switch_to_containers: support iscsigws migration
This adds the iscsigws migration to containers.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=<bz-number>

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c74c27321)
2021-04-09 15:28:27 +02:00
Guillaume Abrioux 5fd299e358 update: followup on 07029e1
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9ddb972fe)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux 82b934cfc1 rolling_update: unmask monitor service after a failure
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 07029e1bf1)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux a8420d41c6 update: stop ceph-crash service before upgrading
This adds the missing service stop task for ceph-crash upgrade workflow.

It should have been added through commit
`15872e3db1e342238636bc9c8e1aef6bd1d3dcd8` in stable-4.0 but at the time
we backported this patch ceph-crash wasn't implemented yet so the
ceph-crash related content in this patch was removed. Then, ceph-crash
has been implemented later so we are still missing this part of the patch in
stable-4.0.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943471

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-26 16:18:50 +01:00
Alex Schultz 7ddbe74712 Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
2021-03-26 00:16:58 +01:00