Commit Graph

665 Commits (0d670c7942d919bab5ea4ad4c31ade1a61be0336)

Author SHA1 Message Date
Dimitri Savineau 0d670c7942 purge-dashboard: remove cid files
This adds the service cid file cleanup as supported in the classic purge
playbook since b9dd253

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cddc23f511)
2021-09-08 12:05:33 -04:00
Guillaume Abrioux 20583e83dd containers: introduce target systemd unit
This adds ceph-*.target systemd unit files support for containerized
deployments.
This also fixes a regression introduced by PR #6719 (rgw and nfs systemd
units not getting purged)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1962748

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 09ef465f62)
2021-08-18 13:43:01 -04:00
Guillaume Abrioux 2d38d8266b update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c14e9114ba)
2021-08-17 15:47:38 -04:00
Dimitri Savineau 712a9c4403 switch2container: fix mon quorum check
This was reverted by 7ddbe74

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1990733

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-10 10:03:02 +02:00
VasishtaShastry 4ae9f321ac Fixes typo in rgw-add-users-buckets playbook
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
(cherry picked from commit 478d9fdcb6)
2021-08-09 14:31:55 -04:00
Dimitri Savineau e1e22933a7 add-osd: use container_exec_cmd fact from mon host
Because we're delegating the task to the first monitor node, we need to be
sure that the container_exec_cmd fact is the one from that node too otherwise
we could have a mismatch on the ceph-mon container name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1990772

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-09 15:48:23 +02:00
Dimitri Savineau 03ed9e111c infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 386661699b)
2021-08-04 11:48:13 -04:00
Dimitri Savineau 1044940304 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 06471a4b82)
2021-08-03 14:03:35 -04:00
Dimitri Savineau 5c6921e553 rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e87a47cf0c)
2021-08-03 12:42:28 -04:00
Benoît Knecht bacaa654b1 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit d7653dca95)
2021-08-02 15:54:34 +02:00
Guillaume Abrioux 77171216fb update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eec38784ec)
2021-07-26 14:10:24 -04:00
Guillaume Abrioux 907fb08956 purge: support osd_auto_discovery
This adds a task that zaps by osd id so we can support the scenario
where osds were deployed with `osd_auto_discovery` is true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876860

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4144074a50)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux 3dcfbc2edf purge: merge playbooks
This refactor merges the two playbooks so we only have to maintain 1
playbook.
(Symlink the old purge-container-cluster.yml playbook for backward
 compatibility).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 17cd83bf3a)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux e4fea521d9 purge: drop variables from 'hosts' sections
Those variables are useless given this is not possible to override them.
Let's replace them with the hardcoded name instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6b50401d0c)
2021-07-26 17:53:06 +02:00
Guillaume Abrioux cf812d06e3 purge: reindent playbook
This commit reindents the playbook.
Also improve readability by adding an extra line between plays.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 60aa70a128)
2021-07-26 17:53:06 +02:00
Dimitri Savineau f433d06a93 rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 97148dd58c)
2021-07-26 17:49:23 +02:00
Dimitri Savineau c8ca73f620 infra: add playbook to purge dashboard/monitoring
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8e4ef7d6da)
2021-07-26 17:48:32 +02:00
Dimitri Savineau 8e939dc377 common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 738fa9428a)
2021-07-21 10:03:36 -04:00
Dimitri Savineau 17b9ff03d2 common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf6e33346e)
2021-07-21 09:54:46 -04:00
Guillaume Abrioux f7882bbc02 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13036115e2)
2021-07-21 09:40:18 -04:00
Guillaume Abrioux f0cd3c4f48 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c396122ad9)
2021-07-09 17:29:54 -04:00
Guillaume Abrioux 65ce69567a update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4eb4268dee)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux 1179ea8b2f update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eee576477c)
2021-07-09 11:34:46 +02:00
Guillaume Abrioux 595a61c137 purge: add monitoring group in final cleanup play
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 037d8cd05e)
2021-07-02 14:37:18 -04:00
Guillaume Abrioux f0413c4a2b update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c77d0094c)
2021-06-30 20:40:15 +02:00
Guillaume Abrioux 8802dcf05f shrink-mgr: modify existing mgr check
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 26a7256c4c)
2021-06-30 16:12:07 +02:00
Dimitri Savineau 25ea12d31d switch2container: run ceph-validate role
This adds the ceph-validate role before starting the switch to a containerized
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1968177

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fc160b3be1)
2021-06-30 09:32:34 +02:00
Guillaume Abrioux a391dad8e1 dashboard: fix typo introduced during backport
during backport of c8b92deba1 the pattern
should have been s/monitoring_group_name/grafana_server_group_name/

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1964907

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ac0a5c1e68)
2021-05-26 18:51:18 +02:00
Guillaume Abrioux 0eed858952 fs2bs: use match filter in selectattr()
0990ae4109 changed the filter in
selectattr() from 'match' to 'equalto' but due to an incompatibility with
the Jinja2 version for python 2.7 on el7 we must stick to using 'match'
filter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d6745e9cd9)
2021-05-26 09:58:19 +02:00
Guillaume Abrioux abebf9b23e fs2bs: fix wrong filter when setting osd_ids
using 'match' filter in that task will lead to bad behavior if I have
the following node names for instance:

- node1
- node11
- node111

with `selectattr('name', 'match', inventory_hostname)` it will match
'node1' along with 'node11' and 'node111'.

using 'equalto' filter will make sure we only match the target node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1963066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0990ae4109)
2021-05-26 09:58:19 +02:00
Guillaume Abrioux 2d59f4579b update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3db1ea7ec4)
2021-05-05 09:47:32 +02:00
Guillaume Abrioux 650964a8c7 fs2bs: add a final play
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.

Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d4267051f)
2021-04-28 08:55:34 +02:00
Guillaume Abrioux b4340b71d9 switch-to-containers: only chown corresponding files
When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.

This commit makes the playbook chown only the files corresponding to the
daemon being migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ddbc11c4a9)
2021-04-15 05:24:44 +02:00
Guillaume Abrioux 3ef9690cd1 docker2podman: skip some role imports from handler
when running docker-to-podman playbook, there's no need to call
`ceph-config` and `ceph-rgw` from the role `ceph-handler`.
It can even have side effects when coming from a baremetal cluster that
was previously migrated using the switch-to-containers playbook. Indeed
it might complain about missing .target systemd unit since they are
removed during that migration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944999

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 70f19be367)
2021-04-12 13:30:31 +02:00
Guillaume Abrioux c0c90c6747 docker2podman: add documentation/header
this adds a small documentation in the header of the playbook in order
to explain what is the goal of this playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 36b4227dcd)
2021-04-12 09:45:20 +02:00
Guillaume Abrioux 3bf2c45123 switch_to_containers: support iscsigws migration
This adds the iscsigws migration to containers.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=<bz-number>

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2c74c27321)
2021-04-09 15:28:27 +02:00
Guillaume Abrioux 5fd299e358 update: followup on 07029e1
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9ddb972fe)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux 82b934cfc1 rolling_update: unmask monitor service after a failure
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 07029e1bf1)
2021-03-29 15:22:23 +02:00
Guillaume Abrioux a8420d41c6 update: stop ceph-crash service before upgrading
This adds the missing service stop task for ceph-crash upgrade workflow.

It should have been added through commit
`15872e3db1e342238636bc9c8e1aef6bd1d3dcd8` in stable-4.0 but at the time
we backported this patch ceph-crash wasn't implemented yet so the
ceph-crash related content in this patch was removed. Then, ceph-crash
has been implemented later so we are still missing this part of the patch in
stable-4.0.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943471

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-26 16:18:50 +01:00
Alex Schultz 7ddbe74712 Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
2021-03-26 00:16:58 +01:00
Guillaume Abrioux 2cd8c3637c fix 'command -v' tasks
`command -v` is a bash script which needs a shell to run.

Fixes: #6325

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 14c472707c)
2021-03-22 13:53:11 +01:00
Guillaume Abrioux 0d0723298f purge: rm service-cid files
This commit makes sure purge playbooks remove those file if for any reason they
have been left.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1920900

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b9dd253a4f)
2021-03-11 13:52:48 +01:00
Guillaume Abrioux 932abbc8cf switch2container: do not serialize the ceph-crash migration
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 980a5a7df4)
2021-03-11 13:52:39 +01:00
Dimitri Savineau 8f26ffdbac rolling_update: enforce ceph-container-engine
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 48a456dc8c)
2021-03-11 13:52:21 +01:00
Dimitri Savineau 3ba27c9387 rolling_update: exclude clients from node-exporter
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 94af3c87d1)
2021-03-11 13:52:02 +01:00
Guillaume Abrioux 1b424ad5e9 purge: zap and destroy db and wal devices for lvm batch
Those devices (db/wal) are never zapped in lvm batch deployment.
Iterating over `dedicated_devices` and `bluestore_wal_devices` fixes
this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1922926

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 984191ac7f)
2021-03-11 13:51:38 +01:00
Guillaume Abrioux bb1f66cb51 switch2container: fix mon quorum check
The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 175ffa1b88)
2021-03-11 13:50:27 +01:00
Guillaume Abrioux 858048560e update: fix require-osd-release task
This commit fixes two issues in rolling_update.yml:

- `container_exec_cmd_update_osd` is unset in the `complete osd upgrade`
play so it never runs the command in a container.
- the 'require-osd-release' task is never applied because the condition
  looks for luminous release.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1930164

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-18 22:22:06 +01:00
Guillaume Abrioux b903446fa4 containers: use --cpus instead --cpu-quota
When using docker 1.13.1, the current condition:

```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```

is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.

From the doc:
> --cpu-quota=<value>	Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e262e072b)
2021-01-28 16:37:50 -05:00
Guillaume Abrioux a36eee1852 fs2bs: skip migration when a mix of fs and bs is detected
Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e66f12d138)
2021-01-22 11:37:40 -05:00