Commit Graph

785 Commits (46e719fda34fe1365bed7260b6e50df741063060)

Author SHA1 Message Date
Seena Fallah 67389d08d4 cephadm-adopt: use cephadm_ssh_user for ssh user
Use cephadm_ssh_user to set custom user (not root) for cephadm to ssh to the hosts

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-08-18 09:10:56 +02:00
Guillaume Abrioux c14e9114ba update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-08-17 14:41:17 -04:00
VasishtaShastry 478d9fdcb6 Fixes typo in rgw-add-users-buckets playbook
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
2021-08-09 15:35:55 +02:00
Guillaume Abrioux 930fc4c850 adopt: import rgw ssl certificate into kv store
Without this, when rgw is managed by cephadm, it fails to start because
the ssl certificate isn't present in the kv store.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1987010
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1988404

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-05 13:02:25 -04:00
Dimitri Savineau 7c38e64681 cephadm-adopt: remove nfs pool and namespace
This has been removed from the code (orch apply name).
The default pool name is now .nfs

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-05 16:59:54 +02:00
Dimitri Savineau 386661699b infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-04 17:39:54 +02:00
Dimitri Savineau 06471a4b82 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-02 15:51:01 +02:00
Dimitri Savineau e87a47cf0c rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-02 15:47:56 +02:00
Benoît Knecht d7653dca95 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2021-07-28 14:04:54 +02:00
Guillaume Abrioux eec38784ec update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-26 18:11:22 +02:00
Guillaume Abrioux 4144074a50 purge: support osd_auto_discovery
This adds a task that zaps by osd id so we can support the scenario
where osds were deployed with `osd_auto_discovery` is true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876860

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Guillaume Abrioux 17cd83bf3a purge: merge playbooks
This refactor merges the two playbooks so we only have to maintain 1
playbook.
(Symlink the old purge-container-cluster.yml playbook for backward
 compatibility).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Guillaume Abrioux 6b50401d0c purge: drop variables from 'hosts' sections
Those variables are useless given this is not possible to override them.
Let's replace them with the hardcoded name instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Dimitri Savineau 738fa9428a common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-21 09:55:21 -04:00
Dimitri Savineau cf6e33346e common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-21 08:17:58 +02:00
Guillaume Abrioux 13036115e2 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-20 07:37:07 +02:00
Guillaume Abrioux 60aa70a128 purge: reindent playbook
This commit reindents the playbook.
Also improve readability by adding an extra line between plays.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-13 09:47:30 -04:00
Dimitri Savineau a305296384 cephadm-adopt: enable osd memory autotune for HCI
This enables the osd_memory_target_autotune option on HCI environment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1973149

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-12 18:17:37 +02:00
Dimitri Savineau 97148dd58c rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-12 18:16:22 +02:00
Guillaume Abrioux c396122ad9 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 11:44:06 -04:00
Guillaume Abrioux 4eb4268dee update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 10:01:45 +02:00
Guillaume Abrioux eee576477c update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 08:28:46 +02:00
Dimitri Savineau aeb9f562e5 cephadm-adopt: set application on ganesha pool
Set the nfs application to the ganesha pool.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1956840

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-08 20:35:58 +02:00
Dimitri Savineau 8e4ef7d6da infra: add playbook to purge dashboard/monitoring
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-06 09:02:37 +02:00
Guillaume Abrioux 3b804a61dd cephadm_adopt: add any_errors_fatal on play
Add any_errors_fatal: true in cephadm-adopt playbook.
We should stop the playbook execution when a task throws an error.
Otherwise it can lead to unexpected behavior.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1976179

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-02 22:15:07 +02:00
Guillaume Abrioux 037d8cd05e purge: add monitoring group in final cleanup play
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-02 13:37:15 -04:00
Dimitri Savineau a05730b38a rhcs: remove ISO install method
Starting RHCS 5, there's no ISO available anymore.
This removes all ISO variables and the ceph_repository_type variable.

Closes: #6626

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-06-30 18:03:03 +02:00
Guillaume Abrioux 26a7256c4c shrink-mgr: modify existing mgr check
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-29 14:53:19 +02:00
Guillaume Abrioux 31311b03ed cephadm-adopt/rgw: add host target in svc_id
If multi-realms were deployed with several instances belonging to the same
realm and zone using the same port on different nodes, the service id
expected by cephadm will be the same and therefore only one service will
be deployed. We need to create a service called
`<node>.<realm>.<zone>.<port>` to be sure the service name will be unique
and well deployed on the expected node in order to preserve backward
compatibility with the rgws instances that were deployed with
ceph-ansible.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-29 14:41:09 +02:00
Dimitri Savineau fc160b3be1 switch2container: run ceph-validate role
This adds the ceph-validate role before starting the switch to a containerized
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1968177

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-06-28 18:06:53 +02:00
Guillaume Abrioux fc784fc44c cephadm-adopt: support rgw multisite adoption
We need to support rgw multisite deployments.
This commit makes the adoption playbook support this kind of deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-23 22:01:59 +02:00
Guillaume Abrioux f9a73149a4 cephadm-adopt: fix mgr placement hosts task
When no `[mgrs]` group is defined in the inventory, mgr daemon are
implicitly collocated with monitors.
This task currently relies on the length of the mgr group in order to
tell cephadm to deploy mgr daemons.
If there's no `[mgrs]` group defined in the inventory, it will ask
cephadm to deploy 0 mgr daemon which doesn't make sense and will throw
an error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970313

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-14 10:38:37 +02:00
Guillaume Abrioux d6745e9cd9 fs2bs: use match filter in selectattr()
0990ae4109 changed the filter in
selectattr() from 'match' to 'equalto' but due to an incompatibility with
the Jinja2 version for python 2.7 on el7 we must stick to using 'match'
filter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-26 08:14:38 +02:00
Guillaume Abrioux 0990ae4109 fs2bs: fix wrong filter when setting osd_ids
using 'match' filter in that task will lead to bad behavior if I have
the following node names for instance:

- node1
- node11
- node111

with `selectattr('name', 'match', inventory_hostname)` it will match
'node1' along with 'node11' and 'node111'.

using 'equalto' filter will make sure we only match the target node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1963066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-25 16:59:30 +02:00
Guillaume Abrioux 2c77d0094c update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-22 08:33:44 +02:00
Guillaume Abrioux 3db1ea7ec4 update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-04 13:06:47 +02:00
Guillaume Abrioux 22c18e82f0 cephadm_adopt: fix ceph-crash migration
ceph-ansible leaves a ceph-crash container in containerized deployment.
It means we end up with 2 ceph-crash containers running after the
migration playbook is complete.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1954614

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-28 19:53:01 +02:00
Guillaume Abrioux 1f40c12502 cephadm_adopt: fix rgw placement task
Due to a recent breaking change in ceph, this command must be modified
to add the <svc_id> parameter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-27 13:37:56 +02:00
Guillaume Abrioux bb7d37fb6a cephadm_adopt: create a 'nfs-ganesha' pool
When migrating from a cluster with no MDS nodes deployed,
`{{ cephfs_data_pool.name }}` doesn't exist so we need to create a pool
for storing nfs export objects.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1950403

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-27 13:37:56 +02:00
Guillaume Abrioux ddbc11c4a9 switch-to-containers: only chown corresponding files
When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.

This commit makes the playbook chown only the files corresponding to the
daemon being migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-14 21:32:20 +02:00
Guillaume Abrioux 3d4267051f fs2bs: add a final play
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.

Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-14 14:56:02 +02:00
Guillaume Abrioux a9220654f5 cephadm_adopt: support nfs-ganesha adoption
This commit adds the nfs-ganesha adoption support in the
`cephadm-adopt.yml` playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944504

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-12 14:43:19 +02:00
Guillaume Abrioux 1ffc4df6b6 cephadm_adopt: modify placement policy for rgw
the adoption playbook should use `radosgw_num_instances` in order to
determine how much rgw instance it should set recreate.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1943170

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-12 14:43:19 +02:00
Guillaume Abrioux ee44d86072 cephadm_adopt: fix a typo
This play doesn't nothing else than stopping/removing rgw daemons.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-12 14:43:19 +02:00
Guillaume Abrioux 36b4227dcd docker2podman: add documentation/header
this adds a small documentation in the header of the playbook in order
to explain what is the goal of this playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-12 09:30:26 +02:00
Guillaume Abrioux 70f19be367 docker2podman: skip some role imports from handler
when running docker-to-podman playbook, there's no need to call
`ceph-config` and `ceph-rgw` from the role `ceph-handler`.
It can even have side effects when coming from a baremetal cluster that
was previously migrated using the switch-to-containers playbook. Indeed
it might complain about missing .target systemd unit since they are
removed during that migration.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1944999

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-09 15:28:50 +02:00
Guillaume Abrioux 2c74c27321 switch_to_containers: support iscsigws migration
This adds the iscsigws migration to containers.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=<bz-number>

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-09 13:37:55 +02:00
Guillaume Abrioux e9ddb972fe update: followup on 07029e1
Playbook must fail anyway, the `rescue` block has been introduced for
unmasking the unit after the playbook has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-26 21:27:02 +01:00
Guillaume Abrioux 14c472707c fix 'command -v' tasks
`command -v` is a bash script which needs a shell to run.

Fixes: #6325

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-18 20:29:05 +01:00
Guillaume Abrioux 07029e1bf1 rolling_update: unmask monitor service after a failure
if for some reason the playbook fails after the service was
stopped, disabled and masked and before it got restarted, enabled and
unmasked, the playbook leaves the service masked and which can make users
confused and forces them to unmask the unit manually.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917680

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-18 15:22:38 +01:00
Guillaume Abrioux b445df0479 cephadm_adopt: fetch and write ceph minimal config
This commit makes the playbook fetch the minimal current ceph
configuration and write it later on monitoring nodes so `cephadm` can
proceed with the adoption.
When a monitoring stack was deployed on a dedicated node, it means no
`ceph.conf` file was written, `cephadm` requires a `ceph.conf` in order
to adopt the daemon present on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1939887

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-17 17:39:12 +01:00
Guillaume Abrioux af95595c82 adopt: convert legacy grafana-server groupname early
This is a follow up on PR #6332

cephadm-adopt.yml playbook is affected by the same bug

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1938658

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-17 16:04:11 +01:00
Guillaume Abrioux 6ccc8b4722 update: convert legacy grafana-server groupname early
If the legacy name `grafana-server` is still being used when upgrading
from Nautilus to Pacific, the task that sets the fact `rolling_update`
to `true` doesn't run on the node(s) included in that group. Indeed the
play where we set this fact (`rolling_update`) only runs on the group
`monitoring_group_name | default('monitoring')`.
As a workaround, we can run earlier the task which converts the
`grafana-server` group name to `monitoring`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935554

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-03-15 15:25:48 +01:00
Alex Schultz a7f2fa73e6 Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
2021-03-08 20:54:02 +01:00
Guillaume Abrioux b9dd253a4f purge: rm service-cid files
This commit makes sure purge playbooks remove those file if for any reason they
have been left.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1920900

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-12 10:01:31 +01:00
Guillaume Abrioux 980a5a7df4 switch2container: do not serialize the ceph-crash migration
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-11 21:36:23 +01:00
Dimitri Savineau 950a6ae406 cephadm-adopt: remove prometheus workaround
This was fixed by [1][2]

[1] https://tracker.ceph.com/issues/45120
[2] https://github.com/ceph/ceph/commit/252d4b30

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-02-10 13:51:41 +01:00
Dimitri Savineau 48a456dc8c rolling_update: enforce ceph-container-engine
When running the rolling_update.yml playbook and adding the dashboard
component in the same time then the requirement (like container packages)
aren't installed.
This could lead to a failure in case of using authentication on the
container registry because the playbook will try to login on the registry
but podman/docker aren't yet installed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1903504
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918650

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-02-10 08:17:11 +01:00
Dimitri Savineau 94af3c87d1 rolling_update: exclude clients from node-exporter
Since b105549 we don't install node-exporter on client nodes so we should
also exclude the client node from the node-exporter upgrade.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-02-09 14:41:13 +01:00
Guillaume Abrioux b9cdee40a2 update: update ceph release pattern in complete upgrade play
since master is now deploying quincy, we must update this.
Otherwise, it will fail like following:

```
Error EPERM: require_osd_release cannot be lowered once it has been set
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-06 00:34:14 +01:00
Guillaume Abrioux 44fbadb50c rolling_update: pg check refactor
There's no need to achieve this in two tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-06 00:34:14 +01:00
Dimitri Savineau 76a663245d cephadm-adopt: use ceph_osd_flag module
There's no reason to not use the ceph_osd_flag module to set/unset osd
flags.
Also if there's no OSD nodes in the inventory then we don't need to
execute the set/unset play.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-02-03 08:29:31 +01:00
Dimitri Savineau 36fc04eaab purge-cluster: use parted ansible module
Instead of doing some scripting via the shell module, we can use the
parted ansible module to check the boot flag on partitions.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-02-03 08:28:22 +01:00
Guillaume Abrioux 984191ac7f purge: zap and destroy db and wal devices for lvm batch
Those devices (db/wal) are never zapped in lvm batch deployment.
Iterating over `dedicated_devices` and `bluestore_wal_devices` fixes
this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1922926

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-02-01 13:01:58 -05:00
Dimitri Savineau 2734a12d44 cephadm-adopt: use radosgw modules for idempotency
When rerunning the cephadm-adopt.yml playbook the radosgw realm,
zonegroup and zone tasks will fail because the task isn't
idempotent.
Using the radosgw ansible modules solves that problem.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-01-29 21:07:39 +01:00
Dimitri Savineau 6886700a00 cephadm-adopt: make the playbook idempotent
If the cephadm-adopt.yml fails during the first execution and some
daemons have already been adopted by cephadm then we can't rerun
the playbook because the old container won't exist anymore.

Error: no container with name or ID ceph-mon-xxx found: no such container

If the daemons are adopted then the old systemd unit doesn't exist anymore
so any call to that unit with systemd will fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1918424

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-01-29 21:07:39 +01:00
Guillaume Abrioux e835b08f8f fs2bs: remove a legacy fact
since cf7345f143, we don't need to set
this fact anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-01-28 16:26:46 +01:00
Dimitri Savineau 13427eddac cephadm-adopt: add grafana group conversion
The grafana group conversion task wasn't present in the cephadm-adopt.yml
playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1917530

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-01-18 20:52:58 +01:00
Guillaume Abrioux e66f12d138 fs2bs: skip migration when a mix of fs and bs is detected
Since the default of `osd_objectstore` has changed as of 3.2, some
deployments might have a mix of filestore and bluestore OSDs on a same
node. In some specific cases, there's a possibility that a filestore OSD
shares a journal/db device with a bluestore OSD. We shouldn't try to
redeploy in this context because ceph-volume will complain. (either
because in lvm batch you can't pass partition or about gpt header).
The safest option is to skip the migration on the node when such a mix
is detected or force all osds including those already using bluestore
(option `force_filestore_to_bluestore=True` has to be passed as an extra var).
If all OSDs are using filestore, then they will be migrated to
bluestore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1875777

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-01-12 14:40:25 -05:00
Guillaume Abrioux 175ffa1b88 switch2container: fix mon quorum check
The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-01-11 14:42:45 -05:00
Dimitri Savineau 5b6f907a72 cephadm: remove loop on host add tasks
Instead of iterate over the host list for adding the node/label to the
host orchestrator configuration then we can do it parallelly.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-16 15:14:28 +01:00
Dimitri Savineau 0108c9f941 purge-container-cluster: always prune force
Since podman 2.x, there's now a confirmation when running podman
container prune command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-09 14:46:45 -05:00
Dimitri Savineau 08f118077f library: add cephadm_adopt module
This adds cephadm_adopt ansible module for replacing the command module
usage with the cephadm adopt command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-02 09:15:44 +01:00
Guillaume Abrioux 86a8889ee3 common: do not use pipefail when not needed
Let's discard the ansible lint error 306 and add a "# noqa 306" on tasks
where we don't need `set -o pipefail`

Fixes: #6090

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-12-01 15:07:09 -05:00
Dimitri Savineau cf7345f143 consume ceph_volume module when possible
We should always use the ceph_volume ansible module when possible.
This patch replace the ceph-volume inventory and lvm {list,zap} commands
called via the command/shell modules by the corresponding call with the
ceph_volume module.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-01 17:54:10 +01:00
Dimitri Savineau c3ed124d31 library: add cephadm_bootstrap module
This adds cephadm_bootstrap ansible module for replacing the command module
usage with the cephadm bootstrap command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-01 10:30:05 +01:00
Dimitri Savineau 5da593604a library: add ceph_osd_flag module
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-12-01 10:29:11 +01:00
Dimitri Savineau 0b5b1de963 library: add ceph_osd module
This adds ceph_osd ansible module for replacing the command module
usage with the ceph osd destroy/down/in/out/purge/rm commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-30 16:53:45 +01:00
Dimitri Savineau eaf0ebfc85 library: add ceph_mgr_module module
This adds ceph_mgr_module ansible module for replacing the command module
usage with the ceph mgr module enable/disable commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-30 16:52:02 +01:00
Guillaume Abrioux 0b05620597 switch2containers: do not stop ceph.target in osd play
`ceph.target` should be disabled only. Otherwise, in collocation
scenario you stop other collocated services in the OSD play which isn't
what we want to do. Each daemon has its corresponding play for managing
the transition to container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-27 09:34:01 -05:00
Dimitri Savineau 3baac5ad5b library: add ceph_volume_simple_{activate,scan}
This adds ceph_volume_simple_{activate,scan} ansible modules for replacing
the command module usage with the ceph-volume simple activate/scan commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-25 10:09:42 +01:00
Guillaume Abrioux 970c6a4ee6 mon: refact initial keyring generation
adding monitor is no longer possible because we generate a new mon
keyring each time the playbook is run.

Fixes: #5864

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-25 09:34:44 +01:00
Guillaume Abrioux 195d88fcda lint: ignore 302,303,505 errors
ignore 302,303 and 505 errors

[302] Using command rather than an argument to e.g. file
[303] Using command rather than module
[505] referenced files must exist

they aren't relevant on these tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux c948b668eb lint: do not use 'local_action'
Fix ansible-lint 504 error:

[504] Do not use 'local_action', use 'delegate_to: localhost'

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux dfc7e6e4bd lint: trailing whitespace
Fix ansible-lint 201 error:

[201] Trailing whitespace

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 97dd9218dd lint: all tasks should be named
Fix ansible-lint 502 error:

[502] All tasks should be named

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 11b4bf5083 lint: use shell only when shell functionality is required
Fix ansible-lint 305 error:

[305] Use shell only when shell functionality is required

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 2011e4dbc8 lint: don't compare to literal true/false
Fix ansible lint 601 error:

[601] Don't compare to literal True/False

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 9fba6eecfa lint: variables should have spaces before and after
Fix ansible lint 206 error:

[206] Variables should have spaces before and after: {{ var_name }}

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 5450de58b3 lint: commands should not change things
Fix ansible lint 301 error:

[301] Commands should not change things if nothing needs doing

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Guillaume Abrioux 1879c26eb9 lint: set pipefail on shell tasks
Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-11-23 08:33:47 +01:00
Dimitri Savineau 35ed9977aa switch2container: chown symlink in mon/mgr plays
fa2bb3a only fix the symlink owner/group issue in the OSD play. If the
OSDs are collocated with other services like MONs and MGRs then the
chown command will fail.

$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-16 13:40:57 -05:00
Dimitri Savineau fa2bb3af86 switch2container: disable ceph-osd enabled-runtime
When deploying the ceph OSD via the packages then the ceph-osd@.service
unit is configured as enabled-runtime.
This means that each ceph-osd service will inherit from that state.
The enabled-runtime systemd state doesn't survive after a reboot.
For non containerized deployment the OSD are still starting after a
reboot because there's the ceph-volume@.service and/or ceph-osd.target
units that are doing the job.

$ systemctl list-unit-files|egrep '^ceph-(volume|osd)'|column -t
ceph-osd@.service     enabled-runtime
ceph-volume@.service  enabled
ceph-osd.target       enabled

When switching to containerized deployment we are stopping/disabling
ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the
systemd unit files.
But the new systemd units for containerized ceph-osd service will still
inherit from ceph-osd@.service unit file.

As a consequence, if an OSD host is rebooting after the playbook execution
then the ceph-osd service won't come back because they aren't enabled at
boot.

This patch also adds a reboot and testinfra run after running the switch
to container playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-12 20:05:39 +01:00
Dimitri Savineau 3e49258377 rolling_update: always run cv simple scan/activate
There's no need to use a condition on the ceph release for the
ceph-volume simple commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-10 14:01:10 +01:00
Dimitri Savineau 3d3ce26327 rolling_update: fix mgr start with mon collocation
cec994b introduced a regression when a mgr is collocated with a mon.
During the mon upgrade, the mgr service is masked to avoid to be
restarted on packages update.
Then the start mgr task is failing because the service is still masked.
Instead we should unmask it.

Fixes: #5983

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:10:17 +01:00
Dimitri Savineau 16afe90806 infrastructure: consume ceph_fs module
bd611a7 introduced the new ceph_fs module but missed some tasks in
rolling_update and shrink-mds playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:06:17 +01:00
Dimitri Savineau acddf4fb67 rolling_update: use ceph health instead of ceph -s
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the cluster health, we're using the health structure in the
ceph status output.
To optimize this, we could use the ceph health command which contains
the same needed information.

$ ceph status -f json | wc -c
2001
$ ceph health -f json | wc -c
46

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:05:33 +01:00
Dimitri Savineau 3f9081931f rgw/rbdmirror: use service dump instead of ceph -s
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the rgw/rbdmirror services status, we're only using the
servicmap structure in the ceph status output.
To optimize this, we could use the ceph service dump command which contains
the same needed information.
This command returns less information and is slightly faster than the ceph
status command.

$ ceph status -f json | wc -c
2001
$ ceph service dump -f json | wc -c
1105
$ time ceph status -f json > /dev/null

real	0m0.557s
user	0m0.516s
sys	0m0.040s
$ time ceph service dump -f json > /dev/null

real	0m0.454s
user	0m0.434s
sys	0m0.020s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:05:33 +01:00
Dimitri Savineau 88f91d8c12 monitor: use quorum_status instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json  | wc -c
2001
$ ceph quorum_status -f json  | wc -c
957
$ time ceph status -f json > /dev/null

real	0m0.577s
user	0m0.538s
sys	0m0.029s
$ time ceph quorum_status -f json > /dev/null

real	0m0.544s
user	0m0.527s
sys	0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:05:33 +01:00
Dimitri Savineau ee50588590 osds: use pg stat command instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real	0m0.529s
user	0m0.503s
sys	0m0.024s
$ time ceph pg stat -f json > /dev/null

real	0m0.426s
user	0m0.409s
sys	0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-11-03 09:05:33 +01:00