Commit Graph

744 Commits (d8b293cc9ba313a294855b01c78f2c895c3c52e2)

Author SHA1 Message Date
Guillaume Abrioux 0a3b916ee7 cephadm-adopt: add no_log: true
Let's add a `no_log: true` on the `cephadm registry-login` task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-09-28 08:11:03 +02:00
Guillaume Abrioux d12efa1ab4 adopt: stop iscsi services in the first place
If old containers are still running, it can make tcmu-runner process
unable to open devices and there's nothing else to do than restarting
the container.

Also, as per discussion with iscsi experts, iscsi should be migrated before
OSDs. (the client should be closed before the server)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2000412

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-09-27 19:46:37 +02:00
Daniel Pivonka 1c50dc29cf cephadm-adopt: set cephadm registry login info
registry login info needs to be stored in cluster for cephadm and future hosts

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2000103
Signed-off-by: Daniel Pivonka <dpivonka@redhat.com>
2021-09-13 11:14:22 +02:00
Seena Fallah ff39c8d70b purge: add remove_docker tag
This can help to skip docker removal tasks

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-09-09 13:25:45 +02:00
Seena Fallah a51ce767ca purge: add container_binary needed for zap osds
`container_binary` isn't set anymore in the purge osd play because of a
regression introduced by 60aa70a.
The CI didn't catch it because the play purging node-exporter sets this
variable for all nodes before we run the purge osd play.

This commit fixes this regression.

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-09-09 11:12:02 +02:00
Dimitri Savineau cddc23f511 purge-dashboard: remove cid files
This adds the service cid file cleanup as supported in the classic purge
playbook since b9dd253

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-09-08 15:40:46 +02:00
Dimitri Savineau 2630f8d47a cephadm-adopt: fix orch host add with FQDN
When a node is configured with FQDN as the hostname value then the
`ceph orch host add` command will fail because the `ansible_hostname` used
by that command contains the short hostname which won't match the current
hostname (FQDN)
Instead we can use the ansible_nodename fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1997083

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-26 15:50:32 -04:00
Dimitri Savineau 8ba6101bbb cephadm-adopt: remove ceph-nfs.target
This systemd target doesn't exist at all.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-18 20:08:22 +02:00
Guillaume Abrioux 09ef465f62 containers: introduce target systemd unit
This adds ceph-*.target systemd unit files support for containerized
deployments.
This also fixes a regression introduced by PR #6719 (rgw and nfs systemd
units not getting purged)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1962748

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-08-18 11:08:50 -04:00
Seena Fallah 67389d08d4 cephadm-adopt: use cephadm_ssh_user for ssh user
Use cephadm_ssh_user to set custom user (not root) for cephadm to ssh to the hosts

Signed-off-by: Seena Fallah <seenafallah@gmail.com>
2021-08-18 09:10:56 +02:00
Guillaume Abrioux c14e9114ba update: gather facts only one time
this play doesn't need to gather facts from localhost

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-08-17 14:41:17 -04:00
VasishtaShastry 478d9fdcb6 Fixes typo in rgw-add-users-buckets playbook
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
2021-08-09 15:35:55 +02:00
Guillaume Abrioux 930fc4c850 adopt: import rgw ssl certificate into kv store
Without this, when rgw is managed by cephadm, it fails to start because
the ssl certificate isn't present in the kv store.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1987010
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1988404

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-05 13:02:25 -04:00
Dimitri Savineau 7c38e64681 cephadm-adopt: remove nfs pool and namespace
This has been removed from the code (orch apply name).
The default pool name is now .nfs

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-05 16:59:54 +02:00
Dimitri Savineau 386661699b infra: use dedicated variables for balancer status
The balancer status is registered during the cephadm-adopt, rolling_update
and swith2container playbooks. But it is also used in the ceph-handler role
which is included in those playbooks too.
Even if the ceph-handler tasks are skipped for rolling_update and
switch2container, the balancer_status variable is erased with the skip task
result.

play1:
  register: balancer_status
play2:
  register: balancer_status <-- skipped
play3:
  when: (balancer_status.stdout | from_json)['active'] | bool

This leads to issue like:

The conditional check '(balancer_status.stdout | from_json)['active'] | bool'
failed. The error was: Unexpected templating type error occurred on
({% if (balancer_status.stdout | from_json)['active'] | bool %} True
{% else %} False {% endif %}): expected string or buffer.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1982054

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-04 17:39:54 +02:00
Dimitri Savineau 06471a4b82 osds: use osd pool ls instead of osd dump command
The ceph osd pool ls detail command is a subset of the ceph osd dump
command.

$ ceph osd dump --format json|wc -c
10117
$ ceph osd pool ls detail --format json|wc -c
4740

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-02 15:51:01 +02:00
Dimitri Savineau e87a47cf0c rolling_update: get ceph version when mons exist
eec3878 introduced a regression for upgrade scenarios where there's no
monitor nodes at all (like ganesha standalone, external clients, etc..)

TASK [get the ceph release being deployed] ************************************
task path: infrastructure-playbooks/rolling_update.yml:121
Thursday 29 July 2021  15:55:29 +0000 (0:00:00.484)       0:00:15.802 *********
fatal: [client0]: FAILED! =>
  msg: '''dict object'' has no attribute ''mons'''

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-08-02 15:47:56 +02:00
Benoît Knecht d7653dca95 infrastructure-playbooks: Get Ceph info in check mode
In the `set osd flags` block, run the Ceph commands that gather information
from the cluster (and don't make any changes to it) even when running in check
mode.

This allows the tasks that depend on the variables set by those tasks to
succeed in check mode.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2021-07-28 14:04:54 +02:00
Guillaume Abrioux eec38784ec update: check the ceph release
Check early which Ceph release is going to be deployed and fail if it
doesn't correspond to the ceph-ansible version being used.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1978643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-26 18:11:22 +02:00
Guillaume Abrioux 4144074a50 purge: support osd_auto_discovery
This adds a task that zaps by osd id so we can support the scenario
where osds were deployed with `osd_auto_discovery` is true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1876860

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Guillaume Abrioux 17cd83bf3a purge: merge playbooks
This refactor merges the two playbooks so we only have to maintain 1
playbook.
(Symlink the old purge-container-cluster.yml playbook for backward
 compatibility).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Guillaume Abrioux 6b50401d0c purge: drop variables from 'hosts' sections
Those variables are useless given this is not possible to override them.
Let's replace them with the hardcoded name instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-22 10:49:44 -04:00
Dimitri Savineau 738fa9428a common: remove unnecessary run_once statements
1303611 introduced tasks for disabling the pg_autoscaler on pools and
the balancer but thoses tasks are already executed on the first monitor
node so we don't need to add the run_once statement.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-21 09:55:21 -04:00
Dimitri Savineau cf6e33346e common: fix py2 pool_list from_json when skipped
When using python 2 and the task with a loop is skipped then it generates
an error.

Unexpected templating type error occurred on
({{ (pool_list.stdout | from_json)['pools'] }}): expected string or buffer

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-21 08:17:58 +02:00
Guillaume Abrioux 13036115e2 common: disable/enable pg_autoscaler
The PG autoscaler can disrupt the PG checks so the idea here is to
disable it and re-enable it back after the restart is done.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-20 07:37:07 +02:00
Guillaume Abrioux 60aa70a128 purge: reindent playbook
This commit reindents the playbook.
Also improve readability by adding an extra line between plays.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-13 09:47:30 -04:00
Dimitri Savineau a305296384 cephadm-adopt: enable osd memory autotune for HCI
This enables the osd_memory_target_autotune option on HCI environment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1973149

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-12 18:17:37 +02:00
Dimitri Savineau 97148dd58c rolling_update: check quorum state before upgrade
If one a the monitor is out of the quorum then nothing prevents the upgrade
playbook to run.
We only check if we have at least three monitor nodes but we should also
check if those monitor nodes are correctly present in the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1952571

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-12 18:16:22 +02:00
Guillaume Abrioux c396122ad9 update: fail the playbook if straw2 conversion failed
It's better to fail the playbook so the user is aware the straw2
migration has failed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 11:44:06 -04:00
Guillaume Abrioux 4eb4268dee update: followup on pr #6689
add mising 'osd' command.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 10:01:45 +02:00
Guillaume Abrioux eee576477c update: convert straw bucket
After an upgrade, the presence of straw buckets will produce the
following warning (HEALTH_WARN):

```
crush map has legacy tunables (require firefly, min is hammer)
```

because straw bucket is a firefly feature it needs to be converted to
straw2.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967964

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-09 08:28:46 +02:00
Dimitri Savineau aeb9f562e5 cephadm-adopt: set application on ganesha pool
Set the nfs application to the ganesha pool.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1956840

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-08 20:35:58 +02:00
Dimitri Savineau 8e4ef7d6da infra: add playbook to purge dashboard/monitoring
The dashboard/monitoring stack can be deployed via the dashboard_enabled
variable. But there's nothing similar if we can to remove that part only
and keep the ceph cluster up and running.
The current purge playbooks remove everything.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786691

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-07-06 09:02:37 +02:00
Guillaume Abrioux 3b804a61dd cephadm_adopt: add any_errors_fatal on play
Add any_errors_fatal: true in cephadm-adopt playbook.
We should stop the playbook execution when a task throws an error.
Otherwise it can lead to unexpected behavior.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1976179

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-02 22:15:07 +02:00
Guillaume Abrioux 037d8cd05e purge: add monitoring group in final cleanup play
This adds the monitoring group in the "final cleanup play" so any cid
files generated are well removed when purging the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1974536

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-07-02 13:37:15 -04:00
Dimitri Savineau a05730b38a rhcs: remove ISO install method
Starting RHCS 5, there's no ISO available anymore.
This removes all ISO variables and the ceph_repository_type variable.

Closes: #6626

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-06-30 18:03:03 +02:00
Guillaume Abrioux 26a7256c4c shrink-mgr: modify existing mgr check
Do not rely on the inventory aliases in order to check if the selected
manager to be removed is present.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967897

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-29 14:53:19 +02:00
Guillaume Abrioux 31311b03ed cephadm-adopt/rgw: add host target in svc_id
If multi-realms were deployed with several instances belonging to the same
realm and zone using the same port on different nodes, the service id
expected by cephadm will be the same and therefore only one service will
be deployed. We need to create a service called
`<node>.<realm>.<zone>.<port>` to be sure the service name will be unique
and well deployed on the expected node in order to preserve backward
compatibility with the rgws instances that were deployed with
ceph-ansible.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-29 14:41:09 +02:00
Dimitri Savineau fc160b3be1 switch2container: run ceph-validate role
This adds the ceph-validate role before starting the switch to a containerized
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1968177

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2021-06-28 18:06:53 +02:00
Guillaume Abrioux fc784fc44c cephadm-adopt: support rgw multisite adoption
We need to support rgw multisite deployments.
This commit makes the adoption playbook support this kind of deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1967455

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-23 22:01:59 +02:00
Guillaume Abrioux f9a73149a4 cephadm-adopt: fix mgr placement hosts task
When no `[mgrs]` group is defined in the inventory, mgr daemon are
implicitly collocated with monitors.
This task currently relies on the length of the mgr group in order to
tell cephadm to deploy mgr daemons.
If there's no `[mgrs]` group defined in the inventory, it will ask
cephadm to deploy 0 mgr daemon which doesn't make sense and will throw
an error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1970313

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-06-14 10:38:37 +02:00
Guillaume Abrioux d6745e9cd9 fs2bs: use match filter in selectattr()
0990ae4109 changed the filter in
selectattr() from 'match' to 'equalto' but due to an incompatibility with
the Jinja2 version for python 2.7 on el7 we must stick to using 'match'
filter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-26 08:14:38 +02:00
Guillaume Abrioux 0990ae4109 fs2bs: fix wrong filter when setting osd_ids
using 'match' filter in that task will lead to bad behavior if I have
the following node names for instance:

- node1
- node11
- node111

with `selectattr('name', 'match', inventory_hostname)` it will match
'node1' along with 'node11' and 'node111'.

using 'equalto' filter will make sure we only match the target node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1963066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-25 16:59:30 +02:00
Guillaume Abrioux 2c77d0094c update: do not gather facts on each play
There's no benefit to gather facts again on each play in
rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-22 08:33:44 +02:00
Guillaume Abrioux 3db1ea7ec4 update: fix ceph-crash stop task
This is a workaround for an issue in ansible.
When trying to stop/mask/disable this service in one task, the stop
didn't actually happen, the task doesn't fail but for some reason the
container is still present and running.
Then the task starting the service in the role ceph-crash fails because
it can't start the container since it's already running with the same
name.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1955393

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-05-04 13:06:47 +02:00
Guillaume Abrioux 22c18e82f0 cephadm_adopt: fix ceph-crash migration
ceph-ansible leaves a ceph-crash container in containerized deployment.
It means we end up with 2 ceph-crash containers running after the
migration playbook is complete.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1954614

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-28 19:53:01 +02:00
Guillaume Abrioux 1f40c12502 cephadm_adopt: fix rgw placement task
Due to a recent breaking change in ceph, this command must be modified
to add the <svc_id> parameter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-27 13:37:56 +02:00
Guillaume Abrioux bb7d37fb6a cephadm_adopt: create a 'nfs-ganesha' pool
When migrating from a cluster with no MDS nodes deployed,
`{{ cephfs_data_pool.name }}` doesn't exist so we need to create a pool
for storing nfs export objects.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1950403

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-27 13:37:56 +02:00
Guillaume Abrioux ddbc11c4a9 switch-to-containers: only chown corresponding files
When collocating daemons, if we chown all files under `/var/lib/ceph` it
can cause issues for the collocated daemons that wouldn't have been
migrated yet.

This commit makes the playbook chown only the files corresponding to the
daemon being migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-14 21:32:20 +02:00
Guillaume Abrioux 3d4267051f fs2bs: add a final play
This removes the fact `skipped_nodes` which is useless when we run with
`--limit` since it gets reset when a new iteration is made.

Instead, let's print within a final play which node has been skipped
reusing the `skip_this_node` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2021-04-14 14:56:02 +02:00