Commit Graph

665 Commits (ae452a86dc3ee8f79a548c08cc2a9121d307403e)

Author SHA1 Message Date
Francesco Pantano 8dd8675080 Fix hosts field in rolling_update playbook when mds are processed
In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
2020-09-09 14:53:44 +02:00
Guillaume Abrioux 3a8be20699 rolling_update: remove 'ignore_errors'
There's no need to use `ignore_errors: true` on these tasks.

Using a loop on the task stopping mon daemons allows us to avoid
duplicating this task, the `ignore_errors` isn't needed here because it
won't fail the playbook if one of the ID doesn't exist (shortname vs. fqdn)

Using the right condition on the task starting the mgr daemon allows us
to avoid using an `ignore_errors: true` as well.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cec994b973)
2020-08-21 16:33:15 +02:00
Guillaume Abrioux 81d116b0ac shrink-mds: use mds_to_kill_hostname instead
When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.

It means we must use `mds_to_kill_hostname` in this task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677d)
2020-08-18 15:09:57 -04:00
Guillaume Abrioux 004155d407 purge-cluster: use sysfs method for unmapping rbd devices
This way we keep consistency with purge-container-cluster.yml playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4)
2020-08-17 09:50:08 -04:00
Guillaume Abrioux 56d2b62e00 purge: import ceph-defaults in purge osd play
Otherwise, `ceph_volume_debug` variable is undefined

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 33a544644a)
2020-08-12 22:57:10 +02:00
Guillaume Abrioux 8a7e4193db common: don't enable debug log on ceph-volume calls by default
ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
2020-08-12 22:57:10 +02:00
Benoît Knecht 5d06c0eda9 purge-cluster: check if rbdmap exists
When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.

This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a090)
2020-08-06 12:01:50 -04:00
Kevin Coakley 92b400f433 Remove ceph-radosgw.target when switching to containerize daemons
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.

Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b2)
2020-08-06 11:43:12 -04:00
Guillaume Abrioux bd3439db75 shrink_osd: remove osd data directory
Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33)
2020-08-06 13:09:38 +02:00
Benoît Knecht ccefe7da9f shrink-osd: various fixes
This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.

This also add a missing `| default(False)` to avoid fowlloing error:

```
fatal: [ceph01]: FAILED! =>
  msg: |-
    The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2)
2020-08-06 13:09:38 +02:00
Dimitri Savineau 1dd9c43efc rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:43:36 -04:00
Dimitri Savineau 2ce60504bd rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:59:25 -04:00
Dimitri Savineau 8ea3fa1752 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

Finally, this adds the missing service stop task for ceph crash upgrade
workflow.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 10:59:25 -04:00
Guillaume Abrioux e6059fdcd3 ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-07-22 18:47:01 -04:00
Dimitri Savineau 0178114f3b cephadm: set the command as a fact
Set the cephadm cmd as a fact instead of rewriting the same command
over and over.
This also fix an issue when using docker as container engine because
the --docker cephadm parameter should be use before the subcommand
not after.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5ef965c4dc)
2020-07-20 22:48:07 -04:00
Dimitri Savineau b7fd3bc844 cephadm: add playbook
This adds a new playbook for deploying ceph via cephadm.

This also adds a new dedicated tox file for CI purpose.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 957903d561)
2020-07-16 12:00:14 -04:00
Dimitri Savineau a22855319b cephadm-adopt: delegate task for orch apply
This is a partial revert of b38019e because we don't want to execute
the whole play on the monitor otherwise if we have some empty group
like rgws or mdss then the orchestrator commands will still be
executed.
Instead we should keep the real target group name at play level and
delegate the orchestator commands to the monitor. The whole play
will be skipped is the group is empty.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9596494911)
2020-07-16 10:50:53 -04:00
Dimitri Savineau 585b3e476c cephadm-adopt: inform users about cephadm
Print a message at the end of the playbook to inform users that they
don't have to user ceph-ansible playbooks anymore as everything else
need to be done via cephadm (day 2 operation).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 75ae1b7e90)
2020-07-15 17:57:41 -04:00
Dimitri Savineau 4e4748b58d cephadm-adopt: refresh the service/daemon list
When reporting the orchestrator service/daemon list at the end of the
playbook, we can use the --refresh option otherwise we could have
an outdated output.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7164426456)
2020-07-15 17:57:41 -04:00
Dimitri Savineau bc2aebaa26 Revert "cephadm-adopt: remove the cephadm script"
This reverts commit c3bbc6b13c.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ceac81cd24)
2020-07-15 17:57:41 -04:00
Dimitri Savineau 48baf63bc2 cephadm-adopt: wait for monitor in quorum
After adopting a monitor we need to wait that monitor to join back
the quorum before moving to the next node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0c3a2b72ff)
2020-07-13 10:17:56 -04:00
Dimitri Savineau 980d1a8365 cephadm-adopt: add osd flags during adoption
Like rolling_update or switch2container playbooks, we need to set/unset
some osd flags before and after the OSD daemons adoption.
This also adds a task for waiting for clean pgs at then of an OSd node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d3b3c8948e)
2020-07-13 10:17:56 -04:00
Dimitri Savineau f4a9f00f20 cephadm-adopt: add iscsi support
The iSCSI support has been added recently in cephadm.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9fe2694711)
2020-07-13 10:17:56 -04:00
Dimitri Savineau d8a8d74625 cephadm-adopt: remove the cephadm script
At the end of the process when don't need the cephadm script.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c3bbc6b13c)
2020-07-13 10:17:56 -04:00
Dimitri Savineau 90f974abb0 cephadm-adopt: show orchestrator status
At the end of the playbook we can show the orchestrator status like
we do with the ceph status in initial deployment.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 381201a394)
2020-07-13 10:17:56 -04:00
Dimitri Savineau c5009101f1 cephadm-adopt: use placement parameter
It's better to use the --placement parameter when using ceph orch apply
commands to avoid confusion in the parameters.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 91a6c79e41)
2020-07-10 14:53:39 -04:00
Dimitri Savineau 3b9ff9ae26 cephadm-adopt: use custom dashboard images
cephadm uses default value for dashboard container images which need to
be customized by ansible for upstream or downstream purpose.
This feature wasn't present when cephadm-adopt.yml has been designed.
Also set the container_image_base variable for upgrade purpose.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f2d997396e)
2020-07-10 11:08:30 -04:00
Dimitri Savineau f4d62212c6 cephadm-adopt: run orch apply from monitors
It looks like we can't run the ceph orch apply commands on nodes other
than monitors even if it used to work in the past.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b38019e3ca)
2020-07-10 11:08:30 -04:00
Dimitri Savineau 9d6a33e114 cephadm-adopt: don't fail on systemd reset-failed
If the systemd service exists successfully then we don't need to reset
the failed state.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 27efcbc0e5)
2020-07-10 11:08:30 -04:00
Dimitri Savineau 0af87be5fc cephadm-adopt: copy client.admin keyring
The ceph config assimilate-conf command requires the client.admin
keyring which isn't present on all nodes most of the time.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fd36433826)
2020-07-10 11:08:30 -04:00
Guillaume Abrioux cdf61540d8 rgw: fix multi instances scaleout
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.

The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:

```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```

This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
2020-07-03 06:37:34 +02:00
Dimitri Savineau 503bc893fb facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-03 06:37:08 +02:00
Guillaume Abrioux 688d5eebf7 rolling_update: add any_errors_fatal
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
2020-06-29 17:13:03 -04:00
Dimitri Savineau c3e89983fc Add playbook for converting cluster to cephadm
The commit adds a new playbook for converting an existing ceph cluster
deployed by ceph-ansible to the cephadm orchestrator.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 548ff26256)
2020-06-29 09:45:22 -04:00
Dimitri Savineau 51cfb89501 ceph-osd: remove ceph-osd-run.sh script
Since we only have one scenario since nautilus then we can just move
the container start command from ceph-osd-run.sh to the systemd unit
service.
As a result, the ceph-osd-run.sh.j2 template and the
ceph_osd_docker_run_script_path variable are removed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 829990e60d)
2020-06-23 17:35:24 +02:00
Guillaume Abrioux a7fc4af06e docker2podman: make images pulling optional
This commit makes the images pulling skipped if podman isn't installed
on the machine.

In OSP context, the podman installation is done later in the workflow,
it means all `podman pull` commands will fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 37b20b6525)
2020-06-22 14:19:44 -04:00
Guillaume Abrioux 4fe8e12484 switch_to_containers: don't set noup flag
We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d384)
2020-06-17 09:24:02 -04:00
Dimitri Savineau b219b1abed switch_to_container: fix osd systemd regex
The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5d)
2020-06-16 18:10:28 +02:00
Guillaume Abrioux c67b3d3530 switch_to_container: refact wait for pg check
There is no need to make this check with several steps.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8aed824f71)
2020-05-22 17:05:22 +02:00
Dimitri Savineau e6bfdd2e44 rolling_update: fix rbdmirror group name
The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
2020-05-13 16:41:23 -04:00
Dimitri Savineau 9a7af0ce6a docker2podman: manage dashboard nodes
The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus)
were not manage during the docker to podman migration.

This adds the systemd container template of those services to a dedicated
file (systemd.yml) in order to include it in the docker2podman playbook.

This also adds the dashboard container images pull from docker to podman.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 252e78b4e4)
2020-05-13 16:41:11 -04:00
Dimitri Savineau 0114457e13 docker2podman: pull images from docker daemon
The docker2podman playbook only installs the podman package and updates
the systemd units with the right container_binary value.

We never pull the container image so if one service is restarted then
the container image will be pulled first before the service can start
which could cause longer downstream.

To avoid to download the container image from internet again we can just
pull it from the local docker daemon.

The container_{binding,package,service}_name variables are removed
because they are only used in the ceph-container-engine role which
isn't call in this playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d38f21aeba)
2020-05-13 16:41:11 -04:00
Dimitri Savineau 1e351bcdd7 filestore-to-bluestore: fix py2 on skipped tasks
When using skipped variables with from_json filter and python2 then we
need to have a default value otherwise the skipped task will fail.

Unexpected templating type error occurred on
({{ (ceph_volume_lvm_list.stdout | from_json) }}): expected string or
buffer

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2b9edba131)
2020-04-20 13:38:06 -04:00
Guillaume Abrioux 9b2d55c007 switch-to-containers: set and unset osd flags
The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e0)
2020-04-06 18:05:10 +02:00
Guillaume Abrioux 529d99a691 update: use tasks_from when including ceph-facts
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6df7887f87)
2020-04-06 18:05:10 +02:00
Guillaume Abrioux 3b7459b3d9 docker2podman: call `container_options_facts.yml` on osd nodes
We must call `ceph-osd` role from `container_options_facts.yml` because
ceph-osd-run.sh.j2 needs variables set in this file.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1819681

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a4f54f6ee)
2020-04-02 10:17:50 -04:00
Guillaume Abrioux 4a9007ce3c remove *docker*.yml symlinks
This commits removes these two symlinks.
They were there for backward compatibility and were marked deprecated as
of stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9219991441)
2020-03-31 11:59:20 -04:00
Guillaume Abrioux 5272a0d1fc purge-container: get *all* osds id
Adding `--all` to the `systemctl list-units` command in order to get
*all* osds id on the node (including stoppped osds). Otherwise, it will
purge the cluster but there will be leftover after that.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1814542

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5e7962ccf6)
2020-03-31 10:58:48 -04:00
Dimitri Savineau 1b094acf24 container: remove ulimit nofile parameter
Since Ceph Octopus is python3 only we don't need to specify the max open
files anymore with the container engine.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 64701437de)
2020-03-30 09:22:28 -04:00
Guillaume Abrioux a94035e957 purge-container: clean legacy code
This commit removes a register which isn't used in this playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-03-12 09:45:12 -04:00
Dimitri Savineau 38a683e5bf filestore-to-bluestore: stop ceph-volume services
We only disable the ceph-osd services but not the ceph-volume lvm
services during the filestore to bluestore migration.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-05 17:53:32 -05:00
Dimitri Savineau d1316ce77b shrink-rbdmirror: fix presence after removal
We should add retry/delay to check the presence of the rbdmirror daemon
in the cluster status because the status takes some time to be updated.
Also the metadata.hostname isn't a good key to check because it doesn't
reflect the ansible_hostname fact. We should use metadata.id instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-03 10:32:15 +01:00
Dimitri Savineau a664159061 shrink-mgr: fix systemd condition
This playbook was using mds systemd condition.
Also a command task was using pipeline which is not allowed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-03 10:32:15 +01:00
Dimitri Savineau 08ac2e3034 shrink: don't use localhost node
The ceph-facts are running on localhost so if this node is using a
different OS/release that the ceph node we can have a mismatch between
docker/podman container binary.
This commit also reduces the scope of the ceph-facts role because we only
need the container_binary tasks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-03 10:32:15 +01:00
Dimitri Savineau 9d3b49293d purge: stop rgw instances by iteration
It looks like that the service module doesn't support wildcard anymore
for stopping/disabling multiple services.

fatal: [rgw0]: FAILED! => changed=false
  msg: 'This module does not currently support using glob patterns,
        found ''*'' in service name: ceph-radosgw@*'
...ignoring

Instead we should iterate over the rgw_instances list.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-02 16:32:06 +01:00
Guillaume Abrioux a084a2a347 common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-27 09:48:36 +01:00
Guillaume Abrioux 1de2bf9991 shrink-osd: support shrinking ceph-disk prepared osds
This commit adds the ceph-disk prepared osds support

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-26 11:45:41 -05:00
Guillaume Abrioux 55970b18f1 shrink-osd: don't run ceph-facts entirely
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-26 11:45:41 -05:00
Dimitri Savineau 535da53d69 filestore-to-bluestore: reuse dedicated journal
If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.

When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-25 16:07:21 +01:00
Dimitri Savineau 195944b123 doc: update infra playbooks statements
We don't need to copy the infrastructure playbooks in the root
ceph-ansible directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-25 15:27:52 +01:00
Benoît Knecht 8b3df4e418 infrastructure-playbooks: Run shrink-osd tasks on monitor
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.

This has the added advantage of becoming root on the monitor only, not
on localhost.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
2020-02-19 11:16:30 -05:00
Dimitri Savineau 100e3a044e purge-cluster: update package list to remove
We only support python3 so renaming all ceph python packages.
Some ceph packages were missing from the list (ceph-mon, ceph-osd or
rbd-mirror) or didn't exist anymore (ceph-fs-common, libcephfs1).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-17 11:33:15 +01:00
Guillaume Abrioux 3700aa5385 switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-07 14:59:14 -05:00
wujie1993 d8b0b3cbd9 purge: fix purge cluster failed
Fix purge cluster failed when local container images does not exist.

Purge node-exporter and grafana-server only when dashboard_enabled is set to True.

Signed-off-by: wujie1993 qq594jj@gmail.com
2020-01-31 12:09:46 -05:00
Dimitri Savineau cd76054f76 filestore-to-bluestore: fix undefine osd_fsid_list
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).

TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
  msg: '''osd_fsid_list'' is undefined

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-28 02:42:39 +01:00
Dimitri Savineau 83c5a1d7a8 filestore-to-bluestore: skip bluestore osd nodes
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-27 18:08:00 +01:00
Dimitri Savineau a9c2300545 filestore-to-bluestore: don't fail when with no PV
When the PV is already removed from the devices then we should not fail
to avoid errors like:

stderr: No PV found on device /dev/sdb.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-24 20:56:08 +01:00
Guillaume Abrioux e5812fe45b rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-22 11:29:36 -05:00
Dimitri Savineau bb3eae0c80 filestore-to-bluestore: fix osd_auto_discovery
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-22 09:36:09 +01:00
Dimitri Savineau f995b079a6 filestore-to-bluestore: --destroy with raw devices
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.

Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
 stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  /dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-21 11:37:39 -05:00
Guillaume Abrioux 3e262e072b containers: use --cpus instead --cpu-quota
When using docker 1.13.1, the current condition:

```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```

is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.

From the doc:
> --cpu-quota=<value>	Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-16 13:51:43 -05:00
Guillaume Abrioux 3d0898aa5d shrink-mds: fix condition on fs deletion
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-15 10:40:14 +01:00
Guillaume Abrioux d853da2a68 update: remove legacy
This task is a code duplicate, probably a legacy, let's remove it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-13 15:18:45 -05:00
Guillaume Abrioux 3496a0efa2 osd: support scaling up using --limit
This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-13 09:59:08 -05:00
Guillaume Abrioux b0c491800a docker2podman: use set_fact to override variables
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-10 10:21:29 -05:00
Guillaume Abrioux 1c2ec9fb40 docker2podman: force systemd to reload config
This is needed after a change is made in systemd unit files.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-10 10:21:29 -05:00
Guillaume Abrioux d746575fd0 docker2podman: install podman
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-10 10:21:29 -05:00
Dimitri Savineau a09d1c38bf purge-iscsi-gateways: don't run all ceph-facts
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-10 15:46:15 +01:00
Dimitri Savineau 3f344fdefe rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.

Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-09 16:14:33 -05:00
Dimitri Savineau 747555dfa6 shrink-rgw: refact global workflow
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-09 19:02:17 +01:00
Guillaume Abrioux 0ae0a9ce28 shrink-mds: do not play ceph-facts entirely
We only need to set `container_binary`.
Let's use `tasks_from` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-08 10:39:27 -05:00
Guillaume Abrioux 77b39d235b shrink-mds: use fact from delegated node
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-08 10:06:43 -05:00
Guillaume Abrioux 38278a6bb5 shrink-mds: fix filesystem removal task
This commit deletes the filesystem when no more MDS is present after
shrinking operation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-08 10:06:43 -05:00
Guillaume Abrioux 2cfe5a04bf shrink-mds: ensure max_mds is always honored
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-08 10:06:43 -05:00
Dimitri Savineau 931a842f21 purge-iscsi-gateways: remove node from dashboard
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-08 14:17:56 +01:00
Dimitri Savineau 42366f0a6c purge-container-cluster: prune exited containers
Remove all stopped/exited containers.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-08 11:13:46 +01:00
Guillaume Abrioux e665d8e239 tests: upgrade from octopus to octopus
on master we can't test upgrade from stable-4.0/CentOS 7 to
master/CentOS 8.

This commit refact the upgrade so we test upgrade from master/CentOS 8
to master/CentOS 8 (octopus to octopus)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-08 11:13:46 +01:00
Guillaume Abrioux 8056514134 filestore-to-bluestore: umount partitions before zapping them
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-19 09:22:25 +01:00
Guillaume Abrioux 8e6ef818a2 filestore-to-bluestore: ensure all dm are closed
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.

Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-11 09:04:41 -05:00
Guillaume Abrioux 51d601193e filestore-to-bluestore: force OSDs to be marked down
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-11 09:04:41 -05:00
Guillaume Abrioux e3305e6bb6 filestore-to-bluestore: do not use --destroy
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-11 09:04:41 -05:00
Guillaume Abrioux 4833b85e04 filestore-to-bluestore: add non containerized support
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-11 09:04:41 -05:00
Guillaume Abrioux 6d9ca6b05b shrink-osd: support fqdn in inventory
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-09 10:52:38 -05:00
Guillaume Abrioux 332c39376b switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-09 10:49:13 -05:00
Guillaume Abrioux c7708eb458 update: restart iscsigws daemons after upgrade
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).

This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-05 13:02:06 -05:00
Guillaume Abrioux 451c5ca934 upgrade: add dashboard deployment
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-05 13:02:06 -05:00
Dimitri Savineau 89f6cc54a2 purge-cluster: add podman support
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-04 14:15:12 -05:00
Guillaume Abrioux f5a81b1790 purge: fix symlink to purge-container-cluster
ceph/ceph-ansible#4805 introduced a symlink to
purge-container-cluster.yml playbook which is broken.

This commit fixes it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-04 09:38:34 +01:00
Guillaume Abrioux 7bc7e3669d purge: rename playbook (container)
Since we now support podman, let's rename the playbook so it's more
generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 11:10:21 -05:00
Guillaume Abrioux b18476a1a6 purge: do not try to stop docker when binary is podman
If the container binary is podman, we shouldn't try to stop docker here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 13:29:52 +01:00