Commit Graph

677 Commits (36fceeaaff4a05f23b1b176e4166080614ab7fab)

Author SHA1 Message Date
Guillaume Abrioux 8cf17750ee shrink_osd: remove osd data directory
Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33)
2020-08-06 13:10:42 +02:00
Benoît Knecht 4052ab29f2 shrink-osd: various fixes
This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.

This also add a missing `| default(False)` to avoid fowlloing error:

```
fatal: [ceph01]: FAILED! =>
  msg: |-
    The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2)
2020-08-06 13:10:42 +02:00
Dimitri Savineau cbdff5f95b rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:49:15 -04:00
Dimitri Savineau 7a970ac028 rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:49:02 -04:00
Dimitri Savineau 15872e3db1 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

This following comment isn't valid because we should have backported
ceph-crash implementation in stable-4.0 before this commit, which was not
possible because of the needed tag v4.0.25.1 (async release for 4.1z1):

~~Finally, this adds the missing service stop task for ceph crash upgrade
workflow.~~

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 09:43:01 -04:00
Guillaume Abrioux 02e7468b4a update: use tasks_from when including ceph-facts
When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
2020-07-23 17:26:04 +02:00
Dimitri Savineau 5db4219f26 facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-20 21:23:48 +02:00
Guillaume Abrioux 518f4f579d rgw: fix multi instances scaleout
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.

The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:

```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```

This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
2020-07-20 21:23:27 +02:00
Guillaume Abrioux 328db8bee1 rolling_update: add any_errors_fatal
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
2020-07-20 21:22:25 +02:00
Dimitri Savineau a99c94ea11 ceph-osd: remove ceph-osd-run.sh script
Since we only have one scenario since nautilus then we can just move
the container start command from ceph-osd-run.sh to the systemd unit
service.
As a result, the ceph-osd-run.sh.j2 template and the
ceph_osd_docker_run_script_path variable are removed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 829990e60d)
2020-06-23 17:35:01 +02:00
Guillaume Abrioux 4e42503218 docker2podman: make images pulling optional
This commit makes the images pulling skipped if podman isn't installed
on the machine.

In OSP context, the podman installation is done later in the workflow,
it means all `podman pull` commands will fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 37b20b6525)
2020-06-22 14:46:38 -04:00
Guillaume Abrioux 085341642e switch-to-containers: set and unset osd flags
The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e0)
2020-06-17 12:15:49 -04:00
Guillaume Abrioux c847c2f117 switch_to_containers: don't set noup flag
We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d384)
2020-06-17 09:24:19 -04:00
Dimitri Savineau a165edb5ba switch_to_container: fix osd systemd regex
The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5d)
2020-06-16 12:10:36 -04:00
Dimitri Savineau a97e24fee9 docker2podman: manage dashboard nodes
The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus)
were not manage during the docker to podman migration.

This adds the systemd container template of those services to a dedicated
file (systemd.yml) in order to include it in the docker2podman playbook.

This also adds the dashboard container images pull from docker to podman.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 252e78b4e4)
2020-06-03 13:20:24 -04:00
Dimitri Savineau 6f893e5ed9 docker2podman: pull images from docker daemon
The docker2podman playbook only installs the podman package and updates
the systemd units with the right container_binary value.

We never pull the container image so if one service is restarted then
the container image will be pulled first before the service can start
which could cause longer downstream.

To avoid to download the container image from internet again we can just
pull it from the local docker daemon.

The container_{binding,package,service}_name variables are removed
because they are only used in the ceph-container-engine role which
isn't call in this playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d38f21aeba)
2020-06-03 13:20:24 -04:00
Dimitri Savineau 8c4865cd14 rolling_update: fix rbdmirror group name
The rbdmirror group name was using the wrong variable definition.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
2020-06-03 13:20:03 -04:00
Dimitri Savineau 1921ace52d docker-to-podman: conditional docker commands
The docker commands should be based on the container_binary variable
otherwise running the playbook on a host without docker (like podman
only) will failed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829985

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-06-03 13:19:28 -04:00
Dimitri Savineau 46c640c169 filestore-to-bluestore: fix py2 on skipped tasks
When using skipped variables with from_json filter and python2 then we
need to have a default value otherwise the skipped task will fail.

Unexpected templating type error occurred on
({{ (ceph_volume_lvm_list.stdout | from_json) }}): expected string or
buffer

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2b9edba131)
2020-04-20 13:38:19 -04:00
Guillaume Abrioux ba6bd3ca3d docker2podman: call `container_options_facts.yml` on osd nodes
We must call `ceph-osd` role from `container_options_facts.yml` because
ceph-osd-run.sh.j2 needs variables set in this file.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1819681

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4a4f54f6ee)
2020-04-02 11:01:14 -04:00
Guillaume Abrioux 32f879de32 purge-container: get *all* osds id
Adding `--all` to the `systemctl list-units` command in order to get
*all* osds id on the node (including stoppped osds). Otherwise, it will
purge the cluster but there will be leftover after that.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1814542

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5e7962ccf6)
2020-03-31 11:00:41 -04:00
Dimitri Savineau e2f1a0ade8 doc: update infra playbooks statements
We don't need to copy the infrastructure playbooks in the root
ceph-ansible directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 195944b123)
2020-03-16 14:43:35 +01:00
Dimitri Savineau 957156c0fe filestore-to-bluestore: stop ceph-volume services
We only disable the ceph-osd services but not the ceph-volume lvm
services during the filestore to bluestore migration.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 38a683e5bf)
2020-03-12 21:10:33 +01:00
Dimitri Savineau 928c792f8d filestore-to-bluestore: reuse dedicated journal
If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.

When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 535da53d69)
2020-03-12 21:10:33 +01:00
Dimitri Savineau 3b0ee83594 shrink-rbdmirror: fix presence after removal
We should add retry/delay to check the presence of the rbdmirror daemon
in the cluster status because the status takes some time to be updated.
Also the metadata.hostname isn't a good key to check because it doesn't
reflect the ansible_hostname fact. We should use metadata.id instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d1316ce77b)
2020-03-03 15:19:45 +01:00
Dimitri Savineau 4b07d97346 shrink-mgr: fix systemd condition
This playbook was using mds systemd condition.
Also a command task was using pipeline which is not allowed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a664159061)
2020-03-03 15:19:45 +01:00
Dimitri Savineau 92b671bcbe shrink: don't use localhost node
The ceph-facts are running on localhost so if this node is using a
different OS/release that the ceph node we can have a mismatch between
docker/podman container binary.
This commit also reduces the scope of the ceph-facts role because we only
need the container_binary tasks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 08ac2e3034)
2020-03-03 15:19:45 +01:00
Dimitri Savineau e037e99bd2 purge: stop rgw instances by iteration
It looks like that the service module doesn't support wildcard anymore
for stopping/disabling multiple services.

fatal: [rgw0]: FAILED! => changed=false
  msg: 'This module does not currently support using glob patterns,
        found ''*'' in service name: ceph-radosgw@*'
...ignoring

Instead we should iterate over the rgw_instances list.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9d3b49293d)
2020-03-03 10:31:48 +01:00
Guillaume Abrioux 5a51bd12dc common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
2020-02-28 11:06:47 -05:00
Guillaume Abrioux d254a8b938 shrink-osd: support shrinking ceph-disk prepared osds
This commit adds the ceph-disk prepared osds support

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1de2bf9991)
2020-02-26 18:16:48 +01:00
Guillaume Abrioux 21851457d6 shrink-osd: don't run ceph-facts entirely
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55970b18f1)
2020-02-26 18:16:48 +01:00
Benoît Knecht 10b3bb2727 infrastructure-playbooks: Run shrink-osd tasks on monitor
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.

This has the added advantage of becoming root on the monitor only, not
on localhost.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b3df4e418)
2020-02-24 16:51:33 -05:00
Guillaume Abrioux 1d2a395aaf switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3700aa5385)
2020-02-10 12:57:17 -05:00
Guillaume Abrioux cdc3e10cf3 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
2020-02-03 09:33:05 -05:00
Guillaume Abrioux 5c3ba0787c switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
2020-02-03 09:32:20 -05:00
Dimitri Savineau 487be2675a filestore-to-bluestore: skip bluestore osd nodes
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 83c5a1d7a8)
2020-02-03 15:16:51 +01:00
Guillaume Abrioux 675b6788f4 update: remove legacy tasks
These tasks should have been removed with backport #4756

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-03 15:16:13 +01:00
wujie1993 dcd4b2955a purge: fix purge cluster failed
Fix purge cluster failed when local container images does not exist.

Purge node-exporter and grafana-server only when dashboard_enabled is set to True.

Signed-off-by: wujie1993 qq594jj@gmail.com
(cherry picked from commit d8b0b3cbd9)
2020-02-03 15:14:56 +01:00
Dimitri Savineau f982a70f02 filestore-to-bluestore: fix undefine osd_fsid_list
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).

TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
  msg: '''osd_fsid_list'' is undefined

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cd76054f76)
2020-01-28 22:21:49 -05:00
Dimitri Savineau 0a2927ce5e filestore-to-bluestore: don't fail when with no PV
When the PV is already removed from the devices then we should not fail
to avoid errors like:

stderr: No PV found on device /dev/sdb.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a9c2300545)
2020-01-24 16:14:47 -05:00
Guillaume Abrioux fd217d9f08 rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
2020-01-22 18:28:54 +01:00
Dimitri Savineau 0abea70e29 filestore-to-bluestore: fix osd_auto_discovery
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bb3eae0c80)
2020-01-22 10:06:17 +01:00
Dimitri Savineau e4965e9ea9 filestore-to-bluestore: --destroy with raw devices
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.

Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
 stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  /dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f995b079a6)
2020-01-21 18:26:55 +01:00
Guillaume Abrioux 0db611ebf8 shrink-mds: fix condition on fs deletion
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d0898aa5d)
2020-01-15 11:28:12 +01:00
Guillaume Abrioux 2d85fab02d osd: support scaling up using --limit
This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3496a0efa2)
2020-01-14 09:12:34 -05:00
Guillaume Abrioux e034a6da69 docker2podman: use set_fact to override variables
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b0c491800a)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 02ec088568 docker2podman: force systemd to reload config
This is needed after a change is made in systemd unit files.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1c2ec9fb40)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 34c4f5baac docker2podman: install podman
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d746575fd0)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 4c4b0edfec update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
2020-01-10 17:16:51 +01:00
Guillaume Abrioux 6e47e96a02 update: use flags noout and nodeep-scrub only
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b95)
2020-01-10 17:16:51 +01:00
Dimitri Savineau f00ee1244f purge-iscsi-gateways: don't run all ceph-facts
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a09d1c38bf)
2020-01-10 16:21:53 +01:00
Dimitri Savineau f042ece9af rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.

Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f344fdefe)
2020-01-09 20:16:07 -05:00
Dimitri Savineau 84276f2fe3 shrink-rgw: refact global workflow
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 747555dfa6)
2020-01-09 21:39:23 +01:00
Guillaume Abrioux 6e7fe62ad5 shrink-osd: support fqdn in inventory
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b)
2020-01-08 16:16:21 -05:00
Dimitri Savineau e4798e22a8 purge-iscsi-gateways: remove node from dashboard
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 931a842f21)
2020-01-08 19:29:59 +01:00
Guillaume Abrioux 86bb734397 filestore-to-bluestore: umount partitions before zapping them
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8056514134)
2020-01-08 11:41:48 -05:00
Guillaume Abrioux 27b1fc8981 shrink-mds: do not play ceph-facts entirely
We only need to set `container_binary`.
Let's use `tasks_from` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0ae0a9ce28)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux edbb207680 shrink-mds: use fact from delegated node
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 77b39d235b)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux 0eaa66f394 shrink-mds: fix filesystem removal task
This commit deletes the filesystem when no more MDS is present after
shrinking operation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 38278a6bb5)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux bfd26e7f78 shrink-mds: ensure max_mds is always honored
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfe5a04bf)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux 19068659c7 filestore-to-bluestore: ensure all dm are closed
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.

Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8e6ef818a2)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 99ac694cc0 filestore-to-bluestore: force OSDs to be marked down
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51d601193e)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 586f6f6262 filestore-to-bluestore: do not use --destroy
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e3305e6bb6)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux d2b1506712 filestore-to-bluestore: add non containerized support
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4833b85e04)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 5062d4094c update: restart iscsigws daemons after upgrade
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).

This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c7708eb458)
2019-12-11 08:48:34 -05:00
Guillaume Abrioux fe8858af38 upgrade: add dashboard deployment
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 451c5ca934)
2019-12-11 08:48:34 -05:00
Dimitri Savineau 3b26df8c75 purge-cluster: add podman support
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 89f6cc54a2)
2019-12-04 18:00:07 -05:00
Guillaume Abrioux 1c03d2b526 purge: rename playbook (container)
Since we now support podman, let's rename the playbook so it's more
generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7bc7e3669d)
2019-12-04 09:12:41 -05:00
Dimitri Savineau 98392be368 add-{mon,osd}: run raw install python tasks
If the new mon/osd node doesn't have python installed then we need to
execute the tasks from raw_install_python.yml.

Closes: #4368

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34b03d1873)
2019-12-04 10:59:39 +01:00
Dimitri Savineau a325ff61e8 switch_to_containers: fix umount ceph partitions
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65)
2019-12-03 15:58:36 +01:00
Guillaume Abrioux 1e7fd9fe36 purge: do not try to stop docker when binary is podman
If the container binary is podman, we shouldn't try to stop docker here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b18476a1a6)
2019-12-03 09:57:11 -05:00
Guillaume Abrioux 6592caab08 facts: isolate container_binary facts
in order to be able to call container_binary without having to run the
whole ceph-facts role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fe5ffe589e)
2019-12-03 09:57:11 -05:00
Guillaume Abrioux 1f30327688 purge: remove docker_* task
All containers are removed when systemd stops them.
There is no need to call this module in purge container playbook.

This commit also removes all docker_image task and remove all container
images in the final cleanup play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776736

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d23383a820)
2019-12-03 09:57:11 -05:00
Guillaume Abrioux 88d060f6e1 docker2podman: import ceph-handler role
This is needed to avoid following error:

```
ERROR! The requested handler 'restart ceph mons' was not found in either the main handlers list nor in the listening handlers list
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a43a872105)
2019-12-03 10:44:48 +01:00
Guillaume Abrioux 3bd8129859 docker2podman: do not hardcode group name
let's use `client_group_name` instead of hardcoding the name.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7fe0d55eff)
2019-12-03 10:44:48 +01:00
Guillaume Abrioux c5145ccf25 docker2podman: import ceph-defaults in first play
We must import this role in the first play otherwise the first call to
`client_group_name`fails.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6526a25ab5)
2019-12-03 10:44:48 +01:00
Guillaume Abrioux 15b78ae252 purge: use sysfs to unmap rbd devices
in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.

Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3cfcc7a105)
2019-11-14 10:49:38 -05:00
Guillaume Abrioux e4c657d711 update: add default values when setting fact
This commit adds a default value in the `with_dict` because when using
python 2.7, if a task using a `with_dict` has a condition, it is
evaluated anyway whereas in python 3 it isn't.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e9823f319b)
2019-10-29 16:00:21 -04:00
Dimitri Savineau 56f0cf79d9 rolling_update: remove default filter on mds group
There's no need to use the default filter on active/standby groups
because if the group doesn't exist then the play is just skipped.

Currently this generates warnings like:

[WARNING]: Could not match supplied host pattern, ignoring: |
[WARNING]: Could not match supplied host pattern, ignoring: default([])

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2ca79fcc99)
2019-10-28 13:08:33 -04:00
Dimitri Savineau ba4059d15a rolling_update: fix active mds host value
The active mds host should be based on the inventory hostname and not on
the ansible hostname.
The value returns under the mdsmap structure is based on the OS hostname
so we need to find the right node in the inventory with this value when
doing operation on inventory nodes.

Othewise we could see error like:

The task includes an option with an undefined variable. The error was:
"hostvars[foobar]" is undefined

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1f2352c79)
2019-10-28 13:08:33 -04:00
Dimitri Savineau b547ad9e71 rolling_update: fix reset mon_host variable
mon_host should use the inventory hostname and not the node hostname.
Fix creates an issue when the inventory and node hostname are different.

Closes: #4670

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 650bc0c3f0)
2019-10-26 08:20:54 -04:00
Dimitri Savineau ff3bea871d add-mon: add missing become flag
Without the become flag set to true, we can't executed the roles
successfully.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 77b212833e)
2019-10-26 08:18:27 -04:00
Guillaume Abrioux 3625ea6ef8 update: use right node when creating active mds group
This must be consistent with what is used in `name` parameter.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d06057ebd2)
2019-10-25 09:42:52 +02:00
Guillaume Abrioux 73d97f525e update: avoid skipping single mds deployment upgrade
otherwise a single MDS would never be updated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d8ab11d2f8)
2019-10-25 09:42:52 +02:00
Guillaume Abrioux c599af6724 update: skip mds deactivation when no mds in inventory
Let's skip this part of the code if there's no mds node in the
inventory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ec906c3af)
2019-10-25 09:42:52 +02:00
Dimitri Savineau f36306ebf4 add-{mon,osd}: add ceph-container-engine role
The ceph-container-engine role is missing from both playbooks so the
container engine (docker, podman) isn't install resulting in a failure
on the added nodes.

fatal: [xxxxx]: FAILED! => changed=false
  cmd: docker --version
  msg: '[Errno 2] No such file or directory'
  rc: 2

Closes: #4634

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bfb1d6be12)
2019-10-24 20:01:04 -04:00
Guillaume Abrioux 4a5d3c3c2d update: add missing quotes
Add missing quote in order to keep consistency.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8d72ff8e5e)
2019-10-21 13:26:37 -04:00
Dimitri Savineau 703c834dab Move the dashboard playbook in the main directory
The [group|host]_vars directories are ignored for the dashboard playbook
when the inventory file directory doesn't contain those directories.

Closes: #4601
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1761612

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8426856262)
2019-10-18 19:32:42 -04:00
Guillaume Abrioux 9bc7f8a7d7 tests: add multimds coverage
This commit makes the all_daemons scenario deploying 3 mds in order to
cover the multimds case.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 25b98b2ce3)
2019-10-18 22:09:04 +02:00
Guillaume Abrioux bc3138eff4 upgrade: fix standby_mdss group creation
This commit fixes the standby_mdss group creation by using `{{ item }}`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c4fc8cc878)
2019-10-18 22:09:04 +02:00
Guillaume Abrioux c962d87def update: follow new recommandation to upgrade mds cluster
Refact the mds cluster upgrade code in order to follow the documented
recommandation.
See: https://github.com/ceph/ceph/blob/master/doc/cephfs/upgrading.rst

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1569689

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 71cebf80a6)
2019-10-16 12:59:08 -04:00
Dimitri Savineau 0b49538621 Execute common roles once on all nodes
The common roles don't need to be executed again on each group plays
(like mons, osds, etc..).
We only need to execute them during the first play. That wat, we will
apply the changes on all nodes in parallel instead of doing it once per
group.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 68a3dac7cd)
2019-10-16 10:41:32 -04:00
Dimitri Savineau fd759f97fa dashboard: disable facts gathering
This is already done in the main playbooks but absent in the dashboard
playbook.
The facts are already gathered during the first play of the main
playbooks so we don't need to doing twice.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5ae7304ace)
2019-10-14 09:45:11 +02:00
Guillaume Abrioux ebfe7f31ed dashboard: if no host is available, let's just skip these plays.
If there is no host available, let's just skip these plays.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1759917

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b245bd007)
2019-10-09 14:47:36 -04:00
Dimitri Savineau 5f91be8740 switch_to_containers: umount osd lockbox partition
When switching from a baremetal deployment to a containerized deployment
we only umount the OSD data partition.
If the OSD is encrypted (dmcrypt: true) then there's an additional
partition (part number 5) used for the lockbox and mount in the
/var/lib/ceph/osd-lockbox/ directory.
Because this partition isn't umount then the containerized OSD aren't
able to start. The partition is still mount by the system and can't be
remount from the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 19edf707a5)
2019-10-08 00:57:05 +00:00
Guillaume Abrioux b325cc386e switch_to_containers: do not re-set `ceph_uid`
This commit refacts the way we set `ceph_uid` fact in `ceph-facts` and
removes all `set_fact` tasks for `ceph_uid` in switch-to-containers playbook
to avoid duplicated code.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fa9b42e98e)
2019-10-07 10:18:17 -04:00
Guillaume Abrioux 468aa5d63b switch_to_containers: optimize ownership change
As per https://github.com/ceph/ceph-ansible/pull/4323#issuecomment-538420164

using `find` command should be faster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1757400

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-Authored-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit c5d0c90bb7)
2019-10-07 10:18:17 -04:00
Guillaume Abrioux 37fd0b179b update: import ceph-defaults role in first play
Typical error:

```
fatal: [mon0]: FAILED! =>
  msg: |-
    The conditional check 'not delegate_facts_host | bool or inventory_hostname in groups.get(client_group_name, [])' failed. The error was: error while evaluating conditional (not delegate_facts_host | bool or inventory_hostname in groups.get(client_group_name, [])): 'client_group_name' is undefined
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8138d4193c)
2019-10-07 11:21:23 +02:00
Guillaume Abrioux 9a4fcfabe1 main: exclude client nodes from facts gathering when delegate_facts_host
This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 865d2eac9b)
2019-10-07 11:21:23 +02:00
Dimitri Savineau ec1c57f690 dashboard: remove useless block section
The block section were used with the dashboard_enabled condition when
the code was included in the main playbooks.
Because this condition isn't present in the dashboard playbook anymore
we can remove the block section.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf47594b47)
2019-10-04 13:28:37 +02:00
Guillaume Abrioux 9a79ed1bf0 rgw: refact tasks directory layout
This commit moves containerized deployment related files to `./tasks/`
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e08194dd67)
2019-10-01 18:50:51 +02:00
Guillaume Abrioux 7f902994b3 rbdmirror: refact tasks directory layout
This commit moves containerized deployment related files to `./tasks/`
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c69816c6b7)
2019-10-01 18:50:51 +02:00
Guillaume Abrioux d7a06c67db iscsigw: refact tasks directory layout
This commit moves containerized deployment related files to `./tasks/
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4636f3f7e2)
2019-10-01 18:50:51 +02:00
Guillaume Abrioux b564c37696 upgrade: add an infra playbook to migrate systemd units to podman
this commit adds a new playbook to force systemd units for containers to
use podman instead of docker.
This is needed in the rhel8 upgrade context so after the base OS is upgraded
containers can be started using podman.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f2017dcda2)
2019-10-01 18:50:51 +02:00
Guillaume Abrioux 4afe1b748c update: reset mon_host after mons upgrade
after all mon are upgraded, let's reset mon_host which is used in the
rest of the playbook for setting `container_exec_cmd` so we are sure to
use the right value.

Typical error:

```
failed: [mds0 -> mon0] (item={u'path': u'/var/lib/ceph/bootstrap-mds/ceph.keyring', u'name': u'client.bootstrap-mds', u'copy_key': True}) => changed=true
  ansible_loop_var: item
  cmd:
  - docker
  - exec
  - ceph-mon-mon2
  - ceph
  - --cluster
  - ceph
  - auth
  - get
  - client.bootstrap-mds
  delta: '0:00:00.016294'
  end: '2019-09-27 13:54:58.828835'
  item:
    copy_key: true
    name: client.bootstrap-mds
    path: /var/lib/ceph/bootstrap-mds/ceph.keyring
  msg: non-zero return code
  rc: 1
  start: '2019-09-27 13:54:58.812541'
  stderr: 'Error response from daemon: No such container: ceph-mon-mon2'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d84160a170)
2019-09-28 09:01:16 +02:00
Harald Jensås 5fea830414 Replace ipaddr() with ips_in_ranges()
This change implements a filter_plugin that is used in the
ceph-facts, ceph-validate roles and infrastucture-playbooks.
The new filter plugin will return a list of all IP address
that reside in any one of the given IP ranges. The new filter
replaces the use of the ipaddr filter.

ceph.conf already support a comma separated list of CIDRs
for the public_network and cluster_network options.

Changes: [1] and [2] introduced a regression in ceph-ansible
where public_network can no longer be a comma separated list
of cidrs.

With this change a comma separated list of subnet CIDRs can
also be used for monitor_address_block and radosgw_address_block.

[1] commit: d67230b2a2
[2] commit: 20e4852888

Related-To: https://bugs.launchpad.net/tripleo/+bug/1840030
Related-To: https://bugzilla.redhat.com/show_bug.cgi?id=1740283

Closes: #4333
Please backport to stable-4.0

Signed-off-by: Harald Jensås <hjensas@redhat.com>
(cherry picked from commit e695efcaf7)
2019-09-27 17:49:46 +02:00
Sam Choraria 7594bc9181 rolling_update.yml: force ceph-volume scan on osds
The rolling_update.yml playbook fails when scanning ceph-disk osds while
deploying nautilus. The --force flag is required to scan existing osds
and rewrite their json metadata.

Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>
(cherry picked from commit 7cc9f93680)
2019-09-26 14:51:59 -04:00
Guillaume Abrioux 96dafd676c infrastructure-playbooks: add filestore-to-bluestore.yml
This playbook helps to migrate all osds on a node from filestore to
bluestore backend.
Note that *ALL* osd on the specified osd nodes will be shrinked and
redeployed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3f9ccdaa8a)
2019-09-26 16:21:54 +02:00
Guillaume Abrioux 26e0f4db97 lv-create: fix a typo
This commit fixes a typo.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c785ad3637)
2019-09-26 16:21:54 +02:00
Mehdy 8c37894109 shrink-rgw.yml: fix confirmation play's name
the confirmation play's name should confirm removing rgw instead of
monitor

Signed-off-by: Mehdy Khoshnoody <mehdy.khoshnoody@gmail.com>
(cherry picked from commit 9fa98d79fd)
2019-09-25 16:37:44 +02:00
Dimitri Savineau a5775be7c4 shrink-mon: search mon in the quorum_names list
If we're looking at the mon hostname in the ceph status output then
there's some scenarios where this could be true.
If we collocate some services (mons, mgrs, etc..) then the hostname of
the monitor to shrink will still be present in the ceph status (like
in mgrs or other).
Instead we should check the hostame only in the mon part of the output.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 734c0dc310)
2019-09-18 14:47:40 +00:00
Kevin Jones 3a8de9cc36 Set proper ownership command performance improvement
By changing the set ownership command from using the file module in combination with a with_items loop to a raw chown command, we can achieve a 98% performance increase here.

On a ceph cluster with a significant amount of directories and files in /var/lib/ceph, the file module has to run checks on ownership of all those directories and files to determine whether a change is needed.

In this case, we just want to explicitly set the ownership of all these directories and files to the ceph_uid

Added context note to all set proper ownership tasks

Signed-off-by: Kevin Jones <kevinjones@redhat.com>
(cherry picked from commit 47bf47c9d8)
2019-08-22 12:59:58 +02:00
Guillaume Abrioux 236020fb2b shrink-mon: refact 'verify the monitor is out of the cluster' task
use `from_json` filter instead of a `| python` so we can get rid of the
`shell` module usage here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5573f17e76)
2019-08-19 18:47:14 +00:00
Rishabh Dave b28ed96378 use pre_tasks and post_tasks in shrink-mon.yml too
This commit should've been part of commit
2fb12ae554.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 2034387f57)
2019-08-19 18:47:14 +00:00
Guillaume Abrioux 2f77704591 common: use discovered_interpreter_python fact
in order to use the right binary name when using python cli in command
or shell module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13815ad3ca)
2019-08-19 18:47:14 +00:00
Dimitri Savineau f9d9ffac8f dashboard: run dashboard role on mgr/mon nodes
We don't need to execute the ceph-dashboard role on the nodes present
in the grafana-server group. This one is dedicated to the grafana and
prometheus stack.
The ceph-dashboard needs to executed where the ceph-mgr is running. It
is either on the dedicated mgr nodes or if mgr and mon are collocated
implicitly on the mon nodes.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16939eff9e)
2019-08-08 13:47:09 +02:00
Rishabh Dave 72a062b6fa add a playbook the remove rgw from a given node
Add a playbook named shrink-rgw.yml to infrastructure-playbooks/ that
can remove a RGW from a node in an already deployed Ceph cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 632a44bdf2)
2019-07-31 15:25:15 -04:00
Rishabh Dave 8ca88b41cc infra-playbooks: rewite a condition for better readability
Use facility built-in in Ansible to check whether a command was executed
successfully rather looking at its return value.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 5aecdd3ba6)
2019-07-29 15:52:29 +02:00
Guillaume Abrioux d0ad1cf0f1 dashboard: use dedicated group only
There's no need to add complexity and trying to fallback on other group.
Let's deploy dashboard on all nodes present in grafana-server group.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d67230b2a2)
2019-07-29 15:46:58 +02:00
Dimitri Savineau dd87db70ca dashboard: move code into a dedicated playbook
Move dashboard, grafana/prometheus and node-exporter plays into a
dedicated playbook in infrastructure-playbook directory.
To avoid using 'dashboard_enabled | bool' condition multiple time
in the main playbook we can just import the dashboard playbook or
not.
This patch also allows to use an unique dashboard playbook for
both baremetal and container playbooks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 43135840b1)
2019-07-29 15:46:58 +02:00
Dimitri Savineau 43d625b59a Remove NBSP characters
Some NBSP are still present in the yaml files.
Adding a test in travis CI.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 07c6695d16)
2019-07-26 16:23:41 -04:00
Guillaume Abrioux bee8a31afe shrink-rbdmirror: check if rbdmirror is well removed from cluster
This commits adds a check to ensure the daemon has been removed from the
cluster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 916dc1f52f)
2019-07-16 15:02:49 +02:00
Rishabh Dave 0a15d1d112 add a playbook that removes rbd-mirror from a node
Add a playbook named "shrink-rbdmirror.yml" in infrastructure-playbooks/
that removes a RBD Mirror from a node in an already deployed Ceph
cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit c4824acb19)
2019-07-16 15:02:49 +02:00
Rishabh Dave 6197d1c8d9 add a playbook that removes manager from a node
Add a playbook, named "shrink-mgr.yml", in infrastructure-playbooks/
that removes a MGR from a node in an already deployed Ceph cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f4ea75051b)
2019-07-09 15:00:56 +00:00
Guillaume Abrioux 85a448429d shrink-mds: refact post tasks
This commit refacts the way we check the "mds_to_kill" node is well
stopped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 7df62fde34)
2019-07-09 12:07:47 +02:00
Rishabh Dave 38c2785e95 add a playbook that removes mds from a node
Add a playbook, named "shrink-mds.yml", in infrastructure-playbooks/
that removes a MDS from a node in an already deployed Ceph cluster.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 235b1fccc6)
2019-07-09 12:07:47 +02:00
Mike Christie cf6050d4e6 igw: Support new ceph-iscsi package during purge
The ceph-iscsi-config and ceph-iscsi-cli packages were combined into
ceph-iscsi and its APIs changed. This fixes up the iscsi purge task to
support the new API and old one.

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit b163206db7)
2019-07-04 00:04:04 +00:00
Guillaume Abrioux 0a0cdc0963 purge: ensure no ceph kernel thread is present
This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 20e4852888)
2019-06-24 13:20:50 +02:00
Guillaume Abrioux 77d24203fa upgrade: accept HEALTH_OK and HEALTH_WARN as valid state
3a100cfa52 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6dce51183b)
2019-06-21 15:47:33 +00:00
Dimitri Savineau aa197f77fc remove ceph restapi references
The ceph restapi configuration was only available until Luminous
release so we don't need those leftovers for nautilus+.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da8b7ab7fb)
2019-06-20 15:15:10 -04:00
Guillaume Abrioux b93064c7c8 rolling_update: fail early if cluster state is not OK
starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3a100cfa52)
2019-06-19 08:41:25 +00:00
Guillaume Abrioux 53dd58e84c rolling_update: only mask and stop unit in mgr part
Otherwise it fails like following:

```
fatal: [mon0]: FAILED! => changed=false
  msg: |-
    Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51b2813e04)
2019-06-19 08:41:25 +00:00
Dimitri Savineau 6e565b251d remove ceph-agent role and references
The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.

Resolves: #4020

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 7503098ca0)
2019-06-17 15:56:00 -04:00
L3D 1daca1ba83 ansible: use 'bool' filter on boolean conditionals
By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```

Now appended ``| bool`` on a lot of the affected variables.

Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.

Closes: #4022

Signed-off-by: L3D <l3d@c3woc.de>
(cherry picked from commit ab54fe20ec)
2019-06-07 16:05:51 +02:00
Dimitri Savineau 7a384e7ec2 purge-cluster: clean all ceph repo files
We currently only purge rh_storage yum repository file but depending
on the ceph_repository value we are using, the ceph repository file
could have a different name.

Resolves: #4056

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 44c63903ca)
2019-06-07 12:05:40 +00:00
guihecheng a6312ba9bc Add section for purging rgw loadbalancer in purge-cluster.yml
Signed-off-by: guihecheng <guihecheng@cmiot.chinamobile.com>
(cherry picked from commit 59e702ec39)
2019-06-06 19:44:30 +00:00
Guillaume Abrioux 16c6d530c6 roles: introduce `ceph-container-engine` role
This commit splits the current `ceph-container-common` role.

This introduces a new role `ceph-container-engine` which handles the
tasks specific to the installation of containers tools (docker/podman).

This is needed for the ceph-dashboard implementation for 2 main reasons:

1/ Since the ceph-dashboard stack is only containerized, we must install
everything needed to run containers even in non containerized
deployments. Splitting this role allows us to not have to call the full
`ceph-container-common` role which would run a bunch of unneeded tasks
that would have been skipped anyway.

2/ The current implementation would have required to run
`ceph-container-common` on all ceph-clients nodes which would have been
conflicting with 9d3517c670 (we don't want
to run ceph-container-common on all client nodes, see mentioned commit
for more details)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55420d6253)
2019-05-22 15:24:11 -04:00
Guillaume Abrioux d83db2c8ed switch to ansible 2.8
- remove private attribute with import_role.
- update documentation.
- update rpm spec requirement.
- fix MagicMock python import in unit tests.

Closes: #3765

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 72d8315299)
2019-05-21 09:17:46 +02:00
Dimitri Savineau 023cdffd95 purge-docker-cluster: don't remove data on atomic
Because we don't manage the docker service on atomic (yet) via the
ceph-container-common role then we can't stop docker dans remove
the data.
For now let's do that only for non atomic hosts.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 638604929b)
2019-05-17 10:44:52 -04:00
Guillaume Abrioux e29fd842a6 rename docker_exec_cmd variable
This commit renames the `docker_exec_cmd` variable to
`container_exec_cmd` so it's more generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e74d80e72f)
2019-05-17 16:05:58 +02:00
Zack Cerza 0496ce8e5c purge-docker-cluster.yml: Default lvm_volumes
We were failing when that variable is unset; purge-cluster.yml contains
this workaround.

Signed-off-by: Zack Cerza <zack@redhat.com>
(cherry picked from commit 9b4339a2ba)
2019-05-17 16:05:58 +02:00
Boris Ranto 5ac7559736 Merge cephmetrics/dashboard-ansible repo
This commit will merge dashboard-ansible installation scripts with
ceph-ansible. This includes several new roles to setup ceph-dashboard
and the underlying technologies like prometheus and grafana server.

Signed-off-by: Boris Ranto & Zack Cerza <team-gmeno@redhat.com>
Co-authored-by: Zack Cerza <zcerza@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2f141a6e80)
2019-05-17 16:05:58 +02:00
wumingqiao 30b1ca9aeb shrink_osd: mark all osd(s) out in one command
Signed-off-by: wumingqiao <wumingqiao@beyondcent.com>
(cherry picked from commit 5320aa11c4)
2019-05-15 21:44:30 -04:00
Dimitri Savineau 1e23d853f9 purge-docker-cluster: remove docker data
We never clean the content of /var/lib/docker so we can still have
some data present in this directory after run the purge playbook.
Pip isn't used anymore.
Also update the docker package name (especially the python binding
one).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 168d7cd016)
2019-05-14 11:00:30 +02:00
Dimitri Savineau 6814fd5ce5 gather-ceph-logs: fix logs list generation
The shell module doesn't have a stdout_lines attributes. Instead of
using the shell module, we can use the find modules.

Also adding `become: false` to the local tmp directory creation
otherwise we won't have enough right to fetch the files into this
directory.

Resolves: #3966

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ea1f8f551c)
2019-05-13 10:33:26 -04:00
Mike Christie 78a55a3df3 igw: Fix rolling update service ordering
We must stop tcmu-runner after the other rbd-target-* services
because they may need to interact with tcmu-runner during shutdown.
There is also a bug in some kernels where IO can get stuck in the
kernel and by stopping rbd-target-* first we can make sure all IO is
flushed.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit d7ef12910e)
2019-05-10 15:53:44 +02:00
Rishabh Dave b6d5352783 remove infrastructure-playbooks/rgw-standalone.yml
We don't need infrastructure-playbooks/rgw-standalone.yml since
site.yml.sample and site-cotainer.yml.sample can add a new RGW node to
an already deployed Ceph cluster.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 6e8fb2b3ea)
2019-05-07 13:11:48 +02:00
letterwuyu 27a8179cd8 Fix comment content
Signed-off-by: lishuhao letterwuyu@gmail.com
(cherry picked from commit d57f6fcdc6)
2019-05-07 11:11:22 +02:00
Rishabh Dave 06b3ab2a6b improve coding style
Keywords requiring only one item shouldn't express it by creating a
list with single item.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 739a662c80)

Conflicts:
	roles/ceph-mon/tasks/ceph_keys.yml
	roles/ceph-validate/tasks/check_devices.yml
2019-05-06 15:09:06 +00:00
Dimitri Savineau 92340d049c rolling_update: restart all ceph-iscsi services
Currently only rbd-target-gw service is restarted during an update.
We also need to restart tcmu-runner and rbd-target-api services
during the ceph iscsi upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1659611

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f1048627ea)
2019-04-30 12:09:52 -04:00