Commit Graph

556 Commits (e2f1a0ade85d1b44a315b41b6f59c31924bc1c04)

Author SHA1 Message Date
Dimitri Savineau e2f1a0ade8 doc: update infra playbooks statements
We don't need to copy the infrastructure playbooks in the root
ceph-ansible directory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 195944b123)
2020-03-16 14:43:35 +01:00
Dimitri Savineau 957156c0fe filestore-to-bluestore: stop ceph-volume services
We only disable the ceph-osd services but not the ceph-volume lvm
services during the filestore to bluestore migration.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 38a683e5bf)
2020-03-12 21:10:33 +01:00
Dimitri Savineau 928c792f8d filestore-to-bluestore: reuse dedicated journal
If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.

When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 535da53d69)
2020-03-12 21:10:33 +01:00
Dimitri Savineau 3b0ee83594 shrink-rbdmirror: fix presence after removal
We should add retry/delay to check the presence of the rbdmirror daemon
in the cluster status because the status takes some time to be updated.
Also the metadata.hostname isn't a good key to check because it doesn't
reflect the ansible_hostname fact. We should use metadata.id instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d1316ce77b)
2020-03-03 15:19:45 +01:00
Dimitri Savineau 4b07d97346 shrink-mgr: fix systemd condition
This playbook was using mds systemd condition.
Also a command task was using pipeline which is not allowed.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a664159061)
2020-03-03 15:19:45 +01:00
Dimitri Savineau 92b671bcbe shrink: don't use localhost node
The ceph-facts are running on localhost so if this node is using a
different OS/release that the ceph node we can have a mismatch between
docker/podman container binary.
This commit also reduces the scope of the ceph-facts role because we only
need the container_binary tasks.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 08ac2e3034)
2020-03-03 15:19:45 +01:00
Dimitri Savineau e037e99bd2 purge: stop rgw instances by iteration
It looks like that the service module doesn't support wildcard anymore
for stopping/disabling multiple services.

fatal: [rgw0]: FAILED! => changed=false
  msg: 'This module does not currently support using glob patterns,
        found ''*'' in service name: ceph-radosgw@*'
...ignoring

Instead we should iterate over the rgw_instances list.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9d3b49293d)
2020-03-03 10:31:48 +01:00
Guillaume Abrioux 5a51bd12dc common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
2020-02-28 11:06:47 -05:00
Guillaume Abrioux d254a8b938 shrink-osd: support shrinking ceph-disk prepared osds
This commit adds the ceph-disk prepared osds support

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1de2bf9991)
2020-02-26 18:16:48 +01:00
Guillaume Abrioux 21851457d6 shrink-osd: don't run ceph-facts entirely
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55970b18f1)
2020-02-26 18:16:48 +01:00
Benoît Knecht 10b3bb2727 infrastructure-playbooks: Run shrink-osd tasks on monitor
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.

This has the added advantage of becoming root on the monitor only, not
on localhost.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b3df4e418)
2020-02-24 16:51:33 -05:00
Guillaume Abrioux 1d2a395aaf switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3700aa5385)
2020-02-10 12:57:17 -05:00
Guillaume Abrioux cdc3e10cf3 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
2020-02-03 09:33:05 -05:00
Guillaume Abrioux 5c3ba0787c switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
2020-02-03 09:32:20 -05:00
Dimitri Savineau 487be2675a filestore-to-bluestore: skip bluestore osd nodes
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 83c5a1d7a8)
2020-02-03 15:16:51 +01:00
Guillaume Abrioux 675b6788f4 update: remove legacy tasks
These tasks should have been removed with backport #4756

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-03 15:16:13 +01:00
wujie1993 dcd4b2955a purge: fix purge cluster failed
Fix purge cluster failed when local container images does not exist.

Purge node-exporter and grafana-server only when dashboard_enabled is set to True.

Signed-off-by: wujie1993 qq594jj@gmail.com
(cherry picked from commit d8b0b3cbd9)
2020-02-03 15:14:56 +01:00
Dimitri Savineau f982a70f02 filestore-to-bluestore: fix undefine osd_fsid_list
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).

TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
  msg: '''osd_fsid_list'' is undefined

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cd76054f76)
2020-01-28 22:21:49 -05:00
Dimitri Savineau 0a2927ce5e filestore-to-bluestore: don't fail when with no PV
When the PV is already removed from the devices then we should not fail
to avoid errors like:

stderr: No PV found on device /dev/sdb.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a9c2300545)
2020-01-24 16:14:47 -05:00
Guillaume Abrioux fd217d9f08 rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
2020-01-22 18:28:54 +01:00
Dimitri Savineau 0abea70e29 filestore-to-bluestore: fix osd_auto_discovery
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bb3eae0c80)
2020-01-22 10:06:17 +01:00
Dimitri Savineau e4965e9ea9 filestore-to-bluestore: --destroy with raw devices
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.

Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
 stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  /dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f995b079a6)
2020-01-21 18:26:55 +01:00
Guillaume Abrioux 0db611ebf8 shrink-mds: fix condition on fs deletion
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3d0898aa5d)
2020-01-15 11:28:12 +01:00
Guillaume Abrioux 2d85fab02d osd: support scaling up using --limit
This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3496a0efa2)
2020-01-14 09:12:34 -05:00
Guillaume Abrioux e034a6da69 docker2podman: use set_fact to override variables
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b0c491800a)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 02ec088568 docker2podman: force systemd to reload config
This is needed after a change is made in systemd unit files.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1c2ec9fb40)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 34c4f5baac docker2podman: install podman
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d746575fd0)
2020-01-10 17:41:27 +01:00
Guillaume Abrioux 4c4b0edfec update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
2020-01-10 17:16:51 +01:00
Guillaume Abrioux 6e47e96a02 update: use flags noout and nodeep-scrub only
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b95)
2020-01-10 17:16:51 +01:00
Dimitri Savineau f00ee1244f purge-iscsi-gateways: don't run all ceph-facts
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a09d1c38bf)
2020-01-10 16:21:53 +01:00
Dimitri Savineau f042ece9af rolling_update: run registry auth before upgrading
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.

Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 3f344fdefe)
2020-01-09 20:16:07 -05:00
Dimitri Savineau 84276f2fe3 shrink-rgw: refact global workflow
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 747555dfa6)
2020-01-09 21:39:23 +01:00
Guillaume Abrioux 6e7fe62ad5 shrink-osd: support fqdn in inventory
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b)
2020-01-08 16:16:21 -05:00
Dimitri Savineau e4798e22a8 purge-iscsi-gateways: remove node from dashboard
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 931a842f21)
2020-01-08 19:29:59 +01:00
Guillaume Abrioux 86bb734397 filestore-to-bluestore: umount partitions before zapping them
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8056514134)
2020-01-08 11:41:48 -05:00
Guillaume Abrioux 27b1fc8981 shrink-mds: do not play ceph-facts entirely
We only need to set `container_binary`.
Let's use `tasks_from` option.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0ae0a9ce28)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux edbb207680 shrink-mds: use fact from delegated node
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 77b39d235b)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux 0eaa66f394 shrink-mds: fix filesystem removal task
This commit deletes the filesystem when no more MDS is present after
shrinking operation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 38278a6bb5)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux bfd26e7f78 shrink-mds: ensure max_mds is always honored
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfe5a04bf)
2020-01-08 11:18:45 -05:00
Guillaume Abrioux 19068659c7 filestore-to-bluestore: ensure all dm are closed
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.

Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8e6ef818a2)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 99ac694cc0 filestore-to-bluestore: force OSDs to be marked down
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51d601193e)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 586f6f6262 filestore-to-bluestore: do not use --destroy
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e3305e6bb6)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux d2b1506712 filestore-to-bluestore: add non containerized support
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 4833b85e04)
2019-12-11 16:37:21 +01:00
Guillaume Abrioux 5062d4094c update: restart iscsigws daemons after upgrade
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).

This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c7708eb458)
2019-12-11 08:48:34 -05:00
Guillaume Abrioux fe8858af38 upgrade: add dashboard deployment
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 451c5ca934)
2019-12-11 08:48:34 -05:00
Dimitri Savineau 3b26df8c75 purge-cluster: add podman support
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 89f6cc54a2)
2019-12-04 18:00:07 -05:00
Guillaume Abrioux 1c03d2b526 purge: rename playbook (container)
Since we now support podman, let's rename the playbook so it's more
generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7bc7e3669d)
2019-12-04 09:12:41 -05:00
Dimitri Savineau 98392be368 add-{mon,osd}: run raw install python tasks
If the new mon/osd node doesn't have python installed then we need to
execute the tasks from raw_install_python.yml.

Closes: #4368

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34b03d1873)
2019-12-04 10:59:39 +01:00
Dimitri Savineau a325ff61e8 switch_to_containers: fix umount ceph partitions
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65)
2019-12-03 15:58:36 +01:00
Guillaume Abrioux 1e7fd9fe36 purge: do not try to stop docker when binary is podman
If the container binary is podman, we shouldn't try to stop docker here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b18476a1a6)
2019-12-03 09:57:11 -05:00