When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If the filestore configuration was using a dedicated journal with either
a partition or a LV/VG then we need to reuse this for bluestore DB.
When filestore is using a raw devices then we shouldn't destroy
everything (data + journal) but only data otherwise the journal
partition won't exist anymore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790479
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.
This has the added advantage of becoming root on the monitor only, not
on localhost.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
We only support python3 so renaming all ceph python packages.
Some ceph packages were missing from the list (ceph-mon, ceph-osd or
rbd-mirror) or didn't exist anymore (ceph-fs-common, libcephfs1).
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Fix purge cluster failed when local container images does not exist.
Purge node-exporter and grafana-server only when dashboard_enabled is set to True.
Signed-off-by: wujie1993 qq594jj@gmail.com
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).
TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
msg: '''osd_fsid_list'' is undefined
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When the PV is already removed from the devices then we should not fail
to avoid errors like:
stderr: No PV found on device /dev/sdb.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:
```
fatal: [magna005]: FAILED! => changed=false
gid: 0
group: root
mode: '0755'
msg: 'chown failed: failed to look up user ceph'
owner: root
path: /etc/ceph
secontext: unconfined_u:object_r:etc_t:s0
size: 4096
state: directory
uid: 0
```
because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.
Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
/dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When using docker 1.13.1, the current condition:
```
{% if (container_binary == 'docker' and ceph_docker_version.split('.')[0] is version_compare('13', '>=')) or container_binary == 'podman' -%}
```
is wrong because it compares the first digit (1) whereas it should
compare the second one.
It means we always use `--cpu-quota` although documentation recommend
using `--cpus` when docker version is 1.13.1 or higher.
From the doc:
> --cpu-quota=<value> Impose a CPU CFS quota on the container. The number of
> microseconds per --cpu-period that the container is limited to before
> throttled. As such acting as the effective ceiling.
> If you use Docker 1.13 or higher, use --cpus instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the new ceph status registered in `ceph_status` will report `fsmap.up` =
0 when it's the last mds given that it's done after we shrink the mds,
it means the condition is wrong. Also adding a condition so we don't try
to delete the fs if a standby node is going to rejoin the cluster.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.
Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit deletes the filesystem when no more MDS is present after
shrinking operation.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
on master we can't test upgrade from stable-4.0/CentOS 7 to
master/CentOS 8.
This commit refact the upgrade so we test upgrade from master/CentOS 8
to master/CentOS 8 (octopus to octopus)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.
Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).
This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
ceph/ceph-ansible#4805 introduced a symlink to
purge-container-cluster.yml playbook which is broken.
This commit fixes it.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
All containers are removed when systemd stops them.
There is no need to call this module in purge container playbook.
This commit also removes all docker_image task and remove all container
images in the final cleanup play.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776736
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is needed to avoid following error:
```
ERROR! The requested handler 'restart ceph mons' was not found in either the main handlers list nor in the listening handlers list
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We must import this role in the first play otherwise the first call to
`client_group_name`fails.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>