This commit lets add-osd.yml in place but mark the deprecation of the
playbook.
Scaling up OSDs is now possible using --limit
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
play vars have lower precedence than role vars and `set_fact`.
We must use a `set_fact` to reset these variables.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds a package installation task in order to install podman
during the docker-to-podman.yml migration playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We only need to have the container_binary fact. Because we're not
gathering the facts from all nodes then the purge fails trying to get
one of the grafana fact.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
There's some tasks using the new container image during the rolling
upgrade playbook that needs to execute the registry login first otherwise
the nodes won't be able to pull the container image.
Unable to find image 'xxx.io/foo/bar:latest' locally
Trying to pull repository xxx.io/foo/bar ...
/usr/bin/docker-current: Get https://xxx.io/v2/foo/bar/manifests/latest:
unauthorized
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of running the ceph roles against localhost we should do it
on the first mon.
The ansible and inventory hostname of the rgw nodes could be different.
Ensure that the rgw instance to remove is present in the cluster.
Fix rgw service and directory path.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The command is delegated on the first monitor so we must use the fact
`container_binary` from this node.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit deletes the filesystem when no more MDS is present after
shrinking operation.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1787543
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit prevent from shrinking an mds node when max_mds wouldn't be
honored after that operation.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using the ceph dashboard with iscsi gateways nodes we also need to
remove the nodes from the ceph dashboard list.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1786686
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
on master we can't test upgrade from stable-4.0/CentOS 7 to
master/CentOS 8.
This commit refact the upgrade so we test upgrade from master/CentOS 8
to master/CentOS 8 (octopus to octopus)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When an OSD is stopped, it leaves partitions mounted.
We must umount them before zapping them, otherwise error like "Device is
busy" will show up.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds a task to ensure device mappers are well closed when
lvm batch scenario is used.
Otherwise, OSDs can't be redeployed given that devices that are rejected
by ceph-volume because they are locked.
Adding a condition `devices | default([]) | length > 0` to remove these
dm only when using lvm batch scenario.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise, sometimes it can take a while for an OSD to be seen as down
and causes the `ceph osd purge` command to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Do not use `--destroy` when zapping a device.
Otherwise, it destroys VGs while they are still needed to redeploy the
OSDs.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit adds the non containerized context support to the
filestore-to-bluestore.yml infrastructure playbook.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In containerized context, containers aren't stopped early in the
sequence.
It means they aren't restarted after the upgrade because the task is
just checking the daemon status is started (eg: `state: started`).
This commit also removes the task which ensure services are started
because it's already done in the role ceph-iscsigw.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
when upgrading from RHCS 3, dashboard has obviously never been deployed
and it forces us to deploy it later manually.
This commit adds the dashboard deployment as part of the upgrade to
RHCS 4.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779092
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
ceph/ceph-ansible#4805 introduced a symlink to
purge-container-cluster.yml playbook which is broken.
This commit fixes it.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
All containers are removed when systemd stops them.
There is no need to call this module in purge container playbook.
This commit also removes all docker_image task and remove all container
images in the final cleanup play.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776736
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is needed to avoid following error:
```
ERROR! The requested handler 'restart ceph mons' was not found in either the main handlers list nor in the listening handlers list
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We must import this role in the first play otherwise the first call to
`client_group_name`fails.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.
We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0
Also we should not fail on the ceph directory listing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
It might be possible at some point even with osd flags `noout` and
`norebalance` set the PGs states can change depending on the amount of data
written meantime. It means the check for PGs state will fail.
This commit changes the way we set those flags:
we set them before an OSD node upgrade and unset them before the PGs
state check so they can recover.
Fixes: #3961
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.
Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If the new mon/osd node doesn't have python installed then we need to
execute the tasks from raw_install_python.yml.
Closes: #4368
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit adds a default value in the `with_dict` because when using
python 2.7, if a task using a `with_dict` has a condition, it is
evaluated anyway whereas in python 3 it isn't.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766499
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's no need to use the default filter on active/standby groups
because if the group doesn't exist then the play is just skipped.
Currently this generates warnings like:
[WARNING]: Could not match supplied host pattern, ignoring: |
[WARNING]: Could not match supplied host pattern, ignoring: default([])
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The active mds host should be based on the inventory hostname and not on
the ansible hostname.
The value returns under the mdsmap structure is based on the OS hostname
so we need to find the right node in the inventory with this value when
doing operation on inventory nodes.
Othewise we could see error like:
The task includes an option with an undefined variable. The error was:
"hostvars[foobar]" is undefined
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
mon_host should use the inventory hostname and not the node hostname.
Fix creates an issue when the inventory and node hostname are different.
Closes: #4670
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph-container-engine role is missing from both playbooks so the
container engine (docker, podman) isn't install resulting in a failure
on the added nodes.
fatal: [xxxxx]: FAILED! => changed=false
cmd: docker --version
msg: '[Errno 2] No such file or directory'
rc: 2
Closes: #4634
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The [group|host]_vars directories are ignored for the dashboard playbook
when the inventory file directory doesn't contain those directories.
Closes: #4601
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1761612
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The common roles don't need to be executed again on each group plays
(like mons, osds, etc..).
We only need to execute them during the first play. That wat, we will
apply the changes on all nodes in parallel instead of doing it once per
group.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This is already done in the main playbooks but absent in the dashboard
playbook.
The facts are already gathered during the first play of the main
playbooks so we don't need to doing twice.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
When switching from a baremetal deployment to a containerized deployment
we only umount the OSD data partition.
If the OSD is encrypted (dmcrypt: true) then there's an additional
partition (part number 5) used for the lockbox and mount in the
/var/lib/ceph/osd-lockbox/ directory.
Because this partition isn't umount then the containerized OSD aren't
able to start. The partition is still mount by the system and can't be
remount from the container.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit refacts the way we set `ceph_uid` fact in `ceph-facts` and
removes all `set_fact` tasks for `ceph_uid` in switch-to-containers playbook
to avoid duplicated code.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The block section were used with the dashboard_enabled condition when
the code was included in the main playbooks.
Because this condition isn't present in the dashboard playbook anymore
we can remove the block section.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit moves containerized deployment related files to `./tasks/`
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit moves containerized deployment related files to `./tasks/`
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit moves containerized deployment related files to `./tasks/
directory. This is needed to make `docker-to-podman.yml` working since
we use `tasks_from:` option.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
this commit adds a new playbook to force systemd units for containers to
use podman instead of docker.
This is needed in the rhel8 upgrade context so after the base OS is upgraded
containers can be started using podman.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
after all mon are upgraded, let's reset mon_host which is used in the
rest of the playbook for setting `container_exec_cmd` so we are sure to
use the right value.
Typical error:
```
failed: [mds0 -> mon0] (item={u'path': u'/var/lib/ceph/bootstrap-mds/ceph.keyring', u'name': u'client.bootstrap-mds', u'copy_key': True}) => changed=true
ansible_loop_var: item
cmd:
- docker
- exec
- ceph-mon-mon2
- ceph
- --cluster
- ceph
- auth
- get
- client.bootstrap-mds
delta: '0:00:00.016294'
end: '2019-09-27 13:54:58.828835'
item:
copy_key: true
name: client.bootstrap-mds
path: /var/lib/ceph/bootstrap-mds/ceph.keyring
msg: non-zero return code
rc: 1
start: '2019-09-27 13:54:58.812541'
stderr: 'Error response from daemon: No such container: ceph-mon-mon2'
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This change implements a filter_plugin that is used in the
ceph-facts, ceph-validate roles and infrastucture-playbooks.
The new filter plugin will return a list of all IP address
that reside in any one of the given IP ranges. The new filter
replaces the use of the ipaddr filter.
ceph.conf already support a comma separated list of CIDRs
for the public_network and cluster_network options.
Changes: [1] and [2] introduced a regression in ceph-ansible
where public_network can no longer be a comma separated list
of cidrs.
With this change a comma separated list of subnet CIDRs can
also be used for monitor_address_block and radosgw_address_block.
[1] commit: d67230b2a2
[2] commit: 20e4852888
Related-To: https://bugs.launchpad.net/tripleo/+bug/1840030
Related-To: https://bugzilla.redhat.com/show_bug.cgi?id=1740283Closes: #4333
Please backport to stable-4.0
Signed-off-by: Harald Jensås <hjensas@redhat.com>
The rolling_update.yml playbook fails when scanning ceph-disk osds while
deploying nautilus. The --force flag is required to scan existing osds
and rewrite their json metadata.
Signed-off-by: Sam Choraria <sam.choraria@bbc.co.uk>
This playbook helps to migrate all osds on a node from filestore to
bluestore backend.
Note that *ALL* osd on the specified osd nodes will be shrinked and
redeployed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we're looking at the mon hostname in the ceph status output then
there's some scenarios where this could be true.
If we collocate some services (mons, mgrs, etc..) then the hostname of
the monitor to shrink will still be present in the ceph status (like
in mgrs or other).
Instead we should check the hostame only in the mon part of the output.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
By changing the set ownership command from using the file module in combination with a with_items loop to a raw chown command, we can achieve a 98% performance increase here.
On a ceph cluster with a significant amount of directories and files in /var/lib/ceph, the file module has to run checks on ownership of all those directories and files to determine whether a change is needed.
In this case, we just want to explicitly set the ownership of all these directories and files to the ceph_uid
Added context note to all set proper ownership tasks
Signed-off-by: Kevin Jones <kevinjones@redhat.com>
use `from_json` filter instead of a `| python` so we can get rid of the
`shell` module usage here.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We don't need to execute the ceph-dashboard role on the nodes present
in the grafana-server group. This one is dedicated to the grafana and
prometheus stack.
The ceph-dashboard needs to executed where the ceph-mgr is running. It
is either on the dedicated mgr nodes or if mgr and mon are collocated
implicitly on the mon nodes.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Add a playbook named shrink-rgw.yml to infrastructure-playbooks/ that
can remove a RGW from a node in an already deployed Ceph cluster.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
There's no need to add complexity and trying to fallback on other group.
Let's deploy dashboard on all nodes present in grafana-server group.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Move dashboard, grafana/prometheus and node-exporter plays into a
dedicated playbook in infrastructure-playbook directory.
To avoid using 'dashboard_enabled | bool' condition multiple time
in the main playbook we can just import the dashboard playbook or
not.
This patch also allows to use an unique dashboard playbook for
both baremetal and container playbooks.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Use facility built-in in Ansible to check whether a command was executed
successfully rather looking at its return value.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Add a playbook named "shrink-rbdmirror.yml" in infrastructure-playbooks/
that removes a RBD Mirror from a node in an already deployed Ceph
cluster.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Add a playbook, named "shrink-mgr.yml", in infrastructure-playbooks/
that removes a MGR from a node in an already deployed Ceph cluster.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
This commit refacts the way we check the "mds_to_kill" node is well
stopped.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Add a playbook, named "shrink-mds.yml", in infrastructure-playbooks/
that removes a MDS from a node in an already deployed Ceph cluster.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1677431
Signed-off-by: Rishabh Dave <ridave@redhat.com>
The ceph-iscsi-config and ceph-iscsi-cli packages were combined into
ceph-iscsi and its APIs changed. This fixes up the iscsi purge task to
support the new API and old one.
Signed-off-by: Mike Christie <mchristi@redhat.com>
This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
3a100cfa52 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise it fails like following:
```
fatal: [mon0]: FAILED! => changed=false
msg: |-
Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The ceph restapi configuration was only available until Luminous
release so we don't need those leftovers for nautilus+.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We currently only purge rh_storage yum repository file but depending
on the ceph_repository value we are using, the ceph repository file
could have a different name.
Resolves: #4056
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```
Now appended ``| bool`` on a lot of the affected variables.
Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.
Closes: #4022
Signed-off-by: L3D <l3d@c3woc.de>
The ceph-agent role was used only for RHCS 2 (jewel) so it's not
usefull anymore.
The current code will fail on CentOS distribution because the rhscon
package is only avaible on Red Hat with the RHCS 2 repository and
this ceph release is supported on stable-3.0 branch.
Resolves: #4020
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This commit splits the current `ceph-container-common` role.
This introduces a new role `ceph-container-engine` which handles the
tasks specific to the installation of containers tools (docker/podman).
This is needed for the ceph-dashboard implementation for 2 main reasons:
1/ Since the ceph-dashboard stack is only containerized, we must install
everything needed to run containers even in non containerized
deployments. Splitting this role allows us to not have to call the full
`ceph-container-common` role which would run a bunch of unneeded tasks
that would have been skipped anyway.
2/ The current implementation would have required to run
`ceph-container-common` on all ceph-clients nodes which would have been
conflicting with 9d3517c670 (we don't want
to run ceph-container-common on all client nodes, see mentioned commit
for more details)
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>