Commit Graph

96 Commits (3ea39b5db30243ff83992ae989664f1a30db95b5)

Author SHA1 Message Date
Alex Schultz 7ddbe74712 Use ansible_facts
It has come to our attention that using ansible_* vars that are
populated with INJECT_FACTS_AS_VARS=True is not very performant.  In
order to be able to support setting that to off, we need to update the
references to use ansible_facts[<thing>] instead of ansible_<thing>.

Related: ansible#73654
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1935406
Signed-off-by: Alex Schultz <aschultz@redhat.com>
(cherry picked from commit a7f2fa73e6)
2021-03-26 00:16:58 +01:00
Guillaume Abrioux 932abbc8cf switch2container: do not serialize the ceph-crash migration
There's no need to slow down the playbook execution time by migrating
all the `ceph-crash` instances in a serial way. Let's remove the
`serial: 1` so the migration is achieved in a parallel way.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 980a5a7df4)
2021-03-11 13:52:39 +01:00
Guillaume Abrioux bb1f66cb51 switch2container: fix mon quorum check
The current check makes no sense because it checks any of other monitor
than the one being played (either a previous one already converted or a
next that isn't yet converted) is present on the quorum.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1909011

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 175ffa1b88)
2021-03-11 13:50:27 +01:00
Guillaume Abrioux ba312a5b5d lint: ignore 302,303,505 errors
ignore 302,303 and 505 errors

[302] Using command rather than an argument to e.g. file
[303] Using command rather than module
[505] referenced files must exist

they aren't relevant on these tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 195d88fcda)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 87d53fea08 lint: variables should have spaces before and after
Fix ansible lint 206 error:

[206] Variables should have spaces before and after: {{ var_name }}

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9fba6eecfa)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 35e738c681 lint: commands should not change things
Fix ansible lint 301 error:

[301] Commands should not change things if nothing needs doing

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5450de58b3)
2020-12-16 14:05:45 +01:00
Guillaume Abrioux 92b261df89 lint: set pipefail on shell tasks
Fix ansible lint 306 error:

[306] Shells that use pipes should set the pipefail option

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1879c26eb9)
2020-12-16 14:05:45 +01:00
Dimitri Savineau 3f16132e44 library: add ceph_osd_flag module
This adds ceph_osd_flag ansible module for replacing the command module
usage with the ceph osd set/unset commands.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5da593604a)
2020-12-15 17:36:28 +01:00
Guillaume Abrioux 1ac034a802 switch2containers: do not stop ceph.target in osd play
`ceph.target` should be disabled only. Otherwise, in collocation
scenario you stop other collocated services in the OSD play which isn't
what we want to do. Each daemon has its corresponding play for managing
the transition to container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1901865

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0b05620597)
2020-12-15 17:32:23 +01:00
Dimitri Savineau ed9c51ff5a switch2container: chown symlink in mon/mgr plays
fa2bb3a only fix the symlink owner/group issue in the OSD play. If the
OSDs are collocated with other services like MONs and MGRs then the
chown command will fail.

$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1896448

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 35ed9977aa)
2020-11-16 16:37:04 -05:00
Dimitri Savineau 33f74771d2 switch2container: disable ceph-osd enabled-runtime
When deploying the ceph OSD via the packages then the ceph-osd@.service
unit is configured as enabled-runtime.
This means that each ceph-osd service will inherit from that state.
The enabled-runtime systemd state doesn't survive after a reboot.
For non containerized deployment the OSD are still starting after a
reboot because there's the ceph-volume@.service and/or ceph-osd.target
units that are doing the job.

$ systemctl list-unit-files|egrep '^ceph-(volume|osd)'|column -t
ceph-osd@.service     enabled-runtime
ceph-volume@.service  enabled
ceph-osd.target       enabled

When switching to containerized deployment we are stopping/disabling
ceph-osd@XX.servive, ceph-volume and ceph.target and then removing the
systemd unit files.
But the new systemd units for containerized ceph-osd service will still
inherit from ceph-osd@.service unit file.

As a consequence, if an OSD host is rebooting after the playbook execution
then the ceph-osd service won't come back because they aren't enabled at
boot.

This patch also adds a reboot and testinfra run after running the switch
to container playbook.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1881288

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fa2bb3af86)
2020-11-12 21:08:32 +01:00
Dimitri Savineau 69b51b5f19 monitor: use quorum_status instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the quorum status, we're only using the quorum_names
structure in the ceph status output.
To optimize this, we could use the ceph quorum_status command which contains
the same needed information.
This command returns less information.

$ ceph status -f json  | wc -c
2001
$ ceph quorum_status -f json  | wc -c
957
$ time ceph status -f json > /dev/null

real	0m0.577s
user	0m0.538s
sys	0m0.029s
$ time ceph quorum_status -f json > /dev/null

real	0m0.544s
user	0m0.527s
sys	0m0.016s

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 88f91d8c12)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 1185b7e86a osds: use pg stat command instead of ceph status
The ceph status command returns a lot of information stored in variables
and/or facts which could consume resources for nothing.
When checking the pgs state, we're using the pgmap structure in the ceph
status output.
To optimize this, we could use the ceph pg stat command which contains
the same needed information.
This command returns less information (only about pgs) and is slightly
faster than the ceph status command.

$ ceph status -f json | wc -c
2000
$ ceph pg stat -f json | wc -c
240
$ time ceph status -f json > /dev/null

real	0m0.529s
user	0m0.503s
sys	0m0.024s
$ time ceph pg stat -f json > /dev/null

real	0m0.426s
user	0m0.409s
sys	0m0.016s

The data returned by the ceph status is even bigger when using the
nautilus release.

$ ceph status -f json | wc -c
35005
$ ceph pg stat -f json | wc -c
240

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ee50588590)
2020-11-03 14:38:49 -05:00
Dimitri Savineau 05d4e76d42 switch2container: chown symlink for devices
If the OSD directory is using symlinks for referencing devices (like
block, db, wal for bluestore and journal for filestore) then the chown
command could fail to change the owner:group on some system.

$ ls -hl /var/lib/ceph/osd/ceph-0/
total 28K
lrwxrwxrwx 1 ceph ceph 92 Sep 15 01:53 block -> /dev/ceph-45113532-95ca-471b-bd75-51de46f1339c/osd-data-570a1aee-60c0-44c9-8036-ffed7d67a4e6
-rw------- 1 ceph ceph 37 Sep 15 01:53 ceph_fsid
-rw------- 1 ceph ceph 37 Sep 15 01:53 fsid
-rw------- 1 ceph ceph 55 Sep 15 01:53 keyring
-rw------- 1 ceph ceph  6 Sep 15 01:53 ready
-rw------- 1 ceph ceph  3 Sep 15 02:00 require_osd_release
-rw------- 1 ceph ceph 10 Sep 15 01:53 type
-rw------- 1 ceph ceph  2 Sep 15 01:53 whoami
$ find /var/lib/ceph/osd/ceph-0 -not -user 167 -execdir chown 167:167 {} +
chown: cannot dereference './block': Permission denied
$ find /var/lib/ceph/osd/ceph-0 -not -user 167
/var/lib/ceph/osd/ceph-0/block

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit da4280e243)
2020-09-15 15:30:21 -04:00
Dimitri Savineau dac0415d75 switch2container: remove deb systemd units
When running the switch2container playbook on a Debian based system
then the systemd unit path isn't the same than Red Hat based system.
Because the systemd unit files aren't removed then the new container
systemd unit isn't take in count.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c1af69a7e7)
2020-09-15 15:30:21 -04:00
Guillaume Abrioux 66dde0034b ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-09-10 20:35:04 -04:00
Kevin Coakley 659262f687 Remove ceph-radosgw.target when switching to containerize daemons
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.

Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b2)
2020-08-06 09:41:23 -04:00
Dimitri Savineau 5db4219f26 facts: explicitly disable facter and ohai
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
2020-07-20 21:23:48 +02:00
Guillaume Abrioux 518f4f579d rgw: fix multi instances scaleout
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.

The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:

```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```

This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
2020-07-20 21:23:27 +02:00
Guillaume Abrioux 085341642e switch-to-containers: set and unset osd flags
The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e0)
2020-06-17 12:15:49 -04:00
Guillaume Abrioux c847c2f117 switch_to_containers: don't set noup flag
We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d384)
2020-06-17 09:24:19 -04:00
Dimitri Savineau a165edb5ba switch_to_container: fix osd systemd regex
The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5d)
2020-06-16 12:10:36 -04:00
Guillaume Abrioux 1d2a395aaf switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3700aa5385)
2020-02-10 12:57:17 -05:00
Guillaume Abrioux 5c3ba0787c switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
2020-02-03 09:32:20 -05:00
Dimitri Savineau a325ff61e8 switch_to_containers: fix umount ceph partitions
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 39cfe0aa65)
2019-12-03 15:58:36 +01:00
Dimitri Savineau 5f91be8740 switch_to_containers: umount osd lockbox partition
When switching from a baremetal deployment to a containerized deployment
we only umount the OSD data partition.
If the OSD is encrypted (dmcrypt: true) then there's an additional
partition (part number 5) used for the lockbox and mount in the
/var/lib/ceph/osd-lockbox/ directory.
Because this partition isn't umount then the containerized OSD aren't
able to start. The partition is still mount by the system and can't be
remount from the container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 19edf707a5)
2019-10-08 00:57:05 +00:00
Guillaume Abrioux b325cc386e switch_to_containers: do not re-set `ceph_uid`
This commit refacts the way we set `ceph_uid` fact in `ceph-facts` and
removes all `set_fact` tasks for `ceph_uid` in switch-to-containers playbook
to avoid duplicated code.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fa9b42e98e)
2019-10-07 10:18:17 -04:00
Guillaume Abrioux 468aa5d63b switch_to_containers: optimize ownership change
As per https://github.com/ceph/ceph-ansible/pull/4323#issuecomment-538420164

using `find` command should be faster.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1757400

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-Authored-by: Giulio Fidente <gfidente@redhat.com>
(cherry picked from commit c5d0c90bb7)
2019-10-07 10:18:17 -04:00
Kevin Jones 3a8de9cc36 Set proper ownership command performance improvement
By changing the set ownership command from using the file module in combination with a with_items loop to a raw chown command, we can achieve a 98% performance increase here.

On a ceph cluster with a significant amount of directories and files in /var/lib/ceph, the file module has to run checks on ownership of all those directories and files to determine whether a change is needed.

In this case, we just want to explicitly set the ownership of all these directories and files to the ceph_uid

Added context note to all set proper ownership tasks

Signed-off-by: Kevin Jones <kevinjones@redhat.com>
(cherry picked from commit 47bf47c9d8)
2019-08-22 12:59:58 +02:00
Guillaume Abrioux 16c6d530c6 roles: introduce `ceph-container-engine` role
This commit splits the current `ceph-container-common` role.

This introduces a new role `ceph-container-engine` which handles the
tasks specific to the installation of containers tools (docker/podman).

This is needed for the ceph-dashboard implementation for 2 main reasons:

1/ Since the ceph-dashboard stack is only containerized, we must install
everything needed to run containers even in non containerized
deployments. Splitting this role allows us to not have to call the full
`ceph-container-common` role which would run a bunch of unneeded tasks
that would have been skipped anyway.

2/ The current implementation would have required to run
`ceph-container-common` on all ceph-clients nodes which would have been
conflicting with 9d3517c670 (we don't want
to run ceph-container-common on all client nodes, see mentioned commit
for more details)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55420d6253)
2019-05-22 15:24:11 -04:00
Guillaume Abrioux e29fd842a6 rename docker_exec_cmd variable
This commit renames the `docker_exec_cmd` variable to
`container_exec_cmd` so it's more generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e74d80e72f)
2019-05-17 16:05:58 +02:00
Rishabh Dave 06b3ab2a6b improve coding style
Keywords requiring only one item shouldn't express it by creating a
list with single item.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 739a662c80)

Conflicts:
	roles/ceph-mon/tasks/ceph_keys.yml
	roles/ceph-validate/tasks/check_devices.yml
2019-05-06 15:09:06 +00:00
Dimitri Savineau f90c051589 switch-from-non-containerized: stop all osds
e6bfb84 introduced a regression in the switch from non containerized
to container deployment.
We need to stop all previous OSDs services. We just don't need the
ceph-disk pattern in the regex.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 150acba8c5)
2019-04-12 00:45:21 +00:00
Guillaume Abrioux f5478dcc0b switch_to_containers: remove ceph-disk references
as of stable-4.0, ceph-disk is no longer supported.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e6bfb843f4)
2019-04-12 00:45:21 +00:00
Guillaume Abrioux 69310a5cd6 switch_to_containers: support multiple rgw instances per host
add multiple rgw instances per host in switch_to_containers playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 70f1eea9b2 switch_to_containers: remove non-containerized systemd unit files
remove old systemd unit files (non-containerized) during the
switch_to_containers transition.

We have seen sometimes the unit started is the old one instead of the
new systemd unit generated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 4064035a54 switch_to_containers: use ceph binary from container
use the ceph binary from the container instead of the host.
If the ceph CLI version isn't compatible between host and container
image, it can cause the CLI to hang.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 7e0a70f7a8 switch_to_containers: do not try to redeploy monitors
`ceph-mon` tries to redeploy monitors because it assumes it was not yet
deployed since `mon_socket_stat` and `ceph_mon_container_stat` are
undefined (indeed, we stop the daemon before calling `ceph-mon` in the
switch_to_containers playbook).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 0eb56e36f8 introduce new role ceph-facts
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.

Closes: #3282

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-12 11:18:01 +01:00
Rishabh Dave 2fb12ae554 use pre_tasks and post_tasks when necessary
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-12-05 08:17:10 +00:00
Rishabh Dave e4f0af2b78 don't use private option for import_role
Since sharing variables amongst roles has been made default since
Ansible 2.6, private option has been deprecated; so stop using it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-12-04 23:45:59 +00:00
Sébastien Han 2814d36c93 infra playbooks: use the right container binary
Use podman or docker wether they are available or not. podman will be
prioritized over docker if present.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-27 16:47:40 +00:00
Sébastien Han c14f9b78ff switch: do not look for devices anymore
It's easier lookup a directoriy instead of the block devices,
especially because of ceph-volume and ceph-disk have a different way to
handle devices.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Sébastien Han cd56dad9fa switch: disable all ceph units
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Sébastien Han fe1d09925a switch: do not mask systemd unit
If we mask it we won't be able to start the OSD container since now the
osd container use the osd ID as a name such as: ceph-osd@0

Fixes the error:  Failed to execute operation: Cannot send after transport endpoint shutdown

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Guillaume Abrioux c783bc70da docker-common: rename role
rename `ceph-docker-common` role to `ceph-container-common`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-12 10:51:48 +01:00
Rishabh Dave 3f62fc585f don't use "role" or "roles" to include roles
Since import_role and include_role are more readable, explicit (about
the nature of inclusion) and flexible (allows placibf inclusion
anywhere) amongst the tasks, use them instead of using roles or role
keyword. Besides, these keywords also allow more arguments.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-10-31 09:38:59 +01:00
Guillaume Abrioux d8d3e55006 remove restapi role
As of `mimic`, restapi is no longer available because of manager daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-30 14:19:13 +01:00
Sébastien Han 9fccffa1ca switch: allow switch big clusters (more than 99 osds)
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-10 16:35:30 -04:00
Noah Watkins 8dcc8d1434 Stringify ceph_docker_image_tag
This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-10-10 04:26:33 +00:00