purge-docker-cluster must remove all osd_disk_prepare logs in
`{{ ceph_osd_docker_run_script_path }}`, otherwise if you purge your
cluster and try to redeploy it, osds will fail to start since because it
will try to retrieve find a partition uuid which doesn't exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510470
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The ansible inventory could have more than just ceph-ansible hosts, so
we shouldnt use "hosts: all", also only grab one file when getting
the ceph cluster name instead of failing when there is more than one
file in /etc/ceph. Also fix location of the ceph.conf template
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
During purge osd, the containers are not stopped because of a typo, as a
result, all the devices can't be unmounted later.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.
We will drop this in a near future, maybe 3.1 or 3.2.
Signed-off-by: Sébastien Han <seb@redhat.com>
nfs nodes can't be upgraded from jewel to luminous because ceph-nfs role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
package is upgraded in `ceph-nfs` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
mgr nodes can't be upgraded from jewel to luminous because ceph-mgr role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
ceph-mgr package is upgraded in `ceph-mgr` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 302e563601cd6820b1ae44fabdfb1506688c7c9b)
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup
Signed-off-by: Sébastien Han <seb@redhat.com>
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.
Ansible throws warnings when using yum/dnf/rpm with the command
module:
[WARNING]: Consider using yum module rather than running yum
This patch adds the `warn: no` argument to suppress the warnings
in the Ansible output.
This playbook can replace failed OSD in containerized and
non-containerized env.
The current limitation is that it won't allow you to choose between
filestore/bluestore and will do collocation as well.
Signed-off-by: Sébastien Han <seb@redhat.com>
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.
Signed-off-by: Sébastien Han <seb@redhat.com>
Use the pg check before doing the pg check, not on the quorum check.
Also never quote int when doing comparaison.
Signed-off-by: Sébastien Han <seb@redhat.com>
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.
Signed-off-by: Boris Ranto <branto@redhat.com>
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.
Signed-off-by: Boris Ranto <branto@redhat.com>
Also we now play ceph-config to have everything being generated for new
daemons bootstrap during upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1497959
Signed-off-by: Sébastien Han <seb@redhat.com>
Even if the task is skipped, ansible registers the var as 'skipped' so
this task the task using this variable for its next usage.
Signed-off-by: Sébastien Han <seb@redhat.com>
rbd-mirror containers are not stopped in purge-docker-cluster playbook
because of the wrong name used.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
unti now, mgr nodes are not managed by purge-cluster.yml, therefore it
breaks scenario like purge_cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The old name is used in `rolling_update.yml` and
`purge-docker-cluster.yml`, it breaks the
`test_rgw_service_is_running()` test.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commits allows us to run
switch-from-non-containerized-to-containerized-ceph-daemons.yml multiple
times.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489353
Signed-off-by: Sébastien Han <seb@redhat.com>
If devices is passed through an extra var this register won't work so
let's only register the var is devices is not defined.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489099
Signed-off-by: Sébastien Han <seb@redhat.com>
Prior command was avoiding the lockbox mountpoint and the playbook was
failing with:
rmtree failed: [Errno 30] Read-only file system:
'/var/lib/ceph/osd-lockbox/4e9d8052-87c2-4fde-a56c-b8c108a3eefc/key-management-mode'
Signed-off-by: Sébastien Han <seb@redhat.com>
we need to force the value of `docker` variable which is initially set
to `false` since it's a migration from non-containerized to
containerized cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We must mask the image so we are sure that even if the system reboots
then the OSDs won't start.
Also remove Ceph udev rules if found on the system prior to deploy
containers. If we don't do this we are exposed to conflicts between udev
rules and sytemd unit files.
Also add the CI will now test the migration from a non-containerized cluster to a
containerized cluster.
Signed-off-by: Sébastien Han <seb@redhat.com>
Monitor removal from the monmap is not immediate, so let's wait a little
bit and then fail if the monitor is still in the monmap.
We try twice in total with 10 sec intervals.
Signed-off-by: Sébastien Han <seb@redhat.com>
Move untested/with few confidence playbooks in a untested-by-ci
directory.
Also removing this directory from the package build.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1461551
Signed-off-by: Sébastien Han <seb@redhat.com>
Prior to this patch, we were applying the osd flags like this:
"
General pre tasks
Set flags
Upgrade OSDs on a host
Unset flags <-- this triggers pending scrub to start
Set flags
Upgrade OSDs on a hosts
Unset flags <-- this triggers pending scrub to start
.
.
.
General post tasks
"
Now instead, we apply the flag once before starting the OSD update and
unset them once the last OSD is finished.
"
General pre tasks
Set flags and wait for any scrubs to finish
Upgrade OSDs on a host
Upgrade OSDs on a host
.
.
.
Unset flags
General post tasks
"
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1450754
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
In our test case we don't have any pgs, thus the check fails. The check
always returns an empty array, which makes the comparaison failing.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit eases the use of the
infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook. We basically run it with a couple of pre-tasks and then we let
the playbook run the docker roles.
It obviously expect to have proper variables configured in order to
work.
Signed-off-by: Sébastien Han <seb@redhat.com>
In the switch to containers migration there were broken references
to ceph_mon_docker_subnet variable, replaced with public_network.
Also fixes references to ceph_mon_docker_extra_env setting for it
a default as it could be undefined.
There is only two main scenarios now:
* collocated: everything remains on the same device:
- data, db, wal for bluestore
- data and journal for filestore
* non-collocated: dedicated device for some of the component
Signed-off-by: Sébastien Han <seb@redhat.com>
This will give us more flexibility and avoid a lot of useless when
skipping all tasks from a non-desired role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
remove `ceph_mon_docker_interface` and use `monitor_interface` instead
for both containerized and non-containerized deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The new test in the checks PGs are no longer working on distributions
where /bin/sh isn't linked to /bin/bash.
Fix: #1619
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Avoid screen scrapping by rewriting `waiting for clean pgs` tasks like it is
done in 304de48.
Use the json output returned by `ceph -s` instead
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We need to include ceph_docker_registry when removing containers/images
because if we don't it will assume docker.io which is not always where
the image originated from, causing the playbook to fail.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
If we're purging a containerized cluster that did not use the
raw_multi_journal OSD scenario then raw_journal_devices will not be
defined which causes the playbook to fail.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1455187
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
`ceph-docker-common`:
At the moment there is a lot of duplicated tasks in each
`./roles/ceph-<role>/tasks/docker/main.yml` that could be refactored in
`./roles/ceph-docker-common/tasks/main.yml`.
`*_containerized_deployment` variables:
All `*_containerized_deployment` have been refactored to a single
variable `containerized_deployment`
duplicate `cephx` variables in `group_vars/* have been removed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Problem: we are delegating the set/unset flag to a monitor node but we
try to call an osd container
Solution: use the right container name.
Signed-off-by: Sébastien Han <seb@redhat.com>
Without this, we don't test the mgr role so we need to add it.
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Problem: too many different commands to do the same thing. The 'cut'
command on infrastructure-playbooks/purge-cluster.yml was also wrong.
This sed command from osixia in ceph-docker
https://github.com/ceph/ceph-docker/pull/580/ addresses all the
scenarios.
Signed-off-by: Sébastien Han <seb@redhat.com>
Doing so will override any values set for these in the group_vars
directory relative to the users inventory.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Doing so at playbook level overrides whatever values might be set for
these in the user's group_vars directory that's relative to their
inventory.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This has the behavior of overriding custom values set in group_vars.
I've added defaults to the rest of the group names so that if they are
not overridden in group_vars then defaults will be used.
See: https://bugzilla.redhat.com/show_bug.cgi?id=1354700
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic, so keyword "skip" to fix it.
Exit the playbook if the user not define devices in both host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic err, so add keyword "skip" to fix it.
Exit the playbook if the user not define devices in both host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
host_vars/default.yml
load device partition file in directory host_vars
1) if the user define host_vars/hostname.yml load the devices partition on this file.
2) otherwise load host_vars/default.yml for default
The task waiting for the monitor to join the quorum... , the result for ceph -s | grep monmap only contain monmap, not included quorum:
# ceph -s --cluster ceph | grep monmap
monmap e1: 3 mons at {sh-office-ceph-1=10.12.10.34:6789/0,sh-office-ceph-2=10.12.10.35:6789/0,sh-office-ceph-3=10.12.10.36:6789/0}
If want to get monitor, should use this:
# ceph -s --cluster ceph | grep election
election epoch 80, quorum 0,1 sh-office-ceph-1,sh-office-ceph-2
ceph verison: 10.2.5
We now run the container and waits until it dies. Prior to this we were
stopping it before completion so not all the devices where zapped.
Signed-off-by: Sébastien Han <seb@redhat.com>
Some playbooks use [0-9]*, others use \d+$
The latter is more correct since cluster name may contain numbers.
Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>
This is to allow for ceph-installer usage of this playbook and
to ensure that you have the correct keys locally when bootstrapping.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>