When setting/unsetting osd flags, we can use `tasks_from` when importing
`ceph-facts` role to save some times given that we only need this role
for setting `container_binary`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d66b698be2)
This commit adds crlf between each task.
It makes the playbook more readable.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ef9fb68bc)
when rerunning lvm_setup.yml on existing cluster with OSDs already
deployed, it fails like following:
```
fatal: [osd0]: FAILED! => changed=false
msg: Sorry, no shrinking of data-lv2 to 0 permitted.
```
because we are asking `lvol` module to create a volume on an empty VG
with size extents = `100%FREE`.
The default behavior of `lvol` is to shrink the volume if the LV's current
size is greater than the requested size.
Given the requested size is calculated like this:
`size_requested = size_percent * this_vg['free'] / 100`
in our case, it is similar to:
`size_requested = 100 * 0 / 100` which basically means `0`
So the current LV size is well greater than the requested size which
leads the module to attempt to shrink it to 0 which isn't obviously now
allowed.
Adding `shrink: false` to the module calls fixes this issue.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 218f4ae361)
This commit fixes these tasks when --limit is used.
It makes sure the fact is set on right nodes even when the playbook is
run with `--limit`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f8a951f50c)
The ceph-dashboard role is executed on the mgr nodes so the TLS cert/key
files are copied to those nodes.
But we are running importing the cert/key files into the ceph
configuration on the monitor.
Closes: #5557
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2b8ebf1457)
By default, ansible gathers facts from facter and ohai if installed on
the remote nodes, given we don't need them, let's exclude these facts
from our facts gathering
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c95adc564b)
This variable isn't consumed by the container so we can remove it.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1361e84a4e)
When rgw and osd are collocated, the current workflow prevents from
scaling out the radosgw_num_instances parameter when rerunning the
playbook.
The environment file used in the rgw systemd template is rendered when
executing the `ceph-rgw` role but during a new run of the playbook (in
order to scale out rgw instances), handlers are triggered from `ceph-osd`
role which is run before `ceph-rgw`, therefore it tries to start the new
rgw daemon whereas its corresponding environment file hasn't been
rendered yet and fails like following:
```
ceph-radosgw@rgw.ceph4osd3.rgw1.service failed to run 'start-pre' task: No such file or directory
```
This commit moves the tasks generating this file in `ceph-config` role
so it is generated early.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851906
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7dd68b9ac1)
We need to set the mgr dashboard server ip address before restarting the
dashboard module otherwise we can try to bind the dashboard module on an
already used address.
We already do this configuration for the dashboard port value and ssl
setup so we should do the same for server address too.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1851455
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 03cd75845f)
If a failure occurs in ceph-validate, the upgrade playbook keeps running
where we expect it to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8f9cdf4b10)
Since [1] if a rgw user already exists then the radosgw-admin user create
command will return an error instead of modifying the current user.
We were already doing separated tasks for create and get operation but
only for multisite configuration but it's not enough.
Instead we should do the get task first and depending on the result
execute the create.
This commit also adds missing run_once and delegate_to statement.
[1] https://github.com/ceph/ceph/commit/269e9b9
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ac0f68ccf0)
This commit makes all jobs authenticating to docker hub in order to
avoid the rate limit.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 40307f810c)
This commit adds a note about `stable-3.0` `stable-3.1` branches which
are deprecated and not maintained anymore.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit bbe30bcc69)
This commit updates the documentation to add a note about containerized
deployments.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e61488507b)
This fixes a long standing fail in ceph-volumes lvm test suite.
Otherwise the default behaviour should not change.
Signed-off-by: Jan Fajerski <jfajerski@suse.com>
(cherry picked from commit 1fe8e819f9)
This changes the way we are running the podman containers via systemd.
They are now in dettached mode and Type/PIDFile set.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1834974
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d43769dc2a)
Since we only have one scenario since nautilus then we can just move
the container start command from ceph-osd-run.sh to the systemd unit
service.
As a result, the ceph-osd-run.sh.j2 template and the
ceph_osd_docker_run_script_path variable are removed.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 829990e60d)
This commit makes the playbook copying self-signed generated certificate
to monitors.
When mons and mgrs are deployed on dedicated nodes the playbook will
fail when trying to import certificate and key files since they are
generated on mgrs whereas we try to import them from a monitor.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1846995
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b7539eb275)
This commit makes the zap function idempotent, especially when using
lvm_volumes variable.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1845668
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3f47236470)
When using docker container engine then the systemd unit scripts only
use a dependency on the docker daemon via the After parameter.
But if docker is restarted on a live system then the ceph systemd units
should wait for the docker daemon to be fully restarted.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1846830
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bd22f1d1ec)
This commit makes the images pulling skipped if podman isn't installed
on the machine.
In OSP context, the podman installation is done later in the workflow,
it means all `podman pull` commands will fail.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1849559
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 37b20b6525)
This commit adds a chapter about the ceph upgrade process.
Closes: #5393
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit e41487dbce)
The workflow in this playbook should be the same than in rolling_update,
we should first set noout and nodeep-scrub flags before migrating the
first osd and unset osd flags after the last osd is migrated.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cfaa056e0)
We shouldn't set this flag when running switch_to_containers playbook.
Otherwise the playbook fails waiting for pgs to be clean.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843569
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit b91d60d384)
When a container image managed by podman isn't tag anymore then the
RepoDigests field when inspecting the image doesn't return any value.
This is different from docker workflow and it breaks the ceph-ansible
container upgrade when collocated multiple services and using a non
fix container tag (like latest or 4).
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/ceph/daemon latest 680c9c0d38c3 8 days ago 957 MB
<none> <none> 011ee108bfc9 2 months ago 1.01 GB
$ podman inspect 680c9c0d38c3 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:20cf789235e23ddaf38e109b391d1496bb88011239d16862c4c106d0e05fea9e"
$ podman inspect 011ee108bfc9 | jq .[0].RepoDigests[0]
null
Because this field returns "null" then the ansible task trying to
determine this value is failing
-----------------------------
fatal: [foo]: FAILED! =>
msg: |-
The task includes an option with an undefined variable. The error
was: None has no element 0
The error appears to be in
'roles/ceph-container-common/tasks/fetch_image.yml': line 137,
column 3, but may be elsewhere in the file depending on the exact
syntax problem.
The offending line appears to be:
- name: set_fact ceph_osd_image_repodigest_before_pulling
^ here
-----------------------------
We don't have this behaviour with docker.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/ceph/daemon latest 680c9c0d38c3 8 days ago 928 MB
docker.io/ceph/daemon <none> 011ee108bfc9 2 months ago 986 MB
$ docker inspect 680c9c0d38c3 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:45e6f28bb67c81b826acb64fad5c0da1cac3dffb41a88992fe4ca2be79575fa6"
$ docker inspect 011ee108bfc9 | jq .[0].RepoDigests[0]
"docker.io/ceph/daemon@sha256:b393a73309d72e43ca7d65cd3519036007947671e373eb59aa75a46185c52231"
Instead we should just get the Id field.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1844496
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cdb30bd125)
The systemd LOAD and ACTIVE fileds could have more than one space between
both values.
This update the systemd regex the same way we're using it in different
part of the code.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1843500
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 50140c9b5d)
When using an untrusted TLS certificate (like self-signed) on grafana
then the grafana dashboards update subcommand will fail.
One solution could be to trust the TLS certificate.
The other one is to disable the TLS verification on the grafana API.
Closes: #5324
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b20519efd0)
We were only adding the endpoints to the master zone but not to the
zonegroup.
This patch fixes the issue.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1839228
Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 0175c205fa)
The condition on this task is wrong, we have to check whether
`target_size_ratio` is set in the pool definition instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8c7a48832c)
always set these facts on monitor nodes whatever we run with `--limit`.
Otherwise, playbook will fail when using `--limit` on nodes where these
facts are used on a delegated task to monitor.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5e81843e9)
The dashboard nodes (alertmanager, grafana, node-exporter, and prometheus)
were not manage during the docker to podman migration.
This adds the systemd container template of those services to a dedicated
file (systemd.yml) in order to include it in the docker2podman playbook.
This also adds the dashboard container images pull from docker to podman.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829389
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 252e78b4e4)
The docker2podman playbook only installs the podman package and updates
the systemd units with the right container_binary value.
We never pull the container image so if one service is restarted then
the container image will be pulled first before the service can start
which could cause longer downstream.
To avoid to download the container image from internet again we can just
pull it from the local docker daemon.
The container_{binding,package,service}_name variables are removed
because they are only used in the ceph-container-engine role which
isn't call in this playbook.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit d38f21aeba)
The rbdmirror group name was using the wrong variable definition.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c0a213f928)
The current ganesha log directory is only present in the container
and not bind mount on the host.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 222fe4abd8)
The docker commands should be based on the container_binary variable
otherwise running the playbook on a host without docker (like podman
only) will failed.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1829985
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Same fix as `ceph-rgw` for `rgw_create_pools` pool names that contain Jinja
templates.
See #5348 for details.
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 444b46ea24)
It is common to set templated pool names in `rgw_create_pools`, e.g.
```yaml
rgw_create_pools:
"{{ rgw_zone }}.rgw.buckets.index":
pg_num: 16
size: 3
type: replicated
```
This worked fine with Ansible 2.8, but broke in Ansible 2.9 due to a change in
the way `with_dict` works [1].
This commit replaces the use of `with_dict` with
```yaml
loop: "{{ rgw_create_pools | dict2items }}"
```
which works as intended and expands the template in the pool name.
[1]: https://docs.ansible.com/ansible/latest/porting_guides/porting_guide_2.9.html#loopsCloses#5348
Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit d2b7670c7d)
When using radosgw_interface and IPv6 setup then the _radosgw_address
fact doesn't use square brackets compared to the radosgw_address and
radosgw_address_block configuration.
Closes: #5325
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ed4f23d530)
This change allows the operator to refresh the
ceph dashboard admin role on multiple ceph-ansible
executions.
In the current state the role is set only when the
user is created, and there's no way to change it if
the user exists.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1826002
Signed-off-by: fmount <fpantano@redhat.com>
(cherry picked from commit 5eb363e033)
this commit removes the task which enable application on cephfs pools.
See: https://tracker.ceph.com/issues/43761Fixes: #5278
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 86dc6f8206)
Otherwise it might try to use the system installed version of ansible
when there's one available.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9acb5e6d)
Make all of the hosts start at 1 and not 0,
also make some minor changes in scenario 3 to
remova an inconsistency.
Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit bd1440f2cd)
The '==' jinja2 operator (or 'equalto') has been introduced in jinja2
2.8.
On EL7, jinja2 version is 2.7 so the operator isn't present creating
templating error like:
The error was: TemplateRuntimeError: no test named '=='
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1747206
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 34e6e8e06c)
Since ea2b654d9 we're not running the rados command from the monitor
nodes but from the ganesha node. Unfortunately we don't have the
required keyring on that node to run the rados command as we don't
import the right keyring.
This commit restores the workflow for internal ganesha deployment like
before ea2b654d9 but keeps the rados commands from the ganesha node for
external deployment until we have a better design.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8a890306ad)