Commit Graph

5346 Commits (7745fd3560b3ef52b216c1d9f3c5677f80d9de0a)
 

Author SHA1 Message Date
Dimitri Savineau 7745fd3560 container: run engine/common roles on first client
We already do this in the site-container.yml playbook because we don't
need docker/podman installed on all client nodes and having the
container image only on the first client node.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 8ecbdc6ede)
2020-09-10 20:57:16 +02:00
Dimitri Savineau 0c0a930374 ceph-facts: only get fsid when monitor are present
When running the rolling_update playbook with an inventory without
monitor nodes defined (like external scenario) then we can't retrieve
the cluster fsid from the running monitor.
In this scenario we have to pass this information manually (group_vars
or host_vars).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1877426

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f63022dfec)
2020-09-10 20:57:16 +02:00
Dimitri Savineau dd05d8ba90 tests: use grafana from quay.io
This changes the grafana container image regitry from docker.io to
quay.io to avoid rate limit.
This also adds the missing container image values for docker2podman
and podman scenarios.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 98c9afceb9)
2020-09-10 17:30:37 +02:00
Guillaume Abrioux 218aedaab6 tests: migrate to quay.ceph.io registry
in order to avoid docker.io rate limiting

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2cbb7de3b2)
2020-09-10 17:30:37 +02:00
Francesco Pantano 8e3ecfd869 Add --cluster option on ceph require-osd-release command
On DCN environments, or when multiple ceph cluster are configured,
we need to specify the cluster name before running the command or
the rolling_update playbook will fail during minor updates.

Closes: https://bugzilla.redhat.com/1876447
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit cb64df30b6)
2020-09-09 14:54:19 +02:00
Francesco Pantano 8dd8675080 Fix hosts field in rolling_update playbook when mds are processed
In the OSP context, during the rolling update the playbook fails
with the following error:

'''
ERROR! The field 'hosts' has an invalid value, which includes an
undefined variable. The error was: list object has no element 0
'''

This PR just change the hosts field providing a valid mons group
value.

Closes: https://bugzilla.redhat.com/1876803
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit e65f9a5c72)
2020-09-09 14:53:44 +02:00
Niko Smeds a41c572785 Enable HAProxy backend checks for Ceph RGW
Add the `check` option to server definitions to enable basic HAProxy health
checks for Ceph RADOS gateway backends.

Currently traffic will be forwarded to unhealthly `radosgw.service` servers.
These changes resolve the issue.

Signed-off-by: Niko Smeds nikosmeds@gmail.com
(cherry picked from commit a951c1a3f0)
2020-09-02 09:54:52 -04:00
Guillaume Abrioux 3a8be20699 rolling_update: remove 'ignore_errors'
There's no need to use `ignore_errors: true` on these tasks.

Using a loop on the task stopping mon daemons allows us to avoid
duplicating this task, the `ignore_errors` isn't needed here because it
won't fail the playbook if one of the ID doesn't exist (shortname vs. fqdn)

Using the right condition on the task starting the mgr daemon allows us
to avoid using an `ignore_errors: true` as well.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cec994b973)
2020-08-21 16:33:15 +02:00
Guillaume Abrioux 2cc7ed21f0 ceph_key: refact the code and minor fixes
This commit refactors the code to remove a duplicate condition and it
makes the `state: absent` code idempotent

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 13e2311cbe)
2020-08-21 13:56:11 +02:00
Guillaume Abrioux f22078b16d tests: add more coverage for test_ceph_key
This commit adds more coverage regarding the testing of ceph_key module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 27ca884d99)
2020-08-21 13:56:11 +02:00
Guillaume Abrioux 8fde5f7396 dashboard: refact admin user creation task
this commit splits this task in order to avoid using a `shell` module.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 54d3e9650f)
2020-08-21 13:55:54 +02:00
Dimitri Savineau dcf915f597 tests: reenable nfs-ganesha testing
This re-adds the nfs-ganesha testing in non containerized deployment.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 6c11695fbe)
2020-08-20 12:22:32 -04:00
George Shuklin c0d98878ff Make 'disable ssl for dashboard task' idempotent.
This should reduce number of 'changed' tasks during convergence test.

Signed-off-by: George Shuklin <george.shuklin@gmail.com>
(cherry picked from commit 73d4bb6bd6)
2020-08-20 17:16:45 +02:00
Rafał Wądołowski 21a37e23b3 Comment out ceph_custom_key
Since there is a check if ceph_custom_key is defined, there is no reason
to define it by default.

Signed-off-by: Rafał Wądołowski <rwadolowski@cloudferro.com>
(cherry picked from commit 55cd6e83e4)
2020-08-20 13:43:44 +02:00
Guillaume Abrioux 4685b411de tests: move erasure pool testing in lvm_osds
This commit moves the erasure pool creation testing from `all_daemons`
to `lvm_osds` so we can decrease the number of osd nodes we spawn so the
OVH Jenkins slaves aren't less overwhelmed when a `all_daemons` based
scenario is being tested.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8476beb5b1)
2020-08-20 11:55:40 +02:00
John Fulton 489efd5689 Set default permission for prometheus config files
Regardless of the outcome of Ansible 2.9.12 issue 71200
we can set a default permission for these files.

Closes: https://github.com/ceph/ceph-ansible/issues/5677

Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit 95dee6f1ca)
2020-08-18 18:04:17 -04:00
Guillaume Abrioux 81d116b0ac shrink-mds: use mds_to_kill_hostname instead
When using fqdn in inventory host file, this task will fail because the
mds is registered with its shortname.

It means we must use `mds_to_kill_hostname` in this task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1869837

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 51c382677d)
2020-08-18 15:09:57 -04:00
Guillaume Abrioux 3fad1677d6 infra: only install logrotate on right nodes
For intsance, there is no need to install logrotate on clients nodes.

This also ensure logrotate is installed only for containerized
deployments since the packaging has an explicit dependency to logrotate

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ed11ea3ee)
2020-08-18 11:10:57 -04:00
Guillaume Abrioux 8f18411770 travis: enforce ansible-lint 4.2.0
Let's pin to 4.2.0

(because of ansible/ansible-lint/issues/966)

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 04d77dcaeb)
2020-08-18 16:37:51 +02:00
Dimitri Savineau e9c6028eb9 ceph-rgw: allow specifying crush rule on pool
We already support specifiying a custom crush rule during pool creation
in ceph-osd role but not in ceph-rgw role.
This patch adds the missing code to implement this feature.
Note this is only available for replicated pool not erasure. The rule
must also exist prior the pool creation.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1855439

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cb8f0237e1)
2020-08-17 23:00:13 +02:00
Dimitri Savineau 8ebe813428 container: don't install the engine on all clients
We only need the container engine to be installed on the first clients
node in order to execute the pools/keys operation. We already do the
same worflow with the ceph-container-common role which pull the ceph
container image.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9805589ef9)
2020-08-17 22:59:40 +02:00
Guillaume Abrioux 004155d407 purge-cluster: use sysfs method for unmapping rbd devices
This way we keep consistency with purge-container-cluster.yml playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f77fa6e2a4)
2020-08-17 09:50:08 -04:00
Ali Maredia 63d991dc3d rgw: allow rgws to be concurrently with or without multisite
Allows rgws in a ceph cluster to be run with
multisite and without multisite at the same time.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
(cherry picked from commit 5c1f4b1a1e)
2020-08-17 13:56:45 +02:00
Guillaume Abrioux 2609da6ce7 infra: add missing tag
This commit adds the missing `with_pkg` tag on the logrotate
installation task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e1cb385740)
2020-08-13 10:09:31 -04:00
Guillaume Abrioux 56d2b62e00 purge: import ceph-defaults in purge osd play
Otherwise, `ceph_volume_debug` variable is undefined

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 33a544644a)
2020-08-12 22:57:10 +02:00
Guillaume Abrioux 29d4c42f80 infra: add log rotation support (containers)
This commit adds the log rotation support via logrotate in containerized
deployments.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1848388

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit f1aa6cea21)
2020-08-12 22:57:10 +02:00
Guillaume Abrioux 8a7e4193db common: don't enable debug log on ceph-volume calls by default
ceph-volume can generate large logs at some point.

debug logs by definition should be enabled only when debugging.

Let's make it customizable with a variable which is set to `False` by
default.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 448cc280b7)
2020-08-12 22:57:10 +02:00
Guillaume Abrioux 223254e8bf nfs: do not copy rgw keyring when `nfs_obj_gw` is true
This keyring shouldn't be copied when `nfs_obj_gw` is `True` if the
cluster doesn't contain a rgw node, which can be the case given we are
using `nfs_obj_gw` instead of `nfs_file_gw` (cephfs vs. object), the
deployment will fail trying to copy a key that doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit dd4b5b0328)
2020-08-12 14:57:56 -04:00
raul 5fc3af5f4d rgw: support 1+ rgw instance in `radosgw_frontend_port`
Change the radosgw_frontend_port to take in account more than 1 RGW instance,
in it's original form `radosgw_frontend_port: radosgw_frontend_port | int`,
it configured the 8080 port to all instances, with the following modification
`radosgw_frontend_port: radosgw_frontend_port | int + item|int` we increase in
1 the port count.

Co-authored-by: Daniel Parkes <dparkes@redhat.com>
Signed-off-by: raul <rmahique@redhat.com>
(cherry picked from commit 110eaf5f9f)
2020-08-12 14:57:35 -04:00
Guillaume Abrioux 4369833008 tests: test iscsigw against stable build
This commit makes the ci using stable build for testing iscsigw in
stable-5.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-08-12 12:19:18 -04:00
Benoît Knecht 5d06c0eda9 purge-cluster: check if rbdmap exists
When running `infrastructure-playbooks/purge-cluster.yml` twice, it fails the
second time on the `ensure rbd devices are unmapped` task, because `rbdmap`
isn't installed anymore at that point.

This commit adds a check that ensures `rbdmap` is available, and skips the
`ensure rbd devices are unmapped` task if it isn't.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit a57fd7a090)
2020-08-06 12:01:50 -04:00
Kevin Coakley 92b400f433 Remove ceph-radosgw.target when switching to containerize daemons
The task "remove old systemd unit file" under "switching from
non-containerized to containerized ceph rgw" only removes
the ceph-radosgw@.service file. The task should also remove
the ceph-radosgw.target file, like the "remove old systemd unit
files" tasks for the mons, mgrs, osds, etc, in order to clean up
all of the unused systemd unit files.

Signed-off-by: Kevin Coakley <kcoakley@sdsc.edu>
(cherry picked from commit d19e6033b2)
2020-08-06 11:43:12 -04:00
Guillaume Abrioux bd3439db75 shrink_osd: remove osd data directory
Otherwise it leaves an empty directory.
When shrinking and redeploying multiple OSDs you have no guarantee it
will reuse the same osd id.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8933bfde33)
2020-08-06 13:09:38 +02:00
Guillaume Abrioux 8632db7cb8 tests: refact shrink_osd scenario
This adds more coverage on the shrink_osd scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 7efea219d6)
2020-08-06 13:09:38 +02:00
Guillaume Abrioux d4dc674fa4 tox: split shrink_osd scenario
Let's split this scenario with a dedicated tox ini file.

This is for testing in two ways:

1/ shrinking OSDs one by one
2/ shrinking multiple OSDs with a single call of the playbook

ceph-build related PR: ceph/ceph-build#1629

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 78e4faf077)
2020-08-06 13:09:38 +02:00
Benoît Knecht ccefe7da9f shrink-osd: various fixes
This handles missing /etc/ceph/osd, by ensuring we actually found files in
`/etc/ceph/osd` before trying to slurp their content.

This also add a missing `| default(False)` to avoid fowlloing error:

```
fatal: [ceph01]: FAILED! =>
  msg: |-
    The conditional check 'ceph_osd_data_json[item.2]['encrypted'] | bool' failed. The error was: error while evaluating conditional (ceph_osd_data_json[item.2]['encrypted'] | bool): 'dict object' has no attribute 'encrypted'
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1862416

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit fe8fbd3ee2)
2020-08-06 13:09:38 +02:00
Dimitri Savineau 92a2a2cf32 pytest: register ceph_crash mark
Otherwise we see some pytest warning.

PytestUnknownMarkWarning: Unknown pytest.mark.ceph_crash - is this a typo?
You can register custom marks to avoid this warning - for details,
see https://docs.pytest.org/en/latest/mark.html

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 03d4620269)
2020-08-06 09:41:54 +02:00
Guillaume Abrioux e0dc56b73c config: only add related rgw section
there's no need to add each rgw section on all rgw nodes.
With this commit, only related rgw section are rendered.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0a581a6e60)
2020-08-05 09:50:09 -04:00
Dimitri Savineau f9a24e2541 dashboard: allow remote TLS cert/key copy
When using TLS on the ceph dashboard or grafana services, we can provide
the TLS certificate and key.
Those files should be present on the ansible controller and they will be
copyied to the right node(s).
In some situation, the TLS certificate and key could be already present
on the target node and not on the ansible controller.
For this scenario, we just need to copy the files locally (on each remote
host).

This patch adds the dashboard_tls_external variable (with default to
false) to allow users to achieve this scenario when configuring this
variable to true.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1860815

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0d0f1e71df)
2020-08-04 14:01:59 +02:00
Dimitri Savineau 1dd9c43efc rolling_update: restart mds after the upgrade
In addition of 155e2a2, the active mds daemons isn't stop/start
correctly as opposed as the other services so that daemon doesn't come
back after the upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1861688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ec0a37a74f)
2020-07-29 17:43:36 -04:00
Dimitri Savineau 2ce60504bd rolling_update: refact dashboard workflow
The dashboard upgrade workflow should do the same process than the ceph
upgrade otherwise any systemd unit modification won't be apply on the
monitoring/dashboard stack.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a6209bd957)
2020-07-27 10:59:25 -04:00
Dimitri Savineau 8ea3fa1752 rolling_update: stop/start instead of restart
During the daemon upgrade we're
  - stopping the service when it's not containerized
  - running the daemon role
  - start the service when it's not containerized
  - restart the service when it's containerized

This implementation has multiple issue.

1/ We don't use the same service workflow when using containers
or baremetal.

2/ The explicity daemon start isn't required since we'are already
doing this in the daemon role.

3/ Any non backward changes in the systemd unit template (for
containerized deployment) won't work due to the restart usage.

This patch refacts the rolling_update playbook by using the same service
stop task for both containerized and baremetal deployment at the start
of the upgrade play.
It removes the explicit service start task because it's already included
in the dedicated role.
The service restart tasks for containerized deployment are also
removed.

Finally, this adds the missing service stop task for ceph crash upgrade
workflow.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1859173

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 155e2a23d5)
2020-07-27 10:59:25 -04:00
Dimitri Savineau 56cf7168fa ceph-handler: remove iscsigws restart scripts
The iscsigws restart scripts for tcmu-runner and rbd-target-{api,gw}
services only call the systemctl restart command.
We don't really need to copy a shell script to do it when we can use
the ansible service module instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cbe79428e6)
2020-07-25 09:34:25 +02:00
Dimitri Savineau 2faed4c204 podman: always remove container on start
In case of failure, the systemd ExecStop isn't executed so the container
isn't removed. After a reboot of a failed node, the container doesn't
start because the old container is still present in created state.
We should always try to remove the container in ExecStartPre for this
situation.
A normal reboot doesn't trigger this issue and this also doesn't affect
nodes running containers via docker.
This behaviour was introduced by d43769d.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1858865

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 47b7c00287)
2020-07-24 12:47:01 -04:00
Dimitri Savineau d5974086dd ceph-facts: remove mds_name fact
The mds_name fact always gets the ansible_hostname value so we don't
need to have a dedicated fact for this and use the ansible_hostname fact
instead.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4e84b4beed)
2020-07-24 10:50:44 -04:00
Dimitri Savineau c694454f82 ceph-handler: add missing condition on ceph-crash
The ceph-crash tasks present in the ceph-handler role don't need to be
executed on all nodes.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 18e3c7a0a2)
2020-07-22 18:47:01 -04:00
Guillaume Abrioux c0b32e4a79 crash: rm container in ExecPreStart even with docker
We should ensure the container is removed in `ExecPreStart` even when
`{{ container_binary }}` is docker.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 39bb279a53)
2020-07-22 18:47:01 -04:00
Guillaume Abrioux e6059fdcd3 ceph-crash: introduce new role ceph-crash
This commit introduces a new role `ceph-crash` in order to deploy
everything needed for the ceph-crash daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 9d2f2108e1)
2020-07-22 18:47:01 -04:00
Guillaume Abrioux 0b5a2648e3 tests: lvm_setup.yml, add carriage return
This commit adds crlf between each task.
It makes the playbook more readable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8ef9fb68bc)
2020-07-22 18:46:49 -04:00
Guillaume Abrioux a4efb521f5 tests: (lvm_setup.yml), don't shrink lvol
when rerunning lvm_setup.yml on existing cluster with OSDs already
deployed, it fails like following:

```
fatal: [osd0]: FAILED! => changed=false
  msg: Sorry, no shrinking of data-lv2 to 0 permitted.
```

because we are asking `lvol` module to create a volume on an empty VG
with size extents = `100%FREE`.

The default behavior of `lvol` is to shrink the volume if the LV's current
size is greater than the requested size.

Given the requested size is calculated like this:

`size_requested = size_percent * this_vg['free'] / 100`

in our case, it is similar to:

`size_requested = 100 * 0 / 100` which basically means `0`

So the current LV size is well greater than the requested size which
leads the module to attempt to shrink it to 0 which isn't obviously now
allowed.

Adding `shrink: false` to the module calls fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 218f4ae361)
2020-07-22 18:46:49 -04:00