Commit Graph

5185 Commits (84d9dd3c418a0c6d6a0cdccbe3b2ea3dada2b561)
 

Author SHA1 Message Date
Dimitri Savineau 9d4f90c8b4 Add ansible 2.9 support
This commit adds ansible 2.9 support in addition of 2.8.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit aefba82a2e)
2020-03-03 10:31:48 +01:00
Dimitri Savineau 8cc2f8f21e ceph-validate: start with ansible version test
It doesn't make sense to start validating configuration if the ansible
version isn't the good one.
This commit moves the check_system task as the first task in the
ceph-validate role.
The ansible version test tasks are moved at the top of this file.
Also moving the iscsi kernel tests from check_system to check_iscsi
file.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1a77dd7e91)
2020-03-03 10:31:48 +01:00
Guillaume Abrioux 5a51bd12dc common: support OSDs with more than 2 digits
When running environment with OSDs having ID with more than 2 digits,
some tasks don't match the system units and therefore, playbook can fail.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1805643

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a084a2a347)
2020-02-28 11:06:47 -05:00
Guillaume Abrioux 96b7857347 requirements: enforce ansible version requirement
See https://github.com/advisories/GHSA-3m93-m4q6-mc6v

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit a2d2e70ac2)
2020-02-27 09:56:55 -05:00
Guillaume Abrioux d254a8b938 shrink-osd: support shrinking ceph-disk prepared osds
This commit adds the ceph-disk prepared osds support

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796453

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 1de2bf9991)
2020-02-26 18:16:48 +01:00
Guillaume Abrioux 21851457d6 shrink-osd: don't run ceph-facts entirely
We need to call ceph-facts only for setting `container_binary`.
Since this task has been isolated we can use `tasks_from` to only execute the
needed task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 55970b18f1)
2020-02-26 18:16:48 +01:00
Benoît Knecht 10b3bb2727 infrastructure-playbooks: Run shrink-osd tasks on monitor
Instead of running shring-osd tasks on localhost and delegating most of
them to the first monitor, run all of them on the first monitor
directly.

This has the added advantage of becoming root on the monitor only, not
on localhost.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 8b3df4e418)
2020-02-24 16:51:33 -05:00
Francesco Pantano f5e2a69134 Configure ceph dashboard backend and dashboard_frontend_vip
This change introduces a new set of tasks to configure the
ceph dashboard backend and listen just on the mgr related
subnet (and not on '*'). For the same reason the proper
server address is added in both prometheus and alertmanger
systemd units.
This patch also adds the "dashboard_frontend_vip" parameter
to make sure we're able to support the HA model when multiple
grafana instances are deployed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792230
Signed-off-by: Francesco Pantano <fpantano@redhat.com>
(cherry picked from commit 15ed9eebf1)
2020-02-24 16:50:19 -05:00
Dimitri Savineau ca2003fbcc ceph-rgw: increase connection timeout to 10
5s as a connection timeout could be low in some setup. Let's increase
it to 10s.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 44e750ee5d)
2020-02-24 14:41:19 -05:00
Dimitri Savineau 3617543517 containers: add KillMode=none to systemd templates
Because we are relying on docker|podman for managing containers then we
don't need systemd to manage the process (like kill).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5a03e0ee1c)
2020-02-18 12:10:35 -05:00
Ali Maredia 7d2a217270 rgw: extend automatic rgw pool creation capability
Add support for erasure code pools.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1731148

Signed-off-by: Ali Maredia <amaredia@redhat.com>
Co-authored-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1834c1e48d)
2020-02-17 17:44:53 -05:00
Florian Faltermeier 17b405eb10 ceph-rgw-loadbalancer: Fix SSL newline issue
The ad7a5da commit introduced a regression when using TLS on haproxy
via the haproxy_frontend_ssl_certificate variable.
This cause the "stats socket" and the "tune.ssl.default-dh-param"
parameters to be on the same line resulting haproxy failing to start.

[ALERT] 351/140240 (21388) : parsing [xxxxx] : 'stats socket' : unknown
keyword 'tune.ssl.default-dh-param'. Registered
[ALERT] 351/140240 (21388) : Fatal errors found in configuration.

Fixes: #4869

Signed-off-by: Florian Faltermeier <florian.faltermeier@uibk.ac.at>
(cherry picked from commit 9d081e2453)
2020-02-17 11:36:09 -05:00
Dimitri Savineau df43f32248 ceph-defaults: remove bootstrap_dirs_xxx vars
Both bootstrap_dirs_owner and bootstrap_dirs_group variables aren't
used anymore in the code.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c644ea9041)
2020-02-17 11:15:33 -05:00
Dimitri Savineau 0deb5b0706 rgw: don't create user on secondary zones
The rgw user creation for the Ceph dashboard integration shouldn't be
created on secondary rgw zones.

Closes: #4707
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794351

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 16e12bf2bb)
2020-02-17 16:43:56 +01:00
Dimitri Savineau 5f7778d59a ceph-{mon,osd}: move default crush variables
Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.

"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"

This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798864

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1fc6b33714)
2020-02-17 10:18:56 -05:00
Dimitri Savineau f78981bbf7 ceph-grafana: fix grafana_{crt,key} condition
The grafana_{crt,key} aren't boolean variables but strings. The default
value is an empty string so we should do the conditional on the string
length instead of the bool filter

Closes: #5053

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 15bd4cd189)
2020-02-17 10:18:39 -05:00
Dimitri Savineau a9a533e398 ceph-prometheus: add alertmanager HA config
When using multiple alertmanager nodes (via the grafana-server group)
then we need to specify the other peers in the configuration.

https://prometheus.io/docs/alerting/alertmanager/#high-availability
https://github.com/prometheus/alertmanager#high-availability

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792225

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b9d975385c)
2020-02-17 16:18:20 +01:00
John Fulton 9c97179fc1 The _filtered_clients list should intersect with ansible_play_batch
Client configuration with --limit fails without this patch
because certain tasks are only done to the first host in the
_filtered_clients list and it's likely that first host will
not be included in what's sepcified with --limit. To fix this
the _filtered_clients list should be built from all clients
in the inventory that are also in the running play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798781

Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit e4bf4857f5)
2020-02-17 10:03:57 -05:00
Dimitri Savineau b01b255414 ceph-nfs: add nfs-ganesha-rados-urls package
Since nfs-ganesha 2.8.3 the rados-urls library has been move to a
dedicated package.
We don't have the same nfs-ganesha 2.8.x between the community and rhcs
repositories.

community: 2.8.1
rhcs: 2.8.3

As a workaround we will install that package only for rhcs setup.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0a3e85e8ca)
2020-02-17 10:00:44 -05:00
Dimitri Savineau 6864d04fdf ceph-nfs: fix ceph_nfs_ceph_user variable
The ceph_nfs_ceph_user variable is a string for the ceph-nfs role but a
list in ceph-client role.
6a6785b introduced a confusion between both variable type in the ceph-nfs
role for external ceph with ganesha.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1801319

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 10951eeea8)
2020-02-17 15:27:30 +01:00
Dimitri Savineau e4e1b386b0 dashboard: allow configuring multiple grafana host
When using multiple grafana hosts then we push set the grafana and
prometheus URL and push the dashboard layout to a single node.

grafana_server_addrs is the list of all grafana nodes and used during
the ceph-dashboard role (on mgr/mon nodes).
grafana_server_addr is the current grafana node used during the
ceph-grafana and ceph-prometheus role (on grafana-server nodes).

We don't have the grafana_server_addr fact duplication code between
external vs collocated nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1784011

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit c6e96699f7)
2020-02-12 19:56:31 -05:00
Guillaume Abrioux 1d2a395aaf switch_to_containers: increase health check values
This commit increases the default values for the following variable
consumed in switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook.
This also moves these variables in `ceph-defaults` role so the user can
set different values if needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783223

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3700aa5385)
2020-02-10 12:57:17 -05:00
Guillaume Abrioux cdc3e10cf3 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 0441812959)
2020-02-03 09:33:05 -05:00
Stanley Lam 0336a1476f Add option for HAproxy to act a SSL frontend termination point for loadbalanced RGW instances.
Signed-off-by: Stanley Lam <stanleylam_604@hotmail.com>
(cherry picked from commit ad7a5dad3f)
2020-02-03 09:32:43 -05:00
Guillaume Abrioux 5c3ba0787c switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
2020-02-03 09:32:20 -05:00
Dimitri Savineau 0dbca448d1 ceph-handler: Use /proc/net/unix for rgw socket
If for some reason, there's an old rgw socket file present in the
/var/run/ceph/ directory then the test command could fail with

test: xxxxxxxxx.asok: binary operator expected

$ ls -hl /var/run/ceph/
total 0
srwxr-xr-x. ceph-client.rgw.rgw0.rgw0.68.94153614631472.asok
srwxr-xr-x. ceph-client.rgw.rgw0.rgw0.68.94240997655088.asok

We can check the radosgw socket in /proc/net/unix to avoid using wildcard
in the socket name.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 60cbfdc2a6)
2020-02-03 09:31:52 -05:00
Dimitri Savineau 487be2675a filestore-to-bluestore: skip bluestore osd nodes
If the OSD node is already using bluestore OSDs then we should skip
all the remaining tasks to avoid purging OSD for nothing.
Instead we warn the user.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1790472

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 83c5a1d7a8)
2020-02-03 15:16:51 +01:00
Dimitri Savineau 460d3557d7 ceph-container-engine: lvm2 on OSD nodes only
Since de8f2a9 the lvm2 package installation has been moved from ceph-osd
role to ceph-container-engine role.
But the scope wasn't limited to the OSD nodes only.
This commit fixes this behaviour.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit fa8aa8c864)
2020-02-03 15:16:32 +01:00
Guillaume Abrioux 675b6788f4 update: remove legacy tasks
These tasks should have been removed with backport #4756

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793564

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-02-03 15:16:13 +01:00
Dimitri Savineau 80f1b0feb0 ceph-common: rhcs 4 repositories for rhel 7
RHCS 4 is available for both RHEL 7 and 8 so we should also enable the
cdn repositories for that distribution.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1796853

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 9b40a959b9)
2020-02-03 15:15:35 +01:00
Mike Christie 76753e64f9 iscsi: Fix crashes during rolling update
During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.

This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1795806

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 77f3b5d51b)
2020-02-03 15:15:15 +01:00
wujie1993 dcd4b2955a purge: fix purge cluster failed
Fix purge cluster failed when local container images does not exist.

Purge node-exporter and grafana-server only when dashboard_enabled is set to True.

Signed-off-by: wujie1993 qq594jj@gmail.com
(cherry picked from commit d8b0b3cbd9)
2020-02-03 15:14:56 +01:00
Guillaume Abrioux 1b33c5358e config: fix external client scenario
When no monitor group is present in the inventory, this task fails.
This affects only non-containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e7bc079405)
2020-01-31 13:37:10 +01:00
Guillaume Abrioux e3cd719ebe tests: add external_clients scenario
This commit adds a new 'external ceph clients' scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 641729357e)
2020-01-31 13:37:10 +01:00
Dimitri Savineau 3daea719b6 ceph-defaults: remove rgw from ceph_conf_overrides
The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794552

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2f07b85131)
2020-01-29 14:34:34 +01:00
Guillaume Abrioux bc6777c6df dashboard: add quotes when passing password to the CLI
Otherwise, if the variables contains a '$' it will be interpreted as a BASH
variable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8c3759f8ce)
2020-01-29 14:15:41 +01:00
Guillaume Abrioux 2e7d7b70ed tests: set dashboard|grafana_admin_password
Set these 2 variables in all test scenarios where `dashboard_enabled` is
`True`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c040199c8f)
2020-01-29 14:15:41 +01:00
Guillaume Abrioux 8a907cb1ca validate: fail if dashboard|grafana_admin_password aren't set
This commit adds a task to make sure user set a custom password for
`grafana_admin_password` and `dashboard_admin_password` variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1795509

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 99328545de)
2020-01-29 14:15:41 +01:00
Dimitri Savineau 9da917501b ceph-facts: fix _container_exec_cmd fact value
When using different name between the inventory_hostname and the
ansible_hostname then the _container_exec_cmd fact will get a wrong
value based on the inventory_hostname instead of the ansible_hostname.
This happens when the ceph cluster is already running (update/upgrade).

Later the container exec commands will fail because the container name
is wrong.

We should always set the _container_exec_cmd based on the
ansible_hostname fact.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1795792

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1fcafffdad)
2020-01-29 11:48:44 +01:00
Dimitri Savineau 44eadcd05e tox: set extras vars for filestore-to-bluestore
The ansible extra variables aren't set with the ansible-playbook
command running the filestore-to-bluestore playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a27290bf98)
2020-01-28 22:21:49 -05:00
Dimitri Savineau f982a70f02 filestore-to-bluestore: fix undefine osd_fsid_list
If the playbook is used on a host running bluestore OSDs then the
osd_fsid_list won't be filled because the bluestore OSDs are reported
with 'type: block' via ceph-volume lvm list command but we are looking
for 'type: data' (filestore).

TASK [zap ceph-volume prepared OSDs] *********
fatal: [xxxxx]: FAILED! =>
  msg: '''osd_fsid_list'' is undefined

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cd76054f76)
2020-01-28 22:21:49 -05:00
Guillaume Abrioux d5dca5087a tests: add 'all_in_one' scenario
Add new scenario 'all_in_one' in order to catch more collocated related
issues.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e7dbb4b16)
2020-01-27 17:54:39 -05:00
Guillaume Abrioux 0d2af6ebf3 fix calls to `container_exec_cmd` in ceph-osd role
We must call `container_exec_cmd` from the right monitor node otherwise
the value of the fact might mistmatch between the delegated node and the
node being played.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794900

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2f919f8971)
2020-01-27 17:54:39 -05:00
Dimitri Savineau 0a2927ce5e filestore-to-bluestore: don't fail when with no PV
When the PV is already removed from the devices then we should not fail
to avoid errors like:

stderr: No PV found on device /dev/sdb.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit a9c2300545)
2020-01-24 16:14:47 -05:00
Guillaume Abrioux 9fb69e13ed handler: read container_exec_cmd value from first mon
Given that we delegate to the first monitor, we must read the value of
`container_exec_cmd` from this node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792320

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit eb9112d8fb)
2020-01-23 18:34:14 +01:00
Vytenis Sabaliauskas 4152a1a862 ceph-facts: Fix for 'running_mon is undefined' error, so that
fact 'running_mon' is set once 'grep' successfully exits with 'rc == 0'

Signed-off-by: Vytenis Sabaliauskas <vytenis.sabaliauskas@protonmail.com>
(cherry picked from commit ed1eaa1f38)
2020-01-23 11:24:24 -05:00
Dimitri Savineau add4089e30 site-container: don't skip ceph-container-common
On HCI environment the OSD and Client nodes are collocated. Because we
aren't running the ceph-container-common role on the client nodes except
the first one (for keyring purpose) then the ceph-role execution fails
due to undefined variables.

Closes: #4970
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794195

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 671b1aba3c)
2020-01-23 16:27:31 +01:00
Guillaume Abrioux fd217d9f08 rolling_update: support upgrading 3.x + ceph-metrics on a dedicated node
When upgrading from RHCS 3.x where ceph-metrics was deployed on a
dedicated node to RHCS 4.0, it fails like following:

```
fatal: [magna005]: FAILED! => changed=false
  gid: 0
  group: root
  mode: '0755'
  msg: 'chown failed: failed to look up user ceph'
  owner: root
  path: /etc/ceph
  secontext: unconfined_u:object_r:etc_t:s0
  size: 4096
  state: directory
  uid: 0
```

because we are trying to run `ceph-config` on this node, it doesn't make
sense so we should simply run this play on all groups except
`[grafana-server]`.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1793885

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e5812fe45b)
2020-01-22 18:28:54 +01:00
Dimitri Savineau 0abea70e29 filestore-to-bluestore: fix osd_auto_discovery
When osd_auto_discovery is set then we need to refresh the
ansible_devices fact between after the filestore OSD purge
otherwise the devices fact won't be populated.
Also remove the gpt header on ceph_disk_osds_devices because
the devices is empty at this point for osd_auto_discovery.
Adding the bool filter when needed.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1729267

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit bb3eae0c80)
2020-01-22 10:06:17 +01:00
Dimitri Savineau e4965e9ea9 filestore-to-bluestore: --destroy with raw devices
We still need --destroy when using a raw device otherwise we won't be
able to recreate the lvm stack on that device with bluestore.

Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-bdc67a84-894a-4687-b43f-bcd76317580a /dev/sdd
 stderr: Physical volume '/dev/sdd' is already in volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  Unable to add physical volume '/dev/sdd' to volume group 'ceph-b7801d50-e827-4857-95ec-3291ad6f0151'
  /dev/sdd: physical volume not initialized.
--> Was unable to complete a new OSD, will rollback changes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1792227

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit f995b079a6)
2020-01-21 18:26:55 +01:00