Commit Graph

5175 Commits (5e7962ccf6c9fa35f5611888826131c4e9e8f043)
 

Author SHA1 Message Date
Dimitri Savineau 014f51c2a4 ceph-defaults: exclude md devices from discovery
The md devices (RAID software) aren't excluded from the devices list in
the auto discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1764601

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-05 10:14:25 +01:00
Dimitri Savineau 89f6cc54a2 purge-cluster: add podman support
The podman support was added to the purge-container-cluster playbook but
containers are always used for the dashboard even on non containerized
deployment.
This commits adds the podman support on purging the dashboard resources
in the purge-cluster playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-04 14:15:12 -05:00
Dimitri Savineau 4a6d19dae2 tests: reduce max_mds from 3 to 2
Having max_mds value equals to the number of mds nodes generates a
warning in the ceph cluster status:

cluster:
id:     6d3e49a4-ab4d-4e03-a7d6-58913b8ec00a'
health: HEALTH_WARN'
        insufficient standby MDS daemons available'
(...)
services:
  mds:     cephfs:3 {0=mds1=up:active,1=mds0=up:active,2=mds2=up:active}'

Let's use 2 active and 1 standby mds.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-04 14:07:29 -05:00
Guillaume Abrioux f5a81b1790 purge: fix symlink to purge-container-cluster
ceph/ceph-ansible#4805 introduced a symlink to
purge-container-cluster.yml playbook which is broken.

This commit fixes it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-04 09:38:34 +01:00
Guillaume Abrioux 7bc7e3669d purge: rename playbook (container)
Since we now support podman, let's rename the playbook so it's more
generic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 11:10:21 -05:00
Guillaume Abrioux a8d76d72d7 dashboard: use fqdn url for active alert
When using the shortname, the URL for active alert launches with short
hostname and fails to connect to the server.

This commit changes the template in order to use the fqdn.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1765485

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 14:30:32 +01:00
Guillaume Abrioux b18476a1a6 purge: do not try to stop docker when binary is podman
If the container binary is podman, we shouldn't try to stop docker here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 13:29:52 +01:00
Guillaume Abrioux fe5ffe589e facts: isolate container_binary facts
in order to be able to call container_binary without having to run the
whole ceph-facts role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 13:29:52 +01:00
Guillaume Abrioux d23383a820 purge: remove docker_* task
All containers are removed when systemd stops them.
There is no need to call this module in purge container playbook.

This commit also removes all docker_image task and remove all container
images in the final cleanup play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776736

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-03 13:29:52 +01:00
Stanley Lam ad7a5dad3f Add option for HAproxy to act a SSL frontend termination point for loadbalanced RGW instances.
Signed-off-by: Stanley Lam <stanleylam_604@hotmail.com>
2019-12-02 16:54:33 -05:00
Guillaume Abrioux a43a872105 docker2podman: import ceph-handler role
This is needed to avoid following error:

```
ERROR! The requested handler 'restart ceph mons' was not found in either the main handlers list nor in the listening handlers list
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-02 09:11:12 -05:00
Guillaume Abrioux 7fe0d55eff docker2podman: do not hardcode group name
let's use `client_group_name` instead of hardcoding the name.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-02 09:11:12 -05:00
Guillaume Abrioux 6526a25ab5 docker2podman: import ceph-defaults in first play
We must import this role in the first play otherwise the first call to
`client_group_name`fails.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1777829

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-12-02 09:11:12 -05:00
Dimitri Savineau 39cfe0aa65 switch_to_containers: fix umount ceph partitions
When a container is already running on a non containerized node then the
umount ceph partition task is skipped.
This is due to the container ps command which always returns 0 even if
the filter matches nothing.

We should run the umount task when:
1/ the container command is failing (not installed) : rc != 0
2/ the container command reports running ceph-osd containers : rc == 0

Also we should not fail on the ceph directory listing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-02 09:19:50 +01:00
Dimitri Savineau 5bd1cf40eb ceph-osd: wait for all osds once
cf8c6a3 moves the 'wait for all osds' task from openstack_config to the
main tasks list.
But the openstack_config code was executed only on the last OSD node.
We don't need to do this check on all OSD node so we need to add set
run_once to true on that task.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-27 13:05:42 -05:00
Guillaume Abrioux 23b1f43897 facts: avoid duplicated element in devices list
When using `osd_auto_discovery`, `devices` is built multiple times due
to multiple runs of `ceph-facts` role. It end up with duplicate
instances of a same device in the list.

Using `unique` filter when building the list fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-27 16:35:41 +01:00
Guillaume Abrioux cc0c1ce301 dashboard: only print dashboard url of the grafana-server node
This commit makes the ceph-dashboard role only printing ceph-dashboard
URL of the nodes present in grafana-server group

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1762163

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-27 10:28:23 -05:00
Guillaume Abrioux 0441812959 purge/update: remove backward compatibility legacy
This was introduced in 3.1 and marked as deprecation
We can definitely drop it in stable-4.0

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-27 10:27:43 -05:00
Dimitri Savineau 3f29b243ea tests: fix cluster health status
The current ceph cluster health is in warning state:

health: HEALTH_WARN
        13 pool(s) have no replicas configured
        2 pool(s) have non-power-of-two pg_num

Because we're using only 1 replica then we need to disable the redundancy
check.
The pool pg num should be a power of two number (like 16).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-27 16:20:17 +01:00
Guillaume Abrioux f19a2aef1a Revert "tox-podman: use centos 8 vagrant image"
This reverts commit 19e9a06ab1.
2019-11-27 16:19:58 +01:00
Dimitri Savineau cf8c6a3849 ceph-osd: wait for all osd before crush rules
When creating crush rules with device class parameter we need to be sure
that all OSDs are up and running because the device class list is
is populated with this information.
This is now enable for all scenario not openstack_config only.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-27 07:43:07 +01:00
Dimitri Savineau 55adc10be3 ceph-grafana: remove ipv6 brakets on wait_for
The wait_for ansible module doesn't support the backets on IPv6 address
so need to remove them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1769710

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-26 10:08:17 +01:00
Guillaume Abrioux 5353ab8a23 tests: revert vagrant_variable file name detection
This commit reverts the following change:

fcf181342a (diff-23b6f443c01ea2efcb4f36eedfea9089R7-R14)

this is causing CI failures so this commit is intended to unlock the CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-25 11:22:04 +01:00
Dimitri Savineau dd97353574 travis: add python 3.7 and 3.8
Add both python 3.7 and 3.8 in the travis matrix testing.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-25 09:17:19 +01:00
Guillaume Abrioux 33bfb10af9 nfs: remove legacy file
this file is provided by the packaging (nfs-ganesha) so there's no need
to maintain it in ceph-ansible

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-22 05:11:41 +01:00
Guillaume Abrioux d06158e9d9 nfs: do not run privileged nfs container
At the moment, we bindmount the dbus socket from the host, this requires
to run the container with --privileged.
Since we now run a dedicated dbus daemon inside the same container, we
can stop running privileged nfs-ganesha containers

Related ceph-container PR : ceph/ceph-container#1517

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1725254

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-22 05:11:41 +01:00
Guillaume Abrioux c878e99589 update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-20 09:22:19 -05:00
Guillaume Abrioux 548db78b95 update: use flags noout and nodeep-scrub only
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
2019-11-20 09:22:19 -05:00
Dimitri Savineau 19e9a06ab1 tox-podman: use centos 8 vagrant image
Switch the podman scenario from atomic centos 7 to centos 8 (not atomic)

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-20 10:34:34 +01:00
VasishtaShastry 72c43cc5d9 Fixes failure of cephfs configuration using --limit
Configuration of cephfs with an existing cluster using --limit used to fail
at different tasks while running with site-docker.yml
This commit addresses both of those tasks

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1773489
Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
2019-11-18 16:44:47 +01:00
Dimitri Savineau d7fd769b6d container: add always tag on gather fact tasks
If we execute the site-container.yml playbook with specific tags (like
ceph_update_config) then we need to be sure to gather the facts otherwise
we will see error like:

The task includes an option with an undefined variable. The error was:
'ansible_hostname' is undefined

This commit also adds missing 'gather_facts: false' to mons plays.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1754432

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-14 11:50:24 -05:00
Dimitri Savineau ef2cb99f73 ceph-osd: add device class to crush rules
This adds device class support to crush rules when using the class key
in the rule dict via the create-replicated sub command.
If the class key isn't specified then we use the create-simple sub
command for backward compatibility.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1636508

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-14 16:25:46 +01:00
Dimitri Savineau ed36a11eab move crush rule creation from mon to osd role
If we want to create crush rules with the create-replicated sub command
and device class then we need to have the OSD created before the crush
rules otherwise the device classes won't exist.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-14 16:25:46 +01:00
Dimitri Savineau 3e29b8d5ff ceph-defaults: pin prometheus container tags
In addition to the grafana container tag change, we need to do the same
for the prometheus container stack based on the release present in the
OSE 4.1 container image.

$ docker run --rm openshift4/ose-prometheus-node-exporter:v4.1 --version
node_exporter, version 0.17.0
  build user:       root@67fee13ed48f
  build date:       20191023-14:38:12
  go version:       go1.11.13
$ docker run --rm openshift4/ose-prometheus-alertmanager:4.1 --version
alertmanager, version 0.16.2
  build user:       root@70b79a3f29b6
  build date:       20191023-14:57:30
  go version:       go1.11.13
$ docker run --rm openshift4/ose-prometheus:4.1 --version
prometheus, version 2.7.2
  build user:       root@12da054778a3
  build date:       20191023-14:39:36
  go version:       go1.11.13

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-14 16:11:14 +01:00
VasishtaShastry 9a1f1626c3 Evades validation of ceph_repository_type in containerized scenario
This will prevent failure of site-docker.yml with configs in doc.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1769760

Signed-off-by: VasishtaShastry <vipin.indiasmg@gmail.com>
Co-Authored-By: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-14 15:53:22 +01:00
Guillaume Abrioux b717b5f736 ceph_key: restore file mode after a key is fetched
when `import_key` is enabled, if the key already exists, it will only be
fetched using ceph cli, if the mode specified in the `ceph_key` task is
different from what is applied by the ceph cli, the mode isn't restored because
we don't call `module.set_fs_attributes_if_different()` before
`module.exit_json(**result)`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1734513

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-14 14:58:37 +01:00
Guillaume Abrioux 16bcef4f28 tests: add time command in vagrant_up.sh
monitor how long it takes to get all VMs up and running

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 15:47:46 +01:00
Guillaume Abrioux 1a5d32dda5 tests: remove legacy in tox-update.ini
This variable isn't used in tox-update.ini so this commit removes it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 09:27:28 -05:00
Guillaume Abrioux edb8629bdb tests: upgrade from nautilus to octopus in master
test upgrades from nautilus to octopus instead of mimic to octopus.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 09:27:28 -05:00
Guillaume Abrioux 206ee589d6 update: reset flags before and after each osd node upgrade
It might be possible at some point even with osd flags `noout` and
`norebalance` set the PGs states can change depending on the amount of data
written meantime. It means the check for PGs state will fail.

This commit changes the way we set those flags:
we set them before an OSD node upgrade and unset them before the PGs
state check so they can recover.

Fixes: #3961

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 09:10:52 -05:00
Javier Pena 19a43ff261 Fixes for Makefile
- Set default mock configuration to epel-8-x86_64, to match the
  default dist value.
- Add support for alpha tags, like the recently added v5.0.0alpha1

Signed-off-by: Javier Pena <jpena@redhat.com>
2019-11-08 09:09:30 -05:00
Guillaume Abrioux db77fbda15 tests: add coverage on purge playbook
This commit adds a playbook to be played before we run purge playbook,
it first creates an rbd image then map an rbd device on client0 so the
purge playbook will try to unmap it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 09:06:11 -05:00
Guillaume Abrioux 3cfcc7a105 purge: use sysfs to unmap rbd devices
in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.

Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-11-08 09:06:11 -05:00
Dimitri Savineau 4a065cebd7 ceph-validate: add rbdmirror validation
When ceph_rbd_mirror_configure is set to true we need to ensure that
the required variables aren't empty.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1760553

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-07 08:57:43 -05:00
Dimitri Savineau 60cbfdc2a6 ceph-handler: Use /proc/net/unix for rgw socket
If for some reason, there's an old rgw socket file present in the
/var/run/ceph/ directory then the test command could fail with

test: xxxxxxxxx.asok: binary operator expected

$ ls -hl /var/run/ceph/
total 0
srwxr-xr-x. ceph-client.rgw.rgw0.rgw0.68.94153614631472.asok
srwxr-xr-x. ceph-client.rgw.rgw0.rgw0.68.94240997655088.asok

We can check the radosgw socket in /proc/net/unix to avoid using wildcard
in the socket name.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-07 14:41:11 +01:00
Dimitri Savineau 34b03d1873 add-{mon,osd}: run raw install python tasks
If the new mon/osd node doesn't have python installed then we need to
execute the tasks from raw_install_python.yml.

Closes: #4368

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-07 14:04:26 +01:00
Dimitri Savineau ece46d33be ceph-osd: fix fs.aio-max-nr sysctl condition
[1] introduced a regression on the fs.aio-max-nr sysctl value condition.
The enable key isn't a boolean but a string because the expression isn't
evaluated.
This string output "(osd_objectstore == 'bluestore')" is always true
because item.enable condition only matches non empty string. So the
sysctl value was applyied for both filestore and bluestore backend.

[2] added the bool filter to the condition but the filter always returns
false on string and the sysctl wasn't applyed at all.

This commit fixes the enable key value by evaluating the value instead
of using the string.

[1] https://github.com/ceph/ceph-ansible/commit/08a2b58
[2] https://github.com/ceph/ceph-ansible/commit/ab54fe2

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-07 13:51:48 +01:00
Dimitri Savineau 02df2ab5ea tests/requirements: bump testinfra and pytest
The ansible ssh connections are now using the ssh backend instead of
paramiko starting testinfra 3.1 and persistent connections too.
pytest 4.6 is the latest release to be supported by python 2.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-11-04 09:09:49 -05:00
Dimitri Savineau 2037fb87b6 ceph-defaults: pin grafana container tag to 5.2.4
The latest grafana container tag is using grafana 6.x release which could
cause issue with the ceph dashboard integration.
Considering that the grafana container in RHCS 3 is based on 5.x then we
should use the same version.

$ docker run --rm rhceph/rhceph-3-dashboard-rhel7:3 -v
Version 5.2.4 (commit: unknown-dev)

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-10-31 18:44:51 -04:00
Dimitri Savineau 9a996aef7f ceph-osd: Remove ulimit nofile on container start
Even if this improves ceph-disk/ceph-volume performances then it also
impact the ceph-osd process.
The ceph-osd process shouldn't use 1024:4096 value for the max open
files.
Removing the ulimit option from the container engine and doing this kind
of change on the container side [1].

[1] https://github.com/ceph/ceph-container/pull/1497

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1702285

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-10-31 10:42:09 -04:00