- clean some leftover.
- move nfs_ganesha_[stable|dev] in group_vars so dev_setup.yml can modify them.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
To address this warning:
```
[DEPRECATION WARNING]: evaluating nfs_ganesha_dev as a bare variable, this
behaviour will go away and you might need to add |bool to the expression in the
future
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Alphabetized ceph_repository_uca keys due to errors validating when
using UCA/queens repository on Ubuntu 16.04
An exception occurred during task execution. To see the full
traceback, use -vvv. The error was:
SchemaError: -> ceph_stable_repo_uca schema item is not
alphabetically ordered
Closes: #4154
Signed-off-by: Gabriel Ramirez <gabrielramirez1109@gmail.com>
This tries to first unmount any cephfs/nfs-ganesha mount point on client
nodes, then unmap any mapped rbd devices and finally it tries to remove
ceph kernel modules.
If it fails it means some resources are still busy and should be cleaned
manually before continuing to purge the cluster.
This is done early in the playbook so the cluster stays untouched until
everything is ready for that operation, otherwise if you try to redeploy
a cluster it could end up by getting confused by leftover from previous
deployment.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1337915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There's two big issues with the current OSD restart script.
1/ We try to test if the ceph osd daemon socket exists but we use a
wildcard for the socket name : /var/run/ceph/*.asok.
This fails because we usually have multiple ceph osd sockets (or
other ceph daemon collocated) present in /var/run/ceph directory.
Currently the test fails with:
bash: line xxx: [: too many arguments
But it doesn't stop the script execution.
Instead we can specify the full ceph osd socket name because we
already know the OSD id.
2/ The container filter pattern is wrong and could matches multiple
containers resulting the script to fail.
We use the filter with two different patterns. One is with the device
name (sda, sdb, ..) and the other one is with the OSD id (ceph-osd-0,
ceph-osd-15, ..).
In both case we could match more than needed.
$ docker container ls
CONTAINER ID IMAGE NAMES
958121a7cc7d ceph-daemon:latest ceph-osd-strg0-sda
589a982d43b5 ceph-daemon:latest ceph-osd-strg0-sdb
46c7240d71f3 ceph-daemon:latest ceph-osd-strg0-sdaa
877985ec3aca ceph-daemon:latest ceph-osd-strg0-sdab
$ docker container ls -q -f "name=sda"
958121a7cc7d
46c7240d71f3
877985ec3aca
$ docker container ls
CONTAINER ID IMAGE NAMES
2db399b3ee85 ceph-daemon:latest ceph-osd-5
099dc13f08f1 ceph-daemon:latest ceph-osd-13
5d0c2fe8f121 ceph-daemon:latest ceph-osd-17
d6c7b89db1d1 ceph-daemon:latest ceph-osd-1
$ docker container ls -q -f "name=ceph-osd-1"
099dc13f08f1
5d0c2fe8f121
d6c7b89db1d1
Adding an extra '$' character at the end of the pattern solves the
problem.
Finally removing the get_container_osd_id function because it's not
used in the script at all.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ansible_lsb fact is based on the lsb package (lsb-base,
lsb-release or redhat-lsb-core).
If the package isn't installed on the remote host then the fact isn't
populated.
--------
"ansible_lsb": {},
--------
Switching to the ansible_distribution_release fact instead.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
3a100cfa52 introduced a check which is a
bit too restrictive, let's accept HEALTH_OK and HEALTH_WARN.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
As per bz1718981, this commit adds higher values to check
the quorum status. This is helpful for several OSP deployments
that fail during the scale up.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1718981
Signed-off-by: fpantano <fpantano@redhat.com>
The ceph-volume lvm list command takes ages to complete when having
a lot of LV devices on containerized deployment.
For instance, with 25 OSDs on a node it takes 3 mins 44s to list the
OSD.
Adding the max open files limit to the container engine cli when
executing the ceph-volume command seems to improve a lot thee
execution time ~30s.
This was impacting the OSDs creation with ceph-volume (both filestore
and bluestore) when using multiple LV devices.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1702285
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We already set the become flag to true at a play level in the site*
playbooks so we don't need to set it at a task level.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
`parted_results` isn't used anymore in the playbook.
By the way, `parted` seems to cause issue because it changes the
ownership on devices:
```
root@osd0 ~]# ls -l /dev/sdc*
brw-rw----. 1 root disk 8, 32 Jun 11 08:53 /dev/sdc
brw-rw----. 1 ceph ceph 8, 33 Jun 11 08:53 /dev/sdc1
brw-rw----. 1 ceph ceph 8, 34 Jun 11 08:53 /dev/sdc2
[root@osd0 ~]# parted -s /dev/sdc print
Model: ATA QEMU HARDDISK (scsi)
Disk /dev/sdc: 53.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 1075MB 1074MB ceph block.db
2 1075MB 2149MB 1074MB ceph block.db
[root@osd0 ~]# #We can see ownerships have changed from ceph:ceph to root:disk:
[root@osd0 ~]# ls -l /dev/sdc*
brw-rw----. 1 root disk 8, 32 Jun 11 08:57 /dev/sdc
brw-rw----. 1 root disk 8, 33 Jun 11 08:57 /dev/sdc1
brw-rw----. 1 root disk 8, 34 Jun 11 08:57 /dev/sdc2
[root@osd0 ~]#
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
starting an upgrade if the cluster isn't HEALTH_OK isn't a good idea.
Let's check for the cluster status before trying to upgrade.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Otherwise it fails like following:
```
fatal: [mon0]: FAILED! => changed=false
msg: |-
Unable to enable service ceph-mgr@mon0: Failed to execute operation: Cannot send after transport endpoint shutdown
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commits adds the support of the installer phase for dashboard,
grafana and node-exporter roles.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The ceph restapi configuration was only available until Luminous
release so we don't need those leftovers for nautilus+.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
ceph-dashboard should be deployed on either a dedicated mgr node or a
mon if they are collocated.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Because we're using vagrant, a ssh config file will be created for
each nodes with options like user, host, port, identity, etc...
But via tox we're override ANSIBLE_SSH_ARGS to use this file. This
remove the default value set in ansible.cfg.
Also adding PreferredAuthentications=publickey because CentOS/RHEL
servers are configured with GSSAPIAuthenticationis enabled for ssh
server forcing the client to make a PTR DNS query.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
CI is facing issues where docker pull reach the timeout, let's increase
this to avoid CI failures.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since timesyncd is not available on RHEL-based OSs, change the default
to chronyd for RHEL-based OSs. Also, chronyd is chrony on Ubuntu, so
set the Ansible fact accordingly.
Fixes: https://github.com/ceph/ceph-ansible/issues/3628
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Ubuntu-based CI jobs often fail with error code 404 while installing
NTP daemons. Updating cache beforehand should fix the issue.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
The definitions of cephfs pools should match openstack pools.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Co-Authored-by: Simone Caronni <simone.caronni@teralytics.net>
if we don't assign the rbd application tag on this pool,
the cluster will get `HEALTH_WARN` state like following:
```
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
application not enabled on pool 'rbd'
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We're using fuser command to see if a process is using a ceph unix
socket file. But the fuser command runs through every PID present in
/proc/<PID> to see if one of them is using the file.
On a system running thousands processes, the fuser command can take
a long time to finish.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1717011
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Add a variable to support the allow_embedding support.
See ceph/ceph-ansible/issues/4084 for details.
Fixes: #4084
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This setting must be set to something resolvable.
See: ceph/ceph-ansible/issues/4085 for details
Fixes: #4085
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
069076b introduced a bug in the systemd unit script template. This
commit fixes the options used by the node-exporter container.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Instead of using the modprobe command from the path in the systemd
unit script, we can use the modprobe ansible module.
That way we don't have to manage the binary path based on the linux
distribution.
Resolves: #4072
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Few fixes on systemd unit templates for node_exporter and
alertmanager container parameters.
Added the ability to use a dedicated instance to deploy the
dashboard components (prometheus and grafana).
This commit also introduces the grafana_group_name variable
to refer grafana group and keep consistency with the other
groups.
During the integration with TripleO some grafana/prometheus
template variables resulted undefined. This commit adds the
ability to check if the group exist and create, accordingly,
different job groups in prometheus template.
Signed-off-by: fmount <fpantano@redhat.com>
This can be seen as a regression for customers who were used to deploy
in offline environment with custom repositories.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1673254
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When using podman, the systemd unit scripts don't have a dependency
on the network. So we're not sure that the network is up and running
when the containers are starting.
With docker this behaviour is already handled because the systemd
unit scripts depend on docker service which is started after the
network.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We currently only purge rh_storage yum repository file but depending
on the ceph_repository value we are using, the ceph repository file
could have a different name.
Resolves: #4056
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
This add support for rgw loadbalancer based on HAProxy and Keepalived.
We define a single role ceph-rgw-loadbalancer and include HAProxy and
Keepalived configurations all in this.
A single haproxy backend is used to balance all RGW instances and
a single frontend is exported via a single port, default 80.
Keepalived is used to maintain the high availability of all haproxy
instances. You are free to use any number of VIPs. A single VIP is
shared across all keepalived instances and there will be one
master for one VIP, selected sequentially, and others serve as
backups.
This assumes that each keepalived instance is on the same node as
one haproxy instance and we use a simple check script to detect
the state of each haproxy instance and trigger the VIP failover
upon its failure.
Signed-off-by: guihecheng <guihecheng@cmiot.chinamobile.com>
By running ceph-ansible there are a lot ``[DEPRECATION WARNING]`` like these:
```
[DEPRECATION WARNING]: evaluating containerized_deployment as a bare variable,
this behaviour will go away and you might need to add |bool to the expression
in the future. Also see CONDITIONAL_BARE_VARS configuration toggle.. This
feature will be removed in version 2.12. Deprecation warnings can be disabled
by setting deprecation_warnings=False in ansible.cfg.
```
Now appended ``| bool`` on a lot of the affected variables.
Sometimes the coding style from ``variable|bool`` changed to ``variable | bool`` *(with spaces at the pipe)*.
Closes: #4022
Signed-off-by: L3D <l3d@c3woc.de>
Currently we're only able to use podman on ubuntu if podman's
installation is done manually before the ceph-ansible execution
because the deb package is present in an external repository.
We already manage the docker-ce installation via an external
repository so we should be able to allow the podman installation
with the same mechanism too.
https://github.com/containers/libpod/blob/master/install.md#ubuntuResolves: #3947
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Otherwise content in /run/udev is mislabeled and prevent some services
like NetworkManager from starting.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the rhel8 image used is an outdated beta version, it is not worth it to
maintain this image upstream, since it's possible to test podman with a
newer version of centos/atomic-host image.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since the split between container-engine and container-common roles,
the tags and condition were not updated to reflect the change.
- ceph-container-engine needs with_pkg tag
- ceph-container-common needs fetch_container_images
- we don't need to pull the container image in a dedicated task for
atomic host. We can now use the ceph-container-common role.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
789cef7 introduces a regression in the ganesha configuration file
generation. The new config_template module version broke it.
But the ganesha.conf file isn't an ini file and doesn't really
need to use the config_template module. Instead we can use the
classic template module.
Resolves: #4045
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
If there is no mgrs and mons in the inventory, it will fail with the following error:
```
ERROR! The field 'hosts' has an invalid value, which includes an undefined variable. The error was: 'dict object' has no attribute 'mons'
The error appears to be in '/home/guits/ceph-ansible/site-docker.yml.sample': line 539, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- hosts: '{{ (groups["mgrs"] | default(groups["mons"]))[0] }}'
^ here
We could be wrong, but this one looks like it might be an issue with
missing quotes. Always quote template expression brackets when they
start a value. For instance:
with_items:
- {{ foo }}
Should be written as:
with_items:
- "{{ foo }}"
```
let's add an `omit` so it just display this message instead:
```
PLAY [[]] *******************
skipping: no hosts matched
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>