For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.
The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.
Typical error seen:
```
STDERR:
failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```
Indeed, on the machine we can see:
```
Every 2.0s: df -h Tue May 29 17:21:13 2018
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 3.0G 3.0G 14M 100% /
```
The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.
```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
Logical volume atomicos/root successfully resized.
```
```
-bash-4.2# xfs_growfs /
meta-data=/dev/mapper/atomicos-root isize=512 agcount=4, agsize=192000 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=768000, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 768000 to 2543616
```
```
-bash-4.2# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 9.7G 1.4G 8.4G 14% /
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In the CI we can see at many times failures like following:
`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`
It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.
This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Erwan Velu <evelu@redhat.com>
let's move this variable in group_vars/all.yml in all testing scenarios
accordingly to this commit 1f15a81c48 so
we keep consistency between the playbook and the tests.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.
The idea here is to move openstack pools creation at the end of `ceph-osd` role.
[1] e59258943b/src/mon/OSDMonitor.cc (L5673)
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.
It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.
Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a.
Now profiles drives the setting of rgw keystone *.
Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
As of ceph 12.2.5 the type of the parameter `type` is not a name anymore but
an id, therefore an `int` is expected otherwise it will fail with the
following error
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
in case of multimds we must check for the number of mds up instead of
just checking if the hostname of the node is in the fsmap.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
These are already handled by ceph-client/defaults/main.yml so the keys
will be created once user_config is set to True.
Signed-off-by: Sébastien Han <seb@redhat.com>
Now that we are using ceph_volume_zap the partitions are
kept around and should be able to be reused.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Prior to this patch, the certificates where being generated on a single
node only (because of the run_once: true). Thus certificates were not
distributed on all the gateway nodes.
This would require a second ansible run to work. This patches fix the
creation and keys's distribution on all the nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1540845
Signed-off-by: Sébastien Han <seb@redhat.com>
We should stop putting everything in 'all'. This is too easy and this is
error prone as well for those who are separating variables into host
type, things that you should do.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now run tests on the newly created ceph_crush module. Now the CI will
create a specific hierarchy for the OSD.
Signed-off-by: Sébastien Han <seb@redhat.com>
The ceph-ansible upstream CI runs severals tests, including a
'idempotency/handlers' test. It means the playbook is run a first time
and then a second time with an other container image version to ensure the
handlers run properly and the containers are well restarted.
This can cause issues.
For instance, in that specific case which drove me to submit this commit,
I've hit the case where `latest` image ships ceph 12.2.3 while the `stable-3.0`
(which is the image used for the second run) ships ceph 12.2.2.
The goal of this test is not to verify we can upgrade from a specific
version to another but to ensure handlers are working even if it's a valid
failure here.
It should be caught by a test dedicated to that usecase.
We just need to have a container image which has a different id for
the upstream CI, we need the same content in container imagebut a different
image id in the registry since the test relies on image id to decide whether
the container should be restarted.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since we have a task to test the handlers we can test a new container to
validate the service restart on a new container image.
Signed-off-by: Sébastien Han <seb@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The --crush-device-class flag for ceph-volume is not available in luminous so lets
remove this testing option for now until it's more widely available.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
When deploying Jewel from master we still need to enable this code since
the container image has such check. This check still exists because
ceph-disk is not able to create a GPT label on a drive that does not
have one.
Signed-off-by: Sébastien Han <seb@redhat.com>
the entrypoint to generate users keyring is `ceph-authtool`, therefore,
it can expand the `$(ceph-authtool --gen-print-key)` inside the
container. Users must generate a keyring themselves.
This commit also adds a check to ensure keyring are properly filled when
`user_config: true`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add a missing test `test_rbd_mirror_service_is_running_from_luminous()`.
Also using bash -c "<cmd>" to make testinfra aware that later in
the upgrade process we are now running `luminous` ceph release so we
must skip the rbd tests related to `jewel` ceph release.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
ceph-ansible is now being testing against ansible2.2 and ansible2.4. We
need to update tox.ini so we use the right version of testinfra
regarding which ansible version we are using.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit add new osd scenarios, it aims to simplify the CI setup and
brings a better coverage on the OSD scenarios.
We decided to differentiate between filestore and bluestore, thinking
ahead when filestore won't be supported anymore.
So we now have two classes of tests:
* Filestore
* Bluestore
In each of those classes we have container and non-container.
Then for each we test the following:
* collocated
* collocated dmcrypt
* non-collocated
* non-collocated dmcrypt
* auto discovery collocated
* auto discovery collocated dmcrypt
This gives us a nice coverage and also reduces the footprint on the CI.
We are now up to 4 scenarios, each containing 6 OSD VMs.
Signed-off-by: Sébastien Han <seb@redhat.com>
1. add the variables to docker_collocation
2. trigger the check when a MDS is part of the inventory file, not when
we run on an MDS...
Signed-off-by: Sébastien Han <seb@redhat.com>
vagrant is serialized and takes a lot of time compare to simple reboot.
See the benchmarks below for 3 VMs:
[leseb@rick docker]$ time ANSIBLE_SSH_ARGS="-F
/home/leseb/reproduce-ci/tmp.zgGC7d5mIC/build/workspace/ceph-ansible/tests/functional/centos/7/docker/vagrant_ssh_config" ansible-playbook -i /home/leseb/reproduce-ci/tmp.zgGC7d5mIC/build/workspace/ceph-ansible/tests/functional/centos/7/docker/hosts reboot.yml
PLAY [mons]
****************************************************************************************************************************************************************************************************
TASK [Gathering Facts]
*****************************************************************************************************************************************************************************************
ok: [mon1]
ok: [mon2]
ok: [mon0]
TASK [restart machine]
*****************************************************************************************************************************************************************************************
changed: [mon2]
changed: [mon1]
changed: [mon0]
TASK [wait for server to boot]
*********************************************************************************************************************************************************************************
ok: [mon2 -> localhost]
ok: [mon0 -> localhost]
ok: [mon1 -> localhost]
TASK [uptime]
**************************************************************************************************************************************************************************************************
changed: [mon2]
changed: [mon0]
changed: [mon1]
PLAY RECAP
*****************************************************************************************************************************************************************************************************
mon0 : ok=4 changed=2 unreachable=0
failed=0
mon1 : ok=4 changed=2 unreachable=0
failed=0
mon2 : ok=4 changed=2 unreachable=0
failed=0
real 0m35.112s
user 0m5.737s
sys 0m1.849s
[leseb@rick docker]$ time vagrant reload
==> mon0: Halting domain...
==> mon0: Starting domain.
==> mon0: Waiting for domain to get an IP address...
==> mon0: Waiting for SSH to become available...
==> mon0: Creating shared folders metadata...
==> mon0: Rsyncing folder:
/home/leseb/reproduce-ci/tmp.zgGC7d5mIC/build/workspace/ceph-ansible/tests/functional/centos/7/docker/
=> /home/vagrant/sync
==> mon0: Machine already provisioned. Run `vagrant provision` or use
the `--provision`
==> mon0: flag to force provisioning. Provisioners marked to run always
will still run.
==> mon1: Halting domain...
==> mon1: Starting domain.
==> mon1: Waiting for domain to get an IP address...
==> mon1: Waiting for SSH to become available...
==> mon1: Creating shared folders metadata...
==> mon1: Rsyncing folder:
/home/leseb/reproduce-ci/tmp.zgGC7d5mIC/build/workspace/ceph-ansible/tests/functional/centos/7/docker/
=> /home/vagrant/sync
==> mon1: Machine already provisioned. Run `vagrant provision` or use
the `--provision`
==> mon1: flag to force provisioning. Provisioners marked to run always
will still run.
==> mon2: Halting domain...
==> mon2: Starting domain.
==> mon2: Waiting for domain to get an IP address...
==> mon2: Waiting for SSH to become available...
==> mon2: Creating shared folders metadata...
==> mon2: Rsyncing folder:
/home/leseb/reproduce-ci/tmp.zgGC7d5mIC/build/workspace/ceph-ansible/tests/functional/centos/7/docker/
=> /home/vagrant/sync
==> mon2: Machine already provisioned. Run `vagrant provision` or use
the `--provision`
==> mon2: flag to force provisioning. Provisioners marked to run always
will still run.
real 1m31.850s
user 0m7.387s
sys 0m0.796s
Reboot via Ansible: 0m35.112s
Reboot via vagrant: 1m31.850s
We save 1/3 time.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now have a variable called ceph_pools that is mandatory when
deploying a MDS.
It's a dictionnary that contains a pool name and a PG count. PG count is
mandatory and must be set, the playbook will fail otherwise.
Closes: https://github.com/ceph/ceph-ansible/issues/2017
Signed-off-by: Sébastien Han <seb@redhat.com>
The `always_run` key is deprecated and being removed in Ansible 2.4.
Using it causes a warning to be displayed:
[DEPRECATION WARNING]: always_run is deprecated.
This patch changes all instances of `always_run` to use the `always`
tag, which causes the task to run each time the playbook runs.
- the rbd-mirror unit systemd name is not the same when running jewel vs
luminous.
- servicemap is not available on jewel.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since we introduced collocation testing scenario, we need to adapt
current tests to this new scenario.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we don't do this the client will create pools with a replica 3 since
osd_pool_default_size was gone in ceph-override.json. This was making
switch_to_containers failing.
Signed-off-by: Sébastien Han <seb@redhat.com>
Shared folder is not required for tests.
We should avoid hitting the error :
```
uninitialized constant VagrantPlugins::ProviderLibvirt::Action::ShareFolders
```
Also, disabling it might reduce the needed time in certains cases for the VMs
to be started.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If two environments are using the same subnet, we will get trouble
because of ips addresses conflicts.
This commit ensures each scenario has a uniq subnet for both public and cluster
network so we can setup several test environment at a time on a same hypervisor.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the path `/dev/disk/by-path/pci-0000:00:01.1-ata-1.0` doesn't exist.
it has to be changed to `/dev/disk/by-path/pci-0000:00:01.1-ata-1`
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit refacts the code regarding all `set_osd_pool_default_*`
related tasks by avoiding usage of useless `set_fact` to determine
whether a key is present in `ceph_conf_overrides`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we don't bootstrap the mgr after the mon and the osds handler are
called, we will never be able to reach a clean state since the pgs
stats are handled by the mgr. This also happens when doing daemon
collocation.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1493920
Signed-off-by: Sébastien Han <seb@redhat.com>
This test doesn't work at the moment and need to be fixed.
Disabling it temporary to avoid errors in the CI.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
On a container env, machines don't have any ceph binaries so we need to
use a container to run the commands.
Signed-off-by: Sébastien Han <seb@redhat.com>
Delete these before creating them incase they are left around in a purge
cluster testing scenario. The purge-cluster.yml playbook does not
currently remove partitions used for journals.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
The partition only needs created and given a gpt label so that a
PARTUUID will exist on the partition.
This task also makes the purge_lvm_osds scenario fail on the second
deployment after purging.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Prior to this patch this activation sequence for autodetection was
always skipped because we were asking to activate on device without
partitions, which doesn't make sense.
We also fix the way we lookup for a device, since the data partition is
always numbered 1, we take the min element of the dict.
Closes: https://github.com/ceph/ceph-ansible/issues/1782
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
we need to force the value of `docker` variable which is initially set
to `false` since it's a migration from non-containerized to
containerized cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>