Commit Graph

4120 Commits (1144df2852c1a51ca72b7b83a5beebd9a9ac4348)
 

Author SHA1 Message Date
Matthew Vernon 04f4991648 restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
2018-09-24 10:33:46 +00:00
Matthew Vernon aa97ecf048 restart_osd_daemon.sh.j2 - Reset RETRIES between calls of check_pgs
Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.

Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
2018-09-24 08:20:32 +00:00
John Spray 26bfef4107 Remove Calamari-related pieces
...with the exception of the purge operation, since
removing Calamari would still be useful for an old
cluster.

Signed-off-by: John Spray <john.spray@redhat.com>
2018-09-21 11:00:18 +01:00
Norbert Illés bd82c380c4 vagrantfile: fix references to OpenStack settings
In case of an OpenStack "box", the Vagrantfile intend to check the
existence of os_networks and os_floating_ip_pool settings in
vagrant_variables.yml and pass them to the provider if they are set.
Due to two typos in the Vagrantfile this is not working as it checks the
wrong variable names.
This commit fixes the typos so these settings can be used.

Signed-off-by: Norbert Illés <illesnorbi@gmail.com>
2018-09-21 07:00:03 +00:00
Andrew Schoen 16ccac83fe ceph-config: calculate num_osds for the lvm batch scenario
For now our best guess is to count the number of devices and multiply
by osds_per_device. Ideally we'd like to run ceph-volume lvm batch
--report and get the number of OSDs that way, but currently we need
a ceph.conf in place already before we can do that. There is a tracker
ticket that would allow os to get around the need for a ceph.conf:
http://tracker.ceph.com/issues/36088

Fixes: https://github.com/ceph/ceph-ansible/issues/3135

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-09-20 15:41:52 +00:00
Guillaume Abrioux 6d6fd514e0 config: set default _rgw_hostname value to respective host
the default value for _rgw_hostname was took from the current node being
played while it should be took from the respective node in the loop.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 20:10:34 +02:00
Andrew Schoen 8afad35f5a ceph-config: default devices and lvm_volumes when setting num_osds
This avoids errors when the osd scenario choosen does not require
setting devices or lvm_volumes. The default values for these are not
set because they exist in the ceph-osd role, not ceph-defaults.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-09-18 17:02:33 +00:00
Neha Ojha 27027a17d3 osd: add osd memory target option
BlueStore's cache is sized conservatively by default, so that it does
not overwhelm under-provisioned servers. The default is 1G for HDD, and
3G for SSD.

To replace the page cache, as much memory as possible should be given to
BlueStore. This is required for good performance. Since ceph-ansible
knows how much memory a host has, it can set

`bluestore cache size = max(total host memory / num OSDs on this host * safety
factor, 1G)`

Due to fragmentation and other memory use not included in bluestore's
cache, a safety factor of 0.5 for dedicated nodes and 0.2 for
hyperconverged nodes is recommended.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1595003

Signed-off-by: Neha Ojha <nojha@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 10:12:46 +00:00
Guillaume Abrioux 57f0b6a476 shrink-osd: follow up on 36fb3cde
- Adds loop in bash to satisfy the 1:n relation between `osd_hosts` and the
different device lists.
- Fixes some container name which were using the host hostname instead
of the actual container one.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 07:27:41 +00:00
Guillaume Abrioux 98c210d757 site-docker: fix undefined variable error
`mon_group_name` isn't defined here, we must hardcode it.

Typical error:

```
The task includes an option with an undefined variable. The error was: 'mon_group_name' is undefined
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 07:27:41 +00:00
Sébastien Han 735e1917db shrink-osd: purge dedicated devices
Once the OSD is destroyed we also have to purge the associated devices,
this means purging journal, db , wal partitions too.

This now works for container and non-container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-18 07:27:41 +00:00
Mike Christie 8fcd63cc50 igw: enable and start rbd-target-api
The commit:

commit 1164cdc002
Author: Guillaume Abrioux <gabrioux@redhat.com>
Date:   Thu Aug 2 11:58:47 2018 +0200

    iscsigw: install ceph-iscsi-cli package

installs the cli package but does not start and enable the
rbd-target-api daemon needed for gwcli to communicate with the igw
nodes. This patch just enables and starts it for the non-container
setup. The container setup is already doing this.

This fixes bz https://bugzilla.redhat.com/show_bug.cgi?id=1613963

Signed-off-by: Mike Christie <mchristi@redhat.com>
2018-09-13 19:35:45 +00:00
Guillaume Abrioux 3382c5226c tests: fix monitor_address for shrink_osd scenario
b89cc1746 introduced a typo. This commit fixes it

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 18:14:01 +02:00
Guillaume Abrioux 4159326a18 shrink-osd: fix purge osd on containerized deployment
ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 18:14:01 +02:00
Guillaume Abrioux 7a61771539 doc: update lvm doc
As of e3820a2 the creation of logical volumes is now supported by ceph-ansible.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 15:29:24 +00:00
Guillaume Abrioux a6f77340fd nfs: ignore error on semanage command for ganesha_t
As of rhel 7.6, it has been decided it doesn't make sense to confine
`ganesha_t` anymore. It means this domain won't exist anymore.

Let's add a `failed_when: false` in order to make the deployment not
failing when trying to run this command.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1626070

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 13:06:47 +02:00
Guillaume Abrioux 8f2c660d25 tests: pin sphinx version to 1.7.9
using sphinx 1.8.0 breaks our doc test CI job.

Typical error:

```
Exception occurred:
  File
  "/home/jenkins-build/build/workspace/ceph-ansible-docs-pull-requests/docs/.tox/docs/lib/python2.7/site-packages/sphinx/highlighting.py",  line 26, in <module>
      from sphinx.ext import doctest
      SyntaxError: unqualified exec is not allowed in function 'run' it contains a nested function with free variables (doctest.py, line 97)
```

See: https://github.com/sphinx-doc/sphinx/issues/5417

Pinning to 1.7.9 to fix our CI.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 13:05:43 +02:00
Andrew Schoen b36f3e06b5 ceph_volume: adds the osds_per_device parameter
If this is set to anything other than the default value of 1 then the
--osds-per-device flag will be used by the batch command to define how
many osds will be created per device.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-09-12 20:27:14 +00:00
Guillaume Abrioux 1c88c444a3 mon: fix `ExecStartPre` option in systemd unit file
This command line is not supported.
According to official documentation:

```
Note that shell command lines are not directly supported.
If shell command lines are to be used,
they need to be passed explicitly to a shell implementation of some kind.
```

We must run this using /bin/sh instead.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-11 10:48:21 +02:00
Guillaume Abrioux 9ff26e80f2 defaults: add a default value to rgw_hostname
let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.

Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-10 12:07:44 +02:00
Guillaume Abrioux 6954ac184f tests: do not upgrade ceph release for switch_to_containers scenario
Using `UPDATE_*` environment variables here will make an upgrade of the
ceph release when running switch_to_containers scenario which is not
correct.

Eg:
If ceph luminous was first deployed, then we should switch to ceph
luminous containers, not to mimic.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-09 14:25:14 +02:00
Guillaume Abrioux ecbd3e4558 Revert "client: add quotes to the dict values"
This commit is adding quotes that make keyring unusuable

eg:

```
client.john
        key: AQAN0RdbAAAAABAAH5D3WgMN9Rxw3M8jkpMIfg==
        caps: [mds] ''
        caps: [mgr] 'allow *'
        caps: [mon] 'allow rw'
        caps: [osd] 'allow rw'
```

Trying to import such a keyring and use it will result:

```
Error EACCES: access denied
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1623417

This reverts commit 424815501a.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-07 17:21:55 +00:00
Tom Barron bf8f589958 run rados cmd in container if containerized deployment
When ceph-nfs is deployed containerized and ceph-common is not
installed on the host the start_nfs task fails because the rados
command is missing on the host.

Run rados commands from a ceph container instead so that
they will succeed.

Signed-off-by: Tom Barron <tpb@dyncloud.net>
2018-09-03 17:06:00 +00:00
Markos Chandras 217f35dbdb roles: ceph-rgw: Enable the ceph-radosgw target
If the ceph-radosgw target is not enabled, then enabling the
ceph-radosgw@ service has no effect since nothing will pull
it on the next reboot. As such, we need to ensure that the
target is enabled.

Signed-off-by: Markos Chandras <mchandras@suse.de>
2018-09-03 15:48:58 +02:00
Sébastien Han 38dc20e74b purge: only purge /var/lib/ceph content
Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-03 10:51:24 +02:00
Alfredo Deza 58b2308036 tests: use new 'num_osds' variable in tests
Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-08-31 21:23:20 +00:00
Alfredo Deza e5fcb0d2d2 tests: allow defining arbitrary number of OSDs
Some tests might want to set this since number of devices will not
necessarily map to number of OSDs

Signed-off-by: Alfredo Deza <adeza@redhat.com>
2018-08-31 21:23:20 +00:00
Andy McCrae 772e6b9be2 Dont run client dummy container on non-x86_64 hosts
The dummy client container currently wont work on non-x86_64 hosts.
This PR creates a filtered client group that contains only hosts
that are x86_64 - which can then be the group to run the
dummy container against.

This is for the specific case of a containerized_deployment where
there is a mixture of non-x86_64 hosts and x86_64 hosts. As such
the filtered group will contain all hosts when running with
containerized_deployment: false.

Currently ppc64le is not supported for Ceph server components.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
2018-08-31 11:34:00 +00:00
Ali Maredia 561ec9203d infrastructure-playbooks: add comments for lv_vars.yml
Add comments telling user that devices used in
playbooks must not have GPT/FS/RAID signatures

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Ali Maredia 77eb459a88 infrastructure playbooks: remove lv-create error msg
remove error message when PV creation fails

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Sébastien Han 124fc727f4 doc: remove old statement
We have been supporting multiple devices for journalin containerized
deployments for a while now and forgot about this.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622393
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-28 13:31:57 -07:00
Sébastien Han 9ba670567e remove warning for unsupported variables
As promised, these will go unsupported for 3.1 so let's actually remove
them :).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622729
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-28 13:31:57 -07:00
Ali Maredia e1ff438800 infrastructure-playbooks: failure msg for pvcreate
Add a message for when PV creation fails.

This message alerts users that FS/GPT/RAID
signatures could still on the device and the
reason for the failures.

`wipefs -a $device` needs to be run to fix this issue.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-28 20:21:42 +00:00
Sébastien Han ae5ebeeb00 sites: fix conditonnal
Same problem again... ceph_release_num[ceph_release] is only set in
ceph-docker-common/common roles so putting the condition on that role
will never work. Removing the condition.

The downside of this is we will be installing packages and then skip the
role on the node.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622210
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-27 22:11:15 +02:00
Sébastien Han 30cfeb5427 site-docker.yml: remove useless condition
If we play site-docker.yml, we are already in a
containerized_deployment. So the condition is not needed.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-23 16:13:54 +02:00
Sébastien Han 7012835d2b ci: stop using different images on the same run
There is no point of using hosts running on atomic AND centos hosts. So
let's run containerized scenarios on Atomic only.

This solves this error here:

```
fatal: [client2]: FAILED! => {
    "failed": true
}

MSG:

The conditional check 'ceph_current_status.rc == 0' failed. The error was: error while evaluating conditional (ceph_current_status.rc == 0): 'dict object' has no attribute 'rc'

The error appears to have been in '/home/jenkins-build/build/workspace/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/roles/ceph-defaults/tasks/facts.yml': line 74, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

- name: set_fact ceph_current_status (convert to json)
  ^ here
```

From https://2.jenkins.ceph.com/view/ceph-ansible-stable3.1/job/ceph-ansible-nightly-luminous-stable-3.1-ooo_collocation/37/consoleFull#1765217701b5dd38fa-a56e-4233-a5ca-584604e56e3a

What's happening here is all the hosts excepts the clients are running atomic, so here: https://github.com/ceph/ceph-ansible/blob/master/site-docker.yml.sample#L62
The condition will skipped all the nodes excepts the clients, thus when running ceph-default, the task "is ceph running already?" is skipped but the task above needs the rc of the skipped task.
This is not an error from the playbook, it's a CI setup issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-23 16:13:54 +02:00
Sébastien Han 6d7fa99ff7 defaults: fix rgw_hostname
A couple if things were wrong in the initial commit:

* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it

* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.

* set the fact rgw_hostname on rgw nodes only

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-22 17:46:00 +02:00
Sébastien Han 0d448da695 vagrant: move variable samples to contrib
Let's clean up the root of the repo a bit

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 23:54:24 +02:00
Sébastien Han a2ad2fb3d5 rm ceph-aio-no-vagrant.sh
Script is out dated.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 23:54:24 +02:00
Sébastien Han 017aa6ef20 remove monitor_keys_example file
This file is not needed, if you want to generate a key you can run:

python -c "import os ; import struct ; import time; import base64 ; key
= os.urandom(16) ; header =
struct.pack('<hiih',1,int(time.time()),0,len(key)) ;
print(base64.b64encode(header + key).decode())"

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 23:54:24 +02:00
Andy McCrae 18684b7209 Sync config_template with base plugin
The config_template plugin exists in the ceph-common role so that
config_template will still work with ansible galaxy.

This PR syncs the config_template module from the base of the repo in
plugins/actions to the ceph-common role.

Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
2018-08-21 16:10:33 +00:00
Sébastien Han 2e6e885bb7 rolling_upgrade: set sortbitwise properly
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 12:22:32 +00:00
Sébastien Han 77a3a682f3 iscsi group name preserve backward compatibility
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 23:52:19 +02:00
Sébastien Han 8c70a5b197 osd: fix ceph_release
We need ceph_release in the condition, not ceph_stable_release

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1619255
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 20:14:56 +02:00
Sébastien Han b738706810 take-over-existing-cluster: do not call var_files
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 14:47:04 +02:00
Markos Chandras 126e2e3f92 roles: ceph-defaults: Check if 'rgw' attribute exists for rgw_hostname
If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:

msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'

We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.

Signed-off-by: Markos Chandras <mchandras@suse.de>
2018-08-20 11:37:45 +02:00
Markos Chandras 37e50114de roles: ceph-defaults: Delegate cluster information task to monitor node
Since commit f422efb1d6 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible

fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"

This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.

Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role

Signed-off-by: Markos Chandras <mchandras@suse.de>
2018-08-20 11:37:45 +02:00
Dardo D Kleiner f6519e4003 mgr: improve/fix disabled modules check
Follow up on 36942af698

"disabled_modules" is always a list, it's the items in the list that
can be dicts in mimic.  Many ways to fix this, here's one.

Signed-off-by: Dardo D Kleiner <dardokleiner@gmail.com>
2018-08-20 11:23:58 +02:00
Andrew Schoen 04df3f0802 lv-create: use copy instead of the template module
The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen f5a4c89869 tests: cat the contents of lv-create.log in infra_lv_create
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00