Commit Graph

3902 Commits (1c88c444a30570dfeacc03a9424a10b088e7b344)
 

Author SHA1 Message Date
Guillaume Abrioux 9d5265fe11 osds: wait for osds to be up before creating pools
This is a follow up on #2628.
Even with the openstack pools creation moved later in the playbook,
there is still an issue because OSDs are not all UP when trying to
create pools.

Adding a task which checks for all OSDs to be UP with a `retries/until`
condition should definitively fix this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-06-01 15:46:52 +02:00
Guillaume Abrioux 0b67f42feb Makefile: followup on #2585
Fix a typo in `tag` target, double quote are missing here.

Without them, the `make tag` command fails like this:

```
if [[ "v3.0.35" ==  ]]; then \
            echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \
            exit 1; \
        fi
/bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator
/bin/sh: -c: line 0: syntax error near `;'
/bin/sh: -c: line 0: `if [[ "v3.0.35" ==  ]]; then     echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35";     exit 1; fi'
make: *** [tag] Error 2
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-06-01 12:50:03 +02:00
Guillaume Abrioux c68126d6fd mdss: do not make pg_num a mandatory params
When playing ceph-mds role, mon nodes have set a fact with the default
pg num for osd pools, we can simply default to this value for cephfs
pools (`cephfs_pools` variable).

At the moment the variable definition for `cephfs_pools` looks like:

```
cephfs_pools:
  - { name: "{{ cephfs_data }}", pgs: "" }
  - { name: "{{ cephfs_metadata }}", pgs: "" }
```

and we have a task in `ceph-validate` to ensure `pgs` has been set to a
valid value.

We could simply avoid this check by setting the default value of `pgs`
to `hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num']` and
let to users the possibility to override this value.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1581164

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-30 16:20:34 +02:00
Guillaume Abrioux 34e646e767 osds: do not set docker_exec_cmd fact
in `ceph-osd` there is no need to set `docker_exec_cmd` since the only
place where this fact is used is in `openstack_config.yml` which
delegate all docker command to a monitor node. It means we need the
`docker_exec_cmd` fact that has been set referring to `ceph-mon-*`
containers, this fact is already set earlier in `ceph-defaults`.

By the way, when collocating an OSD with a MON it fails because the container
`ceph-osd-{{ ansible_hostname }}` doesn't exist.

Removing this task will allow to collocate an OSD with a MON.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1584179

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-30 16:17:29 +02:00
Guillaume Abrioux 6f489015e4 tests: fix broken symlink
`requirements2.5.txt` is pointing to `tests/requirements2.4.txt` while
it should point to `requirements2.4.txt` since they are in the same
directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-30 16:13:47 +02:00
Guillaume Abrioux 34f7042852 tests: resize root partition when atomic host
For a few moment we can see failures in the CI for containerized
scenarios because VMs are running out of space at some point.

The default in the images used is to have only 3Gb for root partition
which doesn't sound like a lot.

Typical error seen:

```
STDERR:

failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device
```

Indeed, on the machine we can see:
```
Every 2.0s: df -h                                                                                                                                                                                                                                       Tue May 29 17:21:13 2018
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  3.0G  3.0G   14M 100% /
```

The idea here is to expand this partition with all the available space
remaining by issuing an `lvresize` followed by an `xfs_growfs`.

```
-bash-4.2# lvresize -l +100%FREE /dev/atomicos/root
  Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents).
  Logical volume atomicos/root successfully resized.
```

```
-bash-4.2# xfs_growfs /
meta-data=/dev/mapper/atomicos-root isize=512    agcount=4, agsize=192000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=768000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 768000 to 2543616
```

```
-bash-4.2# df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/atomicos-root  9.7G  1.4G  8.4G  14% /
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-30 10:54:35 +02:00
Erwan Velu 1fca827724 CONTRIBUTING.md: Initial release
As per issue #2623, it is important to define the commit guidelines.
This commit is about adding a first version of it.

Fixes: #2653
Signed-off-by: Erwan Velu <erwan@redhat.com>
2018-05-30 09:38:27 +02:00
Guillaume Abrioux 98cb6ed8f6 tests: avoid yum failures
In the CI we can see at many times failures like following:

`Failure talking to yum: Cannot find a valid baseurl for repo:
base/7/x86_64`

It seems the fastest mirror detection is sometimes counterproductive and
leads yum to fail.

This fix has been added in the `setup.yml`.
This playbook was used until now only just before playing `testinfra`
and could be used before running ceph-ansible so we can add some
provisionning tasks.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Erwan Velu <evelu@redhat.com>
2018-05-28 22:04:35 +02:00
Ha Phan 144b2fcebc python-netaddr is required to generate ceph.conf
ceph-config: add netaddr to python requirements

netaddr is required to generate ceph.conf, let's add this requirement in `requirements.txt`

Signed-off-by: Ha Phan <thanhha.work@gmail.com>
2018-05-28 10:11:59 +02:00
Sébastien Han e91648a7af rolling_update: add role ceph-iscsi-gw
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575829
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-26 02:38:47 -07:00
Paul Cuzner 2890b57cfc Add privilege escalation to iscsi purge tasks
Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-05-25 03:50:24 -07:00
Guillaume Abrioux 608ea947a9 mds: move mds fs pools creation
When collocating mds on monitor node, the cephpfs will fail
because `docker_exec_cmd` is reset to `ceph-mds-monXX` which is
incorrect because we need to delegate the task on `ceph-mon-monXX`.
In addition, it wouldn't have worked since `ceph-mds-monXX` container
isn't started yet.

Moving the task earlier in the `ceph-mds` role will fix this issue.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-25 11:16:56 +02:00
Sébastien Han 1c084efb3c rgw: container add option to configure multi-site zone
You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your
inventory and assign them a value. Once the rgw container starts it'll
pick the info and add itself to the right zone.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-24 11:32:05 -07:00
Guillaume Abrioux 828848017c playbook: follow up on #2553
Since we fixed the `gather and delegate facts` task, this exception is
not needed anymore. It's a leftover that should be removed to save some
time when deploying a cluster with a large client number.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-24 10:09:01 -07:00
Sébastien Han 3c32280ca1 group_vars: resync group_vars
The previous commit changed the content of roles/$ROLE/default/main.yml
so we have to re generate the group_vars files.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-24 09:39:38 -07:00
Guillaume Abrioux 3a0e168a76 mdss: move cephfs pools creation in ceph-mds
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.

The idea here is to move cephfs pools creation in `ceph-mds` role.

[1] e59258943b/src/mon/OSDMonitor.cc (L5673)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-24 09:39:38 -07:00
Guillaume Abrioux a10e73d78d tests: move cephfs_pools variable
let's move this variable in group_vars/all.yml in all testing scenarios
accordingly to this commit 1f15a81c48 so
we keep consistency between the playbook and the tests.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-24 09:39:38 -07:00
Guillaume Abrioux 564a662baf osds: move openstack pools creation in ceph-osd
When deploying a large number of OSD nodes it can be an issue because the
protection check [1] won't pass since it tries to create pools before all
OSDs are active.

The idea here is to move openstack pools creation at the end of `ceph-osd` role.

[1] e59258943b/src/mon/OSDMonitor.cc (L5673)

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-24 09:39:38 -07:00
Guillaume Abrioux f8260119cd defaults: resync sample files with actual defaults
6644dba5e3 and
1f15a81c48 introduced changes some changes
in defaults variables files but it seems we've forgotten to
regenerate the sample files.
This commit aims to resync the content of `all.yml.sample`,
`mons.yml.sample` and `rhcs.yml.sample`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-24 09:39:38 -07:00
Luigi Toscano 43e96c1f98 ceph-radosgw: disable NSS PKI db when SSL is disabled
The NSS PKI database is needed only if radosgw_keystone_ssl
is explicitly set to true, otherwise the SSL integration is
not enabled.

It is worth noting that the PKI support was removed from Keystone
starting from the Ocata release, so some code paths should be
changed anyway.

Also, remove radosgw_keystone, which is not useful anymore.
This variable was used until fcba2c801a.
Now profiles drives the setting of rgw keystone *.

Signed-off-by: Luigi Toscano <ltoscano@redhat.com>
2018-05-23 23:24:09 -07:00
Sébastien Han bf9593bced rhcs: bump version to 3.0 for stable 3.1
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1519835
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-23 14:42:39 -07:00
Vishal Kanaujia ef5f52b1f3 Skip GPT header creation for lvm osd scenario
The LVM lvcreate fails if the disk already has a GPT header.
We create GPT header regardless of OSD scenario. The fix is to
skip header creation for lvm scenario.

fixes: https://github.com/ceph/ceph-ansible/issues/2592

Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
2018-05-23 11:44:09 -07:00
Sébastien Han da5b104098 rolling_update: fix get fsid for containers
When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:

TASK [get current fsid] ********************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214
Tuesday 22 May 2018  22:48:32 +0000 (0:00:02.615)       0:11:01.035 ***********
fatal: [mgr0 -> mon0]: FAILED! => {
    "changed": true,
    "cmd": [
        "ceph",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:05:00.260674",
    "end": "2018-05-22 22:53:34.555743",
    "rc": 1,
    "start": "2018-05-22 22:48:34.295069"
}

STDERR:

2018-05-22 22:48:34.495651 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000
2018-05-22 22:48:34.495684 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault

This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-23 04:44:12 +02:00
Subhachandra Chandra c7e269fcf5 Fix restarting OSDs twice during a rolling update.
During a rolling update, OSDs are restarted twice currently. Once, by the
handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks
in the rolling_update playbook. This change turns off restarts by the handler.
Further, the restart initiated by the rolling_update playbook is more
efficient as it restarts all the OSDs on a host as one operation and waits
for them to rejoin the cluster. The restart task in the handler restarts one
OSD at a time and waits for it to join the cluster.
2018-05-22 19:23:07 +02:00
Alfredo Deza 4d1338b4bf validate: split schema for lvm osd scenario per objecstore
The bluestore lvm osd scenario does not require a journal entry. For
this reason we need to have a separate schema for that and filestore or
notario will fail validation for the bluestore lvm scenario because the
journal key does not exist in lvm_volumes.

Signed-off-by: Alfredo Deza <adeza@redhat.com>
(cherry picked from commit d916246bfeb927779fa920bab2e0cc736128c8a7)
2018-05-22 17:57:28 +02:00
Andrew Schoen a9ad8eb5f3 ceph-validate: do not check ceph version on dev or rhcs installs
A dev or rhcs install does not require ceph_stable_release to be set and
instead generates that by looking at the installed ceph-version.
However, at this point in the playbook ceph may not have been installed
yet and ceph-common has not be run.

Fixes: https://github.com/ceph/ceph-ansible/issues/2618

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-21 23:11:04 +02:00
Guillaume Abrioux 9801bde4d4 purge_cluster: fix dmcrypt purge
dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-21 08:23:10 +02:00
Andrew Schoen e7d02a50d8 ceph-validate: move system checks from ceph-common to ceph-validate
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 62c34e3c9d set the python-notario version to >= 0.0.13 in ceph-ansible.spec.in
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen c40ed1c66b site.yml: combine validate play with fact gathering play
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen fd7bb16e2f docs: explain the ceph-validate role and how it validates configuration
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen cf2868f0d1 validate: support validation of osd_auto_discovery
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 8b6097e565 validate: remove objectstore from osd options schema
objectstore is not a valid option, it's osd_objectstore and it's already
validated in install_options

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 645f61c351 ceph-defaults: remove backwards compat for containerized_deployment
The validation module does not get config options with the template
syntax rendered, so we're gonna remove that and just default it to
False. The backwards compat was schedule to be removed in 3.1 anyway.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen c65ea7e9d7 site-docker: validate config before pulling container images
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 890e265fd3 validate: adds a CEPH_RELEASES constant
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen d30a99c350 validate: add support for containerized_deployment
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 5d64eb79c1 validate: show an error and stop the playbook when notario is missing
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 62d6f2d84a site-docker.yml: add config validation play
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen a80a109ac9 site.yml: the validation play must use become: true
The ceph-defaults role expects this.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 12bdb8ef87 docs: add instructions for installing ansible and notario
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen ef48ed4e5a adds a requiremnts.txt file for the project
With the addition of the validate module we need to ensure
that notario is installed. This will be done with the use
of this requirments.txt file and pip.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen dea1ea93d5 tests: use notario>=0.0.13 when testing
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen f84c2ba27b ceph-defaults: fix failing tasks when osd_scenario was not set correctly
When devices is not defined because you want to use the 'lvm'
osd_scenario but you've made a mistake selecting that scenario these
tasks should not fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 91f65e2420 validate: improve error messages when config fails validation
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen d83bdce8a9 site.yml: abort playbook when it fails during config validation
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 1f15a81c48 ceph-defaults: move cephfs vars from the ceph-mon role
We're doing this so we can validate this in the ceph-validate role

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen ffe05872ac validate: only validate cephfs_pools on mon nodes
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 760a1afc21 validate: only validate osd config options on osd hosts
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00
Andrew Schoen 4325ccc857 validate: only check mon and rgw config if the node is in those groups
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-05-18 17:58:24 +02:00