Commit Graph

4380 Commits (724620ed3dc76bc8dbddb50538ff64b82b3025a7)
 

Author SHA1 Message Date
Guillaume Abrioux 724620ed3d add-osd: fix fact gathering in add-osd
This commit makes this playbook gathering facts from all other nodes but
clients.
When collocating OSDs on other nodes it can fail like following:

```
fatal: [vm252-11]: FAILED! => {
    "msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"
}
```

In that case, a fact from a RGW node is called when rendering the
`ceph.conf.j2` but it fails because facts are gathered only from mon and
osd nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1806765

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-04-07 11:19:53 -04:00
Guillaume Abrioux 8ccf91c1f0 add-osd: unset noup flag after last osd is deployed
this commit fixes a bug when using `add-osd.yml` playbook.
`noup` flag is set early but it never got unset before the "wait for pgs
clean" check, so the playbook always fails because OSDs aren't never
seen UP.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816023

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-04-07 11:19:53 -04:00
Guillaume Abrioux a8f5e43624 ceph_key: fetch key when needed
Fetch the key when it is present in the cluster but not on the node.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ccfa249919)
2020-04-03 16:19:03 -04:00
Guillaume Abrioux 323d4f8f0b ceph_key: fix idempotency when no secret is passed
553584cbd0 introduced a regression when no
secret is passed, it overwrites the secret each time the task is run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 003defec03)
2020-04-03 16:19:03 -04:00
Guillaume Abrioux b107dcf80b ceph_key: remove 'update' state
With this change, the state `present` is enough to update a keyring.
If the keyring already exist, it will be updated if caps or secret
passed to the module are different.
If the keyring doen't exist, it will be created.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1808367

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 553584cbd0)
2020-04-03 16:19:03 -04:00
Dimitri Savineau edfeb98593 tests: add mgr nodes to shrink_mon inventory
Since 306ce82 we explicitly fail when there's no mgr node preent in the
inventory.

fatal: [mon0]: FAILED! => {
    "changed": false
}

MSG:

Please add a mgr host to your inventory.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-04-02 22:02:35 +02:00
Guillaume Abrioux d4ffe21225 osd: support changing default rule even when osd_crush_location isn't defined
Creating crush rules even with no crush hierarchy configuration is a
valid scenario so we shouldn't be bound to the first task result (which
configure crush hierarchy) to be able to add new crush rules.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1816989

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5b0476385c)
2020-03-31 23:04:03 +02:00
Dimitri Savineau 586c6e8afe Add site-container.yml symlink
This adds a symlink to the site-docker.yml.sample playbook.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-03-31 23:00:49 +02:00
Guillaume Abrioux 3b1794a0fd switch_to_containers: exclude clients nodes from facts gathering
just like site.yml and rolling_update, let's exclude clients node from
the fact gathering.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 332c39376b)
(cherry picked from commit 5c3ba0787c)
2020-03-30 11:10:29 -04:00
Guillaume Abrioux cfe77bc51f main: exclude client nodes from facts gathering when delegate_facts_host
This commit excludes client nodes from facts gathering, they are not
needed and can speed up this task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 865d2eac9b)
2020-03-30 11:10:29 -04:00
John Fulton 658d9cadfd The _filtered_clients list should intersect with ansible_play_batch
Client configuration with --limit fails without this patch
because certain tasks are only done to the first host in the
_filtered_clients list and it's likely that first host will
not be included in what's sepcified with --limit. To fix this
the _filtered_clients list should be built from all clients
in the inventory that are also in the running play.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798781

Signed-off-by: John Fulton <fulton@redhat.com>
(cherry picked from commit e4bf4857f5)
2020-03-30 11:10:29 -04:00
Guillaume Abrioux 6006985466 defaults: remove legacy comment
This is no longer true, let's remove this comment given that this option
is not ignored in containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e551b5ba1a)
2020-03-26 12:08:31 -04:00
Guillaume Abrioux c60967f045 docker-common: remove legacy tasks for ntp configuration
Those tasks aren't needed in docker-common since the introduction of
`ceph-infra` role. They are duplicated tasks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1810376

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit cd0195c562)
2020-03-25 13:53:25 -04:00
Guillaume Abrioux a0f01db800 tests: add inventory host for 4.0 upgrade job
This inventory is intended to be used in the upgrade scenario in
stable-4.0 branch.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-03-04 23:18:43 +01:00
Guillaume Abrioux d2d241f21d
tests: modify add-osd job
This commit modifies the way we test add-osd scenario given that the
playbook add-osd.yml is broken at the moment.

As a workaround we can use main playbook with `--limit` to achieve this
operation.

Note: This commit is intended to be reverted once we get a fix.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-03-03 11:08:22 +01:00
Dimitri Savineau 2d2cec99fc tests: pg num should be a power of two number
This patch changes the pg_num value of the rgw pools foo and bar to be
a power of two number.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-17 14:52:09 -05:00
Benoît Knecht 87034b1fb6 ceph-rgw: Fix customize pool size "when" condition
In 3c31b19ab3, I fixed the `customize pool
size` task by replacing `item.size` with `item.value.size`. However, I
missed the same issue in the `when` condition.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3842aa1a30)
2020-02-17 11:53:58 -05:00
Benoît Knecht 874c94c59e ceph-rgw: Fix custom pool size setting
RadosGW pools can be created by setting

```yaml
rgw_create_pools:
  .rgw.root:
    pg_num: 512
    size: 2
```

for instance. However, doing so would create pools of size
`osd_pool_default_size` regardless of the `size` value. This was due to
the fact that the Ansible task used

```
{{ item.size | default(osd_pool_default_size) }}
```

as the pool size value, but `item.size` is always undefined; the
correct variable is `item.value.size`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 3c31b19ab3)
2020-02-17 11:53:58 -05:00
Dimitri Savineau db8902d444 ceph-{mon,osd}: move default crush variables
Since ed36a11 we move the crush rules creation code from the ceph-mon to
the ceph-osd role.
To keep the backward compatibility we kept the possibility to set the
crush variables on the mons side but we didn't move the default values.
As a result, when using crush_rule_config set to true and wanted to use
the default values for crush_rules then the crush rule ansible task
creation will fail.

"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute
'crush_rules'"

This patch move the default crush variables from ceph-mon to ceph-osd
role but also use those default values when nothing is defined on the
mons side.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1798864

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 1fc6b33714)
2020-02-17 16:23:33 +01:00
Dimitri Savineau 306ce82358 ceph-validate: fail if no mgr host is present
We already stop the upgrade playbook (rolling_update.yml) if there's
no mgr node present so we should also do the same for initial
deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1788644

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-11 13:27:10 -05:00
Dimitri Savineau 553fb1ed1e ceph-mon: use interactive session with aliases
When using ceph aliases with commands that require manual intervention
to stop then the command will keep running inside the container (like
using Ctrl+c).
For handling this, we should use the interactive session option (-it)
with the docker commands.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1797874

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-02-05 15:29:51 +01:00
Mike Christie c2a9397474 iscsi: Fix crashes during rolling update
During a rolling update we will run the ceph iscsigw tasks that start
the daemons then run the configure_iscsi.yml tasks which can create
iscsi objects like targets, disks, clients, etc. The problem is that
once the daemons are started they will accept confifguration requests,
or may want to update the system themself. Those operations can then
conflict with the configure_iscsi.yml tasks that setup objects and we
can end up in crashes due to the kernel being in a unsupported state.

This could also happen during creation, but is less likely due to no
objects being setup yet, so there are no watchers or users accessing the
gws yet. The fix in this patch works for both update and initial setup.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1795806

Signed-off-by: Mike Christie <mchristi@redhat.com>
(cherry picked from commit 77f3b5d51b)
2020-02-03 15:15:53 +01:00
Guillaume Abrioux b7a21d94d3 tests: retry to fire up VMs on vagrant failure
Add a script to retry several times to fire up VMs to avoid vagrant
failures.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Andrew Schoen <aschoen@redhat.com>
(cherry picked from commit 1ecb3a9352)
2020-02-03 10:20:19 +01:00
Guillaume Abrioux d437593e85 config: fix external client scenario
When no monitor group is present in the inventory, this task fails.
This affects only non-containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit e7bc079405)
2020-02-03 10:20:19 +01:00
Guillaume Abrioux 523a93b0e1 tests: add external_clients scenario
This commit adds a new 'external ceph clients' scenario.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 641729357e)
2020-02-03 10:20:19 +01:00
Guillaume Abrioux b6744fd82a validate: allow running ceph-ansible 3.2 against ansible 2.7
This commit allows ceph-ansible 3.2 to be run against ansible 2.7

However, note that running stable-3.2 against ansible 2.7 doesn't get
any testing upstream this might break the playbook, only ansible 2.6 is
officially supported.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1781635

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-31 10:07:48 -05:00
Guillaume Abrioux ce7503a3a6 tests: add 'all_in_one' scenario
Add new scenario 'all_in_one' in order to catch more collocated related
issues.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3e7dbb4b16)
2020-01-31 11:26:40 +01:00
Guillaume Abrioux cf748e729f update: remove legacy tasks
These tasks should have been removed with backport #4756

Note:
This should have been backported from master but it's not possible
because of too many change between master and stable-3.2

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1740463

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-29 09:25:15 -05:00
Dimitri Savineau 13e0f7d341 ceph-defaults: remove rgw from ceph_conf_overrides
The [rgw] section in the ceph.conf file or via the ceph_conf_overrides
variable doesn't exist and has no effect.
To apply overrides to all radosgw instances we should use either the
[global] or [client] sections.
Overrides per radosgw instance should still use the
[client.rgw.{instance-name}] section.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1794552

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 2f07b85131)
2020-01-29 14:19:17 +01:00
Guillaume Abrioux 726b3f220b defaults: change monitor|radosgw_address default values
To avoid confusion, let's change the default value from `0.0.0.0` to
`x.x.x.x`.
Users might think setting `0.0.0.0` will make the daemon binding on all
interfaces.

Fixes: #4827

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit fc02fc98eb)
2020-01-14 17:22:35 +01:00
Dimitri Savineau 071b950325 tox: allow copy admin key for purge scenario
This is enabled in the group_vars/clients file but it's overrided in
extra vars by tox.
Let's do it like that for now.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-13 14:50:29 -05:00
Guillaume Abrioux 01095f1f4c tests: add coverage on purge playbook
This commit adds a playbook to be played before we run purge playbook,
it first creates an rbd image then map an rbd device on client0 so the
purge playbook will try to unmap it.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit db77fbda15)
2020-01-13 14:50:29 -05:00
Guillaume Abrioux 5db0b239f6 purge: use sysfs to unmap rbd devices
in containerized context, using the binary provided in atomic os won't
work because it's an old version provided by ceph-common based on
10.2.5.
Using a container could be an idea but for large cluster with hundreds
of client nodes, that would require to pull the image of each of them
just to unmap the rbd devices.

Let's use the sysfs method in order to avoid any issue related to ceph
version that is shipped on the host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1766064

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 3cfcc7a105)
2020-01-13 14:50:29 -05:00
Guillaume Abrioux bcd7fee18d update: only run post osd upgrade play on 1 mon
There is no need to run these tasks n times from each monitor.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit c878e99589)
2020-01-13 13:42:01 -05:00
Guillaume Abrioux 09f295e89c update: use flags noout and nodeep-scrub only
1. set noout and nodeep-scrub flags,
2. upgrade each OSD node, one by one, wait for active+clean pgs
3. after all osd nodes are upgraded, unset flags

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Co-authored-by: Rachana Patel <racpatel@redhat.com>
(cherry picked from commit 548db78b95)
2020-01-13 13:42:01 -05:00
Dimitri Savineau 7ce33f4865 ceph-defaults: exclude rbd devices from discovery
The RBD devices aren't excluded from the devices list in the LVM auto
discovery scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1783908

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 6f0556f015)
2020-01-13 12:06:35 -05:00
Dimitri Savineau aea4257807 ceph-osd: wait for all osds once
cf8c6a3 moves the 'wait for all osds' task from openstack_config to the
main tasks list.
But the openstack_config code was executed only on the last OSD node.
We don't need to do this check on all OSD node so we need to add set
run_once to true on that task.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 5bd1cf40eb)
2020-01-13 16:54:01 +01:00
Dimitri Savineau 9a42fe580f ceph-osd: wait for all osd before crush rules
When creating crush rules with device class parameter we need to be sure
that all OSDs are up and running because the device class list is
is populated with this information.
This is now enable for all scenario not openstack_config only.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit cf8c6a3849)
2020-01-13 16:54:01 +01:00
Dimitri Savineau 8b2659bf6d rolling_update: create crush rule after osd play
When upgrading from jewel to luminous we can execute the crush rule tasks
only when the 'osd require-osd-release luminous' command.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-13 16:54:01 +01:00
Dimitri Savineau af57597df6 ceph-osd: add device class to crush rules
This adds device class support to crush rules when using the class key
in the rule dict via the create-replicated sub command.
If the class key isn't specified then we use the create-simple sub
command for backward compatibility.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1636508

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ef2cb99f73)
2020-01-13 16:54:01 +01:00
Dimitri Savineau 0ac43d83f4 move crush rule creation from mon to osd role
If we want to create crush rules with the create-replicated sub command
and device class then we need to have the OSD created before the crush
rules otherwise the device classes won't exist.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit ed36a11eab)
2020-01-13 16:54:01 +01:00
Dimitri Savineau 255be99bc5 ceph-validate: add rbdmirror validation
When ceph_rbd_mirror_configure is set to true we need to ensure that
the required variables aren't empty.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1760553

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 4a065cebd7)
2020-01-13 16:53:32 +01:00
Dimitri Savineau 2436044369 switch_to_containers: set GUID on lockbox part
The ceph lockbox partition (part number 5) used with non lvm scenarios
and in non containerized deployment don't have a valid PARTUUID.
The value is set to 00000000-0000-0000-0000-000000000000 for each OSD
devices.

$ blkid -t PARTLABEL="ceph lockbox" -o value -s PARTUUID
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000

When switching to containerized deployment we manually mount the lockbox
partition by using the PARTUUID.
Unfortunately because we have most of the time multiple OSD on the same
node we can't have the right symlink in /dev/disk/by-partuuid because it
will point to only one partition.

/dev/disk/by-partuuid/00000000-0000-0000-0000-000000000000 -> ../../sdb5

After the switch_to_containers playbook then only one OSD will restart
correctly and the other will try to access to the wrong device causing
error like 'xxxx is still in use'.

When deploying with containers and dmcrypt OSDs we force a PARTUUID
value during the ceph-disk prepare task.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616159

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-13 16:52:55 +01:00
Dimitri Savineau 58ffae3117 ceph-mds: allow directory fragmentation
We need to explicitly enable the allow_dirfrags flag on cephfs pool
after upgrading to Luminous.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1776233

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-13 16:52:11 +01:00
Guillaume Abrioux 881056fa9d facts: avoid duplicated element in devices list
When using `osd_auto_discovery`, `devices` is built multiple times due
to multiple runs of `ceph-facts` role. It end up with duplicate
instances of a same device in the list.

Using `unique` filter when building the list fixes this issue.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 23b1f43897)
2020-01-13 15:47:02 +01:00
Guillaume Abrioux 195c49eaa9 tests: add shrink-osd-legacy testing
This commit introduce back testing against ceph-disk deployed osds.

In stable-3.2 which is the most common version used at customers
(downstream pov), a bunch of OSDs are still deployed using ceph-disk.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2020-01-09 09:24:22 +01:00
Guillaume Abrioux ca728dcd70 shrink-osd: support fqdn in inventory
When using fqdn in inventory, that playbook fails because of some tasks
using the result of ceph osd tree (which returns shortname) to get
some datas in hostvars[].

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1779021

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 6d9ca6b05b)
2020-01-09 09:24:22 +01:00
Dimitri Savineau 193ce4f572 ceph-iscsi: add ceph-iscsi stable repositories
This commit adds the support of the ceph-iscsi stable repository when
use ceph_repository community instead of always using the devel
repositories.
We're still using the devel repositories for rtslib and tcmu-runner in
both cases (dev and community).

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2020-01-08 17:47:52 +01:00
Guillaume Abrioux d606ad0bac ansible.cfg: do not enforce PreferredAuthentications
There's no need to enforce PreferredAuthentications by default.
Users can still choose to override the ansible.cfg with any additional
parameter like this one to fit their infrastructure.

Fixes: #4826

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit d682412e2a)
2019-12-11 08:56:05 -05:00
Dimitri Savineau 56a7537f48 ceph-osd: update systemd unit script
The systemd unit script wasn't updated with the new container name
format (without the hostname).
We now have the same start/stop docker commands for all scenarios.
During the device to id OSD migration we need to be sure that the
old container with the hostname are stopped.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1780688

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-12-10 23:59:13 +01:00