Commit Graph

776 Commits (9675e146ee4f4b6162adf6d5d5b995dafcbeafde)

Author SHA1 Message Date
Andrew Schoen e2529dcd7f rolling_update: ceph commands should use --cluster
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2019-04-18 10:55:11 +02:00
Andrew Schoen 67453853ff rolling_update: set num_osds to the number of running osds
We do this so that the ceph-config role can most accurately
report the number of osds for the generation of the ceph.conf
file.

We don't want to use ceph-volume to determine the number of
osds because in an upgrade to nautilus ceph-volume won't be able to
accurately count osds created by ceph-disk.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2019-04-18 10:55:11 +02:00
Andrew Schoen 28c47e4d1b rolling_update: migrate ceph-disk osds to ceph-volume
When upgrading to nautlius run ``ceph-volume simple scan`` and
``ceph-volume simple activate --all`` to migrate any running
ceph-disk osds to ceph-volume.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1656460

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2019-04-18 10:55:11 +02:00
Rishabh Dave d5967af7fb allow adding a monitor to a deployed cluster
Add a playbook that deploys a new monitor on a new node, adds that node
to the Ceph cluster and the monitor to the quorum and updates the ceph
configuration file on OSD nodes.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2019-04-15 10:00:50 +02:00
Dimitri Savineau eb658b3af6 purge-cluster: remove python-ceph-argparse package
When using purge-cluster playbook with nautilus, there's still the
python-ceph-argparse package installed on the host preventing to
reinstall a ceph cluster with a different version (like luminous or
mimic)

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-04-15 09:15:08 +02:00
Dimitri Savineau 150acba8c5 switch-from-non-containerized: stop all osds
e6bfb84 introduced a regression in the switch from non containerized
to container deployment.
We need to stop all previous OSDs services. We just don't need the
ceph-disk pattern in the regex.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-04-11 16:26:53 -04:00
Guillaume Abrioux a1254d767c purge: remove references to ceph-disk
as of stable-4.0, ceph-disk is no longer supported.
These tasks aren't needed anymore.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-11 11:57:02 -04:00
Guillaume Abrioux 73aa788459 shrink-osd: remove legacy playbook
as of stable-4.0, ceph-disk is no longer supported.
Let's remove this legacy version of the playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-11 11:57:02 -04:00
Guillaume Abrioux e6bfb843f4 switch_to_containers: remove ceph-disk references
as of stable-4.0, ceph-disk is no longer supported.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-11 11:57:02 -04:00
Guillaume Abrioux 4d35e9eeed osd: remove variable osd_scenario
As of stable-4.0, the only valid scenario is `lvm`.
Thus, this makes this variable useless.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-11 11:57:02 -04:00
Guillaume Abrioux c1e4529b0e update: fix undefined error when no mgr group is declared
if mgr group isn't defined in inventory, that task will fail with
undefined error.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-11 09:22:35 +02:00
Dimitri Savineau 57b4e76d11 rolling_update: Remove ceph aliases
ceph aliases have been introduced in stable-3.2 during the ceph
deployment. On master this has been removed but we don't handle
this removal in the upgrade from stable-3.2 to master via the
rolling_update playbook.
Also remove the task from purge-docker-cluster missing from
d9e7835

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-04-09 16:50:10 +02:00
Rishabh Dave 192fea0fec add-osds: don't hardcode group names
Instead of hardcoding group names, import ceph-defaults earlier. Also,
rectify a minor mistake in vagrant_varaibles.yml for containerized
version of add_osds.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2019-04-08 10:09:00 +02:00
Guillaume Abrioux 0180738313 purge: fix lvm-batch purge osd
`lvm_volumes` and/or `devices` variable(s) can be undefined depending on
the scenario chosen.

These tasks should be run only if these variable are defined, otherwise
it ends up with undefined variable errors.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-04-02 14:56:24 +02:00
Dimitri Savineau 7cc626b72d purge-docker-cluster: Remove ceph-osd service
The systemd ceph-osd@.service file used for starting the ceph osd
containers is used in all osd_scenarios.
Currently purging a containerized deployment using the lvm scenario
didn't remove the ceph-osd systemd service.
If the next deployment is a non-containerized deployment, the OSDs
won't be online because the file is still present and override the
one from the package.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-03-28 18:26:38 +00:00
Guillaume Abrioux f55e2b08be remove all NBSPs on master branch
Similar to #3658

Since there's too many changes between master and stable branches let's
commit directly in each branches instead of trying to backport this
commit.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-28 11:57:55 +00:00
Dimitri Savineau c8442f3705 rolling_update: Update systemd unit regex for nvme
The systemd unit regex doesn't handle nvme devices (/dev/nvmeXn1).

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1687828

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-03-26 12:01:00 +00:00
Guillaume Abrioux 78aac3e96a update: followup on edfdc49
all rgw instances should be stopped according to the multiple rgw
instances support added in rolling_update.yml

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux f6e0185146 update: add containerized deployment upgrade support (L->N)
Add a couple of fixes to allow containerized deployments upgrade support
to upgrade from luminous/mimic to nautilus.

- pass CEPH_CONTAINER_IMAGE and CEPH_CONTAINER_BINARY environment
variable to the ceph_key module,
- fix the docker exec command in 'waiting for the containerized monitor
to join the quorum' task according to the `delegate_to` parameter,
- override `docker_exec_cmd` in `ceph-facts` with `mon_host` when
rolling_update is `True`,
- do not run unnecessarily `create_mds_filesystems.yml` when performing an
upgrade.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 1816b876ee update: add missing hosts in facts gathering
iscsigws were missing.
The 'complete upgrade' couldn't complete because rolling_update was set
to False for iscsigw nodes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 45ba90c169 update: remove rbdmirror legacy task
This task is no longer needed for next release.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 0ea0adf039 update: show all daemons version at the end
Let's display all daemons version at the end of the playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux f31d6d9485 update: enable new nautilus-only functionality
once the cluster is upgraded to nautilus, we can complete the process by
disallowing pre-nautilus OSDs and enabling all new nautilus-only functionality

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux afdaa70a63 update: enable msgr2 protocol
This commit enable the msgr2 protocol when the cluster is fully upgraded
to nautilus

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux ef096dd021 update: ensure mgrs are upgraded after ALL monitors
As of 1c760904b0, ceph-ansible implicitly
bootstrap managers on monitors.
mgrs must be upgraded only after all monitors, therefore, this commit
refact the way mgrs are upgraded to be sure we don't upgrade a mgr
during the monitors upgrade.

This commit also ensure we handle the case were we split managers on
dedicated nodes.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 7fa2434f0f update: ensure /var/lib/ceph/bootstrap-rbd-mirror is present
This directory is created by ceph-config node by node.
In the upgrade context we need it to be created on ALL monitors as soon
as the first iteration because of the task right after which creates and sends
the keyrings on all monitors.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 82764afe8d update: mask systemd service units during upgrade
This prevents the packaging from restarting services before we do need
to restart them in the rolling update sequence.
We want to handle services restart at rolling_update playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 8add55451c update: set osd flags only once
There is no need to set osd flags (noout, norebalance) each time we
upgrade a mon.

This commit moves up those tasks (before stopping the mon) so we don't need
to delegate them.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux f7c6f4e0b6 update: fix tasks waiting for the node to join the quorum
We actually want to ensure the node being upgraded is joining the quorum
instead of the monitor picked up earlier.

Indeed, the `mon_host`is used only in `delegate_to:` so we can still run ceph
commands while the monitor being upgraded is stopped.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Guillaume Abrioux 32569b79e2 update: remove an old parameter in ceph_key module call
the `containerized` parameter in ceph_key module doesn't exist anymore.
This was making the module failing but was hidden because of the
`ignore_errors: True`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-25 16:02:56 -04:00
Dimitri Savineau b23c05ae52 add-osd.yml: Add become flag for ceph-validate
The check_devices task fails if the ceph-validate role isn't executed
as a privileged user (Permission denied).

failed: [osd0] (item=/dev/sdb) => {"changed": false, "err": "Error:
Error opening /dev/sdb: Permission denied\n", "item": "/dev/sdb",
"msg": "Error while getting device information with parted script:
'/sbin/parted -s -m /dev/sdb -- unit 'MiB' print'", "out": "", "rc": 1}

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
2019-03-09 05:54:46 +00:00
Guillaume Abrioux a440878533 add-osd: gather facts in second part of playbook
otherwise, it will end up with error like following:

```
FAILED! => {"msg": "'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_hostname'"}
```

because facts won't have been gathered.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1670663

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-04 14:44:27 +01:00
Guillaume Abrioux 47ebef374f purge: fix rbd-mirror group name
the default is rbdmirrors in ceph-defaults

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-01 20:31:14 +00:00
Guillaume Abrioux a915308477 purge: fix rbd mirror purge
as of b70d54ac80 the service launched isn't
ceph-rbd-mirror@admin.service.

it's now `ceph-rbd-mirror@rbd-mirror.{{ ansible_hostname }}`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-01 20:31:14 +00:00
Guillaume Abrioux 3849f30f58 purge: do not remove /var/lib/apt/lists/*
removing the content of this directory seems a bit agressive and cause a
redeployment to fail after a purge on debian based distrubition.

Typical error:
```
fatal: [mon0]: FAILED! => changed=false
  attempts: 3
  msg: No package matching 'ceph' is available
```

The following task will consider the cache is still valid, so apt
doesn't refresh it:
```
- name: update apt cache if cache_valid_time has expired
  apt:
    update_cache: yes
    cache_valid_time: 3600
  register: result
  until: result is succeeded
```

since the task installing ceph packages has a `update_cache: no` it
fails:

```
- name: install ceph for debian
  apt:
    name: "{{ debian_ceph_pkgs | unique }}"
    update_cache: no
    state: "{{ (upgrade_ceph_packages|bool) | ternary('latest','present') }}"
    default_release: "{{ ceph_stable_release_uca | default('') }}{{ ansible_distribution_release ~ '-backports' if ceph_origin == 'distro' and ceph_use_distro_backports else '' }}"
  register: result
  until: result is succeeded
```

/tmp/* isn't specific to ceph as well, so we shouldn't remove everything
in this directory.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-01 20:31:14 +00:00
Guillaume Abrioux 89f77589fa purge: fix purge of lvm devices
using `shell` module seems to be the only way to make this task working
on rhel based distribution AND debian based distributions.

on ubuntu, using `command` ansible module fails like following
(not due to `sudo` usage or not):
```
ok: [osd1] => changed=false
  cmd: command -v ceph-volume
  failed_when_result: false
  msg: '[Errno 2] No such file or directory: ''command'': ''command'''
  rc: 2
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1653307

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-03-01 20:31:14 +00:00
Guillaume Abrioux 69310a5cd6 switch_to_containers: support multiple rgw instances per host
add multiple rgw instances per host in switch_to_containers playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 70f1eea9b2 switch_to_containers: remove non-containerized systemd unit files
remove old systemd unit files (non-containerized) during the
switch_to_containers transition.

We have seen sometimes the unit started is the old one instead of the
new systemd unit generated.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 4064035a54 switch_to_containers: use ceph binary from container
use the ceph binary from the container instead of the host.
If the ceph CLI version isn't compatible between host and container
image, it can cause the CLI to hang.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
Guillaume Abrioux 7e0a70f7a8 switch_to_containers: do not try to redeploy monitors
`ceph-mon` tries to redeploy monitors because it assumes it was not yet
deployed since `mon_socket_stat` and `ceph_mon_container_stat` are
undefined (indeed, we stop the daemon before calling `ceph-mon` in the
switch_to_containers playbook).

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-02-13 09:42:27 +01:00
John Fulton 37b5d1084a Make python print statements python3 compatible
The restart_osd_daemon.sh generated from the j2 template
contains a python call which uses 'print x' instead of
'print(x)'. Add the missing parentheses to make this call
compatible with both 2 and 3.

Also add parentheses to other python print calls found
in roles/ceph-client/defaults/main.yml and
infrastructure-playbooks/cluster-os-migration.yml.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1671721
Signed-off-by: John Fulton <fulton@redhat.com>
2019-02-01 15:23:27 +00:00
Noah Watkins 9a43674d2e shrink_osd: use cv zap by fsid to remove parts/lvs
Fixes:
  https://bugzilla.redhat.com/show_bug.cgi?id=1569413
  https://bugzilla.redhat.com/show_bug.cgi?id=1572933

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
2019-01-24 16:34:13 +01:00
Guillaume Abrioux edfdc49488 rolling_update: support multiple rgw instance
1ac94c048f introduced the support of
multiple rgw instances on a single host but somehow has missed to
implement this feature in rolling_update.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-01-22 13:45:38 +01:00
Giulio Fidente ff8dbe114c Preserve rolling_update backward compatibility with ansible < 2.5
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
2019-01-21 14:05:45 +01:00
guihecheng 1ac94c048f rgw: add support for multiple rgw instances on a single host
With this, we could have multiple rgw instances on a single host
with a single run, don't have to use rgw-standalone.yml which does not
seems able to bind ports separately.
If you want to have multiple rgw instances, just change 'radosgw_instances'
to the number you want, which defaults to 1.
Not compatible with Multi-Site yet.

Signed-off-by: guihecheng <guihecheng@cmiot.chinamobile.com>
2019-01-18 11:12:28 +01:00
Guillaume Abrioux 268f2cef82 update: do not enforce `serial: 1` on client nodes
There is no need to enforce `serial: 1` on client nodes.
Let's make it parameterizable by introducing a new *extra* variable
`client_update_batch`, if not filled this will default to `{{
ansible_forks }}`.

NOTE: this is only usable as an extra variable passed with
`-e client_update_batch=<num>`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650184

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2019-01-02 16:55:08 +00:00
Daniel-Pivonka ba149972be Example ceph_add_users_buckets playbook
This is example playbook will show how to bulk add rgw users and buckets

Signed-off-by: Daniel-Pivonka <dpivonka@redhat.com>
2018-12-20 14:23:25 +01:00
Guillaume Abrioux d7e77012ef retry on packages and repositories failures
add register/until on all packaging related tasks to avoid non valid CI
failure.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-19 14:48:27 +00:00
Noah Watkins 110049e825 playbook: report storage device inventory
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-12-18 10:51:31 +01:00
Andrew Schoen ffd56177e7 purge-cluster: skip tasks that use ceph-volume if it's not installed
This will allow the playbook to be idempotent.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1656935

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-12-13 11:27:27 +01:00
Guillaume Abrioux a12de3e048 purge-container: move facts gathering after ceph-defaults role import
This task has to be called after the role `ceph-defaults` has been
played, otherwise, `mon_group_name` will never be known.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-12 16:50:24 +00:00
Guillaume Abrioux d0b3cb7f85 purge-container: fix wrong syntax
we want a default value for `mon_group_name`, not for
`groups[mon_group_name]`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-12 11:33:57 +01:00
Guillaume Abrioux 0eb56e36f8 introduce new role ceph-facts
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.

Closes: #3282

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-12 11:18:01 +01:00
Guillaume Abrioux ae7f3d66a6 purge-docker: do not call ceph-osd role
calling ceph-osd role in purge playbook is not needed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-11 09:59:25 +01:00
Guillaume Abrioux 1a4a6ec855 purge: gather monitors facts in OSD purge
the OSD part of the purge delegates commands on monitor node, we need to
gather monitors facts to know the `ansible_hostname` fact that is used
in the `docker_exec_cmd` fact.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han 62111ff53c purge-container: gather fact before calling ceph-defaults
ceph-defaults relies on facts so we must gather facts before running it.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han fc6ebd8ebb purge-cluster: add support for mon/mgr collocation
Recently we introduced the default collocation of mon/mgr without the
need of a dedicated mgrs section. This means we have to stop the mgr
process on that machine too.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han 3a154fa0ad purge-cluster: remove support for other init system
We only support systemd and use the service module anyway.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han 325a159415 purge-docker-cluster: add support for mgr/mon collocation
Recently we introduced the collocation of mon and mgr by default, so we
don't need to have an explicit mgrs section for this. This means we have
to remove the mgr container on the mon machines too.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han 2bcc00896f purge-docker-cluste: add a task to check hosts
It's useful when running on CI to see what might remain on the machines.
So we list all the containers and images. We expect the list to be
empty.

We fail if we see containers running.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Sébastien Han 1751885bc9 purge-docker-cluster: add ceph-volume support
This commits adds the support for purging cluster that were deployed
with ceph-volume. It also separates nicely with a block intruction the
work to do when lvm is used or not.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-11 09:59:25 +01:00
Rishabh Dave 2fb12ae554 use pre_tasks and post_tasks when necessary
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-12-05 08:17:10 +00:00
Rishabh Dave e4f0af2b78 don't use private option for import_role
Since sharing variables amongst roles has been made default since
Ansible 2.6, private option has been deprecated; so stop using it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-12-04 23:45:59 +00:00
Ramana Raja cb784c601d rolling_update: fail if less than 3 MONs
... for non-containerized deployments as well.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1655470

Signed-off-by: Ramana Raja <rraja@redhat.com>
2018-12-04 14:28:49 +00:00
Sébastien Han 896676ee80 fix json data type
Json is a type structure which is always typed as a string, where before
this we were declaring a dict, which is not a json valid structure.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-04 12:34:54 +01:00
Guillaume Abrioux 78116fa6db purge: add iscsi support
add iscsi support for both non containerized and containerized
deployment in purge playbooks.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1651054

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-12-03 17:35:21 +01:00
Sébastien Han 1c760904b0 site: collocated mon and mgr by default
This will speed up the deployment and also deploy mon and mgr collocated
just as recommended.
This won't prevent you of adding more and dedicaded machines for mgr if
needed.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-03 14:39:43 +01:00
Sébastien Han bb7bfca113 rolling-update: remove old condition
This failure condition was only valid at the time where clusters didn't
have ceph-mgr activated. Now since we collocate the ceph-mgr with the
mon by default, if the daemon wasn't present it will be created during
the upgrade.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-12-03 14:39:43 +01:00
Guillaume Abrioux a952122c38 rolling_update: create missing keyring only on running mon
try to create the potentially missing keys only on monitors that are
actually running.
The current node being played is stopped before this task.
By the way, delegating the command on all nodes but the current node
being played ensures that the generated keys will be present on all
monitors.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-29 16:40:46 +00:00
Sébastien Han 61fb6972ec rolling_update: default ceph json output to empty dict
So we can avoid the following failure:

The conditional check 'hostvars[mon_host]['ansible_hostname'] in (ceph_health_raw.stdout | from_json)["quorum_names"] or hostvars[mon_host]['ansible_fqdn'] in (ceph_health_raw.stdout | from_json)["quorum_names"]
' failed. The error was: No JSON object could be decoded

We just need to set a default, the next iteration will have a more
complete json since the command won't fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-29 10:46:15 +00:00
Guillaume Abrioux 73287f91bc mgr: fix mgr keyring error on rolling_update
when upgrading from RHCS 2.5 to 3.2, it fails because the task `create
ceph mgr keyring(s) when mon is containerized` has a when condition
`inventory_hostname == groups[mon_group_name]|last`.
First, this is incorrect because `inventory_hostname` is referring to a
mgr node, it means this condition would have never been satisfied.
Then, this condition + `serial: 1` makes the mgr keyring creating skipped on
the first node. Further, the `ceph-mgr` role tries to copy the mgr
keyring (it's not aware we are running `serial: 1`) this leads to a
failure like the following:

```
TASK [ceph-mgr : copy ceph keyring(s) if needed] ***************************************************************************************************************************************************************************************************************************************************************************
task path: /usr/share/ceph-ansible/roles/ceph-mgr/tasks/common.yml:10
Tuesday 27 November 2018  12:03:34 +0000 (0:00:00.296)       0:11:01.290 ******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: AnsibleFileNotFound: Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'
failed: [magna021] (item={u'dest': u'/var/lib/ceph/mgr/local-magna021/keyring', u'name': u'/etc/ceph/local.mgr.magna021.keyring', u'copy_key': True}) => {"changed": false, "item": {"copy_key": true, "dest": "/var/lib/ceph/mgr/local-magna021/keyring", "name": "/etc/ceph/local.mgr.magna021.keyring"}, "msg": "Could not find or access '~/ceph-ansible-keys/48d78ac1-e0d6-4e35-ab3e-772aea7828fc//etc/ceph/local.mgr.magna021.keyring'"}
```

The ceph_key module is idempotent, so there is no need to have such a
condition.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1649957

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-27 18:19:56 +01:00
Sébastien Han e5d5dffeb5 shrink-osd: add missing CEPH_BINARY
We need to add the right binary to do the docker exec.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-27 16:47:40 +00:00
Sébastien Han 4f57e44f9c defaults: declare container_binary
Always declare container_binary and assign it a correct value.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-27 16:47:40 +00:00
Sébastien Han 49e0e19056 rolling_update: update ceph_key task for container
Use the new way to create keys on containerized env as introduced by: 1098b71bda90db3dad19ac179f0ba900ccb0f953

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-27 16:47:40 +00:00
Sébastien Han 2814d36c93 infra playbooks: use the right container binary
Use podman or docker wether they are available or not. podman will be
prioritized over docker if present.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-27 16:47:40 +00:00
Guillaume Abrioux 7c99b6df6d update: fix a typo
`hostvars[groups[mon_host]]['ansible_hostname']` seems to be a typo.
That should be `hostvars[mon_host]['ansible_hostname']`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-26 18:22:20 +01:00
Guillaume Abrioux af78173584 rolling_update: refact set_fact `mon_host`
each monitor node should select another monitor which isn't itself.
Otherwise, one node in the monitor group won't set this fact and causes
failure.

Typical error:
```
TASK [create potentially missing keys (rbd and rbd-mirror) when mon is containerized] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-update_docker_cluster/rolling_update.yml:200
Thursday 22 November 2018  14:02:30 +0000 (0:00:07.493)       0:02:50.005 *****
fatal: [mon1]: FAILED! => {}

MSG:

The task includes an option with an undefined variable. The error was: 'dict object' has no attribute u'mon2'
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-26 18:22:20 +01:00
Sébastien Han 4e267bee4f rolling_update: create rbd and rbd-mirror keyrings
During an upgrade ceph won't create keys that were not existing on the
previous version. So after the upgrade of let's Jewel to Luminous, once
all the monitors have the new version they should get or create the
keys. It's ok to have the task fails, especially for the rbd-mirror
key, which only appears in Nautilus.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1650572
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-26 18:22:20 +01:00
Sébastien Han c14f9b78ff switch: do not look for devices anymore
It's easier lookup a directoriy instead of the block devices,
especially because of ceph-volume and ceph-disk have a different way to
handle devices.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Sébastien Han cd56dad9fa switch: disable all ceph units
Prior to this commit we were only disabling ceph-osd units, but forgot
the ceph.target which is controlling everything and will restart the
ceph-osd units at each reboot.
Now that everything gets disabled there won't be any conflicts between
the old non-container and the new container units.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Sébastien Han fe1d09925a switch: do not mask systemd unit
If we mask it we won't be able to start the OSD container since now the
osd container use the osd ID as a name such as: ceph-osd@0

Fixes the error:  Failed to execute operation: Cannot send after transport endpoint shutdown

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-23 07:56:23 +00:00
Guillaume Abrioux c783bc70da docker-common: rename role
rename `ceph-docker-common` role to `ceph-container-common`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-11-12 10:51:48 +01:00
Rishabh Dave 90f222f6a5 add quotes around package names added in da6f384
Add quotes around package names added in the commit
da6f384223 so that the difference between
the Ansible variables and package names is clear.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-11-09 12:59:08 +00:00
Rishabh Dave d72340abbe pass the list of packages to package management modules
Instead of looping over a list of packages or repeating the task
separately for different packages, pass the list of packages to the
task performing package management.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-11-09 12:59:08 +00:00
Sébastien Han 53910de43b ceph_key: add fetch_initial_keys capability
This is needed for Nautilus since the ceph-create-keys script goes away.
(https://github.com/ceph/ceph/pull/21305)
Now the module if called with 'state: fetch_initial_keys' will lookup
keys generated by the monitor and write them down on the filesystem to
the right location (/etc/ceph and /var/lib/ceph/boostrap*).
This is not applicable to container since keys are generated by the
container only.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-09 12:45:52 +01:00
Mike Christie b523a44a1a igw: stop tcmu-runner on iscsi purge
When the iscsi purge playbook is run we stop the gw and api daemons but
not tcmu-runner which I forgot on the previous PR.

Fixes Red Hat BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1621255

Signed-off-by: Mike Christie <mchristi@redhat.com>
2018-11-09 10:02:16 +01:00
Noah Watkins b848d2be4c don't use "role" or "roles" to include roles
see 3f62fc585f

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-11-08 17:45:37 +01:00
Noah Watkins 9c47950961 Fix comments in shrink-osd-ceph-disk playbook
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-11-08 17:45:37 +01:00
Noah Watkins f5dacbf7de Add a ceph-volume aware shrink-osd playbook
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-11-08 17:45:37 +01:00
Noah Watkins 0782cfc546 Rename ceph-disk version of shrink-osd playbook
This will be replaced by a ceph-volume aware verison.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-11-08 17:45:37 +01:00
Sébastien Han b82995df58 Revert "ceph_key: add fetch_initial_keys capability"
This reverts commit 17883e09ba.
2018-11-08 13:34:47 +00:00
Sébastien Han 17883e09ba ceph_key: add fetch_initial_keys capability
This is needed for Nautilus since the ceph-create-keys script goes away.
(https://github.com/ceph/ceph/pull/21305)
Now the module if called with 'state: fetch_initial_keys' will lookup
keys generated by the monitor and write them down on the filesystem to
the right location (/etc/ceph and /var/lib/ceph/boostrap*).
This is not applicable to container since keys are generated by the
container only.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-11-08 13:32:18 +00:00
Rishabh Dave da6f384223 don't loop over a task using package management modules
For tasks using (Ansible) modules for package management utilities,
pass the list of packages to be installed instead of repeating the task
for each package. Using the latter manner of installing a list of
packages leads to a deprecation warning by ansible-playbook command.

Fixes: https://github.com/ceph/ceph-ansible/issues/3293
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-11-08 08:38:10 +00:00
Rishabh Dave 640cad3fd8 remove configuration files for ceph packages on ubuntu clusters
For apt-get, purge command needs to be used, instead of remove command,
to remove related configuration files. Otherwise, packages might be
shown as installed while running dpkg command even after removing them.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1640061
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-11-07 15:52:53 +01:00
Guillaume Abrioux f7d4651186 playbook: remove jinja syntax in when statement
this syntax in deprecated

Closes: #3281

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-31 13:45:41 +01:00
Rishabh Dave 3f62fc585f don't use "role" or "roles" to include roles
Since import_role and include_role are more readable, explicit (about
the nature of inclusion) and flexible (allows placibf inclusion
anywhere) amongst the tasks, use them instead of using roles or role
keyword. Besides, these keywords also allow more arguments.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-10-31 09:38:59 +01:00
Rishabh Dave 8edbda96df use blocks directives to group tasks
Using block directives simplifies the playbooks and makes them more
readable.

Fixes: https://github.com/ceph/ceph-ansible/issues/2835
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-10-31 09:37:43 +01:00
Guillaume Abrioux d8d3e55006 remove restapi role
As of `mimic`, restapi is no longer available because of manager daemon.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-30 14:19:13 +01:00
Ali Maredia 219fa8f919 infrastructure playbooks: ensure nvme_device is defined in lv-create.yml
Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-10-29 08:41:42 +00:00
Mike Christie 0904860032 igw: stop daemons on purge all calls
When purging the entire igw config (lio and rbd) stop disable the api
and gw daemons.

Fixes Red Hat BZ
https://bugzilla.redhat.com/show_bug.cgi?id=1621255

Signed-off-by: Mike Christie <mchristi@redhat.com>
2018-10-25 12:59:18 +02:00
Sébastien Han 44d0da0dd4 rolling_update: fix upgrade when using fqdn
CLusters that were deployed using 'mon_use_fqdn' have a different unit
name, so during the upgrade this must be used otherwise the upgrade will
fail, looking for a unit that does not exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1597516
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-19 13:06:56 +00:00
Guillaume Abrioux b8418ebd17 add-osds: followup on 3632b26
Three fixes:

- fix a typo in vagrant_variables that cause a networking issue for
containerized scenario.
- add containerized_deployment: true
- remove a useless block of code: the fact docker_exec_cmd is set in
ceph-defaults which is played right after.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-17 17:07:25 +02:00
Sébastien Han d6e79044ef infra: add a gather-ceph-logs.yml playbook
Add a gather-ceph-logs.yml which will log onto all the machines from
your inventory and will gather ceph logs. This is not intended to work
on containerized environments since the logs are stored in journald.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582280
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-17 13:52:19 +00:00
Sébastien Han fbd878c8d5 infra: rename osd-configure to add-osd and improve it
The playbook has various improvements:

* run ceph-validate role before doing anything
* run ceph-fetch-keys only on the first monitor of the inventory list
* set noup flag so PGs get distributed once all the new OSDs have been
added to the cluster and unset it when they are up and running

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-17 11:26:11 +00:00
Guillaume Abrioux 40b7747af7 remove jewel support
As of now, we should no longer support Jewel in ceph-ansible.
The latest ceph-ansible release supporting Jewel is `stable-3.1`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-12 23:38:17 +00:00
Sébastien Han 9fccffa1ca switch: allow switch big clusters (more than 99 osds)
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-10 16:35:30 -04:00
Noah Watkins 8dcc8d1434 Stringify ceph_docker_image_tag
This could be a numeric input, but is treated like a string leading to
runtime errors.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1635823

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-10-10 04:26:33 +00:00
Noah Watkins 306e308f13 Avoid using tests as filter
Fixes the deprecation warning:

  [DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
  using `result|search` use `result is search`.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>
2018-10-10 04:26:33 +00:00
Guillaume Abrioux 79bd06ad28 rolling_update: add ceph-handler role
since the introduction of ceph-handler, it has to be added in
rolling_update playbook as well

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-05 13:48:04 +00:00
Rishabh Dave b5d2ea269f don't use "static" field while including tasks
Instead used "import_tasks" and "include_tasks" to tell whether tasks
must be included statically or dynamically.

Fixes: https://github.com/ceph/ceph-ansible/issues/2998
Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-10-04 07:44:28 +00:00
Sébastien Han bae0f41705 switch: copy initial mon keyring
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-10-03 13:58:53 +00:00
Guillaume Abrioux 03e76af7b4 switch: add missing call to ceph-handler role
Add missing call the ceph-handler role, otherwise we can't have
reference to variable registered from ceph-handler from other roles.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-03 13:58:53 +00:00
Guillaume Abrioux 54b02fe187 switch: support migration when cluster is scrubbing
Similar to c13a3c3 we must allow scrubbing when running this playbook.

In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.

This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.

Closes: #3182

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-10-03 13:58:53 +00:00
Andrew Schoen 9747f3dbd5 purge-cluster: zap devices used with the lvm scenario
Fixes: https://github.com/ceph/ceph-ansible/issues/3156

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-09-28 14:49:56 +02:00
wumingqiao 5da71e1ca1 purge-cluster: recursively remove ceph-related files, symlinks and directories under /etc/systemd/system.
fix: https://github.com/ceph/ceph-ansible/issues/3166

Signed-off-by: wumingqiao <wumingqiao@beyondcent.com>
2018-09-28 14:49:22 +02:00
Rishabh Dave 380168dadc don't use "include" to include tasks
Use "import_tasks" or "include_tasks" instead.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
2018-09-27 17:53:40 +02:00
Guillaume Abrioux 144c92b21f purge: actually remove of /var/lib/ceph/*
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.

`/var/lib/ceph/*` files are not purged it means there is a leftover.

When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.

Typical error (from container output):

```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16  /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23  /entrypoint.sh:
SUCCESS
```

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-27 17:45:21 +02:00
Guillaume Abrioux 179c4d00d7 rolling_update: ensure pgs_by_state has at least 1 entry
Previous commit c13a3c3 has removed a condition.

This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-25 14:58:54 +00:00
Guillaume Abrioux c13a3c3492 upgrade: consider all 'active+clean' states as valid pgs
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.

This commit allows an upgrade even while PGs are scrubbing.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-25 12:12:06 +00:00
Guillaume Abrioux 57f0b6a476 shrink-osd: follow up on 36fb3cde
- Adds loop in bash to satisfy the 1:n relation between `osd_hosts` and the
different device lists.
- Fixes some container name which were using the host hostname instead
of the actual container one.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-18 07:27:41 +00:00
Sébastien Han 735e1917db shrink-osd: purge dedicated devices
Once the OSD is destroyed we also have to purge the associated devices,
this means purging journal, db , wal partitions too.

This now works for container and non-container.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-18 07:27:41 +00:00
Guillaume Abrioux 4159326a18 shrink-osd: fix purge osd on containerized deployment
ce1dd8d introduced the purge osd on containers but it was incorrect.

`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-09-13 18:14:01 +02:00
Sébastien Han 38dc20e74b purge: only purge /var/lib/ceph content
Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-09-03 10:51:24 +02:00
Ali Maredia 561ec9203d infrastructure-playbooks: add comments for lv_vars.yml
Add comments telling user that devices used in
playbooks must not have GPT/FS/RAID signatures

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Ali Maredia 77eb459a88 infrastructure playbooks: remove lv-create error msg
remove error message when PV creation fails

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-29 21:10:20 +00:00
Ali Maredia e1ff438800 infrastructure-playbooks: failure msg for pvcreate
Add a message for when PV creation fails.

This message alerts users that FS/GPT/RAID
signatures could still on the device and the
reason for the failures.

`wipefs -a $device` needs to be run to fix this issue.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-28 20:21:42 +00:00
Sébastien Han 2e6e885bb7 rolling_upgrade: set sortbitwise properly
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-21 12:22:32 +00:00
Sébastien Han 77a3a682f3 iscsi group name preserve backward compatibility
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.

This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.

Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 23:52:19 +02:00
Sébastien Han b738706810 take-over-existing-cluster: do not call var_files
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:

- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml

Will create collision and override necessary variables.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-20 14:47:04 +02:00
Andrew Schoen 04df3f0802 lv-create: use copy instead of the template module
The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 131796f275 lv-create: add an example logfile_path config option in lv_vars.yml
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen b0bfc17351 lv-teardown: fail silently if lv_vars.yml is not found
This allows user to opt out of using lv_vars.yml and load configuration
from other sources.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 8424858b40 lv-teardown: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen e43eec57bb lv-create: fail silenty if lv_vars.yml is not found
If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen fde47be13c lv-create: set become: true at the playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Andrew Schoen 35301b35af lv-create: use the template module to write log file
The copy module will not expand the template and render the variables
included, so we must use template.

Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 909b38da82 infrastructure-playbooks/vars/lv_vars.yaml: minor fixes
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha f65f3ea89f infrastructure-playbooks/lv-create.yml: use tempfile to create logfile
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 65fdad0723 infrastructure-playbooks/lv-create.yml: add lvm_volumes to suggested paste
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 50a6d8141c infrastructure-playbooks/lv-create.yml: copy without using a template file
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 186c4e11c7 infrastructure-playbooks/lv-create.yml: don't use action to copy
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha 9d43806df9 infrastructure-playbooks: standardize variable usage with a space after brackets
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Neha Ojha e0293de3e7 vars/lv_vars.yaml: remove journal_device
Signed-off-by: Neha Ojha <nojha@redhat.com>
2018-08-16 16:38:23 +02:00
Ali Maredia 1f018d8612 infrastructure-playbooks: playbooks for creating LVs for bucket indexes and journals
These playbooks create and tear down logical
volumes for OSD data on HDDs and for a bucket index and
journals on 1 NVMe device.

Users should follow the guidelines set in var/lv_vars.yaml

After the lv-create.yml playbook is run, output is
sent to /tmp/logfile.txt for copy and paste into
osds.yml

Signed-off-by: Ali Maredia <amaredia@redhat.com>
2018-08-16 16:38:23 +02:00
Sébastien Han dad10e8f3f rolling_update: register container osd units
Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-08-16 11:13:12 +02:00
Jeffrey Zhang 85cc61a6d9 Use /var/lib/ceph/osd folder to filter osd mount point
In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.

Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
2018-08-14 13:00:24 +00:00
Sébastien Han b3266c5be2 rolling_update: set osd sortbitwise
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-07-24 17:19:02 +02:00
Sébastien Han ce1dd8d2b3 shrink-osd: purge osd on containerized deployment
Prior to this commit we were only stopping the container, but now we
also purge the devices.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-07-18 14:26:22 +00:00
Guillaume Abrioux d0746e0858 common: switch from docker module to docker_container
As of ansible 2.4, `docker` module has been removed (was deprecated
since ansible 2.1).
We must switch to `docker_container` instead.

See: https://docs.ansible.com/ansible/latest/modules/docker_module.html#docker-module

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-07-10 20:08:07 +00:00
Vishal Kanaujia 44d514850a Rolling upgrades: Migrate to ceph-key module
This change moves ceph-mgr upgrades to using ceph-key library.
Fixes: #2758

Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
2018-07-03 18:22:14 +02:00
Sébastien Han 20c8065e48 ceph-iscsi: rename group iscsi_gws
Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-06-08 10:21:54 +02:00
Guillaume Abrioux 232a16d77f rolling_update: fix facts gathering delegation
this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.

Closes: #2708

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-06-06 16:36:30 +08:00
Vishal Kanaujia 08d9432454 Rolling upgrades should use norebalance flag for OSDs
The rolling upgrades playbook should have norebalance flag set for
OSDs upgrades to wait only for recovery.

Fixes: #2657
Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
2018-06-04 10:59:01 +02:00
Sébastien Han e91648a7af rolling_update: add role ceph-iscsi-gw
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575829
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-26 02:38:47 -07:00
Paul Cuzner 2890b57cfc Add privilege escalation to iscsi purge tasks
Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004

Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
2018-05-25 03:50:24 -07:00
Sébastien Han da5b104098 rolling_update: fix get fsid for containers
When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:

TASK [get current fsid] ********************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214
Tuesday 22 May 2018  22:48:32 +0000 (0:00:02.615)       0:11:01.035 ***********
fatal: [mgr0 -> mon0]: FAILED! => {
    "changed": true,
    "cmd": [
        "ceph",
        "--cluster",
        "test",
        "fsid"
    ],
    "delta": "0:05:00.260674",
    "end": "2018-05-22 22:53:34.555743",
    "rc": 1,
    "start": "2018-05-22 22:48:34.295069"
}

STDERR:

2018-05-22 22:48:34.495651 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000
2018-05-22 22:48:34.495684 7f89482c6700  0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault

This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-23 04:44:12 +02:00
Guillaume Abrioux 9801bde4d4 purge_cluster: fix dmcrypt purge
dmcrypt devices aren't closed properly, therefore, it may fail when
trying to redeploy after a purge.

Typical errors:

```
ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command
'/sbin/blkid' returned non-zero exit status 2
```

```
ceph-disk: Error: unable to read dm-crypt key:
/var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf:
/etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key
```

Closing properly dmcrypt devices allows to redeploy without error.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-21 08:23:10 +02:00
Guillaume Abrioux 415dc0a29b take-over: fix bug when trying to override variable
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.

Typical error:

```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```

Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-18 10:10:08 +02:00
Sébastien Han 49a4712485 switch: disable ceph-disk units
During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-17 08:48:28 +02:00
Guillaume Abrioux a9247c4de7 purge_cluster: wipe all partitions
In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-17 08:37:17 +02:00
Guillaume Abrioux 9cad113e2f purge_cluster: fix bug when building device list
there is some leftover on devices when purging osds because of a invalid
device list construction.

typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
    "changed": true,
    "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
    "delta": "0:00:00.015188",
    "end": "2018-05-16 12:41:40.408597",
    "item": "/dev/sda sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:40.393409"
}

STDOUT:

sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600

STDERR:

Error: Could not stat device /dev/sda sda1 - No such file or directory.
```

the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output

eg:

```
changed: [osd3] => (item=/dev/sda1) => {
    "changed": true,
    "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
    "delta": "0:00:00.013634",
    "end": "2018-05-16 12:41:09.068166",
    "item": "/dev/sda1",
    "rc": 0,
    "start": "2018-05-16 12:41:09.054532"
}

STDOUT:

/dev/sda sda1
```

For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-17 08:37:17 +02:00
Sébastien Han d80a871a07 rolling_update: move osd flag section
During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071
Co-authored-by: Tomas Petr <tpetr@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-17 08:17:16 +02:00
Guillaume Abrioux 1b4c3f292d rolling_update: fix dest path for mgr keys fetching
the role `ceph-mgr` that is played later in the playbook fails because
the destination path for the fetched keys is wrong.
This patch fix the destination path used in the task `fetch ceph mgr
key(s)` so there is no mismatch.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-15 19:30:34 +02:00
Guillaume Abrioux 3b89f1bfb1 rolling_update: get fsid in mgr pre_task
{{ fsid }} points to {{ cluster_uuid.stdout }} which is not defined in
this part of the rolling_update playbook.
Since we need to call {{ fsid }} we must get the fsid and register it to
`cluster_uuid`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-05-15 09:01:42 +02:00
Sébastien Han 52fc8a0385 rolling_update: move mgr key creation
Until all the mons haven't been updated to Luminous, there is no way to
create a key. So we should do the key creation in the mon role only if
we are not part of an update.
If we are then the key creation is done after the mons upgrade to
Luminous.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-05-15 09:01:42 +02:00
Guillaume Abrioux adeecc51f8 switch: fix ceph_uid fact for osd
In addition to b324c17 this commit fix the ceph uid for osd role in the
switch from non containerized to containerized playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-04-30 08:15:18 +02:00
Sébastien Han 5fa92804f9 switch: resolve device path so we can umount the osd data dir
If we don't do this, umounting devices declared like this
/dev/disk/by-id/ata-QEMU_HARDDISK_QM00001

will fail like:

umount: /dev/disk/by-id/ata-QEMU_HARDDISK_QM000011: mountpoint not found

Since we append '1' (partition 1), this won't work.
So we need to resolved the link to get something like /dev/sdb and then
append 1 to /dev/sdb1

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-04-30 08:15:18 +02:00
Sébastien Han 767abb5de0 switch: fix ceph_uid fact
Latest is now centos not ubuntu anymore so the condition was wrong.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-04-30 08:15:18 +02:00
Sébastien Han 85732d11b9 mon/client: remove acl code
Applying ACL on the keyrings is not used anymore so let's remove this
code.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-04-23 18:34:58 +02:00
Sébastien Han 66c1ea8cd5 shrink-osd: ability to shrink NVMe drives
Now if the service name contains nvme we know we need to remove the last
2 character instead of 1.

If nvme then osd_to_kill_disks is nvme0n1, we need nvme0
If ssd or hdd then osd_to_kill_disks is sda1, we need sda

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1561456
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-04-20 15:08:29 +02:00
Sébastien Han 641f141c0f selinux: remove chcon calls
We know bindmount with the :z option at the end of the -v command so
this will basically run the exact same command as we used to run. So to
speak:

chcon -Rt svirt_sandbox_file_t /var/lib/ceph

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-04-19 14:59:37 +02:00
Sébastien Han 473939d215 infra: add playbook example for ceph_key module
Helper playbook to manage CephX keys.

Signed-off-by: Sébastien Han <seb@redhat.com>
2018-04-11 12:18:34 +02:00
Andrew Schoen 08f4875533 ceph_volume: refactor to not run ceph osd destroy
This changes state to action and gives the options 'create'
or 'zap'. The zap parameter is also removed.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Andrew Schoen c6e8f8fb11 purge-cluster: no need to use objectstore for ceph_volume module
When zapping objectstore is not required.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Andrew Schoen c29a75ac7f purge-cluster: use ceph_volume module to zap and destroy OSDs
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-04-10 14:19:21 +02:00
Randy J. Martinez d1f2d64b15 purge-docker: added conditionals needed to successfully re-run purge
Added 'ignore_errors: true' to multiple lines which run docker commands; even in cases where docker is no longer installed. Because of this, certain tasks in the purge-docker-cluster.yml will cause the playbook to fail if re-run and stop the purge. This leaves behind a dirty environment, and a playbook which can no longer be run.
Fix Regex line 275: Sometimes 'list-units' will output 4 spaces between loaded+active. The update will account for both scenarios.
purge fetch_directory: in other roles fetch_directory is hard linked ex.: "{{ fetch_directory }}"/"{{ somedir }}". That being said, fetch_directory will never have a trailing slash in the all.yml so this task was never being run(causing failures when trying to re-deploy).

Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
2018-04-10 13:39:14 +02:00
Guillaume Abrioux e32a177af8 purge-docker: remove redundant task
The `remove_packages` prompt is redundant to the `ireallymeanit` prompt
since it does exactly the same thing. I guess the only goal of this task
was to make a break to warn user about `--skip-tags=with_pkg` feature.
This warning should be part of the first prompt.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-04-03 11:54:42 +02:00
Andy McCrae 60d4b75f51 Cleanup plugins directories and references
Having callback_plugins, and action plugins in random locations causes
a lot of disparity.

We should centralize this into one place in the plugins directory and
fix up the ansible.cfg to reflect this.

Additionally, since the ansible.cfg already reflects action_plugins, we
don't need a link to action_plugins in the base of the repository.
2018-03-14 11:15:39 +01:00
jtudelag 691f7c5146 Adds handy ceph aliases whe containerized installations.
Same approach as openshift-ansible etcdctl:

* https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/etcd/tasks/auxiliary/drop_etcdctl.yml
* https://github.com/openshift/openshift-ansible/blob/release-3.7/roles/etcd/etcdctl.sh
2018-03-08 13:56:39 +01:00
Guillaume Abrioux c04e67347c update: look for short and fqdn in ceph_health_raw
According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-19 10:27:47 +01:00
Andrew Schoen 699c777e68 rolling update: fix undefined jewel_minor_update failure
Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.

The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-13 17:03:05 +01:00
Andrew Schoen 7c7017ebe6 infra: do not include host_vars/* in take-over-existing-cluster.yml
These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-02-12 11:48:47 +01:00
Guillaume Abrioux 3b2f6c34e4 purge-docker: fix ceph-osd-zap name container
the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).

this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```

having the the basename of the device path is enough for the container
name.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-02-02 22:09:11 +01:00
Guillaume Abrioux dd0c98c5a2 common: do not use `shell` module when it is not needed
There is no need here to use `shell` instead of `command`

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux deaf273b25 syntax: change local_action syntax
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```

The usual syntax:
```
    local_action:
      module: wait_for
      port: 22
      host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
      state: started
      delay: 10
      timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.

This also fix a potential issue about missing quotation :

```
Traceback (most recent call last):
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
    main()
  File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
    rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
  File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
  File "/usr/lib64/python2.7/shlex.py", line 279, in split
    return list(lex)                                                                                                                                                                                                                                                                                                            File "/usr/lib64/python2.7/shlex.py", line 269, in next
    token = self.get_token()
  File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
    raw = self.read_token()
  File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
    raise ValueError, "No closing quotation"
ValueError: No closing quotation
```

writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-31 10:45:34 +01:00
Guillaume Abrioux f372a4232e purge: fix resolve parent device task
This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-30 17:40:10 +01:00
Guillaume Abrioux c7ec12d49c upgrade: skip luminous tasks for jewel minor update
These tasks are needed only when upgrading to luminous.
They are not needed in Jewel minor upgrade and by the way, they fail because
`ceph versions` command doesn't exist.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-25 18:30:34 +01:00
Sébastien Han 8af7459476 rolling update: add mgr exception for jewel minor updates
When update from a minor Jewel version to another, the playbook will
fail on the task "fail if no mgr host is present in the inventory".
This now can be worked around by running Ansible with_items

-e jewel_minor_update=true

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1535382
Signed-off-by: Sébastien Han <seb@redhat.com>
2018-01-18 14:06:05 +01:00
Guillaume Abrioux 55298fa80c purge-container: use lsblk to resolv parent device
Using `lsblk` to resolv the parent device is better than just removing the last
char when passing it to the zap container.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-17 15:54:20 +01:00
Guillaume Abrioux 58eb045d2f purge-container: remove awk usage in favor of blkid
Avoid using `awk` to get the different devices from the partlabel.
Using `blkid` is more readable.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2018-01-17 15:54:20 +01:00
Andrew Schoen b613321c21 switch-to-containers: do not fail when stopping the nfs-ganesha service
If we're working with a jewel cluster then this service will not exist.

This is mainly a problem with CI testing because our tests are setup to
work with both jewel and luminous, meaning that eventhough we want to
test jewel we still have a nfs-ganesha host in the test causing these
tasks to run.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Andrew Schoen 0b4b60e3c9 switch-to-containers: do not fail when stopping the ceph-mgr daemon
If we are working with a jewel cluster ceph mgr does not exist
and this makes the playbook fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Andrew Schoen 997edea271 rolling_update: do not fail the playbook if nfs-ganesha is not present
The rolling update playbook was attempting to stop the
nfs-ganesha service on nodes where jewel is still installed.
The nfs-ganesha service did not exist in jewel so the task fails.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2018-01-06 14:07:55 +01:00
Guillaume Abrioux c5b7b37105 purge-cluster: clean some code
Avoid using regexp to match device

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Guillaume Abrioux eeedefdf02 purge-cluster: wipe disk using dd
`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.

typical error seen :

```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-12-20 17:42:45 +01:00
Sébastien Han 200785832f rolling_update: do not require root to answer question
There is no need to ask for root on the local action. This will prompt
for a password the current user is not part of sudoers. That's
  unnecessary anyways.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1516947
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-12-19 14:04:55 +01:00
Guillaume Abrioux aaaf980140 purge: fix bug on 'wait_for' task
this task hangs because `{{ inventory_hostname }}` doesn't resolv to an
actual ip address.
Using `hostvars[inventory_hostname]['ansible_default_ipv4']['address']`
should fix this because it will reach the node with its actual IP
address.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-29 11:10:56 +01:00
Guillaume Abrioux 947766e294 purge-cluster: remove usage of `with_fileglob`
`with_fileglob` loops over files on the machine where ansible-playbook
is being run.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-21 08:24:11 +01:00
Guillaume Abrioux d9c1b61092 purge-docker: remove osd disk prepare logs
`with_fileglob` loops over files on the machine that runs the playbook.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-16 14:27:36 +01:00
Sébastien Han 68566444e9
Merge pull request #2142 from squidboylan/master
infra: fix take-over-existing-cluster.yml playbook
2017-11-13 22:06:16 +11:00
Guillaume Abrioux fa675f2ead purge-docker-cluster: ensure old logs are removed
purge-docker-cluster must remove all osd_disk_prepare logs in
`{{ ceph_osd_docker_run_script_path }}`, otherwise if you purge your
cluster and try to redeploy it, osds will fail to start since because it
will try to retrieve find a partition uuid which doesn't exist.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510470

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-11-09 17:49:20 +01:00
Caleb Boylan 41d10a2f64 infra: fix take-over-existing-cluster.yml playbook
The ansible inventory could have more than just ceph-ansible hosts, so
we shouldnt use "hosts: all", also only grab one file when getting
the ceph cluster name instead of failing when there is more than one
file in /etc/ceph. Also fix location of the ceph.conf template
2017-11-06 15:00:30 -08:00
Sébastien Han 473673ab41 shrink-mon: fix typo in the code doc
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-27 11:59:22 +02:00
Sébastien Han 2837d0a22e purge: do not reboot by default
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-26 14:18:38 +02:00
Guillaume Abrioux f90f2f3a04 purge: containers are not stopped
During purge osd, the containers are not stopped because of a typo, as a
result, all the devices can't be unmounted later.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-10-25 07:58:00 +02:00
Sébastien Han 4413511b66 all: backward compatibility between stable-2.2 and 3.0
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.

We will drop this in a near future, maybe 3.1 or 3.2.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-20 11:54:10 +02:00
Guillaume Abrioux 982326373b upgrade: fix upgrade jewel to luminous for nfs nodes
nfs nodes can't be upgraded from jewel to luminous because ceph-nfs role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
package is upgraded in `ceph-nfs` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-10-19 20:54:23 +02:00
Guillaume Abrioux 70034451e9 upgrade: fix upgrade jewel to luminous for mgr nodes
mgr nodes can't be upgraded from jewel to luminous because ceph-mgr role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
ceph-mgr package is upgraded in `ceph-mgr` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 302e563601cd6820b1ae44fabdfb1506688c7c9b)
2017-10-19 20:54:23 +02:00
Sébastien Han d920d4839d upgrade: support for rbd mirror and nfs
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-17 10:54:47 +02:00
Sébastien Han 39bf102b64 switch: nicer way to check mon quorum
re-use the same syntax as rolling_udate.yml

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-17 10:54:36 +02:00
Sébastien Han b685aceede Merge pull request #2044 from major/avoid-jinja-in-when
Remove jinja2 delimiters from `when` keys
2017-10-12 22:23:06 +02:00
Major Hayden c01851325e
Remove jinja2 delimiters from `when` keys
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.
2017-10-12 11:27:42 -05:00
Major Hayden 33b200d43a
Suppress yum/dnf/rpm command warnings
Ansible throws warnings when using yum/dnf/rpm with the command
module:

    [WARNING]: Consider using yum module rather than running yum

This patch adds the `warn: no` argument to suppress the warnings
in the Ansible output.
2017-10-12 08:38:05 -05:00
Sébastien Han 13bce287ad infra: replace osd playbook
This playbook can replace failed OSD in containerized and
non-containerized env.
The current limitation is that it won't allow you to choose between
filestore/bluestore and will do collocation as well.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-12 11:53:30 +02:00
Sébastien Han 85e13a864c purge-iscsi: fix group name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1500281
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-11 12:52:12 +02:00
Sébastien Han 24b82c2679 purge: fix journal purge
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-10 09:57:39 +02:00
Guillaume Abrioux f147b119ed Merge pull request #2014 from ceph/fixes-2
infra: use the pg check in the right place
2017-10-09 20:14:06 +02:00
Sébastien Han 450108fab9 infra: add independant purge-iscsi-gateways.yml
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-09 17:25:44 +02:00
Sébastien Han 774697ebd8 infra: use the pg check in the right place
Use the pg check before doing the pg check, not on the quorum check.
Also never quote int when doing comparaison.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-09 17:25:41 +02:00
Sébastien Han a3e7bcb13f Merge pull request #2013 from ceph/wip-purge-cluster
A couple of purge cluster fixes
2017-10-09 17:18:30 +02:00
Sébastien Han 33a3aa0dda switch: check pgs only when num_pgs > 0
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:42:09 +02:00
Sébastien Han 05f26031ea rolling_update: perform pg check when pgs_num > 0
If num_pgs = 0 the check will never return 0.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:09 +02:00
Sébastien Han c3c63ae539 switch: rework and fix clean pg wait
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:09 +02:00
Sébastien Han c693e95cbf purge-docker: rework device detection
we don't need "devices" and other device variable anymore, the playbook
detects that for us.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 03:39:04 +02:00
Sébastien Han 2fb4981ca9 shrink-osd: admin key not needed for container shrink
Also do some clean

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-07 00:20:43 +02:00
Boris Ranto 64e272d818 purge-cluster: Do not use shell for rm
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:54:37 +02:00
Boris Ranto f696cb7637 purge-cluster: Do not fail on systemd commands
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.

Signed-off-by: Boris Ranto <branto@redhat.com>
2017-10-06 22:52:56 +02:00
Sébastien Han b6b24a5ca9 iscsi: fix wrong group name for iscsi
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1498490
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-05 17:25:32 +02:00
Sébastien Han f37e014a65 Merge pull request #1974 from ceph/mgr-upgrade-luminous
upgrade: a support for mgrs
2017-10-03 19:57:31 +02:00
Sébastien Han 99466e79a1 upgrade: a support for mgrs
Also we now play ceph-config to have everything being generated for new
daemons bootstrap during upgrade.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1497959
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-03 16:57:31 +02:00
Sébastien Han 3bd341f6c0 osd: container use id instead of dev name
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1494127
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-10-03 14:44:00 +02:00
Sébastien Han 3c2c31a591 Merge pull request #1964 from vatelzh/master
purge-cluster: delete block partitions if using bluestore
2017-10-02 12:10:26 +02:00
Sébastien Han b9050d6229 update: fix var register
Even if the task is skipped, ansible registers the var as 'skipped' so
this task the task using this variable for its next usage.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-29 14:27:55 +02:00
zhangwentao 86a6db0d58 purge-cluster: delete block partitions if using bluestore 2017-09-29 14:04:17 +08:00
Sébastien Han a0a5b174ba rolling_update: clarify mon quorum command
Cleaner.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-29 01:19:46 +02:00
Sébastien Han bd5471b940 update: complete luminous upgrade
Once we complete the upgrade to Luminous, we must issue a specific
command. For more info read:
http://ceph.com/community/new-luminous-upgrade-complete/

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-28 21:05:00 +02:00
Sébastien Han 68f1f99ee9 update: nicer way to wait for clean pgs
More comprhensive and friendly to read.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-28 14:46:26 +02:00
Andrew Schoen fccc604f4a purge-cluster: default lvm_volumes if not defined
Most osd scenarios do not use lvm_volumes, so default it in
purge-cluster.yml if it's not defined.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-09-26 15:14:29 -05:00
Guillaume Abrioux fcb6454e04 rbd-mirror: fix systemd unit in purge-docker
rbd-mirror containers are not stopped in purge-docker-cluster playbook
because of the wrong name used.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-09-24 21:18:50 +02:00
Guillaume Abrioux c80ba7a307 purge: implement mgr purge
unti now, mgr nodes are not managed by purge-cluster.yml, therefore it
breaks scenario like purge_cluster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-09-24 21:18:50 +02:00
Guillaume Abrioux 7195b08718 update: update rgw systemd unit name
The old name is used in `rolling_update.yml` and
`purge-docker-cluster.yml`, it breaks the
`test_rgw_service_is_running()` test.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-09-24 14:58:55 +02:00
Sébastien Han 6bac613611 shrink: support for container
We can now shrink mon and osds on containerized deployment.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492115
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-20 16:25:07 +02:00
Sébastien Han 7fedc8ebf4 Merge pull request #1891 from ceph/clarify-update
rolling_update: clarify update doc
2017-09-15 07:08:49 -06:00
Sébastien Han fe1d84d395 Merge pull request #1892 from ceph/purge-dmcrypt-col
purge: only purge specific directories for mon
2017-09-13 17:57:06 -06:00
Sébastien Han ba3e3b6cc7 purge: only purge specific directories for mon
Handles the case when a mon is collocated with an OSD.

Closes: https://github.com/ceph/ceph-ansible/issues/1877
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-13 17:07:04 -06:00
Sébastien Han 82c4848ec4 Merge pull request #1885 from ceph/shrink-osd
shrink-osd: fix when multiple osds
2017-09-13 16:12:49 -06:00
Sébastien Han 92f9be963b rolling_update: clarify update doc
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1490188
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-13 15:46:29 -06:00
Sébastien Han 3031e51778 shrink-osd: fix when multiple osds
The loop was being built properly so we were always getting the last
item as osd host.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1490355
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-13 15:20:11 -06:00
Sébastien Han aa364264cd resync ceph-iscsi-gw with old upstream
Taken from https://github.com/pcuzner/ceph-iscsi-ansible/tree/tcmu-fixes

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1454945 and
https://bugzilla.redhat.com/show_bug.cgi?id=1484083
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-12 18:06:10 -06:00
Sébastien Han 477f86e305 switch to container: fix ceph nfs
The service is nfs-ganesha where ceph-nfs@{{ ansible_hostname }} will be
the name of the container.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-08 22:43:50 +02:00
Sébastien Han fdacac9fa0 switch: make osd collection idempotent
This commits allows us to run
switch-from-non-containerized-to-containerized-ceph-daemons.yml multiple
times.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489353
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-08 11:31:47 +02:00
Sébastien Han e46440e19c switch-from-non-containerized-to-containerized: fix devices
If devices is passed through an extra var this register won't work so
let's only register the var is devices is not defined.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489099
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-07 23:18:14 +02:00
Sébastien Han b9ced956d7 purge: get lockbox mountpoint and unmount it
Prior command was avoiding the lockbox mountpoint and the playbook was
failing with:

rmtree failed: [Errno 30] Read-only file system:
'/var/lib/ceph/osd-lockbox/4e9d8052-87c2-4fde-a56c-b8c108a3eefc/key-management-mode'

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-07 16:31:31 +02:00
Guillaume Abrioux d987d26719 tests: force docker variable for switch-to-containers scenario
we need to force the value of `docker` variable which is initially set
to `false` since it's a migration from non-containerized to
containerized cluster.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-09-06 18:03:52 +02:00
Sébastien Han b7db600caa switch-from-non-containerized-to-containerized: mask unit files
We must mask the image so we are sure that even if the system reboots
then the OSDs won't start.

Also remove Ceph udev rules if found on the system prior to deploy
containers. If we don't do this we are exposed to conflicts between udev
rules and sytemd unit files.

Also add the CI will now test the migration from a non-containerized cluster to a
containerized cluster.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-05 15:20:31 +02:00
Sébastien Han 579b95fd8a shrink-mon: wait a little bit for the mon to be out
Monitor removal from the monmap is not immediate, so let's wait a little
bit and then fail if the monitor is still in the monmap.
We try twice in total with 10 sec intervals.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-04 23:08:57 +02:00
Sébastien Han 54d7a81241 infra playbook: move untested scenario to a new dir
Move untested/with few confidence playbooks in a untested-by-ci
directory.
Also removing this directory from the package build.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1461551
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-01 19:58:24 +02:00
Sébastien Han 298a63c437 shrink mon and osd
Rework shrinking a monitor and an OSD playbook. Also adding test
scenario.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1366807
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-09-01 19:12:00 +02:00
Sébastien Han e0a264c7e9 osd: allow multi dedicated journals for containers
Fix: https://bugzilla.redhat.com/show_bug.cgi?id=1475820
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-30 12:34:06 +02:00
Ben England 617d9ee75d dont use devices var anymore, works for osd_auto_discover 2017-08-28 17:27:01 -04:00
Sébastien Han 0205f6d645 rolling_update: nicer way to set osd flags
Prior to this patch, we were applying the osd flags like this:

"
General pre tasks
Set flags
Upgrade OSDs on a host
Unset flags <-- this triggers pending scrub to start
Set flags
Upgrade OSDs on a hosts
Unset flags <-- this triggers pending scrub to start
.
.
.
General post tasks
"

Now instead, we apply the flag once before starting the OSD update and
unset them once the last OSD is finished.

"
General pre tasks
Set flags and wait for any scrubs to finish
Upgrade OSDs on a host
Upgrade OSDs on a host
.
.
.
Unset flags
General post tasks
"

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1450754
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-08-25 18:21:28 +02:00
Sébastien Han 4a4a20f07d rolling update: skip pg check if num_pgs = 0
In our test case we don't have any pgs, thus the check fails. The check
always returns an empty array, which makes the comparaison failing.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-24 08:50:49 +02:00
Alfredo Deza e651469a2a Merge pull request #1797 from ceph/purge-lvm
adds purge support for the lvm_osds osd scenario
2017-08-23 14:28:29 -04:00
Sébastien Han f2499ff5ac Merge pull request #1788 from ceph/improve-switch
switch-from-non-containerized-to-containerized: simplify
2017-08-23 19:47:26 +02:00
Sébastien Han 4f0ecb7f30 switch-from-non-containerized-to-containerized: simplify
This commit eases the use of the
infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook. We basically run it with a couple of pre-tasks and then we let
the playbook run the docker roles.

It obviously expect to have proper variables configured in order to
work.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-23 18:39:45 +02:00
Andrew Schoen bed57572cc purge-cluster: adds support for purging lvm osds
This also adds a new testing scenario for purging lvm osds

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-08-23 10:33:35 -05:00
Sébastien Han 1ac0969c28 Merge pull request #1778 from ceph/fix-1770
purge: add ability to purge bluestore osd
2017-08-22 23:56:36 +02:00
Giulio Fidente 2c01de4350 Default cluster to ceph in switch to containers 2017-08-22 13:13:36 +02:00
Giulio Fidente f0423b1804 Parse ceph_docker_registry in switch to containers
Defaults it to docker.io as it was for backward compatibility.
2017-08-22 13:11:27 +02:00
Giulio Fidente a59b84d5c9 Assume mon_docker_privileged false in switch to containers 2017-08-22 13:01:25 +02:00
Giulio Fidente 0106fa6835 Consume public_network vs ceph_mon_docker_subnet
In the switch to containers migration there were broken references
to ceph_mon_docker_subnet variable, replaced with public_network.

Also fixes references to ceph_mon_docker_extra_env setting for it
a default as it could be undefined.
2017-08-21 18:34:24 +02:00
Giulio Fidente 386303d42e Extend set_uid fact to support RH Ceph images 2017-08-21 18:32:08 +02:00
Sébastien Han 9c824b9818 purge: add ability to purge bluestore osd
We now purge block db and/or wal partitions if we find any.

Closes: https://github.com/ceph/ceph-ansible/issues/1770
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-21 18:08:18 +02:00
Andrew Schoen d2f4d3666f Merge pull request #1725 from ceph/simplify-osd-scenario
osd: simply osd scenario declaration
2017-08-03 09:31:57 -05:00
Sébastien Han 671f2cd4bc Merge pull request #1738 from yanyixing/nvmepart
fix for nvme part path
2017-08-03 13:37:10 +02:00
yanyx d506fad056 fix for nvme part path 2017-08-03 17:37:52 +08:00
Sébastien Han 30991b1c0a osd: simplify scenarios
There is only two main scenarios now:

* collocated: everything remains on the same device:
  - data, db, wal for bluestore
  - data and journal for filestore
* non-collocated: dedicated device for some of the component

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-03 10:20:39 +02:00
Sébastien Han fdc6aebd62 infrastructure-playbooks: update with ceph-defaults roles
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-08-02 17:12:20 +02:00
Guillaume Abrioux 7a333d05ce Add handlers for containerized deployment
Until now, there is no handlers for containerized deployments.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-08-02 17:12:20 +02:00
Guillaume Abrioux 5adbf0fdaa Move role dependencies in site.yml/site-docker.yml
This will give us more flexibility and avoid a lot of useless when
skipping all tasks from a non-desired role.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-08-02 17:12:14 +02:00
Guillaume Abrioux 206c7a16d0 rolling_update: refact code
Refact rolling_update playbook.
Add ceph-client upgrade.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-08-02 11:10:51 +02:00
yanyx d0a17b11b2 change the partition's ownership 2017-07-27 11:55:30 +08:00
Sébastien Han fad9d0caec Merge pull request #1690 from yanyixing/master
fix: when osd device is a disk partition
2017-07-26 15:55:29 +02:00
yanyx 2e6233271e fix: when osd device is a disk partition 2017-07-25 21:39:43 +08:00
Sébastien Han 0c18cf199e purge: remove leftover unit files
Closes https://github.com/ceph/ceph-ansible/issues/1672

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-07-25 13:26:28 +02:00
Guillaume Abrioux 828f88403e Update: Avoid screen scraping in rolling update
since luminous has revamped the `ceph -s` output, we need to avoid screen
scraping.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-07-12 15:02:39 +02:00
Guillaume Abrioux 896d62d78b Refact: remove ceph_mon_docker_interface variable
remove `ceph_mon_docker_interface` and use `monitor_interface` instead
for both containerized and non-containerized deployment.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-07-04 18:08:59 +02:00
Guillaume Abrioux 73141118d0 Make the new check PGs working with /bin/sh
The new test in the checks PGs are no longer working on distributions
where /bin/sh isn't linked to /bin/bash.

Fix: #1619
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-06-22 17:59:38 +02:00
David Galloway 127b5ad9b4 infra: Create a backup of ceph.conf when taking over existing cluster
Signed-off-by: David Galloway <dgallowa@redhat.com>
2017-06-21 09:53:09 -04:00
David Galloway 40ed2d7be6 infra: Fix ceph.conf creation when taking over existing cluster
Fixes bug introduced in https://github.com/ceph/ceph-ansible/pull/1330

The "stat ceph.conf" task was basically using the stat module on a
string instead of the ceph.conf filename.  This caused the "generate
ceph configuration file" task to fail.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1463382

Signed-off-by: David Galloway <dgallowa@redhat.com>
2017-06-21 09:52:01 -04:00
Andrew Schoen e2104acb62 rolling_update: set health_mon_check_delay to 15
The old value of 10 did not give enough time for a containerized mon to
pass the health check.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-06-13 08:56:44 -05:00
Guillaume Abrioux 5af9bb432c rewrite check pgs clean tasks
Avoid screen scrapping by rewriting `waiting for clean pgs` tasks like it is
done in 304de48.

Use the json output returned by `ceph -s` instead

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-06-13 09:48:56 +02:00
Andrew Schoen 59992c54cc purge-docker-cluster: include ceph_docker_registry
We need to include ceph_docker_registry when removing containers/images
because if we don't it will assume docker.io which is not always where
the image originated from, causing the playbook to fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-06-02 09:49:17 -05:00
Sébastien Han fdc7866072 Merge pull request #1469 from ceph/refact_code
Docker: Refact code
2017-06-02 12:40:25 +02:00
Andrew Schoen f7677e4393 purge-docker-cluster: pip is only used on Debian
We only need to purge packages installed by pip on Debian systems.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-05-31 09:03:44 -05:00
Andrew Schoen 8e322d4825 purge-docker-cluster: default raw_journal_devices to []
If we're purging a containerized cluster that did not use the
raw_multi_journal OSD scenario then raw_journal_devices will not be
defined which causes the playbook to fail.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1455187

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-05-25 07:30:25 -05:00
Guillaume Abrioux ddfe019342 Refact code
`ceph-docker-common`:
  At the moment there is a lot of duplicated tasks in each
  `./roles/ceph-<role>/tasks/docker/main.yml` that could be refactored in
  `./roles/ceph-docker-common/tasks/main.yml`.

`*_containerized_deployment` variables:
  All `*_containerized_deployment` have been refactored to a single
  variable `containerized_deployment`

duplicate `cephx` variables in `group_vars/* have been removed.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-05-24 15:55:41 +02:00
Sébastien Han 90389864d8 rolling-update: set/unset flags on the right container
Problem: we are delegating the set/unset flag to a monitor node but we
try to call an osd container

Solution: use the right container name.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-05-22 09:38:08 +02:00
Sébastien Han b93ffe637b Merge pull request #1476 from WingkaiHo/improve-shrink-osd.yml
improve shrink-osd.yml can shrink osd when disk damage
2017-04-27 11:01:27 +02:00
WingkaiHo 0b9f322ca0 improve shrink-osd.yml can shrink osd when disk damage 2017-04-27 10:26:26 +08:00
Andrew Schoen 5a3f95dfc1 purge-cluster: check for any running ceph process after purge
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-25 09:30:22 -05:00
Andrew Schoen 26bdd59f5d purge-cluster: we don't support sysv or upstart anymore
Now that ceph-ansible only supports > jewel we don't need
to bother with sysv or upstart

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:14:38 -07:00
Andrew Schoen 7ca2bddcce purge-cluster: do not need to check for running ceph processes
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:12:46 -07:00
Andrew Schoen aac79df3b3 purge-cluster: no need to remove ceph.target
The package uninstalls will stop ceph.target

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-04-21 15:11:03 -07:00
Sébastien Han dfd8f4d96e test: add mgr section to the host inventory file
Without this, we don't test the mgr role so we need to add it.

Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-04-15 00:16:10 +02:00
Sébastien Han 17ac1fd464 Merge pull request #1443 from WingkaiHo/osds-journal-migrate
Migrate osd(s) journal to ssd
2017-04-13 16:45:57 +02:00
WingkaiHo 9fba41b4ce Migrate osd(s) journal to ssd 2017-04-13 11:05:58 +08:00
Daniel Lupescu d5e56c481a purge-cluster: fix grep match for NVMe and HP Smart Array devices
raw_device would return invalid block device names for NVMe and HPSA
devices which would cause sgdisk partition deletion to fail

$ echo /dev/nvme1n1p3 | egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}'
/dev/nvme1n1p

$ echo /dev/cciss/c0d0p2 |  egrep -o '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p){1,2}'
/dev/cciss/c0d0p
2017-04-11 16:13:28 +03:00
Sébastien Han c37aaa41f4 playbook: homogenize the way list osd ids
Problem: too many different commands to do the same thing. The 'cut'
command on infrastructure-playbooks/purge-cluster.yml was also wrong.
This sed command from osixia in ceph-docker
https://github.com/ceph/ceph-docker/pull/580/ addresses all the
scenarios.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-03-30 11:51:38 +02:00
Sébastien Han 35a90ae283 Merge pull request #1386 from WingkaiHo/master
Create recover-osds-after-ssd-journal-failure.yml
2017-03-28 09:50:39 +02:00
Konstantin Shalygin 1662976fc0
Resolve issues when groups names not in default value. 2017-03-27 21:44:30 +07:00
WingkaiHo ac1498b0d7 Merge https://github.com/ceph/ceph-ansible 2017-03-27 10:50:38 +08:00
WingkaiHo ebb56ccebf command module instead shell 2017-03-23 17:38:41 +08:00
WingkaiHo 2d44c1cee6 remove service enable 2017-03-23 15:28:14 +08:00
WingkaiHo 14c189fee5 break it into lines since you already use the string block synta and fix disable it here and enable again in later task 2017-03-23 14:49:10 +08:00
WingkaiHo 62c37042fe remove this detection and simply rely on {{ cluster }} 2017-03-23 09:22:06 +08:00
WingkaiHo 3d10c5981e fix some pelling mistakes and wirting format, use full device path for device name 2017-03-22 17:48:34 +08:00
WingkaiHo 1e670bdeb0 This assumes ceph as a cluster name. We need detect the name of the cluster 2017-03-22 10:09:06 +08:00
WingkaiHo 83a1ac0c67 This assumes ceph as a cluster name. We need detect the name of the cluster 2017-03-22 10:06:11 +08:00
WingkaiHo 19f9e200d7 Add auto detect the ceph cluster name 2017-03-22 10:00:44 +08:00
WingkaiHo 8602166f6e Ansible will include host_vars/ansible_hostname.yml itself, no need this task IMO. 2017-03-21 13:50:27 +08:00
WingkaiHo 55725fd01d fix some syntax error 2017-03-21 11:19:25 +08:00
WingKai Ho 7445113dc4 Create recover-osds-after-ssd-journal-failure.yml
This playbook use to recover Ceph OSDs after ssd journal failure.
2017-03-21 11:08:25 +08:00
Anthony D'Atri 6c4911276e Enhance clean PG check to catch active+clean+scrubbing and active+clean+scrubbing+deep
Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
2017-03-19 00:23:26 -07:00
Daniel Marks 77edd3d40a Fixing tabs that are breaking the syntax check
With the merge of PR #1336 the syntax check fails. This commit replaces
the tabs with proper indentation.
2017-03-15 14:15:15 +01:00
Sébastien Han 38ab6de602 Merge pull request #1336 from WingkaiHo/master
Load a variable file for devices partition
2017-03-15 11:55:26 +01:00
Sébastien Han 8320c14191 Merge pull request #1317 from ibotty/harmonize-docker-names
harmonize docker names
2017-03-14 18:20:20 +01:00
Andrew Schoen e81d690aa0 switch-to-containers: do not include group vars or role defaults
Doing so will override any values set for these in the group_vars
directory relative to the users inventory.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:09 -06:00
Andrew Schoen cf702b05cf purge-docker-cluster: do not include role defaults or group vars
Doing so at playbook level overrides whatever values might be set for
these in the user's group_vars directory that's relative to their
inventory.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:09 -06:00
Andrew Schoen aef54d89d9 switch-to-containers: do not set group name vars at playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:09 -06:00
Andrew Schoen 7289acb6b3 purge-docker-cluster: do not set group names vars at playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:08 -06:00
Andrew Schoen 46f26bec13 rolling-update: do not set group name vars at playbook level
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:08 -06:00
Andrew Schoen 4fe6607004 purge-cluster: do not set group name vars at playbook level
This has the behavior of overriding custom values set in group_vars.
I've added defaults to the rest of the group names so that if they are
not overridden in group_vars then defaults will be used.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1354700

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-03-08 08:57:08 -06:00
WingKai Ho 0d134b4ad9 Update make-osd-partitions.yml
change
2017-03-08 17:46:37 +08:00
WingKai Ho e2d06068f4 Update make-osd-partitions.yml
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic, so keyword "skip" to fix it. 
Exit the playbook if the user not define devices  in both  host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
2017-03-06 15:43:09 +08:00
WingKai Ho 2861a483d7 Update make-osd-partitions.yml
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic err, so add keyword "skip" to fix it. 

Exit the playbook if the user not define devices  in both  host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
host_vars/default.yml
2017-03-06 10:33:22 +08:00
WingKai Ho 4cc489f2ba Update make-osd-partitions.yml
fix syntactic error
2017-03-03 17:26:53 +08:00
WingKai Ho 102befa927 Update make-osd-partitions.yml
Remove capital `L`
2017-03-02 14:06:41 +08:00
WingKai Ho c3f170e758 Update make-osd-partitions.yml
there is an extra space between 'custom' and 'layout'
2017-03-02 12:24:44 +08:00
WingKai Ho 2967772f6a Load a variable file for devices parrition
load device partition file in directory host_vars

1) if the user define host_vars/hostname.yml load the devices  partition on this file.
2) otherwise load host_vars/default.yml for default
2017-03-01 17:27:57 +08:00
yangyimincn 8b36cbac64 Update rolling_update.yml
The task waiting for the monitor to join the quorum... , the result for ceph -s | grep monmap only contain monmap, not included quorum:

# ceph -s --cluster ceph | grep monmap
     monmap e1: 3 mons at {sh-office-ceph-1=10.12.10.34:6789/0,sh-office-ceph-2=10.12.10.35:6789/0,sh-office-ceph-3=10.12.10.36:6789/0}

If want to get monitor, should use this:

# ceph -s --cluster ceph | grep election
            election epoch 80, quorum 0,1 sh-office-ceph-1,sh-office-ceph-2

ceph verison: 10.2.5
2017-02-28 16:56:02 +08:00
Sébastien Han 4639d89231 infra: fix cluster name detection
The previous command was returning /etc/ceph/ceph.conf, we only need
'ceph' to be returned.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-23 15:40:34 -05:00
Tobias Florek 931027e6f7 harmonize docker names
Created containers now are named more or less in the form of

    <ansible role>-<ansible_hostname>
2017-02-23 09:15:05 +01:00
Sébastien Han 3b633d5ddc purge-docker: re-implement zap devices
We now run the container and waits until it dies. Prior to this we were
stopping it before completion so not all the devices where zapped.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-21 15:56:09 -05:00
Sébastien Han a002508a91 purge-docker: also purge journal devices
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-21 15:54:36 -05:00
Andrew Schoen 5622c94e8b rolling-update: do not use upstart to stop mons when using systemd
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-21 12:31:26 -06:00
Shengjing Zhu 32923fd217 fix grep match pattern for osd ids
Some playbooks use [0-9]*, others use \d+$
The latter is more correct since cluster name may contain numbers.

Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>
2017-02-20 16:35:56 +08:00
Andrew Schoen 22f52a9dc6 purge-cluster: also purge dmcrypt dedicated journals
See: https://bugzilla.redhat.com/show_bug.cgi?id=1414647

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-15 10:27:17 -06:00
Andrew Schoen 3964929a56 rgw-standalone: also fetch keys from mons
This is to allow for ceph-installer usage of this playbook and
to ensure that you have the correct keys locally when bootstrapping.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-14 16:12:59 -06:00
Andrew Schoen c5f561a4e9 purge-cluster: remove calamari-server package
See: https://bugzilla.redhat.com/show_bug.cgi?id=1422134

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves rhbz#1422134
2017-02-14 09:24:02 -06:00
Sébastien Han c2f1dca823 docker: use a better method to pull images
We changed the way we declare image.
Prior to this patch we must have a "user/image:tag"
format, which is incompatible with non docker-hub registry where you
usually don't have a "user". On the docker hub a "user" is also
identified as a namespace, so for Ceph the user was "ceph".

Variables have been simplified with only:

* ceph_docker_image
* ceph_docker_image_tag

1. For docker hub images: ceph_docker_name: "ceph/daemon" will give
you the 'daemon' image of the 'ceph' user.

2. For non docker hub images: ceph_docker_name: "daemon" will simply
give you the "daemon" image.

Infrastructure playbooks have been modified as well.
The file group_vars/all.docker.yml.sample has been removed as well.
It is hard to maintain since we have to generate it manually. If
you want to configure specific variables for a specific daemon simply
edit group_vars/$DAEMON.yml

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1420207
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-09 17:57:18 +01:00
Andrew Schoen 5ddfc4f85c Merge pull request #1284 from ceph/BZ-1418980
purge-cluster: do not use ceph-detect-init
2017-02-08 08:46:03 -06:00
Andrew Schoen 4ff5908758 Merge pull request #1289 from ceph/fix-1286
rolling-update: detect init system properly
2017-02-08 06:31:30 -06:00
Andrew Schoen 865b4500dc purge-cluster: set a default value for fetch_directory if not defined
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:25:43 -06:00
Andrew Schoen adf6aee643 purge-cluster: remove all include tasks
Including variables from role defaults or files in a group_vars
directory relative to the playbook is a bad practice. We don't want to
do this because including these defaults at the task level overrides
values that would be set in a group_vars directory relative to the
inventory file, which is the correct usage if you wish to override
those default values.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:25:43 -06:00
Andrew Schoen 0476b24af1 purge-cluster: do not use ceph-detect-init
We can not always ensure that ceph-detect-init will be
present on the system.

See: https://bugzilla.redhat.com/show_bug.cgi?id=1418980

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-02-08 06:24:44 -06:00
Sébastien Han 8f94bfb498 rolling-update: detect init system properly
Simply use the ansible_service_mgr fact.

Closes: #1286

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-08 08:52:05 +01:00
Sébastien Han c34d0a9d28 purge-docker: force image deletion
even if non-runnin containers are using this image as a reference.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-07 22:14:21 +01:00
Sébastien Han 72cd9199ac purge: ability to purge client role
Signed-off-by: Sébastien Han <seb@redhat.com>
2017-02-07 22:14:18 +01:00
Guillaume Abrioux 76ddcbc271 Remove support of releases prior to Jewel.
According to #1216, we need to simply the code by removing the
support of anything before Jewel.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
2017-01-31 11:00:54 +01:00
Sébastien Han d5dd658cfa purge: do not stop ceph.target on each daemon
Doing this cause some all the daemons to go down at the same time. In a
scenario where we colocate a monitor and an osd, this osds will take
some time to go down which will make the 'umount' task fail.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han cb57a359ba purge: do not fail on purge ceph files
On systems running docker there is an issue with lxfs that results in
the find command returning 1 but actually did the job.
e.g: on a system with docker runnning find /var will give us the
following error:

find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/systemd-update-utmp.service/devices.deny':
Permission denied
find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/dev-random.mount/devices.allow':
Permission denied
...
...

However ceph files got deleted so we ignore the error.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han e371bd591c purge: fix ubuntu purge when not using systemd
We now rely on the cli tool ceph-detect-init which will tell us the init
system in used on the distribution. We do this instead of the previous
lookup for systemd unit files to call the right task depending on the
init system.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han 0e2e270ab2 purge: allow purge to run multiple times
with_items is evaluated before the when so in a second run where the
variable is empty if will fail with "'dict object' has no attribute
'stdout_lines'". To fix this we had a default array so with_items does
not fail and the task is skipped with the when.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-30 14:31:56 +01:00
Sébastien Han 0d2e580768 Merge pull request #1250 from ceph/new-tests
CI testing updates
2017-01-27 14:30:45 +01:00
Andrew Schoen d3cb8dba4e purge-cluster: fix failure when raw_multi_journal is not defined
Because the purge-cluster.yml playbook does not have access to the roles
default vars then we can be sure that raw_multi_journal is defined. For
example, if this was purging a dmcrypt journal then raw_multi_journal
might not be defined at all in group_vars/all.yml or
group_vars/osds.yml.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-27 05:23:17 -06:00
Ivan Font 0298354137 Update to use consistent docker extra env vars
This playbook was still referencing the old version of the
ceph_*_docker_extra_env but only for Ceph MONs and Ceph NFS. This
playbook was not kept up-to-date when updating the
ceph_*_docker_extra_env variables to add the '-e' option to docker.
That's because the addition of '-e' breaks this playbook as it requires
a comma separated list of variables for the 'env:' docker module
parameter. Therefore this change just makes the playbook consistently
broken by referencing the same variable throughout.
2017-01-26 15:57:34 -08:00
Andrew Schoen b2a6f095f1 purge-cluster: fix syntax when deleting dmcrypt devices
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-26 11:28:30 -06:00
Sébastien Han 73ca1a7a00 purge: remove dm-crypt devices
When running encrypted OSDs, an encrypted device mapper is used (because
created by the crypsetup tool). So before attempting to remove all the
partitions on a device we must delete all the encrypted device mappers,
then we can delete all the partitions.

Signed-off-by: Sébastien Han <seb@redhat.com>

 Please enter the commit message for your changes. Lines starting
2017-01-25 22:32:46 +01:00
Sébastien Han adeb3decf3 purge: remove zap_block_devs variable
The name of this variable was a bit confusing since its activation will
zap all the block devices no matter which osd scenario we are using.
Removing this variable and applying a condition on the OSD scenario is
now feasible and easier since we import group_vars variable files for
OSDs.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-18 10:55:01 +01:00
Sébastien Han b7fcbe5ca2 purge: cosmetic cleanup
Just applying our writing syntax convention in the playbook.

Signed-off-by: Sébastien Han <seb@redhat.com>
2017-01-18 10:53:21 +01:00
Andrew Schoen dd8389cdf7 purge-cluster: do not include ceph-osd and ceph-common defaults for osds
When purging OSDs we do not need to include these defaults as nothing in
the following tasks uses them. Also, it has the side effect of
overwriting any variables defined in group_vars files that are relative
to the inventory you are using with the default values. That behavior
was causing the CI tests to fail.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-10 16:57:58 -06:00
Andrew Schoen 321cea8ba9 purge-cluster: get journal partitions after zapping osd disks
In my testing zapping the osd disks deleted the journal
partitions, making the 'zap ceph journal partitions' task fail because
the partitions it found previously do not exist anymore.

This moves the task that finds the journal partitions after 'zap osd disks'
to catch any partitions ceph-disk might have missed.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-03 15:57:17 -06:00
Andrew Schoen c9e5914377 purge-cluster: use ignore_errors: true when including group_vars files
Using failed_when will still throw an exception and stop the playbook if
the file you're trying to include doesn't exist.

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2017-01-03 15:57:17 -06:00
Sébastien Han cb1c06901e Merge pull request #1171 from cbodley/wip-libcephfs2
bump package version to libcephfs2
2017-01-03 10:48:56 +01:00
Shengjing Zhu 2dc2e1d48c infrastructure playbook: add make osd partition
Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>
2016-12-15 22:03:38 +08:00
Casey Bodley acaf01ac17 purge-cluster: add new version of libcephfs2
the libcephfs version was bumped to 2, so we need to check for that as
well when we're removing all ceph packages

Signed-off-by: Casey Bodley <cbodley@redhat.com>
2016-12-09 16:54:06 -05:00
Sébastien Han 9dac195200 take-over: use more precise ceph.conf detection
Prior to this patch we were just looking for any *.conf file which
sometimes could results in multiple matches. The new command looks for a
.conf file that must contain [global] and 'fsid' patterns. This will
definitely get us the ceph.conf file. We can not directly use ceph.conf
because of a different cluster name.

Signed-off-by: Sébastien Han <seb@redhat.com>
2016-12-06 16:02:48 +01:00
Sébastien Han 4444d7d78e git: update gitignore
* ignore yml files in general
* refactor based on commit f8e043b6ea5ac4e886532d4f2f675c507b44b955 that
changed directory layouts

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ec5c6f5da566611c4e0b88f925cbd26dc90368d6)
2016-12-06 10:18:19 +01:00
chenyanshan 7eab2529ed this patch fix the regex pattern in infrastructure-playbooks/shrink-osd.yml when the osd's pid num is bigger than 9999
Signed-off-by: chenyanshan <yanshanchen@139.com>
2016-12-05 13:40:38 +08:00
Guillaume Abrioux b2b7222b3a [shrink-mon]: force playbook to fail if there is only one mon
The playbook will fail if only 1 mon is in the cluster
and advise to use the `purge-cluster` playbook instead.

Fix #1083
2016-11-25 11:20:11 +01:00
Guillaume Abrioux a680707f6f All `include_vars` need to have `*.yml`, `*.yaml` or `*.json` extension.
As introduced in the following PR:
- https://github.com/ansible/ansible/pull/17207
we need to refactor our code.
2016-11-24 14:03:49 +01:00
Sébastien Han 829e2b6598 Merge pull request #1077 from font/rolling_update
Support containerized rolling update
2016-11-22 16:56:46 +01:00
Sébastien Han 38e846e542 rolling_update: clarify "serial" usage
Prior to this commit the serial variable was poorly documented. Now we
are making clear that this value should be left untouched as the rolling
update mechanism should happen serially.

Solves: bz-1396742
Signed-off-by: Sébastien Han <seb@redhat.com>
2016-11-21 14:42:46 +01:00
Ken Dreyer adfdf6871e remove apache support for RGW
libfcgi is dead upstream (http://tracker.ceph.com/issues/16784)

The RGW developers intend to remove libfcgi support entirely before the
Luminous release.

Since libfcgi gets little-to-no developer attention or testing, remove
it entirely from ceph-ansible.
2016-11-18 13:13:12 -07:00
Ivan Font 255e816e28 Rolling update changes for containerized deployments
Separate out systemd restart tasks for containerized and
non-containerized deployments

Signed-off-by: Ivan Font <ifont@redhat.com>
2016-11-17 11:25:25 -08:00
Ivan Font e72f08080d Warn user when upgrading cluster with only one mon
Signed-off-by: Ivan Font <ifont@redhat.com>
2016-11-17 11:25:25 -08:00
Ivan Font 3ff17f1c8f Support containerized rolling update
- Update rolling update playbook to support containerized deployments
  for mons, osds, mdss, and rgws
- Skip checking if existing cluster is running when performing a rolling
  update
- Fixed bug where we were failing to start the mds container because it
  was missing the admin keyring. The admin keyring was missing because
  it was not being pushed from the mon host to the ansible host due to
  the keyring not being available before running the copy_configs.yml
  task include file. Now we forcefully wait for the admin keyring to be
  generated before continuing with the copy_configs.yml task include file
- Skip pre_requisite.yml when running on atomic host. This technically
  no longer requires specifying to skip tasks containing the with_pkg tag
- Add missing variables to all.docker.sample
- Misc. cleanup

Signed-off-by: Ivan Font <ifont@redhat.com>
2016-11-17 11:25:25 -08:00
Alfredo Deza 60ce2311b8 rolling_update: bump retries for osd_check/retries to 20 minutes
Signed-off-by: Alfredo Deza <adeza@redhat.com>

Resolves: rhbz#1395073
2016-11-17 10:43:58 -05:00
Sébastien Han 81a72cb85d Merge pull request #1068 from ceph/v2.2
moving to ansible v2.2 compatibility
2016-11-16 16:33:40 +01:00
Andrew Schoen 5f44b118b8 rolling update: stop RGWs before upgrade and start afterwards
Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves: rhbz#1394929
2016-11-14 14:47:12 -06:00
Andrew Schoen ded9d9dfd3 rolling update: stop MDSs before upgrading and start afterwards
Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves: rhbz#1394929
2016-11-14 14:47:12 -06:00
Andrew Schoen 5429c5f8c5 rolling update: stop MONs before upgrading and start afterwards
Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves: rhbz#1394929
2016-11-14 14:47:12 -06:00
Andrew Schoen 66f09bdac4 rolling update: stop OSDs before upgrading
This avoids a bug where OSDs are sometimes restarted twice on
upgrades which leaves the OSD process running but not marked up.

See:

https://bugzilla.redhat.com/show_bug.cgi?id=1394928
https://bugzilla.redhat.com/show_bug.cgi?id=1391675
https://bugzilla.redhat.com/show_bug.cgi?id=1394929

Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves: rhbz#1394929
2016-11-14 14:46:58 -06:00
Sébastien Han 991341f525 rolling_update: add variable to upgrade ceph
My stupid self removed this crucial variable here: 217ce3ca thinking it
was another hard coded variable import where this is actually the
trigger for the upgrade.

Closes: #1071

Signed-off-by: Sébastien Han <seb@redhat.com>
2016-11-04 17:31:02 +01:00
Sébastien Han a2fcd222d2 moving to ansible v2.2 compatibility
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-Authored-By: Julien Francoz julien@francoz.net
2016-11-04 10:09:38 +01:00
Andrew Schoen 8262ce5e40 rolling update: fix restarts of radosgw
Signed-off-by: Andrew Schoen <aschoen@redhat.com>

Resolves: rhbz#1391675
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
2016-11-03 14:36:42 -05:00
Eduard Egorov ab5c9f2a67 Adjust 'devices' list check for being not defined in purge-cluster playbook (see PR #1024)
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2016-11-03 06:36:42 +00:00
Leseb 899c8b309f Merge pull request #1024 from eduardegorov/egorove_make_devices_optional
Make {{ devices }} list optional
2016-11-02 15:12:02 +01:00
Eduard Egorov e5473ee565 Fix typos
Signed-off-by: Eduard Egorov <eduard.egorov@icl-services.com>
2016-11-01 12:29:21 +00:00