Three fixes:
- fix a typo in vagrant_variables that cause a networking issue for
containerized scenario.
- add containerized_deployment: true
- remove a useless block of code: the fact docker_exec_cmd is set in
ceph-defaults which is played right after.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add a gather-ceph-logs.yml which will log onto all the machines from
your inventory and will gather ceph logs. This is not intended to work
on containerized environments since the logs are stored in journald.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582280
Signed-off-by: Sébastien Han <seb@redhat.com>
The playbook has various improvements:
* run ceph-validate role before doing anything
* run ceph-fetch-keys only on the first monitor of the inventory list
* set noup flag so PGs get distributed once all the new OSDs have been
added to the cluster and unset it when they are up and running
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1624962
Signed-off-by: Sébastien Han <seb@redhat.com>
As of now, we should no longer support Jewel in ceph-ansible.
The latest ceph-ansible release supporting Jewel is `stable-3.1`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current regex had a limitation of 99 OSDs, now this limit has been
removed and regardless the number of OSDs they will all be collected.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1630430
Signed-off-by: Sébastien Han <seb@redhat.com>
Fixes the deprecation warning:
[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
Instead used "import_tasks" and "include_tasks" to tell whether tasks
must be included statically or dynamically.
Fixes: https://github.com/ceph/ceph-ansible/issues/2998
Signed-off-by: Rishabh Dave <ridave@redhat.com>
We need to copy this key into /etc/ceph so when ceph-docker-common runs
it can fetch it to the ansible server. Previously the task wasn't not
failing because `fail_on_missing` was False before 2.5, so now it's True
hence the failure.
Signed-off-by: Sébastien Han <seb@redhat.com>
Add missing call the ceph-handler role, otherwise we can't have
reference to variable registered from ceph-handler from other roles.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Similar to c13a3c3 we must allow scrubbing when running this playbook.
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag.
This commit allows to switch from non containerized to containerized
environment even while PGs are scrubbing.
Closes: #3182
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
38dc20e74b introduced a bug in the purge
playbooks because using `*` in `command` module doesn't work.
`/var/lib/ceph/*` files are not purged it means there is a leftover.
When trying to redeploy a cluster, it failed because monitor daemon was
detecting existing keyring, therefore, it assumed a cluster already
existed.
Typical error (from container output):
```
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16 /entrypoint.sh: Existing mon, trying to rejoin cluster...
Sep 26 13:18:16 mon0 docker[31316]: 2018-09-26 13:18:16.9323937f15b0d74700 -1 auth: unable to find a keyring on /etc/ceph/test.client.admin.keyring,/etc/ceph/test.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,:(2) No such file or directory
Sep 26 13:18:23 mon0 docker[31316]: 2018-09-26 13:18:23 /entrypoint.sh:
SUCCESS
```
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1633563
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Previous commit c13a3c3 has removed a condition.
This commit brings back this condition which is essential to ensure we
won't hit a false positive result in the `when` condition for the check
PGs task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
In cluster with a large number of PGs, it can be expected some of them
scrubbing, it's a normal operation.
Preventing from scrubbing operation force to set noscrub flag before a
rolling update which is a problem because it pauses an important data
integrity operation until the end of the rolling upgrade.
This commit allows an upgrade even while PGs are scrubbing.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1616066
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
- Adds loop in bash to satisfy the 1:n relation between `osd_hosts` and the
different device lists.
- Fixes some container name which were using the host hostname instead
of the actual container one.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Once the OSD is destroyed we also have to purge the associated devices,
this means purging journal, db , wal partitions too.
This now works for container and non-container.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1572933
Signed-off-by: Sébastien Han <seb@redhat.com>
ce1dd8d introduced the purge osd on containers but it was incorrect.
`resolve parent device` and `zap ceph osd disks` tasks must be delegated to
their respective OSD nodes.
Indeed, they were run on the ansible node, it means it was trying to
resolve parent devices from this node where it should be done on OSD
nodes.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1612095
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Sometime /var/lib/ceph is mounted on a device so we won't be able to
remove it (device busy) so let's remove its content only.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1615872
Signed-off-by: Sébastien Han <seb@redhat.com>
Add a message for when PV creation fails.
This message alerts users that FS/GPT/RAID
signatures could still on the device and the
reason for the failures.
`wipefs -a $device` needs to be run to fix this issue.
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Running 'osd set sortbitwise' when we detect a version 12 of Ceph is
wrong. When OSD are getting updated, even though the package is updated
they won't send their updated version (12) and will stick with 10 if the
command is not applied. So we have to check if OSD are sending a version
10 and then run the command to unlock the OSDs.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
Recently we renamed the group_name for iscsi iscsigws where previously
it was named iscsi-gws. Existing deployments with a host file section
with iscsi-gws must continue to work.
This commit adds the old group name as a backoward compatility, no error
from Ansible should be expected, if the hostgroup is not found nothing
is played.
Close: https://bugzilla.redhat.com/show_bug.cgi?id=1619167
Signed-off-by: Sébastien Han <seb@redhat.com>
We were using var_files long ago when default variables were not in
ceph-defaults, now the role exists this is not need. Moreover having
these two var files added:
- roles/ceph-defaults/defaults/main.yml
- group_vars/all.yml
Will create collision and override necessary variables.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1555305
Signed-off-by: Sébastien Han <seb@redhat.com>
The copy module does in fact do variable interpolation so we do not need
to use the template module or keep a template in the source.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
If a user decides to to use the lv_vars.yml file then it should fail
silenty so that configuration can be picked up from other places.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
The copy module will not expand the template and render the variables
included, so we must use template.
Creating a temp file and using it locally means that you must run the
playbook with sudo privledges, which I don't think we want to require.
This introduces a logfile_path variable that the user can use to control
where the logfile is written to, defaulting to the cwd.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
These playbooks create and tear down logical
volumes for OSD data on HDDs and for a bucket index and
journals on 1 NVMe device.
Users should follow the guidelines set in var/lv_vars.yaml
After the lv-create.yml playbook is run, output is
sent to /tmp/logfile.txt for copy and paste into
osds.yml
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Before running the upgrade, let's call systemd to collect unit names
instead of relaying on the device list. This is more accurate and fix
the osd_auto_discovery scenario too.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613626
Signed-off-by: Sébastien Han <seb@redhat.com>
In some case, use may mount a partition to /var/lib/ceph, and umount
it will be failure and no need to do so too.
Signed-off-by: Jeffrey Zhang <zhang.lei.fly@gmail.com>
upgrade RHCS 2 -> RHCS 3 will fail if cluster has still set
sortnibblewise,
it stay stuck on "TASK [waiting for clean pgs...]" as RHCS 3 osds will
not start if nibblewise is set.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600943
Signed-off-by: Sébastien Han <seb@redhat.com>
Let's try to avoid using dashes as testinfra needs to be able to read
the groups.
Typically, with iscsi-gws we can't add a marker for these iscsi nodes,
using an underscore fixes the issue.
Signed-off-by: Sébastien Han <seb@redhat.com>
this is kind of follow up on what has been made in #2560.
See #2560 and #2553 for details.
Closes: #2708
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The rolling upgrades playbook should have norebalance flag set for
OSDs upgrades to wait only for recovery.
Fixes: #2657
Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com>
Without the escalation, invocation from non-root
users with fail when accessing the rados config
object, or when attempting to log to /var/log
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
When running ansible2.4-update_docker_cluster there is an issue on the
"get current fsid" task. The current task only works for
non-containerized deployment but will run all the time (even for
containerized). This currently results in the following error:
TASK [get current fsid] ********************************************************
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214
Tuesday 22 May 2018 22:48:32 +0000 (0:00:02.615) 0:11:01.035 ***********
fatal: [mgr0 -> mon0]: FAILED! => {
"changed": true,
"cmd": [
"ceph",
"--cluster",
"test",
"fsid"
],
"delta": "0:05:00.260674",
"end": "2018-05-22 22:53:34.555743",
"rc": 1,
"start": "2018-05-22 22:48:34.295069"
}
STDERR:
2018-05-22 22:48:34.495651 7f89482c6700 0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000
2018-05-22 22:48:34.495684 7f89482c6700 0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault
This is not really representative on the real error since the 'ceph' cli is available on that machine.
On other environments we will have something like "command not found: ceph".
Signed-off-by: Sébastien Han <seb@redhat.com>
A customer has been facing an issue when trying to override
`monitor_interface` in inventory host file.
In his use case, all nodes had the same interface for
`monitor_interface` name except one. Therefore, they tried to override
this variable for that node in the inventory host file but the
take-over-existing-cluster playbook was failing when trying to generate
the new ceph.conf file because of undefined variable.
Typical error:
```
fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"}
```
Including variables like this `include_vars: group_vars/all.yml` prevent
us from overriding anything in inventory host file because it
overwrites everything you would have defined in inventory.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
During the transition from jewel non-container to container old ceph
units are disabled. ceph-disk can still remain in some cases and will
appear as 'loaded failed', this is not a problem although operators
might not like to see these units failing. That's why we remove them if
we find them.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846
Signed-off-by: Sébastien Han <seb@redhat.com>
In order to ensure there is no leftover after having purged a cluster,
we must wipe all partitions properly.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
there is some leftover on devices when purging osds because of a invalid
device list construction.
typical error:
```
changed: [osd3] => (item=/dev/sda sda1) => {
"changed": true,
"cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print | grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600",
"delta": "0:00:00.015188",
"end": "2018-05-16 12:41:40.408597",
"item": "/dev/sda sda1",
"rc": 0,
"start": "2018-05-16 12:41:40.393409"
}
STDOUT:
sgdisk -Z /dev/sda sda1
dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200
udevadm settle --timeout=600
STDERR:
Error: Could not stat device /dev/sda sda1 - No such file or directory.
```
the devices list in the task `resolve parent device` isn't built
properly because the command used to resolve the parent device doesn't
return the expected output
eg:
```
changed: [osd3] => (item=/dev/sda1) => {
"changed": true,
"cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")",
"delta": "0:00:00.013634",
"end": "2018-05-16 12:41:09.068166",
"item": "/dev/sda1",
"rc": 0,
"start": "2018-05-16 12:41:09.054532"
}
STDOUT:
/dev/sda sda1
```
For instance, it will result with a devices list like:
`['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']`
where we expect to have:
`['/dev/sda', '/dev/sdb', '/dev/sdc']`
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
During a minor update from a jewel to a higher jewel version (10.2.9 to
10.2.10 for example) osd flags don't get applied because they were done
in the mgr section which is skipped in jewel since this daemons does not
exist.
Moving the set flag section after all the mons have been updated solves
that problem.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1548071
Co-authored-by: Tomas Petr <tpetr@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
the role `ceph-mgr` that is played later in the playbook fails because
the destination path for the fetched keys is wrong.
This patch fix the destination path used in the task `fetch ceph mgr
key(s)` so there is no mismatch.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
{{ fsid }} points to {{ cluster_uuid.stdout }} which is not defined in
this part of the rolling_update playbook.
Since we need to call {{ fsid }} we must get the fsid and register it to
`cluster_uuid`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Until all the mons haven't been updated to Luminous, there is no way to
create a key. So we should do the key creation in the mon role only if
we are not part of an update.
If we are then the key creation is done after the mons upgrade to
Luminous.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1574995
Signed-off-by: Sébastien Han <seb@redhat.com>
In addition to b324c17 this commit fix the ceph uid for osd role in the
switch from non containerized to containerized playbook.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we don't do this, umounting devices declared like this
/dev/disk/by-id/ata-QEMU_HARDDISK_QM00001
will fail like:
umount: /dev/disk/by-id/ata-QEMU_HARDDISK_QM000011: mountpoint not found
Since we append '1' (partition 1), this won't work.
So we need to resolved the link to get something like /dev/sdb and then
append 1 to /dev/sdb1
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
Now if the service name contains nvme we know we need to remove the last
2 character instead of 1.
If nvme then osd_to_kill_disks is nvme0n1, we need nvme0
If ssd or hdd then osd_to_kill_disks is sda1, we need sda
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1561456
Signed-off-by: Sébastien Han <seb@redhat.com>
We know bindmount with the :z option at the end of the -v command so
this will basically run the exact same command as we used to run. So to
speak:
chcon -Rt svirt_sandbox_file_t /var/lib/ceph
Signed-off-by: Sébastien Han <seb@redhat.com>
This changes state to action and gives the options 'create'
or 'zap'. The zap parameter is also removed.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Added 'ignore_errors: true' to multiple lines which run docker commands; even in cases where docker is no longer installed. Because of this, certain tasks in the purge-docker-cluster.yml will cause the playbook to fail if re-run and stop the purge. This leaves behind a dirty environment, and a playbook which can no longer be run.
Fix Regex line 275: Sometimes 'list-units' will output 4 spaces between loaded+active. The update will account for both scenarios.
purge fetch_directory: in other roles fetch_directory is hard linked ex.: "{{ fetch_directory }}"/"{{ somedir }}". That being said, fetch_directory will never have a trailing slash in the all.yml so this task was never being run(causing failures when trying to re-deploy).
Signed-off-by: Randy J. Martinez <ramartin@redhat.com>
The `remove_packages` prompt is redundant to the `ireallymeanit` prompt
since it does exactly the same thing. I guess the only goal of this task
was to make a break to warn user about `--skip-tags=with_pkg` feature.
This warning should be part of the first prompt.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Having callback_plugins, and action plugins in random locations causes
a lot of disparity.
We should centralize this into one place in the plugins directory and
fix up the ansible.cfg to reflect this.
Additionally, since the ansible.cfg already reflects action_plugins, we
don't need a link to action_plugins in the base of the repository.
According to hostname configuration, the task waiting for mons to be in
quorum might fail.
The idea here is to look for both shortname and fqdn in
`ceph_health_raw` instead of just `ansible_hostname`
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1546127
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Variables set at the play level with ``vars`` do
not carry over into the next play in the playbook.
The var jewel_minor_update was set in a previous play but
used in this one and was failing because it was not defined.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1544029
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
These are better collected by ansible automatically. This would also
fail if the host_var file didn't exist.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
the `zap ceph osd disks` task should iter on `resolved_parent_device`
instead of `combined_devices_list` which contain only the base device
name (vs. full path name in `combined_devices_list`).
this fixes the issue where docker complain about container name because
of illegal characters such as `/` :
```
"/usr/bin/docker-current: Error response from daemon: Invalid container
name (ceph-osd-zap-magna074-/dev/sdb1), only [a-zA-Z0-9][a-zA-Z0-9_.-]
are allowed.","See '/usr/bin/docker-current run --help'."
""
```
having the the basename of the device path is enough for the container
name.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Use a nicer syntax for `local_action` tasks.
We used to have oneliner like this:
```
local_action: wait_for port=22 host={{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }} state=started delay=10 timeout=500 }}
```
The usual syntax:
```
local_action:
module: wait_for
port: 22
host: "{{ hostvars[inventory_hostname]['ansible_default_ipv4']['address'] }}"
state: started
delay: 10
timeout: 500
```
is nicer and kind of way to keep consistency regarding the whole
playbook.
This also fix a potential issue about missing quotation :
```
Traceback (most recent call last):
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 213, in <module>
main()
File "/tmp/ansible_wQtWsi/ansible_module_command.py", line 185, in main
rc, out, err = module.run_command(args, executable=executable, use_unsafe_shell=shell, encoding=None, data=stdin)
File "/tmp/ansible_wQtWsi/ansible_modlib.zip/ansible/module_utils/basic.py", line 2710, in run_command
File "/usr/lib64/python2.7/shlex.py", line 279, in split
return list(lex) File "/usr/lib64/python2.7/shlex.py", line 269, in next
token = self.get_token()
File "/usr/lib64/python2.7/shlex.py", line 96, in get_token
raw = self.read_token()
File "/usr/lib64/python2.7/shlex.py", line 172, in read_token
raise ValueError, "No closing quotation"
ValueError: No closing quotation
```
writing `local_action: shell echo {{ fsid }} | tee {{ fetch_directory }}/ceph_cluster_uuid.conf`
can cause trouble because it's complaining with missing quotes, this fix solves this issue.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1510555
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is a typo caused by leftover.
It was previously written like this :
`shell: echo /dev/$(lsblk -no pkname "{{ item }}") }}")`
and has been rewritten to :
`shell: $(lsblk --nodeps -no pkname "{{ item }}") }}")`
because we are appending later the '/dev/' in the next task.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1540137
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
These tasks are needed only when upgrading to luminous.
They are not needed in Jewel minor upgrade and by the way, they fail because
`ceph versions` command doesn't exist.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
When update from a minor Jewel version to another, the playbook will
fail on the task "fail if no mgr host is present in the inventory".
This now can be worked around by running Ansible with_items
-e jewel_minor_update=true
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1535382
Signed-off-by: Sébastien Han <seb@redhat.com>
Using `lsblk` to resolv the parent device is better than just removing the last
char when passing it to the zap container.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Avoid using `awk` to get the different devices from the partlabel.
Using `blkid` is more readable.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
If we're working with a jewel cluster then this service will not exist.
This is mainly a problem with CI testing because our tests are setup to
work with both jewel and luminous, meaning that eventhough we want to
test jewel we still have a nfs-ganesha host in the test causing these
tasks to run.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
The rolling update playbook was attempting to stop the
nfs-ganesha service on nodes where jewel is still installed.
The nfs-ganesha service did not exist in jewel so the task fails.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
`bluestore_purge_osd_non_container` scenario is failing because it
keeps old osd_uuid information on devices and cause the `ceph-disk activate`
to fail when trying to redeploy a new cluster after a purge.
typical error seen :
```
2017-12-13 14:29:48.021288 7f6620651d00 -1
bluestore(/var/lib/ceph/tmp/mnt.2_3gh6/block) _check_or_set_bdev_label
bdev /var/lib/ceph/tmp/mnt.2_3gh6/block fsid
770080e2-20db-450f-bc17-81b55f167982 does not match our fsid
f33efff0-2f07-4203-ad8d-8a0844d6bda0
```
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There is no need to ask for root on the local action. This will prompt
for a password the current user is not part of sudoers. That's
unnecessary anyways.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1516947
Signed-off-by: Sébastien Han <seb@redhat.com>
this task hangs because `{{ inventory_hostname }}` doesn't resolv to an
actual ip address.
Using `hostvars[inventory_hostname]['ansible_default_ipv4']['address']`
should fix this because it will reach the node with its actual IP
address.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
purge-docker-cluster must remove all osd_disk_prepare logs in
`{{ ceph_osd_docker_run_script_path }}`, otherwise if you purge your
cluster and try to redeploy it, osds will fail to start since because it
will try to retrieve find a partition uuid which doesn't exist.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1510470
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The ansible inventory could have more than just ceph-ansible hosts, so
we shouldnt use "hosts: all", also only grab one file when getting
the ceph cluster name instead of failing when there is more than one
file in /etc/ceph. Also fix location of the ceph.conf template
Rebooting servers is really intrusive and perhaps this is not what the
operator wants. So we disable the reboot by default now. Note that the
reboot might not happen all the time.
It can be enabled by default by running the purge playbook with -e
reboot_osd_node=True
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1505011
Signed-off-by: Sébastien Han <seb@redhat.com>
During purge osd, the containers are not stopped because of a typo, as a
result, all the devices can't be unmounted later.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
stable-3.0 brought numerous changes in ceph-ansible variables, this PR
aims to maintain backward compatibility for someone running stable-2.2
upgrading to stable-3.0 but keeps its groups_vars untouched.
We will then determine the right options to make sure the upgrade works
but we are expecting that new variables should be used.
We will drop this in a near future, maybe 3.1 or 3.2.
Signed-off-by: Sébastien Han <seb@redhat.com>
nfs nodes can't be upgraded from jewel to luminous because ceph-nfs role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
package is upgraded in `ceph-nfs` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
mgr nodes can't be upgraded from jewel to luminous because ceph-mgr role
is skipped because of the condition `when:
"ceph_release_num[ceph_release] >= ceph_release_num.luminous"`. Indeed,
ceph-mgr package is upgraded in `ceph-mgr` role, therefore,
`ceph_release` is still set to the old version. It means the when can't
be satisfied.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 302e563601cd6820b1ae44fabdfb1506688c7c9b)
- Add upgrade support for rbd mirror and nfs daemons.
- Only works with systemd (remove sysvinit and upstart occurence)
- A bit of cleanup
Signed-off-by: Sébastien Han <seb@redhat.com>
This patch changes the `when:` keys so that they have no jinja2
delimiters. This avoids Ansible warnings which could turn into
errors in a future Ansible release.
Ansible throws warnings when using yum/dnf/rpm with the command
module:
[WARNING]: Consider using yum module rather than running yum
This patch adds the `warn: no` argument to suppress the warnings
in the Ansible output.
This playbook can replace failed OSD in containerized and
non-containerized env.
The current limitation is that it won't allow you to choose between
filestore/bluestore and will do collocation as well.
Signed-off-by: Sébastien Han <seb@redhat.com>
Using a condition when osd_scenario == 'non-collocated' was wrong since
these partitions can be collocated on a single device also. Removing the
check makes the purge of these partitions.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1499871
Signed-off-by: Sébastien Han <seb@redhat.com>
The current inclusion of purge-iscsi-gateways.yml in purge-cluster.yml
is not working well and blocking the CI too. So removing it from
purge-cluster.yml and re-add the original purge-iscsi-gateways.yml.
Signed-off-by: Sébastien Han <seb@redhat.com>
Use the pg check before doing the pg check, not on the quorum check.
Also never quote int when doing comparaison.
Signed-off-by: Sébastien Han <seb@redhat.com>
The shell wildcard expansion of non-existing paths fails on zsh making
the whole script fail. We can use file module with with_fileglob to
alleviate the problem instead.
Signed-off-by: Boris Ranto <branto@redhat.com>
The systemd can't stop services if the unit files were removed before
the cluster was purged. We should just ignore these.
Signed-off-by: Boris Ranto <branto@redhat.com>
Also we now play ceph-config to have everything being generated for new
daemons bootstrap during upgrade.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1497959
Signed-off-by: Sébastien Han <seb@redhat.com>
Even if the task is skipped, ansible registers the var as 'skipped' so
this task the task using this variable for its next usage.
Signed-off-by: Sébastien Han <seb@redhat.com>
rbd-mirror containers are not stopped in purge-docker-cluster playbook
because of the wrong name used.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
unti now, mgr nodes are not managed by purge-cluster.yml, therefore it
breaks scenario like purge_cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The old name is used in `rolling_update.yml` and
`purge-docker-cluster.yml`, it breaks the
`test_rgw_service_is_running()` test.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commits allows us to run
switch-from-non-containerized-to-containerized-ceph-daemons.yml multiple
times.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489353
Signed-off-by: Sébastien Han <seb@redhat.com>
If devices is passed through an extra var this register won't work so
let's only register the var is devices is not defined.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1489099
Signed-off-by: Sébastien Han <seb@redhat.com>
Prior command was avoiding the lockbox mountpoint and the playbook was
failing with:
rmtree failed: [Errno 30] Read-only file system:
'/var/lib/ceph/osd-lockbox/4e9d8052-87c2-4fde-a56c-b8c108a3eefc/key-management-mode'
Signed-off-by: Sébastien Han <seb@redhat.com>
we need to force the value of `docker` variable which is initially set
to `false` since it's a migration from non-containerized to
containerized cluster.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We must mask the image so we are sure that even if the system reboots
then the OSDs won't start.
Also remove Ceph udev rules if found on the system prior to deploy
containers. If we don't do this we are exposed to conflicts between udev
rules and sytemd unit files.
Also add the CI will now test the migration from a non-containerized cluster to a
containerized cluster.
Signed-off-by: Sébastien Han <seb@redhat.com>
Monitor removal from the monmap is not immediate, so let's wait a little
bit and then fail if the monitor is still in the monmap.
We try twice in total with 10 sec intervals.
Signed-off-by: Sébastien Han <seb@redhat.com>
Move untested/with few confidence playbooks in a untested-by-ci
directory.
Also removing this directory from the package build.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1461551
Signed-off-by: Sébastien Han <seb@redhat.com>
Prior to this patch, we were applying the osd flags like this:
"
General pre tasks
Set flags
Upgrade OSDs on a host
Unset flags <-- this triggers pending scrub to start
Set flags
Upgrade OSDs on a hosts
Unset flags <-- this triggers pending scrub to start
.
.
.
General post tasks
"
Now instead, we apply the flag once before starting the OSD update and
unset them once the last OSD is finished.
"
General pre tasks
Set flags and wait for any scrubs to finish
Upgrade OSDs on a host
Upgrade OSDs on a host
.
.
.
Unset flags
General post tasks
"
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1450754
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
In our test case we don't have any pgs, thus the check fails. The check
always returns an empty array, which makes the comparaison failing.
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit eases the use of the
infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml
playbook. We basically run it with a couple of pre-tasks and then we let
the playbook run the docker roles.
It obviously expect to have proper variables configured in order to
work.
Signed-off-by: Sébastien Han <seb@redhat.com>
In the switch to containers migration there were broken references
to ceph_mon_docker_subnet variable, replaced with public_network.
Also fixes references to ceph_mon_docker_extra_env setting for it
a default as it could be undefined.
There is only two main scenarios now:
* collocated: everything remains on the same device:
- data, db, wal for bluestore
- data and journal for filestore
* non-collocated: dedicated device for some of the component
Signed-off-by: Sébastien Han <seb@redhat.com>
This will give us more flexibility and avoid a lot of useless when
skipping all tasks from a non-desired role.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
remove `ceph_mon_docker_interface` and use `monitor_interface` instead
for both containerized and non-containerized deployment.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The new test in the checks PGs are no longer working on distributions
where /bin/sh isn't linked to /bin/bash.
Fix: #1619
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Avoid screen scrapping by rewriting `waiting for clean pgs` tasks like it is
done in 304de48.
Use the json output returned by `ceph -s` instead
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We need to include ceph_docker_registry when removing containers/images
because if we don't it will assume docker.io which is not always where
the image originated from, causing the playbook to fail.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
If we're purging a containerized cluster that did not use the
raw_multi_journal OSD scenario then raw_journal_devices will not be
defined which causes the playbook to fail.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1455187
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
`ceph-docker-common`:
At the moment there is a lot of duplicated tasks in each
`./roles/ceph-<role>/tasks/docker/main.yml` that could be refactored in
`./roles/ceph-docker-common/tasks/main.yml`.
`*_containerized_deployment` variables:
All `*_containerized_deployment` have been refactored to a single
variable `containerized_deployment`
duplicate `cephx` variables in `group_vars/* have been removed.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Problem: we are delegating the set/unset flag to a monitor node but we
try to call an osd container
Solution: use the right container name.
Signed-off-by: Sébastien Han <seb@redhat.com>
Without this, we don't test the mgr role so we need to add it.
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Problem: too many different commands to do the same thing. The 'cut'
command on infrastructure-playbooks/purge-cluster.yml was also wrong.
This sed command from osixia in ceph-docker
https://github.com/ceph/ceph-docker/pull/580/ addresses all the
scenarios.
Signed-off-by: Sébastien Han <seb@redhat.com>
Doing so will override any values set for these in the group_vars
directory relative to the users inventory.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Doing so at playbook level overrides whatever values might be set for
these in the user's group_vars directory that's relative to their
inventory.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This has the behavior of overriding custom values set in group_vars.
I've added defaults to the rest of the group names so that if they are
not overridden in group_vars then defaults will be used.
See: https://bugzilla.redhat.com/show_bug.cgi?id=1354700
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic, so keyword "skip" to fix it.
Exit the playbook if the user not define devices in both host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
When ansible do not load the file host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml it will show syntactic err, so add keyword "skip" to fix it.
Exit the playbook if the user not define devices in both host_vars/{{ ansible_hostname }}.yml and host_vars/default.yml
host_vars/default.yml
load device partition file in directory host_vars
1) if the user define host_vars/hostname.yml load the devices partition on this file.
2) otherwise load host_vars/default.yml for default
The task waiting for the monitor to join the quorum... , the result for ceph -s | grep monmap only contain monmap, not included quorum:
# ceph -s --cluster ceph | grep monmap
monmap e1: 3 mons at {sh-office-ceph-1=10.12.10.34:6789/0,sh-office-ceph-2=10.12.10.35:6789/0,sh-office-ceph-3=10.12.10.36:6789/0}
If want to get monitor, should use this:
# ceph -s --cluster ceph | grep election
election epoch 80, quorum 0,1 sh-office-ceph-1,sh-office-ceph-2
ceph verison: 10.2.5
We now run the container and waits until it dies. Prior to this we were
stopping it before completion so not all the devices where zapped.
Signed-off-by: Sébastien Han <seb@redhat.com>
Some playbooks use [0-9]*, others use \d+$
The latter is more correct since cluster name may contain numbers.
Signed-off-by: Shengjing Zhu <zsj950618@gmail.com>
This is to allow for ceph-installer usage of this playbook and
to ensure that you have the correct keys locally when bootstrapping.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
We changed the way we declare image.
Prior to this patch we must have a "user/image:tag"
format, which is incompatible with non docker-hub registry where you
usually don't have a "user". On the docker hub a "user" is also
identified as a namespace, so for Ceph the user was "ceph".
Variables have been simplified with only:
* ceph_docker_image
* ceph_docker_image_tag
1. For docker hub images: ceph_docker_name: "ceph/daemon" will give
you the 'daemon' image of the 'ceph' user.
2. For non docker hub images: ceph_docker_name: "daemon" will simply
give you the "daemon" image.
Infrastructure playbooks have been modified as well.
The file group_vars/all.docker.yml.sample has been removed as well.
It is hard to maintain since we have to generate it manually. If
you want to configure specific variables for a specific daemon simply
edit group_vars/$DAEMON.yml
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1420207
Signed-off-by: Sébastien Han <seb@redhat.com>
Including variables from role defaults or files in a group_vars
directory relative to the playbook is a bad practice. We don't want to
do this because including these defaults at the task level overrides
values that would be set in a group_vars directory relative to the
inventory file, which is the correct usage if you wish to override
those default values.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
According to #1216, we need to simply the code by removing the
support of anything before Jewel.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Doing this cause some all the daemons to go down at the same time. In a
scenario where we colocate a monitor and an osd, this osds will take
some time to go down which will make the 'umount' task fail.
Signed-off-by: Sébastien Han <seb@redhat.com>
On systems running docker there is an issue with lxfs that results in
the find command returning 1 but actually did the job.
e.g: on a system with docker runnning find /var will give us the
following error:
find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/systemd-update-utmp.service/devices.deny':
Permission denied
find:
'/var/lib/lxcfs/cgroup/devices/lxc/x1/system.slice/dev-random.mount/devices.allow':
Permission denied
...
...
However ceph files got deleted so we ignore the error.
Signed-off-by: Sébastien Han <seb@redhat.com>
We now rely on the cli tool ceph-detect-init which will tell us the init
system in used on the distribution. We do this instead of the previous
lookup for systemd unit files to call the right task depending on the
init system.
Signed-off-by: Sébastien Han <seb@redhat.com>
with_items is evaluated before the when so in a second run where the
variable is empty if will fail with "'dict object' has no attribute
'stdout_lines'". To fix this we had a default array so with_items does
not fail and the task is skipped with the when.
Signed-off-by: Sébastien Han <seb@redhat.com>
Because the purge-cluster.yml playbook does not have access to the roles
default vars then we can be sure that raw_multi_journal is defined. For
example, if this was purging a dmcrypt journal then raw_multi_journal
might not be defined at all in group_vars/all.yml or
group_vars/osds.yml.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
This playbook was still referencing the old version of the
ceph_*_docker_extra_env but only for Ceph MONs and Ceph NFS. This
playbook was not kept up-to-date when updating the
ceph_*_docker_extra_env variables to add the '-e' option to docker.
That's because the addition of '-e' breaks this playbook as it requires
a comma separated list of variables for the 'env:' docker module
parameter. Therefore this change just makes the playbook consistently
broken by referencing the same variable throughout.
When running encrypted OSDs, an encrypted device mapper is used (because
created by the crypsetup tool). So before attempting to remove all the
partitions on a device we must delete all the encrypted device mappers,
then we can delete all the partitions.
Signed-off-by: Sébastien Han <seb@redhat.com>
Please enter the commit message for your changes. Lines starting
The name of this variable was a bit confusing since its activation will
zap all the block devices no matter which osd scenario we are using.
Removing this variable and applying a condition on the OSD scenario is
now feasible and easier since we import group_vars variable files for
OSDs.
Signed-off-by: Sébastien Han <seb@redhat.com>
When purging OSDs we do not need to include these defaults as nothing in
the following tasks uses them. Also, it has the side effect of
overwriting any variables defined in group_vars files that are relative
to the inventory you are using with the default values. That behavior
was causing the CI tests to fail.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
In my testing zapping the osd disks deleted the journal
partitions, making the 'zap ceph journal partitions' task fail because
the partitions it found previously do not exist anymore.
This moves the task that finds the journal partitions after 'zap osd disks'
to catch any partitions ceph-disk might have missed.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
Using failed_when will still throw an exception and stop the playbook if
the file you're trying to include doesn't exist.
Signed-off-by: Andrew Schoen <aschoen@redhat.com>
the libcephfs version was bumped to 2, so we need to check for that as
well when we're removing all ceph packages
Signed-off-by: Casey Bodley <cbodley@redhat.com>
Prior to this patch we were just looking for any *.conf file which
sometimes could results in multiple matches. The new command looks for a
.conf file that must contain [global] and 'fsid' patterns. This will
definitely get us the ceph.conf file. We can not directly use ceph.conf
because of a different cluster name.
Signed-off-by: Sébastien Han <seb@redhat.com>
* ignore yml files in general
* refactor based on commit f8e043b6ea5ac4e886532d4f2f675c507b44b955 that
changed directory layouts
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit ec5c6f5da566611c4e0b88f925cbd26dc90368d6)
Prior to this commit the serial variable was poorly documented. Now we
are making clear that this value should be left untouched as the rolling
update mechanism should happen serially.
Solves: bz-1396742
Signed-off-by: Sébastien Han <seb@redhat.com>
libfcgi is dead upstream (http://tracker.ceph.com/issues/16784)
The RGW developers intend to remove libfcgi support entirely before the
Luminous release.
Since libfcgi gets little-to-no developer attention or testing, remove
it entirely from ceph-ansible.
- Update rolling update playbook to support containerized deployments
for mons, osds, mdss, and rgws
- Skip checking if existing cluster is running when performing a rolling
update
- Fixed bug where we were failing to start the mds container because it
was missing the admin keyring. The admin keyring was missing because
it was not being pushed from the mon host to the ansible host due to
the keyring not being available before running the copy_configs.yml
task include file. Now we forcefully wait for the admin keyring to be
generated before continuing with the copy_configs.yml task include file
- Skip pre_requisite.yml when running on atomic host. This technically
no longer requires specifying to skip tasks containing the with_pkg tag
- Add missing variables to all.docker.sample
- Misc. cleanup
Signed-off-by: Ivan Font <ifont@redhat.com>
My stupid self removed this crucial variable here: 217ce3ca thinking it
was another hard coded variable import where this is actually the
trigger for the upgrade.
Closes: #1071
Signed-off-by: Sébastien Han <seb@redhat.com>
- Updates to allow running infrastructure-playbooks both from within its
directory or root directory of ceph-ansible.
Signed-off-by: Ivan Font <ifont@redhat.com>
- Separated out one large playbook into multiple playbooks to run
host-type by host-type i.e. mdss, rgws, rbdmirrors, nfss, osds, mons.
- Combined common tasks into one shared task for all hosts where
applicable
- Fixed various bugs
Signed-off-by: Ivan Font <ivan.font@redhat.com>
Prior to this change we were purging all the partitions on the device
when using the raw_journal_devices scenario.
This was breaking deployments where other partitions are used for other
purposes (ie: OS system).
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit introduces the ability to configure delays and retries for
cluster health checks, for both monitors and OSDs.
Signed-off-by: Sébastien Han <seb@redhat.com>
Users have reported this task to hang. Since this command is not
required to perform the upgrade, we remove it.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Add all relevant group_vars files in containerized purge cluster
playbook and ignore errors if file may not exist.
- Also fixing indentation issues.
Signed-off-by: Ivan Font <ivan.font@redhat.com>
Since we have a couple of infrastructure related playbooks
(additionnally to the roles we are using to deploy Ceph), it makes sense
to have them located in a separate directory.
Signed-off-by: Sébastien Han <seb@redhat.com>