ceph-ansible

Commit Graph

Author	SHA1	Message	Date
Guillaume Abrioux	156daf1018	tests: increase memory to 1024Mb for centos7_cluster scenario we see more and more failure like `fatal: [mon0]: UNREACHABLE! => {}` in `centos7_cluster` scenario, Since we have 30Gb RAM on hypervisors, we can give monitors a bit more RAM. By the way, nodes on containerized cluster testing scenario have already 1024Mb memory allocated. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `bbb8691335`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-11 17:55:04 +02:00
Guillaume Abrioux	18e794217d	client: keyrings aren't created when single client node combining `run_once: true` with `inventory_hostname == groups.get(client_group_name) \| first` might cause bug when the only node being run is not the first in the group. In a deployment with a single client node it might cause issue because sometimes keyring won't be created since the task could be definitively skipped. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `090ecff94e`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-08 15:34:34 +02:00
Guillaume Abrioux	fd10fcedff	tests: update ooo inventory hostfile Update the inventory host for tripleo testing scenario so it's the same parameters than in tripleo CI. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `28d21b4e9c`)	2018-06-07 18:27:00 +02:00
Guillaume Abrioux	c35203da88	client: add a default value for keyring file Potential error if someone doesnt pass the mode in `keys` dict for client nodes: ``` fatal: [client2]: FAILED! => {} MSG: The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'mode' The error appears to have been in '/home/guits/ceph-ansible/roles/ceph-client/tasks/create_users_keys.yml': line 117, column 3, but may be elsewhere in the file depending on the exact syntax problem. The offending line appears to be: - name: get client cephx keys ^ here exception type: <class 'ansible.errors.AnsibleUndefinedVariable'> exception: 'dict object' has no attribute 'mode' ``` adding a default value will avoid the deployment failing for this. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `8a653cacd5`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-07 18:27:00 +02:00
Guillaume Abrioux	7bcb005e6b	client: use dummy created container when there is no mon in inventory the `docker_exec_cmd` fact set in client role when there is no monitor in inventory is wrong, `ceph-client-{{ hostname }}` is never created so it will fail anyway. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `7b156deb67`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-07 13:48:33 +02:00
Guillaume Abrioux	48e7cc506c	tests: improve mds tests the expected number of mds daemon consist of number of daemons that are 'up' + number of daemons 'up:standby'. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c94ada69e8`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-07 17:45:29 +08:00
Guillaume Abrioux	9d50874d38	osd: copy openstack keys over to all mon When configuring openstack, the created keyrings aren't copied over to all monitors nodes. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1588093 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `433ecc7cbc`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-07 10:59:46 +02:00
Guillaume Abrioux	c533556935	rolling_update: fix facts gathering delegation this is kind of follow up on what has been made in #2560. See #2560 and #2553 for details. Closes: #2708 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `232a16d77f`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-07 09:46:42 +02:00
Sébastien Han	5199300a6b	test: do not always copy admin key The admin key must be copied on the osd nodes only when we test the shrink scenario. Shrink relies on ceph-disk commands that require the admin key on the node where it's being executed. Now we only copy the key when running on the shrink-osd scenario. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `41b4632abc`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-06 10:50:56 +02:00
Patrick Donnelly	4c5042ae28	change max_mds default to 1 Otherwise, with the removal of mds_allow_multimds, the default of 3 will be set on every new FS. Introduced by: `c8573fe0d7` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1583020 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit `91f9da530f`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-06 10:47:04 +02:00
Guillaume Abrioux	f940163ab5	tests: fix rgw tests `41b4632` has introduced a change in functionnals tests. Since the admin keyring isn't copied on rgw nodes anymore in tests, let's use the rgw keyring to achieve them. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `47276764f7`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-06 15:41:13 +08:00
Guillaume Abrioux	a558d8aef3	rgw: refact rgw pools creation Refact of `8704144e31` There is no need to have duplicated tasks for this. The rgw pools creation should be delegated on a monitor node se we don't have to care if the admin keyring is present on rgw node. By the way, only one task is needed to create the pools, we just need to use the `docker_exec_cmd` fact already defined in `ceph-defaults` to achieve it. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1550281 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `2cf06b515f`)	2018-06-06 11:30:29 +08:00
jtudelag	36b2c4a527	rgws: renames create_pools variable with rgw_create_pools. Renamed to be consistent with the role (rgw) and have a meaningful name. Signed-off-by: Jorge Tudela <jtudelag@redhat.com> (cherry picked from commit `600e1e2c26`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-05 18:56:24 +02:00
jtudelag	1d94d12c9f	Adds RGWs pool creation to containerized installation. ceph command has to be executed from one of the monitor containers if not admin copy present in RGWs. Task has to be delegated then. Adds test to check proper RGW pool creation for Docker container scenarios. Signed-off-by: Jorge Tudela <jtudelag@redhat.com> (cherry picked from commit `8704144e31`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-06-05 18:56:24 +02:00
Guillaume Abrioux	cd6ef8e9ec	tests: skip disabling fastest mirror detection on atomic host There is no need to execute this task on atomic hosts. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `f0cd4b0651`)	2018-06-05 16:02:54 +02:00
Erwan Velu	51a7eb5a70	ceph-defaults: Enable local epel repository During the tests, the remote epel repository is generating a lots of errors leading to broken jobs (issue #2666) This patch is about using a local repository instead of a random one. To achieve that, we make a preliminary install of epel-release, remove the metalink and enforce a baseurl to our local http mirror. That should speed up the build process but also avoid the random errors we face. This patch is part of a patch series that tries to remove all possible yum failures. Signed-off-by: Erwan Velu <erwan@redhat.com> (cherry picked from commit `493f615eae`)	2018-06-05 16:02:54 +02:00
Andy McCrae	c90535ecce	Fix template reference for ganesha.conf We can simply reference the template name since it exists within the role that we are calling. We don't need to check the ANSIBLE_ROLE_PATH or playbooks directory for the file. Signed-off-by: Lionel Sausin <ls@initiatives.fr>	2018-06-04 10:21:17 +02:00
Andrew Schoen	53dfd050c5	ceph-defaults: add the nautilus 14.x entry to ceph_release_num The first 14.x tag has been cut so this needs to be added so that version detection will still work on the master branch of ceph. Fixes: https://github.com/ceph/ceph-ansible/issues/2671 Signed-off-by: Andrew Schoen <aschoen@redhat.com> (cherry picked from commit `c2423e2c48`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-01 19:21:59 +02:00
Guillaume Abrioux	28319698e2	mons: move set_fact of openstack_keys in ceph-osd Since the openstack_config.yml has been moved to `ceph-osd` we must move this `set_fact` in ceph-osd otherwise the tasks in `openstack_config.yml` using `openstack_keys` will actually use the defaults value from `ceph-defaults`. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1585139 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `aae37b44f5`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-01 18:37:18 +02:00
Guillaume Abrioux	9c91bb8b2c	osds: wait for osds to be up before creating pools This is a follow up on #2628. Even with the openstack pools creation moved later in the playbook, there is still an issue because OSDs are not all UP when trying to create pools. Adding a task which checks for all OSDs to be UP with a `retries/until` condition should definitively fix this issue. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9d5265fe11`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-01 17:07:35 +02:00
Guillaume Abrioux	01701eed96	Makefile: followup on #2585 Fix a typo in `tag` target, double quote are missing here. Without them, the `make tag` command fails like this: ``` if [[ "v3.0.35" == ]]; then \ echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; \ exit 1; \ fi /bin/sh: -c: line 0: unexpected argument `]]' to conditional binary operator /bin/sh: -c: line 0: syntax error near `;' /bin/sh: -c: line 0: `if [[ "v3.0.35" == ]]; then echo "e5f2df8 on stable-3.0 is already tagged as v3.0.35"; exit 1; fi' make: *** [tag] Error 2 ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `0b67f42feb`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-06-01 12:51:53 +02:00
Ken Dreyer	b035697246	Makefile: add "make tag" command Add a new "make tag" command. This automates some common operations: 1) Automatically determine the next Git tag version number to create. For example: "3.2.0beta1 -> "3.2.0beta2" "3.2.0rc1 -> "3.2.0rc2" "3.2.0" -> "3.2.1" 2) Create the Git tag, and print instructions for the user to push it to GitHub. 3) Sanity check that HEAD is a stable-* branch or master (bail on everything else). 4) Sanity check that HEAD is not already tagged. Note, we will still need to tag manually once each time we change the format, for example when moving from tagging "betas" to tagging "rcs", or "rcs" to "stable point releases". Signed-off-by: Ken Dreyer <kdreyer@redhat.com> Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `fcea568495`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-31 11:15:12 +02:00
Sébastien Han	2ac720d2c2	rgw: container add option to configure multi-site zone You can now use RGW_ZONE and RGW_ZONEGROUP on each rgw host from your inventory and assign them a value. Once the rgw container starts it'll pick the info and add itself to the right zone. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1551637 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `1c084efb3c`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-31 11:10:01 +02:00
Guillaume Abrioux	4f0850adf1	mon: remove check on pg_num for cephfs_pools It should have been backported from `29a9dff` but for better clarity I think it's better to create a new commit for this. `c68126d6` aims to not make `pgs` attribute mandatory for each element of `cephfs_pools`. Therefore, we must remove the check in `roles/ceph-mon/tasks/check_mandatory_vars.yml`. This task has been removed by `29a9dff` but I've chosen to not backport this commit since it's part of a bunch of commits belonging to a PR implementing `ceph-validate` role. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-30 21:45:03 +02:00
Guillaume Abrioux	4328e0b42a	mdss: do not make pg_num a mandatory params When playing ceph-mds role, mon nodes have set a fact with the default pg num for osd pools, we can simply default to this value for cephfs pools (`cephfs_pools` variable). At the moment the variable definition for `cephfs_pools` looks like: ``` cephfs_pools: - { name: "{{ cephfs_data }}", pgs: "" } - { name: "{{ cephfs_metadata }}", pgs: "" } ``` and we have a task in `ceph-validate` to ensure `pgs` has been set to a valid value. We could simply avoid this check by setting the default value of `pgs` to `hostvars[groups[mon_group_name][0]]['osd_pool_default_pg_num']` and let to users the possibility to override this value. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1581164 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `c68126d6fd`)	2018-05-30 21:45:03 +02:00
Guillaume Abrioux	77b02fe720	tests: fix broken symlink `requirements2.5.txt` is pointing to `tests/requirements2.4.txt` while it should point to `requirements2.4.txt` since they are in the same directory. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `6f489015e4`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-30 20:22:58 +02:00
Guillaume Abrioux	6ee4b228ba	osds: do not set docker_exec_cmd fact in `ceph-osd` there is no need to set `docker_exec_cmd` since the only place where this fact is used is in `openstack_config.yml` which delegate all docker command to a monitor node. It means we need the `docker_exec_cmd` fact that has been set referring to `ceph-mon-*` containers, this fact is already set earlier in `ceph-defaults`. By the way, when collocating an OSD with a MON it fails because the container `ceph-osd-{{ ansible_hostname }}` doesn't exist. Removing this task will allow to collocate an OSD with a MON. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1584179 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `34e646e767`)	2018-05-30 20:20:43 +02:00
Guillaume Abrioux	e1c1017e15	tests: resize root partition when atomic host For a few moment we can see failures in the CI for containerized scenarios because VMs are running out of space at some point. The default in the images used is to have only 3Gb for root partition which doesn't sound like a lot. Typical error seen: ``` STDERR: failed to register layer: Error processing tar file(exit status 1): open /usr/share/zoneinfo/Atlantic/Canary: no space left on device ``` Indeed, on the machine we can see: ``` Every 2.0s: df -h Tue May 29 17:21:13 2018 Filesystem Size Used Avail Use% Mounted on /dev/mapper/atomicos-root 3.0G 3.0G 14M 100% / ``` The idea here is to expand this partition with all the available space remaining by issuing an `lvresize` followed by an `xfs_growfs`. ``` -bash-4.2# lvresize -l +100%FREE /dev/atomicos/root Size of logical volume atomicos/root changed from <2.93 GiB (750 extents) to 9.70 GiB (2484 extents). Logical volume atomicos/root successfully resized. ``` ``` -bash-4.2# xfs_growfs / meta-data=/dev/mapper/atomicos-root isize=512 agcount=4, agsize=192000 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 data = bsize=4096 blocks=768000, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 768000 to 2543616 ``` ``` -bash-4.2# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/atomicos-root 9.7G 1.4G 8.4G 14% / ``` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `34f7042852`)	2018-05-30 14:40:50 +02:00
Guillaume Abrioux	92f0de3792	tests: avoid yum failures In the CI we can see at many times failures like following: `Failure talking to yum: Cannot find a valid baseurl for repo: base/7/x86_64` It seems the fastest mirror detection is sometimes counterproductive and leads yum to fail. This fix has been added in the `setup.yml`. This playbook was used until now only just before playing `testinfra` and could be used before running ceph-ansible so we can add some provisionning tasks. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> Co-authored-by: Erwan Velu <evelu@redhat.com> (cherry picked from commit `98cb6ed8f6`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-30 14:40:50 +02:00
Guillaume Abrioux	220d528e8b	mds: move mds fs pools creation When collocating mds on monitor node, the cephpfs will fail because `docker_exec_cmd` is reset to `ceph-mds-monXX` which is incorrect because we need to delegate the task on `ceph-mon-monXX`. In addition, it wouldn't have worked since `ceph-mds-monXX` container isn't started yet. Moving the task earlier in the `ceph-mds` role will fix this issue. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `608ea947a9`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-25 03:57:31 -07:00
Paul Cuzner	bdff7204f2	Add privilege escalation to iscsi purge tasks Without the escalation, invocation from non-root users with fail when accessing the rados config object, or when attempting to log to /var/log Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1549004 Signed-off-by: Paul Cuzner <pcuzner@redhat.com> (cherry picked from commit `2890b57cfc`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-25 03:52:06 -07:00
Guillaume Abrioux	0d3bce95e1	playbook: follow up on #2553 Since we fixed the `gather and delegate facts` task, this exception is not needed anymore. It's a leftover that should be removed to save some time when deploying a cluster with a large client number. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `828848017c`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:33:11 +02:00
Andrew Schoen	49f6d3cbec	ceph-defaults: move cephfs vars from the ceph-mon role We're doing this so we can validate this in the ceph-validate role Signed-off-by: Andrew Schoen <aschoen@redhat.com> (cherry picked from commit `1f15a81c48`)	2018-05-24 21:29:42 +02:00
Sébastien Han	1fe587441f	group_vars: resync group_vars The previous commit changed the content of roles/$ROLE/default/main.yml so we have to re generate the group_vars files. Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `3c32280ca1`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:29:42 +02:00
Guillaume Abrioux	683bec9eb2	mdss: move cephfs pools creation in ceph-mds When deploying a large number of OSD nodes it can be an issue because the protection check [1] won't pass since it tries to create pools before all OSDs are active. The idea here is to move cephfs pools creation in `ceph-mds` role. [1] `e59258943b/src/mon/OSDMonitor.cc (L5673)` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `3a0e168a76`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:29:42 +02:00
Guillaume Abrioux	b00a3cf790	tests: move cephfs_pools variable let's move this variable in group_vars/all.yml in all testing scenarios accordingly to this commit `1f15a81c48` so we keep consistency between the playbook and the tests. Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a10e73d78d`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:29:42 +02:00
Guillaume Abrioux	873abdbf0c	osds: move openstack pools creation in ceph-osd When deploying a large number of OSD nodes it can be an issue because the protection check [1] won't pass since it tries to create pools before all OSDs are active. The idea here is to move openstack pools creation at the end of `ceph-osd` role. [1] `e59258943b/src/mon/OSDMonitor.cc (L5673)` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1578086 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `564a662baf`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:29:42 +02:00
Guillaume Abrioux	4487eaba40	defaults: resync sample files with actual defaults `6644dba5e3` and `1f15a81c48` introduced changes some changes in defaults variables files but it seems we've forgotten to regenerate the sample files. This commit aims to resync the content of `all.yml.sample`, `mons.yml.sample` and `rhcs.yml.sample` Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `f8260119cd`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 21:29:42 +02:00
Luigi Toscano	d7f0ea33c9	ceph-radosgw: disable NSS PKI db when SSL is disabled The NSS PKI database is needed only if radosgw_keystone_ssl is explicitly set to true, otherwise the SSL integration is not enabled. It is worth noting that the PKI support was removed from Keystone starting from the Ocata release, so some code paths should be changed anyway. Also, remove radosgw_keystone, which is not useful anymore. This variable was used until `fcba2c801a`. Now profiles drives the setting of rgw keystone *. Signed-off-by: Luigi Toscano <ltoscano@redhat.com> (cherry picked from commit `43e96c1f98`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-24 15:41:42 +02:00
Sébastien Han	7b2cefd9f8	rhcs: bump version to 3.0 for stable 3.1 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1519835 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `bf9593bced`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-23 14:43:32 -07:00
Vishal Kanaujia	0e0bd09b1f	Skip GPT header creation for lvm osd scenario The LVM lvcreate fails if the disk already has a GPT header. We create GPT header regardless of OSD scenario. The fix is to skip header creation for lvm scenario. fixes: https://github.com/ceph/ceph-ansible/issues/2592 Signed-off-by: Vishal Kanaujia <vishal.kanaujia@flipkart.com> (cherry picked from commit `ef5f52b1f3`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-23 13:37:09 -07:00
Sébastien Han	37693870df	rolling_update: fix get fsid for containers When running ansible2.4-update_docker_cluster there is an issue on the "get current fsid" task. The current task only works for non-containerized deployment but will run all the time (even for containerized). This currently results in the following error: TASK [get current fsid] ****************************************************** task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-luminous-ansible2.4-update_docker_cluster/rolling_update.yml:214 Tuesday 22 May 2018 22:48:32 +0000 (0:00:02.615) 0:11:01.035 ********* fatal: [mgr0 -> mon0]: FAILED! => { "changed": true, "cmd": [ "ceph", "--cluster", "test", "fsid" ], "delta": "0:05:00.260674", "end": "2018-05-22 22:53:34.555743", "rc": 1, "start": "2018-05-22 22:48:34.295069" } STDERR: 2018-05-22 22:48:34.495651 7f89482c6700 0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).connect protocol feature mismatch, my 83ffffffffffff < peer 481dff8eea4fffb missing 400000000000000 2018-05-22 22:48:34.495684 7f89482c6700 0 -- 192.168.17.10:0/1022712 >> 192.168.17.12:6789/0 pipe(0x7f8944067010 sd=4 :42654 s=1 pgs=0 cs=0 l=1 c=0x7f894405d510).fault This is not really representative on the real error since the 'ceph' cli is available on that machine. On other environments we will have something like "command not found: ceph". Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `da5b104098`)	2018-05-22 23:22:38 -07:00
Subhachandra Chandra	747b545af4	Fix restarting OSDs twice during a rolling update. During a rolling update, OSDs are restarted twice currently. Once, by the handler in roles/ceph-defaults/handlers/main.yml and a second time by tasks in the rolling_update playbook. This change turns off restarts by the handler. Further, the restart initiated by the rolling_update playbook is more efficient as it restarts all the OSDs on a host as one operation and waits for them to rejoin the cluster. The restart task in the handler restarts one OSD at a time and waits for it to join the cluster. (cherry picked from commit `c7e269fcf5`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-22 23:22:38 -07:00
Sébastien Han	ddafad3f32	switch: disable ceph-disk units During the transition from jewel non-container to container old ceph units are disabled. ceph-disk can still remain in some cases and will appear as 'loaded failed', this is not a problem although operators might not like to see these units failing. That's why we remove them if we find them. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1577846 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `49a4712485`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-22 17:08:01 -07:00
Guillaume Abrioux	ec528b9278	purge_cluster: fix dmcrypt purge dmcrypt devices aren't closed properly, therefore, it may fail when trying to redeploy after a purge. Typical errors: ``` ceph-disk: Cannot discover filesystem type: device /dev/sdb1: Command '/sbin/blkid' returned non-zero exit status 2 ``` ``` ceph-disk: Error: unable to read dm-crypt key: /var/lib/ceph/osd-lockbox/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf: /etc/ceph/dmcrypt-keys/c6e01af1-ed8c-4d40-8be7-7fc0b4e104cf.luks.key ``` Closing properly dmcrypt devices allows to redeploy without error. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9801bde4d4`)	2018-05-22 16:44:06 +02:00
Guillaume Abrioux	17ee4e92f0	purge_cluster: wipe all partitions In order to ensure there is no leftover after having purged a cluster, we must wipe all partitions properly. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `a9247c4de7`)	2018-05-22 16:44:06 +02:00
Guillaume Abrioux	7d0e072da4	purge_cluster: fix bug when building device list there is some leftover on devices when purging osds because of a invalid device list construction. typical error: ``` changed: [osd3] => (item=/dev/sda sda1) => { "changed": true, "cmd": "# if the disk passed is a raw device AND the boot system disk\n if parted -s \"/dev/sda sda1\" print \| grep -sq boot; then\n echo \"Looks like /dev/sda sda1 has a boot partition,\"\n echo \"if you want to delete specific partitions point to the partition instead of the raw device\"\n echo \"Do not use your system disk!\"\n exit 1\n fi\n echo sgdisk -Z \"/dev/sda sda1\"\n echo dd if=/dev/zero of=\"/dev/sda sda1\" bs=1M count=200\n echo udevadm settle --timeout=600", "delta": "0:00:00.015188", "end": "2018-05-16 12:41:40.408597", "item": "/dev/sda sda1", "rc": 0, "start": "2018-05-16 12:41:40.393409" } STDOUT: sgdisk -Z /dev/sda sda1 dd if=/dev/zero of=/dev/sda sda1 bs=1M count=200 udevadm settle --timeout=600 STDERR: Error: Could not stat device /dev/sda sda1 - No such file or directory. ``` the devices list in the task `resolve parent device` isn't built properly because the command used to resolve the parent device doesn't return the expected output eg: ``` changed: [osd3] => (item=/dev/sda1) => { "changed": true, "cmd": "echo /dev/$(lsblk -no pkname \"/dev/sda1\")", "delta": "0:00:00.013634", "end": "2018-05-16 12:41:09.068166", "item": "/dev/sda1", "rc": 0, "start": "2018-05-16 12:41:09.054532" } STDOUT: /dev/sda sda1 ``` For instance, it will result with a devices list like: `['/dev/sda sda1', '/dev/sdb', '/dev/sdc sdc1']` where we expect to have: `['/dev/sda', '/dev/sdb', '/dev/sdc']` Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1492242 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `9cad113e2f`)	2018-05-22 16:44:06 +02:00
Sébastien Han	831491f7d6	defaults: restart_osd_daemon unit spaces Extra space in systemctl list-units can cause restart_osd_daemon.sh to fail It looks like if you have more services enabled in the node space between "loaded" and "active" get more space as compared to one space given in command the command[1]. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1573317 Signed-off-by: Sébastien Han <seb@redhat.com> (cherry picked from commit `2f43e9dab5`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-22 09:43:10 +02:00
Michael Vollman	e1aa85f04c	Do nothing when mgr module is in good state Check whether a mgr module is supposed to be disabled before disabling it and whether it is already enabled before enabling it. Signed-off-by: Michael Vollman <michael.b.vollman@gmail.com> (cherry picked from commit `ed050bf3f6`) Signed-off-by: Sébastien Han <seb@redhat.com>	2018-05-18 16:53:39 +02:00
Guillaume Abrioux	fb0304230b	take-over: fix bug when trying to override variable A customer has been facing an issue when trying to override `monitor_interface` in inventory host file. In his use case, all nodes had the same interface for `monitor_interface` name except one. Therefore, they tried to override this variable for that node in the inventory host file but the take-over-existing-cluster playbook was failing when trying to generate the new ceph.conf file because of undefined variable. Typical error: ``` fatal: [srvcto103cnodep01]: FAILED! => {"failed": true, "msg": "'dict object' has no attribute u'ansible_bond0.15'"} ``` Including variables like this `include_vars: group_vars/all.yml` prevent us from overriding anything in inventory host file because it overwrites everything you would have defined in inventory. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1575915 Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com> (cherry picked from commit `415dc0a29b`) Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>	2018-05-18 11:21:33 +02:00

... 2 3 4 5 6 ...

3820 Commits (ebc901c6af67300f7b7b8da1b2d0a74147798da5) All Branches Search

3820 Commits (ebc901c6af67300f7b7b8da1b2d0a74147798da5)

All Branches