When configuring grafana/prometheus embed in the mgr/dashboard, we need
to use the address of the grafana-server node and not the current
hostname because mgr/dashboard and grafana/prometheus could be present
on different hosts.
We should instead rely on the grafana_server_addr variable and remove
the dashboard_url.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
There's no need to add complexity and trying to fallback on other group.
Let's deploy dashboard on all nodes present in grafana-server group.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
The current port value for alertmanager, grafana, node-exporter and
prometheus is hardcoded in the roles so it's not possible to change the
port binding of those services.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
We are currently using incorrect dashboard default port. The upstream
uses 8443 instead of 8234 by default. This should get us closer to the
upstream project.
Signed-off-by: Boris Ranto <branto@redhat.com>
This commit moves some old variables into ceph-defaults so we can move
the `use_new_ceph_iscsi` fact in ceph-facts role in order.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is necessary when configuring RGW with SSL because
in addition to passing specific frontend options, civetweb
appends the 's' character to the binding port and beast uses
ssl_endpoint instead of endpoint.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1722071
Signed-off-by: Giulio Fidente <gfidente@redhat.com>
The ansible_lsb fact is based on the lsb package (lsb-base,
lsb-release or redhat-lsb-core).
If the package isn't installed on the remote host then the fact isn't
populated.
--------
"ansible_lsb": {},
--------
Switching to the ansible_distribution_release fact instead.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
As per bz1718981, this commit adds higher values to check
the quorum status. This is helpful for several OSP deployments
that fail during the scale up.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1718981
Signed-off-by: fpantano <fpantano@redhat.com>
Since timesyncd is not available on RHEL-based OSs, change the default
to chronyd for RHEL-based OSs. Also, chronyd is chrony on Ubuntu, so
set the Ansible fact accordingly.
Fixes: https://github.com/ceph/ceph-ansible/issues/3628
Signed-off-by: Rishabh Dave <ridave@redhat.com>
The definitions of cephfs pools should match openstack pools.
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Co-Authored-by: Simone Caronni <simone.caronni@teralytics.net>
Add a variable to support the allow_embedding support.
See ceph/ceph-ansible/issues/4084 for details.
Fixes: #4084
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Few fixes on systemd unit templates for node_exporter and
alertmanager container parameters.
Added the ability to use a dedicated instance to deploy the
dashboard components (prometheus and grafana).
This commit also introduces the grafana_group_name variable
to refer grafana group and keep consistency with the other
groups.
During the integration with TripleO some grafana/prometheus
template variables resulted undefined. This commit adds the
ability to check if the group exist and create, accordingly,
different job groups in prometheus template.
Signed-off-by: fmount <fpantano@redhat.com>
This add support for rgw loadbalancer based on HAProxy and Keepalived.
We define a single role ceph-rgw-loadbalancer and include HAProxy and
Keepalived configurations all in this.
A single haproxy backend is used to balance all RGW instances and
a single frontend is exported via a single port, default 80.
Keepalived is used to maintain the high availability of all haproxy
instances. You are free to use any number of VIPs. A single VIP is
shared across all keepalived instances and there will be one
master for one VIP, selected sequentially, and others serve as
backups.
This assumes that each keepalived instance is on the same node as
one haproxy instance and we use a simple check script to detect
the state of each haproxy instance and trigger the VIP failover
upon its failure.
Signed-off-by: guihecheng <guihecheng@cmiot.chinamobile.com>
This commits allows to deploy an internal ganesha with an external ceph
cluster.
This requires to define `external_cluster_mon_ips` with a comma
separated list of external monitors.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1710358
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
There is no need to have default values for these variables in each roles
since there is no corresponding host groups
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This commit renames the `docker_exec_cmd` variable to
`container_exec_cmd` so it's more generic.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This adds support for podman in dashboard-related roles. It also drops
the creation of custom network for the dashboard-related roles as this
functionality works in a different way with podman.
Signed-off-by: Boris Ranto <branto@redhat.com>
This commit will merge dashboard-ansible installation scripts with
ceph-ansible. This includes several new roles to setup ceph-dashboard
and the underlying technologies like prometheus and grafana server.
Signed-off-by: Boris Ranto & Zack Cerza <team-gmeno@redhat.com>
Co-authored-by: Zack Cerza <zcerza@redhat.com>
Co-authored-by: Guillaume Abrioux <gabrioux@redhat.com>
RHCS 4 will be based on Nautilus and only usable on RHEL 8.
Updated the default ceph_rhcs_version to 4 and update the rhcs
repositories to rhcs 4 with RHEL 8.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
this commit refact the msgr2 protocol introduction.
If it's a fresh install, let's go with v2 only.
If we upgrade to nautilus, we should go with v2+v1 syntax to ensure
nothing breaks.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Currently we only support ansible 2.7
We plan to use 2.8 when it will be release so we have to support both
2.7 and 2.8.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1700548
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
Because 5c98e361df could be seen as a non
backward compatible change this commit reverts it and bring back package
dependencies installation support.
Let's just modify the default value instead.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
These packages aren't needed anymore.
They were needed for ceph-init-detect buti as of ceph-init-detect doesn't exist
anymore.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1683885
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
As discussed in ceph/ceph#26599, beast is now the default frontend
for rados gateway with nautilus release.
Add rgw_thread_pool_size variable with 512 as default value and keep
backward compatibility with num_threads option when using civetweb.
Update radosgw_civetweb_num_threads to reflect rgw_thread_pool_size
change.
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
The Ubuntu Cloud Archive-related (UCA) defaults in
roles/ceph-defaults/defaults/main.yml were commented out, which means
if you set `ceph_repository` to "uca", you get undefined variable
errors, e.g.
```
The task includes an option with an undefined variable. The error was: 'ceph_stable_repo_uca' is undefined
The error appears to have been in '/nfs/users/nfs_m/mv3/software/ceph-ansible/roles/ceph-common/tasks/installs/debian_uca_repository.yml': line 6, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: add ubuntu cloud archive repository
^ here
```
Unfortunately, uncommenting these results in some other breakage,
because further roles were written that use the fact of
`ceph_stable_release_uca` being defined as a proxy for "we're using
UCA", so try and install packages from the bionic-updates/queens
release, for example, which doesn't work. So there are a few `apt` tasks
that need modifying to not use `ceph_stable_release_uca` unless
`ceph_origin` is `repository` and `ceph_repository` is `uca`.
Closes: #3475
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
Currently the default crush rule value is added to the ceph config
on the mon nodes as an extra configuration applied after the template
generation via the ansible ini module.
This implies two behaviors:
1/ On each ceph-ansible run, the ceph.conf will be regenerated via
ceph-config+template and then ceph-mon+ini_file. This leads to a
non necessary daemons restart.
2/ When other ceph daemons are collocated on the monitor nodes
(like mgr or rgw), the default crush rule value will be erased by
the ceph.conf template (mon -> mgr -> rgw).
This patch adds the osd_pool_default_crush_rule config to the ceph
template and only for the monitor nodes (like crush_rules.yml).
The default crush rule id is read (if exist) from the current ceph
configuration.
The default configuration is -1 (ceph default).
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1638092
Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
the previous approach was wrong.
checking if `item.key` is in `osd_auto_discovery_exclude` (`['dm-',
'loop']`) is incorrect because it will obviously not match. Therefore,
the condition will return `True` whatever the device we are checking.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add a new `osd_auto_discovery_exclude` to give the possibility of
excluding some devices in auto_discovery scenario.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
- also add `--foreground` which seems to fix some issue we are facing when
using timeout with `podman`.
- use this fact in the `is ceph running already?` task.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
nfs-ganesha v2.5 and 2.6 have hit EOL. Install nfs-ganesha v2.7
stable that is currently being maintained.
Signed-off-by: Ramana Raja <rraja@redhat.com>
With this, we could have multiple rgw instances on a single host
with a single run, don't have to use rgw-standalone.yml which does not
seems able to bind ports separately.
If you want to have multiple rgw instances, just change 'radosgw_instances'
to the number you want, which defaults to 1.
Not compatible with Multi-Site yet.
Signed-off-by: guihecheng <guihecheng@cmiot.chinamobile.com>
sometimes we play the whole role `ceph-defaults` just to access the
default value of some variables. It means we play the `facts.yml` part
in this role while it's not desired. Splitting this role will speedup
the playbook.
Closes: #3282
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
We want to test podman on f29 non-atomic, atomic is not a hard
requirement. However, if you want to get podman then you will have to
install it first before running the playbook.
Signed-off-by: Sébastien Han <seb@redhat.com>
change default value of `radosgw_address` to keep consistency with
`monitor_address`.
Moreover, `ceph-validate` checks if the value is '0.0.0.0' to determine
if it has to run `check_eth_rgw.yml`.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1600227
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
This is not needed to play these tasks on nodes that are not in rgw
group.
Always playing this code makes `shrink_mon.yml` failing.
Typical error:
```
TASK [ceph-defaults : set_fact _radosgw_address to radosgw_interface - ipv4] ***
task path: /home/jenkins-build/build/workspace/ceph-ansible-prs-dev-shrink_mon/roles/ceph-defaults/tasks/set_radosgw_address.yml:21
Thursday 22 November 2018 12:34:51 +0000 (0:00:00.154) 0:00:12.371 *****
fatal: [localhost]: FAILED! => {}
MSG:
The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute u'ansible_eth1'
```
Indeed, `radosgw_interface` is the network interface on rgw only. It is
expected that this same interface doesn't exist on `localhost`, so, when
running `shrink_mon.yml`, the role `ceph-defaults` is called in
`hosts: localhost` and causes the playbook to fail.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It seems Atomic 7.5 has podman already, however this is an old version
(0.4). The podman integration is targetting RHEL 8, so Fedora is
currently the closest to that.
Signed-off-by: Sébastien Han <seb@redhat.com>
This is to add a granularity level.
We can have ceph specific variables that user shouldn't have to change
here.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Add real default value for osd pool size customization.
Ceph itself has an `osd_pool_default_size` default value to `3`.
If users don't specify a pool size in various pools definition within
ceph-ansible, we should default to `3`.
By the way, this kind of condition isn't really clear:
```
when:
- rbd_pool_size | default ("")
```
we should try to get the customized value then default to what is in
`osd_pool_default_size` (which has its default value pointing to
`ceph_osd_pool_default_size` (`3`) as well) and compare it to
`ceph_osd_pool_default_size`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
`osd_pool_default_pg_num` parameter is set in `ceph-mon`.
When using ceph-ansible with `--limit` on a specifc group of nodes, it
will fail when trying to access this variables since it wouldn't be
defined.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1518696
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
It is safer to use the list filter than the keys() method since the keys
method does have some interoperability issues between python2 and
python3 based ansible/jinja.
Signed-off-by: Boris Ranto <branto@redhat.com>
* The default value of osd_memory_target used by ceph is 4294967296 bytes,
so use the same as ceph-ansible default.
* Convert ansible_memtotal_mb to bytes to calculate osd_memory_target
Signed-off-by: Neha Ojha <nojha@redhat.com>
if firewalld.service systemd unit is masked, the handler will fail when
trying to restart it.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1650281
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
So we don't have to loop over `_monitor_addresses` when we need the
monitor address of the current node being played.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
using consecutive set_fact in the playbook instead of complex jinja syntax
makes ceph.conf.j2 more readable.
By the way, jinja can be painful to debug at some point.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Currently a throw-away container is built to run ceph client
commands to setup users, pools & auth keys. This utilises
the same base ceph container which has all the ceph services
inside it.
This PR allows the use of a separate container if the deployer
wishes - but defaults to use the same full ceph container.
This can be used for different architectures or distributions,
which may support the the Ceph client, but not Ceph server,
and allows the deployer to build and specify a separate client
container if need be.
Signed-off-by: Andy McCrae <andy.mccrae@gmail.com>
Liberty is no longer available in the UCA. The last available release there
is currently Queens.
Signed-off-by: Christian Berendt <berendt@betacloud-solutions.de>
As of now, we should no longer support Jewel in ceph-ansible.
The latest ceph-ansible release supporting Jewel is `stable-3.1`.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Fixes the deprecation warning:
[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of
using `result|search` use `result is search`.
Signed-off-by: Noah Watkins <nwatkins@redhat.com>
Check firewall isn't working as expected and might break deployments.
This part of the code will be reworked soon.
Let's focus on configure_firewall code for now.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1541840
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Allow user to choose between timesyncd, chronyd and ntpd
Installation will default to timesyncd since it is distributed as
part of the systemd installation for most distros.
Added note indicating NTP daemon type is not used for containerized
deployments.
Fixes issue #3086 on Github
Signed-off-by: Benjamin Cherian <benjamin_cherian@amat.com>
The role contains all the handlers for Ceph services. We decided to
leave ceph-defaults role with variables and a few facts only. This is
useful when organizing the site.yml files and also adding the known
variables to infrastructure-playbooks.
Signed-off-by: Sébastien Han <seb@redhat.com>
As per #1013 it appears that BS will soon use THP to lower TLB misses,
also disabling THP hasn't demonstrated any gains so far.
Closes: https://github.com/ceph/ceph-ansible/issues/1013
Signed-off-by: Sébastien Han <seb@redhat.com>
`+` is more idiomatic for "one or more" in a regex than `{1,}`; the
latter was introduced in a previous fix for an incorrect `{1,2}`
restriction.
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.
On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".
Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.
(could this be backported to at least stable-3.0 please?)
Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
Previously RETRIES was set (by default to 40) once at the start of the
script; this meant that it would only ever wait for up to 40 lots of
30s across *all* the OSDs on a host before bombing out. In fact, we
want to be prepared to wait for the same amount of time after each OSD
restart for the clusters' pgs to be happy again before continuing.
Closes: #3154
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
BlueStore's cache is sized conservatively by default, so that it does
not overwhelm under-provisioned servers. The default is 1G for HDD, and
3G for SSD.
To replace the page cache, as much memory as possible should be given to
BlueStore. This is required for good performance. Since ceph-ansible
knows how much memory a host has, it can set
`bluestore cache size = max(total host memory / num OSDs on this host * safety
factor, 1G)`
Due to fragmentation and other memory use not included in bluestore's
cache, a safety factor of 0.5 for dedicated nodes and 0.2 for
hyperconverged nodes is recommended.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1595003
Signed-off-by: Neha Ojha <nojha@redhat.com>
Co-Authored-by: Guillaume Abrioux <gabrioux@redhat.com>
let's add ansible_hostname as a default value for rgw_hostname if no
hostname in servicemap matches ansible_fqdn.
Fixes: #3063
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1622505
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
A couple if things were wrong in the initial commit:
* ceph_release_num[ceph_release] >= ceph_release_num['luminous'] will
never work since the ceph_release fact is set in the roles after. So
either ceph-common or ceph-docker-common set it
* we can easily re-use the initial command to check if a cluster is
running, it's more elegant than running it twice.
* set the fact rgw_hostname on rgw nodes only
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1618678
Signed-off-by: Sébastien Han <seb@redhat.com>
If there are no services on the cluster, then the 'rgw' could be missing
and the task is failing with the following problem:
msg": "The task includes an option with an undefined variable.
The error was: 'dict object' has no attribute 'rgw'
We fix this by checking the existence of the 'rgw' attribute. If it's
missing, we skip the task since the role already contains code to set
a good default rgw_hostname.
Signed-off-by: Markos Chandras <mchandras@suse.de>
Since commit f422efb1d6 ("config: ensure
rgw section has the correct name") we observe the following failures in
new Ceph deployment with OpenStack-Ansible
fatal: [aio1_ceph-rgw_container-fc588f0a]: FAILED! => {"changed": false,
"cmd": "ceph --cluster ceph -s -f json", "msg": "[Errno 2] No such file
or directory"
This is because the task executes 'ceph' but at this point no package
installation has happened. Packages are normally installed in the
'ceph-common' role which runs after the 'ceph-defaults' one.
Since we are looking to obtain cluster information, the task should be
delegated to a monitor node similar to other tasks in that role
Signed-off-by: Markos Chandras <mchandras@suse.de>
fqdn configuration possibility caused a lot of trouble, it's adding a
lot of complexity because of multiple cases and the relation between
ceph-ansible and ceph-container. Moreover, there is no benefit for such
a feature.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1613155
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
the ceph.conf.j2 always assumes the hostname used to register the
radosgw in the servicemap is equivalent to `{{ ansible_hostname }}`
which returns the shortname form.
We need to detect which form of the hostname was used in case of already
deployed cluster and update the ceph.conf accordingly.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1580408
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since the container now simply reads the ceph.conf, we remove all the
unnecessary options.
Also this PR is the foundation to support multiple backend, such as the
new 'beast' from Ceph Mimic.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1582411
Signed-off-by: Sébastien Han <seb@redhat.com>
deployment.
restart_osd_daemon.sh is used to discover and restart all OSDs on a
host. To do it the scripts loops the list of ceph-osd@ services in the
system. This commit fixes bug in the regular expression responsile for
extraction of OSDs - prior version uses `[0-9]{1,2}` expression
which is ignoring all OSDS which numbers are greater than 99 (thus
longer than 2 digits). Fix removed upper limit of digits in the number.
This problem existed in two places in the script.
Closes: #2964
Signed-off-by: Artur Fijalkowski <artur.fijalkowski@ing.com>
This commit ensures we are backward compatible with fqdn deployments.
Since ceph-container enforces deployment to be done with shortname, we
must keep backward compatibility with clusters already deployed with
fqdn configuration
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
Since `V2.6-stable` is available and has packages for `mimic`, let's
update this default value accordingly so nfs nodes can be deployed with
mimic.
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
As of Kraken, the journal code does not use the hdparm command anymore
so we can remove it from our package dependency list.
Fixes: https://github.com/ceph/ceph-ansible/issues/1402
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit f6910efa24389c264062963b2054c7cd29ffebb3)
We now add a default 'rbd' application type to each pool we create. This
will remove the warning: " application not enabled on N pool(s) "
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1590275
Signed-off-by: Sébastien Han <seb@redhat.com>
keyring files in /etc/ceph. Default value is the same as it was (0600),
but this variable allows user to override it (f.e. set it to 0640).
Signed-off-by: George Shuklin <george.shuklin@gmail.com>
Depending on your setup, ceph-mgr might get restarted multiple times.
When this is done to fast, systemd will prevent further restarts because of
configured limits in the ceph-mgr systemd unit file.
Resetting the failure count will prevent this problem. The reset is done before
the restart so in case of a real problem during the restart it still fails.
Fixes: #2768
Signed-off-by: Christian Zunker <christian.zunker@codecentric.cloud>
Prior to this patch if you were running on a Red Hat system,
ceph-ansible would try to configure firewalld for you without the
operators's consent.
Now you can enable or disable the fw configuration by setting
configure_firewall to either true or false.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1589146
Signed-off-by: Sébastien Han <seb@redhat.com>