The cost of complexity: Ansible AWX
2024-5-5 21:16:33 Author: www.adainese.it(查看原文) 阅读量:14 收藏

Post cover

For several years now, Ansible AWX has been on my ToDo list. I have always postponed dealing with it due to its complexity, but today I find myself having to tackle it. I knew it would be challenging, but I didn’t imagine just how much.

This post aims to summarize the steps to install a development instance of AWX and discuss the enormous, unnecessary, and harmful complexity that our infrastructures have reached.

AWX Operator

First and foremost, it’s essential to note that AWX is available via containers on the Kubernetes platform. My attempts to run the application in standalone mode or as a simple container on Docker were futile.

I chose to prepare an Ubuntu Linux 22.04 VM on which I installed Minikube. Minikube, in turn, installs a reduced version of Kubernetes.

To summarize:

For the details of the individual steps, I refer you to the two links by Christopher Hart at the end of the post.

The reflection I want to make concerns the complexity of the environment necessary to run what is an orchestrator of playbooks. The probability that something will go wrong in the setup is very high, and debugging any problems requires expertise in many, too many, different environments. Not to mention how to update, backup, restore, or design the disaster recovery process for such a solution.

Reboot

After a few hours of use, I realize that my instance of AWX is no longer working correctly. The reason becomes almost immediately apparent: the disk space of the Ubuntu Linux system was exhausted. I extend the disk space and restart the system… but AWX doesn’t seem to start.

In order, I verify:

  • Docker on Ubuntu Linux;
  • Minikube;
  • Docker within Kubernetes;
  • The AWX containers within Kubernetes.
minikube status
minikube restart
minikube kubectl -- get pods -A
minikube service list
kubectl logs --namespace=awx -p svc/awx-service
kubectl get services --namespace awx

After a few minutes, the AWX service responds, but it is not exposed. A few more minutes, and I manage to reach the login page again.

Ansible Galaxy

The playbooks I wrote use the cisco.ios collection, which is installed by default with Ansible, but it seems not to exist in AWX. I discover that there is a specific format for defining the dependencies of an AWX project. You need to create the collections/requirements.yml file indicating the necessary dependencies:

collections:
- name: cisco.ios
  version: 8.0.0
  source: https://galaxy.ansible.com

I rerun the playbook and move on to the next error.

paramiko vs pylibssh

The next error indicates the use of paramiko instead of ansible-pylibssh:

ASK [cisco_ios_system : SETTING FACTS] ****************************************
[WARNING]: ansible-pylibssh not installed, falling back to paramiko
[WARNING]: ansible-pylibssh not installed, falling back to paramiko
ok: [sw1.example.com]
ok: [sw2.example.com]
TASK [cisco_ios_system : CONFIGURING HOSTNAME AND DOMAIN NAME] *****************
fatal: [sw2.example.com]: FAILED! => {"changed": false, "msg": "No existing session"}
fatal: [sw1.example.com]: FAILED! => {"changed": false, "msg": "No existing session"}

However, the cisco.ios collection requires the use of ansible-pylibssh, not present by default on Ansible and AWX, which instead use paramiko. With little hope, I force the use of pylibssh, configuring in the inventory:

ansible_network_cli_ssh_type: libssh

I rerun the playbook, and indeed the error reports the absence of the library:

TASK [cisco_ios_system : CONFIGURING HOSTNAME AND DOMAIN NAME] *****************
fatal: [sw1.example.com]: FAILED! => {"changed": false, "msg": "Failed to import the required Python library (ansible-pylibssh) on automation-job-11-m85n8's Python /usr/bin/python3. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter"}
fatal: [sw2.example.com]: FAILED! => {"changed": false, "msg": "Failed to import the required Python library (ansible-pylibssh) on automation-job-11-m85n8's Python /usr/bin/python3. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter"}

The above error also shows us that the playbook is executed from a container created specifically for each execution: automation-job-11-m85n8. Investigating, I discover that the container is created for each job from the image quay.io/ansible/awx-ee. The image is defined on AWX in Administration -> Execution Environment.

So I need a custom awx-ee image containing the ansible-pylibssh library. After some research, I find that someone has had the same problem as me and has already created an image, although it’s a year old. Before embarking on creating a new image, in quick & dirty mode, I decide to use that image to see if I can get a working prototype.

In Administrator -> Execution Environments, I add a new image as follows:

  • Name: AWX EE w pylibssh
  • Image: quay.io/repository/azzeddineslimani/eev_awx_cisco
  • Pull: Always pull container before running

I associate this new image with the playbook, restart, and get a new error:

Receptor detail:
Error creating pod: container failed to start, ImagePullBackOff

I discover it’s a Kubernetes error caused by the inability to download the image. I check manually and indeed see that the image is not present in the docker repository within Kubernetes:

I try the manual way and force the download:

minikube image pull quay.io/azzeddineslimani/eev_awx_cisco

A few minutes, and the image is correctly downloaded.

I rerun the playbook, but the error is the same.

$ kubectl logs -f -n awx automation-job-18-gcwfn
Error from server (BadRequest): container "worker" in pod "automation-job-18-gcwfn" is waiting to start: trying and failing to pull image

The error is mine: I assumed that not specifying the image version would download the latest available (latest), but it is not the case. I correct the URL using quay.io/azzeddineslimani/eev_awx_cisco:latest and rerun the playbook.

Same error.

I come across the following description:

If you believe everything is appropriately configured, try pulling the image directly from the command line (using the docker image pull command) with the same values that are specified in your application manifest. If this works, you know the image is accessible, and that the root cause of the problem lies somewhere in Kubernetes. In this case, lack of available resources or networking configuration issues are your most likely culprit.

Since the image is correct, exists, I deduce that the error is application-related. I delete the image reference, recreate it, and reference it to the playbook.

I rerun the playbook and encounter the next error.

Bastion host

The next error concerns the inaccessibility of devices:

fatal: [sw1.example.com]: FAILED! => {"changed": false, "msg": "ssh connection failed: ssh connect failed: Timeout connecting to 169.254.1.111"}
fatal: [sw2.example.com]: FAILED! => {"changed": false, "msg": "ssh connection failed: ssh connect failed: Timeout connecting to 169.254.1.112"}

The reason is simple: my lab environment involves passing through a bastion host, which requires authentication via key. The solution in an environment without containers would be immediate, but containers have no persistence, nor does the automation container created for each job. The solution is the same as seen for the previous paragraph: I need to create a custom awx-ee image.

Once again, I go for the quick & dirty way. My goal is to get a working prototype and then figure out how to cleanly resolve the errors I’ve encountered.

I add a route so that the Ubuntu Linux system can directly reach the devices without the need for the bastion host.

I run the playbook and move on to the new error.

Authentication

The next error concerns authentication to the devices:

fatal: [sw1.example.com]: FAILED! => {"changed": false, "msg": "ssh connection failed: Failed to authenticate public key: Access denied for 'none'. Authentication that can continue: publickey,keyboard-interactive,password"}
fatal: [sw2.example.com]: FAILED! => {"changed": false, "msg": "ssh connection failed: Failed to authenticate public key: Access denied for 'none'. Authentication that can continue: publickey,keyboard-interactive,password"}

Here too, the error is mine: I chose the wrong type of Ansible Tower: Credentials. The Network type is to be used for devices that require a local connection, while the cisco.ios module has been migrated to network_cli mode, which requires the Machine type. I recreate the credentials using the correct type, update the playbook, and rerun it:

PLAY RECAP *********************************************************************
sw1.example.com            : ok=26   changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
sw2.example.com            : ok=26   changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Finally, after about 8 hours of work, I get the result.

Conclusions

I’m definitely getting older, but I also believe I have some good reasons on my side: if to install an orchestrator necessary to automate a series of tasks we have to resort to a complexity like that described in this post, perhaps we are doing something wrong.

Not only do we have difficulty finding the necessary skills to manage such a system, not only are we ignoring the basic rules that security would impose, but we have no idea how complex applications interact with each other. And we don’t even have the tools to understand it anymore.

Am I exaggerating?

Perhaps, but for the project I’m working on, I need:

  • network engineers, who do the low-level design for the involved devices;
  • automation engineers, who write automations based on the given requirements;
  • Kubernetes specialists who manage the orchestrator for me;
  • application specialists (AWX in this specific case) who help me understand if and how I can adapt the orchestrator according to the project’s needs;
  • gurus who can understand the language of each described figure, creating a single team that speaks and works together. In this specific case, I remember a discussion between a Kubernetes/OpenShift expert and a network expert on the topic of routing. Same word with two completely different meanings.

Good luck.

References


文章来源: https://www.adainese.it/blog/2024/05/05/the-cost-of-complexity-ansible-awx/
如有侵权请联系:admin#unsafe.sh