Version: 0.4

Common problems

This document describes how to identify and resolve common Capact problems that might occur.

Action

In this section, you can find common Action failures that might occur.

Action does not have status

Symptoms:

Action status is empty.

Debugging steps:

Check if Engine is up and running.
Check the Engine logs. You can grep logs using Action name. This will narrow-down the number of log entries. During the initial process, Engine tries to update Action. The common problem can be that the Engine has wrong/missing RBAC.

Action stuck in the `BeingRendered` phase

Rendering more complex workflow may take a few minutes. An Action in BeingRendered for more than 15 minutes may mean that it is stuck.

Symptoms:

An Action was created more than 15 minutes ago. To check the AGE column, run:
```
kubectl get actions.core.capact.io ${ACTION_NAME} -n {ACTION_NAMESPACE}
```

Debugging steps:

Check if Engine is up and running.
Check the Engine logs. You can grep logs using Action name. This will narrow-down number of log entries. During the render process, manifests are downloaded from the Public Hub. The common problem can be that the Public Hub is unreachable for Engine. Check Unreachable Gateway section to resolve this issue.

Action in the `Failed` phase

The action may fail for a variety of reasons. First what you need to do is to check the status message.

Debugging steps:

Check the Action status message. If status message contains: while fetching latest Interface revision string: cannot find the latest revision for Interface "cap.interfac.db.install" (giving up - exceeded 15 retries):
- Ensure that Public Hub is populated and manifests can be fetched.
- Ensure that ActionRef is not misspelled.
Check the Engine logs. You can grep logs using Action name. This will narrow-down the number of log entries. The common problem can be that the Engine doesn't have proper permission to schedule Action execution, e.g. cannot create ServiceAccount, Secret, Argo Workflow. Ensure that the k8s-engine-role ClusterRole in the capact-system Namespace has all necessary permissions.
Check the Action execution.

Clean up Action execution pods

After Action execution there are a lot of Pods with name pattern {ACTION_NAME}-{RANDOM_10_DIGITS} in theCompleted state.

NAME                READY   STATUS      RESTARTS   AGE
mattermost-1602179194     0/2     Completed   0          14d
mattermost-2270774275     0/2     Completed   0          14d
mattermost-823541112      0/2     Completed   0          14d
mattermost-470211537      0/2     Completed   0          14d
mattermost-1030672350     0/2     Completed   0          14d
mattermost-147207013      0/2     Completed   0          14d
mattermost-2768336525     0/2     Completed   0          14d
mattermost-3634435893     0/2     Completed   0          14d
mattermost-4236050029     0/2     Completed   0          14d
mattermost-2282111071     0/2     Completed   0          14d
mattermost-3762917690     0/2     Completed   0          14d
mattermost-4129897782     0/2     Completed   0          14d
mattermost-1307838837     0/2     Completed   0          14d
mattermost-2309417707     0/2     Completed   0          14d
mattermost-1619688498-1   1/1     Running     0          12d
mattermost-1619688498-0   1/1     Running     0          12d

Those Pods were created by Argo Workflow and each of them represent executed Action step e.g. create a database, create user in the database etc. For failed Actions they are useful to debug the root cause of an error. For successfully execute Action you can remove them. To remove only Argo Workflow Pods, run:

kubectl delete workflows.argoproj.io {ACTION_NAME} -n {ACTION_NAMESPACE}

To remove Action and all resources associated with it (Argo Workflow Pods, ServiceAccount, user input data etc.), run:

capact action delete {ACTION_NAME} -n {ACTION_NAMESPACE}

Wrong Implementation was selected

Actions may define theirs dependencies via Interfaces. Depending on cluster Policy configuration, every time user runs an Action, a different Implementation may be picked for a given Interface.

Symptoms:

Rendered Action workflow contains Implementation which should not be used.
Executed Action create resources in the unexpected destination. For example, deployed PostgreSQL on a cluster instead of provisioning RDS instance on AWS side.

Debugging steps:

Check if proper policy exists and has proper configuration. Read the Policy overview document to get familiar with the syntax and available set of features.
If you use cloud solutions, such as GCP or AWS, you need to specify TypeInstance ID in the cluster Policy. This TypeInstance must hold a subscription which allows to provision a given service on the hyperscaler side. If TypeInstance doesn't exist, Engine will ignore this configuration. Check if TypeInstance with a given ID exists

Namespace stuck in the `Terminating` state

If the Namespace has been marked for deletion and the Capact components were removed before, the Namespace may become stuck in the Terminating state. This is typically due to the fact that the Capact Engine cannot execute clean-up logic and remove the finalizer from the Action resource. To resolve it, you need to remove the finalizer from the Action:

kubectl patch actions.core.capact.io ${ACTION_NAME} -n {ACTION_NAMESPACE} -p '{"metadata":{"finalizers":null}}' --type=merge

Unreachable Gateway

Gateway aggregates GraphQL APIs from the Capact Engine, Public Hub, and Local Hub. If one of the aggregated component is not working properly, Gateway is not working either.

Symptoms:

Gateway responds with the 502 status code.
Gateway logs contain a message similar to: while introspecting GraphQL schemas: while introspecting schemas with retry: while introspecting schemas: invalid character 'l' looking for beginning of value.
Gateway Pod is frequently restarting.

Debugging steps:

Restart Gateway. For component name use gateway.
Check if Hub Public is up and running.
Check if Hub Public has in logs information that it was started. It should contain a message similar to: INFO GraphQL API is listening {"endpoint":"http://:8080/graphql"}.
Check if Hub Local is up and running.
Check if Hub Local has in logs information that it was started. It should contain a message similar to: INFO GraphQL API is listening {"endpoint":"http://:8080/graphql"}.
Check if Engine is up and running.
Check if Engine has in logs information that it was started. It should contain a message similar to: engine-graphql httputil/server.go:47 Starting HTTP server {"server": "graphql", "addr": ":8080"}.

Common problems

Action​

Action does not have status​

Action stuck in the BeingRendered phase​

Action in the Failed phase​

Clean up Action execution pods​

Wrong Implementation was selected​

Namespace stuck in the Terminating state​

Unreachable Gateway​

Action

Action does not have status

Action stuck in the `BeingRendered` phase

Action in the `Failed` phase

Clean up Action execution pods

Wrong Implementation was selected

Namespace stuck in the `Terminating` state

Unreachable Gateway