Troubleshooting Guide

In this section we are going to provide hints and useful commands to help you troubleshoot traffic-related problems or k8s related issues. It is important to remember that these two types of issues are highly related as both control plane software and data plane software are containerized and deployed as Kubernetes services in SD-Fabric. Please refer to Architecture and Design for further details.

SONiC troubleshooting

Can’t reboot into SONiC, loops on ONIE installer mode

Sometimes an SONiC installation is incomplete or problematic, and reinstalling it doesn’t result in a working system.

If this is the case, reboot into ONIE Rescue mode and use parted to delete all the SONiC related partitions, then reinstall the SONiC image.

K8s troubleshooting

We assume that the tool kubectl have been install already on your local machine. First step is to setup the proper kubeconfig file to access the k8s cluster you want to troubleshoot:

$ export KUBECONFIG=~/kubeconfig/dev-sdfabric-menlo
$ kubectl config use-context dev-sdfabric-menlo
  Switched to context "dev-sdfabric-menlo".

You can get the list of the k8s namespaces using kubectl get command:

$ kubectl get namespaces
  ...
  kube-node-lease            Active   68d
  kube-public                Active   68d
  kube-system                Active   68d
  security-scan              Active   68d
  sdfabric                   Active   26h

Let’s assume that SD-Fabric resources are deployed under the namespace sdfabric, so make sure that the sdfabric namespace has been properly created (additionally other namespaces could be created - please check your overarching chart).

If the deployment is not successful, a first check is to make sure there are enough available nodes in the target cluster. You can check the available nodes through kubectl get nodes command:

$ kubectl get nodes -o wide
  NAME       STATUS   ROLES                      AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION             CONTAINER-RUNTIME
  compute1   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.74   <none>        Ubuntu 18.04.6 LTS             5.4.0-73-generic           docker://20.10.9
  compute2   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.72   <none>        Ubuntu 18.04.5 LTS             5.4.0-73-generic           docker://19.3.15
  compute3   Ready    controlplane,etcd,worker   39d   v1.18.8   10.76.28.68   <none>        Ubuntu 18.04.5 LTS             5.4.0-73-generic           docker://19.3.15
  leaf1      Ready    worker                     39d   v1.18.8   10.76.28.70   <none>        Debian GNU/Linux 10 (buster)   4.19.0-12-2-amd64          docker://18.9.8
  leaf2      Ready    worker                     39d   v1.18.8   10.76.28.71   <none>        Debian GNU/Linux 10 (buster)   4.19.0-12-2-amd64          docker://18.9.8

You should have at least 3+N available nodes, where N depends on the deployed network topology. Please note that ONOS cannot be scheduled on the network devices (these are special worker nodes), and different ONOS cannot share the same worker node (the same applies for Atomix).

At least you should have some basic containers that are present in each deployment. You can get the list of the pods by using kubectl get pods -n sdfabric:

$ kubectl get pods -n sdfabric -o wide
  NAME                                                        READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES
  onos-tost-atomix-0                                          1/1     Running   0          6h31m   10.72.106.161   compute3   <none>           <none>
  onos-tost-atomix-1                                          1/1     Running   0          6h31m   10.72.111.229   compute1   <none>           <none>
  onos-tost-atomix-2                                          1/1     Running   0          6h31m   10.72.75.254    compute2   <none>           <none>
  onos-tost-onos-classic-0                                    1/1     Running   0          98m     10.72.106.133   compute3   <none>           <none>
  onos-tost-onos-classic-1                                    1/1     Running   0          6h31m   10.72.111.207   compute1   <none>           <none>
  onos-tost-onos-classic-2                                    1/1     Running   0          6h31m   10.72.75.247    compute2   <none>           <none>
  onos-tost-onos-classic-onos-config-loader-ddc9d68bb-lq97t   1/1     Running   0          6h19m   10.72.106.190   compute3   <none>           <none>
  stratum-bwlvh                                               1/1     Running   0          6h31m   10.76.28.70     leaf1      <none>           <none>
  stratum-gh842                                               1/1     Running   0          6h31m   10.76.28.71     leaf2      <none>           <none>

3 Atomix nodes and 3 ONOS nodes are needed for HA. onos-config-loader is equally important, because without ONOS cannot be properly configured. The number of Stratum pods depend on the deployed topology. If the status of the pods is not Running you can check the events published by k8s components to have a first idea of what is happening:

$ kubectl get events -n sdfabric --sort-by='.lastTimestamp'
  LAST SEEN   TYPE      REASON              OBJECT                                                           MESSAGE
  12m         Normal    Scheduled           pod/telegraf-75b959574d-sl8qb                                    Successfully assigned tost/telegraf-75b959574d-sl8qb to compute3
  12m         Normal    SuccessfulCreate    replicaset/telegraf-75b959574d                                   Created pod: telegraf-75b959574d-sl8qb
  12m         Normal    ScalingReplicaSet   deployment/telegraf                                              Scaled up replica set telegraf-75b959574d to 1
  12m         Normal    Pulled              pod/telegraf-75b959574d-sl8qb                                    Container image "telegraf:1.17" already present on machine
  12m         Normal    AddedInterface      pod/telegraf-75b959574d-sl8qb                                    Add eth0 [10.72.106.153/32]
  12m         Normal    Started             pod/telegraf-75b959574d-sl8qb                                    Started container telegraf
  12m         Normal    Created             pod/telegraf-75b959574d-sl8qb                                    Created container telegraf
  ...

The option --sort-by='.lastTimestamp' is typically used to get the events sorted by time. The previous command will report all the events happened in the sdfabric namespace, if you want to have more insights on a specific pod, it is possible to use the command kubectl describe pods:

$ kubectl describe pods -n sdfabric onos-tost-onos-classic-0
  Name:         onos-tost-onos-classic-0
  Namespace:    sdfabric
  Priority:     0
  Node:         compute3/10.76.28.68
  Start Time:   Mon, 11 Oct 2021 10:35:43 +0200
  ...
  Events:
    Type     Reason          Age   From               Message
    ...
    {"message":"pending"}
    org.onosproject.segmentrouting is not yet ready

The Events section provides typically useful information about the issues the pod is facing.

Both ONOS and Atomix define readiness probes which will make sure that the pods are ready before any configuration will take place. As consequence of this, if the probes fail for a given pod you will notice in the output of the command kubectl get pods` near its name 0/1 under the column READY. We report in ONOS pod not ready (1) and ONOS pod not ready (2) two scenarios frequently faced by the SD-Fabric developers.

Logs of the SD-Fabric pods can be accessed by using kubectl logs command

$ kubectl -n sdfabric logs onos-tost-onos-classic-0
  2021-10-12 04:46:17,955 INFO  [EventAdminConfigurationNotifier] Sending Event Admin notification (configuration successful) to org/ops4j/pax/logging/Configuration
  ...
  2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl] Changes to perform:
  2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl]   Region: root
  2021-10-12 04:46:18,991 INFO  [FeaturesServiceImpl]     Bundles to install:

ONOS Troubleshooting

You can get the ONOS CLI by establishing SSH connection to the port 8101 (default password is karaf):

$ kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101
// In another terminal or you can send to /dev/null the port-forward
$ ssh -p 8101 karaf@localhost
  The authenticity of host '[localhost]:8101 ([127.0.0.1]:8101)' can't be established.
  RSA key fingerprint is SHA256:Mlaax9tHmIR6WwK0B3okC1O4mpAuoXjI7Z5+KKelxOo.
  Are you sure you want to continue connecting (yes/no)? yes
  Warning: Permanently added '[localhost]:8101' (RSA) to the list of known hosts.
  Password authentication
  Password:
  Welcome to Open Network Operating System (ONOS)!
       ____  _  ______  ____
      / __ \/ |/ / __ \/ __/
     / /_/ /    / /_/ /\ \
     \____/_/|_/\____/___/

  Documentation: wiki.onosproject.org
  Tutorials:     tutorials.onosproject.org
  Mailing lists: lists.onosproject.org

  Come help out! Find out how at: contribute.onosproject.org

  Hit '<tab>' for a list of available commands
  and '[cmd] --help' for help on a specific command.
  Hit '<ctrl-d>' or type 'logout' to exit ONOS session.

  karaf@root >

Alternatively, if this is not possible to establish an ssh connection with the ONOS pods, it is possible to use kubectl exec command on the target pod:

$ kubectl -n sdfabric exec -it onos-tost-onos-classic-0 -- bash apache-karaf-4.2.14/bin/client
  Welcome to Open Network Operating System (ONOS)!
       ____  _  ______  ____
      / __ \/ |/ / __ \/ __/
     / /_/ /    / /_/ /\ \
     \____/_/|_/\____/___/

  Documentation: wiki.onosproject.org
  Tutorials:     tutorials.onosproject.org
  Mailing lists: lists.onosproject.org

  Come help out! Find out how at: contribute.onosproject.org

  Hit '<tab>' for a list of available commands
  and '[cmd] --help' for help on a specific command.
  Hit '<ctrl-d>' or type 'logout' to exit ONOS session.

  karaf@root

You can attach to the ONOS logs by using the log:tail command:

$ karaf@root > log:tail
  20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine1 -> device:leaf1
  20:19:40.188 DEBUG [DefaultRoutingHandler] device:spine2 -> device:leaf1
  20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf1 -> device:spine1
  20:19:40.188 DEBUG [DefaultRoutingHandler] device:leaf2 -> device:spine1

The command will display continuously the log entries - this is useful for a live debugging session. Complete ONOS logs can be accessed by using kubectl logs command as explained in the previous section. If anything can be figured out from the logs, you can access to the ONOS state by issuing specific CLI commands. We report in the section Frequently Used Commands few commands we frequently use when troubleshooting SD-Fabric.

Pipeline Walk-through

Note

More information of Pipeline Walk-through is coming soon

onos-diagnostics

In the case where you can’t figure out what is going wrong, you can seek help on SD-Fabric developer mailing list sdfabric-dev@opennetworking.org or you can reach out on the sdfabric-dev Slack channel. There are a few things we would like you to attach:

  • Issue description

  • Environment description, such as SD-Fabric version, switch model and SDE version version

  • Steps of reproduction, as detail as possible

  • Diagnostics.

We have built a tool onos-diagnostics-k8s to help you easily collect and package ONOS diagnostics. The tool collects various information from the running ONOS cluster and packages it into one, easy-to-share archive file. This tool is distributed as part of the ONOS software itself (under bin directory), but is also available as part of a small archive of remote tools to administer an ONOS cluster (onos-admin-*.tar.gz).

Alternatively, it is possible to use onos-diagnostics-k8s in Kubernetes enabled environments. The tool will produce the same results of onos-diagnostics and relies only on kubectl commands. The tool need to know the name of the namespace and this can be provided through the option -s. Then, you have to provide the names of the target pods. To avoid having to specify these names as part of the command, you can export the ONOS_PODS environment variable. Here’s an example of how to set the variable:

$ export ONOS_PODS="onos-0 onos-1 onos-2"

The tool needs to know the Karaf home (path from the mount point). To avoid having to specify this path as part of the command, you can export the KARAF_HOME environment variable:

$ export KARAF_HOME="apache-karaf-4.2.14"

Once done, the onos-diagnostics-k8s tool can be run as follows:

$ onos-diagnostics-k8s -s sdfabric

There is the option -n that allows for naming the resulting archive file for differentiation between different cluster instances, e.g.

# This will produce archive file /tmp/delta-pod-diags.tar.gz
$ onos-diagnostics-k8s -s sdfabric -n delta-pod

By default onos-diagnostics-k8s will use ONOS_PROFILE to collect the diagnostics, you can tailor the behavior of the command to your needs by specifying a different profile. For SD-Fabric we suggest to use TRELLIS_PROFILE. The resulting /tmp/*-diags.tar.gz file will contain all relevant information about the ONOS cluster.

The following is an example of a complete onos-diagnostics-k8s command:

$ DIAGS_PROFILE=TRELLIS_PROFILE onos-diagnostics-k8s -k apache-karaf-4.2.14 -s sdfabric onos-tost-onos-classic-0 onos-tost-onos-classic-1 onos-tost-onos-classic-2

UP4 Troubleshooting

Note

More information of UP4 troubleshoot is coming soon

Common Issues

Note

Here is a list of common issues. More details of each case are coming soon

ImagePullBackOff

ONOS pod not ready (1)

ONOS pod not ready (2)

ONOS pods not configured

Packet-In not working

Device offline

Frequently Used Commands

In this subsection, we are going to introduce a few commands we frequently used when troubleshooting SD-Fabric.

ONOS

To execute following ONOS CLI commands,

  • Create K8s port forwarding by kubectl -n sdfabric port-forward onos-tost-onos-classic-0 8101

  • Login to ONOS CLI by ssh -p 8101 karaf@localhost. Default password is karaf

ONOS basics

  • flows: List flow tables. -s for simplified output.

  • groups: List group tables. -s for simplified output.

  • devices: List device information. -s for simplified output.

  • ports: List port information. -e to list enabled ports only.

  • links: List discovered links

  • hosts: List discovered hosts. -s for simplified output.

  • netcfg: List network configuration

  • interfaces: List interface configuration

trellis-control

  • sr-pr-list: List current recovery phase of each device

  • sr-device-subnets: List device-subnet mapping

fabric-tna

  • slices: List network slices

  • tcs: List traffic classes of given slice

up4

  • read-entities -a: Print UPF entities installed in the UPF dataplane. More options are available. See read-entities --help

Stratum

To execute following BF Shell commands,

  • Login to Stratum switch via ssh.

  • Attach to Stratum docker container by docker attach `docker ps | grep stratum-bfrt | awk '{print $1}'`

    • Hit enter for the prompt

    • Use <Ctrl-P><Ctrl-Q> to exit the container. Do not use <Ctrl-C> since it will terminate the process.

BF Shell

  • pm.show: List port configurations. -a to list all ports.