Migrating NCP Clusters and TAS Foundations to Policy

You can migrate NCP clusters and foundations from Manager mode to Policy mode.

Upon NCP upgrade, NSX Manager mode resources can be migrated to NSX Policy, allowing NCP to operate in Policy mode. Migration is also called "import" in the documentation. They mean the same thing. This feature to migrate TAS Foundations is available starting with NCP 4.1.2, The feature to migrate TKGI clusters and vanilla Kubernetes clusters is available starting with NCP 4.0).

Prerequisites

Migrating a Kubernetes cluster or TAS Foundation can take some time and requires control plane downtime (no create, update, or delete operations allowed) on NSX. The downtime will last depending upon the migration strategy used of which there are two:
1. Strategy 1: (Recommended and required for TAS) Schedule a control plane downtime on all clusters (running in Manager mode or Policy mode) simultaneously. The downtime will last until all clusters are migrated to Policy. The benefit is that it allows you to use the NSX Backup and Restore feature during Failure and Recovery.
2. Strategy 2: Schedule a control plane downtime on one cluster at a time. After a cluster is migrated, NCP is started in Policy mode on that cluster. NSX Backup and Restore cannot be used with this strategy because other clusters may create new workloads on NSX while the current cluster is being migrated. This mode can be used if it is acceptable to discard the cluster if migration and rollback fails.
NSX managers can be shared by multiple clusters, or foundations respectively. Only one cluster/foundation can be migrated to Policy at a time that share the same NSX managers.
All manually-created DFW sections that are present between any NCP-created sections must be moved outside the range of NCP-created DFW sections. See Dealing with DFW Sections Created by NSX Admin.
All user-created rules inside NCP-created sections must be moved out of the NCP-created sections. See Dealing with DFW Sections Created by NSX Admin.
NCP-created LoadBalancer Virtual Servers in Manager mode must not have more than 255 rules. This means that Kubernetes must not have more than 255 Ingress rules on the default LoadBalancer. If there are more than 255 Ingress rules, you must split Ingress into multiple LoadBalancer CRDs while NCP is running in Manager mode.
If you are using Load Balancer in front of NSX UAs, you must attach a Source IP Persistence Profile on the Load Balancer so that all API calls made during migration by the script reach the same NSX appliance. This persistence profile should be removed after all foundations/clusters are migrated to Policy API.
If there are Gateway Firewall rules on the top Tier-0 that can block traffic from containers, using BYPASS as the NAT Firewall Match for SNAT Rules property will avoid firewall enforcement on NCP-created SNAT rules. Otherwise, Gateway Firewall rules will need to be reviewed to ensure traffic from containers is not accidentally blocked.

Limitations and Caveats

We do not support the scenarios in which NSX managers are shared among products TAS foundations, TKGi, and Vanilla Kubernetes clusters yet. Example when the same NSX manager nodes are used to run 1 or more foundation and 1 or more TKGi clusters.
After migration and restarting NCP in policy mode, NCP will reconcile the existing workload in TAS foundation or TKGI cluster with NSX. The reconciliation time is proportional to the existing workload size. During reconciliation, operations such as pod or app instance creation might fail, and there could also be significant delays in creating or deleting NSX resources mapped to TAS and TKGI entities. For very large cluster or foundations, where NCP resource usage is close to scale limits, it is recommended to add at least 45 minutes to the maintenance window to allow for NCP to reconcile with the the backend. NCP will automatically retry to sync the failed operations after reconciliation completes.
After migration to Policy, the migrated SNAT rules created by NCP will use BYPASS as firewall_match. New SNAT rules created by NCP after migration will always be created with the same value as the firewall_match property that you have configured.

Process Details

There are two categories of NSX resources that are migrated when a Kubernetes/TKGi cluster or TAS foundation is migrated to Policy mode:

Shared NSX Resources: These NSX resources are created manually by the admin and are provided to NCP via Opsmanager UI in TAS/TKGi, or nsx-ncp-config Kubernetes config map in vanilla Kubernetes. They can be shared among foundations and clusters. These need to be manually specified before foundation/cluster migration begins in a YAML file called user spec (refer Sample user-spec.yaml)
NCP-created NSX Resources: These NSX resources are created by NCP in response to foundation/cluster workloads. These are automatically inferred during migration

NOTE: NCP pod can work in Manager mode only when all NCP-created NSX resources are in Manager mode. Likewise, NCP pod can work in Policy mode only when all NCP-created NSX resources are in Policy mode.

Migration of vanilla Kubernetes clusters is driven by the Kubernetes job named "nsx-ncp-migrate-mp2p", and TKGi clusters and TAS foundations are driven by the errand named "migrate-mp2p". This job/errand executes a python program to migrate either the Shared or NCP created NSX resources. The python program runs in two modes: migration (refer section "Migration Phases") and rollback (refer section "Rollback mode"). Before the errand/job is triggered, you must first stop NCP in all the clusters that share the NSX network (Manager and Policy clusters) and then create an NSX Backup. Detailed steps are provided at migrating a Kubernetes cluster or TAS foundation.

Migration Mode

Migration mode runs in 4 logically separated phases.

Phase 1

Retrieve all the NSX resources from Manager API using Search API. Filter the resources based on cluster tag (when migrating NCP created resources) or shared resources specified in the user spec file (when migrating Shared resources). Start making request bodies to be sent to the migration server. If any request cannot be generated, no NSX resource is migrated.

Possible issues:

Connectivity issues with NSX
Kubernetes API Server does not contain a resource that is needed to migrate an NSX Manager resource

Phase 2

Start sending the migration requests created in Phase 1 to the migration coordinator service running in NSX. Once a request is processed successfully by NSX, store the MP IDs of the NSX resources that were migrated through the request on the local disk (these are called "migration records"). If an issue occurs, the program rollbacks all the NSX resources that were migrated during the current execution to Manager mode using their MP IDs stored on the local disk.

Possible issues:

Connectivity issues with NSX
Migration API returns an error

Phase 3

Infer the updates that should be made on the migrated NSX resources in current execution. These only encompass updates in tags and/or display names of the NSX resources. If an update cannot be inferred (reason could be missing corresponding Kubernetes resource), all NSX resources are rollbacked.

Possible issues:

Connectivity issues with NSX.
Kubernetes API Server does not contain a resource that is needed to update an NSX Policy resource.

Phase 4

Update the NSX Policy resources with the information inferred in Phase 3. If an NSX resource cannot be updated at the time, store the updated policy resource body and policy resource URL on the local disk.

Possible issues:

Connectivity issues with NSX.

See the section "Failure and Recovery" if any issue is encountered during migration.

Rollback Mode

In this mode, the python program tries to rollback all the NSX resources whose MP IDs are present in the migration records on the local storage (see Migration Phase 2). It will delete the migration records of NSX resources once they are rollbacked successfully. If there's a failure during rollback, then the execution will stop and the errand/job must be run again.

As soon as the program starts, it will automatically run in rollback mode if it finds any migration records on the local storage (see Migration Phase 2).

Failure and Recovery

It may happen that the migration process is not able to finish successfully because of an external issue such as a power failure, disk exhaustion, connectivity issues, functional issues, and so on. In such scenarios, there are ways to recover.

We recommend to first check the logs of migration errand/job. They will likely indicate the next action to take. It can belong to one of these:

(Default resolution if logs do not indicate) Run the migration errand/job again.
- If the previous failure occurred in Phase 1, 2, or 3, then the migration errand/job will try to rollback the NSX resources using the migration records (see Phase 2). This must be done until all the NSX resources are rollbacked.
- If the previous failure occurred in Phase 4, then the migration errand/job will try to update the NSX resources in Policy mode again. This must be done until all the NSX resources are successfully updated.
Run NCP in Manager mode and then try migration again. If migration errand/job is unable to migrate the cluster, then NCP needs to be run again in this cluster/foundation in Manager mode. However, this will render the NSX Backup void. So skip the migration of this cluster temporarily. Once all the other clusters are migrated to Policy mode, start NCP in Manager mode in this cluster and Policy mode in them. Wait for at least 60 minutes and then follow the migration steps again from the beginning to retry the cluster migration.

In the rare event that recovery is not possible with above steps, then restore the NSX manager to the previous backup point created before any cluster was migrated to Policy, restart NCP in all the clusters in the same mode when NSX backup was performed, and attempt to migrate again.