Creating a Data Pipeline with VMware Spring Cloud® Data Flow for Kubernetes

Here you will learn how to get started using VMware Spring Cloud® Data Flow for Kubernetes (SCDF for Kubernetes) to quickly create a data pipeline.

Start the SCDF shell

First, download and start the SCDF shell. For information about using the shell to connect to the SCDF for Kubernetes server, see Use the SCDF Shell.

$ java -jar spring-cloud-dataflow-shell-2.9.2.jar --dataflow.uri=http://data-flow.example.com
...

dataflow:>

Import the stream applications

Next, import the Spring Cloud Stream application starters, using the SCDF shell app import command. You can choose from two versions of the starters, depending on your message broker: "RabbitMQ + Docker", and "Apache Kafka + Docker". For more information about the application starters, see the Register Supported Applications and Tasks section of the Spring Cloud Data Flow Reference Guide.

To use the RabbitMQ starters, run the following command:

dataflow:>app import https://dataflow.spring.io/rabbitmq-docker-latest

To use the Apache Kafka starters, run the following command:

dataflow:>app import https://dataflow.spring.io/kafka-docker-latest

Create the stream definition

After importing the application starters, you can use three of the imported applications (the http source, the split processor, and the log sink) to create a stream that consumes data through an HTTP POST request, processes it by splitting it into words, and outputs the results to the log.

Create the stream using the SCDF shell stream create command:

dataflow:>stream create --name words --definition "http | splitter --expression=payload.split(' ') | log"

Deploy the stream

Finally, deploy the stream using the SCDF shell stream deploy command. In order to enable HTTP requests to the http application, you must specify a deployment property that requests a LoadBalancer for the application service:

dataflow:>stream deploy words --properties deployer.http.kubernetes.createLoadBalancer=true

To see the application pods deployed as part of the stream, run the following command in a different terminal window or tab:

$ kubectl get pods -l role=spring-app
NAME                                 READY   STATUS    RESTARTS   AGE
words-http-v1-7c977c4965-5q27m       1/1     Running   0          5m8s
words-log-v1-855f6ddd69-slmnn        1/1     Running   0          5m8s
words-splitter-v1-5466fdf6d4-lrz2c   1/1     Running   0          5m8s

At the SCDF shell, run the stream list command to check the status of the stream:

dataflow:>stream list
╔═══════════╤═══════════╤═══════════════════════════════════════════════════════╤═════════════════════════════════════════╗
║Stream Name│Description│                   Stream Definition                   │                 Status                  ║
╠═══════════╪═══════════╪═══════════════════════════════════════════════════════╪═════════════════════════════════════════╣
║words      │           │http | splitter --expression="payload.split(' ')" | log│The stream has been successfully deployed║
╚═══════════╧═══════════╧═══════════════════════════════════════════════════════╧═════════════════════════════════════════╝

Use the deployed data pipeline

To view the log output of the log sink, run the following command in a new terminal window or tab:

kubectl logs -f deployment/words-log-v1

Next, look up the IP address assigned to the http application Service resource:

$ kubectl get service words-http-v1
NAME            TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)          AGE
words-http-v1   LoadBalancer   10.47.244.113   104.197.218.242   8080:31383/TCP   4m4s

You can access the http application via the IP address shown for EXTERNAL-IP, using the port 8080.

At the SCDF shell, send a POST request to the words-http-v1 application, using the assigned IP address:

dataflow:>http post --target http://104.197.218.242:8080 --data "This is a test"

In the terminal window or tab that is viewing the log output of the log sink application, watch for the processed data to appear:

2020-06-03 19:56:59.163  INFO 1 --- [container-0-C-1] log-sink                                 : This
2020-06-03 19:56:59.169  INFO 1 --- [container-0-C-1] log-sink                                 : is
2020-06-03 19:56:59.171  INFO 1 --- [container-0-C-1] log-sink                                 : a
2020-06-03 19:56:59.255  INFO 1 --- [container-0-C-1] log-sink                                 : test

Destroy the data pipeline

At the SCDF shell, run the stream undeploy command to undeploy the stream:

dataflow:>stream undeploy words

Then run the stream destroy command to destroy the stream definition:

dataflow:>stream destroy words

Learn more about streams and tasks

To learn more about creating streams, see the Getting Started with Stream Processing guide. To learn about creating tasks, see the Getting Started with Task Processing guide.