What is Apache NiFi
Apache Nifi is an easy to use, powerful, and reliable system to automate the flow of data between software systems. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
It’s highly configurable with a web-based user interface and ability to track data from beginning to end. It allows to build own processors and rapid development with effective testing.
GCP and Apache NiFi Integration
Apache Nifi uses processors to process and transform the data. In order to connect to GCP products and process the data, we need to use the Nifi’s processors. Nifi provides a variety of prebuilt options. With respect to GCP, it provides processors to connect with Google PubSub and Google Cloud Storage that includes Processors like ConsumeGCPubSub, FetchGCSObject to read messages from Google cloud PUBSUB subscription and to fetch files from GCS bucket. For more information on processors, please refer to the link: https://nifi.apache.org/docs.html
We can also build custom processors to connect with other GCP products. In the demo shown below, a custom processor is built to connect Apache Nifi with Google CloudDataflow.
Designing a Workflow in NiFi Using Dataflow
Apache Nifi provides a template mechanism to create customizable stereotyped solutions. We will demonstrate how to run a Google Cloud Dataflow job to count words and a secondary job to sort them lexicographically using NiFi.
- First for the initial step to List files from a GCS bucket. We add ListGCSBucket processor from a list of prebuilt processors which is highlighted via step 1 and step 2. After adding the processor in the canvas as shown in step 3, we can configure it with respect to GCS bucket and respective folders to fetch text files
- Following the similar steps, to filter only text files, we select RouteonAttribute prebuilt processor and add our filter in the regex
- For the third and fourth steps, we will see how to use custom processors to run dataflow jobs on Google Cloud
In Apache Nifi, we have the leverage to write custom processors and configure it for further extension enabling rapid development as shown below in the image.
Here as part of the third and fourth step, we will add our Dataflow templates stored in Google Cloud Storage and map it to the above Dataflow processor’s configuration.
At the end after adding all processors and configuring Dataflow processors with our Dataflow job templates i.e., Word Count and Lexicographic Sort. We will have a full flow template as shown in the image below. We can start the entire workflow from the Operator panel of Apache Nifi.
We can also view the summary of all the processors under the Nifi Summary tag.
To conclude, Apache NiFi is a web-based user interface tool for data flow management. It is highly configurable. To connect with GCP products like Pub Sub we can use built-in processors and also can build our own custom processors if it is not builtin. To compare it with Airflow, Airflow is a platform to programmatically author, schedule and monitor data pipelines. We can use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. Airflow can be classified as a tool in the Workflow Manager category, while Apache NiFi is grouped under Stream Processing. It also helps in mediating powerful and scalable directed graphs of data routing, transformation, and system mediation logic.