Stream tweets by keyword using Kafka Producer, Kafka Stream, and Dockerize HBase instance for persistence.
- From the main menu, select
File
|Open
.- Alternatively, click
Open
orImport
on the welcome screen.
- Alternatively, click
- In the dialog that opens, select the
pom.xml
file of the project you want to open. ClickOK
. - In the dialog that opens, click
Open as Project
.
IntelliJ IDEA opens and syncs the Maven project in the IDE. If you need to adjust importing options when you open the project, refer to the Maven settings.
- Confluent Platform 5.2 or later
- Confluent CLI
- Java 1.8 or 1.11 to run Confluent Platform
- MacOS Java 8 installation:
brew tap adoptopenjdk/openjdk brew cask install adoptopenjdk8
- MacOS Java 8 installation:
- Maven to compile the client Java code (If using Intellij -- Maven comes bundled in the IDE, so you can skip this step)
- Docker
- Apache HBase Sink Connector (writes data from a topic in Kafka to a table in the specified HBase instance)
confluent-hub install confluentinc/kafka-connect-hbase:latest
Apply for a Twitter Developer Account to receive
access tokens and keys to use Twitter API. When the tokens are received, keys and tokens can be generated by creating an App
in the Twitter
Developer dashboard.
To crate an app:
1. Apps
-> Create an app
-> Fill out App details form.
2. Copy key and tokens into twitter.properties
file.
-
Create a Dockerized HBase Instance
- Get the Docker image:
docker pull aaionap/hbase:1.2.0
- Start the HBase Docker Image
docker run -d --name hbase --hostname hbase -p 2182:2181 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 aaionap/hbase:1.2.0
- Add an entry
127.0.0.1 hbase
to/etc/hosts
.
- Get the Docker image:
-
Run Kafka Producer and Kafka Stream
sh run.sh
-
Check HBase for data:
- Start HBase Shell:
docker exec -it hbase /bin/bash entrypoint.sh
- Verify the table
popular-tweets-avro
exists. Output should be:TABLE example_table 1 row(s) in 0.2750 seconds => ["popular-tweets-avro"]
- Verify table received data:
scan 'popular-tweets-avro'
- Start HBase Shell:
-
Clean up resources
- Delete the connector
confluent local unload hbase
- Stop Confluent:
confluent local stop
- Delete Dockerized Hbase instance
docker stop hbase docker rm -f hbase
- Delete the connector
The producer ingests tweets from Twitter API configured by a list of search terms.
public class TwitterProducer {
List<String> terms = Lists.newArrayList("conspiracy", "conspiracyTheory", "fakenews");
}
All messages conform to a certain schema (class) defined in /resources/avro/Tweets.avsc
before they are sent to Kafka topic in Avro format.
public Tweets(java.lang.CharSequence tweet, java.lang.CharSequence userName, java.lang.Integer userNumFollowers) {
this.tweet = tweet;
this.userName = userName;
this.userNumFollowers = userNumFollowers;
}
To create the code-generated class, compile the Java class from the Tweets.avsc file:
mvn clean compile package
The stream ingests messages from the Kafka topic and filters on user's number of followers. After filtering, the messages are persisted in a new Kafka Topic.
A single configuration file hbase-avro.json
is used to configure which topic the Hbase instance should
subscribe to (filtered kafka topic) in order to persist messages to an HBase table.