Real-time machine learning with TensorFlow, Kafka, and MemSQL

How to build a simple machine learning pipeline that allows you to stream and classify simultaneously, while also supporting SQL queries

Real-time machine learning with TensorFlow, Kafka, and MemSQL
Thinkstock

TensorFlow has emerged as one of the leading machine learning libraries, and when combined with an operational database, it provides the foundation for quickly building sophisticated machine learning workflows.

In this post, we will explore a machine learning workflow using a speed dating dataset. The overall objective of this demonstration is to compare the machine-suggested matches with those  a person might choose directly from looking at different people’s profiles. The dataset comes from a speed dating experiment on Kaggle.

As part of the workflow, we will detail how you can use MemSQL Pipelines to stream data from Kafka in real time into the database. Upon ingesting the data, we will incorporate TensorFlow to train and classify data simultaneously using some of the built-in TensorFlow algorithms. Finally, we’ll see how well the machine determines matches.

This overall architecture provides a template for creating more complex machine learning workflows with new datasets.

A machine learning workflow with TensorFlow

Our architecture consists of training and classification data streamed through Kafka and stored in a persistent, queryable database. In this case we will use MemSQL and take advantage of the Pipelines function to run TensorFlow operations on the stream before persisting it to the database.

machine learning workflow MemSQL

On the Kafka side, we set up two Kafka topics, Classification and Training. Raw training and classification data is streamed from these Kafka topics into our MemSQL Pipeline.

On the database side, we create a database called speed_dating_matches, and within that database we create two tables, dating_training and dating_results.

  • dating_training is a single-row table where we place the output of the training evaluation to show training in action
  • dating_results is a table containing all of the data about a potential date as well as whether it was determined that this date is a match
    • isMatch = 1 means the date was a match
    • isMatch = 0 means the date was not a match

Next we will create two Pipelines, speed_dating_training and speed_dating_results, which stream in the data from the Kafka topics, train or classify using that data, and place the final result in the corresponding table.

Applying machine learning to predict matches

The speed dating information includes assigning 100 priority points across six traits: attractiveness, intelligence, fun, shared interests, sincerity, and ambition.

It also includes biographical and interest information on hometown, study interests (data was collected from college students), and hobbies such movies, yoga, travel, and video games.

The training data is a set of predetermined matches, and the classification data represents the predicted likelihood of a match. With this information, we can look at who matched in the training data, and use our own answers to the questions to see whom we might match with.

From there, we can ask more detailed questions such as what does the average person look for in terms of dating attributes and interests, and what is the difference between the average person and whom I match with?

We can also query across the entire dataset or query a subset of the dataset that was determined to be a match.

Using built-In TensorFlow models

TensorFlow comes with a number of built in models to choose from. It includes:

  • DNNClassifer
  • DNNRegressor
  • DNNLinearCombinedClassifier
  • DNNLinearCombinedRegressor
  • LinearClassifier
  • LinearRegressor

We will choose the linear classifier for the purpose of this demonstration, and base our model inputs on a combination of the following data types.

  • CSV field names. The CSV field names are the names that will be used when reading your CSV into Pandas dataframes.
  • TensorFlow categorical feature columns. Categorical feature columns are any item that cannot be represented by a discrete number. Features like country of residence, occupation, or alma mater are all examples of categorical feature columns. One of the great features of TensorFlow is that you do not need to know how many distinct values you will have for a given category, and it will handle creating sparse vectors for you. See the “Base Categorical Features Column” section of the TensorFlow Linear Model Tutorial in the TensorFlow documentation.
  • TensorFlow continuous feature columns. Continuous features are anything that can be represented by a number. Features like age, salary, and maximum running speed are all examples of things that could be represented using a continuous feature column. For more information, see the “Base Continuous Feature Columns” section of the TensorFlow Linear Model Tutorial.

Putting training and classification data to work

In this example, people in the speed dating dataset are represented as a vector composed of how they ranked traits, completed biographical info, and listed interests:

Person <traits, biographical info, interests>

Training data is represented as

<Person A, Person B, 0>

where the final value is a 0 or 1 based on no match or match.

Classification data is passed through as

<Person A, Bryan>

where the outcome is a 0 or 1 based on a predicted match.

In the following diagram, we can see that the training data is passed through to train the linear classifier model and the classification data is passed through the TensorFlow model to output a 0 or 1 based on the likelihood of a match.

tensorflow linear classifier MemSQL

Predicting love with TensorFlow and MemSQL

With this infrastructure in place we can add our own information into the mix. In this case we can feed dating information for an individual into the classification workflow and predict the likelihood of a match. To assess the validity, one could then look at the matches to see if they are representative of what one might have chosen directly.

The overall architecture provides a number of advantages. It supports simple streaming of new data through Kafka, draws on out-of-the-box TensorFlow models, and persists data in a format that can be easily queried with SQL. Fundamentally, it provides the ability to stream data into MemSQL and classify simultaneously. For more on this, see the TensorFlow documentation on serving a TensorFlow model.

If you would like to see a demonstration of this application in action, feel free to check out this 10 minute video, “Real-time Machine Learning with TensorFlow, Kafka, and MemSQL,” from Strata Data Conference New York 2017.

Gary Orenstein leads marketing strategy, growth, communications, and customer engagement at MemSQL. Prior to MemSQL, Gary was the Chief Marketing Officer at Fusion-io where he led global marketing activities. He holds a bachelor’s degree from Dartmouth College and a master's in business administration from The Wharton School at the University of Pennsylvania.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2017 IDG Communications, Inc.