How We Figured Out both Customer and Trainers Segmentations Using Machine Learning

13 min readDec 16, 2021

picture with Thomas Menzefricke from FlexIt, a client in this project

Introduction

Due to the COVID-19 pandemic, many traditionally in-person services have shifted towards virtual delivery. After getting used to our new virtual lives (which now include such things as work, education, and medical services) , we have realized how convenient and effective these services are!

In this virtual era, FlexIt offers Virtual Personal Training services and has seen significant growth in the fitness industry as a supplement to traditional in-person personal training offerings.

FlexIt was founded in 2018 by Columbia Business School alumni, Austin Cohen. Aiming to, in the words of Cohen, “make fitness accessible to everyone”, the company offers high-quality online personal training at competitive price points and has become one of the leading companies in the virtual fitness industry.

During the course of our semester enrolled in Columbia Business School’s, Analytics in Action class, we worked with FlexIt to streamline their product offering using management consulting methods and data analysis. Our team consists of three MBA students, who have a background of management consulting and corporate management, and two data scientists from the School of Engineering and Applied Sciences, who have a background in computer science and data analytics.

Our problems

We wanted to help FlexIt use data and sophisticated analytical tools to make their operations more efficient and effective. Our team spoke at length with members of the FlexIt team and found two critical use cases to which we could apply our methods in order to improve the quality of FlexIt’s Virtual Personal Training product and increase FlexIt’s revenue.

But, before explaining the problems, let’s take a look at FlexIt’s customer journey.

Step 1: Potential customers arrive at FlexIt website (either organically or through one of FlexIt’s paid partnerships).

Step 2: They first take the “Goal Quiz”, is designed to assess the goals and preferences of the customers

Step3: After the Quiz, they enter their contact information to receive their results (thereby becoming Leads)

Step 4: They are guided through the process of creating an account and registering for their first workout

Step5: After the trial, they can decide whether or not to purchase a package of workouts

After speaking to the FlexIt team and considering the customer journey, we defined the two problems to solve.

Problem 1: Increase the quality of the “Leads”: customers who submit their contact information — Getting these leads is critical to FlexIt’s future growth and understanding the demographics of key groups enables FlexIt to optimize their marketing.

Problem 2: Increase the rate of “Conversions”: customers who purchase packages — The purchase of packages is a major revenue source for FlexIt, and directly correlates to future company growth.

Our approach

In order to solve the issue above, we created the 3 analytical modules listed below.

<Module 1 Customer data analysis>

We thought that analyzing the data from the customer goals quiz would help us better understand our customers and lead to more effective marketing activities. In order to do so, we considered an analysis approach in which the answers to each customer’s goal quiz were clustered into several customer groups.

In addition to the above, A/B testing was performed to identify what questions would be effective in gaining a better understanding of customers. In the past, customer data was collected using three questions, but by experimenting with new questions, we analyzed whether it was possible to classify the customer groups more clearly

<Module 2 Trainer data analysis>

For trainers as well as customers, we thought we might be able to identify trainers who are likely or unlikely to lead to customer conversions by conducting a cluster analysis using the data. Specifically, we categorized trainers into several groups by analyzing information such as their specialty training areas and the contents of their self-introductions.

In addition, since information on trainers not previously held by FlexIt may be useful for this analysis, a new survey of trainers was designed and valid data items were identified.

<Module 3 Matching (customer & trainer) analysis>

We believe that if we can identify the optimal trainer characteristics for each customer, we will be able to provide more satisfactory services to our customers and increase future conversion rates. The analysis in modules 1 & 2 above will reveal the characteristics of both the customer and the trainer, and we expect to be able to propose the best matching method ever by combining them. Specifically, by analyzing the training results and conversions of each customer group and each trainer group in the past, we have identified what combination is the best solution

The above explanation can be summarized into a schematic diagram as follows

Analyses and Solutions

Module 1 Customer Analysis

If you visit FlexIt’s website, you’ll find a goal quiz with several questions. After you finish the goal quiz, FlexIt provides you with a personalized training plan recommendation.

While the quiz helps people start their workout journey with FlexIt more easily, it also helps FlexIt better understand its customers. In the first module of our project, we used de-identified data of users’ quiz responses to analyze and segment users in order to give some insights to the marketing and matching.

At first, FlexIt used a three-question goal quiz. Its questions are as follows.

What is your wellness goal? (motivation)
How active are you? (experience)
What kind of trainer do you like? (preferred trainers)

Analyzing with K-means clustering and decision tree classification, we found that the quiz includes important features for such a fitness training platform to segment users. The important features are users’ experience and motivation to work out.

This is well demonstrated by the decision tree model (Figure 1). This model was built by selecting important features that best classify people who have converted and those who have not (people are considered to have converted if they have purchased at least one training package). Before the model construction, the data was balanced with SMOTE and the tree model was truncated with the maximum number of leaves as six to avoid overfitting. While the model is built from the top to the bottom and features are selected from the most important ones to the least important ones, wellness goals and activeness are more significant than preferred trainers, in classifying converted and unconverted people. Moreover, if we receive quiz reponses of a new user, we may use this model to imply whether he/she would be more likely to convert or not. For example, people with improving overall fitness or getting stronger as their wellness goals are more likely to convert.

Although, due to the limited number of features, the model’s performance is not satisfactory, the analyses set a path for us to improve the efficiency of the quiz to segment users.

Figure 1. Decision tree classification on whether people convert or not

Therefore, in order to improve the efficiency of the quiz, we proposed a new quiz and conducted an A/B test of the quiz. In the new quiz, two questions related to experience and motivation were kept and two new questions about experience and motivation were added. They are as follows.

What is your wellness goal? (motivation)
How often do you work out weekly? (experience)
Are you currently a gym member? (experience)
How often do you feel stressed? (motivation)

In the A/B test, 70% of people get the original three-question quiz and 30% of people get the new four-question quiz. The A/B test ran for three weeks until we collected data for analyses. We kept track of each question’s drop-off rate, user registration rate after taking the quiz, and user’s conversion rate.

To segment users, we reduced the number of dimensions of one-hot encoded features from 21 to 11 with SVD, keeping the variance explained for each question larger than 70%, and performed K-means clustering. When we tried clustering data with the first three questions into five groups, the results show distinguishable patterns among different clusters (Figure 2). Based on the results, we summarized each group’s characteristics with four dimensions and found that they also help us understand the reasons for people’s conversion, which provide insights for marketing strategies (Figure 3).

Figure 2. Distribution of users in each cluster regarding each question

Figure 3. Characteristics of clusters of users

The three clusters of users highlighted in Figure 3 have better-than-average conversion rates. We found that clusters of people tend to convert because they have clear motivations to use FlexIt, regardless of workout experience. For example, people in cluster 2 are motivated by lack of money, time, or any kind of dissatisfaction in gym experience (which leads to few visits to gyms) and they would like to find alternative ways like FlexIt to work out. People in cluster 5 are gym members who pursue personalized training with FlexIt. With these clustering results, we proceeded to test coupon sensitivity on different user segments and the test is still on-going. We hope that the clustering results together with coupon sensitivity analysis may provide some insights the help improve the company’s marketing strategies and matching mechanism between users and trainers.

Module 2 Trainer Analysis

On FlexIt’s website, there are many personal trainers available for customers to choose. Information about trainers is also available on websites: specialties, certification, language, and a short paragraph of trainers’ biography.

The customer journey is: after registration, a customer can pick a trainer based on their profile and take a discounted first session, then decide whether or not to repurchase.

Our hypothesis is that, whether a trainer would be picked by customers depends on what a customer sees about a trainer on the website, but this dependence would be undermined when a customer decides whether to subscribe to a trainer. ( but whether a customer will subscribe to this trainer is much more dependent on their true experience and feeling about the session. )

Our mission for module 2 is to find the trainers who 1) are active as trainers for FlexIt 2) have an especially strong rate of client repurchase.

The existing info of trainers contains the following:

Biography
Favorite exercise
Preferred music style
Favorite cheat meal
Quote to live by
Gender
Languages spoken
Trainer specialties

As mentioned, only a part of the info above would be exposed to clients. To specifically analyze features, we did some feature engineering for trainers: 1) Used natural language processing (sentiment analysis) to find features within the biography; 2) Summarized in total 26 specialties into 7-category labels. 3) Considering the importance and missing information, we only kept these two features above.

Here are some examples of our feature engineering:

Biography (6 category):

Passionate
Scientific fitness
Very specialized in some area
Capable of many specialties
Specialized in recovery/nutrition/diet
Tend to push customers in fitness

Specialties (7 category):

Shaping
Oxygen exercise (Cardio)
Power/Strength training
Recovery
Athletic training
Diet/nutrition
Post Natal

We can use K-Means clustering to create trainers segments, the same way as customer analysis.

The next part of module 2 will focus on the 1) Conversion (repurchase) rate 2) TWS rate (the proportion of trainer with sessions)

Our results shows the feature (Specialties) has the similar impact on conversion rate and TWS rate, while the feature (Biography) has different pattern of impact on these two indexes:

1) Specialties

To explain the distribution of conversion rate/TWS rate in each segment, we listed the feature distribution of each segment: (the darker a cube is, the more featured than average )

We found cluster 2 trainers, who received lowest conversion rate, are short in overall specialities; what’s more, some specialities recovery, nutrition, post natal are significant for conversion.

2) Biography

Note: Cluster 1 obviously has high conversion rate but receives low TWS rate.

An intuitive explanation of that is, when picking trainers, clients see the biography and avoid trainers who mention they give clients “tough love”, while these pushing trainers can satisfy the clients, so even they are less likely to be picked, they do best in teaching fitness.

To gather more information about trainers, we designed new survey to collect potential information of trainers which might be significant in conversion:

Age
Years of experience
Clients number
Customer preference (beginner, intermediate, advanced in fitness )
Specialities (limited to max 5)
Wellness goal in fitness
Time availability pre day (morning, afternoon, evening)
Time availability per week (in hours)

Only some of trainers answered the survey, but we still find several questions which could better identify conversion:

Morning availability
Number of clients
Amount of availability

The explanation is also intuitive. 1) For morning availability, it make sense if we focus on the clients who only have time for morning sessions, which could help with future matching algorithm and customer analysis; 2) Number of trainers’ total clients is a better index to measure professionalism than age, since age is correlated to years of experience while most trainers are young; 3) More time periods means more time that a trainer could devote to teaching, and it shows positively related to conversion

Module 3 Matching Analysis

We conducted a matching analysis using the unmodified quiz data with a large sample size and the data from the trainer profiles. Specifically, we looked for good and bad matches between the customer and trainer clusters obtained in Modules 2 and 3.

We counted matches between trainer-user combinations and calculated the probability that the user repeated the trainer among those matches.

We could find extremely low numbers and high numbers. If we can avoid the low value matches and direct customers to the high value matches, we can expect to increase our conversion rates.

We were not able to use the new customer clusters this time due to the insufficient sample size of the new quiz data, but we will be able to find more detailed matching rules in the future when we have more data.

We ran simulations to measure the effect of improving the accuracy of our matching. Specifically, we fixed the total number of sessions and the number of customers in each cluster, and optimized the allocation of trainers. As a result, we were able to increase the conversion rate by 1.5 times. This is an optimistic figure, but it shows the high potential of matching optimization.

Summary

Through the various analyses in this project, we were able to obtain the following suggestions. We hope that these will be sublimated into actual initiatives in the future, leading to the improvement of FlexIt’s service quality and the further prosperity of the company

Constructed a model to predict the future conversion rate of customers based on their answers to a goals quiz using decision tree analysis
In addition to the existing goals quiz, we found that by adding new questions about customer motivation and training experience, we could more clearly separate the customer groups
By using the K-means clustering analysis method on customers’ answers to the goals quiz, we identified that customers can be broken down into five categories. By clarifying the characteristics of each customer group, we were able to promote future marketing activities

By using natural language processing and K-means analysis of the trainers’ self-introductions and information on their training areas of expertise, the trainers were broken down into about five clusters
Found new metrics for trainers that affect customer conversion, such as whether they offer training slots in the morning and how many people they have trained in the past

Identified the best (or worst) match of customers and trainers by analyzing past conversions for each cluster of customers and trainers identified in modules 1 and 2
Revealed that if functions and services that can match customers and trainers are introduced in the future, the existing conversion rate could be improved by up to 1.5 times

We’d like to thank FlexIt for their great help in this project. In particular, we’d like to thank Thomas Menzefricke for his help in designing the analysis, providing data, and many other aspects of the project. Thank you very much.

In addition, we want to thank Daniel Guetta and Brett Martin, the professors in charge of this course, and the teaching assistants for their helpful advice and guiding the project in the right direction. Without their support, we wouldn’t have been able to realize such a wonderful project. We’d like to express our sincere gratitude.

picture with our professors Daniel Guetta and Brett Martin