How to decide in the face of the multitude of data?
Decision-making is a central aspect of business activity and management. Always containing an element of risk, it primarily relies on the expert knowledge of the business teams. This essential knowledge can be augmented and refined through data analysis techniques, for instance, in situations where it is hindered by a large volume of seemingly heterogeneous data. No decision then appears satisfactory for all scenarios. What is needed is to segment the data into subsets such that within each group they are sufficiently homogeneous to enable relevant decisions for the entire group. When such an operation is too complex to achieve with simple reading, powerful clustering techniques can be employed to automate the search for these groups within the data mass.
An effective solution for marketing segmentation
Suppose, for example, that we have a large amount of data (anonymized!) on potential customers, including their preferences and expectations regarding products. The goal is to identify which of our products these customers might be interested in or how we should evolve our products to appeal to them. Therefore, it is about using the information contained in the data to make decisions regarding product design and marketing strategy. The problem is that customers, as described by the dataset, do not seem to have a very clear commonality at first glance. No general trend emerges, and any decision seems to only apply to a portion of them without it being easy to identify groups. However, it must be possible to categorize these customers into groups so that customers within the same group have enough in common for a decision to be relevant to all! But the large number of characteristics we have to describe these customers prevents us from finding groups, or too many, and ultimately we don't know what to prioritize to progress in the analysis.
Let's consider the following diagram representing customers (depicted as points), described here by only two characteristics, which are satisfaction ratings on two products A and B.
This diagram highlights two fairly distinct groups of customers: those who like both products a lot and those who don't like A and moderately like B. In this situation, a metric like the average satisfaction of product A may be of no interest and may even be misleading because it actually corresponds to practically no customer. Relying on it to make a decision risks resulting in a situation that suits no one, even though it is an average decision across the board. What is needed here is to identify the two groups highlighted by the diagram and then consider the averages (or any other indicator) within each group rather than globally. The above case may seem obvious, and one may wonder why use sophisticated techniques when simply displaying the data can identify the groups as is the case with this diagram. However, there are only two characteristics here (the satisfaction ratings of products A and B)! In reality, there are often dozens or even hundreds, and in any case, a sufficiently large number for which no graphical representation is possible, neither in a plane nor even in 3D space. We are then blinded, and even if homogeneous groups exist, we are unable to discern them.
Clustering is the solution to this problem. It precisely allows us to do what the eye can do on the diagram above but for customers described by a large number of characteristics, thus enabling us to benefit from all the richness of information contained in the data. Each customer is assigned to a group whose homogeneity can be measured. The analysis of each group can then be done independently of the others in order to make decisions for each set of customers that are relevant for the entire group while limiting their number.
An enrichment of expert knowledge
Clustering techniques can be implemented by a Data Scientist. Several methods, the main ones of which are detailed in this article: "What is clustering? The 3 methods to know", exist and can be tested simultaneously and then compared. Each method has its advantages and disadvantages. Some require fixing the desired number of groups in advance; others deal with finding the number of groups that allows for the greatest homogeneity. The Data Scientist knows how to select and adapt these techniques according to the nature of the data, the constraints they impose, and the practical objectives to be achieved.
The implementation of these techniques is not intended to replace the expert but rather to enrich their knowledge by highlighting subsets that are simply not identifiable by reading the dataset alone. Use cases are very diverse and apply to all domains as long as there is data available. The obtained classification can provide clarity that complements the knowledge of domain experts to make decision-making more efficient and faster. It can expedite the search for segmentation and serve as a basis for marketer analyses. But clustering can also be used to facilitate the implementation of predictive models such as neural networks or random forests, for example, by helping to address thorny issues related to missing data. These different aspects will be covered in future articles.