Блог Datawiz.io

Math as an Oracle for Retailers. Correlation and linear models

Antonina KelmanMarch 10, 2016, 4:09 p.m.

Could we predict or influence in some way future sales of the shop? Is it possible to find products, which sales have a great influence on total turnover and total store sale? How math and statistics could be useful in retail?

Math and statistics are really powerful, it’s certain fact and everybody knows it, but less understands it. Let`s have a look where you really see math?

Generally nowadays math has only one purpose - break student’s brain at the high school. Students, who, by some way enjoy it, go to physics, computer science, and stock trading.

Here, in Datawiz.io, we manage to find such math geeks. And start to search way how to apply our knowledge to something more real, to something touchable, something that can bring benefit to society. So, we hack around and find retail industry. Retail is great, has a really lot of numbers to play - waterfall of numbers - products sales, receipts, customers behaviour, their social profiles etc. and we put all this things to work.

Simple visualization of sales data - it`s really boring, let QlickView and Tableu do it, anyway boring things are their expertise. Traditional analytics only gonna show you yesterday’s data that looks great in executive reports, but it never gonna show you tomorrow.

Machine learning models are different - you can interact with model, change parameters and look what gonna happen. Model gives you the possibility of control and requires your participation. It’s like you are playing with future, speaking with oracle. Models show you consequences of different factors combination. For ex. if tomorrow will be raining and Paciato beat Mayweather how many customers will come to my website?

In this article, we start with more simple examples.

As the examples for this article we took 2 stores. Let's name them: Shop 1, Shop 2.

We'll use from this stores such data: 1. sales data quantity per category, 2. product receipt penetration data per category.

Correlation models

Is it possible to find which products sales are correlated with total turnover and how sales of just few products will effect on total store sale?

We start with building of 4 tables, 2 per each shop, where: in rows - weeks; in columns - category; in cell sold product quantity or receipts quantity.

First of all let's find correlated sales. Sales that change together. The higher correlation coefficient - the higher likeliness.

There are 3 different variants.

If two products sales is correlated it`s means - increasing sale of product A will lead to the growth of sale of product B.

If not correlated, correlation coefficient near 0, means - Sale of Product A can dance up and down and Product B does not care at all, his sale has own life.

If product has reverse correlation - increasing sale of product A will lead to falling sale of product B. For ex. more we sell Coke, less we sell Pepsi.

So, we know that shop sells lot of milk, we bring hypothesis that milk and turnover has strong correlation. We plot milk sales data at blue (left chart) and turnover data in red (right chart).

plotPair <- function(x,y,namesX,namesY){ par(mfrow=c(2,1)) plot(x,type='l',col='red',main=namesX,xlab='') plot(y,type='l',col='blue',main=namesY,xlab='') par(mfrow = c(3,2), oma = c(5,4,0,0) + 1, mar = c(0,0,1,1) + 1 ) layout(matrix(c(1,2,3,4,5,5), 3,2, byrow = TRUE)) plot(x,type='l',col='blue',ylab='log sales qty',main=namesX,xlab='') plot(y,type='l',col='red',main=namesY,xlab='') hist(x,col='blue',main=names(x),breaks=20) hist(y,col='red',main=names(y),breaks=20) m<-lm(y~x) plot(x,y,xlab=namesX,ylab=namesY) abline(m,col='green',lwd=3) } plotPair(x=shop1_df$Milk.and.Dairy', y=moloko_df$sum, namesX='Milk.and.Dairy', namesY='Turnover') If we compare left and right charts, they are almost the same as we can see, they are normally distributed and distribution also looks the same. At bottom of chart we plot in horizontal axis Milk and in Vertical turnover and obviously we see linear dependence.

So, our hypothesis is true, milk and turnover has strong correlation.

It means customers who come for milk will buy some other products as well. So as you understood for shop it`s really important to keep attractive price for milk, it pulls in customers.

But how to find all correlated products? How do product sales correlate with turnover? Let's build correlation matrix.

corr<-function(df){ cr <-cor(df, use="complete.obs") par(cex = 0.9) corrplot.mixed(corr=cr,upper="ellipse", tl.pos="lt", col = colorpanel(50, "red", "gray60", "blue4"), cl.cex=0.5,tl.cex=1.1) }

Product sales correlation in Shop 1 Correlations close to zero are plotted as circular and gray (using the color scheme we specified), while magnitudes away from zero produce ellipses that are increasingly tighter and blue for positive correlation and red for negative.

We can see from the matrix the top products with highest correlation with turnover for Shop 1 are:

  • Bread and Bakery 0.977
  • Non-food products 0.950
  • Milk and Dairy 0.934
  • Sausages 0.930
  • Snacks 0.870
  • Tobacco.products 0.835
  • Confectionery 0.802
  • Diabetic meals 0.794
  • Grocery 0.782


Bread, packages (non-food products), milk, cakes have a strong correlation that easy to explain. Almost everybody who is coming to shop buys these products. So they will have a strong correlation in any shops around the world. More customers you have, more such kind of products you’ll sell.

But snacks, cigarette, and candies are different in this case. It shows that Shop 1 has strong group of customers who especially come for that products.

Also we can make some more customer-centric, not a product centric analyzes. Instead of product sale we put a product receipt penetration data per category (how many receipts, with current product do we have).

Product penetration correlation in Shop 1 As we can see top categories and numbers almost the same.


  • Bread and Bakery 0.986
  • Sausages 0.961
  • Non-food products 0.956
  • Milk and Dairy 0.944
  • Confectionery 0.867
  • Snacks 0.864
  • Tobacco.products 0.858
  • Meat 0.829
  • Diabetic meals 0.812

But would be situation the same in Shop 2?

Product sales correlation in Shop 2 Top categories in Shop 2 are:

  • Non-food products 0.966
  • Bread and Bakery 0.943
  • Milk and Dairy 0.908
  • Diabetic meals 0.882
  • Sausages 0.840
  • Soft drinks 0.837
  • Tobacco products 0.835
  • Confectionery 0.775
  • Alcohol 0.773

So Top categories changed, you can see 2 new categories:

  • Soft drinks
  • Alcohol

It could depend on geographical situation of this two shops, or other factors. Maybe near 1st shop was situated competitors who sell alcohol, but near current we don’t have it, so here alcohol is in list.

Product penetration correlation in Shop 2 Top categories for penetration correlation - Shop 2:

  • Non-food products 0.975
  • Bread and Bakery 0.968
  • Milk and Dairy 0.948
  • Sausages 0.927
  • Diabetic meals 0.905
  • Confectionery 0.899
  • Tobacco products 0.858
  • Soft drinks 0.819
  • Alcohol 0.785

Not only correlated relations has a value, non-correlated relations are important too. Oil doesn’t correlate with any products it means doesn’t matter how shop sale chart is moving, oil sale will has own movement. Shop has customers who come to buy oil and nothing else, and they rarely buy something else. This facts are valuable for better planning of traffic flow in the shop.

Some of the product has reverse correlation like fish and snacks. We can explain that people get use to buy beer+snacks or beer+fish. So they buy snacks or fish, but very rarely together. The same situation is with ice-cream and cakes.

This facts are valuable for better planning of product placement, traffic flow in the shop.

By more careful examining of correlation between products we can find lot of such artefacts, that can be rather useful.

This analysis shows what is happening now, and how different parts are connected.

But which products are rainmakers? How strong they and their combination could influence the total shop turnover?v Easiest way to find it - is to build simple linear model, easy to build, easy to analyze.

Linear models

So we want to get one equation what will describe shop’s sale structure and will be easy to model.

For the same dataset, we apply features selection algorithm.

feature_selection <- function(df){ drops <- c('Non-food.products','Fruits','Non-food.products') df <- df[,!names(df) %in% drops] control <- rfeControl(functions=rfFuncs, method="cv", number=10) results <- rfe(df[1:ncol(df)-1], df$sum, sizes=c(1:ncol(df)-1), rfeControl=control) return(results) } results <- feature_selection(df=shop1_df) predictors(results) plot(results, type=c("g", "o"))

For Shop 1 product sales output is:

  1. Bread and Bakery
  2. Snacks
  3. Milk and Dairy
  4. Coffee tea
  5. Grocery
  6. Sausages
  7. Preservations
  8. Confectionery
  9. Soft drinks
  10. Meat
  11. Diabetic meals
  12. Alcohol
  13. Tobacco products
  14. Concomitant food
  15. Refrigerated meals
  16. Vegetable oil

From chart we could see how many variables with different kind of error are generating. The best is to use 9 variables, but it`s really inconvenient for analytics, so we try top 6.

create formula moloco_formula <- sum~ Bread and Bakery+ Snacks+ Milk and Dairy+ Coffee tea+ Grocery+ Sausages

Apply linear model for Shop 1 product sales shop1_lm <- linear_models(moloco_formula,shop1_df) coefplot(shop1_lm, intercept=FALSE, outerCI=1.96, lwdOuter=1.5) Coefficients values, the more far away our coefficient from zero, more influence to output it has.

cv.lm(df=shop1_df, shop1_lm, m=10) #run cross validation test > sqrt(1629697) #mse [1] 1277 #square root from mse Print coefficients >shop1_lm$coefficients (Intercept) 5.53e+02 Bread and Bakery 3.20e+00 Snacks 4.60e+00 Milk and Dairy 1.18e+00 Coffee tea 9.34e-03 Grocery 1.23e+00 Sausages 3.58e+00 So our equation looks like:

Turnover +-1277 = 5530 + 3.2*Bread and Bakery + 4.6*Snacks +1.18*Milk and Dairy + 0.0934*Coffee tea + 1.23 *Grocery + 3.58*Sausages

”Coffee tea” has a very small influence to final output and we can remove it from the equation without losing accuracy.

Final equation:

Turnover +- 1277 = 5530 + 3.2 *Bread and Bakery + 4.6 *Snacks +1.18 *Milk and Dairy + 1.23 *Grocery + 3.58 *Sausages

We found what 5 products are turnover rainmakers for Shop 1. They have huge impact on final output so even some small increasing or decreasing among their sales will have huge impact for output. We think this information is of huge value.

If we need influence not on turnover but on receipt quantity, we just use other dataset with products penetration.
Feature selection

  1. Bread and Bakery
  2. Snacks
  3. Grocery
  4. Milk and Dairy
  5. Confectionery
  6. Sausages
  7. Coffee tea
  8. Preservations
  9. Tobacco products
  10. Soft drinks
  11. Meat
  12. Concomitant.food
  13. Diabetic meals
  14. Alcohol

shop1_formula <- sum~ Bread and Bakery+ Snacks+ Grocery+ Milk and Dairy+ Confectionery+ Sausages > sqrt(1033597) [1] 1017 > shop1_lm$coefficients (Intercept) 840.105 Bread and Bakery 3.792 Snacks 3.283 Grocery 0.741 Milk and Dairy 0.924 Confectionery 0.142 Sausages 2.358 Have a equation:

Receipt_qty +-1017 = 840 + 3.79 *Bread and Bakery + 3.29 *Snacks + 0.7*Grocery + 0.92\*Milk.and.Dairy + 0.14*Confectionery + 2.358 *Sausages

Let`s try to compare this 2 equations for Shop 1:

Turnover +- 1277 = 5530 + 3.2 *Bread and Bakery + 4.6 *Snacks +1.18 *Milk and Dairy + 1.23 *Grocery +3.58*Sausages


Receipt_qty+-1017 = 840 + 3.79 *Bread and Bakery + 3.29 *Snacks + 0.7*Grocery + 0.92 *Milk.and.Dairy +0.14*Confectionery + 2.358 *Sausages

Coefficients in first equation is higher than in second, so increasing of product sales will have stronger influence on turnover other than receipt quantity in case of Shop 1. We see new product category Confectionery with small influence appear at 2nd equation.

But for some other shop these 2 equations can be rather different. We could check it at the example of Shop 2.

For Shop 2 product sales output is:

  1. Diabetic meals
  2. Bread and Bakery
  3. Preservations
  4. Milk and Dairy
  5. Coffee tea
  6. Grocery
  7. Sausages
  8. Concomitant food
  9. Cakes
  10. Pet products
  11. Eggs
  12. Refrigerated meals
  13. Soft drinks
  14. Tobacco products
  15. Goods excluded
  16. Confectionery
  17. Alcohol
  18. Vegetable oil
  19. Snacks
  20. Fish and Seafoods
  21. Meat

shop2_formula = sum ~ Diabetic.meals+ Bread.and.Bakery+ Preservations+ > sqrt(9591095) [1] 3097 > shop2_lm$coefficients (Intercept) -1612.10 Diabetic.meals 27.77 Bread.and.Bakery 5.24 Preservations 1.44

Have such equation for Shop 2

Turnover +- 3097 = -1612.10 + 27.77 *Diabetic.meals + 5.24 *Bread.and.Bakery + 1.44 *Preservations

For Shop 2 product penetration output is:

  1. Bread and Bakery
  2. Milk and Dairy
  3. Diabetic meals
  4. Confectionery
  5. Sausages
  6. Preservations
  7. Coffee tea
  8. Grocery
  9. Eggs
  10. Cakes
  11. Concomitant food
  12. Refrigerated meals
  13. Snacks
  14. Tobacco products
  15. Soft drinks
  16. Vegetable oil
  17. Alcohol
  18. Goods excluded
  19. Pet products
  20. Meat
  21. Fish and Seafoods

shop2_formula = sum ~ Bread.and.Bakery+ Milk.and.Dairy+ Diabetic.meals+ Confectionery+ Sausages > sqrt(3382240) [1] 1839 > shop2_lm$coefficients (Intercept) 270.43 Bread.and.Bakery 2.24 Milk.and.Dairy 1.37 Diabetic.meals 14.81 Confectionery 1.21 Sausages 1.31

Receipt_qty+-1839 = 270.43 + 2.24 *Bread and Bakery + 1.37 *Milk.and.Dairy + 14.81*Diabetic.meals + 1.21*Confectionery + 1.31 *Sausages

If we have a look on 2 equations at Shop 2, we could see almost another situation comparing to Shop1. Turnover +- 3097 = -1612.10 + 27.77 *Diabetic.meals + 5.24 *Bread.and.Bakery +1.44 *Preservations

Receipt_qty+-1839 = 270.43 + 2.24 *Bread and Bakery + 1.37 *Milk.and.Dairy + 14.81*Diabetic.meals + 1.21*Confectionery + 1.31 *Sausages

Only 3 categories of products have influence on turnover, and Diabetic meals has really huge impact (it is because of Meals for Children that in Shop 2 were in the category of Diabetic meals). Receipt quantity in Shop 2 has 5 rainmakers. As you could see Preservations have influence on turnover, but are not so important in receipt quantity.

As an conclusion, examples of math models mentioned in the article are rather simple, but practical cases with correlation and linear models could give lots of useful analytics for retailers. Our approaches could help to measure what products are important for total turnover of the shop, plan sales and influence on customer behavior.

Stay Connected