Launching dynamic pricing:
How to evaluate the pilot test results
and not go wrong


Example 1. Self-service checkouts and the checkout area categories

In 2021, a retailer launched an A/B test in two hypermarkets. A dynamic pricing system managed the prices of the test hypermarket. In the control hypermarket, they managed the prices manually.
In the last two months of the pilot, a clear negative trend in sales of two categories appeared; the decrease was around 20%. The pilot test included all categories, except "ultra-fresh products", but only two categories got decrease in sales, namely "chewing gum" and "chocolate".
In the control store, sales of the "chewing gum" and "chocolate" categories didn't change in these two months.

Our experts analyzed the prices together with the retailer's team, and didn't detect any price issues. Therefore, we began to discover what could have changed on the retailer's side.

What was the reason for the decrease? Two months before the end of the test, they replaced a quarter of the cash registers with self-service cash desks without checkout areas in the test hypermarket. Chewing gum and chocolate were displayed exactly in the checkout areas, and so their sales decreased in proportion to the number of checkout areas.

In the control hypermarket, they utilized self-service checkouts since its opening, so sales of goods from checkout areas didn't change during the pilot.

1 - During the pilot, a hypermarket chain found out that self-service checkouts provide not only advantages, but a negative effect on sales in the checkout area.
The company fixed the issue by adding displays to the checkout areas.

2 - Summing up the results of the pilot required more complex calculations. The experts had to calculate losses from self-service checkouts in the test hypermarket, to remove these losses from the resulting figures.

Example 2. A competitor appearance and the strategy switching

The pharmacy chain has launched a pilot.
For the A/B test, we chose two similar pharmacies in a half a million citizens city, with stably strong sales and profitability. The pilot's challenge was an increase in profits of the test pharmacy compared to the control one.

The first two months of the pilot were successful; the pharmacy's sales and profits were growing.

The pilot pharmacy was located at a busy intersection; the location ensured a lot of customer visits.
Two months after the start of the pilot, a new pharmacy opened nearby, and the new competitor started to take away customer traffic and revenue.

The pharmacy chain's team had to switch from the initial challenge, gross profit growth, to remaining/protection of revenue and traffic.
They were forced to employ a completely different competitive pricing strategy, considering the new competitor's prices in price calculations.

1 - The chain succeeded to retain traffic and revenue with dynamic competitive pricing; in this sense the pilot was evaluated as successful.

2 - They couldn't estimate the potential increase in gross profit from dynamic pricing during the pilot test. Such estimation required a control pharmacy with a fresh-opened competitor's pharmacy nearby.
There were no such pharmacies of the chain in the city.

3 - To estimate the potential increase in gross profit with dynamic pricing, the retailer had to run a new pilot, with a different test pharmacy.

Example 3. Hypermarket in a residential area and a new residential complex

A hypermarket chain ran an A/B test of dynamic pricing.
For the test, we chose two hypermarkets in a half a million citizen city. The stores were similar in terms of dynamics and sales structure over the past year. In the hypermarket in a residential area, they launched dynamic pricing; the hypermarket in downtown was a control one.

The results of the pilot were impressive. The increase in sales and other indicators of the pilot store compared to the control store significantly exceeded the dynamic pricing results in other cities. These figures looked suspiciously good, since no new "revolutionary" pricing strategies were used in the test.

With the help of the hypermarket team, we found out an essential fact. A new residential complex, placed nearby the tested store, was rehabbed in the middle of the pilot.

Then we analyzed sales of the hypermarkets for three past years. We discovered that the test hypermarket had been operating in a residential area almost since the district's appearance. Analyzing the test hypermarket's sales statistics of 3 past years, we revealed a clear trend: a small annual increase in the number of checks, revenues, and profits. As the area was built up, and as the number of inhabitants of the district grew, the test hypermarket's sales performance improved.
This trend was not noticeable in one year stats, but only in several years of data.
The control hypermarket in downtown did not have such a "natural growth in sales" trend.

1 - Evaluation of the pilot results required more complex calculations, that considered the revealed trend of growing.

2 - Considering the trend of growing, the true increase in sales due to dynamic pricing was approximately equal to the results in other cities. That is, the real growth was less significant than primary figures showed.
Plots of the test and control hypermarkets' revenue during 3 past years. Revenue of the test store (red line) grows faster than inflation. Initially, the control hypermarket's revenue was larger than the test hypermarket's one. But in 2021, the test hypermarket overtook the control store. There is a clear trend of revenue increasing. To enlarge the image, click on it.
  • Nikita Tsukanov
    CEO Imprice — dynamic pricing software,
    The examples above are stories about external and internal factors that added complexity to the evaluation of the pilot results.

    The appearance of a competitor next to the pilot store is difficult to predict or prevent.
    But installing self-checkouts, or a long-term sales growing trend, that's another matter. Such information is available to the retailer's team. Experts can leverage such data to prepare and manage the pilot.
    Thus, it is important to know what information can be essential for the dynamic pricing pilot and what points one needs to check and control.

    Our experts have prepared a core checklist of parameters and incidents, those one should monitor and control, for preventing a distortion of the A/B test results.
The A/B test helps evaluate the effect of an innovation.
In this article, the innovation is dynamic pricing.

To get the correct evaluation, we should equalize all the parameters of the test and control stores, except for the innovation being tested.
That is, the control and pilot store should differ only by type of pricing during the pilot.

Parameter 1. Identical promos

The retailer should choose such a test and control store for A/B testing that during the pilot they would have the identical promotions.

Ideally, if in these stores all mutually influencing categories have the same promotions, not only the pilot categories.
Simplified example:

A retailer included the category "pasta" in the pilot. That is, they launched dynamic pricing for noodles and pasta.
The category "sauces" was not included in the pilot. Pricing specialists continued to manage the prices of sauces manually.

During the pilot, they ran a promotion for sauces in the test hypermarket. Pesto, bolognese and other "dressings" for pasta and noodles were offered at half the regular price. Many shoppers decided to take the opportunity and try the sauces. They also bought pasta, because the sauces were intended for cooking pasta dishes.

The control store didn't have any discounts for sauces at that time.
At the end of the pilot, the test hypermarket's sales, revenue and gross profit by the "pasta" category had significantly exceeded the performance of the control hypermarket.
But the real reason for the increase was the sauce promotions, not dynamic pricing.

The reverse situation is also possible. If they run the "sauce promos" only in the control store, in the test store the results of dynamic pricing will be underestimated or even negative.
To-dos for the pilot project manager:
  • Choose a test and control stores with identical promos.
  • Or change the promotions in the test and control store to make them identical for the pilot time.
  • Or exclude from the pilot all categories with different promos in the test and control stores. This way is risky, because excluded "promo categories" can affect the sales of the pilot categories.

Parameter 2. Identical discounts complementing promos

In stores, often there are extra promotions and discounts that are not included in the catalog of promos. "Buy two for the price of one", "Gift with purchase" (GWP) and so on.
In addition to identical promos from the catalog, the pilot and control stores should have identical "additional promotions/ discounts".

To-dos for the pilot project manager:
  • Choose a test and control store with identical "extra promotions and discounts".
  • Or change "extra promotions and discounts" in the test and control stores to make them identical for the pilot time.
  • Or exclude from the pilot all categories with different "additional promotions and discounts" in the test and control stores. This way is risky, because excluded "promo categories" can affect the sales of the pilot categories.

Parameter 3. The same inventory management (out-of-stock control)

During the pilot, the pilot and control stores should have the same availability of goods of the pilot categories.

Store's stocks of the test categories significantly affect the pilot's result. If a product is out of stock, its sales will be zero with any price.
Simplified example:

The retailer launched the same promos in the pilot and control stores. During the pilot, one store got much less promoted goods than another one.
The first store sold out "promo items" in two days. The second store sold these goods for three weeks, because it had enough stocks.

Thus, revenue and sales of promotional items in the second store were several times higher than in the first store.

1 - It is important to maintain the same stocks in the pilot and control stores during the pilot.

2 - It is important to manage inventory and calculate the optimal stock balance in a centralized and identical way, utilizing the same algorithms, in the pilot and control stores during the pilot. Perfect, if the inventory management process is automated.
The worst situation is when every store manages itself the inventory.

The skills of a particular employee will affect the store's supply efficiency. "In one store they make orders well, in another store they do it poorly."

Such an uncertainty factor can ruin the A/B test. Sales will be affected not only by the type of pricing ("manual" and dynamic), but also by the ordering skills of the store employees.

In such a case, "improvements" in the pilot store may be caused by the fact that a talented employee managed orders and fixed the stock-out problem. Inversely, if a low-skilled employee was responsible for orders, it may lead to decreasing sales due to lack of goods.
To-dos for the pilot project manager:
  • Choose the test and control stores with a centralized stock management system that utilizes the same algorithms for these both stores.
  • Or, for the pilot time, organize the "sameness" of the ordering system and out-of-stock control (centralization, automation, the same ordering algorithm) in the test and control stores.

Parameter 4. The same on-shelf availability

During the pilot, the availability of goods from the pilot categories on the shelves of the test and control stores should be the same.

"The product is in the warehouse of the store" is not equal to "the product is on the shelf of the store."

The supermarket chain ran a dynamic pricing pilot and monitored daily the stores' performance dynamics.
The pricing analyst noticed repeated cases when an item with stable good sales suddenly stopped selling, despite it being in stocks.

We started to download lists of such SKUs and clarify the issue "in the field", with the store's employees. In almost 100% of cases, we discovered that after acceptance they left such a good on pallets. The item was in the warehouse, but it was out-of-shelves. Therefore, the SKU's sales were falling to zero on "out-of-shelves days."
The chain managers extended the pilot time, and ran daily monitoring of "abnormally frozen sales" in the test and control stores. The result:

1 – They reduced soon to almost zero on-shelf availability issues of the pilot categories' items in both supermarkets. This positively impacted on the sales of both the test and control stores.

2 –
This helped to correctly evaluate the increase in sales and profits due to dynamic pricing at the end of the pilot. For that, experts compared the figures' dynamics of the test and control supermarkets.
If the item is in stock, but the store employees do not put it on the shelf on time:

1 - This instantly leads to a drop in sales.

2 - This complicates the work of artificial intelligence in pricing.
Machine learning algorithms see a non-zero inventory and zero sales; so they conclude that the price is inefficient. Algorithms' learning process slows down because of that issue in data.

3 - This leads to wrong result calculations.
The goods absence on shelves won't be recorded. At the end of the pilot, the data will show low sales, though the goods were in stock. The experts will get the incorrect conclusion, based on these incorrect data:

The lack of goods on the shelf (out-of-shelf, on-shelf availability issues) in the control store leads to overestimating the resulting increase in sales due to dynamic pricing.

The lack of product on the shelf in the test store leads to underestimating the resulting increase in sales due to dynamic pricing. If the item wasn't on a shelf, it was impossible to buy it. But as the data does not contain any information about on-shelf availability, experts will conclude that the reason for the decrease in sales is in prices.
To-dos for the pilot project manager:
  • During the pilot, control the display of goods on the shelf in the test and control stores.

    It is impossible to quickly "repair" the delivery process "store's warehouse - store's shelf" in the entire retail chain.
    On the other hand, it is quite possible to create an "on-shelf availability issues monitoring system" for test categories in two stores.

To reduce on-shelf availability issues, the Imprice platform provides an automated report "Abnormally frozen sales". This tool is simple but quite effective.

A store receives the following information from the report:

Beer "Heineken 0.0",
-- average daily sales - 3.85 cans,
-- balance in the store's warehouse 586 pieces, appeared at least three days ago,
-- in 2 past days was sold 0 cans of "Heineken 0.0",
-- in the current week was sold 3 cans of "Heineken 0.0".
On-shelf availability issues are suspected.

Cat food,
-- average sales are 4 packs per day,
-- warehouse balance of 185 pieces, appeared at least three days ago,
-- sold zero packs in two days,
-- weekly sales were 2 packs.
On-shelf availability issues are suspected.

Despite its simplicity, this report indicates the problem quite accurately and allows the retailer to eliminate issues efficiently.
The store's responsible receives daily such a list of "suspected items". Then he manually checks what has happened with "the suspected items" on the shelf in the store.
As practice shows, daily monitoring of "suspected items" on the shelf significantly reduces the number of lines in the report in a short time. That is, the number of on-shelf availability issues quickly decreases to nearly zero.

Parameter 5. Identical efficiency of merchandising (special case of on-shelf availability)

During the pilot, the retailer should have the same quality of display of goods and their price tags availability in the test and control stores.
An "unexpected" drop in sales in a control or test store during the pilot can happen in the following cases:

Absence of price tags. For many shoppers, this is equal to the absence of goods on the shelf.
The balance on the shelf is less than the "minimum required number of facings" ("minimum shelf stock to make the SKU visible"). For many shoppers, this is equal to the absence of goods on the shelf.

Shaving products category, Gillette shaving cartridges. The merchandisers of the manufacturer are responsible for displaying.

If there are few Gillette cassettes posted in the checkout area, their sales drop sharply:
Some shoppers do not notice them, and therefore forget to buy.
Others consumers simply cannot find the cassettes in the usual place next to checkouts. They are not ready to return from the line inside the store and look for the section of shaving accessories.
To-dos for the pilot project manager:
  • During the dynamic pricing pilot, control and "equalize" the quality of displaying test categories' goods in the test and control stores:
    the presence of price tags,
    the number of facings,
    the total balance of SKUs on the shelf.
    Otherwise, sales will be affected not only by the type of pricing ("manual" and dynamic), but also by the skills of store employees and manufacturers' merchandisers.

Parameter 6. The same cost prices

During the pilot, the cost prices of goods from the test categories in the test and control stores should be the same.

Another option: it should be possible to calculate the resulting increase in gross profit, using the same cost price for the test and control stores.
Simplified example:

At the beginning of the pilot, in the test store there was a large stock of product A with a cost price of 100 euros. This stock was enough for the entire pilot time.
The control store had a small stock of item A. From the beginning of the pilot, the store began to receive the product A at a cost price of 150 euros.

One month before the pilot, the gross profit of the A item in both the test and control stores was 5000 euros.
During the month of the pilot, the test store sold 100 units of product A at a "dynamic" price of 190 euros.
Gross profit was calculated as follows:
100* (190-100) = 9000 euros

The control store sold 100 units of product A at a "manual" price of 200 euros.
Gross profit was calculated as follows:
100* (200-150) = 5000 euros

With this calculation method, it seems the increase in gross profit in the test store was 4000 euros, and in the control store 0 euros. And therefore, dynamic pricing increased gross profit by 80%. In reality, there was no growth, only a difference in cost prices.

The reverse case is also possible, when the increase in gross profit due to dynamic pricing is underestimated because of the higher cost price in the test store.
To-dos for the pilot project manager:
  • For the pilot, choose categories with the same cost prices in the test and control stores.
  • Or choose categories for which one can use the same cost prices in the calculations of resulting gross profit in the test and control stores.
If neither the first nor the second is possible for a category, one cannot include it in the pilot. The reason is its gross profit will be affected not only by the type of pricing ("manual" and dynamic), but also by differences in cost prices.

Parameter 7. Identical volume of "loyalty points" write-off

During the pilot, the test and control stores should have the same volume of paying by the loyalty points.

"If two stores have a similar revenue, the volume of paying by the loyalty points is also similar for them," our intuition tells us. As practice shows, this is not the case.
Why can the amount of points written off vary greatly between stores with the same revenue?
For example, because of cashiers and their scripts.

In one store, they tell shoppers their amount of accumulated points and offer to write them off.

In another store, they do not provide such information on checkouts and do not offer to use the customer's points.
The shopper has to ask the cashier: "How many points do I have on my loyalty card? I would like to write them off!" Some consumers are shy, some simply forget about the points.
The result is that fewer shoppers use their loyalty points.
Differences in points write-offs can lead to underestimating or overestimating the resulting gross profit just as differences in cost prices.

To-dos for the pilot project manager:
  • Сhoose the test and control stores with the amount of paying by loyalty points is the same (or these amounts are the same share of stores' revenue).
  • Or remove written-off loyalty points from the resulting calculations of gross profit (remove written-off points from receipts)

Parameter 8. Identical functions of the store and its warehouse, the same write-offs

The test and control stores should perform the same list of functions and have the same amount of write-offs.

1 - The hypermarket worked also as a warehouse for the chain's online store. Therefore, its assortment in non-food categories was much wider than in other hypermarkets.
It is incorrect to compare sales of frying pans in two stores, if
one store offers 50 types of frying pans,
and another offers only five types of pans.
Therefore, they didn't include in the pilot the hypermarket with functions of an online store's warehouse.

2 - The pharmacy was a "transshipment warehouse" for other pharmacies. For this reason, its assortment was abnormally wide compared to other stores of this chain. Its volume of write-offs was abnormally large for the same reason; this warped the pharmacy's financial results. This pharmacy was not included in the pilot.

3 - The pharmacy chain had a rule to formalize all write-offs as a sale for 1 cent.
This warped the results of dynamic pricing, and added complexity to the training of algorithms.
The decision was to remove receipts with write-offs from the sales data and from the resulting calculations.
To-dos for the pilot project manager:
  • Choose test and control stores with similar write-off amounts (or these amounts are the same share of stores' revenue).
  • Or remove write-offs from data for analysis and from calculations of the resulting indicators of the pilot
  • Do not include in the pilot stores with unusual functions and abnormal assortment for the chain

Parameter 9. Identical share of wholesale purchases, identical share of business purchases

The share of wholesale and business purchases in the revenue of the test and control stores should be the same.

The test hypermarket was located next to a business center with offices, warehouses, and manufacturing spaces. For this reason, business customers often made office and small-scale wholesale purchases, at a price for "ordinary" shoppers, and they paid at ordinary cashouts.
In the control hypermarket, there were significantly fewer small-scale wholesale and office purchases.

Wholesale purchases made "noise" in data and complicated the training of ML pricing algorithms. On a certain day, sales of an item could grow due to not price optimization, but because a "wholesale/office customer" came.

The solution was removing "abnormal receipts" from the sales data of both stores. "Abnormal" were receipts with an amount or number of units greater than certain values.
To-dos for the pilot project manager:
  • Choose test and control stores with the same share of wholesale sales.
  • Or remove abnormally large receipts from the analyzing data and final results.
  • Nikita Tsukanov
    CEO Imprice — dynamic pricing software,
    The algorithm for A / B testing of dynamic pricing in offline retail is as follows:

    Choose identical or highly similar stores.

    Start dynamic pricing in one (test) store. Remain price management unchanged in the other (control) store.

    During the pilot, monitor if the way of pricing remains the only difference in stores' working.

    At the end of the pilot, measure the dynamic pricing result. That is, calculate the difference in the target indicators of the control and pilot stores.

    If one executed the described steps correctly, it ensures that the reason for profits increasing was dynamic pricing, not other factors.

    Let's review the tools that large retail chains can use to search for "identical stores" for the pilot, and what small chains can do if they do not have any "identical stores".

Data Science Approach. Selecting Stores for A/B Testing with Machine Learning Algorithms

Large chains, which have thousands of stores (like Conad, for example), can select a control and test stores with machine learning algorithms. The general search principle is as follows:

1 - Algorithms study the overall sales dynamics of the chain's stores.

2 - By clustering time series, they identify and combine into groups (clusters) stores with highly similar sales dynamics.

3 - Within the group of stores (cluster), they compare the shares of sales of specific product groups and categories in stores within the same cluster.
That is, the algorithms seek stores with similar overall dynamics that have no fundamental differences in sales structure. They study both the shares of the categories themselves and the dynamics of these shares.

4 - Then algorithms look for stores with similar key performance indicators: a similar number of cash receipts (traffic), average order value, revenue, marginality.

5 - Finally, they compare the descriptive characteristics of similar stores:
store format,
location ("downtown / residential area"),
street retail or shopping center,
the number of competitors nearby,
specialized services of the store ("a click-and-collect option", "the store is a warehouse of an online store").
For such an analysis, algorithms work with a base of stores with filled-in properties.

The result is pairs of stores with the most similar indicators and characteristics.

Simplified Approach. Choosing Stores and Categories for A/B Testing in a Small Retail Chain

For small chains, up to several tens or hundreds of stores, it is usually faster and more efficient to employ the "semi-manual" approach.
The reasons:

1 – Often the quality and volume of the retailer's data are lower than the level required for the machine learning algorithms' work.
There are "noises" and "gaps" in data. There are no directories of descriptive characteristics of stores.

2 – With a small number of stores, one can quickly compare the essential stores' characteristics "manually", and get a reasonable accuracy of the conclusions.

3 – Due to the small number of stores, some chains don't have two similar stores. All stores of the chain differ significantly in certain indicators and characteristics.
  • Step 1. Compare stores' sales dynamics.
    Display sales charts for all stores; the minimum period is 1 year.

    Compare the dynamics of the performances of stores: revenue, number of purchases, average amount of purchases, units per transaction, and so on. Clustering algorithms can help detect stores with similar performances.
    When there are few stores, one can even find the stores with similar dynamics in a visual way.

    To discover hidden trends, one should analyze the sales dynamics for several years. As mentioned above, hidden trends are cases when the store is slightly but steadily growing or "falling".
    Example 3 at the beginning of the article (about how the new residential complex affected sales during the pilot) shows why considering trends is crucial for the pilot evaluation:
An example of the hidden trend. Due to the number of district inhabitants grew, in a few years the test hypermarket (red line) surpassed the control hypermarket's performance (blue line).
To enlarge the image, click on it.
  • Unfortunately, it is a common situation when a small chain has no stores with perfectly similar dynamics.

    In such a case, the result of Step 1 is stores whose dynamics are somehow similar.
An example of the absence of perfectly similar stores. In the highlighted zones, the dynamics of all stores differ significantly.
To enlarge the image, click on it.

  • Step 2. Compare all available descriptive characteristics of the stores chosen in Step 1.
    It is more efficient if the retailer makes this "manual" step on its side. For an external contractor, this task could be quite difficult.

    Why is this step the second, not the first?
    The base of the stores' characteristics are often poor.
    "We only have 30 stores. It is nothing to describe. Our managers keep all the information in their heads."

    Descriptive characteristics of the store do not always affect its sales.
    Obviously, it is better not to compare a hypermarket with a supermarket. But if two stores are located in shopping centers, this does not mean they are similar, because shopping centers can differ significantly. Therefore, the dynamics of stores in these centers can also be significantly different.
  • Step 3. Choose the assortment of goods for the pilot.
    It makes perfect sense to include only part of the assortment in the pilot sometimes.
    In what situations can it be recommended:

    1) The retail chain does not have two sufficiently similar stores.
    In this case, one can choose specific product categories to run the pilot. The dynamics of these categories should be the same in two stores.

    2) It is wise to exclude from the pilot assortment products that are packed manually by the store.
    For example, the store cuts some cheeses and sausages, packs them into plastic, and sticks a price tag on each pack. During the pilot of dynamic pricing, the store has to change prices often. To change the price of packed food, they need to re-weight EVERY pack and stick the new price tag. This labor cost is much higher than replacing one price tag for all goods with the same SKU.

  • 3) It is advisable to exclude from the pilot ready-made meals manufactured by the supermarket.
    Otherwise, the store can have difficulties with price tags as described above. The second reason is such SKUs often have incorrect cost prices.

    4) It is recommended to exclude fruits and vegetables from the pilot.
    In these categories, sales are highly dependent on the quality of the batch. If the pricing algorithm learns on a batch of high quality cucumbers, and then a low quality batch comes, the price will be suboptimal.

    It is better to exclude categories which have a "low season" or a decline in sales during the pilot.
    The category sales will be low anyway, and it will be difficult to evaluate the results.
  • Step 4. Check Parameters 1-9 according to the checklist from this article.
    The result is a pair of stores and a list of categories in these stores with the most similar indicators and characteristics.
Talk to Imprice pricing experts: