Launching Dynamic Pricing:
How to Accurately Assess Pilot Test Outcomes and Avoid Missteps


Example 1. Self-service Checkouts and Checkout Area Categories

In 2021, a retailer conducted an A/B test in two hypermarkets. A dynamic pricing platform managed the prices in the test hypermarket, while the control hypermarket's prices were managed manually.
During the final two months of the pilot, a noticeable negative trend emerged in the sales of two categories, with a decline of around 20%. The pilot test encompassed all categories, excluding "ultra-fresh products," but only the "chewing gum" and "chocolate" categories experienced a decrease in sales.
In the control store, there was no change in sales for these two categories during this period.

Our experts, in collaboration with the retailer's team, analyzed the prices and found no pricing issues. Consequently, we began to investigate potential changes on the retailer's side.

What caused the decrease? Two months prior to the end of the test, they replaced a quarter of the cash registers with self-service cash desks without checkout areas in the test hypermarket. Chewing gum and chocolate were displayed exactly in the checkout areas, and so their sales decreased in proportion to the number of checkout areas.

In the control hypermarket, self-service checkouts had been in place since its opening, so the sales of items in the checkout areas remained unchanged during the pilot.

1 - During the pilot, a hypermarket chain found out that self-service checkouts provide not only advantages, but a negative effect on sales in the checkout area.
The company fixed the issue by adding displays to the checkout areas.

2 - Assessing the pilot results necessitated more sophisticated calculations. Experts had to estimate the losses from self-service checkouts in the test hypermarket and exclude these losses from the final figures.

Example 2. Competitor Emergence and Strategy Shift

A pharmacy chain initiated a pilot test.
For the A/B test, two comparable pharmacies were selected in a city with a population of half a million, both demonstrating consistently robust sales and profitability. The pilot's objective was to increase the test pharmacy's gross profit relative to the control pharmacy.
The first two months of the pilot proved successful, with the pilot pharmacy experiencing growth in sales and profits.

The pilot pharmacy was situated at a bustling intersection, ensuring a high volume of customer visits. However, two months into the pilot, a competitor's new pharmacy opened nearby, attracting customer traffic and revenue away from the pilot pharmacy.

The pharmacy chain's team had to shift their focus from the original goal of gross profit growth to preserving and protecting revenue and traffic. As a result, they adopted a different competitive pricing strategy, considering the new competitor's prices when calculating prices.

1 - The chain managed to maintain traffic and revenue through dynamic competitive pricing, deeming the pilot successful in this regard.

2 - The potential increase in gross profit due to dynamic pricing could not be accurately assessed during the pilot test. For such assessing, a control pharmacy with a newly-opened competitor nearby would be necessary.
However, no such pharmacy existed within the city.

3 - In order to estimate the potential increase in gross profit with dynamic pricing, the retailer had to conduct a new pilot using a different test pharmacy.

Example 3. Hypermarket in a Residential Area and New Residential Development

A hypermarket chain conducted an A/B test for dynamic pricing.
Two hypermarkets were selected within a city of half a million residents, both exhibiting similar sales dynamics and structure over the past year. Dynamic pricing was implemented in the hypermarket located in a residential area, while the downtown hypermarket served as the control store.

The results of the pilot were impressive. The increase in sales and other performance indicators for the test store compared to the control store substantially surpassed dynamic pricing outcomes in other cities. These figures appeared suspiciously favorable, as no novel "revolutionary" pricing strategies were employed in the test.

Collaborating with the hypermarket team, we uncovered a critical detail: a new residential complex had been completed near the test store midway through the pilot.

Upon analyzing the hypermarkets' sales data from the past three years, we found that the test hypermarket had been operating in the residential area nearly since the district's inception. Assessing the test hypermarket's sales statistics over this period revealed a consistent trend of modest annual increases in sales volume, revenue, and profit. As the area developed and the population grew, the test hypermarket's sales performance improved. This trend was not evident in one-year data but became apparent over several years.
The control hypermarket in downtown did not exhibit this "natural sales growth" trend.

1 - Evaluation of the pilot results required more complex calculations, that considered the revealed trend of growing.

2 - Considering the growth trend, the actual increase in sales attributed to dynamic pricing was roughly equivalent to results in other cities. In other words, the true growth was less significant than primary figures showed.
Plots of the test and control hypermarkets' revenue during 3 past years. Revenue of the test store (red line) grows faster than inflation. Initially, the control hypermarket's revenue was larger than the test hypermarket's one. But in 2021, the test hypermarket overtook the control store. There is a clear trend of revenue increasing. To enlarge the image, click on it.
  • Nikita Tsukanov
    CEO Imprice — dynamic pricing software,
    The examples mentioned previously highlight external and internal factors that added complexity to the evaluation of pilot results.

    While the emergence of a competitor near the pilot store is difficult to predict or avoid, factors such as the installation of self-checkouts or a long-term sales growth trend are data available for the retailer. Consequently, pricing experts can use such facts to plan and manage the pilot effectively. Therefore, it is crucial to understand what information is essential for the dynamic pricing pilot and which aspects need to be monitored and controlled.

    Our experts have prepared a core checklist of parameters and incidents that should be monitored and controlled to prevent distortion of the A/B test results.
The A/B test serves to assess the impact of an innovation.
In this article, the innovation is dynamic pricing.

To obtain an accurate evaluation, it is necessary to equalize all parameters of the test and control stores, except for the innovation being tested.
In other words, the control and pilot stores should only differ in their pricing approach during the pilot.

Parameter 1. Identical promos

For a successful A/B test, retailers should select test and control stores that have identical promotions during the pilot.

Ideally, all mutually influencing categories should have the same promotions, not just the pilot categories.
Simplified example:

A retailer included the "pasta" category in the pilot, implementing dynamic pricing for noodles and pasta. The "sauces" category was not part of the pilot, with pricing specialists continuing to manage sauce prices manually.

During the pilot, a promotion for sauces was introduced in the test hypermarket. Pesto, bolognese, and other accompaniments for pasta and noodles were offered at half the regular price. Many customers took advantage of the promotion and purchased pasta to cook with the discounted sauces.

The control store did not offer any discounts on sauces during this time.
At the end of the pilot, the test hypermarket's sales, revenue, and gross profit for the "pasta" category significantly outperformed the control hypermarket.
However, the real reason for the increase was the sauce promotion, not dynamic pricing.

The opposite situation is also possible: if the "sauce promos" were only offered in the control store, the test store's dynamic pricing results could be underestimated or even negative.
To-dos for the pilot project manager:
  • Choose a test and control stores with identical promos.
  • Or change the promotions in the test and control store to make them identical for the pilot time.
  • Or exclude from the pilot all categories with different promos in the test and control stores. This way is risky, because excluded "promo categories" can affect the sales of the pilot categories.

Parameter 2. Identical Discounts Complementing Promos

In retail stores, there are often additional promotions and discounts beyond the standard promotional catalog. These may include "buy two for the price of one," "gift with purchase" (GWP), and similar offers. To ensure a valid A/B test, both the pilot and control stores should have identical supplementary promotions and discounts.

To-dos for the pilot project manager:
  • Choose a test and control store with identical "extra promotions and discounts".
  • Or change "extra promotions and discounts" in the test and control stores to make them identical for the pilot time.
  • Or exclude from the pilot all categories with different "additional promotions and discounts" in the test and control stores. This way is risky, because excluded "promo categories" can affect the sales of the pilot categories.

Parameter 3. The Same Inventory Management (Out-of-stock Control)

During the pilot, both the pilot and control stores should maintain identical availability of goods within the test categories.
The inventory levels of the test categories can significantly impact the pilot results, as out-of-stock items will result in zero sales regardless of price.
Simplified example:

The retailer had identical promotions in both the pilot and control stores. However, during the pilot, one store received far fewer promotional items than the other.
The first store sold out of promotional items in two days. The second store continued to sell these items for three weeks due to sufficient stock levels.

Result: the revenue and sales of promotional items in the second store were significantly higher than in the first store.

1 - Ensuring the same stock levels in both the pilot and control stores during the pilot is crucial.

2 - It is important to manage inventory and calculate the optimal stock balance in a centralized and identical way, utilizing the same algorithms, in the pilot and control stores during the pilot. Ideally, the inventory management process should be automated.
The least desirable situation is when each store independently manages its inventory.

The proficiency of individual employees can impact a store's supply efficiency, with some employees excelling at order management while others struggle.

This uncertainty factor can compromise the A/B test, as sales will be influenced not only by the pricing strategy (manual vs. dynamic) but also by the order management skills of store employees.

In such cases, "improvements" in the pilot store may be attributable to a talented employee effectively managing orders and addressing stock-out issues. Conversely, if a less skilled employee is responsible for orders, it may lead to declining sales due to insufficient stock.
To-dos for the pilot project manager:
  • Choose the test and control stores with a centralized stock management system that utilizes the same algorithms for these both stores.
  • Or, for the pilot time, organize the "sameness" of the ordering system and out-of-stock control (centralization, automation, the same ordering algorithm) in the test and control stores.

Parameter 4. The same on-shelf availability

During the pilot, the availability of goods from the pilot categories on the shelves of the test and control stores should be the same.

"The product is in the warehouse of the store" is not equal to "the product is on the shelf of the store."

The supermarket chain ran a dynamic pricing pilot and monitored daily the stores' performance dynamics.
The pricing analyst noticed repeated cases when an item with stable good sales suddenly stopped selling, despite it being in stocks.

We started to download lists of such SKUs and clarify the issue "in the field", with the store's employees. In almost 100% of cases, we discovered that after acceptance they left such a good on pallets. The item was in the warehouse, but it was out-of-shelves. Therefore, the SKU's sales were falling to zero on "out-of-shelves days."
The chain managers extended the pilot time, and ran daily monitoring of "abnormally frozen sales" in the test and control stores. The result:

1 – They reduced soon to almost zero on-shelf availability issues of the pilot categories' items in both supermarkets. This positively impacted on the sales of both the test and control stores.

2 –
This helped to correctly evaluate the increase in sales and profits due to dynamic pricing at the end of the pilot. For that, experts compared the figures' dynamics of the test and control supermarkets.
If the item is in stock, but the store employees do not put it on the shelf on time:

1 - This instantly leads to a drop in sales.

2 - This complicates the work of artificial intelligence in pricing.
Machine learning algorithms see a non-zero inventory and zero sales; so they conclude that the price is inefficient. Algorithms' learning process slows down because of that issue in data.

3 - This leads to wrong result calculations.
The goods absence on shelves won't be recorded. At the end of the pilot, the data will show low sales, though the goods were in stock. The experts will get the incorrect conclusion, based on these incorrect data:

The lack of goods on the shelf (out-of-shelf, on-shelf availability issues) in the control store leads to overestimating the resulting increase in sales due to dynamic pricing.

The lack of product on the shelf in the test store leads to underestimating the resulting increase in sales due to dynamic pricing. If the item wasn't on a shelf, it was impossible to buy it. But as the data does not contain any information about on-shelf availability, experts will conclude that the reason for the decrease in sales is in prices.
To-dos for the pilot project manager:
  • During the pilot, control the display of goods on the shelf in the test and control stores.

    It is impossible to quickly "repair" the delivery process "store's warehouse - store's shelf" in the entire retail chain.
    On the other hand, it is quite possible to create an "on-shelf availability issues monitoring system" for test categories in two stores.

To reduce on-shelf availability issues, the Imprice platform provides an automated report "Abnormally frozen sales". This tool is simple but quite effective.

A store receives the following information from the report:

Beer "Heineken 0.0",
-- average daily sales - 3.85 cans,
-- balance in the store's warehouse 586 pieces, appeared at least three days ago,
-- in 2 past days was sold 0 cans of "Heineken 0.0",
-- in the current week was sold 3 cans of "Heineken 0.0".
On-shelf availability issues are suspected.

Cat food,
-- average sales are 4 packs per day,
-- warehouse balance of 185 pieces, appeared at least three days ago,
-- sold zero packs in two days,
-- weekly sales were 2 packs.
On-shelf availability issues are suspected.

Despite its simplicity, this report indicates the problem quite accurately and allows the retailer to eliminate issues efficiently.
The store's responsible receives daily such a list of "suspected items". Then he manually checks what has happened with "the suspected items" on the shelf in the store.
As practice shows, daily monitoring of "suspected items" on the shelf significantly reduces the number of lines in the report in a short time. That is, the number of on-shelf availability issues quickly decreases to nearly zero.

Parameter 5. Identical efficiency of merchandising (special case of on-shelf availability)

During the pilot, the retailer should have the same quality of display of goods and their price tags availability in the test and control stores.
An "unexpected" drop in sales in a control or test store during the pilot can happen in the following cases:

Absence of price tags. For many shoppers, this is equal to the absence of goods on the shelf.
The balance on the shelf is less than the "minimum required number of facings" ("minimum shelf stock to make the SKU visible"). For many shoppers, this is equal to the absence of goods on the shelf.

Shaving products category, Gillette shaving cartridges. The merchandisers of the manufacturer are responsible for displaying.

If there are few Gillette cassettes posted in the checkout area, their sales drop sharply:
Some shoppers do not notice them, and therefore forget to buy.
Others consumers simply cannot find the cassettes in the usual place next to checkouts. They are not ready to return from the line inside the store and look for the section of shaving accessories.
To-dos for the pilot project manager:
  • During the dynamic pricing pilot, control and "equalize" the quality of displaying test categories' goods in the test and control stores:
    the presence of price tags,
    the number of facings,
    the total balance of SKUs on the shelf.
    Otherwise, sales will be affected not only by the type of pricing ("manual" and dynamic), but also by the skills of store employees and manufacturers' merchandisers.

Parameter 6. The same cost prices

During the pilot, the cost prices of goods from the test categories in the test and control stores should be the same.

Another option: it should be possible to calculate the resulting increase in gross profit, using the same cost price for the test and control stores.
Simplified example:

At the beginning of the pilot, in the test store there was a large stock of product A with a cost price of 100 euros. This stock was enough for the entire pilot time.
The control store had a small stock of item A. From the beginning of the pilot, the store began to receive the product A at a cost price of 150 euros.

One month before the pilot, the gross profit of the A item in both the test and control stores was 5000 euros.
During the month of the pilot, the test store sold 100 units of product A at a "dynamic" price of 190 euros.
Gross profit was calculated as follows:
100* (190-100) = 9000 euros

The control store sold 100 units of product A at a "manual" price of 200 euros.
Gross profit was calculated as follows:
100* (200-150) = 5000 euros

With this calculation method, it seems the increase in gross profit in the test store was 4000 euros, and in the control store 0 euros. And therefore, dynamic pricing increased gross profit by 80%. In reality, there was no growth, only a difference in cost prices.

The reverse case is also possible, when the increase in gross profit due to dynamic pricing is underestimated because of the higher cost price in the test store.
To-dos for the pilot project manager:
  • For the pilot, choose categories with the same cost prices in the test and control stores.
  • Or choose categories for which one can use the same cost prices in the calculations of resulting gross profit in the test and control stores.
If neither the first nor the second is possible for a category, one cannot include it in the pilot. The reason is its gross profit will be affected not only by the type of pricing ("manual" and dynamic), but also by differences in cost prices.

Parameter 7. Identical volume of "loyalty points" write-off

During the pilot, the test and control stores should have the same volume of paying by the loyalty points.

"If two stores have a similar revenue, the volume of paying by the loyalty points is also similar for them," our intuition tells us. As practice shows, this is not the case.
Why can the amount of points written off vary greatly between stores with the same revenue?
For example, because of cashiers and their scripts.

In one store, they tell shoppers their amount of accumulated points and offer to write them off.

In another store, they do not provide such information on checkouts and do not offer to use the customer's points.
The shopper has to ask the cashier: "How many points do I have on my loyalty card? I would like to write them off!" Some consumers are shy, some simply forget about the points.
The result is that fewer shoppers use their loyalty points.
Differences in points write-offs can lead to underestimating or overestimating the resulting gross profit just as differences in cost prices.

To-dos for the pilot project manager:
  • Сhoose the test and control stores with the amount of paying by loyalty points is the same (or these amounts are the same share of stores' revenue).
  • Or remove written-off loyalty points from the resulting calculations of gross profit (remove written-off points from receipts)

Parameter 8. Identical functions of the store and its warehouse, the same write-offs

The test and control stores should perform the same list of functions and have the same amount of write-offs.

1 - The hypermarket worked also as a warehouse for the chain's online store. Therefore, its assortment in non-food categories was much wider than in other hypermarkets.
It is incorrect to compare sales of frying pans in two stores, if
one store offers 50 types of frying pans,
and another offers only five types of pans.
Therefore, they didn't include in the pilot the hypermarket with functions of an online store's warehouse.

2 - The pharmacy was a "transshipment warehouse" for other pharmacies. For this reason, its assortment was abnormally wide compared to other stores of this chain. Its volume of write-offs was abnormally large for the same reason; this warped the pharmacy's financial results. This pharmacy was not included in the pilot.

3 - The pharmacy chain had a rule to formalize all write-offs as a sale for 1 cent.
This warped the results of dynamic pricing, and added complexity to the training of algorithms.
The decision was to remove receipts with write-offs from the sales data and from the resulting calculations.
To-dos for the pilot project manager:
  • Choose test and control stores with similar write-off amounts (or these amounts are the same share of stores' revenue).
  • Or remove write-offs from data for analysis and from calculations of the resulting indicators of the pilot
  • Do not include in the pilot stores with unusual functions and abnormal assortment for the chain

Parameter 9. Identical share of wholesale purchases, identical share of business purchases

The share of wholesale and business purchases in the revenue of the test and control stores should be the same.

The test hypermarket was located next to a business center with offices, warehouses, and manufacturing spaces. For this reason, business customers often made office and small-scale wholesale purchases, at a price for "ordinary" shoppers, and they paid at ordinary cashouts.
In the control hypermarket, there were significantly fewer small-scale wholesale and office purchases.

Wholesale purchases made "noise" in data and complicated the training of ML pricing algorithms. On a certain day, sales of an item could grow due to not price optimization, but because a "wholesale/office customer" came.

The solution was removing "abnormal receipts" from the sales data of both stores. "Abnormal" were receipts with an amount or number of units greater than certain values.
To-dos for the pilot project manager:
  • Choose test and control stores with the same share of wholesale sales.
  • Or remove abnormally large receipts from the analyzing data and final results.
  • Nikita Tsukanov
    CEO Imprice — dynamic pricing software,
    The procedure for conducting A/B testing of dynamic pricing in offline retail consists of the following steps:

    Select stores that are identical or highly similar.

    Implement dynamic pricing in one (test) store, while maintaining the existing pricing management approach in the other (control) store.

    Throughout the pilot, ensure that the pricing methodology remains the sole differentiating factor in the stores' operations.

    Upon completion of the pilot, assess the results of dynamic pricing by calculating the difference in the target metrics between the control and test stores.

    We will now examine the tools large retail chains can utilize to identify "identical stores" for the pilot, as well as the options available to smaller chains lacking "identical stores."

Data Science Approach. Selecting Stores for A/B Testing with Machine Learning Algorithms

Large retail chains with thousands of stores, such as Conad, can leverage machine learning algorithms to select suitable control and test stores. The general search methodology involves the following steps:

1 - Algorithms study the overall sales dynamics of the chain's stores.

2 - By clustering time series, they identify and group together stores with highly similar sales dynamics into clusters.

3 - Within each cluster, the algorithms compare the proportions of sales attributed to specific product groups and categories among the stores. This process involves examining both the shares of the categories themselves and the dynamics of these shares, with the goal of identifying stores with similar overall dynamics and no significant differences in sales structure.

4 - Next, algorithms search for stores with comparable key performance indicators, such as the number of transactions (traffic), average order value, revenue, and profitability.

5 - Finally, they compare the descriptive characteristics of similar stores, including:
store format,
location ("downtown / residential area"),
street retail or shopping center,
the number of nearby competitors,
specialized services of the store ("a click-and-collect option", "the store is a warehouse of an online store").
For this analysis, algorithms work with a database of stores that have their properties accurately recorded.

The outcome is pairs of stores exhibiting the most similar indicators and characteristics.


Simplified Approach: Selecting Stores and Categories for A/B Testing in a Small Retail Chain

For small retail chains with only tens or hundreds of stores, a "semi-manual" approach is often faster and more efficient. This is due to several reasons:

1 – The quality and volume of the retailer's data may be insufficient for effective machine learning algorithm operation.
Data may be noisy or contain gaps, and there may not be a comprehensive directory of descriptive store characteristics.

2 – With a limited number of stores, essential store characteristics can be compared manually, yielding reasonably accurate conclusions.

3 – In some cases, small retail chains may not have two similar stores. All stores within the chain might differ significantly in specific indicators and characteristics.
  • Step 1. Compare sales dynamics of stores.
    Display sales charts for all stores, covering a minimum period of one year.

    Compare the dynamics of store performance, such as revenue, number of purchases, average purchase amount, units per transaction, and so on. Clustering algorithms can help identify stores with similar performance. When there are only a few stores, visually identifying stores with similar dynamics is also possible.

    To uncover hidden trends, analyze sales dynamics over several years. As previously mentioned, hidden trends involve cases where a store is experiencing slow but consistent growth or decline. Example 3 at the beginning of the article (regarding the impact of a new residential complex on sales during the pilot) demonstrates why considering trends is essential for pilot evaluation:
An example of the hidden trend. Due to the number of district inhabitants grew, in a few years the test hypermarket (red line) surpassed the control hypermarket's performance (blue line).
To enlarge the image, click on it.
  • Unfortunately, small chains often lack stores with perfectly similar dynamics.

    In such cases, the outcome of Step 1 is to identify stores with somewhat similar dynamics.
An example of the absence of perfectly similar stores. In the highlighted zones, the dynamics of all stores differ significantly.
To enlarge the image, click on it.

  • Step 2. Compare all available descriptive characteristics of the stores selected in Step 1.
    It is more efficient if the retailer makes this "manual" step on its side. For an external contractor, this task might be quite challenging.

    Why is this step the second, not the first?
    The store characteristic database is often limited.
    "We have only 30 stores. There isn't much to describe. Our managers retain all the information in their heads."

    Descriptive store characteristics do not always influence sales.
    While it is preferable not to compare a hypermarket with a supermarket, two stores located in shopping centers may not be similar due to significant differences between the centers. Thus, the dynamics of stores in these centers can also vary greatly.

  • Step 3. Select the product assortment for the pilot.
    Sometimes, it makes sense to include only a portion of the assortment in the pilot. Situations in which this is recommended include:

    1) The retail chain does not have two sufficiently similar stores.
    In this case, one can choose specific product categories to run the pilot. The dynamics of these categories should be the same in two stores.

    2) It is prudent to exclude manually packed products from the pilot assortment.
    For example, the store cuts some cheeses and sausages, packs them into plastic, and sticks a price tag on each pack. During the pilot of dynamic pricing, the store has to change prices often. To change the price of packed food, they need to re-weight EVERY pack and stick the new price tag. This labor cost is much higher than replacing one price tag for all goods with the same SKU.

  • 3) It is advisable to exclude ready-made meals produced by the supermarket from the pilot.
    Otherwise, the store can have difficulties with price tags as described above. The second reason is such SKUs often have incorrect cost prices.

    4) It is recommended to exclude fruits and vegetables from the pilot.
    In these categories, sales are highly dependent on the quality of the batch. If the pricing algorithm learns on a batch of high quality cucumbers, and then a low quality batch comes, the price will be suboptimal.

    Exclude categories experiencing a "low season" or sales decline during the pilot.
    Sales in these categories will be low regardless, making it challenging to evaluate the results.
  • Step 4. Check Parameters 1-9 according to the checklist provided in this article.
    The outcome is a pair of stores and a list of categories within these stores that exhibit the most similar indicators and characteristics.
Talk to Imprice pricing experts: