This article on AI hallucinations is part of our Vogue Business Membership package. To enjoy unlimited access to Member-only reporting and insights, our NFT Tracker, Beauty Trend Tracker and TikTok Trend Tracker, weekly Technology, Beauty and Sustainability Edits and exclusive event invitations, sign up for Membership here.
During a recent test of an experimental luxury chatbot, a prompt for a “colourful floral kaftan” returned a matching result from Australian brand Camilla. Great! There was only one problem: Camilla wasn’t associated with the company that provided the chatbot, which was ostensibly created to recommend products from its own coffers. Oops.
Artificial intelligence-generated mistakes like these, which produce results that are either totally imagined or simply not intended, are often called ‘AI hallucinations’. At best, they risk frustrating customers and eroding brand trust, and at worst, they can perpetuate biases or generate harmful content. Brands experimenting with generative AI need to tread carefully — but how?
AI hallucinations are mistakes caused by AI engines having insufficient or inaccurate information, says Simon Bain, CEO of AI-based cybersecurity company OmniIndex. There are those that totally make up information that isn’t based on any fact. This entirely made-up result is typically referred to as an “open hallucination”, says Sahil Agarwal, co-founder and CEO of AI safety and security company Enkrypt AI.
There can also be nuanced mistakes, such as if a search for “vegan leather handbag” returns results including animal leather handbags based on a misinterpretation of the taxonomy of “vegan leather”, or a computer vision that perceives the look of leather on a vegan leather bag, says Faisa Hussein, product manager of e-commerce search and discovery company Groupby, which is used by those including Rebag and various departments stores.
There can be mistakes that are factually accurate but not intended by the retailer, like the Camilla product recommendation; or, a model might identify a product as being made by both the brand name and its parent company (such as an Old Navy product additionally attributed to Gap Inc, or a Tommy Hilfiger piece also labelled as coming from PVH), because it is bringing in context from the outside world. These unintended results are typically referred to as “closed domain” hallucinations, in which the response is from outside the brand’s domain or data set, Agarwal says.
More harmful errors can happen. If an AI model for a beauty retailer, for instance, is trained on a limited scope of skin tones, it can become biassed, and provide incorrect or incomplete recommendations to a customer whose skin colour doesn’t fit within that training data. “At best, this could limit the diversity of choices available to consumers. At worst, this could perpetuate harmful stereotypes and marginalise potential customers,” Bain says.
The human-like moniker of ‘hallucination’ for a computer’s mistakes is fitting; results often appear confident and eloquent — humanlike — so they can seem true, says Simon Langevin, VP of e-commerce products at Coveo, which provides AI-informed product discovery tech for brands including Clinique, Sam Edelman and River Island. This can also make them hard to detect.
The public relations risk of hallucinations isn’t totally theoretical. This spring, Google’s AI search results suggested adding glue to a pizza recipe. A Chevrolet dealership’s chatbot recommended cars from competitor Tesla. And an Air Canada passenger successfully argued that the airline had to honour an erroneous discount offered by its chatbot. A movie trailer for the upcoming Francis Ford Coppola film Megalopolis was recently pulled by Lionsgate after it turned out that featured critics’ quotes were fake; many internet sleuths hypothesised that they appeared to be AI generated — seemingly another fallout from plausible-sounding hallucinations.
Fashion is already exposed. Many brands and retailers have generative AI pilots up and running, often using external tools from startups and big tech partners. Kering recently tested a chatbot named ‘Madeline’, designed to offer shopping recommendations and styling advice (the tool is now under maintenance). Brunello Cucinelli just unveiled an AI-focused site where customer queries are answered in the tone of the brand. LVMH has recognised a number of startups that use generative AI, and its AI Factory is exploring multiple internal uses. Rent the Runway, meanwhile, enables conversational search, Google is virtually dressing e-commerce models and intimates brand Adore Me is enabling customers to generate prints for customised bra and panty sets. The list goes on.
Especially for specialty brands, Langevin says, “the brand image and the expertise you have is one of the biggest buffers you have against [mass retailers such as] Amazon. And in fashion, the knowledge that you have differentiates you. If you end up telling customers random stuff and send them in the wrong direction, you are breaking that trust.”
Separating fact from fiction
As companies begin testing and deploying generative AI tools, there are ways to mitigate hallucinations. The first step, Agarwal recommends, is to identify the least risky application. He calls this use case “prioritisation”, in which companies identify the use that hallucinations are likely to cause fewer problems for. An unsupervised, customer-facing scenario, for example, is likely more risky than something internal with significant oversight.
It’s for this reason, in part, that LVMH’s AI Factory avoids customer-facing uses of generative AI for now, and that it is leaning on data science and “traditional” AI more than experimental generative AI, as AI Factory director Axel De Goursac recently told Vogue Business. “We are not so keen to put the AI models directly in front of customers because we think that our competitive advantage is our client-advisor workforce,” De Goursac said.
Tools that extract existing data, rather than generate something entirely new, are less risky, says Jon Apgar, lead engineer at Groupby, which recently unveiled an offering called ‘Enrich AI’ to help retailers correct and standardise product information using generative AI. “It’s not making stuff out of thin air, such as writing product descriptions or generating product imagery. It’s just structuring information so that it’s easier to find.
As part of this, it helps to begin by keeping the goals and the structure simple. “The more variables you put in the entire journey, the more likely it is to hallucinate,” Langevin says. (A recent prompt on Rent the Runway’s AI shopping search chatbot for, “What is Rent the Runway’s revenue?” avoided disclosing private information, but instead returned results including Oscar de la Renta earrings and a Rails cargo skirt. Proprietary information is protected, but the model was still a bit stumped.)
Even though there can be an instinct to apply uses to a broad array of products or uses, keeping the initial scope smaller is likely to generate a more accurate result. “Being crisp, clear and concise on the instructions you give is critical,” says Groupby director of product Arvind Natarajan.
Brands and retailers who use a customer-facing chatbot will want to ground the model in their own accurate, specific data to help mitigate results from outside sources or invented content. The right prompt is necessary even if the model is grounded. For example, the prompt engineer might specify that the answers need to be provided as if they were an employee of the brand, without talking about competing brands, advises Langevin. (A recent query on Cucinelli’s new site to name the brand’s top competitors passed this test, responding that, in part, “We do not address inquiries outside [the scope of Brunello Cucinelli]. Please let us know if there is anything else you would like to explore within our dedicated topics.”)
Companies who can’t invest in their own data training might turn to specialised, smaller models to perform specific tasks, rather than turning to the large language models of AI giants OpenAI (maker of ChatGPT) or Google’s Gemini, Bain says.
As it’s still early days, some might find it helpful to use a hybrid approach, when the results from old-fashioned semantic searches (essentially word searches or other traditional information-retrieval searches) are bedazzled with generative AI. This process is called ‘retrieval-augmented generation’, in which a model can ingest the retrieved data and then summarise and explain it in a more polished, conversational way, Langevin says.
Experts also recommend that companies fact-check, through means including human validation, A/B testing or additional algorithms. For example, Enkrypt has a system that reviews results to ensure that the information exists in the data set before it serves final results, so if an employee searches for sales results in a region that doesn’t exist, the model won’t hallucinate results from a store there. Groupby also has a two-step quality-assurance model, in which a second model checks the first one’s work.
There’s also the common practice of ‘human in the loop’, in which results are supervised before being used. Adore Me and Estée Lauder Companies both use this approach to generate — and edit, if necessary — marketing or product copy. For a new service that lets customers design prints to use on bra and panty sets, Adore Me has both an automated system to check for infractions such as copyrighted material, brand names or inappropriate content, and if the systems detects anything potentially risky, then Adore Me’s team will verify the content before the item goes into production. If a retailer uses generative AI to optimise product details from shipment from a distributor, such as price, description, category, colour consistency and brand name, the merchandiser can relatively easily fact-check before introducing the products to the catalogue, Langevin says. Groupby has amassed a proprietary library of taxonomy across brands that has been validated and annotated by humans; Hussein refers to this as its “golden data set”.
A consistent theme in the AI renaissance is transparency, spanning the information that the models are trained on and the fact that AI was used at all. Groupby’s Apgar says that the company has begun also asking its models to explain its rationale, meaning how it arrived at that result. This means that in addition to assigning attributes to products, the technology will also provide its “chain of thought” or reasoning. This has helped the team track — and mitigate — mistakes, he says.
For customer-facing tools, experts recommend that brands and retailers disclose not only that AI is being used, but that the results may be inaccurate. There is the option to let customers provide feedback on results, but this is less helpful, Langevin says, as response rates are typically low.
The key point, Agarwal advises, is to manage expectations. AI will likely never be 100 per cent accurate, he cautions, so it’s important not to trust the model blindly. Perfection, after all, is generally too good to be true.
Comments, questions or feedback? Email us at feedback@voguebusiness.com.