🆕 Haystack 2.30 is here! Pass a plain string to any ChatGenerator

Improve Retrieval by Embedding Meaningful Metadata


Notebook by Stefano Fiorucci

In this notebook, I do some experiments on embedding meaningful metadata to improve Document retrieval.

%%capture
! pip install wikipedia haystack-ai sentence-transformers-haystack rich
import rich

Load data from Wikipedia

We are going to download the Wikipedia pages related to some bands, using the python library wikipedia.

These pages are converted into Haystack Documents.

some_bands="""The Beatles
Rolling stones
Dire Straits
The Cure
The Smiths""".split("\n")
import wikipedia
wikipedia.set_user_agent("haystack-cookbook (https://github.com/deepset-ai/haystack-cookbook)")
from haystack.dataclasses import Document

raw_docs=[]

for title in some_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

🔧 Setup the experiment

Utility functions to create Pipelines

The indexing Pipeline transforms the Documents and stores them (with vectors) in a Document Store. The retrieval Pipeline takes a query as input and perform the vector search.

I build some utility functions to create different indexing and retrieval Pipelines.

In fact, I am interested in comparing the standard approach (where we only embed text) with the embedding metadata strategy (we embed text + meaningful metadata).

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack_integrations.components.embedders.sentence_transformers import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice
def create_indexing_pipeline(document_store, metadata_fields_to_embed):

  indexing = Pipeline()
  indexing.add_component("cleaner", DocumentCleaner())
  indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))

  # in the following componente, we can specify the parameter `metadata_fields_to_embed`, with the metadata to embed
  indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model="thenlper/gte-large",
                                                                              device=ComponentDevice.from_str("cuda:0"),
                                                                              meta_fields_to_embed=metadata_fields_to_embed)
  )
  indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

  indexing.connect("cleaner", "splitter")
  indexing.connect("splitter", "doc_embedder")
  indexing.connect("doc_embedder", "writer")

  return indexing
def create_retrieval_pipeline(document_store):

  retrieval = Pipeline()
  retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large",
                                                                            device=ComponentDevice.from_str("cuda:0")))
  retrieval.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3))

  retrieval.connect("text_embedder", "retriever")

  return retrieval

Create the Pipelines

Let’s define 2 Document Stores, to compare the different approaches.

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
document_store_w_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function="cosine")

Now, I create the 2 indexing pipelines and run them.

indexing_pipe_std = create_indexing_pipeline(document_store=document_store, metadata_fields_to_embed=[])

# here we specify the fields to embed
# we select the field `title`, containing the name of the band
indexing_pipe_w_embedded_metadata = create_indexing_pipeline(document_store=document_store_w_embedded_metadata, metadata_fields_to_embed=["title"])
indexing_pipe_std.run({"cleaner":{"documents":raw_docs}})
indexing_pipe_w_embedded_metadata.run({"cleaner":{"documents":raw_docs}})
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]



sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]



Loading weights:   0%|          | 0/391 [00:00<?, ?it/s]



tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]



Batches:   0%|          | 0/9 [00:00<?, ?it/s]



Batches:   0%|          | 0/9 [00:00<?, ?it/s]





{'writer': {'documents_written': 266}}
print(len(document_store.filter_documents()))
print(len(document_store_w_embedded_metadata.filter_documents()))
266
266

Create the 2 retrieval pipelines.

retrieval_pipe_std = create_retrieval_pipeline(document_store=document_store)

retrieval_pipe_w_embedded_metadata = create_retrieval_pipeline(document_store=document_store_w_embedded_metadata)

🧪 Run the experiment!

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Document(id=d8abc68a38bc4fc151ccc3ced628feaff73a0aaff9b61a5627908112c3e77727, content: 'The Beatles were an English
rock band formed in Liverpool in 1960. The band comprised John Lennon, P...', meta: {'title': 'The Beatles', 'url':
'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 0, 
'split_idx_start': 0}, score: 0.8530507324777759)
The Beatles were an English rock band formed in Liverpool in 1960. The band comprised John Lennon, Paul McCartney, 
George Harrison and Ringo Starr. 

Document(id=b8b6bb46c73e300c0849bf73965c7813677a9c1cab6736f5916d529281c285d6, content: 'All of the disparate 
influences on their first two albums coalesced into a bright, joyous, original ...', meta: {'title': 'The Beatles',
'url': 'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 26, 
'split_idx_start': 22615}, score: 0.8525623451911699)
All of the disparate influences on their first two albums coalesced into a bright, joyous, original sound, filled 
with ringing guitars and irresistible melodies." That "ringing guitar" sound was primarily the product of 
Harrison's 12-string electric Rickenbacker, a prototype given to him by the manufacturer, which made its debut on 
the record. ==== 1964 world tour, meeting Bob Dylan and stand on civil rights ==== Touring internationally in June 
and July, the Beatles staged 37 shows over 27 days in Denmark, the Netherlands, Hong Kong, Australia and New 
Zealand. In August and September, they returned to the US, with a 30-concert tour of 23 cities. Generating intense 
interest once again, the month-long tour attracted between 10,000 and 20,000 people to each 30-minute performance 
in cities from New York to San Francisco.
In August, journalist Al Aronowitz arranged for the Beatles to meet Bob Dylan. Visiting the band in their New York 
hotel suite, Dylan introduced them to cannabis. Gould points out the musical and cultural significance of this 
meeting, before which the musicians' respective fan bases were "perceived as inhabiting two separate subcultural 
worlds": Dylan's audience of "college kids with artistic or intellectual leanings, a dawning political and social 
idealism, and a mildly bohemian style" contrasted with their fans, "veritable 'teenyboppers' – kids in high school 
or grade school whose lives were totally wrapped up in the commercialised popular culture of television, radio, pop
records, fan magazines and teen fashion. To many of Dylan's followers in the folk music scene, the Beatles were 
seen as idolaters, not idealists."
Within six months of the meeting, according to Gould, "Lennon would be making records on which he openly imitated 
Dylan's nasal drone, brittle strum, and introspective vocal persona"; and six months after that, Dylan began 
performing with a backing band and electric instrumentation, and "dressed in the height of Mod fashion". As a 
result, Gould continues, the traditional division between folk and rock enthusiasts "nearly evaporated", as the 
Beatles' fans began to mature in their outlook and Dylan's audience embraced the new, youth-driven pop culture.
During the 1964 US tour, the group were confronted with racial segregation in the country at the time. When 
informed that the venue for their 11 September concert, the Gator Bowl in Jacksonville, Florida, was segregated, 
the Beatles said they would refuse to perform unless the audience was integrated. Lennon stated: "We never play to 
segregated audiences and we aren't going to start now ... I'd sooner lose our appearance money." City officials 
relented and agreed to allow an integrated show. The group also cancelled their reservations at the whites-only 
Hotel George Washington in Jacksonville. For their subsequent US tours in 1965 and 1966, the Beatles included 
clauses in contracts stipulating that shows be integrated. ==== Beatles for Sale, Help! and Rubber Soul ====
According to Gould, the Beatles' fourth studio LP, Beatles for Sale, evidenced a growing conflict between the 
commercial pressures of their global success and their creative ambitions. They had intended the album, recorded 
between August and October 1964, to continue the format established by A Hard Day's Night which, unlike their first
two LPs, contained only original songs. They had nearly exhausted their backlog of songs on the previous album, 
however, and given the challenges constant international touring posed to their songwriting efforts, Lennon 
admitted, "Material's becoming a hell of a problem". As a result, six covers from their extensive repertoire were 
chosen to complete the album. Released in early December, its eight original compositions stood out, demonstrating 
the growing maturity of the Lennon–McCartney songwriting partnership.
In early 1965, following a dinner with Lennon, Harrison and their wives, Harrison's dentist, John Riley, secretly 
added LSD to their coffee. 

Document(id=3aae0d3d5dcd3ae21f711e61d773ab0aa93ba31323ee02cc459d3a390f8a3c7a, content: '==== Early residencies and 
UK popularity ==== Allan Williams, the Beatles' unofficial manager, arran...', meta: {'title': 'The Beatles', 
'url': 'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 15, 
'split_idx_start': 5786}, score: 0.8482544134070036)
==== Early residencies and UK popularity ==== Allan Williams, the Beatles' unofficial manager, arranged a residency
for them in Hamburg. They auditioned and hired drummer Pete Best in mid-August 1960. The band, now a five-piece, 
departed Liverpool for Hamburg four days later, contracted to club owner Bruno Koschmider for what would be a 
3+12-month residency. Beatles historian Mark Lewisohn writes: "They pulled into Hamburg at dusk on 17 August, the 
time when the red-light area comes to life ... flashing neon lights screamed out the various entertainment on 
offer, while scantily clad women sat unabashed in shop windows waiting for business opportunities."
Koschmider had converted a couple of strip clubs in the red light Reeperbahn district of St. Pauli into music 
venues and initially placed the Beatles at the Indra Club. After closing Indra due to noise complaints, he moved 
them to the Kaiserkeller in October. When he learned they had been performing at the rival Top Ten Club in breach 
of their contract, he gave them one month's termination notice, and reported the underage Harrison, who had 
obtained permission to stay in Hamburg by lying to the German authorities about his age. The authorities arranged 
for Harrison's deportation in late November. One week later, Koschmider had McCartney and Best arrested for arson 
after they set fire to a condom in a concrete corridor; the authorities deported them. Lennon returned to Liverpool
in early December, while Sutcliffe remained in Hamburg until late February with his German fiancée Astrid 
Kirchherr, who took the first semi-professional photos of the Beatles.
During the next two years, the Beatles were resident for periods in Hamburg, where they used Preludin both 
recreationally and to maintain their energy through all-night performances. In 1961, during their second Hamburg 
engagement, Kirchherr cut Sutcliffe's hair in the "exi" (existentialist) style, later adopted by the other Beatles.

❌ the retrieved Documents seem irrelevant

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Document(id=3c6428f9f52bbd44000ea4f4741bfdce5d36dd8346aef048e4f117f82bd75e40, content: 'The next day, they 
travelled to Bangor for his Transcendental Meditation retreat. On 27 August, thei...', meta: {'title': 'The 
Beatles', 'url': 'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 47, 
'split_idx_start': 43054}, score: 0.8650703581853868)
The next day, they travelled to Bangor for his Transcendental Meditation retreat. On 27 August, their manager's 
assistant, Peter Brown, phoned to inform them that Epstein had died. The coroner ruled the death an accidental 
carbitol overdose, although it was widely rumoured to be a suicide. His death left the group disoriented and 
fearful about the future. Lennon recalled: "We collapsed. I knew that we were in trouble then. I didn't really have
any misconceptions about our ability to do anything other than play music, and I was scared. 

Document(id=40181fad156bda72866d364ccbf1bdb0591aa6042b70729973bec1e60267bf8e, content: 'He eventually negotiated a 
one-month early release in exchange for one last recording session in Ham...', meta: {'title': 'The Beatles', 
'url': 'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 20, 
'split_idx_start': 9495}, score: 0.858983150084857)
He eventually negotiated a one-month early release in exchange for one last recording session in Hamburg backing 
Tony Sheridan. On their return to Germany in April, a distraught Kirchherr met them at the airport with news of 
Sutcliffe's death the previous day from a brain haemorrhage. Martin's first recording session with the Beatles took
place at EMI Recording Studios (later Abbey Road Studios) in London on 6 June 1962. 

Document(id=6bad03d52fbb917df6a5c5d2b3947310b0df41c9ace6742798cf7d79212b5654, content: 'Preston received label 
billing on the "Get Back" single – the only musician ever to receive that ack...', meta: {'title': 'The Beatles', 
'url': 'https://en.wikipedia.org/wiki/The_Beatles', 'source_id': 
'c14b28e1c0df58af707301f1a8564df840f3726f5e92447679830c6bc897825f', 'page_number': 1, 'split_id': 62, 
'split_idx_start': 51994}, score: 0.8548119991711911)
Preston received label billing on the "Get Back" single – the only musician ever to receive that acknowledgment on 
an official Beatles release. After the rehearsals, the band could not agree on a location to film a concert, 
rejecting several ideas, including a boat at sea, a lunatic asylum, the Libyan desert and the Colosseum. 

✅ the first Document is relevant

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Document(id=883fcd200df5f57d06cb97715779c19ecf3c3989e228fa1bb4110587dce80242, content: 'They became the 
highest-earning live act of 2021, surpassing Taylor Swift; since 2018 the two have t...', meta: {'title': 'The 
Rolling Stones', 'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 83, 
'split_idx_start': 69977}, score: 0.8395461912617969)
They became the highest-earning live act of 2021, surpassing Taylor Swift; since 2018 the two have traded the top 
two spots. The band began a new tour in 2022, with Jordan on drums.

Document(id=ea3a026b7caae21787240aa4ba3b5989a66afdd5835eaf544ca0d4b15e628f93, content: 'It was later extended to go
from May to July 2018, adding fourteen new dates in the UK and Europe, m...', meta: {'title': 'The Rolling Stones',
'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 79, 
'split_idx_start': 66965}, score: 0.8299663989351334)
It was later extended to go from May to July 2018, adding fourteen new dates in the UK and Europe, making it the 
band's first UK tour since 2006. In November 2018, the Stones announced plans to bring the No Filter Tour to US 
stadiums in 2019, with 13 shows set to run from April to June. In March 2019, it was announced that Jagger would be
undergoing heart valve replacement surgery, forcing the band to postpone the 17-date North American leg of their No
Filter Tour. On 4 April 2019, it was announced that Jagger had completed his heart valve procedure in New York, was
recovering (in hospital) and could be released in the following few days. On 16 May, the Rolling Stones announced 
that the No Filter Tour would resume on 21 June with the 17 postponed dates rescheduled up to the end of August. In
March 2020, the No Filter Tour was postponed due to the COVID-19 pandemic.
The Rolling Stones—featuring Jagger, Richards, Watts, and Wood at their homes—were one of the headline acts on 
Global Citizen's One World: Together at Home on-line and on-screen concert on 18 April 2020, a global event 
featuring dozens of artists and comedians to support frontline healthcare workers and the World Health Organization
during the COVID-19 pandemic. On 23 April, Jagger announced through his Facebook page the release (the same day at 
5:00 pm BST) of the single "Living in a Ghost Town", a new Rolling Stones song recorded in London and Los Angeles 
in 2019 and finished in isolation (part of the new material that the band were recording in the studio before the 
COVID-19 lockdown), a song that the band "thought would resonate through the times we're living in" and their first
original one since 2012. 
Document(id=3f0508cfeba59648a532b8205b75f8856b6db5baa4ddbc078e4879020484dbb9, content: 'Four months later, it was 
reported that Wyman would return for a song, more than 30 years after his ...', meta: {'title': 'The Rolling 
Stones', 'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 85, 
'split_idx_start': 70474}, score: 0.8226799875605302)
Four months later, it was reported that Wyman would return for a song, more than 30 years after his departure from 
the band. In August 2023, media outlets reported, based on an advertisement in a local UK newspaper, that a new 
Stones album might be released in September 2023. 

❌ the retrieved Documents seem irrelevant

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)
Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Document(id=ea3a026b7caae21787240aa4ba3b5989a66afdd5835eaf544ca0d4b15e628f93, content: 'It was later extended to go
from May to July 2018, adding fourteen new dates in the UK and Europe, m...', meta: {'title': 'The Rolling Stones',
'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 79, 
'split_idx_start': 66965}, score: 0.8347962830885718)
It was later extended to go from May to July 2018, adding fourteen new dates in the UK and Europe, making it the 
band's first UK tour since 2006. In November 2018, the Stones announced plans to bring the No Filter Tour to US 
stadiums in 2019, with 13 shows set to run from April to June. In March 2019, it was announced that Jagger would be
undergoing heart valve replacement surgery, forcing the band to postpone the 17-date North American leg of their No
Filter Tour. On 4 April 2019, it was announced that Jagger had completed his heart valve procedure in New York, was
recovering (in hospital) and could be released in the following few days. On 16 May, the Rolling Stones announced 
that the No Filter Tour would resume on 21 June with the 17 postponed dates rescheduled up to the end of August. In
March 2020, the No Filter Tour was postponed due to the COVID-19 pandemic.
The Rolling Stones—featuring Jagger, Richards, Watts, and Wood at their homes—were one of the headline acts on 
Global Citizen's One World: Together at Home on-line and on-screen concert on 18 April 2020, a global event 
featuring dozens of artists and comedians to support frontline healthcare workers and the World Health Organization
during the COVID-19 pandemic. On 23 April, Jagger announced through his Facebook page the release (the same day at 
5:00 pm BST) of the single "Living in a Ghost Town", a new Rolling Stones song recorded in London and Los Angeles 
in 2019 and finished in isolation (part of the new material that the band were recording in the studio before the 
COVID-19 lockdown), a song that the band "thought would resonate through the times we're living in" and their first
original one since 2012. 
Document(id=883fcd200df5f57d06cb97715779c19ecf3c3989e228fa1bb4110587dce80242, content: 'They became the 
highest-earning live act of 2021, surpassing Taylor Swift; since 2018 the two have t...', meta: {'title': 'The 
Rolling Stones', 'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 83, 
'split_idx_start': 69977}, score: 0.8192363429987537)
They became the highest-earning live act of 2021, surpassing Taylor Swift; since 2018 the two have traded the top 
two spots. The band began a new tour in 2022, with Jordan on drums.

Document(id=3f0508cfeba59648a532b8205b75f8856b6db5baa4ddbc078e4879020484dbb9, content: 'Four months later, it was 
reported that Wyman would return for a song, more than 30 years after his ...', meta: {'title': 'The Rolling 
Stones', 'url': 'https://en.wikipedia.org/wiki/The_Rolling_Stones', 'source_id': 
'f6fdabc7d1b1a48945296ccae41190adca558a73682abcd007ddc6aff4f50277', 'page_number': 1, 'split_id': 85, 
'split_idx_start': 70474}, score: 0.819044859805792)
Four months later, it was reported that Wyman would return for a song, more than 30 years after his departure from 
the band. In August 2023, media outlets reported, based on an advertisement in a local UK newspaper, that a new 
Stones album might be released in September 2023. 

✅ some Documents are relevant

⚠️ Notes of caution

  • This technique is not a silver bullet
  • It works well when the embedded metadata are meaningful and distinctive
  • I would say that the embedded metadata should be meaningful from the perspective of the embedding model. For example, I don’t expect embedding numbers to work well.