Data Engineering Zoomcamp: Free Data Engineering course. Register here!

DataTalks.Club

Relevant Search

by Doug Turnbull, John Berryman

The book of the week from 12 Jul 2021 to 16 Jul 2021

Relevant Search demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines.

Questions and Answers

xnot

How do you see the likes of Vespa/Jina in this space? Are they viable candidates to base your first search system off of vs ES/Solr ?

Doug Turnbull

Sure, I think they are great tech. And the Lucene world is falling behind in a lot of information retrieval. Such as doing good dense vector retrieval and thinking in terms of multiple reranking phases. These technologies are good rethinkings of the space (hopefully push ES/Solr to be better too)

xnot

Which piece of tech are you most excited about in this space?

Doug Turnbull

šŸ¤” I still like Elasticsearch a lot, Elastic has invested a ton in scaling the product, and it also has forks. IE if something happened to Elastic, I feel comfortable that some other version of Elasticsearch (ie open distro) would take over. I also really like Elastic the company, the people there, the culture. Maybe cause I know them well as people šŸ˜‚ . So it would take a lot for me to switch. I also am comfortable extending the engine in great depth to do what I need (such as the Elasticsearch Learning to Rank plugin). Some of the features of newer tech do find their way into community plugins, etc for ES.
That said, it does take longer for cutting edge features to make it into the more mature product. And the JSON syntax can be very verbose if things reach a certain level of complexity. They are also obviously biased more towards analytics than search. I give them lots of feedback, and maybe sometimes they listen, lol.

Doug Turnbull

Lately I have pondered how well a completely (unsharded, but replicated) pylucene cluster would work, as then I could manage other data structures on the side for different needs, and most of my colleagues are Pythonistas

xnot

Iā€™ve always felt that companies with huge amount of traffic can have a big advantage when it comes to being able to provide a superior search experience. Do you agree?

xnot

As a follow up, how can we close that gap?

Doug Turnbull

Certainly agree. Itā€™s the secret sauce. I think thereā€™s a famous phrase that with enough data you donā€™t need fancy models, you can do much more with simpler methods

Doug Turnbull

Closing the gap? Probably the lowest hanging fruit is investing in crowdsourcing like from a firm like Supahands (http://supahands.com)

Andrew Kornilov

Q: Any plans for the second edition? Some elascticsearch examples are outdated a bit already

Doug Turnbull

They finally have a new Space Jam movieā€¦ hehe I wonā€™t rule it out!
A second edition would be quite an undertaking, as Iā€™ve learned so much since then. Iā€™d want to incorporate so much more about relevance measurement, about dense vector retrieval, learning to rank, more machine learningā€¦ probably more query understanding types of things

David Cox

Iā€™m curious the contexts / types of organizations for which this book is relevant?

Doug Turnbull

hmm I canā€™t think of any that arenā€™t? Maybe risk averse orgs where ā€˜experimentationā€™ is not advisable. Like regulatory data that just has to meet certain laws. Also when youā€™re using search for just log analytics

Kshitiz

Hi Doug Turnbull and John,
Q) I know that search is like one of the fundamental component of web. Apart from the common use case which is searching for content on the web, are there any other use cases of search algorithms?

Doug Turnbull

there are a ton, because (a) every field is an index which is not common in other databases and (b) the rich package of text analytics.
For example Iā€™ve used it to automate drug recovery, or to do ā€œfuzzy joinsā€ on different databases that have different variants of names or identifiers that can be normalized with text matching.
Also when you write flat data, but you have no idea how youā€™ll look it up later, and could want to look up on any attribute

Kshitiz

I have another question -
Q) What are the bottlenecks/improvement areas in the field of search?

Doug Turnbull

bottlenecks: creating the index is time consuming. Generally a search engine is much more read scalable than write scalable

Doug Turnbull

improvement areas: dense vector retrieval (like embeddings) is an active area of data structures research. Like the ā€œApproximate nearest neighborsā€ topic

Doug Turnbull

Shameless plug, if you like my book, be sure to also check out AI Powered Search which Iā€™ve contributed 2 chapters to so far (on Learning to Rank and gathering search training data from clicks)

Doug Turnbull

and ping me if youā€™re curious about working with me at Shopify!

Wendy Mak

at what point would you decide itā€™s worthwhile to add your custom ML on top of elasticsearch?

Doug Turnbull

Usually, when I can trust the training data. But thereā€™s lots of forms of ā€œMLā€. Thereā€™s augmenting search with ML generated assets. And thereā€™s Learning to Rank, ML based ranking. Thereā€™s also tools like dense vector indices that help you do embedding based retrieval. So you might have some enrichment activity that is ML driven making the docs more findable, but the ranking itself might be manual, for example

Wendy Mak

also, maintaining a large ES index can be a real painā€“ are there optimizations you can do around this? (and what sort of metrics would you use to decide a certain piece of information is too ā€˜oldā€™ and can be moved off to a different storage?)

Doug Turnbull

Ah interesting, hmmā€¦ Iā€™ve seen some patterns of hot / cold indices. Often this happens in log analytics. In product search, usually merchants explicitly delete their old products.

Doug Turnbull

maybe you are thinking like document search, something is old and stale? I am a big fan of trusting userā€™s behavior. Is there a way to know whether users find the document useful? Do they share it? Copy the link? Does the link show up on slack? Can you see if itā€™s still used?

Wendy Mak

yeah, I am sort of thinking of document searchā€“ I used to work for a newspaper publisher and we use ES for recommendation + various other stuff (and the dev in charge of it had a lot of headaches trying to find the correct balance of what to keep or not in the index)

Alexey Grigorev

What do you think about Solr vs elasticsearch? Are they almost the same or elasticsearch is a bit ahead?

Alexey Grigorev

Which one would you recommend for a new project and why?

Andrew Kornilov

or solr is a bit ahead šŸ™‚

Alexey Grigorev

Indeed!

Doug Turnbull

I usually think people worry too hard about what their current search engine is, with some exceptions - stop worrying about Solr vs Elasticsearch

Doug Turnbull

Solr == a bit buggier, weirder, rarely had a Solr project where I havenā€™t looked at Solr source code
ES == more modern, more opinionated, not going to bend the project to your will as easily, but lets you do almost anything you need in a sane way

Andrew Kornilov

and they (elastic) also break things quite often

Doug Turnbull

true, though they say ā€œwe are now breaking Xā€. And then Iā€™m like ā€œstop taking my toys awayā€ šŸ˜‚

Andrew Kornilov

my upgrate of elastic (between minor versions!) - aggregations latency

Andrew Kornilov

but they managed to fix :-)

Andrew Kornilov

well, ā€œfixā€ == find a workaround

Doug Turnbull

ouch

Alexey Grigorev

Okay, thanks! And what about managed elasticsearch from AWS? It seems to be the easiest option to get started

Andrew Kornilov

if you are not aws certified guy, this one is way easier to start

Alexey Grigorev

Thatā€™s nice, thank you!

Dmitriy Shvadskiy

With tons of work/research going into vector search what do you think of Learning to rank. Is it still relevant? What are good usecases?

Doug Turnbull

Depends if by Learning to Rank you mean ā€œthe problem of ranking solved by MLā€ -> then yes. In the end, weā€™re just training neural nets against the same loss functions.
If you mean ā€œthe traditional family of modelsā€ (LambdaMART, RankSVM, etc)ā€¦ I think both are pretty well understood. In particular RankSVM is nice because its a set of linear weights, so easy to interpret/understand.
Plus vector search is only one part of a search solution, the inverted index is still valuable for many use cases. So all those features combined into a final ranking function is super valuable.

Dmitriy Shvadskiy

Thanks. I meant ā€œthe problem of ranking solved by MLā€. But second part makes sense. Donā€™t change it if it works šŸ˜€

Bayram Kapti

Apologies if this is too rookie, but Iā€™m braNd new for search experiences.
Where does someone start for improving search experience on a given platform?
Whatā€™s the place of AI and ML for improving search experience?
Are categorization & labeling considered part of improving search experience or are they completely different areas of expertise?

Doug Turnbull

No worries!
Best place to start:

  • How will you define ā€œgoodā€ and can you get data that defines what ā€˜goodā€™ is for a query. Thereā€™s different strategies. One is to use expert, human raters if, for example, your app is very domain specific. Another is to use clicks, etc, which involves its own complexities (see Ch 11 AI Powered Search). Another is crowdsourcingā€¦ That strategy is very specific to the problem
  • Then try to iterate and get better!
    Like all kinds of ML, it comes down to the underlying data. How good / clean is data for an ML problem. At Shopify, all our merchants have vastly different ways they use our underlying data. So thatā€™s a challenge for us that would lend us to lean to hard on some parts of content to really ā€œlearnā€ anything to help merchants.
    Absolutely categorization / labeling are HUGE. We have a team member that just manages a team that does this.
Bayram Kapti

Thanks appreciate thw answers

Dmitriy Shvadskiy

What is a good way to scale judgements collection or ranking feedback for Enterprise search. Clicks do not mean the same things as in online shopping scenarios. Is thumb up/thumb down on the search result a good way to collect feedback? I had to build a judgement list for 1 experiment and it is a slow and painful process

Doug Turnbull

Enterprise search is tough in part because nobody on your company in really that incentivized directly to make themselves/their content findable. If anything, they have the opposite incentive. They want to not be bothered, lol. Contrast this with Web search, where everyone is trying to SEO their content to get in front of as many eyeballs as possible

Doug Turnbull

Itā€™s also domain specific, making things tougher šŸ˜¬

Doug Turnbull

For a large org, with a heavily used app, you could use clicks for your head queries. BUT of course, as you get down the long tail, that gets tougher

Dmitriy Shvadskiy

Definitely agree about incentives. It is a pain

Doug Turnbull

yeah not my favorite domainā€¦
I would probably gather feedback informally myself from qualitative feedback and turn it into judgments, rather than make users rate items

Doug Turnbull

And likely give people a very simple button to complain about a whole page of results (with very few fields to fill out, if any). Then when they click, capture the query + results. And if you can, you can find that person, and get more info, and turn that into relevance judgments

Dmitriy Shvadskiy

Great. Thank you for the idea

Doink

Transformer based Search engines vs ES based search engines which to choose when, what types of challenges each have? Cause I think there are lots of BERT models used for search

Doug Turnbull

I think in a way BERT-like search will take over for first pass search, thatā€™s text heavy. However, in my experience, no technique ever really ā€œdiesā€ in search. So lots of classic relevance approaches, that people use to explicitly manage and understand queries - like simple dictionaries/taxonomies managed by specialists - I think will continue. In part this is because of the often specialized nature of the search app. Another reason is the focus on precision, and in many cases the kind of explicit control a traditional taxonomy based approach built on an index like ES is preferred to have really fine-grain precision

Doug Turnbull

Also not all search is very text-centric, the text might be unreliable/noisy, and BERT seems ideal for text passages

Kim Falk

Hey Doug Turnbull, What is the difference between a search engine and a recommender system?

Doug Turnbull

I have no idea šŸ˜†

Doug Turnbull

Probably the difference is really how explicit and active the users intent is from the userā€™s PoV, but implementation wise they are just on one spectrum

Kim Falk

Good answer šŸ™‚

Alexey Grigorev

Doug Turnbull maybe Kim Falk knows? Iā€™ve heard heā€™s answering questions in August here.

Alexey Grigorev

Do you know how it works? The query is ā€œmovie where a man wakes up on the same dayā€ and it correctly identifies it as ā€œGroundhog Dayā€

Doug Turnbull

google just amazes sometimes šŸ™‚. But as itā€™s a long form query, and likely in the realm of BERT/transformers/question answering kinds of solutions.

Doug Turnbull

To me, Iā€™m always first interested in what the loss function was to train such a thing, and the specific training task to predict missing terms in a piece of text by randomly masking them is quite interesting. Link.

Alexey Grigorev

Even before defining the loss function, I imagined they needed a lot of training data for that. Do you think just relying on clicks would be sufficient in this case?

Alexey Grigorev

Perhaps thereā€™s a separate flow for movies and books

Doug Turnbull

yes google has looooooots of training data (and lots of smart people working on the prob). At high scale, like web search, they probably can rely on clicks.
I doubt thereā€™s a separate flow, I bet you can generalize these patterns regardless of topicā€¦

Lalit Pagaria

How to evaluate search system?
How to achieve micro second search result on BERT based system with TBs of data?

Doug Turnbull

Iā€™m not sure you can achieve micro second search results using BERT / TBs of data šŸ™‚

Doug Turnbull

Link

Doug Turnbull

On evaluation

Lalit Pagaria

I thought there might be magic way to achieve it šŸ˜„

Shankar Somayajula

Doug Turnbull
How do you incorporate fuzzy matches within the search function? Search for relevance using entire phrase and overlay with ā€œ# of Partial hits based on keywordsā€ within phrase? Do you need to maintain a set of tags or keywords (based on topics, say) for each inventory item to match against/with the user ask? Is there a hierarchy of sorts within the various independent search tasks while combining to get the final return resultset?

Doug Turnbull

I prefer to learn common misspellings by looking at query -> query refinements in log. Or by suggesting spellchecking and seeing what they click thru/what they donā€™t. Then you can build a dictionary of these common corrections

Shankar Somayajula

Doug Turnbull I dont mean mis-spellings but regular words but the match being partial, not a full or exact match. So if user gives ā€œauburn hair color with lavender perfumeā€ ā€¦ match to ā€œauburn hair colorā€ with any scent/perfume, ā€œred hair color with fast drying featureā€ etc.

WingCode

Hi Doug Turnbull ,

  1. How do you handle:
    a. Different languages? Example: Input text is in English but documents to be queried over is in Spanish.
    b. Phonetic variations?
    c. Transliterations in search?
  2. How do you search:
    d. Text input over media content (image, video, audio). Example: Input -> ā€œLots of explosionā€; Results <Image of an explosions>, <Mission Impossible movie>, <Sound of dynamites exploding>;
    e. Media versus media (Search for similar images given an image, video given an video, audio given an audio).
Doug Turnbull

Ah these are all great and each very complicated questions. For 1b & 1c, assuming thereā€™s no homonyms, a good starting point is a simple dictionary approach. For example, Iā€™ve done that for British <ā€“> US English variants
Everything else feels very much like a deep learning problemā€¦ You build a model transforming each to a vector space, and try to build an embedding that moves more similar content together, and push dissimilar content apart using whatever training data you have.

Doug Turnbull

then with that embedding space, Iā€™d use an ANN index (or hack one out of an inverted index if possible)

WingCode

Thank you for the answers Doug šŸ™‚

To take part in the book of the week event:

  • Register in our Slack
  • Join the #book-of-the-week channel
  • Ask as many questions as you'd like
  • The book authors answer questions from Monday till Thursday
  • On Friday, the authors decide who wins free copies of their book

To see other books, check the the book of the week page.

Subscribe to our weekly newsletter and join our Slack.
We'll keep you informed about our events, articles, courses, and everything else happening in the Club.


DataTalks.Club. Hosted on GitHub Pages. We use cookies.