Questions and Answers
Hey, Doug Turnbull! Thanks for doing this!
The main thing I remember from my AI courses a couple decades ago is the A* algorithm. Has that been unseated as the “ultimate” search algorithm? Will it ever be?
oh ha! A* is about graph search, no? This book is about natural language search 🙂 Of which I doubt there ever will be an ultimate algorithm given how diverse the space and domains are
Hi all, I’m excited to be here. Hopefully I can drag fellow authors here a swell!
Hi Doug Turnbull , good to have you here again 🙂
- What are the commonly used “dials and knobs” used in search engine to fine-tune its behaviour? example: Synonym groups to handle domain level business jargons. Who usually controls these “dials and knobs”? Is it the data scientists, business team or someone else?
Hey WingCode -
- Basically anything that defines the structure of the index and how it’s queried is open game. Maybe some major groupings?
◦ Stages of ranking from first pass retrieval and later reranking against different criteria / loss functions
◦ Synonyms, stemming, lematization, any kind of NLP between the content and the index (or query and querying the index)
◦ Any kind of statistic that might indicate quality (page rank, sales, clicks, etc)
◦ Really the limit is your imagination! - I find its best if the domain expert manages direct synonyms, but the relevance/data team has to decide exactly how they interface into the main algorithm
- What are the characteristics of your dream search engine? example: For me personally, it is not using any of the facets or “sort by” options. The search engine knows my favorite color is red and usually I look out for the cheapest product out there.
If by search engine you mean the underlying search index technology programmable to build a search solution, I want
- A math-oriented, not text-match-oriented, API (see Vespa’s ranking steps)
- An ability mix traditional sparse and dense vector indices for hybrid retrieval
- Doing all those things at high speed
- Declarative configuration, not programmatic configuration, so we can iterate on the search solution independent of the end application
- Built in ability to execute arbitrary python code at query and index time with the classic data science toolkit
Since you mentioned vespa, I’m curious if you would advise picking it over ES as a base search stack ? 🙂
Probably yes these days, but I usually don’t recommend people go through extensive search rewrites just for the sake of the underlying index…
Thank you Doug for the detailed answers
When is a good time to start thinking about investing in LTR capabilities in your search stack?
I think you should always think about it, because the limiting factor is training data, and you would want good training data for a non-LTR solution anyway. Once you figure out the training data side the LTR optimization becomes “easy”
Great, does the book cover the event analytics side of gathering relevant training data?
Yes! Very much so
Awesome
What are the low hanging fruits in AI powered search - stuff with the highest ROI in short amount of time
Anything around query understanding. Can you classify queries into categories? Types of intents, etc based on simple click statistics?
Thanks! I’m looking forward to this part in the book. I have struggled with query understanding because of lack of click data. Simple methods like string matching lead to a lot of weird edge cases
That and spelling correction. I have tried the standard stuff like edit distance and metaphone based algos, but they still fall short of expectations
On a similar note, what are the things which will take a long time to give results, but in the end will be worth all the effort?
This question is really hard as its so domain specific, IMO you really gotta work to spike ideas with an experiment before digging too deep. I’ve seen teams really invest heavily upfront, but not see payoff at the end. That’s a big thing to avoid
Quite a few of AI capabilities in search require decent amount of data. How do you deal with this if you are starting from scratch?
Data as in clickstream data?
Well earlier I mentioned query understanding as an obvious win. But this classification can also be done through manual labelers (given a sufficiently well formulated task). Of course it breaks down as queries grow in complexity or domain specificity, but that’s a good start.
Ah yes, does the book cover a the proper ways to approach manual labelling? That space also seems to have exploded (so many tools!)
Sadly we don’t cover this that much, but its a great topic!
Hey Doug Turnbull. any thoughts on assessing and tackling position bias through empirical methods like Randpair vs theoretical methods built into LTR models?
we found that the position biases calculated using methods like EM didnt line up with what what we found with randpair - so wanted to know whats more typical in industry. thanks!
I haven’t. messed with these, interesting!
Usually my approach is to debias the training data itself using a click model, so I’m not overly coupled to the LTR model itself and these sorts of assumptions…
Because then you can take that training data and study it independent of any model, and decide whether it reflects reality
In practice, search relevancy can be highly subjective, which makes results hard to evaluate and optimise for actual users. Do you think that “AI-powered search” is affected by this more/equally/less than “traditional” (keyword-based) search?
yes it can be highly subjective. I think it means that you have to know, given a keyword, the many possible intents it could have and instead of ranking to one of those intents, you give them a mixture of them. Then as your confidence grows in their intent (through personalization? or just better knowledge about queries), you zero-in on one of the intents…
I wrote an articel about that! https://opensourceconnections.com/blog/2019/09/05/diversity-vs-relevance/
Speaking of labels, what do you think about clickstream data vs manual assessment (via platforms like mturk and similar)? What are the pros and cons for each? And how can we combine the two?
clickstream data will be like implicit data, where users tell you what they wouldn’t say outloud - whether because they don’t want to say it out loud, or because they’re not conscious of it!
Mechanical turk is great to overcome cold start problems. But might not reflect the reality of your usecase/app. Especially if your app is domain specific
“Combining” is tough, rather I use them as different perspectives on the problem.
How do we know it’s time to add ML to our search pipeline?
And finally, what’s the easiest way of adding ML to our search pipeline? let’s say we already have search on our platform, and use solr or elasticsearch for indexing all the documents
This is really tough, and depends so much on the org. Some teams become Lucene experts and get into the nitty gritty of modifying the guts of the search engine. These teams tend to be very engineering heavy and don’t mind doing this kind of work. This solution is nice because it turns your search engine into a “one stop shop”, without needing extra services, to solve your search problems.
But of course, if you’re more data scientist heavy, you’d prefer to work in python as much as possible 🙂 Such teams tend to build search services that front the search system. The nice thing about this is its not a single point of failure, you can fallover to Solr or Elasticsearch if the service becomes unavailable.
Of course, my dream solution would be a search service that lets me host and run the python stack as the query side, exposing to my python code the underlying data structures. Or something that lets me deploy tensorflow or other models into the engine. The closest things out there are Vespa or Jina from what I’ve seen…
I’ll ask a question! What does your search, discovery, or recommendations platform look like at a high level? What do you like, dislike about it? (For example, do you just use Elasticsearch, or do you use Vespa fronted by a Java service? etc etc)
Is it a question for you Doug Turnbull or for us? 😆
Oh for all of you sorry 😛
Maybe Cristian Javier Martinez can say a few words about OLX =)
The one I’m working on right now is pretty simple at a high level. LB-> Gateway -> Reco/Search Backend -> ES/Redis
Backend is written in Go. Using official lib of es to connect to it. Redis is used for caching of user historical features / recently watched items which we then use for reco
Don’t use any vector search engine right now. Content size is small enough that we got away with in memory ann index which gets built on service startup (this is used to serve reco)
What I don’t like - quality of query understanding + spelling correction components, lack of good quality labelled data so to build good classifiers, ES’s json based DSL is a pain, want to eventually decouple ANN from this system
Doug Turnbull thanks for doing this!
I’m curious to understand how the best teams measure and understand performance of their search systems on an ongoing basis. What are the dashboards and alerts they’ll set up, and how do they use those to make incremental improvements to their search models?
good question! It takes a lot of different effort and a deep appreciation of the pros and cons of different metrics:
- Human ratings, including judgments (evaluation of relevance of each result) and whole SERP evaluation (how ‘good’ does this search results page look)
- General search conversion rates over time (though these can be influenced by factors like checkout or product page design)
- Search CTR (understanding this is a combination of relevance, UX, perf). Another flaw here is users won’t click if they get their answer from the Search UX itself
- Roundtrip latency to the user and other performance metrics, like p90 latency, etc. Super critical and highly correlated with performance
- Best and worst performing queries -> a great product person can analyze to see what patterns you do well / do poorly with
- Typeahead success: after users clickthrough to a typeahead query suggestion, do they take a follow on action or was the click in the end perhaps not so great
- Content performance: what content does well / poorly in the search system? Are there areas where the content itself needs to be tuned to be more findable
Probably a ton others, but the really great teams, work really really hard here. It takes a lot of great data work to make use of these metrics!
This is super interesting. Quick follow-up: how do teams quantify best and worst performing queries?
Hi Doug Turnbull, interesting book! I have some basic questions for you.
- Could you please explain what a search engine actually is?
omg I have no idea. Some people use the term to mean just the technology that serves results from sparse (classic inverted) or dense vector (approximate nearest neighbor) indices. This is a piece of infrastructure
Other people use the term to refer to a full search solution that solves a specific problem. In the latter case lots of pieces of infra can be involved, not just the index-serving part but query understanding and rerannkig layers
- What kind of ML models are usually applied to search problems?
Classically GBDT (Gradient Boosted Decision Trees) have worked well and fast for ranking, as these play well with existing technologies. But increasingly, the problem can be distilled to similarity function modeled in a nearest-neighbor index. This similarity might be the result of an embedding generated from a deep learning or other model
- What was the most exiting AI powered search problem you have been working on?
Aside from Shopify, I got to work on the Elasticsearch Learning to Rank plugin that helps power Wikipedia search. Exciting from a tech, data, and impact perspective 🙂 Also very high scale!
cool! Thank you for your answers 🙂
Doug Turnbull any thoughts on the right way to combine filtering and nearest neighbor search? do we do the former first and then the latter or the other way around? is there a way to do both reliably?
I actually don’t know 😅
With a faster NN index, I’d like to do the NN part first as it helps improve recall. Then filter out candidates. But I think combining the two is still an art, and very much an open area of research