Questions and Answers
How do you see the likes of Vespa/Jina in this space? Are they viable candidates to base your first search system off of vs ES/Solr ?
Sure, I think they are great tech. And the Lucene world is falling behind in a lot of information retrieval. Such as doing good dense vector retrieval and thinking in terms of multiple reranking phases. These technologies are good rethinkings of the space (hopefully push ES/Solr to be better too)
Which piece of tech are you most excited about in this space?
🤔 I still like Elasticsearch a lot, Elastic has invested a ton in scaling the product, and it also has forks. IE if something happened to Elastic, I feel comfortable that some other version of Elasticsearch (ie open distro) would take over. I also really like Elastic the company, the people there, the culture. Maybe cause I know them well as people 😂 . So it would take a lot for me to switch. I also am comfortable extending the engine in great depth to do what I need (such as the Elasticsearch Learning to Rank plugin). Some of the features of newer tech do find their way into community plugins, etc for ES.
That said, it does take longer for cutting edge features to make it into the more mature product. And the JSON syntax can be very verbose if things reach a certain level of complexity. They are also obviously biased more towards analytics than search. I give them lots of feedback, and maybe sometimes they listen, lol.
Lately I have pondered how well a completely (unsharded, but replicated) pylucene cluster would work, as then I could manage other data structures on the side for different needs, and most of my colleagues are Pythonistas
I’ve always felt that companies with huge amount of traffic can have a big advantage when it comes to being able to provide a superior search experience. Do you agree?
As a follow up, how can we close that gap?
Certainly agree. It’s the secret sauce. I think there’s a famous phrase that with enough data you don’t need fancy models, you can do much more with simpler methods
Closing the gap? Probably the lowest hanging fruit is investing in crowdsourcing like from a firm like Supahands (http://supahands.com)
Q: Any plans for the second edition? Some elascticsearch examples are outdated a bit already
They finally have a new Space Jam movie… hehe I won’t rule it out!
A second edition would be quite an undertaking, as I’ve learned so much since then. I’d want to incorporate so much more about relevance measurement, about dense vector retrieval, learning to rank, more machine learning… probably more query understanding types of things
I’m curious the contexts / types of organizations for which this book is relevant?
hmm I can’t think of any that aren’t? Maybe risk averse orgs where ‘experimentation’ is not advisable. Like regulatory data that just has to meet certain laws. Also when you’re using search for just log analytics
Hi Doug Turnbull and John,
Q) I know that search is like one of the fundamental component of web. Apart from the common use case which is searching for content on the web, are there any other use cases of search algorithms?
there are a ton, because (a) every field is an index which is not common in other databases and (b) the rich package of text analytics.
For example I’ve used it to automate drug recovery, or to do “fuzzy joins” on different databases that have different variants of names or identifiers that can be normalized with text matching.
Also when you write flat data, but you have no idea how you’ll look it up later, and could want to look up on any attribute
I have another question -
Q) What are the bottlenecks/improvement areas in the field of search?
bottlenecks: creating the index is time consuming. Generally a search engine is much more read scalable than write scalable
improvement areas: dense vector retrieval (like embeddings) is an active area of data structures research. Like the “Approximate nearest neighbors” topic
Shameless plug, if you like my book, be sure to also check out AI Powered Search which I’ve contributed 2 chapters to so far (on Learning to Rank and gathering search training data from clicks)
and ping me if you’re curious about working with me at Shopify!
at what point would you decide it’s worthwhile to add your custom ML on top of elasticsearch?
Usually, when I can trust the training data. But there’s lots of forms of “ML”. There’s augmenting search with ML generated assets. And there’s Learning to Rank, ML based ranking. There’s also tools like dense vector indices that help you do embedding based retrieval. So you might have some enrichment activity that is ML driven making the docs more findable, but the ranking itself might be manual, for example
also, maintaining a large ES index can be a real pain– are there optimizations you can do around this? (and what sort of metrics would you use to decide a certain piece of information is too ‘old’ and can be moved off to a different storage?)
Ah interesting, hmm… I’ve seen some patterns of hot / cold indices. Often this happens in log analytics. In product search, usually merchants explicitly delete their old products.
maybe you are thinking like document search, something is old and stale? I am a big fan of trusting user’s behavior. Is there a way to know whether users find the document useful? Do they share it? Copy the link? Does the link show up on slack? Can you see if it’s still used?
yeah, I am sort of thinking of document search– I used to work for a newspaper publisher and we use ES for recommendation + various other stuff (and the dev in charge of it had a lot of headaches trying to find the correct balance of what to keep or not in the index)
What do you think about Solr vs elasticsearch? Are they almost the same or elasticsearch is a bit ahead?
Which one would you recommend for a new project and why?
or solr is a bit ahead 🙂
I usually think people worry too hard about what their current search engine is, with some exceptions - stop worrying about Solr vs Elasticsearch
Solr == a bit buggier, weirder, rarely had a Solr project where I haven’t looked at Solr source code
ES == more modern, more opinionated, not going to bend the project to your will as easily, but lets you do almost anything you need in a sane way
and they (elastic) also break things quite often
true, though they say “we are now breaking X”. And then I’m like “stop taking my toys away” 😂
my upgrate of elastic (between minor versions!) - aggregations latency
but they managed to fix :-)
well, “fix” == find a workaround
Okay, thanks! And what about managed elasticsearch from AWS? It seems to be the easiest option to get started
if you are not aws certified guy, this one is way easier to start
That’s nice, thank you!
With tons of work/research going into vector search what do you think of Learning to rank. Is it still relevant? What are good usecases?
Depends if by Learning to Rank you mean “the problem of ranking solved by ML” -> then yes. In the end, we’re just training neural nets against the same loss functions.
If you mean “the traditional family of models” (LambdaMART, RankSVM, etc)… I think both are pretty well understood. In particular RankSVM is nice because its a set of linear weights, so easy to interpret/understand.
Plus vector search is only one part of a search solution, the inverted index is still valuable for many use cases. So all those features combined into a final ranking function is super valuable.
Thanks. I meant “the problem of ranking solved by ML”. But second part makes sense. Don’t change it if it works 😀
Apologies if this is too rookie, but I’m braNd new for search experiences.
Where does someone start for improving search experience on a given platform?
What’s the place of AI and ML for improving search experience?
Are categorization & labeling considered part of improving search experience or are they completely different areas of expertise?
Best place to start:
- How will you define “good” and can you get data that defines what ‘good’ is for a query. There’s different strategies. One is to use expert, human raters if, for example, your app is very domain specific. Another is to use clicks, etc, which involves its own complexities (see Ch 11 AI Powered Search). Another is crowdsourcing… That strategy is very specific to the problem
- Then try to iterate and get better!
Like all kinds of ML, it comes down to the underlying data. How good / clean is data for an ML problem. At Shopify, all our merchants have vastly different ways they use our underlying data. So that’s a challenge for us that would lend us to lean to hard on some parts of content to really “learn” anything to help merchants.
Absolutely categorization / labeling are HUGE. We have a team member that just manages a team that does this.
Thanks appreciate thw answers
What is a good way to scale judgements collection or ranking feedback for Enterprise search. Clicks do not mean the same things as in online shopping scenarios. Is thumb up/thumb down on the search result a good way to collect feedback? I had to build a judgement list for 1 experiment and it is a slow and painful process
Enterprise search is tough in part because nobody on your company in really that incentivized directly to make themselves/their content findable. If anything, they have the opposite incentive. They want to not be bothered, lol. Contrast this with Web search, where everyone is trying to SEO their content to get in front of as many eyeballs as possible
It’s also domain specific, making things tougher 😬
For a large org, with a heavily used app, you could use clicks for your head queries. BUT of course, as you get down the long tail, that gets tougher
Definitely agree about incentives. It is a pain
yeah not my favorite domain…
I would probably gather feedback informally myself from qualitative feedback and turn it into judgments, rather than make users rate items
And likely give people a very simple button to complain about a whole page of results (with very few fields to fill out, if any). Then when they click, capture the query + results. And if you can, you can find that person, and get more info, and turn that into relevance judgments
Great. Thank you for the idea
Transformer based Search engines vs ES based search engines which to choose when, what types of challenges each have? Cause I think there are lots of BERT models used for search
I think in a way BERT-like search will take over for first pass search, that’s text heavy. However, in my experience, no technique ever really “dies” in search. So lots of classic relevance approaches, that people use to explicitly manage and understand queries - like simple dictionaries/taxonomies managed by specialists - I think will continue. In part this is because of the often specialized nature of the search app. Another reason is the focus on precision, and in many cases the kind of explicit control a traditional taxonomy based approach built on an index like ES is preferred to have really fine-grain precision
Also not all search is very text-centric, the text might be unreliable/noisy, and BERT seems ideal for text passages
Hey Doug Turnbull, What is the difference between a search engine and a recommender system?
I have no idea 😆
Probably the difference is really how explicit and active the users intent is from the user’s PoV, but implementation wise they are just on one spectrum
Good answer 🙂
Doug Turnbull maybe Kim Falk knows? I’ve heard he’s answering questions in August here.
Do you know how it works? The query is “movie where a man wakes up on the same day” and it correctly identifies it as “Groundhog Day”
google just amazes sometimes 🙂. But as it’s a long form query, and likely in the realm of BERT/transformers/question answering kinds of solutions.
To me, I’m always first interested in what the loss function was to train such a thing, and the specific training task to predict missing terms in a piece of text by randomly masking them is quite interesting. Link.
Even before defining the loss function, I imagined they needed a lot of training data for that. Do you think just relying on clicks would be sufficient in this case?
Perhaps there’s a separate flow for movies and books
yes google has looooooots of training data (and lots of smart people working on the prob). At high scale, like web search, they probably can rely on clicks.
I doubt there’s a separate flow, I bet you can generalize these patterns regardless of topic…
How to evaluate search system?
How to achieve micro second search result on BERT based system with TBs of data?
I’m not sure you can achieve micro second search results using BERT / TBs of data 🙂
I thought there might be magic way to achieve it 😄
How do you incorporate fuzzy matches within the search function? Search for relevance using entire phrase and overlay with “# of Partial hits based on keywords” within phrase? Do you need to maintain a set of tags or keywords (based on topics, say) for each inventory item to match against/with the user ask? Is there a hierarchy of sorts within the various independent search tasks while combining to get the final return resultset?
I prefer to learn common misspellings by looking at query -> query refinements in log. Or by suggesting spellchecking and seeing what they click thru/what they don’t. Then you can build a dictionary of these common corrections
Doug Turnbull I dont mean mis-spellings but regular words but the match being partial, not a full or exact match. So if user gives “auburn hair color with lavender perfume” … match to “auburn hair color” with any scent/perfume, “red hair color with fast drying feature” etc.
Hi Doug Turnbull ,
- How do you handle:
a. Different languages? Example: Input text is in English but documents to be queried over is in Spanish.
b. Phonetic variations?
c. Transliterations in search?
- How do you search:
d. Text input over media content (image, video, audio). Example: Input -> “Lots of explosion”; Results <Image of an explosions>, <Mission Impossible movie>, <Sound of dynamites exploding>;
e. Media versus media (Search for similar images given an image, video given an video, audio given an audio).
Ah these are all great and each very complicated questions. For 1b & 1c, assuming there’s no homonyms, a good starting point is a simple dictionary approach. For example, I’ve done that for British <–> US English variants
Everything else feels very much like a deep learning problem… You build a model transforming each to a vector space, and try to build an embedding that moves more similar content together, and push dissimilar content apart using whatever training data you have.
then with that embedding space, I’d use an ANN index (or hack one out of an inverted index if possible)
Thank you for the answers Doug 🙂