TransWikia.com

How to build a textual search engine?

Data Science Asked on August 26, 2020

I am having an HTML string and want to find out if a word I supply is relevant in that string.

Relevancy could be measured based on frequency in the text.

An example to illustrate my problem:

this is an awesome bike store
bikes can be purchased online.
the bikes we own rock.
check out our bike store now

Now I want to test a few other words:

bike repairs
dog poo

bike repairs should be marked as relevant whereas dog poo should not be marked as relevant.

Questions:

  • How could this be done?
  • How to I filter out ambiguous words like in or or

Thanks for your ideas!

I guess it’s something Google does to figure out what keywords are relevant to a website. I am basically trying to reproduce their on-page rankings.

2 Answers

  • pre-process your documents (some of the steps may be skipped)
  • use a Vector Space model to represent documents (you may use TF, aforementioned TF-IDF or other models)
  • do the same with the query: preprocess and represent it in the vector space
  • find the most similar documents by computing the vector similarity (e.g. using the cosine similarity)

That's an outline of the Information Retrieval process

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze is a very good book to get started in IR.


Or just use Apache Solr to get everything you need out of the box (or Apache Lucene, that is used by Solr, to build your own application)

Answered by Alexey Grigorev on August 26, 2020

I remember a long time ago playing with Elastic Search (the website is very different now from what I remember). There is some stuff about dealing with human language here.

Be warned that Elastic search is like a big bazooka to your problem. If your problem is very simple, maybe you want to go from scratch. There is some docs in the web about it.

Answered by eri0o on August 26, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP