In a standard search engine, words are the primary material used to match user queries to relevant documents. In a semantic search engine, a lot of the focus is shifted from the space of words to the space of concepts. The engine not only recognizes words, but it also understands what they mean and how they relate in a document.
We rely on a number of major components to accomplish the tasks described above. First, we use an extensive ontology coupled with a lexicon and an onomasticon. The ontology describes knowledge as a graph of concepts (nodes) connected to one another with typed relations (the edges). It is associated with a lexicon containing the words associated with each concept for a particular language. Finally, we use a large onomasticon which captures individual entities and their names. This is where we represent lists of people, places, organizations, brands, works of arts and more.
Once a document has been retrieved from the Internet, we extract useful information from it in a step called indexing. During indexing, text is first tokenized, lemmatized and a part of speech is associated with each word. This produces a “normalized” form of the sentence which can be matched against the expressions present in our lexicon. For each expression (i.e., sequence of words), possible matching concepts are identified. In some cases, a particular expression may mean multiple things.
A critical step in this process is disambiguation: when an expression is ambiguous, it is necessary to select the most likely meaning. As an example, in the text “Find the top rated High Yield Bond mutual funds.” the word “bond” means the financial instruments, not the fictional character “James Bond”. It seems easy, but another example will drive the point home: “I saw the musical Chicago in Munich last night”. Here both “Chicago” and “Munich” can mean a city or a work of art (musical or movie). Once the expression’s meaning has been resolved, concepts are associated with words in the text and given a score. The score for each concept reflects how important the concept is to this document.
The Power of Concepts
Shifting focus from words to concepts has several benefits. First, the use of concepts provides an elegant way to handle synonyms. For example, the expressions “high blood pressure” and “hypertension” are equally considered when searching for documents about this concept.
Secondly, when each concept is related to other concepts in the ontology, the document now also contains implied knowledge. As an example, consider a document containing the text “Foreigners should take care when traveling in the south of Italy, where highway robbery occasionally takes place.” This text indirectly talks about criminality in Europe since “robbery” is a type of crime and “Italy” is in Europe. At search time, this implied knowledge can be leveraged to return relevant documents even when they do not contain the words you queried for. In the example above, a search for “criminality in Europe” will consider the aforementioned document as possibly relevant even though it does not contain the word “Europe” or “criminality”. That’s the power of implied knowledge.
Each keyword and concept is scored individually to reflect how important the concept is in this particular document in relation to other documents that contain this concept (either explicitly, or implicitly).
Finally, semantic indexing provides a language-independent representation of the document which can be used to search in one language and retrieve documents in other languages.
Once words in a document are associated with concepts, it is often useful to also classify the document according to various criteria. The categories in which the document is placed can be used to filter or influence the rank during the search phase. A primary application of classification is to filter documents that may contain spam, mature or potentially offensive content. This content may be filtered in accordance to the user preference (e.g., safe-search settings). A second application is to determine the general topics covered in a document. For example, is it about sports, politics, health? A document that covers football related injury may be associated with general topics such as sports and health. We also want to know the document’s genre. It is useful to know if a document is a biography, FAQ, guide, review or recipe since we can use this information when selecting documents at search time. For example, a search for “dark chocolate cake recipe” should return recipes which contain ingredients and preparation instructions, even if the word recipe is not present in the document.
In the final stage of indexing, we use all the information described above and rate the document in its entirety. We associated several scores related to the document’s intrinsic and extrinsic attributes. The intrinsic value of a document can be determined by a combination of many attributes such as its format, length, vocabulary level, average sentence complexity, spelling quality and overall topic coherence. Independent from this, a document’s perceived value may depend on the trustworthiness of the domain where it is hosted, how many other pages link to this document, and how often people click on it when offered as a result.