Note! Defining Semantics, NLP, LSI and AI

By Kathleen Dahlgren, PhD
Dr. Dahlgren is the Founder and CTO (Chief Technology Officer) of Cognition Technologies.

The ability for users to efficiently and accurately locate a complete universe of documents which are relevant to their interest is continuing to be a major challenge on the Web. Using Search engines like Google, users often miss documents they desire to retrieve (under-retrieval), and they also see many documents that are not relevant to their query (over-retrieval). Actually, the typical search engine retrieves only 20% relevant information (90% irrelevant), and misses 80% of the desired information.

Let’s look at an example: On a Wikipedia search using Google as the Search engine (i.e. the Google choice on the Search engine pull-down), the following query, “strike up an intimate friendship”, Google returns mostly irrelevant documents with concepts such as:

“living it up”
“strike justice upon one”
“it struck him that”

Google returns irrelevant information because it doesn’t understand semantics (i.e. knowledge of meaning). Frequent words such as “strike” have many meanings. In this particular query “strike up” means to “start” or “inaugurate”, not “living it up”, as in (1) above; or “mete out” as in (2) above; nor “occur to one” as in (3) above.

In addition, Google misses many relevant documents, such as those which say “strike up a close friendship” not an “intimate friendship”. Google missed the relevant documents because it didn’t know what “intimate” meant, and didn’t know that “close” in this context means the same thing (or expresses the same concept).

These problems, over-retrieval and under-retrieval, can be solved with technology that has “semantic” or meaning understanding. Software with semantic Natural Language Processing (semantic NLP) or linguistic semantics, can retrieve just the right documents, because before searching it interprets the meaning of the query.

Once the meaning of the words in a query are determined, software with semantic NLP can look for those same meanings in the stored index of the document base (which have also been semantically analyzed), and return documents that use “strike” and “intimate” with those specific meanings rather than returning documents that contain those words in any of their other meanings (thus avoiding the bad retrievals). The software can also find synonyms of those specific meanings.

Semantic NLP technologies also make use of taxonomies that classify things into categories. This taxonomy starts with very general categories such as “animal”, “vegetable” and “mineral”, and breaks down all the concepts of English into a tree structure. For example, the vehicle category is broken into a sub-tree like the one below:

In summary, semantic NLP enables an application or technology to “understand” word and phrase meanings within the context of how they are being used. Semantic NLP technology distinguishes the various meanings of ambiguous words from each other, it knows which meanings of words are synonymous with meanings of other words, and it knows which meanings of words are specific instances of more general concepts. In addition, semantic NLP enables a computer to recognize phrases, and their relationships with other words and phrases. For example, a computer technology with semantic NLP knows that “Securities and Exchange Commission” is a phrase, and that it is synonymous with “SEC”.

Comparison with other technologies

NLP (natural language processing) is a broader field that includes semantic NLP. Other aspects of this field are parsing technologies (which may or may not have semantic knowledge). In parsing, the grammatical structure of sentences is analyzed. Cognition has both parsing and semantic NLP. Recent entrant into the semantic NLP field, Powerset, has built its technology on a parsing foundation, with very little semantic knowledge. Hakia, another semantic technology company, does not publicly disclose what they have, but it appears that their technology is built around a well-developed taxonomy (reasoning from the general to the particular).

Other NLP technologies on the market do not include linguistic semantics or parsing, but instead attempt to teach the computer meaning through statistical methods. One such method is “Latent Semantic Indexing” that tries to infer meaning and synonymy through counting of word co-occurrences. For example, it counts up and computes that many documents frequently mention “ball”, “bat”, “field”, and “baseball” and assume that therefore any documents mentioning those words frequently have the same topic (namely, baseball). NLP and LSI are subfields of Artificial Intelligence, which endeavors to teach computers to behave intelligently. Language is one intelligent behavior, automated learning, vision and manufacture (robotics) are others also pursued in the field of Artificial Intelligence.

Sphere: Related Content

5 Responses to “Note! Defining Semantics, NLP, LSI and AI”

  1. Charles Knight says:

    Paul writes:

    The example query, “strike up an intimate friendship”, goes too far, IMHO. Even if the search engine understands the meanings of the words, the phrase is still too vague to be useful. The functionality implied here seems to skip a level that would be much more useful. If you could make a query like, “forums that do not charge a fee on which I can meet smart people who like to talk about personal relationships”, return relevant results, that would be better.

  2. vicaya says:

    Geez, how about demonstrate some understanding of ROC and/or precision/recall next time you write about how would NLP/LSI/AI help? Using these anecdotal examples insult the intelligence of readers who actually care about alt search engines. Most studies show that NLP/LSI or whatever semantics stuff doesn’t help *much* for the overall ROC for web search, with much higher computation cost.

  3. Christian Hempelmann says:

    Dear Kathleen -

    Thanks for mentioning hakia here. I just wanted to clarify one point. Our semantic resources are actually transparent. You can interact with a limited dataset at http://labs.hakia.com/hakia-lab-onto.html. There is more information about the underlying theory, Ontological Semantics, and its application to Web search here: http://www.ontologicalsemantics.com/.

    Best,
    Dr. Kiki Hempelmann, Chief Scientific Officer, hakia

  4. Kathleen Dahlgren says:

    My blog post was intended to be a high-level (i.e. simple) description of various approaches to Search. Readers will find greater detail in Cognition’s Technical Overview whitepaper, which can be downloaded from our Website at www.cognition.com.

    In response to the concerns about higher computation costs, statistical semantic approaches, such as LSI, do indeed experience exponential growth in computation resources because they have to compare all the co-occurrences in an entire document base. On the other hand, Semantic NLP based upon linguistic approaches, does not have that problem. CognitionSearch, for example, indexes documents about half as fast as a typical pattern-matcher technology, and the function is linear as it scales. Its algorithms use bottom-up interpretation — word-by-word and sentence-by-sentence.

    To answer the concerns about scientific measures, precision and recall are standard measures of Search engine performance. Precision is a measure of retrieval accuracy calculated by dividing the total number of relevant retrievals by the number of all retrievals generated by the Search. Recall is a measure of the extent to which relevant material in the total document base is found. It is calculated by dividing the number of relevant retrievals by the total number of potentially relevant retrievals in the document base.

    Pattern-matching technologies perform with both low precision and low recall (typically under 20% for both ). The TREC (Text Retrieval Conference), sponsored by the National Institute of Standards and Technology (NIST), is a recognized source of precision/recall testing for various technologies, including pattern-matching and statistical approaches. In TREC’s legal track competition in 2007, there were 13 technologies participating. Their precision performance ranged from under 1% to 23% and their recall performance ranged from under 1% to 22%.

    While Cognition did not participate in the TREC competition in 2007 (but is participating in 2008), it did conduct its own internal precision/recall tests on a wide variety of document bases (similar to the TREC data) and Websites. These included the National Library of Medicine’s MEDLINE™, the public domain Enron fraud case, the public domain Microsoft anti-trust case, the BBC World News Website (http://news.bbc.co.uk/), and the Global Issues Website (http://www.globalissues.com), among others. For each test, 50 queries that were considered likely to be asked by users of the data/Website were formulated and posed to a CognitionSearch Search function on the sites documents. Relevancy was judged for a sample of 20 or fewer retrievals and extrapolated. Cognition’s precision exceeded 90%. Recall was measured relatively. In other words, full recall was taken to be the total of all relevant retrievals returned by any of the Search engines used in the particular test. Cognition’s relative recall in these tests exceeded 90% relative recall.

    In summary, by employing Semantic NLP technology, Search results will achieve significantly better precision and recall than pattern-matching or statistical approaches.

    1 [TREC Conference 2007]

  5. Sam says:

    Hi Kathleen,

    As you know for your numbers to be compared to the Trec legal track it would be helpful and more accurate to run your test against the same data as it is the only way to do a proper apples to apples comparison.

    Even if the data you used is similar to the Trec data it is not a valid comparison, which is why Trec participants all use the same baseline data for the competition. The numbers you site as far as I am concerned are not valid as they do not use the same data.

    Your technology sounds very interesting but all the posts I have seen from your company seem to be simply look at me too, invoking the name of your “compeditors” to get hits rather than instructive posts about your technology and where it fits into the wider world.

    The above post is an example, it really wasn’t that helpful.

    Cheers,

    Sam

 

Leave a Reply

  Entries (RSS)  |  Comments (RSS) altsearchengines.com is proudly powered by WordPress  
© 2008 altsearchengines.com