Sign In


Exploring semantic technology in search 


Dr Jordi Asterias,  Fundaciò Barcelona Media-Yahoo! Research 


12/6 ore 12-13, 13/6 ore 11-13 e 14/6 ore 11-13.

This seminar will explore the uses of Natural Language Processing (NLP)

technologies in (IR) to improve information access. There has been a great deal of work in NLP over the past several decades. Although we are still far from understanding natural language new practical products are making uses of NLP and semantic technologies (siri, google translate, powerset, watson, ...)


Most NLP processing involves adding some level of information (annotations) which should help to clarify the documents intended meaning. While there has been hope that this work will translate into better search, this still has not been clearly realized. Search engines are still hampered by their limited understanding of user queries and the content of the Web.


Semantic search is difficult because language, its structure and its relation to the world and human activities, is complex and only partially understood.  However, important advances has occurred in the last ten years, makes this study today completely different. The web has emerged as a huge available corpora, also with a fast growing structured component (web 3.0), cloud computing is emerging as the scaling up processing solution, etc.


The Semantic Search Group at Yahoo! Research Lab in Barcelona  is a multidisciplinary team spanning from Information Retrieval (IR) to Natural Language Processing (NLP), machine learning (ML) and the semantic Web (SW) that has been exploring this view in the last years.  As a result several prototypes has been developed: Yahoo! Correlator /  Yahoo! Quest / DeepSearch / TimeExplorer.


We foresee three main components which are crucial for addressing this “semantic gap”: Machine learning (more complex and scalable models), Data (large amounts of annotated or partially annotated data) and Tasks (new ways of browsing and accessing the data and supporting your answers).


During this seminar will try to show the state of the art techniques used to address these challenges, and the lessons learned from a research and  industrial points of view.



-          Introduction to Semantic Search

A general introduction to the different components involved in semantic search (NLP, IR and Semantic Web) and  an exploration of the new trends on semantic search technology  in the industry. 

o   Some hints about Natural Language Processing

o   Some hints about Information Retrieval Ranking

o   Some hints about Semantic Web:

o   Semantic search trends in industry: In the last years a new industrial applications  using semantic search technology: Siri, Watson, Powerset, Wolfram Alpha, ...

-          New Challenges, New Data

One of the games changers on semantic search has been the available of new types of data (user generated content, semantic web data) whose volume and nature has pushed NLP technology to new architectures (cloud computing). In this section we will present these news types of data and the emerging architectures

o   Web content: Heterogeneous, multilingual, multimedia

o   Semantic Web data, (Freebase, YAGO Wikipedia) and technologies (microformats, RDF).

o   NLP Frameworks  and  scalability - hadoop - S4 - and NLP framework  UIMA AS/ GATE on the cloud

-          New Challenges, New Shallow NLP  tasks (error-prone, highly redundant information)

These new data have also change the traditional NLP and IR tasks. In this session we will analyse these changes and present the current approaches to these new tasks:

o   From News/Books text processing to User Generated Content (tweeter, mails, blogs) processing

o   From Sentiment analysis to Opinion Mining/ Online Reputation

o   From Document Ranking to Entity Ranking / Sentence Ranking / RDF ranking

o   From Name Entity Recognition to Named Entity Disambiguation 

o   From Information Extraction to Open information Extraction


[Baeza et al.’2008] Ricardo Baeza-Yates, Massimiliano Ciaramita, Peter Mika, and Hugo Zaragoza “Towards Semantic Search” NLDB 2008, LNCS 5039, 2008.

[Blanco et al,2011]Effective and Efficient Entity Search in RDF data Roi Blanco; Peter Mika; Sebastiano Vigna, ISWC 2011

[Ciaramita et al. 2008] Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza, Jordi Atserias “Learning to Tag and Tagging to Learn: A Case Study on Wikipedia” Journal IEEE Intelligent Systems, Volume 23 Issue 5, September 2008

[Deep Search] 

[Demartini et al’2010] Demartini,  Missen, Blanco, Zaragoza, “Entity Summarization of News Articles” SIGIR 2010

[Demartini et al’2012] Demartini, Mika, Tran, P. de Vries “From Expert finding to Entity Search on Web”, ECIR 2012 tutorial


[Fader et al 2011] Anthony Fader, Stephen Soderland and Oren Etzioni “Identifying Relations for Open Information Extraction” Empirical Methods in Natural Language Processing 2011


[Google Translate]


[Matthews et al’2010 ] Matthews, Tolchinsky, Blanco, Atserias, Mika, Zaragoza  “Searching through time in the New York Times”


[Mesquita et al. 2010 ] Filipe Mesquita, Yuval Merhav, Denilson Barbosa “Extracting information networks from the blogosphere: state-of-the-art and challenges”. ICWSM 2010

[Linked Data]

[Pasca’2012] Marius Pasca "Web-Based Open-Domain Information Extraction" tutorial EACL 2012 web


[Rada and Csomai,2007] Rada Mihalcea, Andras Csomai  “Wikify!: linking documents to encyclopedic knowledge”, CIKM, 2007

[Ratinov et al, 2011] L. Ratinov and D. Roth and D. Downey and M. Anderson, “Local and Global Algorithms for Disambiguation to Wikipedia”. ACL (2011)




[Time Explorer]



[Yahoo! Quest]


[Workshop NLP4UGC]

[Web Says]

[Wu et al 2010] Fei Wu Daniel, S. Weld “Open information Extraction using Wikipedia” ACL, 2010









Planned students' seminars

Created at 5/20/2012 8:02 PM  by Antonio Cisternino 
Last modified at 5/31/2012 11:02 AM  by Pierpaolo Degano