I See Dead Code

… as sounding brass, or a tinkling cymbal.

I See Dead Code header image 2

Yes, Master.

Mai 24th, 2009 · 1 Comment

My Master thesis “Integration of Light-weight Semantics into a Syntax Query Formalism” is available from the SALSA project pages.

Abstract

In the Computational Linguistics community, much work is put into the creation of large, high-quality linguistic resources, often with complex annotation. In order to make these resources accessible to nontechnical audiences, formalisms for searching and filtering are needed.

The TIGER query language can, by describing partial structures, be used to search treebanks with syntactic annotation. Recently, augmented treebanks have been published, including the SALSA corpus which features frame semantic annotation on top of syntactic structure. Query languages, however, need to keep up with newly introduced annotation, allowing it to be searchable and easy to access.

We design an extension for the TIGER language which allows searching for frame structures along with syntactic annotation. To achieve this, the TIGER object model is expanded to include frame semantics, while remaining fully backwards-compatible.

Finally, these extensions have been added to our own implementation of TIGER, which includes novel indexing features not found in the original work of Lezius (2002a).

What does it all mean?

In the most basic sense of all, the TIGER query language allows specification of nodes (which are flat feature structures) and relations between these nodes. So far, only syntactic nodes (words and phrases) and syntactic relations (dominance, precedence and structure sharing) were supported in the query language, while the underlying annotation formalism had been extended to include frame semantics as well. My conservative extension of the query language introduces types and relations for frame semantics. This makes it possible to express linguistic queries such as Find all sentences where the role TOPIC in the frame
STATEMENT is realized by a PP with the preposition “über”
, which was not possible previously:

{frame="Statement"} > #r:{role="Topic"} &
#pp:[cat="PP"] >AC [word="über"] &
#r > #pp & arity(#r, 1)

What is cryptically referred to as “novel indexing techniques” are improvements to the candidate selection for relation checks, which now exploits some graph-theoretic notions which can be used as rough filters prior to actual relation checks, which can be quite expensive. All in all, the implementation is generally faster than TIGERSearch (the original implementation by Lezius) for complex queries, for simple queries, it is slower, because our node index is slower.

Can I try it out

A demo system for the original and extended query language is online on the CoLi webservers in Saarbrücken. With regard to features, this is the latest version, since then I committed one bugfix to the query evaluator.

Will you continue work on it?

Hopefully, yes. Current directions of ongoing development include:

  • In Progress:
    • Client-side rendering of trees in the query front-end (using the HTML5 canvas)
  • Planned:
    • Custom-written node index
    • Relations between graphs and nodes in different graphs

I’m also running some experiments for massively parallel constraint evaluation using GPUs, but that might not lead anywhere and depends on the availabity of special hardware.

Thanks

Again, special thanks go to Martin Lazarov and Armin Schmidt, who both read the full draft version and provided many comments and corrections.

Tags: coli · studies

1 response so far ↓

  • 1 Armin Schmidt // Mai 26, 2009 at 08:55

    Congrats! I’ll be happy to also proof-read you phd thesis, soon :-)

Leave a Comment