Thanks to Torsten Kilias and Alexander Löser of the Beuth University of Applied Sciences in Berlin for the following guest post about their INDREX project and its integration with Impala for integrated management of textual and relational data.
Textual data is a core source of information in the enterprise. Example demands arise from sales departments (monitor and identify leads), human resources (identify professionals with capabilities in ‘xyz’), market research (campaign monitoring from the social web), product development (incorporate feedback from customers), and the medical domain (anamnesis).
In this post, we describe In-Database Relation Extraction (INDREX), a system that transforms text data into relational data with Impala (the open source analytic database for Apache Hadoop), with low overhead needed for extraction, linking, and organization. Read this paper for complete details about our approach and our implementation.
Introduction to INDREX
Currently, mature systems exist for either managing relational data (such as Impala) or for extracting information from text (such as the Stanford CoreNLP). As a result, the user must ship data between systems. Moreover, transforming textual data in a relational representation requires “glue” and development time to bind different system landscapes and data models seamlessly. Finally, domain-specific relation extraction is an iterative task. It requires the user to continuously adopt extraction rules and semantics in both, the extraction system and the database system. As a result, many projects that combine textual data with existing relational data may likely fail and/or will become infeasible.
In contrast, INDREX is a single system for managing textual and relational data together. The figure below shows two possible system configurations for INDREX; both read and transform textual data into generic relations in an ETL approach and store results in base tables in the Parquet file format. Once generic relations are loaded, the INDREX user defines queries on base tables to extract higher-level semantics or to join them with other relational data.
The first configuration imitates state-of-the-art batch processing systems for text mining, such as Stanford Core NLP, IBM’s System-T, or GATE. These systems use HDFS for integrating text data and loading relational data from an RDBMS. The user poses queries against HDFS-based data with proprietary query languages or by writing transformation rules with user-defined functions in languages like Pig Latin, Java, or JAQL.
The second configuration utilizes Impala and INDREX. This approach permits users to describe relation extraction tasks, such as linking entities in documents to relational data, within a single system and with SQL. For providing this powerful functionality, INDREX extends Impala with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relations. As a result:
- The user can utilize structured data from tables in Impala to adapt extraction rules to the target domain
- No additional system is needed for relation extraction, and
- INDREX can benefit from the full power of Impala’s built-in indexing, query optimization, and security model
Figure 1 shows the initial ETL (Steps 1, 2, and 3) and compares two implementations. Both setups read text data from HDFS sequence files (Step 1) and process the same base linguistic operations (Step 2). The system writes the output, generic domain independent relations, in highly compressed HDFS Parquet files (Step 3). The user refines these generic relations into domain-specific relations in an iterative process. [One implementation uses MapReduce on Hadoop (Step 4.1) and writes results into HDFS sequence files (Step 5.1).] The user analyzes the result from HDFS (Step 6) and may refine the query. INDREX permits also a much faster approach in Impala (Step 4.2) where the user inspects results (Step 5.2) and refines the query.
This approach benefits from optimizing query workflows in Impala.
Each document, such as an email or a web page, comprises one or more sequence of characters. For instance, we denote the interval of characters from the last sentence with the type document:sentence. We call a sequence of characters attached with a semantic meaning a span. Our data model is based on spans and permits the user (or the application) to map multiple spans to an n-ary relation. For example, the relation PersonCareerAge(Person, Position, Age) requires to map spans to three attribute values.Our model is flexible enough to hold additional important structures for common relation extraction tasks, such as extracting document structures, shallow or deep syntactic structures, or structures for resolving entities within and across documents (see details in our paper).
In practice, either a batch-based ETL process, or an ad-hoc query, may generate such spans.
The diagram above illustrates an example corpus of three documents. Each document consists of a single sentence. The figure shows the output after common language-specific annotations, such as tokenization, part-of-speech (POS) tagging, phrases chunking, dependencies, constituencies (between the phrases), OIE-relationship candidates, entities, and semantic relationships. Each horizontal bar represents an annotation with one span and each arrow an annotation with multiple spans.
The semantic relationship PersonCareerAge is an annotation with three spans. The spans “Torsten Kilias” in document 26 and “Torsten” in document 27 are connected by a cross-document co-reference annotation. The query in the right-bottom corner shows a join between a table, including positions about persons in an organization, and spans about relations of these persons that have been extracted from the three documents shown in the diagram.
Joining Text with Tables and Extracting Domain-Dependent Relations
Domain-dependent operations combine base features from above with existing entities from structured data (also called entity linkage). The user may add domain specific rules for extracting relationship or attribute types or attribute value ranges. In INDREX, these domain-specific operations are part of an iterative query process and are based on SQL. (Recall the example corpus, in which the user joins spans with persons and their positions from an existing table.)
The example query below in Impala demonstrates the extraction of the binary relationship type PersonCareer(Person, Position). After an initial training, writing these queries is as simple as writing SQL queries in Impala.
Aggregation and Group By Queries from Text Data
INDREX supports aggregations, such as , , , or , directly from text data and within Impala.
The query above shows the process of extracting a person’s age. In our simple example, the user expresses the age argument to the person as apposition, using a comma. The query applies span and consolidation operators (explained in our paper) and finally shows a distribution of age information for persons appearing in the text.
The query below shows a simple person-age extractor. In addition, it provides a distribution for age-group information in a corpus. The outer statement computes the aggregation for each age-group observation, grouped and ordered by age. The nested statement projects two strings: the person and the age from conditions in the clause that implement the heuristic of a syntactic apposition, in this case represented by the pattern
Коммандер, - сказал Чатрукьян, - я уверен, что нам надо проверить… - Подождите минутку, - сказал Стратмор в трубку, извинившись перед собеседником. Он прикрыл микрофон телефона рукой и гневно посмотрел на своего молодого сотрудника.
- Мистер Чатрукьян, - буквально прорычал он, - дискуссия закончена.