The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW: An Implementation and Evaluation

Jakob Halskov

    Research output: Book/ReportPh.D. thesisResearch

    2 Downloads (Pure)

    Abstract

    The research object of this thesis is the so-called knowledge patterns and their usefulness in automatically extracting specic semantic relations from unannotated and uncategorized text on the WWW so as to facilitate semi-automatic updating and extension of existing ontological and terminological resources. The main contribution of the thesis is the implementation of a com- plete ontology extension framework called WWW2REL which is 100% based on a knowledge-poor, domain-independent processing of WWW text snippets and includes the three stages of pattern discovery, pattern ltering and relation instance ranking. Unlike most comparable systems WWW2REL is special in that it is both highly portable, can be applied to any semantic relation type and operates directly on uncategorized WWW text snippets. The system is tested on the biomedical UMLS Metathesaurus for four dierent relation types and manually evaluated by four domain experts. It is demonstrated that high precision in the task of knowledge discovery from a noisy text source can be achieved using a very simple instance relevance measure and two ranking heuristics. In contrast, many comparable systems operate on richly annotated academic text and tend to apply heuristics which are custom-tailored to a specic domain and/or relation type. When selecting the overall best ranking scheme, average system performance across all four relation types ranges between 70% to 65% of the maximum possible F-score by top 10 and top 50 relation instances, respectively. Finally, the thesis experiments also examine the portability of individ- ual knowledge patterns and of the ranking heuristics. It is concluded that synonymy KPs are the most domain independent closely followed by ISA KPs, whereas patterns for "may_prevent" and especially "induces" are more dependent on the domain. Empirical experiments also suggest that a ranking heuristic which penalizes relation instances whose arguments occur frequently in a general language corpus can be highly eective, but may need to be adapted to the domain in question.
    Original languageEnglish
    Place of PublicationFrederiksberg
    PublisherCopenhagen Business School [Phd]
    Number of pages289
    ISBN (Print)9788759383377
    Publication statusPublished - 2007
    SeriesPhD series
    Number28.2007
    ISSN0906-6934

    Cite this

    Halskov, Jakob. / The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW : An Implementation and Evaluation. Frederiksberg : Copenhagen Business School [Phd], 2007. 289 p. (PhD series; No. 28.2007).
    @phdthesis{bb151060861411dc8979000ea68e967b,
    title = "The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW: An Implementation and Evaluation",
    abstract = "The research object of this thesis is the so-called knowledge patterns and their usefulness in automatically extracting specic semantic relations from unannotated and uncategorized text on the WWW so as to facilitate semi-automatic updating and extension of existing ontological and terminological resources. The main contribution of the thesis is the implementation of a com- plete ontology extension framework called WWW2REL which is 100{\%} based on a knowledge-poor, domain-independent processing of WWW text snippets and includes the three stages of pattern discovery, pattern ltering and relation instance ranking. Unlike most comparable systems WWW2REL is special in that it is both highly portable, can be applied to any semantic relation type and operates directly on uncategorized WWW text snippets. The system is tested on the biomedical UMLS Metathesaurus for four dierent relation types and manually evaluated by four domain experts. It is demonstrated that high precision in the task of knowledge discovery from a noisy text source can be achieved using a very simple instance relevance measure and two ranking heuristics. In contrast, many comparable systems operate on richly annotated academic text and tend to apply heuristics which are custom-tailored to a specic domain and/or relation type. When selecting the overall best ranking scheme, average system performance across all four relation types ranges between 70{\%} to 65{\%} of the maximum possible F-score by top 10 and top 50 relation instances, respectively. Finally, the thesis experiments also examine the portability of individ- ual knowledge patterns and of the ranking heuristics. It is concluded that synonymy KPs are the most domain independent closely followed by ISA KPs, whereas patterns for {"}may_prevent{"} and especially {"}induces{"} are more dependent on the domain. Empirical experiments also suggest that a ranking heuristic which penalizes relation instances whose arguments occur frequently in a general language corpus can be highly eective, but may need to be adapted to the domain in question.",
    keywords = "Ph.d.-afhandlinger, Datalingvistik, Terminologi, Terminologiekstraktion, Internet, Automatisk fremfinding, Ontologier, WWW2REL",
    author = "Jakob Halskov",
    year = "2007",
    language = "English",
    isbn = "9788759383377",
    series = "PhD series",
    number = "28.2007",
    publisher = "Copenhagen Business School [Phd]",
    address = "Denmark",

    }

    The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW : An Implementation and Evaluation. / Halskov, Jakob.

    Frederiksberg : Copenhagen Business School [Phd], 2007. 289 p. (PhD series; No. 28.2007).

    Research output: Book/ReportPh.D. thesisResearch

    TY - BOOK

    T1 - The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW

    T2 - An Implementation and Evaluation

    AU - Halskov, Jakob

    PY - 2007

    Y1 - 2007

    N2 - The research object of this thesis is the so-called knowledge patterns and their usefulness in automatically extracting specic semantic relations from unannotated and uncategorized text on the WWW so as to facilitate semi-automatic updating and extension of existing ontological and terminological resources. The main contribution of the thesis is the implementation of a com- plete ontology extension framework called WWW2REL which is 100% based on a knowledge-poor, domain-independent processing of WWW text snippets and includes the three stages of pattern discovery, pattern ltering and relation instance ranking. Unlike most comparable systems WWW2REL is special in that it is both highly portable, can be applied to any semantic relation type and operates directly on uncategorized WWW text snippets. The system is tested on the biomedical UMLS Metathesaurus for four dierent relation types and manually evaluated by four domain experts. It is demonstrated that high precision in the task of knowledge discovery from a noisy text source can be achieved using a very simple instance relevance measure and two ranking heuristics. In contrast, many comparable systems operate on richly annotated academic text and tend to apply heuristics which are custom-tailored to a specic domain and/or relation type. When selecting the overall best ranking scheme, average system performance across all four relation types ranges between 70% to 65% of the maximum possible F-score by top 10 and top 50 relation instances, respectively. Finally, the thesis experiments also examine the portability of individ- ual knowledge patterns and of the ranking heuristics. It is concluded that synonymy KPs are the most domain independent closely followed by ISA KPs, whereas patterns for "may_prevent" and especially "induces" are more dependent on the domain. Empirical experiments also suggest that a ranking heuristic which penalizes relation instances whose arguments occur frequently in a general language corpus can be highly eective, but may need to be adapted to the domain in question.

    AB - The research object of this thesis is the so-called knowledge patterns and their usefulness in automatically extracting specic semantic relations from unannotated and uncategorized text on the WWW so as to facilitate semi-automatic updating and extension of existing ontological and terminological resources. The main contribution of the thesis is the implementation of a com- plete ontology extension framework called WWW2REL which is 100% based on a knowledge-poor, domain-independent processing of WWW text snippets and includes the three stages of pattern discovery, pattern ltering and relation instance ranking. Unlike most comparable systems WWW2REL is special in that it is both highly portable, can be applied to any semantic relation type and operates directly on uncategorized WWW text snippets. The system is tested on the biomedical UMLS Metathesaurus for four dierent relation types and manually evaluated by four domain experts. It is demonstrated that high precision in the task of knowledge discovery from a noisy text source can be achieved using a very simple instance relevance measure and two ranking heuristics. In contrast, many comparable systems operate on richly annotated academic text and tend to apply heuristics which are custom-tailored to a specic domain and/or relation type. When selecting the overall best ranking scheme, average system performance across all four relation types ranges between 70% to 65% of the maximum possible F-score by top 10 and top 50 relation instances, respectively. Finally, the thesis experiments also examine the portability of individ- ual knowledge patterns and of the ranking heuristics. It is concluded that synonymy KPs are the most domain independent closely followed by ISA KPs, whereas patterns for "may_prevent" and especially "induces" are more dependent on the domain. Empirical experiments also suggest that a ranking heuristic which penalizes relation instances whose arguments occur frequently in a general language corpus can be highly eective, but may need to be adapted to the domain in question.

    KW - Ph.d.-afhandlinger

    KW - Datalingvistik

    KW - Terminologi

    KW - Terminologiekstraktion

    KW - Internet

    KW - Automatisk fremfinding

    KW - Ontologier

    KW - WWW2REL

    UR - https://primo.kb.dk/permalink/f/10k3fbj/CBS01000324280

    M3 - Ph.D. thesis

    SN - 9788759383377

    T3 - PhD series

    BT - The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW

    PB - Copenhagen Business School [Phd]

    CY - Frederiksberg

    ER -

    Halskov J. The Semi-automatic Expansion of Existing Terminological Ontologies using Knowledge Patterns Discovered on the WWW: An Implementation and Evaluation. Frederiksberg: Copenhagen Business School [Phd], 2007. 289 p. (PhD series; No. 28.2007).