Semi-automatic extraction of multiword terms from domain-specific corpora
Abstract
Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were ...recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.
Keywords:
Digital documents / Data analysis / Evaluation / Information retrieval / Data processing / Foreign languages / Data retrieval / Document handlingSource:
Electronic Library, 2018, 36, 3, 550-567Publisher:
- Emerald Group Publishing Ltd, Bingley
Funding / projects:
- Serbian Language and Its Resources: Theory, Description and Applications (RS-178006)
- Infrastructure for Technology Enhanced Learning in Serbia (RS-47003)
DOI: 10.1108/EL-06-2017-0128
ISSN: 0264-0473
WoS: 000434773400011
Scopus: 2-s2.0-85047317443
Collections
Institution/Community
Poljoprivredni fakultetTY - JOUR AU - Pajić, Vesna AU - Vujicić-Stanković, Stasa AU - Stanković, Ranka AU - Pajić, Miloš PY - 2018 UR - http://aspace.agrif.bg.ac.rs/handle/123456789/4786 AB - Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well. PB - Emerald Group Publishing Ltd, Bingley T2 - Electronic Library T1 - Semi-automatic extraction of multiword terms from domain-specific corpora EP - 567 IS - 3 SP - 550 VL - 36 DO - 10.1108/EL-06-2017-0128 ER -
@article{ author = "Pajić, Vesna and Vujicić-Stanković, Stasa and Stanković, Ranka and Pajić, Miloš", year = "2018", abstract = "Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.", publisher = "Emerald Group Publishing Ltd, Bingley", journal = "Electronic Library", title = "Semi-automatic extraction of multiword terms from domain-specific corpora", pages = "567-550", number = "3", volume = "36", doi = "10.1108/EL-06-2017-0128" }
Pajić, V., Vujicić-Stanković, S., Stanković, R.,& Pajić, M.. (2018). Semi-automatic extraction of multiword terms from domain-specific corpora. in Electronic Library Emerald Group Publishing Ltd, Bingley., 36(3), 550-567. https://doi.org/10.1108/EL-06-2017-0128
Pajić V, Vujicić-Stanković S, Stanković R, Pajić M. Semi-automatic extraction of multiword terms from domain-specific corpora. in Electronic Library. 2018;36(3):550-567. doi:10.1108/EL-06-2017-0128 .
Pajić, Vesna, Vujicić-Stanković, Stasa, Stanković, Ranka, Pajić, Miloš, "Semi-automatic extraction of multiword terms from domain-specific corpora" in Electronic Library, 36, no. 3 (2018):550-567, https://doi.org/10.1108/EL-06-2017-0128 . .