Research Infrastructure for Diachronic Czech Studies

Identification code: LM2015081

Acronym of Research Infrastructure: RIDICS

Research areas: social science and humanities (major), information and communication technologies / e-infrastructures (minor)

Hosting Institution: The Institute of the Czech Language of the Academy of Sciences of the Czech Republic, v. v. i.

Legal Representative: PhDr. Martin Prošek, Ph.D.

Partner Institution: Czech Technical University in Prague, Faculty of Electrical Engineering

Responsible person: Dalibor Lehečka

Contact: vokabular@ujc.cas.cz

Conditions of access

Research infrastructure data and tools are freely available to all researchers, currently via the web application Vokabulář webový at the address http://vokabular.ujc.cas.cz. The research infrastructure can be contacted via the e-mail address: vokabular@ujc.cas.cz.

The user is obliged to acknowledge all sources or tools obtained via the web application in every publication as well as qualification, Ph.D. or habilitation thesis, namely in the following recommended form:

„The present work makes use of the sources of the Research Infrastructure for Diachronic Czech Studies (RIDICS, http://vokabular.ujc.cas.cz).“

When using a specific source available within The Web Vocabulary (Vokabulář webový), the usual rules for citing web pages and contributions are to be followed.

Description of the research infrastructure

Research Infrastructure for Diachronic Czech Studies (RIDICS) will co-create and operate two complementary web portals facilitating and inspiring the research in the field of diachronic Czech studies (i.e. Czech from the earliest periods up to the late 18th century) and other related fields in humanities. The first pillar will be the research web portal designed for excellent research, which provides access to a vast number of miscellaneous scientifically processed and analysed primary and secondary sources provided with detailed metadata that will be gradually supplemented with lemmatization, morphological tagging, etc. The portal will provide virtual environment to facilitate the research on various aspects of the Czech history: Czech language, culture, art, etc. The primary emphasis will be laid on the data accessibility and the development of the tools suitable for linguistic research (e.g. in the areas of orthography, phonology, morphology, word derivation, semantics, lexicology, onomastics, syntax, dialectology, translatology, intertextual issues, etc.). The portal will also serve as a useful resource for literary scholars, classical philologists, historians, science and art historians, and other specialists, primarily in the humanities (e.g. philosophy, history of medicine, history of law, biblical studies, geography, genealogy, etc.). Besides the primary sources, the research portal will also provide other material (e.g. modern diachronic dictionaries, scholarly literature) for more complex historically oriented research. RIDICS will also develop and provide quality tools for accessing the offered materials (full text search, corpus analysis tools, etc.). Collected materials will also provide data for ICT specialists, especially as regards the development of tools to process nonstandard language data, for their automated analysis and search of their relationships.

The second pillar of the research infrastructure (hereinafter referred to as “RI”) will be represented by the community web portal that will enable the researchers to share their research output with scholars, students and the general public alike (store and make accessible scholarly works as well as electronic editions of primary sources) and also keep the community informed about the events in the respective fields, to discuss scholarly issues, etc., and by doing so to inspire further research both in theirs and other related fields. The community portal will engage a number of specialists in diachronic Czech studies (language and literature) and Czech Middle Ages and Early Modern time period to provide and share resources and expert discussion, and thus support research coordination and establish (interdisciplinary) collaboration. RIDICS will also develop and provide tools for the primary sources preparation (template for electronic editions, software for automated transcription) and for the research proper. These are available in the form of web services, individual programs or add-ons for programs, with which the researchers work on everyday basis (text editors).

The process of integrating various materials will draw on available resources and the intended web portals will aggregate data from these resources – e.g. topically and temporarily relevant bibliographical data from the Institute of the Czech Language, the Institute of Czech Literature, and the Institute of History AS CR.

The research portal will provide the following services:

  1. access to digital copies of the resources, if permitted by copyright and other rights owners
  2. access to full texts of primary and secondary sources in continuous form, i.e. in form of coherent webpage text, or in e-book form in the PDF and EPUB formats
  3. full text corpus access, with the employment of gradual lemmatization and morphological tagging
  4. access to modern diachronic dictionaries (in the form of searchable full text, possibly digital images) or lexical databases
  5. access to period diachronic dictionaries (in the form of searchable full text – transcribed, possibly also transliterated – and digital images)
  6. access to digitised period Czech grammars in the form of digital images furnished with metadata, possibly as full texts, if available
  7. access to the same source in alternative forms, if available (digital images, transliterated and transcribed versions)
  8. gradual lemmatization and morphological tagging of texts written in historical Czech
  9. development and maintenance of a program to form own standards of lemmatization and morphological tagging
  10. spell check dictionaries of historical Czech for word processors (Microsoft Word, OpenOffice/LibreOffice Writer)
  11. development and maintenance of a tool to process electronic editions (an add-on for the Microsoft Word editor)
  12. development and maintenance of the Transcriptorium program to process transliterated and transcribed versions of primary sources editions
  13. generating phonological variants of historical words
  14. a searchable database of topically appropriate bibliographical records

The community portal will provide the following services:

  1. storage and tools for the publication of users’ scholarly works
  2. storage of primary sources provided by users/researchers
  3. full text search of provided scholarly works and primary sources
  4. moderated discussions forums
  5. topical and/or authorial bibliographical registers
  6. topically appropriate record of scholarly conferences and other undertakings (public lectures etc.)

Function of the RI

Ridics will become the exclusive, publicly accessible internet research environment focusing on historical Czech in a complex way, i.e. opening up not only primary sources in a machine readable form, but also other related resources, such as modern diachronic dictionaries and research outputs. The infrastructure will provide a large number of linguistic resources (editions, dictionaries), processed specifically for its purposes. Regarding the primary resources, these are often unique works, accessible only with great difficulties. The research portal will employ source data in the XML format (Extensible Markup Language) corresponding to the TEI P5 (Text Encoding Initiative) standard which will guarantee their standardization and seamless exchange with other platforms, or their transformation into other formats. Transfering resources to computer readable form will make them available even to the less experienced researchers; moreover, it will enable their further computer processing (analyses, statistics, e-book output, etc.).

The analysis and tagging of the linguistic data within RIDICS is based on the assumption that a language system of a particular period (both historical and modern) can be adequately processed only when employing tools corresponding to such a system (e.g. by having a different set of morphological categories). As a result of this, changes in a language system must be taken into account from the earliest stages of the process during which the tools for natural language processing are designed. In the contemporary language processing tools, such an aspect is usually missing.

One of the aims of the infrastructure will be to explore intertextual relations between miscellaneous primary and secondary sources which will contribute to easier understanding and processing of the provided data. As opposed to the study of modern languages, interlinking of the relevant data is especially important for the research of a historical language because in this case, researchers are able to use their language competence only to a limited degree. Foreign researchers for whom Czech is not their first language will benefit even more from such data interconnection. Such an interconnection between the respective materials (e.g. modern Czech diachronical dictionaries, full texts, bibliographies, scholarly literature) has not yet been employed in any of the known and publicly accessible sources. The mutual interlinking of the relevant sources and/or passages from other documents (e.g. between a quotation from the primary source cited in a scholarly article and the same quote in an e-edition) will allow to reveal a heretofore unknown connections.

During the process of developing applications, linguistic tools and user aids, RIDICS will concentrate on making them easy to use even for the technically less advanced researchers who are true experts as for the linguistic aspect of a researched topic but would otherwise be discouraged to use overly complicated or incomprehensible tools.

Materials and tools will be made freely available which will lead to the creation of more primary sources to be consequently used in further research.

RIDICS will provide data and other groundwork for research in:

  1. phonological, morphological, syntactical and semantic shifts in historical Czech
  2. orthographical system in historical Czech
  3. lexical system of historical Czech
  4. terminology of fields represented in primary texts (medicine, law, philosophy, etc.)
  5. lexicographical methods in period dictionaries
  6. relations among various lexicographical fields
  7. development of grammar description of Czech
  8. changes of literary forms and genres
  9. historical realia (individuals, places)
  10. transfer and changes of motives
  11. topical relationships among primary texts
  12. relationship between a literary monument and their witnesses (textual variability)
  13. relationship between a Czech translation and a foreign-language original

The research portal will also provide researchers with software tools for primary research, particularly processing of primary texts electronic editions (an electronic edition template, the Transcriptorium program, data bases, applicable to different historical periods, for the transfer from a transliterated copy into a transcribed form).

RIDICS will also function in the higher education as a source of study materials: period Czech grammars, primary sources (scanned, in transliterated and transcribed form – this will not only facilitate an effective search but will also serve for paleographic studies), modern Czech diachronic dictionaries; and a source of the knowledge of historical Czech and related fields: an overview of historical phonological changes, formal description of Czech diachronic morphology, topical papers in the scholarly literature section, a searchable bibliography database, moderated discussions. University students will make use of these portals while working on their (under- and post-) graduate theses. Community portal will enable students to publish the output of their research. As a part of their university studies, students will participate in the creation of the infrastructure content and software components, and by so doing, they will become acquainted with the field of historical Czech and its computational processing, the editorial principles of the digital editions, the Old-Czech morphology and its formal description, etc.

Interdisciplinary collaboration will be facilitated firstly by a resource repertoire which will make part of the research portal, secondly by the community portal bringing together researchers in various fields.

RIDICS responds to the current state of affairs in the diachronic Czech studies where no consolidated source of information is available on the existing and currently running research. Similarly, a register of existing electronic editions of primary sources and currently processed editions is lacking. RIDICS will also accelerate and improve the primary sources electronic editions processing. The editions can be subsequently employed in further research in the field, e.g. in the corpus data form. Both web portals and tools for research workers will serve to create more data for research in diachronic Czech Studies and related disciplines, which will result in greater relevance of consequent research and, last but not least, in an increased interest in the field.

Relationship of the RI to the International Research Area

For foreign researchers, RIDICS will be the main source of primary materials for the study of historical Czech which are usually not available in the required quality and number. Regarding primary sources being made available together with their foreign language elements (especially Latin and German in dictionaries and scholarly literature), such material will become the source of knowledge even for other national languages.

RIDICS as a new RI will cooperate with international RIs especially in the area of primary and secondary sources digitalization, such as, e.g., DARIAH (Digital Research Infrastructure for the Arts and Humanities, https://www.dariah.eu), DiXiT (Digital Scholarly Editions Initial Training Network; http://dixit.uni-koeln.de) or ENeL (European Network of Lexicography; http://www.elexicography.eu)

By means of research in the area of automatic lemmatization and morphological tagging of historical Czech based on the formal description of historical morphology (not on the modern languages tools adjustment), RIDICS will provide a platform for the cooperation with researchers working with historical texts written especially in the West Slavic languages.

Utilization and the RI output

The planned output of RIDICS is in accord with the ICL long-term research objectives, especially with the research in Old-Age and Middle-Age Czech lexis and publishing its outputs in the electronic dictionary forms, with the analysis and edition of Old- and Middle-Czech literary works and building text corpora from them.

The added value of the RI lies especially in analysing possibilities of intertextual linking of various data and making them accessible in order to enable the study of historical Czech. The development of the tools used for creating electronic editions (MS Word template, Transcriptorium) will make the preparation of primary sources for research purposes in the given field qualitatively better and faster. The tool for lemmatization and morphological tagging of historical Czech will offer a gradually built morphological description of the surveyed period including the tendencies of the individual changes development, and will make the research work with the linguistic material less demanding (e.g. searching, focusing on a specific phenomena on the morphological or morphosyntactic level). A detailed and uniform treatment of historical dictionaries will allow for the comparison of the individual works; the development of the Czech lexicographic tradition can also be observed. The community web portal would enable scholars from various disciplines to work closely together. Moderated forum discussions will ensure that anybody who asks a question will always get an answer.

Considering that RIDICS is specifically focused on a historical Czech and the related fields of study, no research results are known at this moment that could be employed in the commercial sector.

Intended outputs carried out within the RI will be employed in the area of humanities, especially the language and literary Czech studies, history, history of art, auxiliary sciences of history and other history-oriented fields in humanities. Other outputs would be employed in the field of humanities focused on teaching. RI will have a significant impact on the information technologies area (automated morphological tagging and lemmatization). In 2016–2019, the RIDICS team members will organize 6 lectures both for the scholarly and general public, 2 colloquies and will participate in 8 conferences and seminars organized by other institutions. Also, we are planning RI distant presentations in the form of video recordings or video conferences. At least 1 publication will come out in a public journal and 1 media appearance will take place. Annually, approximately 5 contributions describing the results of the RI development will be presented in the form of scholarly journal articles, conference papers and lectures intended for an academic audience. Furthemore, we are planning the applied research outputs, namely in the form of software and other program applications whose number will be influenced by the progress of RI realization. In 2016–2019, we are planning to host at least 2 workshops or scholarly colloquies, aimed at collecting entry information and demands of future RI users and informing of the RI realization progress.

Research and other cooperation within RI

There is currently no alternative infrastructure with the intended purpose, either on national or international level. We can speak of the synergic effect in relation to LINDAT/CLARIN and CNC projects whose purpose is complementary to the Czech language research. RIDICS and CNC intend to share their data and metadata and cooperate on the software tools development. Collaboration is also planned with other institutions and research infrastructures, especially in the area of bibliographic records aggregation (BHCL – Bibliography of the History of the Czech Lands; CLB – Bibliography of Czech Literary Studies).

RIDICS will cooperate with other research institutions and universities, whose research topic encompasses diachronic Czech studies. Among these are, e.g., Institute of Czech Literature AS CR, Centre for Medieval Studies or Centre for Classical Studies at the Institute of Philosophy of the Czech Academy of Sciences, Faculties of Arts at Masaryk University and the University of Ostrava, Philosophische Fakultät – Universität Tübingen, Philologisch-Kulturwissenschaftliche – Fakultät Universität Wien, etc. The cooperation will have the form of consultations on specific research topics or specialist preparation of research data, in form of guest lectures, running seminars, etc. RI also plans to contribute to the preparation of new, e.g. grant, projects in the sense of consultations on source preparation as well as participating in realized projects.

logo ÚJČCopyright © 2006–2017, oddělení vývoje jazyka, Ústav pro jazyk český AV ČR, v. v. i.
Vyhledávací program © 2006–2017, Boris Lehečka; Grafický návrh © 2006–2017, Irena Fuková

Vokabulář byl spuštěn před 10 lety, 8 měsíci a 9 dny; verze dat: 1.1.2
Ministerstvo školství, mládeže a tělovýchovyStrategie AV21
Web je podpořen projektem Ministerstva školství, mládeže a tělovýchovy č. LM2015081
„Výzkumná infrastruktura pro diachronní bohemistiku“ (akronym RIDICS) v rámci Projektu velkých infrastruktur pro VaVaI.