Research Infrastructure for Diachronic Czech Studies
Identification code: LM2015081
Acronym of Research Infrastructure: RIDICS
Research areas: social science and humanities (major), information and communication technologies
/ e-infrastructures (minor)
Hosting Institution: The Institute of the Czech Language of the Academy of Sciences of the Czech Republic,
v. v. i.
Legal Representative: PhDr. Martin Prošek, Ph.D.
Partner Institution: Czech Technical University in Prague, Faculty of Electrical Engineering
Responsible person: Dalibor Lehečka
Contact: vokabular@ujc.cas.cz
Conditions of access
Research infrastructure data and tools are freely available to all researchers, currently
via the web application Vokabulář webový at the address http://vokabular.ujc.cas.cz. The research infrastructure can be contacted via the e-mail address: vokabular@ujc.cas.cz.
The user is obliged to acknowledge all sources or tools obtained via the web application
in every publication as well as qualification, Ph.D. or habilitation thesis, namely
in the following recommended form:
„The present work makes use of the sources of the Research Infrastructure for Diachronic
Czech Studies (RIDICS, http://vokabular.ujc.cas.cz).“
When using a specific source available within The Web Vocabulary (Vokabulář webový),
the usual rules for citing web pages and contributions are to be followed.
Description of the research infrastructure
Research Infrastructure for Diachronic Czech Studies (RIDICS) will co-create and operate
two complementary web portals facilitating and inspiring the research in the field
of diachronic Czech studies (i.e. Czech from the earliest periods up to the late 18th
century) and other related fields in humanities. The first pillar will be the research
web portal designed for excellent research, which provides access to a vast number
of miscellaneous scientifically processed and analysed primary and secondary sources
provided with detailed metadata that will be gradually supplemented with lemmatization,
morphological tagging, etc. The portal will provide virtual environment to facilitate
the research on various aspects of the Czech history: Czech language, culture, art,
etc. The primary emphasis will be laid on the data accessibility and the development
of the tools suitable for linguistic research (e.g. in the areas of orthography, phonology,
morphology, word derivation, semantics, lexicology, onomastics, syntax, dialectology,
translatology, intertextual issues, etc.). The portal will also serve as a useful
resource for literary scholars, classical philologists, historians, science and art
historians, and other specialists, primarily in the humanities (e.g. philosophy, history
of medicine, history of law, biblical studies, geography, genealogy, etc.). Besides
the primary sources, the research portal will also provide other material (e.g. modern
diachronic dictionaries, scholarly literature) for more complex historically oriented
research. RIDICS will also develop and provide quality tools for accessing the offered
materials (full text search, corpus analysis tools, etc.). Collected materials will
also provide data for ICT specialists, especially as regards the development of tools
to process nonstandard language data, for their automated analysis and search of their
relationships.
The second pillar of the research infrastructure (hereinafter referred to as “RI”)
will be represented by the community web portal that will enable the researchers to
share their research output with scholars, students and the general public alike (store
and make accessible scholarly works as well as electronic editions of primary sources)
and also keep the community informed about the events in the respective fields, to
discuss scholarly issues, etc., and by doing so to inspire further research both in
theirs and other related fields. The community portal will engage a number of specialists
in diachronic Czech studies (language and literature) and Czech Middle Ages and Early
Modern time period to provide and share resources and expert discussion, and thus
support research coordination and establish (interdisciplinary) collaboration. RIDICS
will also develop and provide tools for the primary sources preparation (template
for electronic editions, software for automated transcription) and for the research
proper. These are available in the form of web services, individual programs or add-ons
for programs, with which the researchers work on everyday basis (text editors).
The process of integrating various materials will draw on available resources and
the intended web portals will aggregate data from these resources – e.g. topically
and temporarily relevant bibliographical data from the Institute of the Czech Language,
the Institute of Czech Literature, and the Institute of History AS CR.
The research portal will provide the following services:
- access to digital copies of the resources, if permitted by copyright and other rights
owners
- access to full texts of primary and secondary sources in continuous form, i.e. in
form of coherent webpage text, or in e-book form in the PDF and EPUB formats
- full text corpus access, with the employment of gradual lemmatization and morphological
tagging
- access to modern diachronic dictionaries (in the form of searchable full text, possibly
digital images) or lexical databases
- access to period diachronic dictionaries (in the form of searchable full text – transcribed,
possibly also transliterated – and digital images)
- access to digitised period Czech grammars in the form of digital images furnished
with metadata, possibly as full texts, if available
- access to the same source in alternative forms, if available (digital images, transliterated
and transcribed versions)
- gradual lemmatization and morphological tagging of texts written in historical Czech
- development and maintenance of a program to form own standards of lemmatization and
morphological tagging
- spell check dictionaries of historical Czech for word processors (Microsoft Word,
OpenOffice/LibreOffice Writer)
- development and maintenance of a tool to process electronic editions (an add-on for
the Microsoft Word editor)
- development and maintenance of the Transcriptorium program to process transliterated
and transcribed versions of primary sources editions
- generating phonological variants of historical words
- a searchable database of topically appropriate bibliographical records
The community portal will provide the following services:
- storage and tools for the publication of users’ scholarly works
- storage of primary sources provided by users/researchers
- full text search of provided scholarly works and primary sources
- moderated discussions forums
- topical and/or authorial bibliographical registers
- topically appropriate record of scholarly conferences and other undertakings (public
lectures etc.)
Function of the RI
Ridics will become the exclusive, publicly accessible internet research environment
focusing on historical Czech in a complex way, i.e. opening up not only primary sources
in a machine readable form, but also other related resources, such as modern diachronic
dictionaries and research outputs. The infrastructure will provide a large number
of linguistic resources (editions, dictionaries), processed specifically for its purposes.
Regarding the primary resources, these are often unique works, accessible only with
great difficulties. The research portal will employ source data in the XML format
(Extensible Markup Language) corresponding to the TEI P5 (Text Encoding Initiative) standard
which will guarantee their standardization and seamless exchange with other platforms,
or their transformation into other formats. Transfering resources to computer readable
form will make them available even to the less experienced researchers; moreover,
it will enable their further computer processing (analyses, statistics, e-book output,
etc.).
The analysis and tagging of the linguistic data within RIDICS is based on the assumption
that a language system of a particular period (both historical and modern) can be
adequately processed only when employing tools corresponding to such a system (e.g.
by having a different set of morphological categories). As a result of this, changes
in a language system must be taken into account from the earliest stages of the process
during which the tools for natural language processing are designed. In the contemporary
language processing tools, such an aspect is usually missing.
One of the aims of the infrastructure will be to explore intertextual relations between
miscellaneous primary and secondary sources which will contribute to easier understanding
and processing of the provided data. As opposed to the study of modern languages,
interlinking of the relevant data is especially important for the research of a historical
language because in this case, researchers are able to use their language competence
only to a limited degree. Foreign researchers for whom Czech is not their first language
will benefit even more from such data interconnection. Such an interconnection between
the respective materials (e.g. modern Czech diachronical dictionaries, full texts,
bibliographies, scholarly literature) has not yet been employed in any of the known
and publicly accessible sources. The mutual interlinking of the relevant sources and/or
passages from other documents (e.g. between a quotation from the primary source cited
in a scholarly article and the same quote in an e-edition) will allow to reveal a
heretofore unknown connections.
During the process of developing applications, linguistic tools and user aids, RIDICS
will concentrate on making them easy to use even for the technically less advanced
researchers who are true experts as for the linguistic aspect of a researched topic
but would otherwise be discouraged to use overly complicated or incomprehensible tools.
Materials and tools will be made freely available which will lead to the creation
of more primary sources to be consequently used in further research.
RIDICS will provide data and other groundwork for research in:
- phonological, morphological, syntactical and semantic shifts in historical Czech
- orthographical system in historical Czech
- lexical system of historical Czech
- terminology of fields represented in primary texts (medicine, law, philosophy, etc.)
- lexicographical methods in period dictionaries
- relations among various lexicographical fields
- development of grammar description of Czech
- changes of literary forms and genres
- historical realia (individuals, places)
- transfer and changes of motives
- topical relationships among primary texts
- relationship between a literary monument and their witnesses (textual variability)
- relationship between a Czech translation and a foreign-language original
The research portal will also provide researchers with software tools for primary
research, particularly processing of primary texts electronic editions (an electronic
edition template, the Transcriptorium program, data bases, applicable to different
historical periods, for the transfer from a transliterated copy into a transcribed
form).
RIDICS will also function in the higher education as a source of study materials:
period Czech grammars, primary sources (scanned, in transliterated and transcribed
form – this will not only facilitate an effective search but will also serve for paleographic
studies), modern Czech diachronic dictionaries; and a source of the knowledge of historical
Czech and related fields: an overview of historical phonological changes, formal description
of Czech diachronic morphology, topical papers in the scholarly literature section,
a searchable bibliography database, moderated discussions. University students will
make use of these portals while working on their (under- and post-) graduate theses.
Community portal will enable students to publish the output of their research. As
a part of their university studies, students will participate in the creation of the
infrastructure content and software components, and by so doing, they will become
acquainted with the field of historical Czech and its computational processing, the
editorial principles of the digital editions, the Old-Czech morphology and its formal
description, etc.
Interdisciplinary collaboration will be facilitated firstly by a resource repertoire
which will make part of the research portal, secondly by the community portal bringing
together researchers in various fields.
RIDICS responds to the current state of affairs in the diachronic Czech studies where
no consolidated source of information is available on the existing and currently running
research. Similarly, a register of existing electronic editions of primary sources
and currently processed editions is lacking. RIDICS will also accelerate and improve
the primary sources electronic editions processing. The editions can be subsequently
employed in further research in the field, e.g. in the corpus data form. Both web
portals and tools for research workers will serve to create more data for research
in diachronic Czech Studies and related disciplines, which will result in greater
relevance of consequent research and, last but not least, in an increased interest
in the field.
Relationship of the RI to the International Research Area
For foreign researchers, RIDICS will be the main source of primary materials for the
study of historical Czech which are usually not available in the required quality
and number. Regarding primary sources being made available together with their foreign
language elements (especially Latin and German in dictionaries and scholarly literature),
such material will become the source of knowledge even for other national languages.
RIDICS as a new RI will cooperate with international RIs especially in the area of
primary and secondary sources digitalization, such as, e.g., DARIAH (Digital Research
Infrastructure for the Arts and Humanities, https://www.dariah.eu), DiXiT (Digital Scholarly Editions Initial Training Network; http://dixit.uni-koeln.de) or ENeL (European Network of Lexicography; http://www.elexicography.eu)
By means of research in the area of automatic lemmatization and morphological tagging
of historical Czech based on the formal description of historical morphology (not
on the modern languages tools adjustment), RIDICS will provide a platform for the
cooperation with researchers working with historical texts written especially in the
West Slavic languages.
Utilization and the RI output
The planned output of RIDICS is in accord with the ICL long-term research objectives,
especially with the research in Old-Age and Middle-Age Czech lexis and publishing
its outputs in the electronic dictionary forms, with the analysis and edition of Old-
and Middle-Czech literary works and building text corpora from them.
The added value of the RI lies especially in analysing possibilities of intertextual
linking of various data and making them accessible in order to enable the study of
historical Czech. The development of the tools used for creating electronic editions
(MS Word template, Transcriptorium) will make the preparation of primary sources for
research purposes in the given field qualitatively better and faster. The tool for
lemmatization and morphological tagging of historical Czech will offer a gradually
built morphological description of the surveyed period including the tendencies of
the individual changes development, and will make the research work with the linguistic
material less demanding (e.g. searching, focusing on a specific phenomena on the morphological
or morphosyntactic level). A detailed and uniform treatment of historical dictionaries
will allow for the comparison of the individual works; the development of the Czech
lexicographic tradition can also be observed. The community web portal would enable
scholars from various disciplines to work closely together. Moderated forum discussions
will ensure that anybody who asks a question will always get an answer.
Considering that RIDICS is specifically focused on a historical Czech and the related
fields of study, no research results are known at this moment that could be employed
in the commercial sector.
Intended outputs carried out within the RI will be employed in the area of humanities,
especially the language and literary Czech studies, history, history of art, auxiliary
sciences of history and other history-oriented fields in humanities. Other outputs
would be employed in the field of humanities focused on teaching. RI will have a significant
impact on the information technologies area (automated morphological tagging and lemmatization).
In 2016–2019, the RIDICS team members will organize 6 lectures both for the scholarly
and general public, 2 colloquies and will participate in 8 conferences and seminars
organized by other institutions. Also, we are planning RI distant presentations in
the form of video recordings or video conferences. At least 1 publication will come
out in a public journal and 1 media appearance will take place. Annually, approximately
5 contributions describing the results of the RI development will be presented in
the form of scholarly journal articles, conference papers and lectures intended for
an academic audience. Furthemore, we are planning the applied research outputs, namely
in the form of software and other program applications whose number will be influenced
by the progress of RI realization. In 2016–2019, we are planning to host at least
2 workshops or scholarly colloquies, aimed at collecting entry information and demands
of future RI users and informing of the RI realization progress.
Research and other cooperation within RI
There is currently no alternative infrastructure with the intended purpose, either
on national or international level. We can speak of the synergic effect in relation
to LINDAT/CLARIN and CNC projects whose purpose is complementary to the Czech language
research. RIDICS and CNC intend to share their data and metadata and cooperate on
the software tools development. Collaboration is also planned with other institutions
and research infrastructures, especially in the area of bibliographic records aggregation
(BHCL – Bibliography of the History of the Czech Lands; CLB – Bibliography of Czech
Literary Studies).
RIDICS will cooperate with other research institutions and universities, whose research
topic encompasses diachronic Czech studies. Among these are, e.g., Institute of Czech
Literature AS CR, Centre for Medieval Studies or Centre for Classical Studies at the
Institute of Philosophy of the Czech Academy of Sciences, Faculties of Arts at Masaryk
University and the University of Ostrava, Philosophische Fakultät – Universität Tübingen,
Philologisch-Kulturwissenschaftliche – Fakultät Universität Wien, etc. The cooperation
will have the form of consultations on specific research topics or specialist preparation
of research data, in form of guest lectures, running seminars, etc. RI also plans
to contribute to the preparation of new, e.g. grant, projects in the sense of consultations
on source preparation as well as participating in realized projects.