Abstraktit / Abstracts

Wicked witches and wise wizards: Children’s literature as an interdisciplinary meeting point in digital research

Anna Čermáková

Gender is one of the fundamental structuring principles of our society. Gender is a social construction and as such is changing over time. The change is visible through various social practices and legislation. Reflections of how we conceive of gender and these developments are manifested in the discourse. Essentially, gender construction is reproduced and negotiated through language. Children’s literature, is not just “literature” directed at child audiences, it also presents an important formative discourse. In this talk, I am going to investigate the gendered social structure of the 19th century and contemporary children’s literature. I will show how a corpus linguistic approach makes it possible to identify different layers of society and how characteristics of fictional social structures that may include the likes of wicked witches and wise wizards are shared across children’s books. There are two major data sources I use. For the analysis of the 19th century I use ChiLit, the 19th Century Children’s Literature corpus (4.4 million words, available from CLiC). For the contemporary data, I use a bigger (12.9 million words) corpus of contemporary children’s literature drawn from texts published after 2000 by the Oxford University Press (Oxford Children’s Corpus (OCC). The findings from these two data-sets will be further contextualised by other resources. This paper addresses questions relevant for discourse analysis, literary analysis, stylistics, gender and childhood studies but also social history. It aims to situate corpus linguistics across these fields and within the digital humanities more widely. I will argue in favour of incorporating a qualitative dimension through digital ‘close reading’ supported by state-of-the-art tools like CLiC.

Tietosuoja ihmistieteissä

Arja Kuula-Luumi

Tietosuojamuutokset ovat näkyneet monella tavalla eri palvelujen käytössä, kun Euroopassa alettiin soveltaa EU:n tietosuoja-asetusta keväällä 2018. Tietosuoja-asetukseen tutustumisen rinnalle saimme lisää opiskeltavaa 1.1.2019, kun asetusta täydentävä ja täsmentävä kansallinen tietosuojalaki astui voimaan. Molemmat vaikuttavat oleellisesti myös tutkimuksiin, joihin sisältyy henkilötietojen käsittelyä.

Esitykseni sisältää keskeisimmät asiat tietosuojasäädösten soveltamisesta henkilötietoja sisältävien tutkimusaineistojen käsittelyyn. Niitä ovat esimerkiksi tutkimuksen rekisterinpitäjän määrittäminen, henkilötietojen käsittelyperusteen valinta ja tutkittavien informoiminen henkilötietojen käsittelystä. Lisäksi selitän tietosuoja-asetuksen mukaisia rekisteröidyn (tutkittavan) oikeuksia ja kerron, miten oikeuksia voi rajoittaa kansallisen tietosuojalain perusteella. Painopisteeni on suoraan tutkittavilta kerättävissä aineistoissa, mutta esityksen lopussa kerron lyhyesti myös tietosuoja-asetuksen soveltamisesta some-datan käyttöön tutkimuksessa. Kuvaan muuttuneen tietosuojan säädösympäristön ottamalla esitykseen mukaan konkretisoivia esimerkkejä.

From bits and numbers to explanations – doing research on Internet-based big data

Veronika Laippala

Internet is a constantly growing source of information that has already brought dramatic changes and possibilities to science. For instance, thanks to the billions of words available online, the quality of many natural language processing (NLP) systems, such as machine translation, has improved tremendously, and people’s beliefs, cultural changes and entire nations’ mindscapes can be explored on an unprecedented scale (see Tiedemann et al. 2016; Koplenig 2017; Lagus et al. 2018). Importantly, almost anyone can write on the Internet. Therefore, the web provides access to languages, language users, and communication settings that otherwise could not be studied (see Biber and Egbert 2018).

Paradoxically, the Internet’s extreme size and diversity also complicate its use as research data. Many Internet-based language resources, such as English Corpus of Global Web-Based English (GloWbE) or the web-crawled Finnish Internet Parsebank developed by our research group, are composed of billions of words. Already searching from these databases requires specific tools, but especially the analysis of the search results may not be straightforward. For instance, the Finnish word köyhä ‘poor’ has 209 609 occurrences in the Finnish Parsebank, and its English correspondant has 312 974 hits in GloWbE. These language resources provide easily bits and numbers, but how to explain them?

In my talk, I will present some of the work we have done in our research group in order to bend Internet-based data collections for research questions in the humanities, where numeric results on frequencies are just the beginning of the analysis. In particular, I will discuss our newly-launched project on improving the usability of Internet-based big data, A piece of news, an opinion or something else? Different texts and their automatic detection from the multilingual Internet. In the project, the ultimate objective is to develop a system that could automatically detect different text varieties, or registers (Biber 1988), such as user manuals, news, and encyclopedia articles, from online data. Currently, for instance a Google search can return an overwhelming number of documents from mostly unknown origins and similarly, the origins of the documents in the web-crawled big data language collections are typically unknown. However, in order to explain research results gotten from these collections, information on the kinds of texts included in the data would be very useful if not mandatory.

Identifying registers from the Internet involves a number of challenges. An essential prerequisite would be information on the registers to be detected. But what kinds of texts is the Internet composed of? A second concern, then, is that online texts do not follow the traditional print media boundaries (see Biber and Egbert 2018). For example, how can one distinguish texts that neutrally report scientific findings from those that use the information to persuade the reader? Additionally, text classification is typically based on manually labeled example documents representing the categories to be detected. However, developing this training data is very time-consuming and needs to be done separately for each language. Would it be possible to detect registers without all this manual work?


Biber, D. 1988. Variation across speech and writing. Cambridge University Press. Cambridge.

Biber, D. and J. Egbert 2018. Register variation online. Cambridge University Press. Cambridge.

Koplenig. A. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets—Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities, 32(1), 169–188.

Lagus, K., M. Pantzar, and M. Ruckenstein 2018. Kansallisen tunnemaiseman rakentuminen: Pelon ja ilon rytmit verkkokeskusteluissa. Kulutustutkimus. Nyt 1-2/2018.

Tiedemann, J., F. Cap, J. Kanerva, F. Ginter, S. Stymne, R. Östling, and M. Weller-Di Marco 2016. Phrase-based SMT for finnish with more data, better models and alternative alignment and translation tools. Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, 391–398. Berlin, Germany. Association for Computational Linguistics.

Viimeksi päivitetty: 1.3.2019