Beáta Megyesi
Professor in Computational Linguistics (leave of absence) at Department of Linguistics and Philology
- Telephone:
- +46 18 471 78 60
- E-mail:
- Beata.Megyesi@lingfil.uu.se
- Visiting address:
- Engelska parken
Thunbergsvägen 3H - Postal address:
- Box 635
751 26 UPPSALA - Leave of absence:
- 2023-08-01 - 2024-07-31
Download contact information for Beáta Megyesi at Department of Linguistics and Philology
- CV:
- Download CV
- ORCID:
- 0000-0002-4838-6518
More information is available to staff who log in.
Short presentation
I am a professor of computational linguistics and currently on leave from Uppsala University for a professorship at Stockholm University.
My main research area is natural language processing and digital philology. I conduct research on historical cryptology to develop methods to automatically crack historical ciphers. I also develop tools for the analysis of historical and modern texts in various genres to enable large, quantitative studies for humanities and social sciences.
Keywords
- digital humanities
- historical cryptology
- natural language processing
Biography
Education
- Professor of Computational Linguistics, Department of Linguistics and Philology, Uppsala University, 2021
- Associate Professor in Computational Linguistics, Department of Linguistics and Philology, Uppsala University, 2013
- PhD in Speech Communication, Department of Speech, Music and Hearing, KTH, 2002
- B.A. in Computational Linguistics, Department of Linguistics, Stockholm University, 2000
Appointments
Present:
- Vice chair and member of the Linguistics review panel at the Swedish Research Council, 2021-2023
- Member of the nominating committee of the Northern European Association for Language Technology – NEALT, 2022-2025
- Vice-chair and member of the board of the Center for Digital Humanities, Uppsala University, 2021-2023
Past:
- President of the Northern European Association for Language Technology –
NEALT, 2020-2021 - Head of Department of Linguistics and Philology, 2009-2018
- Director of the English Park Campus, Uppsala University, 2017-2018
- Vice-president of the Northern European Association for Language Technology –
NEALT (2018-2019) - Member of the board at the Dept. of Linguistics and Philology, 2007–2009, 2010-2012, 2012-2015, 2016-2018
- Member of the board of the faculty of languages, Uppsala University, 2008-2011, 2011-2014, 2019-2020
- Director of studies at the Department of linguistics and philology, 2007-2009
- Program coordinator for the Language Technology Program, Uppsala University, 2004-2007
- Member of the board at the Department of Speech, Music and Hearing, 2003-2004
Teaching
Basic level courses
- Languages, computers, and text processing (in Swedish)
- Advisor for Language Technology Project, 7.5 ECTS
- BA thesis supervision
Advanced level courses
- Research and Development, 15 ECTS
- Digital Philology, 5/7.5 ECTS
- Thesis work in language technology, 30 ECTS
- Advisor for Language Technology Project, 7.5 ECTS
- Master thesis supervision
PhD education
- I was co-supervisor: Eva Petterson and Mojgan Seraji
Other things I like: my twins, traveling, Amnesty International, some workout like skiing, piloxing and pump, books, cello, chocolate, margaritas and cosmos, ladies of jazz, Bridges of Madison county, and of course my dearest best friends: girls, you know who you are!, and my (often empty) not-to-do list...
Things I don't like: greed, injustice, and ruling techniques
Research
Research interests
- Historical Cryptology
- Digital Philology focusing on the automatic analysis of historical texts and student writings
- PoS tagging, morphological analysis, chunking, shallow parsing for different types of languages
- Parallel corpora and treebanks
- Text categorization
Projects
- DECRYPT: Decryption of historical manuscripts (PI, Vetenskapsrådet: 2018-2024).
- DECODE: Automatic decoding of historical manuscripts (PI, Vetenskapsrådet: 2015-2017)
- SweLL - L2 infrastructure: Research Infrastructure for Swedish as a second language (RJ, 2017-2019)
- SWE-CLARIN - SWEGRAM: Automatic annotation and analysis of Swedish texts (Swedish Research Council, 2014-2018, 2019-2022)
-
- Swedish treebank
- Grammar extraction
- Basic Language Resource Kit for Swedish
Publications
Selection of publications
- Proceedings of the 5th International Conference on Historical Cryptology (2022)
- Identifying Cleartext in Historical Ciphers (2022)
- The DECODE Database of Historical Ciphers and Keys: Version 2 (2022)
- Lost in Transcription of Graphic Signs in Ciphers (2022)
- What Was Encoded in Historical Cipher Keys in the Early Modern Era? (2022)
- Unsupervised Alphabet Matching in Historical Encrypted Manuscript Images (2021)
- Deciphering Papal Ciphers from the 16th to the 18th Century (2021)
- Transcription of Historical Ciphers and Keys (2021)
- A Web-based Interactive Transcription Tool for Encrypted Manuscripts (2020)
- Proceedings of the 3rd International Conference on Historical Cryptology (2020)
- Transcription of Historical Ciphers and Keys (2020)
- Decryption of historical manuscripts (2020)
- Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays (2020)
- Proceedings of the Workshop on NLP and Pseudonymisation (2019)
- Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts (2019)
- The DECODE Database: Collection of Historical Ciphers and Keys (2019)
- SWEGRAM: Annotering och analys av svenska texter (2019)
- Pseudonymization of Language Learner Data (2019)
- Matching Keys and Encrypted Manuscripts (2019)
- The SweLL Language Learner Corpus: From Design to Annotation (2019)
- Proceedings of the 1st International Conference on Historical Cryptology (2018)
- Learner Corpus Anonymization in the Age of GDPR (2018)
- The HistCorp Collection of Historical Corpora and Resources (2018)
- Annotation of learner corpora (2018)
- Transcription of Encoded Manuscripts with Image Processing Techniques (2017)
- SWEGRAM (2017)
- Annotating Errors in Student Texts (2017)
- The Uppsala Corpus of Student Writings (2016)
- A Friend in Need? (2016)
- Proceedings of the 20th Nordic Conference of Computational Linguistics (2015)
- A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text (2014)
- Professional language in Swedish clinical text (2014)
- The Secrets of the Copiale Cipher (2011)
- Swedish CLARIN Activities (2009)
- Using Parallel Corpora in Teaching and Research (2009)
- Language Resources and Tools for Swedish: A Survey (2008)
- Cultivating a Swedish Treebank (2008)
- General-Purpose Text Categorization Applied to the Medical Domain. (2007)
- The Swedish-Turkish Parallel Corpus and Tools for its Creation (2007)
- Single Malt or Blended? A Study in Multilingual Parser Optimization (2007)
- A Study on Automatically Extracted Keywords in Text Categorization (2006)
- Exploring the Prosody-Syntax Interface in Conversations (2003)
- Boundaries and groupings - the structuring of speech in different communicative situations: a description of the GROG project (2002)
Recent publications
- Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023) (2023)
- Historical Language Models in Cryptanalysis: Case Studies on English and German (2023)
- What is the Code for the Code?Historical Cryptology Terminology (2023)
- Towards Data-effective Educational Question Generation with Prompt-based Learning (2023)
- Proceedings of the 5th International Conference on Historical Cryptology (2022)
All publications
Articles
- Keys with nomenclatures in the early modern Europe (2022)
- Few shots are all you need (2022)
- Deciphering Papal Ciphers from the 16th to the 18th Century (2021)
- Decryption of historical manuscripts (2020)
- The SweLL Language Learner Corpus: From Design to Annotation (2019)
- Parallel corpora and Universal Dependencies for Turkic (2015)
- Professional language in Swedish clinical text (2014)
- Bootstrapping a Persian Dependency Treebank (2012)
- The Secrets of the Copiale Cipher (2011)
- Shallow Parsing with PoS Taggers and Linguistic Features. (2002)
Books
- Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023) (2023)
- Proceedings of the 5th International Conference on Historical Cryptology (2022)
- Proceedings of the 3rd International Conference on Historical Cryptology (2020)
- Proceedings of the Workshop on NLP and Pseudonymisation (2019)
- Proceedings of the 1st International Conference on Historical Cryptology (2018)
- Proceedings of the 20th Nordic Conference of Computational Linguistics (2015)
- Resourceful Language Technology (2008)
Chapters
- Supporting Research Environment for Less Explored Languages (2008)
- Cultivating a Swedish Treebank (2008)
- Cultivating a Swedish Treebank (2008)
Conferences
- Historical Language Models in Cryptanalysis: Case Studies on English and German (2023)
- What is the Code for the Code?Historical Cryptology Terminology (2023)
- Towards Data-effective Educational Question Generation with Prompt-based Learning (2023)
- Identifying Cleartext in Historical Ciphers (2022)
- The DECODE Database of Historical Ciphers and Keys: Version 2 (2022)
- Lost in Transcription of Graphic Signs in Ciphers (2022)
- What Was Encoded in Historical Cipher Keys in the Early Modern Era? (2022)
- Unsupervised Alphabet Matching in Historical Encrypted Manuscript Images (2021)
- Revealing Secrets from the Past: Studying Historical Ciphers. (2021)
- Key Design in the Early Modern Era in Europe (2021)
- A Web-based Interactive Transcription Tool for Encrypted Manuscripts (2020)
- Transcription of Historical Ciphers and Keys (2020)
- Automatic Key Structure Extraction (2020)
- Towards Privacy by Design in Learner Corpora Research: A Case of On-the-fly Pseudonymization of Swedish Learner Essays (2020)
- Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts (2019)
- The DECODE Database: Collection of Historical Ciphers and Keys (2019)
- Pseudonymization of Language Learner Data (2019)
- Matching Keys and Encrypted Manuscripts (2019)
- Learner Corpus Anonymization in the Age of GDPR (2018)
- The HistCorp Collection of Historical Corpora and Resources (2018)
- Annotation of learner corpora (2018)
- Transcription of Encoded Manuscripts with Image Processing Techniques (2017)
- SWEGRAM (2017)
- Annotating Errors in Student Texts (2017)
- Swe-Clarin (2016)
- The Uppsala Corpus of Student Writings (2016)
- A Friend in Need? (2016)
- Ranking Relevant Verb Phrases Extracted from Historical Text (2015)
- Automatic Morphosyntactic Analaysis of Clinical Text (2014)
- Verb Phrase Extraction in a Historical Context (2014)
- A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text. (2014)
- A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text (2014)
- EACL - Expansion of Abbreviations in CLinical text (2014)
- Normalization of historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting (2013)
- An SMT Approach to Automatic Annotation of Historical Texts (2013)
- Parsing the Past - Identification of Verb Constructions in Historical Text (2012)
- Rule-Based Normalisation of Historical Text – a Diachronic Study (2012)
- A Basic Language Resource Kit for Persian (2012)
- Dependency Parsers for Persian (2012)
- The Copiale Cipher (2011)
- Using Parallel Corpora in Data-Driven Teaching of Turkish in Sweden. (2010)
- The English-Swedish-Turkish Parallel Treebank (2010)
- Swedish CLARIN Activities (2009)
- The Open Source Tagger HunPoS for Swedish. (2009)
- Using Parallel Corpora in Teaching and Research (2009)
- Language Resources and Tools for Swedish: A Survey (2008)
- Swedish-Turkish Parallel Treebank (2008)
- Single Malt or Blended? A Study in Multilingual Parser Optimization. (2007)
- The Swedish-Turkish Parallel Corpus and Tools for its Creation (2007)
- Single Malt or Blended? A Study in Multilingual Parser Optimization (2007)
- Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection (2007)
- Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection. (2007)
- A Study on Automatically Extracted Keywords in Text Categorization (2006)
- Building a Swedish-Turkish Parallel Corpus (2006)
- Using Linguistic Data for Genre Classification (2005)
- Exploring the Prosody-Syntax Interface in Conversations (2003)
- The Acoustic and Morpho-Syntactic Context of Prosodic Boundaries in Dialogs. (2003)
- Boundaries and groupings - the structuring of speech in different communicative situations: a description of the GROG project (2002)
- Silence and Discourse Context in Read Speech and Dialogues in Swedish (2002)
- Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. (2002)
- Data-Driven Methods for Building a Swedish Treebank. (2002)
- A Comparative Study of Pauses in Dialogues and Read Speech. (2001)
- Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish (2001)
- Data-Driven Methods for PoS tagging and Chunking of Swedish (2001)
- Phrasal Parsing by Using Data-Driven PoS Taggers (2001)
- Pausing in Dialogues and Read Speech: Speaker's Production and Listeners Interpretation (2001)
- Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora (2000)
- Towards a Finite-State Parser for Swedish (2000)
- Improving Brill's PoS Tagger for an Agglutinative Language (1999)
- Brill's PoS Tagger with Extended Lexical Templates for Hungarian (1999)
Reports
- SweLL Pseudonymization Guidelines (2021)
- Transcription of Historical Ciphers and Keys (2021)
- SweLL transcription guidelines, L2 essays (2021)
- Transcription of Historical Ciphers and Keys (2020)
- SWEGRAM: Annotering och analys av svenska texter (2019)
- Survey on Swedish Language Resources (2008)
- The Open Source Tagger HunPoS for Swedish (2008)
- Supporting Research Environment for Swedish and Turkish (2008)
- General-Purpose Text Categorization Applied to the Medical Domain. (2007)
- Changing the tokenization in Talbanken to SUC2.0 (2007)
- Converting SUC2.0 to XCES with stand-off annotation (2007)