CHERN Training School blog 
By: Junhua Zhu (Centre for East Asian Studies, University of Turku – doctoral researcher)

This blog is based on Eric Zhang’s webinar “Contemporary Chinese Digital Materials in the Big-data Era” which was arranged by CHERN, the Nordic Institute for Asian Studies (NIAS), University of Copenhagen; and Centre for East Asian Studies (CEAS), University of Turku. Eric is a junior researcher at LeidenAsiaCentre. He has a background in political science and international relations. His current work mainly involves research on topics including China’s economic footprint in Europe, the impact of digital technologies, and cybersecurity, with regional focuses on China and Eastern Europe. He has a particular interest in incorporating computational methods in policy-relevant academic research. You can reach him via email,

The Covid-19 related regulations in China are still quite tight, which, for most researchers residing outside China (and sometimes even those who reside inside China), sets significant obstacles to conducting research on China. Skyrocketing price of flight tickets, at least two weeks’ compulsory quarantine, and multiple test documents, although aimed at keeping the virus at bay, prevent China researchers from entering the country and carrying out their fieldworks, observations, or interviews. While these traditional methods used by researchers for social sciences studies are currently not available, digital methods and materials become more important. Yet for those who are used to the employment of semi-structured interviews as the main method, they probably have no profound knowledge on what the available materials are and how to get access to those materials. This is why CHERN, together with NIAS and CEAS, organized a webinar and invited Eric Zhang to introduce the new types of data and the data collection methods for those who are in need of some alternative methodologies.


The whole process of utilizing digital materials for scientific research involves the following steps: 1) identifying the sources for data collection, 2) obtaining the data (web-scraping, API, databases), 3) storage and cleaning of the raw data, and finally 4) data interpretation and analysis. The latter two steps are similar with the ways how data are handled in interview-based research. Thus, this blog will focus on the first two steps and re-present the three types of new digital materials and their respective data collection methods that Eric introduced during the webinar.


  1. Governmental web pages

Governmental web pages are important data source for those who are interested in policy analysis, thematic analysis on government narrative/response/attitude, or other text-based research on governmental documents. Eric took the web pages of Ministry of Foreign Affairs (MFA) as an example to show how this public source can be fully utilized. To start with, those who plan to conduct policy analysis on China’s foreign policy can easily find the official transcripts of MFA regular press conferences and spokesperson’s remarks, which are very important materials if you are interested in China’s foreign policy narratives, or the official positions taken by Chinese government on different issues in international relations. Another merit of this source is that all textual data is available in all UN official languages, namely Arabic, Chinese, English, French, Russian, and Spanish. This could be expedient for China Studies researchers who do not usually use Chinese text as primary sources.


How to collect the data then? According to Eric, MFA textual data can be collected with basic html web-scraping methods, such as Rvest or Beautiful soup (for Python users). The process of scraping is also rather straightforward. You first apply a scrape link function to all title pages of the material you are collecting. After acquiring all the links to webpages of transcripts, you then build a function to scrape all the desired textual data. Once you obtain your raw data, you can start processing and analyzing your data; such as separating journalists’ questions from the MFA’s answers Computational method does not sound too difficult, does it? Noteworthily, the availability of certain materials may vary depending on the language. For instance, the Chinese transcripts of regular press conferences are available for approximately the recent four years while in English it is available for approximately two years. Nonetheless, some earlier press conferences are available through WayBack Machine web archive (Mochtak & Turcsanyi, 2021). With WayBack’s API engine, researchers can extract information that has been collected in its archive over the archived period. 


  1. Chinese news media

The second type of digital material source is from the Chinese news media. It is also quite common to conduct scraping on the media websites or, researchers can in fact turn to the third-party databases. As introduced by Eric, Wisers, Factiva, and Ringdata are among the most popular databases that might provide you the data you are looking for, although they are not free of charge, or require an institutional subscription. Chinese news media outlets can be used as a source to study various research topics. For example, researchers in the field of International Relations may be interested in role media outlets play in China’s strategic narrative promotion and image-building Besides, they are also great sources if you are intrigued by the contents in those outlets or how ordinary Chinese citizens are reacting to those contents. Besides, different outlets may have distinguished thematic coverages or stylistic patterns, which is also a fascinating phenomenon awaiting to be explored.  As mentioned by Eric, Chinese international media outlets, including Xinhua, Global Times, and CGTN, have in recent years intensified their efforts to promote Chinese narratives since the notion of “telling the Chinese story” was invented in 2014, as a part of the general effort to enhance China’s soft power. What are the narratives China’s media outlets tries to promote to international audiences? And how does the promotion of those narratives serve China’s wider geopolitical interests? Such questions are not only the reason why those western communication regulators are scratching their heads, but also wonderful topics for academic debates as well as scientific research. The contents from those outlets, then become significant source of data for research.


  1. Chinese social media

The last type of digital material is from social media platforms. It is valuable data for a wider spectrum of different topics such as public opinion/discourse, propaganda, censorship, and disinformation campaigns. Furthermore, there are a wide range of platforms that you can choose from, including Weibo, WeChat, Zhihu, Tiktok, and Bilibili. These platforms, characterized by different groups of users, can offer different perspectives depending on your research interests. For instance, if you are interested in the general (relatively) public opinion of Chinese netizens, Weibo might be a relatively suitable source, given the huge user base. Or, if you are focusing on the perceptions from the younger generation, Bilibili is probably the one you should turn to. Precisely as Eric pointed out, no single social media platform can be seen as representative of ‘the general public’. Therefore, it would be wise if you can ask platform-specific questions, and then start the data collection process.


At last

While digital materials are gaining importance and popularity amid pandemic, you should notice that the methods to collect them (e.g. scraping) are not always legally allowed, depending on the region where you reside and the platform from which you collect the data. It is therefore important to study the relevant regulations prior to your moves. Moreover, certain data source may detect the scraping, or massive downloading. In some cases where the data source takes a tougher position against scraping, the scraper has a chance to be tracked and identified, accordingly end up facing a lawsuit. This potential risk does not only apply to research on China, but all using scraping as the data collection method. If you need more information and help, formally, please consult website terms of agreement always before performing web-scraping, and the research ethical committee in your institution or informally, you can ask the researcher community for advice. All in all, under current circumstances where it is difficult to obtain your first-hand research data, do not panic as you have all the digital materials as we introduced above, but do remember to be a little bit more careful.



Mochtak, M., Turcsanyi, R.Q. Studying Chinese Foreign Policy Narratives: Introducing the Ministry of Foreign Affairs Press Conferences Corpus. J OF CHIN POLIT SCI 26, 743–761 (2021).

Author Bio:
Junhua Zhu is a doctoral researcher working at the Centre for East Asian Studies, University of Turku. Previously he received his master’s degree from Lund University and currently his research focuses on AI ethics, particularly in the Chinese context.

We use cookies in accordance with our privacy policy and the General Data Protection Regulations (GDPR). If you continue to use this site we will assume that you are happy with it.