Skip to contents

R-CMD-check License: MIT License: CC BY-NC-ND 4.0 DOI Zenodo DOI

English | 한국어

krpoltext provides convenient R access to two large-scale Korean political text corpora described in:

Lim, T.H. (2025). South Korean Election Campaign Booklet and Party Statements Corpora. Scientific Data, 12, 1030. https://doi.org/10.1038/s41597-025-05220-4

Corpus Period Candidates / Entries Description
Election Campaign Booklets 2000–2022 49,678 candidates Manifesto booklets from candidates in presidential, National Assembly, and local elections, available in original and enriched variants
Party Statements 2003–2022 83,201 statements Official statements and leadership meeting minutes from the two major parties

Data is hosted on the Open Science Framework (DOI: 10.17605/OSF.IO/RCT9Y). If a managed artifact is needed, it is downloaded automatically on first use in interactive sessions; non-interactive sessions should use a local file path or a pre-populated cache.

campaign_booklet is available in two public variants:

  • original: the original krpoltext corpus artifact
  • enriched: the same document-row universe plus conservative NEC linkage fields such as huboid, sg_id, sg_typecode, link_status, matcher_version, and nec_snapshot_id

load_campaign_booklet() defaults to variant = "original". Use variant = "enriched" for NEC-aligned workflows such as kr-elections-mcp.

Release Notes

Recent package changes are summarized in:

Installation

# install.packages("remotes")
remotes::install_github("taehyun-lim/krpoltext")

Quick Start

library(krpoltext)

# Load datasets from managed Parquet artifacts when available
ps <- load_party_statements(format = "parquet")
cb <- load_campaign_booklet(format = "parquet")
cb_enriched <- load_campaign_booklet(format = "parquet", variant = "enriched")

# Default campaign_booklet loads the original corpus artifact
cb[, c("code", "party", "text")]

# Use the enriched variant for NEC-linked workflows
cb_enriched[, c("code", "huboid", "sg_id", "sg_typecode", "link_status")]

# Explore metadata
metadata("campaign_booklet")
metadata("campaign_booklet", variant = "enriched")
schema("campaign_booklet")
schema("campaign_booklet", variant = "enriched")
metadata("party_statements")
schema("party_statements")

# Filter documents
docs_2020 <- get_docs("party_statements", year = 2020)
conservative <- get_docs("party_statements", year = 2018:2022, conservative = 1)
strict_subset <- get_docs(
  "party_statements",
  year = 2020,
  .select = c("year", "title", "text"),
  .strict = TRUE
)

# Campaign booklets: filter by office and party
assembly <- get_docs("campaign_booklet", office = "national_assembly", .data = cb)

Data Download

Managed artifacts are available from OSF in both CSV and Parquet formats. The load_*() helpers can use managed Parquet artifacts, while download_data() remains a CSV-prefetch helper. You can also point the loaders at local CSV or Parquet files.

For campaign_booklet, the historical unsuffixed filenames remain the original artifact:

  • sk_election_campaign_booklet_v2022.csv
  • sk_election_campaign_booklet_v2022.parquet

The enriched artifact uses suffixed filenames:

  • sk_election_campaign_booklet_enriched_v2022.csv
  • sk_election_campaign_booklet_enriched_v2022.parquet
# Use managed Parquet explicitly
ps <- load_party_statements(format = "parquet")
cb <- load_campaign_booklet(format = "parquet")
cb_enriched <- load_campaign_booklet(format = "parquet", variant = "enriched")

# Or use CSV explicitly
ps <- load_party_statements(format = "csv")
cb <- load_campaign_booklet(format = "csv")
cb_enriched <- load_campaign_booklet(format = "csv", variant = "enriched")

# Prefetch CSV caches for both datasets
download_data()

# Provide a local file path instead
ps <- load_party_statements(path = "~/Downloads/sk_party_statements_v2022.csv")
ps <- load_party_statements(path = "~/Downloads/sk_party_statements_v2022.parquet")

Data is cached as compressed RDS in tools::R_user_dir("krpoltext", "cache") and verified via SHA-256 checksums. Subsequent loads take ~2 seconds.

Integration with quanteda

library(quanteda)

corp <- as_quanteda_corpus(ps, docid_field = "id")
toks <- tokens(corp, remove_punct = TRUE)
dfm_obj <- dfm(toks)
topfeatures(dfm_obj, 20)

Functions

Function Description
load_campaign_booklet() Load the campaign booklet corpus
load_party_statements() Load the party statements corpus
metadata() Dataset metadata (columns, versions, citation)
schema() Column-level schema and artifact metadata
get_docs() Filter documents and optionally select columns
filter_docs() Apply strict filters to an in-memory table
select_vars() Select columns from an in-memory table
as_quanteda_corpus() Convert to a quanteda corpus object
download_data() Download datasets from OSF
clear_cache() Remove cached data files

Static Data API

Dataset metadata and download links are available as a static JSON API via GitHub Pages, with no server required:

Endpoint Description
/data/index.json Resource index (files, versions, SHA-256, download URLs)
/data/metadata.json Dataset descriptions and citation info
/data/schema/campaign_booklet.json Column schema for the original campaign booklet artifact
/data/schema/campaign_booklet_enriched.json Column schema for the enriched campaign booklet artifact
/data/schema/party_statements.json Column schema for party statements

API overview and fallback URLs: https://taehyun-lim.github.io/krpoltext/data-api.html

If GitHub Pages temporarily returns 404, the same resource index is also available here: https://raw.githubusercontent.com/taehyun-lim/krpoltext/gh-pages/data/index.json

R (without installing the package):

api <- "https://taehyun-lim.github.io/krpoltext/data/metadata.json"
meta <- jsonlite::fromJSON(api)
url <- meta$party_statements$download_urls$csv
tmp <- tempfile(fileext = ".csv")
download.file(url, tmp, mode = "wb")
dt <- data.table::fread(tmp, encoding = "UTF-8")

Python:

import requests, pandas as pd
meta = requests.get("https://taehyun-lim.github.io/krpoltext/data/metadata.json").json()
url = meta["party_statements"]["download_urls"]["csv"]
df = pd.read_csv(url)

Function reference: https://taehyun-lim.github.io/krpoltext/reference/index.html

Guides and examples: https://taehyun-lim.github.io/krpoltext/articles/index.html

Citation

If you use this data in academic work, please cite the Data Descriptor paper:

Lim, T.H. (2025). South Korean Election Campaign Booklet and Party Statements Corpora. Scientific Data, 12, 1030. https://doi.org/10.1038/s41597-025-05220-4

And the data repository:

Lim, T.H. (2024). South Korean Election Campaign Booklet Corpus and Party Statements Corpus. OSF. https://doi.org/10.17605/OSF.IO/RCT9Y

For the R package itself, cite:

Lim, T.H. (2026). krpoltext: Korean Political Text Corpora for R. R package version 0.2.0. Zenodo. https://doi.org/10.5281/zenodo.18704318

You can also retrieve the current package citation in R:

citation("krpoltext")

License