The Attendance-Replication Cannot Use Committee Data: Roll-Call Voting Is the Only Non-Anchored Behavioral Outcome Currently Available, and It Shows a -3.92pp Pre-Election Shift for the R17 Runner Cohort (Descriptive Only, N=9)
Rejected Paths
Before committing to the query below, I considered and rejected:
- Run the committee-attendance replication directly from committee_meetings_{17-22}.parquet: rejected because those files contain bill-level committee-review events (BILL_ID, PPSL_DT, JRCMIT_CONF_NM, JRCMIT_CONF_RSLT), not member-level attendance. Audit of all six parquet files (below) returns zero columns matching MBR/MEM/ATTEND/출석/참석/의원. Scout's R21 plan assumed an attendance variable that does not exist in the processed corpus.
- Reconstruct attendance from speech counts in the kr-hearings speeches parquet: rejected because the 1.1GB file is not downloaded locally; a speech-count proxy also collapses "attended but silent" into "absent," which is the opposite of the 대리출석 measurement error Scout flagged. Pursuing it would introduce a second validity problem while trying to solve the first.
- Expand the 16-member cohort to include 19th-Assembly runners: rejected because roll_calls_all.parquet has only 1,247 rows for term 19 (vs 1.0M for term 20 and 970K for term 21). Coverage gap kills any identification for pre-2016 cohorts; staying on terms 20-21 where roll-call coverage is dense is the only defensible move.
1. The first finding is a data finding: committee-attendance replication is not feasible on the processed corpus
Scout's R21 post (061_literature_scout.md) asks for "pre-resignation committee attendance drop" as the non-anchored replication outcome and even suggests validating 대리출석 contamination against published committee minutes. Both tasks presume a member-level committee-attendance variable. Audit of the six committee_meetings_{17-22}.parquet files (code below) returns zero such columns:
import pandas as pd, os
KBL = os.environ['KBL_DATA']
for asm in [17,18,19,20,21,22]:
df = pd.read_parquet(f"{KBL}/committee_meetings_{asm}.parquet")
att = [c for c in df.columns if any(k in c.upper() for k in ['MBR','MEM','ATTEND','MEMBER'])]
kor = [c for c in df.columns if any(k in c for k in ['출석','참석','의원'])]
print(asm, list(df.columns)[:6], 'member_cols=',att,'korean=',kor)
# Every assembly: attendance_cols=[], korean=[] -- bill-level event data only.
Schema is uniform across terms: [ERACO, BILL_ID, BILL_NO, BILL_NM, PPSR, PPSL_DT, JRCMIT_CONF_NM, JRCMIT_CONF_DT, JRCMIT_CONF_RSLT, _BILL_ID]. Each row is one bill × one committee-session event. The Scout-proposed attendance outcome cannot be constructed from this table. The 대리출석 validation target is therefore moot - we have no attendance field to cross-check against the minutes in the first place.
Data gap (flag to Critic): Paper B's attendance specification requires ingesting assembly.go.kr의 nhwxpdvciuxpykbdc attendance-roster API or scraping commissionMemberList pages per meeting. Neither is in master_bills_* / committee_meetings_*. This is a Phase-1 acquisition task that should be scoped before the PAP commits to an attendance section.
2. Roll-call voting participation is the only non-anchored behavioral outcome already in the corpus
roll_calls_all.parquet (2,425,113 rows, 16-22nd Assembly, dense for terms 20-22) records member-level floor votes coded as 찬성 / 반대 / 기권 / 불참 (yes / no / abstain / absent). Because 불참 is an explicit code rather than a missing row, the file supplies the exact denominator needed to compute attendance as (찬성+반대+기권) / (찬성+반대+기권+불참). This is floor-vote attendance, not committee attendance, but it matches Koo, Kim, and Choi's (2018) behavioral outcome and is genuinely non-anchored to bill dates (votes are called on the Speaker's schedule, not the MP's).
I built the attendance outcome for the 16-member local_executive_runner cohort coded in knowledge/hand_coding/round_21.jsonl, restricted to terms 20-21 where roll-call coverage is complete (9 of 16 runners survive the coverage filter). Each runner's election anchor is the corresponding local-election date (2018-06-13 for term 20; 2022-06-01 for term 21). Per 공직선거법 §53, local-executive candidates must resign ~30 days before filing, so the -3 to 0 months-to-election window is the pre-resignation / pre-exit analog. Baseline window: -12 to -4 months.
Per-member attendance (descriptive only, N=9)
| Member | Assembly | Baseline N | Baseline | Exit N | Exit | Delta (pp) |
|---|---|---|---|---|---|---|
| 양승조 | 20 | 699 | 99.3% | 142 | 99.3% | +0.0 |
| 박남춘 | 20 | 699 | 84.8% | 142 | 88.7% | +3.9 |
| 김경수 | 20 | 699 | 81.3% | 142 | 59.9% | -21.4 |
| 이철우 | 20 | 699 | 58.4% | 142 | 0.0% | -58.4 |
| 박준영 | 20 | 633 | 58.5% | 0 | - | n/a |
| 오영훈 | 21 | 705 | 52.9% | 26 | 11.5% | -41.4 |
| 박완수 | 21 | 705 | 41.6% | 26 | 0.0% | -41.6 |
| 이광재 | 21 | 705 | 20.6% | 26 | 11.5% | -9.0 |
| 김은혜 | 21 | 705 | 18.4% | 26 | 0.0% | -18.4 |
Pooled aggregates (N=9 runners, 20-21st Assembly)
| Window | Votes | Attended | Rate |
|---|---|---|---|
| Baseline (-12 to -4 mo) | 6,249 | 3,574 | 57.19% |
| Exit (-3 to 0 mo) | 672 | 358 | 53.27% |
| Shift | -3.92 pp |
Same-assembly non-runner control (N=313 in 20th, N=316 in 21st)
| Assembly | Base rate | Exit rate | Delta |
|---|---|---|---|
| 20 | 70.84% (202,078 rows) | 68.73% (70,290 rows) | -2.10 pp |
| 21 | 71.14% (204,413 rows) | 71.24% (46,333 rows) | +0.11 pp |
Naive difference-in-differences: runner shift (-3.92) minus pooled control shift (≈ -1.0) = -2.9 pp. This is directionally consistent with Paper B's sponsorship-shirking sign, but the signal is tiny relative to sponsorship (where R19 reported -1.5 bills/month). The per-member table shows the pooled figure masks enormous heterogeneity: some runners (이철우, 박완수) drop to 0% attendance in the exit window while others (양승조, 박남춘) stay at baseline.
C6 compliance reminder: N=9 at the member level, so every number here is reported as DESCRIPTIVE ONLY. No p-values, no inferential language, no equivalence ranges. If the PAP wants to commit to Titiunik-Feher (2017) equivalence testing on the attendance outcome (per Scout's Recommendation 4), the N=9 floor means the smallest equivalence range the cohort can support is roughly ±15pp - too wide to be useful. This is itself an argument for demoting attendance to a secondary outcome or acquiring committee-level attendance data before the PAP is signed.
3. Response to Scout (061): three concrete consequences for the PAP
On the Høyland-Hobolt-Hix (2017) anchor. Their framework predicts that progressive-ambition legislators participate more as elections approach under closed-list PR (because the party controls renomination), and less under district SMD. Korea's local-executive cohort is almost entirely district SMD under 공직선거법: 12 of 16 runners resigned from district seats, only 4 from 비례대표 (proportional) seats. The framework therefore predicts shirking, matching the sign I find. But the mechanism prediction is sharper than the overall sign - if attendance is the outcome, the PAP should pre-specify the district vs proportional heterogeneity test. Hand-coding dict entries already tag election_type; this is cheap.
On Scout Recommendation 2 (대리출석 validation). The validation target does not exist in the processed data. I recommend the PAP replace this with a different measurement-validity check: compare roll-call attendance (this paper) against member-level speech counts in kr-hearings-data for the same member-months. Discrepancy between the two proxies is itself a measurement-validity statement Paper B can report honestly.
On Scout Recommendation 3 (committee-role 4b). The hand-coded dictionary does not contain committee-chair / 간사 status. Adding this is a 2-hour hand-coding task against National Assembly press-release archives for each of the 16 runners. Pre-committing to Commitment 4b in the PAP is fine; the data enablement is cheap. But with N=9 in the modern subsample, a chair-vs-rank-file split would produce cells of roughly N=2 vs N=7 - firmly below the N=10 guardrail, so 4b must be pre-registered as descriptive-only.
4. What Critic should evaluate for theoretical framing
- Whether the "attendance" section should survive the PAP at all, given that (a) the processed corpus cannot supply committee attendance, (b) the roll-call proxy pools to a -3.92 pp shift that is smaller than the control-group seasonal decline in the 20th Assembly, and (c) the N=9 modern subsample precludes inferential claims. One defensible posture: drop attendance, keep sponsorship, and reframe Paper B as "sponsorship-specific shirking with attendance null scope-condition."
- Whether the Høyland-Hobolt-Hix (2017) anchor should be replaced by Koo, Kim, and Choi (2018) doi:10.1017/s1468109918000403 as the Korean-cohort precedent on voting participation. KKC uses roll-call voting participation and position-change in a Korean lame-duck setting; their outcome variable is closer to what the available data supports than Høyland et al.'s EP participation data.
- Whether the heterogeneity in per-member delta (0 pp to -58 pp) is itself the paper. The pooled headline number obscures that some runners (particularly 이철우, 박완수, 오영훈) collapse to near-zero attendance immediately post-resignation, while others (양승조, 박남춘) maintain baseline. A paper about who shirks under progressive ambition is more interesting than a paper about whether shirking happens on average, and the hand-coded cohort could support qualitative case-coding even with N=9.
5. Data gaps this round surfaced
- No member-level committee attendance in the processed KNA corpus. This blocks Scout's original R21 specification and requires an upstream data-acquisition sprint before any attendance-based PAP can be signed.
- Roll-call coverage is thin for terms 17-19 (≤25K rows each vs 1M+ for 20-21). The attendance replication is effectively restricted to the 2018 and 2022 cycles, collapsing the 16-member cohort to 9 analyzable members.
- No resignation-date field in
members_{17-22}.parquet. The election date is a proxy anchor, but the true exit event is후보등록일 - 30일. Building the true-exit event would need candidate-registration dates from the NEC registry, which Scout's topic-gate commitment 2 already flagged. - 공직선거법 §53 exit window would benefit from a pre-registered distinction between "statutory-mandated" and "voluntary" exit timing; absent a per-member registration date, all 9 runners are pooled into the same
-3 to 0window.
Reproducibility
All analyses use /Users/kyusik/Desktop/kyusik-github/kna/data/processed/roll_calls_all.parquet with the parser:
def parse_date(s):
if pd.isna(s): return pd.NaT
s = str(s).strip()
if '-' in s[:10]: return pd.to_datetime(s[:10], errors='coerce')
if len(s)>=8 and s[:8].isdigit(): return pd.to_datetime(s[:8], format='%Y%m%d', errors='coerce')
return pd.NaT
The mixed-format date column (some YYYY-MM-DD, some YYYYMMDD HHMMSS) silently collapses 98% of roll-call rows under a naive pd.to_datetime call — a reproducibility trap worth recording for future arc-2 roll-call work.
References
Høyland, Bjørn, Sara B. Hobolt, and Simon Hix. 2017. "Career Ambitions and Legislative Participation: The Moderating Effect of Electoral Institutions." British Journal of Political Science 49 (2): 491-512. doi:10.1017/s0007123416000697
Koo, Bon Sang, Junseok Kim, and Jun Young Choi. 2018. "Testing Legislative Shirking in a New Setting: The Case of Lame Duck Sessions in the Korean National Assembly." Japanese Journal of Political Science 19 (4): 608-624. doi:10.1017/s1468109918000403