Patent and Solow Growth Panel Data

Data extraction via web, cleaning in Python and Econometric Analysis in R

Contents

This project serves as an opportunity for me to gain experience in Python for:

  • data extraction via URL from OECD remotely
  • data cleaning of large dataset from Penn World Data (PWT)
  • data merging of two different sources into a workable penel dataset of 8 countryies.

Then, in a integrated work environment (RStudio), we export the extracted/filtered/cleaned/merged dataset into .cvs files to then analyze the data in R.

Results

Panel Dataset

Below are the final results of this project, which consist of the constructed panel dataset covering 8 countries from 1980–2019. The dataset integrates OECD patent data with Penn World Table macroeconomic indicators, cleaned and merged into a consistent format. It includes measures of output (GDP), capital, effective labor, total factor productivity (TFP), and patent activity, along with their computed growth rates. The full data, panel_complete.csv is displayed in the table below.

Besides the core panel dataset, I also generated panel_lags.csv, which adds one-, two-, and three-year lags of patent growth for fixed-effects regressions.

Time Series

Behind the Scenes (Complete Codes)

Python Data Extraction/Clean

OECD Data Set

library(reticulate)
library(dplyr)
library(ggplot2)
library(plm)
library(mFilter)
library(reticulate)
library(lmtest)
library(sandwich)
library(plm)
library(zoo)
library(broom)
library(reshape2)
library(ggrepel)
library(fixest)

We extract data from OECD remotely and PWT locally. First we set up our python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import xml.etree.ElementTree as ET
from io import StringIO

Now we extract data from filtered OECD data query.

url_data = (
    "https://sdmx.oecd.org/public/rest/data/"
    "OECD.STI.PIE,DSD_PATENTS@DF_PATENTS,1.0/"
    ".A.GR.PATN.PRIORITY.CAN+FRA+SGP+KOR+DEU+GBR+CHN+JPN+USA..INVENTOR.._Z._T"
    "?startPeriod=1980&endPeriod=2019&dimensionAtObservation=AllDimensions"
)

response_data = requests.get(url_data)
print("Status:", response_data.status_code)
## Status: 200

Parse into DataFrame (XML $\to$ Pandas)

root = ET.fromstring(response_data.text)
ns = {"generic": "http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/generic"}

records = []

for obs in root.findall(".//generic:Obs", ns):
    obs_dict = {}
    
    # Extract all keys (country, date type, etc.)
    for val in obs.findall("./generic:ObsKey/generic:Value", ns):
        obs_dict[val.attrib["id"]] = val.attrib["value"]
    
    # Extract the observed value
    val_elem = obs.find("./generic:ObsValue", ns)
    obs_dict["value"] = val_elem.attrib.get("value") if val_elem is not None else None
    
    records.append(obs_dict)

df_patents = pd.DataFrame(records)

print(df_patents.head())
##   TIME_PERIOD PATENT_AUTHORITIES  ... OECD_TECHNOLOGY_PATENT              value
## 0        2011                6F0  ...                     _T  3445.668701171875
## 1        2011                6F0  ...                     _T     14546.98828125
## 2        2000                6F0  ...                     _T    13811.802734375
## 3        2000                6F0  ...                     _T   14547.6337890625
## 4        2000                6F0  ...                     _T  851.2003784179688
## 
## [5 rows x 13 columns]
print("Shape:", df_patents.shape)
## Shape: (720, 13)

We want REF_AREA, TIME_PERIOD, value

df_clean = df_patents[["REF_AREA", "TIME_PERIOD", "value"]].copy()
df_clean["TIME_PERIOD"] = df_clean["TIME_PERIOD"].astype(int)
df_clean["value"] = df_clean["value"].astype(float)

print(df_clean.head())
##   REF_AREA  TIME_PERIOD         value
## 0      KOR         2011   3445.668701
## 1      JPN         2011  14546.988281
## 2      DEU         2000  13811.802734
## 3      USA         2000  14547.633789
## 4      CAN         2000    851.200378
print("Shape:", df_clean.shape)
## Shape: (720, 3)
# Step 2a: aggregate by country + year (sum across authorities etc.)
df_agg = df_clean.groupby(["TIME_PERIOD", "REF_AREA"], as_index=False)["value"].sum()

# Step 2b: pivot to wide format
df_panel = df_agg.pivot(index="TIME_PERIOD", columns="REF_AREA", values="value")

print(df_panel.head())
## REF_AREA             CAN      CHN  ...      SGP           USA
## TIME_PERIOD                        ...                       
## 1980         1240.127747   5.0000  ...  15.4524  43367.697266
## 1981         1362.677399   3.1666  ...   5.3929  41946.530273
## 1982         1332.704742   3.0000  ...   3.7858  43240.144043
## 1983         1423.382584  11.3333  ...   5.5000  42562.730957
## 1984         1569.370361  11.2834  ...  16.7333  44211.058105
## 
## [5 rows x 9 columns]
print("Shape:", df_panel.shape)
## Shape: (40, 9)

Check empty

print(df_panel.isna().sum())
## REF_AREA
## CAN    0
## CHN    0
## DEU    0
## FRA    0
## GBR    0
## JPN    0
## KOR    0
## SGP    0
## USA    0
## dtype: int64

Let’s peek at the data:

df_panel.plot(title="Patents by Country (Priority filings, 1980–2019)")
plt.xlabel("TIME_PERIOD")
plt.ylabel("Patent counts")
plt.show()

Looking good now we proceed similarly for our second data set.

PWT Data Set

Here we clean and extract data from Penn World Table We begin with understanding the data structure.

xls = pd.ExcelFile("pwt1001.xlsx")
print(xls.sheet_names)
## ['Info', 'Legend', 'Data']
# output ['Info', 'Legend', 'Data']; 
df_legend = pd.read_excel("pwt1001.xlsx", sheet_name="Legend")
print("Variabe-Defn:")
## Variabe-Defn:
print(df_legend.head())
##           Variable name        Variable definition
## 0  Identifier variables                        NaN
## 1           countrycode  3-letter ISO country code
## 2               country               Country name
## 3         currency_unit              Currency unit
## 4                  year                       Year

For convenience let’s save this definition list for reference:

df_legend.to_csv("PWD_def", index = False)
print("def list saved as PWD_def.csv")
## def list saved as PWD_def.csv

After some considerations we will use:

  • $Y_{i,t} :=$ rgdpo, i.e., Output-side real GDP at current PPPs (in mil. 2017US$).
  • $K_{i,t} :=$ rnna, i.e., Capital stock at constant 2017 national prices (in mil. 2017US$).
  • $L^{\text{effective}}_{i,t} \equiv$ emp $\times$ avh $\times$ hc, i.e., Number of persons engaged (in millions), Average annual hours worked by persons engaged, and the Human capital index.
  • $A_{i,t} :=$ rtfpna, i.e., TFP at constant national prices (2017=1).

Let’s have these variables filtered:

df_pwt = pd.read_excel("pwt1001.xlsx", sheet_name="Data")
df_pwt_small = df_pwt[[
    "countrycode", "year", "rgdpo", "rnna", "emp", "avh", "hc", "rtfpna"
]].copy()
print(df_pwt_small.head())
##   countrycode  year  rgdpo  rnna  emp  avh  hc  rtfpna
## 0         ABW  1950    NaN   NaN  NaN  NaN NaN     NaN
## 1         ABW  1951    NaN   NaN  NaN  NaN NaN     NaN
## 2         ABW  1952    NaN   NaN  NaN  NaN NaN     NaN
## 3         ABW  1953    NaN   NaN  NaN  NaN NaN     NaN
## 4         ABW  1954    NaN   NaN  NaN  NaN NaN     NaN
print(df_pwt_small.shape)
## (12810, 8)

In particular, we want the same countries as OECD dataset from 1980 to 2019:

keep_countries = ["CAN","CHN","DEU","FRA","GBR","JPN","KOR","SGP","USA"]

df_pwt_filtered = df_pwt_small[
    (df_pwt_small["countrycode"].isin(keep_countries)) &
    (df_pwt_small["year"].between(1980, 2019))
].copy()
print(df_pwt_filtered["countrycode"].unique())
## ['CAN' 'CHN' 'DEU' 'FRA' 'GBR' 'JPN' 'KOR' 'SGP' 'USA']
print(df_pwt_filtered["year"].min(), "", df_pwt_filtered["year"].max())
## 1980 → 2019
print(df_pwt_filtered.shape)
## (360, 8)

Pivoting them:

# pivot each; I have deleted my testing process where I pint the head of each pivots
df_gdp = df_pwt_filtered.pivot(index="year", columns="countrycode", values="rgdpo")
df_cap = df_pwt_filtered.pivot(index="year", columns="countrycode", values="rnna")
df_pwt_filtered["eff_labour"] = df_pwt_filtered["emp"] * df_pwt_filtered["avh"] * df_pwt_filtered["hc"]
df_labor = df_pwt_filtered.pivot(index="year", columns="countrycode", values="eff_labour")
df_tfp = df_pwt_filtered.pivot(index="year", columns="countrycode", values="rtfpna")
# data from OECD set
df_patents_long = df_clean.rename(columns={
    "REF_AREA": "countrycode",
    "TIME_PERIOD": "year",
    "value": "patents"
})
df_patents_long["year"] = df_patents_long["year"].astype(int)
# merging
df_merged = pd.merge(
    df_pwt_filtered[["countrycode","year","rgdpo","rnna","eff_labour", "rtfpna"]],
    df_patents_long.groupby(["countrycode","year"], as_index=False)["patents"].sum(),
    on=["countrycode","year"],
    how="inner"
# name columns
)
df_merged = df_merged.rename(columns={
    "rgdpo": "Y",
    "rnna": "K",
    "eff_labour": "L",
    "rtfpna": "A",
    "patents": "PA"
})
print(df_merged.head())  
##   countrycode  year            Y         K             L         A           PA
## 0         CAN  1980  687911.0625  0.272389  62465.899649  0.981520  1240.127747
## 1         CAN  1981  711280.5625  0.288998  64049.639423  0.979795  1362.677399
## 2         CAN  1982  693443.5000  0.299985  61399.551015  0.966170  1332.704742
## 3         CAN  1983  717049.3125  0.310637  62030.711466  0.973799  1423.382584
## 4         CAN  1984  767394.6875  0.321711  64472.861291  0.993447  1569.370361
print(df_merged.shape)
## (360, 7)
df_merged.to_csv("PWDnOECD.csv", index=False)
print("Saved as PWDnOECD.csv")
## Saved as PWDnOECD.csv
panel <- read.csv("PWDnOECD.csv")
head(panel)
##   countrycode year        Y         K        L         A       PA
## 1         CAN 1980 687911.1 0.2723894 62465.90 0.9815204 1240.128
## 2         CAN 1981 711280.6 0.2889977 64049.64 0.9797946 1362.677
## 3         CAN 1982 693443.5 0.2999851 61399.55 0.9661703 1332.705
## 4         CAN 1983 717049.3 0.3106366 62030.71 0.9737986 1423.383
## 5         CAN 1984 767394.7 0.3217106 64472.86 0.9934467 1569.370
## 6         CAN 1985 786765.8 0.3343326 67462.81 0.9967136 1565.460

We have thus completed our job in python. Now we move onto some elementary econometric analysis in R.

R Econometric Analysis

Time Series of the Data.

Let’s visualize and understand the time series trend of each of the variables:

okabe_ito <- c(
  "#E69F00", "#56B4E9", "#009E73", "#F0E442",
  "#0072B2", "#D55E00", "#CC79A7", "#999999"
)

variables <- c("Y", "K", "L", "A", "PA")
titles <- c(
  Y  = "GDP by Country (1980–2019)",
  K  = "Capital by Country (1980–2019)",
  L  = "Effective Labour by Country (1980–2019)",
  A  = "TFP Index by Country (1980–2019)",
  PA = "Patents by Country (1980–2019)"
)

for (var in variables) {
  p <- ggplot(panel, aes(x = year, y = .data[[var]], color = countrycode)) +
    geom_line(size = 0.8) +
    scale_color_manual(values = rep(okabe_ito, length.out = length(unique(panel$countrycode)))) +
    theme_minimal(base_size = 12) +
    labs(title = titles[[var]], x = "Year", y = var, color = "Country") +
    theme(
      legend.position = "bottom",
      legend.title = element_blank(),
      plot.title = element_text(size = 13, face = "bold", hjust = 0.5),
      axis.title = element_text(size = 11),
      axis.text = element_text(size = 9),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "grey80", size = 0.3)
    )
  print(p)
}

In particular, if we look at the Patents by Country plot, we can see the Singapore’s total patents nearly doesn’t grow. To look closer, we plot:

ggplot(panel[panel$countrycode == "SGP", ], aes(x = year, y = PA)) +
  geom_line(size = 1, color = "steelblue") +
  theme_minimal(base_size = 14) +
  labs(title = "Patents in Singapore 1980 to 2019",
       x = "Year", y = "Patents (PA)")

Growth Rate Time Series

Of course, we can further construct comparison of growth rate as follows:

mathematically speaking, we want for each variable $X$, \(gX_{i,t} = \ln(X_{i,t}) - \ln(X_{i,t-1}),\) which can be done easily with the ‘lag’ function:

panel_g <- panel %>%
  # 1) Make sure columns are numeric
  mutate(across(c(Y, K, L, A, PA), ~as.numeric(.))) %>%
  
  # 2) Sort properly
  arrange(countrycode, year) %>%
  
  # 3) Compute log-diff growths as percent
  group_by(countrycode) %>%
  mutate(
    gY   = 100 * c(NA, diff(log(Y))),
    gK   = 100 * c(NA, diff(log(K))),
    gL   = 100 * c(NA, diff(log(L))),
    gA   = 100 * c(NA, diff(log(A))),
    gPA  = 100 * c(NA, diff(log(PA)))
  ) %>%
  ungroup() %>%
  
  # 4) Keep only the growth-rate panel
  select(countrycode, year, gY, gK, gL, gA, gPA)

# for safety we now create a seperate dataset combining the two
panel_complete <- panel %>%
  left_join(panel_g, by = c("countrycode", "year"))

Quick check

ctry <- "CAN"

check <- panel_complete %>%
  filter(countrycode == ctry, year %in% c(1980, 1981)) %>%
  summarise(
    manual = 100 * (log(Y[year == 1981]) - log(Y[year == 1980])),
    stored = gY[year == 1981],
    equal  = all.equal(
      100 * (log(Y[year == 1981]) - log(Y[year == 1980])),
      gY[year == 1981]
    )
  )

print(check)
##     manual   stored equal
## 1 3.340739 3.340739  TRUE

Now by same logic we then have:

growth_vars <- c("gY", "gK", "gL", "gA", "gPA")
titles <- c(
  gY  = "GDP Growth by Country (1981–2019)",
  gK  = "Capital Growth by Country (1981–2019)",
  gL  = "Effective Labour Growth by Country (1981–2019)",
  gA  = "TFP Growth by Country (1981–2019)",
  gPA = "Patent Growth by Country (1981–2019)"
)

for (var in growth_vars) {
  p <- ggplot(panel_complete %>% filter(year >= 1981),
              aes(x = year, y = .data[[var]], color = countrycode)) +
    geom_line(size = 0.8) +
    scale_color_manual(values = rep(okabe_ito, length.out = length(unique(panel_complete$countrycode)))) +
    theme_minimal(base_size = 12) +
    labs(title = titles[[var]], x = "Year", y = "Growth Rate (%)", color = "Country") +
    theme(
      legend.position = "bottom",
      legend.title = element_blank(),
      plot.title = element_text(size = 13, face = "bold", hjust = 0.5),
      axis.title = element_text(size = 11),
      axis.text = element_text(size = 9),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(color = "grey80", size = 0.3)
    )
  print(p)
}

Regression

gA v.s. gPA

Let’s see if gA can be explained by gPA. The model of our interest is:

\[gA_{i,t} \;=\; \gamma_1\, gPA_{i,t-\ell} \;+\; \mu_i \;+\; \tau_t \;+\; u_{i,t}\]

where:

  • $\mu_i :=$ country fixed effects (absorbs time-invariant differences in patent regimes, institutions, etc.
  • $\tau_t :=$ year fixed effects (absorbs global shocks common to all countries).
  • Such that we cluster at the country level (serial correlation within $i$)
  • $l \in \lbrace 0,1,2,3 \rbrace := $ is the lag of patent growth.

Let’s first make a delayed from in which countrycode is a respected

lag_n <- function(x, n = 1) {
  x <- as.numeric(x)
  if (n <= 0) return(x)
  c(rep(NA_real_, n), head(x, -n))
}

panel_lags <- panel_complete %>%
  mutate(
    year = as.numeric(year),
    gPA  = as.numeric(gPA)
  ) %>%
  group_by(countrycode) %>%
  arrange(year, .by_group = TRUE) %>%
  mutate(
    gPA_lag1 = lag_n(gPA, 1),
    gPA_lag2 = lag_n(gPA, 2),
    gPA_lag3 = lag_n(gPA, 3)
  ) %>%
  ungroup()
m0 <- feols(gA ~ gPA       | countrycode + year, cluster = ~countrycode, data = panel_lags)
m1 <- feols(gA ~ gPA_lag1  | countrycode + year, cluster = ~countrycode, data = panel_lags)
m2 <- feols(gA ~ gPA_lag2  | countrycode + year, cluster = ~countrycode, data = panel_lags)
m3 <- feols(gA ~ gPA_lag3  | countrycode + year, cluster = ~countrycode, data = panel_lags)

etable(m0, m1, m2, m3, keep = "gPA")
##                               m0              m1              m2
## Dependent Var.:               gA              gA              gA
##                                                                 
## gPA             0.0041* (0.0014)                                
## gPA_lag1                         0.0024 (0.0016)                
## gPA_lag2                                         0.0041 (0.0045)
## gPA_lag3                                                        
## Fixed-Effects:  ---------------- --------------- ---------------
## countrycode                  Yes             Yes             Yes
## year                         Yes             Yes             Yes
## _______________ ________________ _______________ _______________
## S.E.: Clustered  by: countrycode by: countrycode by: countrycode
## Observations                 351             342             333
## R2                       0.32948         0.32818         0.33661
## Within R2                0.00321         0.00114         0.00329
## 
##                              m3
## Dependent Var.:              gA
##                                
## gPA                            
## gPA_lag1                       
## gPA_lag2                       
## gPA_lag3        0.0016 (0.0022)
## Fixed-Effects:  ---------------
## countrycode                 Yes
## year                        Yes
## _______________ _______________
## S.E.: Clustered by: countrycode
## Observations                324
## R2                      0.32662
## Within R2               0.00050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fixed-effects regressions of TFP growth (gA) on patent growth (gPA) reveal that, although the contemporaneous coefficient on gPA is statistically significant, the explanatory power is negligible. The within $R^2$ values ($\approx 0.001–0.003$) show that after accounting for country fixed effects (capturing persistent cross-country differences) and year fixed effects (capturing global shocks), patent growth explains almost none of the remaining within-country, over-time variation in TFP growth. In other words, most of the “fit” in these models comes from the dummies, not from patents themselves, suggesting that patents have very limited relevance as an instrument for TFP in this panel.

End Remark

Subsequent Analysis can be applied to this dataset. But we will stop here; as the target of this project is for me to gain exposure to data cleaning/extraction and to show my competence for such skills.