---
title: "Getting Started with realestatebr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with realestatebr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.asp = 0.618,
  fig.align = "center",
  message = FALSE,
  warning = FALSE
)
```

## Introduction

This vignette provides a minimal introduction to the `realestatebr` package, showing how to use its core functions. Since `realestatebr` returns `tibble` as default values, we recommend using it together with the `dplyr` package, though conversion do `data.table` is trivial.

```{r}
library(realestatebr)
library(dplyr)
```

The code below defines a common theme for all plots in this vignette and is required to fully replicate the code in this document. Despite this, this code is entirely optional and can be omitted.

```{r setup, message = FALSE}
#| code-fold: true
library(ggplot2)

color_palette <- c(
  "#1E3A5F",
  "#DD6B20",
  "#2C7A7B",
  "#D69E2E",
  "#805AD5",
  "#C53030"
)

theme_series <- function() {
  theme_minimal(
    # swap for other font if needed
    base_family = "Avenir",
    base_size = 10
  ) +
    theme(
      plot.title = element_text(size = 16),
      panel.grid.minor = element_blank(),
      panel.grid.major.x = element_blank(),
      axis.line.x = element_line(color = "gray10", linewidth = 0.5),
      axis.ticks.x = element_line(color = "gray10", linewidth = 0.5),
      axis.title.x = element_blank(),
      legend.position = "bottom",
      palette.color.discrete = color_palette
    )
}
```

```{r}
#| include: false
library(knitr)
library(kableExtra)
```

`realestatebr` provides a unified interface to Brazilian real estate data from
multiple public sources. All datasets are returned as tidy `tibble` objects.

## Core Interface

The goal of `realestatebr` is to provide a unified interface to Brazilian real estate data from multiple public sources. All datasets are returned as tidy `tibble` objects. The package is centered around a key function: `get_dataset(name, table)` which retrieves any dataset by name. Without a `table` argument it returns the default table; use `table` to select a specific sub-table.

- Use `get_dataset()` main function to retrieve datasets.

```{r}
#| eval: false
# Default table
abecip <- get_dataset("abecip")

# Specific table
sbpe <- get_dataset("abecip", table = "units")
```

In order to explore which datasets are available, use `list_datasets()` and `get_dataset_info()`.

- **`list_datasets()`** returns a catalogue of all available datasets and their
tables.

```{r}
ds <- list_datasets()
```

```{r}
#| echo: false
ds |>
  select(name, title, source, available_tables, frequency) |>
  kable() |>
  kable_styling(bootstrap_options = "striped") |>
  scroll_box(width = "100%", height = "400px")
```

- **`get_dataset_info()`** shows available tables and metadata for a given
dataset.

```{r}
#| eval: false
info <- get_dataset_info("abecip")
names(info$categories)
#> [1] "sbpe"  "units"  "cgi"
```

### The `source` Argument

The `source` argument from `get_dataset()` controls where data comes from. The default (`"auto"`) checks the local cache first, then falls back to the GitHub release. Typically, the best option is to use the default or `"github"`. Choosing `"fresh"` will download the data from the original source: while this guarantees the most recent data, it is slower.

```{r}
#| eval: false
get_dataset("abecip", source = "cache") # local cache (instant, works offline)
get_dataset("abecip", source = "github") # GitHub release
get_dataset("abecip", source = "fresh") # direct from the original source
```

Cache files are stored in the user data directory and can be inspected with
`list_cached_files()` or cleared with `clear_user_cache()`.

## Example: Housing Credit Cycle

SBPE (Sistema Brasileiro de Poupança e Empréstimo) is the primary funding
mechanism for residential mortgages in Brazil. The table `sbpe` from` `abecip` tracks the deposits and withdrawals from saving accounts, that help finance real estate construction and acquisition.

```{r}
sbpe <- get_dataset("abecip", table = "sbpe")

glimpse(sbpe)
```

The plot below shows the annual net savings flow in recent years.

```{r}
#| code-fold: true
# Annual net credit flow
sbpe_annual <- sbpe |>
  filter(date >= as.Date("2019-01-01")) |>
  mutate(year = lubridate::year(date)) |>
  summarise(net_flow = sum(sbpe_netflow, na.rm = TRUE) / 1e3, .by = year) |>
  mutate(
    label_num = format(round(net_flow, 1)),
    ypos = if_else(net_flow > 0, net_flow + 10, net_flow - 10)
  )

ggplot(sbpe_annual, aes(year, net_flow)) +
  geom_col(fill = color_palette[1], alpha = 0.9, width = 0.8) +
  geom_text(aes(y = ypos, label = label_num), size = 3) +
  geom_hline(yintercept = 0) +
  scale_x_continuous(breaks = 2019:2026) +
  labs(
    title = "Annual Net Savings Flow (SBPE)",
    x = NULL,
    y = "R$ billions"
  ) +
  theme_series()
```

The companion table `"units"` contains monthly counts of financed units.

```{r}
units <- get_dataset("abecip", table = "units")

glimpse(units)
```

The plot shows the amount of units financed per month together with a LOESS trend line.

```{r}
#| code-fold: true
# SBPE units financed per year
units_recent <- units |>
  filter(date >= as.Date("2019-01-01"))

ggplot(units_recent, aes(date, units_total)) +
  geom_point(alpha = 0.5, size = 0.8, color = color_palette[1]) +
  geom_smooth(
    color = color_palette[1],
    lwd = 0.7,
    se = FALSE,
    method = stats::loess,
    method.args = list(span = 0.4)
  ) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(
    title = "Monthly Financed Units",
    y = "Units"
  ) +
  theme_series()
```

## Example: Real Estate Credit Portfolio

The `bcb_realestate` dataset imports all real estate statistics from the [Brazilian Central Bank](https://www.bcb.gov.br/estatisticas/mercadoimobiliario). This is a relatively large dataset and exploring can be cumbersome. Each series is uniquely identified by `date` and `series_info`. Helper functions `v1`, `v2`, ..., `v5`, `abbrev_state`, `category`, and `type` are provided to simplify the use of the dataset.

The code below shows how to access a specific series and also how to fetch a group of related series.

```{r}
bcb <- get_dataset("bcb_realestate")

# Get a specific series
sfh_pf <- bcb |>
  filter(series_info == "credito_estoque_carteira_credito_pf_sfh_br")

# Get the all the related series for 'estoque_carteira_credito_pf'
credit_stock <- bcb |>
  filter(
    category == "credito",
    type == "estoque",
    v1 == "carteira",
    v2 == "credito",
    v3 == "pf",
    # since v4 is left blank, we get all credit lines
    v5 == "br"
  )

# The helper columns essentially separate the 'series_info' column allowing
# for easier filtering. It's equivalent to filtering by regex
credit_stock <- bcb |>
  filter(grepl(
    "(?<=credito_estoque_carteira_credito_pf_).+_br$",
    series_info,
    perl = TRUE
  ))
```

The single series shows only the values from SFH (specific credit line).

```{r}
#| code-fold: true
ggplot(sfh_pf, aes(date, value / 1e9)) +
  geom_line(lwd = 0.7, color = color_palette[1]) +
  labs(title = "SFH", y = "R$ (billions)") +
  theme_series()
```

The grouped series show the entire household credit stock by credit line.

```{r}
#| code-fold: true
credit_labels <- c(
  "Home Equity" = "home-equity",
  "Comercial" = "comercial",
  "Livre" = "livre",
  "FGTS" = "fgts",
  "SFH" = "sfh"
)

credit_stock <- credit_stock |>
  mutate(
    credit_line_label = factor(
      v4,
      levels = credit_labels,
      labels = names(credit_labels)
    )
  )

ggplot(credit_stock, aes(date, value / 1e9)) +
  geom_area(aes(fill = credit_line_label), alpha = 0.9) +
  scale_fill_manual(values = rev(color_palette[1:5])) +
  scale_x_date(expand = expansion(mult = c(0.01))) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "Real Estate Credit Stock",
    subtitle = "Household real estate credit stock (total debt) by credit line",
    y = "R$ (billions)",
    fill = NULL
  ) +
  theme_series()
```

As a final warning, note that the `bcb_realestate` dataset follows the `YYYY-MM-DD` format using the last day of the month as default value (e.g. `2023-01-31`). This can cause issues when merging with other datasets, since the first day of the month is the more common date format (e.g. `2023-01-01`).

To avoid this, use `lubridate::floor_date(date, 'month')`. Future versions of `realestatebr` might provide this as a default behavior.

## Reference (all datasets)

The available datasets are listed below.

| Dataset | Source | Tables | Status |
|---------|--------|--------|--------|
| `abecip` | ABECIP | `sbpe`, `units`, `cgi` | Active |
| `abrainc` | ABRAINC / FIPE | `indicator`, `radar`, `leading` | Active |
| `bcb_realestate` | Banco Central do Brasil | `accounting`, `application`, `indices`, `sources`, `units` | Active |
| `bcb_series` | Banco Central do Brasil | `core`, `primary`, `secondary`, `tertiary`, `full` | Active |
| `fgv_ibre` | FGV IBRE | — | Active |
| `rppi` | FIPE/ZAP, IVGR, IGMI, IQA, IVAR, SECOVI-SP | `sale`, `rent`, `fipezap`, `ivgr`, `igmi`, `iqa`, `iqaiw`, `ivar`, `secovi_sp` | Active |
| `rppi_bis` | Bank for International Settlements | `selected`, `detailed_monthly`, `detailed_quarterly`, `detailed_annual`, `detailed_halfyearly` | Active |
| `secovi` | SECOVI-SP | `condo`, `rent`, `launch`, `sale` | Active |