使用R将多个txt汇总到一个csv文件中

任务

使用R语言，对多个文件夹内的数百个txt汇总到一个csv文件内。

数据集

01-21年，全国各地市政府工作报告数据集。

任务分解

使用list.files获取文件路径列表
定义需要的函数
- 使用readtext::readtext()函数读取报告文本
- 年份函数、省份函数
对每个文件路径，根据2得到三个字段信息，构造tibble结构；
步骤2和步骤3使用bind_cols合并成一个tibble
readr::write_csv()函数存至data.csv
审查data.csv

数据存在province文件内, 该点击下载该数据集

1. txt路径列表

使用 list.files函数查看

文件夹路径列表
文件路径列表

province内的文件夹路径列表

dirs <- list.files('province', full.names = TRUE)
head(dirs)

## [1] "province/上海"   "province/云南"   "province/内蒙古" "province/北京"  
## [5] "province/吉林"   "province/四川"

所有省份文件夹内的文件路径列表

files <- list.files(dirs, full.names = TRUE)
head(files)

## [1] "province/上海/2003年上海政府工作报告.txt"
## [2] "province/上海/2004年上海政府工作报告.txt"
## [3] "province/上海/2005年上海政府工作报告.txt"
## [4] "province/上海/2006年上海政府工作报告.txt"
## [5] "province/上海/2007年上海政府工作报告.txt"
## [6] "province/上海/2008年上海政府工作报告.txt"

共有617个txt文件

length(files)

## [1] 617

2.1 readtext读取txt

使用 readtext::readtext 批量读取多个txt

txts_df <- readtext::readtext(files)
head(txts_df)

## readtext object consisting of 6 documents and 0 docvars.
## # Description: df [6 × 2]
##   doc_id                     text                         
##   <chr>                      <chr>                        
## 1 2003年上海政府工作报告.txt "\"  各位代表， 现在\"..."   
## 2 2004年上海政府工作报告.txt "\" 各位代表：\n\n  \"..."   
## 3 2005年上海政府工作报告.txt "\"各位代表：\n\n　　现\"..."
## 4 2006年上海政府工作报告.txt "\"各位代表：\n　　上海\"..."
## 5 2007年上海政府工作报告.txt "\"　　政府工作报告\n　\"..."
## 6 2008年上海政府工作报告.txt "\"\n\t政府工作报告\n\n\"..."

检查text字段长度，是否为617.

length(txts_df[['text']])

## [1] 617

2.2 定义功能函数

数据整理到一个csv，我们想保存四个字段，分别是

txt文件名
年份
省(市)名
工作报告内容

年份和省份需要通过定义函数实现~

s<-basename("province/上海/2003年上海政府工作报告.txt") 
substr(s, 1, 4)

## [1] "2003"

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

year_func <- function(filepath){
  year <- filepath %>% 
    basename() %>% 
    substr(1, 4)
  return (year)
}

name_func <- function(file){
  file <- basename(file)
  name <- gsub('政府工作报告.txt', '', file) 
  name <- stringr::str_sub(name, start=6)
  return (name)
}

file <- "province/上海/2003年上海政府工作报告.txt"
year_func(file)

## [1] "2003"

name_func(file)

## [1] "上海"

txts_df 是一个特殊的tibble数据类型。现在需要构造年份、省份函数，获取另外一个tibble。

year_province_df <- tibble(
  year = year_func(txts_df$doc_id),
  province = lapply(txts_df$doc_id, name_func) %>% unlist()
  )

head(year_province_df)

## # A tibble: 6 × 2
##   year  province
##   <chr> <chr>   
## 1 2003  上海    
## 2 2004  上海    
## 3 2005  上海    
## 4 2006  上海    
## 5 2007  上海    
## 6 2008  上海

4. 合并两个tibble

cbind_rows()合并两个tibble

res_df <- bind_cols(year_province_df, txts_df)
head(res_df)

## # A tibble: 6 × 4
##   year  province doc_id                     text                                
##   <chr> <chr>    <chr>                      <chr>                               
## 1 2003  上海     2003年上海政府工作报告.txt "  各位代表， 现在，我代表上海市人… 
## 2 2004  上海     2004年上海政府工作报告.txt " 各位代表：\n\n    现在，我代表上… 
## 3 2005  上海     2005年上海政府工作报告.txt "各位代表：\n\n　　现在，我代表上海…
## 4 2006  上海     2006年上海政府工作报告.txt "各位代表：\n　　上海市国民经济和社…
## 5 2007  上海     2007年上海政府工作报告.txt "　　政府工作报告\n　　――2007年1月2…
## 6 2008  上海     2008年上海政府工作报告.txt "\n\t政府工作报告\n\n\t——2008年1月2…

5. 存入csv

使用 write.table(x, file, sep) 写入data.csv

x 待存储数据对象
file csv文件路径
delim 分割符

?readr::write_csv

readr::write_csv(x=res_df, 
                 file='data.csv', 
                 col_names=T)

6. 检查data.csv

尝试读取 data.csv

df <- readr::read_csv('data.csv')

## Rows: 617 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): province, doc_id, text
## dbl (1): year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 × 4
##    year province doc_id                     text                                
##   <dbl> <chr>    <chr>                      <chr>                               
## 1  2003 上海     2003年上海政府工作报告.txt "各位代表， 现在，我代表上海市人民… 
## 2  2004 上海     2004年上海政府工作报告.txt "各位代表：\n\n    现在，我代表上海…
## 3  2005 上海     2005年上海政府工作报告.txt "各位代表：\n\n　　现在，我代表上海…
## 4  2006 上海     2006年上海政府工作报告.txt "各位代表：\n　　上海市国民经济和社…
## 5  2007 上海     2007年上海政府工作报告.txt "　　政府工作报告\n　　――2007年1月2…
## 6  2008 上海     2008年上海政府工作报告.txt "\n\t政府工作报告\n\n\t——2008年1月2…