{ "cells": [ { "cell_type": "markdown", "id": "06573cfa", "metadata": {}, "source": [ "pandas是最有用的Python数据分析库, 两个数据类型DataFrame和Series,值的我们反复接触、学习和实验,逐渐的将pandas独特的语法掌握。\n", "\n", "
\n", "\n", "## DataFrame是什么?\n", "DataFrame是一种二维数据类型, 跟 Excel中的数据结构差不多, 即\n", "\n", "- 每行代表一条记录\n", "- 每列代表一个字段(变量)\n", "\n", "每列(行),看做一个序列Series,在pandas中就是pd.Series类型的数据, 由很多pd.Sereis组成的数据就是DataFrame。\n", "\n", "![](img/base_01_pandas_5_0.png)" ] }, { "cell_type": "markdown", "id": "fdb1f460", "metadata": {}, "source": [ "## 一、创建DataFrame\n", "创建DataFrame有两种方式\n", "\n", "1. 使用pd.read_csv()/ pd.read_excel() 从csv、xlsx等文件导入, 得到DataFrame\n", "2. 使用pd.DataFrame()函数构造DataFrame\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "f730781a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "data = {'name': ['David', 'Mary', 'Jessica', 'John'],\n", " 'age': ['25', '30', '31', '20'],\n", " 'gender': ['male', 'female', 'female', 'male']}\n", " \n", "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "markdown", "id": "d5cee295", "metadata": {}, "source": [ "
\n", "\n", "## 二、选中某行(列)\n", "\n", "1. ``df.loc[row, :]`` 选中第row行所有数据\n", "2. ``df.loc[row:row, :]`` 选中第row行所有数据\n", "3. ``df.loc[row1: row2, ]`` 选中从row1至row2行的所有数据\n", "4. ``df.loc[:, col]`` 选中第col列所有数据\n", "5. ``df.loc[:, [col]]`` 选中第col列所有数据\n", "6. ``df.loc[:, cols]`` 选中cols列的所有数据" ] }, { "cell_type": "code", "execution_count": 14, "id": "655c203c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name Jessica\n", "age 31\n", "gender female\n", "Name: 2, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#选中第二行所有数据,返回结果为Series\n", "df.loc[2, :]" ] }, { "cell_type": "code", "execution_count": 15, "id": "c94609ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
2Jessica31female
\n", "
" ], "text/plain": [ " name age gender\n", "2 Jessica 31 female" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#选中第二行所有数据,返回结果为DataFrame\n", "df.loc[2:2, :]" ] }, { "cell_type": "code", "execution_count": 16, "id": "336921c5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 male\n", "1 female\n", "2 female\n", "3 male\n", "Name: gender, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#选中某列所有数据,返回结果为Series\n", "df.loc[:, 'gender']" ] }, { "cell_type": "code", "execution_count": 18, "id": "20490184", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gender
0male
1female
2female
3male
\n", "
" ], "text/plain": [ " gender\n", "0 male\n", "1 female\n", "2 female\n", "3 male" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#选中某列所有数据,返回结果为DataFrame\n", "df.loc[:, ['gender']]" ] }, { "cell_type": "code", "execution_count": 17, "id": "53def89c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegender
025male
130female
231female
320male
\n", "
" ], "text/plain": [ " age gender\n", "0 25 male\n", "1 30 female\n", "2 31 female\n", "3 20 male" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#选中某几列所有数据,返回结果为DataFrame\n", "df.loc[:, ['age', 'gender']]" ] }, { "cell_type": "markdown", "id": "439ae03b", "metadata": {}, "source": [ "
\n", "\n", "## 三、新建某行(列)\n", "- ``df[colname] = col_data`` 新增一列(字段)数据\n", "- ``df.loc[row] = row_data`` 新增一条行(记录)数据" ] }, { "cell_type": "code", "execution_count": 19, "id": "a8dc829c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#复制实验数据\n", "df2 = df.copy()\n", "df2" ] }, { "cell_type": "code", "execution_count": 20, "id": "643626f1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegendernation
0David25maleUS
1Mary30femaleUK
2Jessica31femaleUK
3John20maleUS
\n", "
" ], "text/plain": [ " name age gender nation\n", "0 David 25 male US\n", "1 Mary 30 female UK\n", "2 Jessica 31 female UK\n", "3 John 20 male US" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#新增nation字段列\n", "df2['nation'] = ['US', 'UK', 'UK', 'US']\n", "df2" ] }, { "cell_type": "code", "execution_count": 22, "id": "1a208785", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegendernation
0David25maleUS
1Mary30femaleUK
2Jessica31femaleUK
3John20maleUS
4Robert22maleUS
\n", "
" ], "text/plain": [ " name age gender nation\n", "0 David 25 male US\n", "1 Mary 30 female UK\n", "2 Jessica 31 female UK\n", "3 John 20 male US\n", "4 Robert 22 male US" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#新建(增)一行数据\n", "df2.loc[4] = ['Robert', 22, 'male', 'US']\n", "df2" ] }, { "cell_type": "markdown", "id": "ae1dac2f", "metadata": {}, "source": [ "
\n", "\n", "## 四、删除\n", "1. df.drop_duplicates() 删除重复的数据\n", "2. df.drop() 删除某行(列)" ] }, { "cell_type": "code", "execution_count": 39, "id": "9021c4db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
4John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male\n", "4 John 20 male" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3 = df.copy()\n", "df3.loc[4] = ['John', 20, 'male']\n", "df3" ] }, { "cell_type": "code", "execution_count": 40, "id": "edeea7fa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 根据name字段删除重复的数据\n", "#inplace=True,操作后会修改df3原始数据\n", "#df3.drop_duplicates(inplace=True, subset='name')\n", "\n", "df3.drop_duplicates(subset='name')" ] }, { "cell_type": "code", "execution_count": 47, "id": "4fea4b80", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#删除某行(索引值)的数据\n", "#inplace=True,操作后会修改df3原始数据\n", "#df3.drop(1, inplace=True)\n", "df3.drop(4)" ] }, { "cell_type": "code", "execution_count": 41, "id": "41787e8a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameage
0David25
1Mary30
2Jessica31
3John20
4John20
\n", "
" ], "text/plain": [ " name age\n", "0 David 25\n", "1 Mary 30\n", "2 Jessica 31\n", "3 John 20\n", "4 John 20" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#删除gender字段列\n", "#inplace=True,操作后会修改df3原始数据\n", "#df3.drop('gender', axis=1, inplace=True)\n", "df3.drop('gender', axis=1)" ] }, { "cell_type": "markdown", "id": "f0bea87e", "metadata": {}, "source": [ "
\n", "\n", "## 五、行列重命名\n", "\n", "使用 df.rename() 重命名dataframe" ] }, { "cell_type": "code", "execution_count": 48, "id": "4e219f97", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age gender\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df4 = df.copy()\n", "df4" ] }, { "cell_type": "code", "execution_count": 49, "id": "5754043a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagesex
0David25male
1Mary30female
2Jessica31female
3John20male
\n", "
" ], "text/plain": [ " name age sex\n", "0 David 25 male\n", "1 Mary 30 female\n", "2 Jessica 31 female\n", "3 John 20 male" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#重命名 字段(列)名\n", "df4.rename(columns = {'gender': 'sex'}, inplace=True)\n", "df4" ] }, { "cell_type": "code", "execution_count": 50, "id": "caa3e9c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameagegender
aDavid25male
bMary30female
cJessica31female
cJohn20male
\n", "
" ], "text/plain": [ " name age gender\n", "a David 25 male\n", "b Mary 30 female\n", "c Jessica 31 female\n", "c John 20 male" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 重命名行索引index\n", "df4.rename(index = {0: 'a', \n", " 1:'b', \n", " 2:'c', \n", " 3:'c'})" ] }, { "cell_type": "markdown", "id": "af403dd3", "metadata": {}, "source": [ "
\n", "\n", "## 六、替换" ] }, { "cell_type": "code", "execution_count": 54, "id": "eab9a2ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Student1Student2Student3
0OKsPerfectAcceptable
1AwfulAwfulPerfect
2AcceptableOKPoor
\n", "
" ], "text/plain": [ " Student1 Student2 Student3\n", "0 OKs Perfect Acceptable\n", "1 Awful Awful Perfect\n", "2 Acceptable OK Poor" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df5 = pd.DataFrame({\"Student1\":['OKs','Awful','Acceptable'], \n", " \"Student2\":['Perfect','Awful','OK'], \n", " \"Student3\":['Acceptable','Perfect','Poor']})\n", "\n", "df5" ] }, { "cell_type": "code", "execution_count": 55, "id": "3afd2456", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Student1Student2Student3
0OKs43
1004
2321
\n", "
" ], "text/plain": [ " Student1 Student2 Student3\n", "0 OKs 4 3\n", "1 0 0 4\n", "2 3 2 1" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 将评级单词改为对应数字序列\n", "df5.replace(['Awful', 'Poor', 'OK', 'Acceptable', 'Perfect'], \n", " [0, 1, 2, 3, 4]) " ] }, { "cell_type": "code", "execution_count": 56, "id": "2dfb0f9b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Student1Student2Student3
0243
1004
2321
\n", "
" ], "text/plain": [ " Student1 Student2 Student3\n", "0 2 4 3\n", "1 0 0 4\n", "2 3 2 1" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 将评级单词改为对应数字序列\n", "# 使用正则表达式匹配OK为2\n", "df5.replace(['Awful', 'Poor', 'OK*', 'Acceptable', 'Perfect'], \n", " [0, 1, 2, 3, 4],\n", " regex=True) " ] }, { "cell_type": "markdown", "id": "b2a42e99", "metadata": {}, "source": [ "
\n", "\n", "## 七、 识别日期\n", "\n", "```python\n", "import pandas as pd\n", "\n", "#自动识别日期数据\n", "pd.read_csv('yourFile', parse_dates=True)\n", "\n", "#或指定日期所在列的字段名\n", "pd.read_csv('yourFile', parse_dates=['columnName'])\n", "\n", "#指定日期字段的日期格式进行解析\n", "dateparser = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')\n", "pd.read_csv(infile, parse_dates=['columnName'], date_parser=dateparse)\n", "```" ] }, { "cell_type": "markdown", "id": "3d4ca74f", "metadata": {}, "source": [ "
\n", "\n", "## 八、保存\n", "\n", "```python\n", "df.to_csv('myDataFrame.csv', encoding='utf-8')\n", "df.to_excel('myDataFrame.xlsx')\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }