|
|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
1. 引言
Pandas是Python中最流行的数据分析库之一,它提供了快速、灵活和富有表现力的数据结构,旨在使数据清洗和分析工作变得简单直观。在数据科学竞赛中,如Kaggle、天池等平台上的比赛,熟练掌握Pandas是成功的关键因素之一。本教程将从基础概念开始,逐步深入到高级技巧和竞赛策略,帮助你全面掌握Pandas,成为数据科学竞赛的高手。
2. Pandas基础
核心数据结构
Pandas有两个主要的数据结构:Series和DataFrame。
Series是一维标记数组,能够保存任何数据类型(整数、字符串、浮点数、Python对象等)。轴标签统称为索引。
- import pandas as pd
- import numpy as np
- # 创建一个Series
- s = pd.Series([1, 3, 5, np.nan, 6, 8])
- print(s)
复制代码
输出:
- 0 1.0
- 1 3.0
- 2 5.0
- 3 NaN
- 4 6.0
- 5 8.0
- dtype: float64
复制代码
DataFrame是二维标记数据结构,具有可能不同类型的列。你可以将其视为电子表格或SQL表,或Series对象的字典。DataFrame是Pandas中最常用的数据结构。
- # 创建一个DataFrame
- data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
- 'Age': [28, 34, 29, 42],
- 'City': ['New York', 'Paris', 'Berlin', 'London']}
- df = pd.DataFrame(data)
- print(df)
复制代码
输出:
- Name Age City
- 0 John 28 New York
- 1 Anna 34 Paris
- 2 Peter 29 Berlin
- 3 Linda 42 London
复制代码
基本操作
查看数据
- # 查看前几行数据
- print(df.head())
- # 查看后几行数据
- print(df.tail())
- # 查看数据的统计摘要
- print(df.describe())
- # 查看数据的基本信息
- print(df.info())
复制代码
选择数据
- # 选择单列
- print(df['Name'])
- # 选择多列
- print(df[['Name', 'Age']])
- # 使用iloc选择行(基于位置)
- print(df.iloc[0]) # 第一行
- print(df.iloc[0:2]) # 前两行
- # 使用loc选择行(基于标签)
- print(df.loc[0]) # 第一行
- print(df.loc[0:2]) # 前三行(注意:loc是包含结束位置的)
- # 条件选择
- print(df[df['Age'] > 30])
复制代码
数据排序
- # 按年龄升序排序
- print(df.sort_values('Age'))
- # 按年龄降序排序
- print(df.sort_values('Age', ascending=False))
复制代码
3. 数据加载与探索
在数据分析竞赛中,数据通常以CSV、Excel或其他格式提供。Pandas提供了多种加载数据的方法。
加载数据
- # 从CSV文件加载数据
- # 假设我们有一个名为'train.csv'的文件
- train_df = pd.read_csv('train.csv')
- # 从Excel文件加载数据
- # train_df = pd.read_excel('train.xlsx')
- # 查看数据的前几行
- print(train_df.head())
- # 查看数据的形状(行数和列数)
- print(train_df.shape)
- # 查看列名
- print(train_df.columns)
复制代码
探索数据
- # 查看数据类型
- print(train_df.dtypes)
- # 查看数值列的统计摘要
- print(train_df.describe())
- # 查看分类列的值计数
- # 假设有一个名为'Category'的分类列
- print(train_df['Category'].value_counts())
- # 查看缺失值
- print(train_df.isnull().sum())
- # 查看唯一值
- # 假设有一个名为'Product'的列
- print(train_df['Product'].unique())
复制代码
相关性分析
- # 计算数值列之间的相关性
- correlation = train_df.corr()
- print(correlation)
- # 可视化相关性矩阵
- import seaborn as sns
- import matplotlib.pyplot as plt
- plt.figure(figsize=(10, 8))
- sns.heatmap(correlation, annot=True, cmap='coolwarm')
- plt.title('Correlation Matrix')
- plt.show()
复制代码
4. 数据清洗
数据清洗是数据分析竞赛中至关重要的一步。原始数据通常包含缺失值、异常值和重复数据,需要妥善处理。
处理缺失值
- # 检查缺失值
- print(train_df.isnull().sum())
- # 删除包含缺失值的行
- train_df_dropna = train_df.dropna()
- # 删除包含缺失值的列
- train_df_dropna_col = train_df.dropna(axis=1)
- # 填充缺失值
- # 用均值填充数值列
- train_df_fillna = train_df.copy()
- numeric_cols = train_df_fillna.select_dtypes(include=['int64', 'float64']).columns
- train_df_fillna[numeric_cols] = train_df_fillna[numeric_cols].fillna(train_df_fillna[numeric_cols].mean())
- # 用众数填充分类列
- categorical_cols = train_df_fillna.select_dtypes(include=['object']).columns
- for col in categorical_cols:
- train_df_fillna[col] = train_df_fillna[col].fillna(train_df_fillna[col].mode()[0])
- # 使用前一个值填充(适用于时间序列)
- train_df_ffill = train_df.fillna(method='ffill')
- # 使用后一个值填充
- train_df_bfill = train_df.fillna(method='bfill')
- # 使用插值法填充(适用于数值列)
- train_df_interpolate = train_df.copy()
- for col in numeric_cols:
- train_df_interpolate[col] = train_df_interpolate[col].interpolate()
复制代码
处理异常值
- # 使用箱线图检测异常值
- plt.figure(figsize=(10, 6))
- sns.boxplot(x=train_df['Age'])
- plt.title('Boxplot of Age')
- plt.show()
- # 使用Z-score检测异常值
- from scipy import stats
- z_scores = stats.zscore(train_df['Age'])
- abs_z_scores = np.abs(z_scores)
- outliers = (abs_z_scores > 3)
- print(train_df[outliers])
- # 使用IQR方法检测异常值
- Q1 = train_df['Age'].quantile(0.25)
- Q3 = train_df['Age'].quantile(0.75)
- IQR = Q3 - Q1
- lower_bound = Q1 - 1.5 * IQR
- upper_bound = Q3 + 1.5 * IQR
- outliers = (train_df['Age'] < lower_bound) | (train_df['Age'] > upper_bound)
- print(train_df[outliers])
- # 处理异常值
- # 删除异常值
- train_df_no_outliers = train_df[~outliers]
- # 替换异常值为边界值
- train_df_capped = train_df.copy()
- train_df_capped['Age'] = np.where(train_df_capped['Age'] < lower_bound, lower_bound,
- np.where(train_df_capped['Age'] > upper_bound, upper_bound,
- train_df_capped['Age']))
复制代码
处理重复数据
- # 检查重复行
- print(train_df.duplicated().sum())
- # 查看重复行
- print(train_df[train_df.duplicated()])
- # 删除重复行
- train_df_unique = train_df.drop_duplicates()
- # 基于特定列删除重复行
- train_df_unique_id = train_df.drop_duplicates(subset=['ID'])
复制代码
5. 数据转换与特征工程
特征工程是数据分析竞赛中提高模型性能的关键步骤。Pandas提供了强大的工具来创建新特征和转换数据。
创建新特征
- # 假设我们有一个包含日期的DataFrame
- data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
- 'Sales': [100, 150, 200, 175]}
- df = pd.DataFrame(data)
- # 将日期列转换为datetime类型
- df['Date'] = pd.to_datetime(df['Date'])
- # 从日期中提取年、月、日、星期几等特征
- df['Year'] = df['Date'].dt.year
- df['Month'] = df['Date'].dt.month
- df['Day'] = df['Date'].dt.day
- df['DayOfWeek'] = df['Date'].dt.dayofweek
- df['DayOfYear'] = df['Date'].dt.dayofyear
- df['WeekOfYear'] = df['Date'].dt.isocalendar().week
- print(df)
- # 创建数值特征之间的交互特征
- df['Sales_Squared'] = df['Sales'] ** 2
- df['Sales_Log'] = np.log(df['Sales'])
- # 创建分类特征的组合
- # 假设我们有一个包含产品和类别的DataFrame
- data = {'Product': ['A', 'B', 'A', 'C', 'B'],
- 'Category': ['X', 'Y', 'X', 'Z', 'Y'],
- 'Price': [10, 20, 15, 30, 25]}
- df = pd.DataFrame(data)
- # 创建产品和类别的组合特征
- df['Product_Category'] = df['Product'] + '_' + df['Category']
- print(df)
复制代码
数据类型转换
- # 查看数据类型
- print(df.dtypes)
- # 将数值列转换为分类列
- df['Price_Category'] = pd.cut(df['Price'], bins=[0, 15, 25, 100], labels=['Low', 'Medium', 'High'])
- # 将分类列转换为数值列(标签编码)
- df['Product_Code'] = df['Product'].astype('category').cat.codes
- # 使用get_dummies进行独热编码
- df_onehot = pd.get_dummies(df, columns=['Product', 'Category'])
- print(df_onehot)
复制代码
特征缩放和标准化
- from sklearn.preprocessing import StandardScaler, MinMaxScaler
- # 标准化(Z-score标准化)
- scaler = StandardScaler()
- df['Price_Standardized'] = scaler.fit_transform(df[['Price']])
- # 归一化(Min-Max缩放)
- min_max_scaler = MinMaxScaler()
- df['Price_Normalized'] = min_max_scaler.fit_transform(df[['Price']])
- print(df)
复制代码
高级特征工程
- # 假设我们有一个包含多个数值特征的DataFrame
- data = {'Feature1': [1, 2, 3, 4, 5],
- 'Feature2': [10, 20, 30, 40, 50],
- 'Feature3': [100, 200, 300, 400, 500]}
- df = pd.DataFrame(data)
- # 创建多项式特征
- from sklearn.preprocessing import PolynomialFeatures
- poly = PolynomialFeatures(degree=2, include_bias=False)
- poly_features = poly.fit_transform(df)
- poly_df = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(df.columns))
- print(poly_df)
- # 创建聚合特征(假设我们有一个分组变量)
- data = {'Group': ['A', 'A', 'B', 'B', 'A'],
- 'Value': [10, 20, 30, 40, 50]}
- df = pd.DataFrame(data)
- # 计算每组的均值、中位数、标准差等
- group_stats = df.groupby('Group')['Value'].agg(['mean', 'median', 'std', 'min', 'max'])
- group_stats.columns = ['Group_' + col for col in group_stats.columns]
- # 将聚合特征合并回原始DataFrame
- df = df.merge(group_stats, left_on='Group', right_index=True)
- print(df)
复制代码
6. 数据聚合与分组分析
在数据分析竞赛中,经常需要对数据进行分组和聚合操作,以提取有用的信息。
基本分组操作
- # 假设我们有一个包含销售数据的DataFrame
- data = {'Region': ['North', 'South', 'East', 'West', 'North', 'South'],
- 'Product': ['A', 'B', 'A', 'C', 'B', 'A'],
- 'Sales': [100, 150, 200, 175, 125, 225],
- 'Quantity': [10, 15, 20, 17, 12, 22]}
- df = pd.DataFrame(data)
- # 按区域分组并计算销售总额
- region_sales = df.groupby('Region')['Sales'].sum()
- print(region_sales)
- # 按区域和产品分组并计算销售总额
- region_product_sales = df.groupby(['Region', 'Product'])['Sales'].sum()
- print(region_product_sales)
- # 按区域分组并计算多个统计量
- region_stats = df.groupby('Region').agg({
- 'Sales': ['sum', 'mean', 'std'],
- 'Quantity': ['sum', 'mean']
- })
- print(region_stats)
复制代码
高级分组操作
- # 使用transform进行分组计算并保持原始DataFrame的形状
- df['Sales_Mean_By_Region'] = df.groupby('Region')['Sales'].transform('mean')
- df['Sales_Deviation_From_Mean'] = df['Sales'] - df['Sales_Mean_By_Region']
- print(df)
- # 使用filter筛选分组
- # 筛选平均销售额大于150的区域
- high_sales_regions = df.groupby('Region').filter(lambda x: x['Sales'].mean() > 150)
- print(high_sales_regions)
- # 使用apply进行自定义分组操作
- def range_func(x):
- return x.max() - x.min()
- region_sales_range = df.groupby('Region')['Sales'].apply(range_func)
- print(region_sales_range)
- # 使用pivot_table创建透视表
- pivot = pd.pivot_table(df, values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
- print(pivot)
- # 使用crosstab创建交叉表
- cross_tab = pd.crosstab(df['Region'], df['Product'], values=df['Sales'], aggfunc='sum', normalize=False)
- print(cross_tab)
复制代码
时间序列分组
- # 创建一个包含日期和销售数据的DataFrame
- data = {'Date': pd.date_range(start='2021-01-01', end='2021-12-31', freq='D'),
- 'Sales': np.random.randint(100, 1000, size=365)}
- df = pd.DataFrame(data)
- # 设置日期为索引
- df.set_index('Date', inplace=True)
- # 按月分组并计算销售总额
- monthly_sales = df.resample('M')['Sales'].sum()
- print(monthly_sales)
- # 按季度分组并计算平均销售额
- quarterly_avg_sales = df.resample('Q')['Sales'].mean()
- print(quarterly_avg_sales)
- # 按周分组并计算多个统计量
- weekly_stats = df.resample('W').agg({
- 'Sales': ['sum', 'mean', 'std', 'min', 'max']
- })
- print(weekly_stats.head())
- # 使用rolling计算移动平均
- df['Sales_7D_MA'] = df['Sales'].rolling(window=7).mean()
- df['Sales_30D_MA'] = df['Sales'].rolling(window=30).mean()
- print(df.head(10))
复制代码
7. 数据可视化
数据可视化是数据分析竞赛中理解数据和展示结果的重要工具。Pandas可以与Matplotlib、Seaborn等可视化库无缝集成。
基本绘图
- # 创建一个示例DataFrame
- data = {'Year': [2010, 2011, 2012, 2013, 2014, 2015],
- 'Sales': [100, 150, 200, 175, 225, 275],
- 'Expenses': [80, 90, 100, 110, 120, 130]}
- df = pd.DataFrame(data)
- # 使用Pandas内置绘图功能
- # 折线图
- df.plot(x='Year', y=['Sales', 'Expenses'], kind='line', figsize=(10, 6))
- plt.title('Sales and Expenses Over Time')
- plt.ylabel('Amount ($)')
- plt.grid(True)
- plt.show()
- # 柱状图
- df.plot(x='Year', y='Sales', kind='bar', figsize=(10, 6))
- plt.title('Sales by Year')
- plt.ylabel('Sales ($)')
- plt.grid(True)
- plt.show()
- # 散点图
- df.plot(x='Year', y='Sales', kind='scatter', figsize=(10, 6))
- plt.title('Sales by Year')
- plt.grid(True)
- plt.show()
- # 直方图
- df['Sales'].plot(kind='hist', bins=5, figsize=(10, 6))
- plt.title('Distribution of Sales')
- plt.xlabel('Sales ($)')
- plt.grid(True)
- plt.show()
- # 箱线图
- df[['Sales', 'Expenses']].plot(kind='box', figsize=(10, 6))
- plt.title('Distribution of Sales and Expenses')
- plt.ylabel('Amount ($)')
- plt.grid(True)
- plt.show()
复制代码
高级可视化
- # 使用Seaborn进行更高级的可视化
- import seaborn as sns
- # 创建一个更复杂的示例DataFrame
- np.random.seed(42)
- data = {
- 'Category': np.random.choice(['A', 'B', 'C', 'D'], size=200),
- 'Value1': np.random.normal(0, 1, size=200),
- 'Value2': np.random.normal(5, 2, size=200),
- 'Group': np.random.choice(['Group 1', 'Group 2'], size=200)
- }
- df = pd.DataFrame(data)
- # 分类变量的计数图
- plt.figure(figsize=(10, 6))
- sns.countplot(x='Category', hue='Group', data=df)
- plt.title('Count by Category and Group')
- plt.grid(True)
- plt.show()
- # 分类变量的分布图
- plt.figure(figsize=(10, 6))
- sns.boxplot(x='Category', y='Value1', hue='Group', data=df)
- plt.title('Distribution of Value1 by Category and Group')
- plt.grid(True)
- plt.show()
- # 小提琴图
- plt.figure(figsize=(10, 6))
- sns.violinplot(x='Category', y='Value1', hue='Group', data=df, split=True)
- plt.title('Distribution of Value1 by Category and Group')
- plt.grid(True)
- plt.show()
- # 散点图矩阵
- sns.pairplot(df, hue='Group', vars=['Value1', 'Value2'])
- plt.suptitle('Pairwise Relationships', y=1.02)
- plt.show()
- # 热力图
- # 计算相关性矩阵
- corr = df[['Value1', 'Value2']].corr()
- plt.figure(figsize=(8, 6))
- sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
- plt.title('Correlation Matrix')
- plt.show()
复制代码
时间序列可视化
- # 创建一个时间序列DataFrame
- np.random.seed(42)
- date_rng = pd.date_range(start='2020-01-01', end='2021-12-31', freq='D')
- ts_data = {
- 'Date': date_rng,
- 'Value': np.cumsum(np.random.randn(len(date_rng))) + 100
- }
- df = pd.DataFrame(ts_data)
- df.set_index('Date', inplace=True)
- # 绘制时间序列
- plt.figure(figsize=(12, 6))
- df['Value'].plot()
- plt.title('Time Series Data')
- plt.ylabel('Value')
- plt.grid(True)
- plt.show()
- # 绘制季节性分解
- from statsmodels.tsa.seasonal import seasonal_decompose
- # 执行季节性分解
- decomposition = seasonal_decompose(df['Value'], model='additive', period=365)
- # 绘制分解结果
- fig = decomposition.plot()
- fig.set_size_inches(12, 8)
- plt.suptitle('Seasonal Decomposition of Time Series', y=1.02)
- plt.show()
- # 绘制滚动统计量
- rolling_mean = df['Value'].rolling(window=30).mean()
- rolling_std = df['Value'].rolling(window=30).std()
- plt.figure(figsize=(12, 6))
- plt.plot(df['Value'], label='Original')
- plt.plot(rolling_mean, label='Rolling Mean (30 days)')
- plt.plot(rolling_std, label='Rolling Std (30 days)')
- plt.title('Rolling Statistics')
- plt.legend()
- plt.grid(True)
- plt.show()
复制代码
8. 高级Pandas技巧
在数据分析竞赛中,掌握一些高级Pandas技巧可以帮助你更高效地处理数据和提取特征。
性能优化
- # 创建一个大型DataFrame用于演示
- np.random.seed(42)
- large_df = pd.DataFrame({
- 'id': range(1, 1000001),
- 'value1': np.random.rand(1000000),
- 'value2': np.random.rand(1000000),
- 'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], size=1000000)
- })
- # 使用适当的数据类型减少内存使用
- print("原始内存使用:", large_df.memory_usage(deep=True).sum() / 1024**2, "MB")
- # 将数值列转换为更小的数据类型
- large_df['value1'] = pd.to_numeric(large_df['value1'], downcast='float')
- large_df['value2'] = pd.to_numeric(large_df['value2'], downcast='float')
- # 将分类列转换为category类型
- large_df['category'] = large_df['category'].astype('category')
- print("优化后内存使用:", large_df.memory_usage(deep=True).sum() / 1024**2, "MB")
- # 使用iterrows()和itertuples()遍历DataFrame
- # iterrows()较慢但灵活
- %%time
- sum_value1 = 0
- for index, row in large_df.head(10000).iterrows():
- sum_value1 += row['value1']
- print("Sum using iterrows:", sum_value1)
- # itertuples()更快
- %%time
- sum_value1 = 0
- for row in large_df.head(10000).itertuples():
- sum_value1 += row.value1
- print("Sum using itertuples:", sum_value1)
- # 使用向量化操作替代循环
- %%time
- sum_value1 = large_df['value1'].head(10000).sum()
- print("Sum using vectorization:", sum_value1)
- # 使用apply()进行高效操作
- %%time
- large_df['value1_squared'] = large_df['value1'].apply(lambda x: x**2)
- print("Value1 squared using apply")
- # 使用向量化操作替代apply
- %%time
- large_df['value1_squared_vec'] = large_df['value1']**2
- print("Value1 squared using vectorization")
复制代码
复杂操作
- # 创建示例DataFrame
- data = {
- 'ID': [1, 2, 3, 4, 5],
- 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
- 'Tags': [['python', 'data'], ['java', 'web'], ['python', 'ml'], ['r', 'stats'], ['python', 'web']]
- }
- df = pd.DataFrame(data)
- # 展开列表列
- df_exploded = df.explode('Tags')
- print("Exploded DataFrame:")
- print(df_exploded)
- # 透视展开的列表
- df_pivoted = df_exploded.pivot_table(index='ID', columns='Tags', aggfunc='size', fill_value=0)
- print("\nPivoted DataFrame:")
- print(df_pivoted)
- # 合并多个DataFrame
- df1 = pd.DataFrame({'ID': [1, 2, 3], 'Value1': ['A', 'B', 'C']})
- df2 = pd.DataFrame({'ID': [2, 3, 4], 'Value2': ['X', 'Y', 'Z']})
- df3 = pd.DataFrame({'ID': [3, 4, 5], 'Value3': ['P', 'Q', 'R']})
- # 内连接
- inner_merge = pd.merge(df1, df2, on='ID', how='inner')
- print("\nInner merge:")
- print(inner_merge)
- # 外连接
- outer_merge = pd.merge(df1, df2, on='ID', how='outer')
- print("\nOuter merge:")
- print(outer_merge)
- # 左连接
- left_merge = pd.merge(df1, df2, on='ID', how='left')
- print("\nLeft merge:")
- print(left_merge)
- # 右连接
- right_merge = pd.merge(df1, df2, on='ID', how='right')
- print("\nRight merge:")
- print(right_merge)
- # 合并多个DataFrame
- multi_merge = df1.merge(df2, on='ID', how='outer').merge(df3, on='ID', how='outer')
- print("\nMulti-way merge:")
- print(multi_merge)
- # 使用concat连接DataFrame
- concat_vertical = pd.concat([df1, df2, df3], axis=0)
- print("\nVertical concatenation:")
- print(concat_vertical)
- concat_horizontal = pd.concat([df1.set_index('ID'), df2.set_index('ID'), df3.set_index('ID')], axis=1)
- print("\nHorizontal concatenation:")
- print(concat_horizontal)
复制代码
高级索引和选择
- # 创建多层索引DataFrame
- arrays = [
- ['A', 'A', 'B', 'B', 'C', 'C'],
- [1, 2, 1, 2, 1, 2]
- ]
- tuples = list(zip(*arrays))
- index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
- df_multi = pd.DataFrame(np.random.randn(6, 2), index=index, columns=['Value1', 'Value2'])
- print("Multi-index DataFrame:")
- print(df_multi)
- # 选择数据
- # 选择第一层索引为'A'的所有行
- print("\nRows with first level 'A':")
- print(df_multi.loc['A'])
- # 选择第一层为'A',第二层为1的行
- print("\nRow with first level 'A' and second level 1:")
- print(df_multi.loc[('A', 1)])
- # 使用xs方法选择数据
- print("\nUsing xs to select second level 1:")
- print(df_multi.xs(1, level='second'))
- # 交换层级
- df_swapped = df_multi.swaplevel('first', 'second')
- print("\nAfter swapping levels:")
- print(df_swapped)
- # 堆叠和取消堆叠
- df_stacked = df_multi.stack()
- print("\nStacked DataFrame:")
- print(df_stacked.head())
- df_unstacked = df_stacked.unstack()
- print("\nUnstacked DataFrame:")
- print(df_unstacked.head())
复制代码
9. 竞赛策略
在数据分析竞赛中,仅仅掌握Pandas的技术是不够的,还需要了解如何有效地应用这些技术来解决问题。
理解问题
- # 假设我们有一个竞赛数据集
- # 首先加载数据
- train_df = pd.read_csv('train.csv')
- test_df = pd.read_csv('test.csv')
- # 查看数据形状
- print("Training data shape:", train_df.shape)
- print("Test data shape:", test_df.shape)
- # 查看训练数据的前几行
- print("\nFirst few rows of training data:")
- print(train_df.head())
- # 查看数据类型
- print("\nData types:")
- print(train_df.dtypes)
- # 查看缺失值
- print("\nMissing values in training data:")
- print(train_df.isnull().sum())
- # 查看目标变量的分布
- # 假设目标变量名为'target'
- print("\nTarget variable distribution:")
- print(train_df['target'].value_counts())
- # 如果是回归问题,查看目标变量的统计信息
- if train_df['target'].dtype in ['int64', 'float64']:
- print("\nTarget variable statistics:")
- print(train_df['target'].describe())
-
- # 绘制目标变量的分布
- plt.figure(figsize=(10, 6))
- sns.histplot(train_df['target'], kde=True)
- plt.title('Distribution of Target Variable')
- plt.show()
复制代码
特征选择
- # 假设我们已经完成了数据清洗和特征工程
- # 现在我们需要选择最重要的特征
- # 对于分类问题,可以使用卡方检验、互信息等方法
- from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif
- # 假设X是特征矩阵,y是目标变量
- X = train_df.drop('target', axis=1)
- y = train_df['target']
- # 选择前10个最重要的特征(使用卡方检验)
- selector = SelectKBest(score_func=chi2, k=10)
- X_new = selector.fit_transform(X, y)
- selected_features = X.columns[selector.get_support()]
- print("Top 10 features using chi-square test:")
- print(selected_features)
- # 使用互信息选择特征
- selector = SelectKBest(score_func=mutual_info_classif, k=10)
- X_new = selector.fit_transform(X, y)
- selected_features = X.columns[selector.get_support()]
- print("\nTop 10 features using mutual information:")
- print(selected_features)
- # 对于回归问题,可以使用相关系数、互信息回归等方法
- from sklearn.feature_selection import f_regression, mutual_info_regression
- # 使用相关系数选择特征
- selector = SelectKBest(score_func=f_regression, k=10)
- X_new = selector.fit_transform(X, y)
- selected_features = X.columns[selector.get_support()]
- print("\nTop 10 features using f-regression:")
- print(selected_features)
- # 使用互信息回归选择特征
- selector = SelectKBest(score_func=mutual_info_regression, k=10)
- X_new = selector.fit_transform(X, y)
- selected_features = X.columns[selector.get_support()]
- print("\nTop 10 features using mutual information regression:")
- print(selected_features)
- # 使用基于模型的特征选择
- from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
- # 对于分类问题
- if y.nunique() < 20: # 假设类别数小于20
- model = RandomForestClassifier(n_estimators=100, random_state=42)
- model.fit(X, y)
- importances = model.feature_importances_
- indices = np.argsort(importances)[::-1][:10]
- print("\nTop 10 features using Random Forest Classifier:")
- print(X.columns[indices])
- else: # 对于回归问题
- model = RandomForestRegressor(n_estimators=100, random_state=42)
- model.fit(X, y)
- importances = model.feature_importances_
- indices = np.argsort(importances)[::-1][:10]
- print("\nTop 10 features using Random Forest Regressor:")
- print(X.columns[indices])
- # 可视化特征重要性
- plt.figure(figsize=(12, 8))
- plt.title('Feature Importances')
- plt.bar(range(X.shape[1]), importances[indices], align='center')
- plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
- plt.tight_layout()
- plt.show()
复制代码
交叉验证和模型评估
- from sklearn.model_selection import cross_val_score, StratifiedKFold, KFold
- from sklearn.metrics import accuracy_score, mean_squared_error, make_scorer
- # 对于分类问题
- if y.nunique() < 20: # 假设类别数小于20
- # 使用分层K折交叉验证
- cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
-
- # 使用随机森林分类器
- model = RandomForestClassifier(n_estimators=100, random_state=42)
-
- # 计算交叉验证准确率
- scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
- print("Cross-validation accuracy scores:", scores)
- print("Mean accuracy:", scores.mean())
- print("Standard deviation:", scores.std())
-
- # 计算其他评估指标
- scores = cross_val_score(model, X, y, cv=cv, scoring='f1_weighted')
- print("\nCross-validation F1 scores:", scores)
- print("Mean F1:", scores.mean())
-
- scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc_ovr')
- print("\nCross-validation AUC scores:", scores)
- print("Mean AUC:", scores.mean())
-
- else: # 对于回归问题
- # 使用K折交叉验证
- cv = KFold(n_splits=5, shuffle=True, random_state=42)
-
- # 使用随机森林回归器
- model = RandomForestRegressor(n_estimators=100, random_state=42)
-
- # 计算交叉验证均方误差
- scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
- rmse_scores = np.sqrt(-scores)
- print("Cross-validation RMSE scores:", rmse_scores)
- print("Mean RMSE:", rmse_scores.mean())
- print("Standard deviation:", rmse_scores.std())
-
- # 计算平均绝对误差
- scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_absolute_error')
- mae_scores = -scores
- print("\nCross-validation MAE scores:", mae_scores)
- print("Mean MAE:", mae_scores.mean())
-
- # 计算R²分数
- scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
- print("\nCross-validation R² scores:", scores)
- print("Mean R²:", scores.mean())
复制代码
模型集成
- from sklearn.ensemble import VotingClassifier, VotingRegressor
- from sklearn.linear_model import LogisticRegression, LinearRegression
- from sklearn.svm import SVC, SVR
- from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
- from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
- from sklearn.model_selection import train_test_split
- # 分割数据为训练集和验证集
- X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
- # 对于分类问题
- if y.nunique() < 20: # 假设类别数小于20
- # 定义多个模型
- model1 = RandomForestClassifier(n_estimators=100, random_state=42)
- model2 = LogisticRegression(max_iter=1000, random_state=42)
- model3 = SVC(probability=True, random_state=42)
- model4 = KNeighborsClassifier(n_neighbors=5)
- model5 = DecisionTreeClassifier(random_state=42)
-
- # 创建投票分类器
- voting_clf = VotingClassifier(
- estimators=[
- ('rf', model1),
- ('lr', model2),
- ('svc', model3),
- ('knn', model4),
- ('dt', model5)
- ],
- voting='soft' # 使用概率投票
- )
-
- # 训练模型
- voting_clf.fit(X_train, y_train)
-
- # 在验证集上评估
- y_pred = voting_clf.predict(X_val)
- accuracy = accuracy_score(y_val, y_pred)
- print("Voting Classifier Accuracy:", accuracy)
-
- # 查看各个模型的性能
- for name, model in voting_clf.named_estimators_.items():
- model.fit(X_train, y_train)
- y_pred = model.predict(X_val)
- accuracy = accuracy_score(y_val, y_pred)
- print(f"{name} Accuracy:", accuracy)
-
- else: # 对于回归问题
- # 定义多个模型
- model1 = RandomForestRegressor(n_estimators=100, random_state=42)
- model2 = LinearRegression()
- model3 = SVR()
- model4 = KNeighborsRegressor(n_neighbors=5)
- model5 = DecisionTreeRegressor(random_state=42)
-
- # 创建投票回归器
- voting_reg = VotingRegressor(
- estimators=[
- ('rf', model1),
- ('lr', model2),
- ('svr', model3),
- ('knn', model4),
- ('dt', model5)
- ]
- )
-
- # 训练模型
- voting_reg.fit(X_train, y_train)
-
- # 在验证集上评估
- y_pred = voting_reg.predict(X_val)
- rmse = np.sqrt(mean_squared_error(y_val, y_pred))
- print("Voting Regressor RMSE:", rmse)
-
- # 查看各个模型的性能
- for name, model in voting_reg.named_estimators_.items():
- model.fit(X_train, y_train)
- y_pred = model.predict(X_val)
- rmse = np.sqrt(mean_squared_error(y_val, y_pred))
- print(f"{name} RMSE:", rmse)
复制代码
10. 实战案例
让我们通过一个完整的例子来展示如何使用Pandas参与数据分析竞赛。我们将使用Kaggle上的”Titanic: Machine Learning from Disaster”竞赛数据作为例子。
数据加载和初步探索
- # 加载数据
- train_df = pd.read_csv('train.csv')
- test_df = pd.read_csv('test.csv')
- # 查看数据形状
- print("Training data shape:", train_df.shape)
- print("Test data shape:", test_df.shape)
- # 查看训练数据的前几行
- print("\nFirst few rows of training data:")
- print(train_df.head())
- # 查看数据类型
- print("\nData types:")
- print(train_df.dtypes)
- # 查看缺失值
- print("\nMissing values in training data:")
- print(train_df.isnull().sum())
- # 查看目标变量的分布
- print("\nSurvived distribution:")
- print(train_df['Survived'].value_counts())
- # 查看不同性别的生存率
- print("\nSurvival rate by sex:")
- print(train_df.groupby('Sex')['Survived'].mean())
- # 查看不同舱位的生存率
- print("\nSurvival rate by Pclass:")
- print(train_df.groupby('Pclass')['Survived'].mean())
- # 查看不同登船港口的生存率
- print("\nSurvival rate by Embarked:")
- print(train_df.groupby('Embarked')['Survived'].mean())
复制代码
数据清洗和特征工程
- # 合并训练集和测试集以便统一处理
- train_df['Dataset'] = 'Train'
- test_df['Dataset'] = 'Test'
- test_df['Survived'] = np.nan # 测试集没有Survived列,我们添加一个并设为NaN
- combined_df = pd.concat([train_df, test_df], ignore_index=True)
- # 处理缺失值
- # Age缺失值用中位数填充
- combined_df['Age'].fillna(combined_df['Age'].median(), inplace=True)
- # Fare缺失值用中位数填充
- combined_df['Fare'].fillna(combined_df['Fare'].median(), inplace=True)
- # Embarked缺失值用众数填充
- combined_df['Embarked'].fillna(combined_df['Embarked'].mode()[0], inplace=True)
- # Cabin缺失值较多,我们创建一个新特征表示是否有Cabin记录
- combined_df['Has_Cabin'] = combined_df['Cabin'].notna().astype(int)
- # 从Name中提取称谓
- combined_df['Title'] = combined_df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
- # 简化称谓
- combined_df['Title'] = combined_df['Title'].replace(['Lady', 'Countess','Capt', 'Col',
- 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
- combined_df['Title'] = combined_df['Title'].replace('Mlle', 'Miss')
- combined_df['Title'] = combined_df['Title'].replace('Ms', 'Miss')
- combined_df['Title'] = combined_df['Title'].replace('Mme', 'Mrs')
- # 创建家庭大小特征
- combined_df['FamilySize'] = combined_df['SibSp'] + combined_df['Parch'] + 1
- # 创建是否独自一人特征
- combined_df['IsAlone'] = (combined_df['FamilySize'] == 1).astype(int)
- # 创建年龄分段
- combined_df['AgeBin'] = pd.cut(combined_df['Age'], bins=[0, 12, 20, 40, 60, 100], labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
- # 创建票价分段
- combined_df['FareBin'] = pd.qcut(combined_df['Fare'], q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
- # 将分类变量转换为数值
- combined_df['Sex_Code'] = combined_df['Sex'].map({'male': 0, 'female': 1})
- combined_df['Embarked_Code'] = combined_df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
- combined_df['Title_Code'] = combined_df['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})
- combined_df['AgeBin_Code'] = combined_df['AgeBin'].cat.codes
- combined_df['FareBin_Code'] = combined_df['FareBin'].cat.codes
- # 查看处理后的数据
- print("\nProcessed data:")
- print(combined_df.head())
- # 查看处理后的缺失值
- print("\nMissing values after processing:")
- print(combined_df.isnull().sum())
复制代码
特征选择和模型训练
- # 选择特征
- features = ['Pclass', 'Sex_Code', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Code',
- 'Has_Cabin', 'Title_Code', 'FamilySize', 'IsAlone', 'AgeBin_Code', 'FareBin_Code']
- # 分离训练集和测试集
- train_df = combined_df[combined_df['Dataset'] == 'Train']
- test_df = combined_df[combined_df['Dataset'] == 'Test']
- X = train_df[features]
- y = train_df['Survived']
- X_test = test_df[features]
- # 分割训练集为训练集和验证集
- from sklearn.model_selection import train_test_split
- X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
- # 训练随机森林模型
- from sklearn.ensemble import RandomForestClassifier
- rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
- rf_model.fit(X_train, y_train)
- # 在验证集上评估
- from sklearn.metrics import accuracy_score
- y_pred = rf_model.predict(X_val)
- accuracy = accuracy_score(y_val, y_pred)
- print("Random Forest Validation Accuracy:", accuracy)
- # 查看特征重要性
- importances = rf_model.feature_importances_
- indices = np.argsort(importances)[::-1]
- plt.figure(figsize=(12, 8))
- plt.title('Feature Importances')
- plt.bar(range(len(features)), importances[indices], align='center')
- plt.xticks(range(len(features)), [features[i] for i in indices], rotation=90)
- plt.tight_layout()
- plt.show()
- # 训练多个模型进行比较
- from sklearn.linear_model import LogisticRegression
- from sklearn.svm import SVC
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.naive_bayes import GaussianNB
- models = {
- 'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
- 'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
- 'SVC': SVC(probability=True, random_state=42),
- 'KNN': KNeighborsClassifier(n_neighbors=5),
- 'Decision Tree': DecisionTreeClassifier(random_state=42),
- 'Naive Bayes': GaussianNB()
- }
- for name, model in models.items():
- model.fit(X_train, y_train)
- y_pred = model.predict(X_val)
- accuracy = accuracy_score(y_val, y_pred)
- print(f"{name} Validation Accuracy:", accuracy)
复制代码
模型调优和预测
- # 使用网格搜索进行参数调优
- from sklearn.model_selection import GridSearchCV
- # 定义参数网格
- param_grid = {
- 'n_estimators': [50, 100, 200],
- 'max_depth': [3, 5, 7, 9],
- 'min_samples_split': [2, 5, 10],
- 'min_samples_leaf': [1, 2, 4]
- }
- # 创建网格搜索对象
- grid_search = GridSearchCV(
- estimator=RandomForestClassifier(random_state=42),
- param_grid=param_grid,
- cv=5,
- scoring='accuracy',
- n_jobs=-1
- )
- # 执行网格搜索
- grid_search.fit(X_train, y_train)
- # 打印最佳参数和最佳分数
- print("Best parameters:", grid_search.best_params_)
- print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
- # 使用最佳模型进行预测
- best_model = grid_search.best_estimator_
- y_pred = best_model.predict(X_val)
- accuracy = accuracy_score(y_val, y_pred)
- print("Tuned Random Forest Validation Accuracy:", accuracy)
- # 在测试集上进行预测
- test_predictions = best_model.predict(X_test)
- # 创建提交文件
- submission = pd.DataFrame({
- 'PassengerId': test_df['PassengerId'],
- 'Survived': test_predictions.astype(int)
- })
- # 保存提交文件
- submission.to_csv('submission.csv', index=False)
- print("\nSubmission file created:")
- print(submission.head())
复制代码
11. 总结与资源推荐
总结
本教程全面介绍了如何使用Pandas库参与数据分析竞赛,从基础概念到高级技巧,涵盖了数据加载、清洗、转换、可视化、特征工程、模型训练和评估等各个方面。通过学习本教程,你应该能够:
1. 熟练使用Pandas进行数据加载和初步探索
2. 掌握数据清洗和预处理技术
3. 创建有效的特征并进行特征选择
4. 使用Pandas进行数据可视化和分析
5. 应用高级Pandas技巧提高效率
6. 开发有效的竞赛策略和模型评估方法
7. 通过实战案例将所学知识应用到实际问题中
记住,在数据分析竞赛中,没有放之四海而皆准的方法。每个问题都是独特的,需要根据数据的特点和问题的要求来调整你的方法。不断学习和实践是提高技能的关键。
资源推荐
1. 官方文档Pandas官方文档NumPy官方文档Scikit-learn官方文档
2. Pandas官方文档
3. NumPy官方文档
4. Scikit-learn官方文档
5. 书籍“Python for Data Analysis” by Wes McKinney (Pandas的创建者)“Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron“Data Science from Scratch” by Joel Grus
6. “Python for Data Analysis” by Wes McKinney (Pandas的创建者)
7. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron
8. “Data Science from Scratch” by Joel Grus
9. 在线课程Kaggle LearnCoursera上的”Applied Data Science with Python”Udemy上的”Python for Data Science and Machine Learning Bootcamp”
10. Kaggle Learn
11. Coursera上的”Applied Data Science with Python”
12. Udemy上的”Python for Data Science and Machine Learning Bootcamp”
13. 竞赛平台Kaggle天池DrivenDataZindi
14. Kaggle
15. 天池
16. DrivenData
17. Zindi
18. 社区和博客Towards Data Science (Medium)KDnuggetsDataCamp Community
19. Towards Data Science (Medium)
20. KDnuggets
21. DataCamp Community
22. GitHub仓库Pandas ExamplesAwesome PandasKaggle Kernels
23. Pandas Examples
24. Awesome Pandas
25. Kaggle Kernels
官方文档
• Pandas官方文档
• NumPy官方文档
• Scikit-learn官方文档
书籍
• “Python for Data Analysis” by Wes McKinney (Pandas的创建者)
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron
• “Data Science from Scratch” by Joel Grus
在线课程
• Kaggle Learn
• Coursera上的”Applied Data Science with Python”
• Udemy上的”Python for Data Science and Machine Learning Bootcamp”
竞赛平台
• Kaggle
• 天池
• DrivenData
• Zindi
社区和博客
• Towards Data Science (Medium)
• KDnuggets
• DataCamp Community
GitHub仓库
• Pandas Examples
• Awesome Pandas
• Kaggle Kernels
通过这些资源,你可以继续深化你的Pandas技能,并在数据分析竞赛中取得更好的成绩。祝你在数据科学的旅程中取得成功!
版权声明
1、转载或引用本网站内容(从入门到精通使用Pandas库参与数据分析竞赛的全方位教程助你成为数据科学竞赛高手)须注明原网址及作者(威震华夏关云长),并标明本网站网址(https://pixtech.org/)。
2、对于不当转载或引用本网站内容而引起的民事纷争、行政处理或其他损失,本网站不承担责任。
3、对不遵守本声明或其他违法、恶意使用本网站内容者,本网站保留追究其法律责任的权利。
本文地址: https://pixtech.org/thread-36716-1-1.html
|
|