Python数据分析-异常数据处理


导入pandas和numpy

import pandas as pd
import numpy as np

1.导入数据并查看属性

data = pd.read_csv("data/Online_Retail_Fake.csv")  # 导入数据
data.shape # 查看shape属性

outputs:

(541910, 8)
data.size # 查看size属性

outputs:

4335280
data  # 查看数据

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 NaN 17850.0 United Kingdom
2 536365 84406B NaN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

data.info() # 查看元数据信息

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541910 entries, 0 to 541909
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541910 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540454 non-null  object 
 3   Quantity     541910 non-null  int64  
 4   InvoiceDate  541910 non-null  object 
 5   UnitPrice    541907 non-null  float64
 6   CustomerID   406828 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

2.异常数据处理

2.1空值

data.isnull().sum() # 查看是否有空值
InvoiceNo           0
StockCode           1
Description      1456
Quantity            0
InvoiceDate         0
UnitPrice           3
CustomerID     135082
Country             1
dtype: int64

可以看到各个属性列的空值个数

2.2 重复数据

data.duplicated()

output:

0         False
1         False
2         False
3         False
4         False
          ...  
541905    False
541906    False
541907    False
541908    False
541909    False
Length: 541910, dtype: bool
data.duplicated().sum()

outputs:

5268

可以看出有较多重复数据

data[data.duplicated()].head()

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
517 536409 21866 UNION JACK FLAG LUGGAGE TAG 1 2010/12/1 11:45 1.25 17908.0 United Kingdom
527 536409 22866 HAND WARMER SCOTTY DOG DESIGN 1 2010/12/1 11:45 2.10 17908.0 United Kingdom
537 536409 22900 SET 2 TEA TOWELS I LOVE LONDON 1 2010/12/1 11:45 2.95 17908.0 United Kingdom
539 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom
555 536412 22327 ROUND SNACK BOXES SET OF 4 SKULLS 1 2010/12/1 11:49 2.95 17920.0 United Kingdom

可以看出InvoiceNo为536409的信息较多,需要单独分析

查看InvoiceNo为536409的数据

data[data["InvoiceNo"] == "536409"]

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
483 536409 90199C 5 STRAND GLASS NECKLACE CRYSTAL 3 2010/12/1 11:45 6.35 17908.0 United Kingdom
484 536409 21479 WHITE SKULL HOT WATER BOTTLE 1 2010/12/1 11:45 3.75 17908.0 United Kingdom
485 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom
486 536409 22785 SQUARECUSHION COVER PINK UNION FLAG 1 2010/12/1 11:45 6.75 17908.0 United Kingdom
... ... ... ... ... ... ... ...
537 536409 22900 SET 2 TEA TOWELS I LOVE LONDON 1 2010/12/1 11:45 2.95 17908.0 United Kingdom
538 536409 21980 PACK OF 12 RED RETROSPOT TISSUES 12 2010/12/1 11:45 0.29 17908.0 United Kingdom
539 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom
540 536409 22112 CHOCOLATE HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom
data[data.duplicated(keep=False)][data[data.duplicated(keep=False)]["InvoiceNo"] == "536409"]

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
485 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom
489 536409 22866 HAND WARMER SCOTTY DOG DESIGN 1 2010/12/1 11:45 2.10 17908.0 United Kingdom
494 536409 21866 UNION JACK FLAG LUGGAGE TAG 1 2010/12/1 11:45 1.25 17908.0 United Kingdom
517 536409 21866 UNION JACK FLAG LUGGAGE TAG 1 2010/12/1 11:45 1.25 17908.0 United Kingdom
521 536409 22900 SET 2 TEA TOWELS I LOVE LONDON 1 2010/12/1 11:45 2.95 17908.0 United Kingdom
527 536409 22866 HAND WARMER SCOTTY DOG DESIGN 1 2010/12/1 11:45 2.10 17908.0 United Kingdom
537 536409 22900 SET 2 TEA TOWELS I LOVE LONDON 1 2010/12/1 11:45 2.95 17908.0 United Kingdom
539 536409 22111 SCOTTIE DOG HOT WATER BOTTLE 1 2010/12/1 11:45 4.95 17908.0 United Kingdom

可直接获取如上重复数据

2.3 异常处理

2.3.1 空值处理

data.dropna()

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2010/12/1 8:26 7.65 17850.0 United Kingdom
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2010/12/1 8:26 4.25 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

406825 rows × 8 columns

如上实现某一行有一个空值则去掉

data.dropna(how="all")

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 NaN 17850.0 United Kingdom
2 536365 84406B NaN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

如上实现某一行全为空值才去掉

data.dropna(axis=1)

outputs:

InvoiceNo Quantity InvoiceDate
0 536365 6 2010/12/1 8:26
1 536365 6 2010/12/1 8:26
2 536365 8 2010/12/1 8:26
3 536365 6 2010/12/1 8:26
4 536365 6 2010/12/1 8:26
... ... ... ...
541905 581587 6 2011/12/9 12:50
541906 581587 4 2011/12/9 12:50
541907 581587 4 2011/12/9 12:50
541908 581587 3 2011/12/9 12:50
541909 581587 3 2011/12/9 12:50

541910 rows × 3 columns

如上实现按列去除

data.dropna(axis=1, how="all")

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 NaN 17850.0 United Kingdom
2 536365 84406B NaN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

如上实现按列去除全为空值的列

data.fillna(value=0)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 0.00 17850.0 United Kingdom
2 536365 84406B 0 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

如上实现空值填充为0

data.fillna(method="ffill")

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
2 536365 84406B WHITE METAL LANTERN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

如上实现空值填充为上面的数据

data.fillna(method="bfill")

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 2.75 17850.0 United Kingdom
2 536365 84406B KNITTED UNION FLAG HOT WATER BOTTLE 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

如上实现空值填充为下面的数据

2.3.2 重复值处理

data.drop_duplicates()

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 NaN 17850.0 United Kingdom
2 536365 84406B NaN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

去除重复数据

data

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 NaN 17850.0 United Kingdom
2 536365 84406B NaN 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

data = data.fillna(method="bfill")
data

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 2.75 17850.0 United Kingdom
2 536365 84406B KNITTED UNION FLAG HOT WATER BOTTLE 8 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 2011/12/9 12:50 4.15 12680.0 France
541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 2011/12/9 12:50 4.15 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France
541909 581587 22138 Wrong booking 3 2011/12/9 12:50 4.95 12680.0 France

541910 rows × 8 columns

data.drop_duplicates(["InvoiceNo", "UnitPrice"])

outputs:

InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010/12/1 8:26 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010/12/1 8:26 3.39 17850.0 United Kingdom
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2010/12/1 8:26 7.65 17850.0 United Kingdom
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2010/12/1 8:26 4.25 17850.0 United Kingdom
... ... ... ... ... ... ... ... ...
541897 581587 22728 ALARM CLOCK BAKELIKE PINK 4 2011/12/9 12:50 3.75 12680.0 France
541903 581587 23256 CHILDRENS CUTLERY SPACEBOY 4 2011/12/9 12:50 4.15 12680.0 France
541904 581587 22613 PACK OF 20 SPACEBOY NAPKINS 12 2011/12/9 12:50 0.85 12680.0 France
541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 2011/12/9 12:50 2.10 12680.0 France
541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 2011/12/9 12:50 4.95 12680.0 France

204727 rows × 8 columns

如上实现去除 InvoiceNoUnitPrice相同的数据


文章作者: 易安
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 易安 !
评论
  目录