先来个小测试

1
2
3
4
import pandas as pd


print("Setup complete.")
Setup complete.

1.

In the cell below, create a DataFrame fruits that looks like this:

1
2
3
# Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruits.
fruits = pd.DataFrame({'Apples': [30], 'Bananas': [21]})
print(fruits)
   Apples  Bananas
0      30       21

2.

Create a dataframe fruit_sales that matches the diagram below:

1
2
3
# Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruit_sales.
fruit_sales = pd.DataFrame({'Apples': [35,41], 'Bananas': [21,34]},index= ['2017 Sales','2018 Sales'])
print(fruit_sales)
            Apples  Bananas
2017 Sales      35       21
2018 Sales      41       34

3.

Create a variable ingredients with a Series that looks like:

1
2
3
4
5
Flour     4 cups
Milk 1 cup
Eggs 2 large
Spam 1 can
Name: Dinner, dtype: object
1
2
ingredients = pd.Series(['4 cups','1 cup', '2 large', '1 can'],index=['Flour','Milk','Eggs','Spam'],name='Dinner',dtype=object)
ingredients
Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

4.

Read the following csv dataset of wine reviews into a DataFrame called reviews:

The filepath to the csv file is ../input/wine-reviews/winemag-data_first150k.csv. The first few lines look like:

1
2
3
,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,"This tremendous 100% varietal wine[...]",Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and[...]",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
1
reviews = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv',index_col=0)

reviews = pd.read_csv(‘../input/wine-reviews/winemag-data_first150k.csv’,index_col=0)

1
2
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals

Cows Goats
Year 1 12 22
Year 2 20 19

In the cell below, write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv.

1
animals.to_csv('cows_and_goats.csv')
1
2
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)
reviews

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

我们将数据当对象来访问

1
reviews.country
0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

如果我们有一个 Python 字典,我们可以使用索引 ([]) 运算符访问它的值。我们可以对 DataFrame 中的列执行相同的操作:

1
reviews['country']
0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

这是从 DataFrame 中选择特定 Series 的两种方法。这两种方法在语法上都不太有效,但索引操作符[]的优点是可以处理列名中的保留字符(例如,如果我们有一个country providence列,reviews.country providence就不起作用)。

pandas 系列看起来不像一本精美的dictionary吗?几乎是这样,所以毫不奇怪,要深入到单个特定值,我们只需要再次使用索引运算符 [] :

1
reviews['country'][0]
'Italy'

Indexing in pandas

索引运算符和属性选择很好,因为它们就像在 Python 生态系统的其它部分一样工作。作为新手,这使得它们很容易上手和使用。然而,pandas有自己的访问操作符,loc和iloc。对于更高级的操作,你应该使用它们。

Index-based selection

Pandas 索引以两种范式之一工作。第一种是基于索引的选择:根据数据在数据中的数字位置来选择数据。iloc遵循这一范例。

要选择 DataFrame 中的第一行数据,我们可以使用以下命令:

1
reviews.iloc[0]
country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O’Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

loc 和 iloc 都是行第一、列第二。这与我们在原生 Python 中所做的相反,即列在前,行在后。

这意味着检索行稍微容易一些,而检索列稍微困难一些。要使用 iloc 获取列,我们可以执行以下操作:

1
reviews.iloc[:, 0]
0            Italy
1         Portugal
2               US
3               US
4               US
            ...   
129966     Germany
129967          US
129968      France
129969      France
129970      France
Name: country, Length: 129971, dtype: object

就其本身而言,:操作符(也来自本地Python)表示 “一切”。然而,当与其他选择器组合时,它可以用来表示值的范围。例如,要从第一、第二和第三行中选择country列,我们可以这样做

1
reviews.iloc[:3, 0]
0       Italy
1    Portugal
2          US
Name: country, dtype: object
1
reviews.iloc[999:-99999, 0]
999             US
1000            US
1001         Italy
1002            US
1003        France
           ...    
29967       France
29968           US
29969    Argentina
29970           US
29971           US
Name: country, Length: 28973, dtype: object

It’s also possible to pass a list:

1
reviews.iloc[[0, 1, 2], 0]
0       Italy
1    Portugal
2          US
Name: country, dtype: object

最后,值得注意的是,负数可以用于选择。这将从数值的末尾开始向前计数。例如,这里是数据集的最后五个元素。

1
reviews.iloc[-5:]

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

Label-based selection

属性选择的第二种模式是loc操作符所遵循的模式:基于标签的选择。在这种模式下,重要的是数据索引值,而不是它的位置。

例如,要获得reviews中的第一个条目,我们现在要做如下操作:

1
reviews.loc[0, 'country']
'Italy'

iloc在概念上比loc简单,因为它忽略了数据集的索引。当我们使用 iloc 时,我们把数据集当作一个大矩阵(一个列表的列表),一个我们必须按位置进行索引的矩阵。由于您的数据集通常具有有意义的索引,因此使用loc通常会更容易。例如,这里有一个使用loc更容易的操作:

1
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

taster_name taster_twitter_handle points
0 Kerin O’Keefe @kerinokeefe 87
1 Roger Voss @vossroger 87
2 Paul Gregutt @paulgwine 87
3 Alexander Peartree NaN 87
4 Paul Gregutt @paulgwine 87
... ... ... ...
129966 Anna Lee C. Iijima NaN 90
129967 Paul Gregutt @paulgwine 90
129968 Roger Voss @vossroger 90
129969 Roger Voss @vossroger 90
129970 Roger Voss @vossroger 90

129971 rows × 3 columns

Choosing between loc and iloc

在 loc 和 iloc 之间选择或转换时,有一个 “小问题 “值得注意,那就是这两种方法使用的索引方案略有不同。

iloc 使用 Python stdlib 索引方案,其中包含范围的第一个元素,而排除最后一个元素。因此,0:10 将选择条目 0,……9。而loc包含最后一个元素,因此0:10将选择0,…,10。

为什么会有这样的变化?请记住,loc可以索引任何stdlib类型:例如字符串。如果我们有一个索引值为 Apples, …, Potatoes, … 的 DataFrame,并且我们想选择 “所有按字母顺序排列的苹果和土豆之间的水果选择”,那么索引 df.loc[‘Apples’:’Potatoes’] 要比索引 df.loc[‘Apples’, ‘Potatoet’] 这样的东西方便得多 (t 在字母表的 s 后面)。

当DataFrame索引是一个简单的数字列表(例如0,…,1000)时,这种情况尤其令人困惑。在这种情况下,df.iloc[0:1000]将返回1000个条目,而df.loc[0:1000]将返回1001个条目!要使用 loc 得到 1000 个元素,您需要再低一级,请求 df.loc[0:999]。

否则,使用loc的语义和使用iloc的语义是一样的。

Manipulating the index

基于标签的选择功能来自索引中的标签。重要的是,我们使用的索引并非一成不变。我们可以以任何我们认为合适的方式操作索引。

我们可以使用set_index()方法来完成这项工作。下面是当我们将set_index设置为title字段时发生的情况:

1
reviews.set_index("title")

country description designation points price province region_1 region_2 taster_name taster_twitter_handle variety winery
title
Nicosia 2013 Vulkà Bianco (Etna) Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe White Blend Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro) Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Portuguese Red Quinta dos Avidagos
Rainstorm 2013 Pinot Gris (Willamette Valley) US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Pinot Gris Rainstorm
St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore) US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN Riesling St. Julian
Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley) US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ...
Dr. H. Thanisch (Erben Müller-Burggraef) 2013 Brauneberger Juffer-Sonnenuhr Spätlese Riesling (Mosel) Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
Citation 2004 Pinot Noir (Oregon) US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Pinot Noir Citation
Domaine Gresser 2013 Kritt Gewurztraminer (Alsace) France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Gewürztraminer Domaine Gresser
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Pinot Gris Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace) France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Gewürztraminer Domaine Schoffit

129971 rows × 12 columns

如果您可以为数据集找到一个比当前索引更好的索引,那么这将非常有用。

Conditional selection

到目前为止,我们已经使用 DataFrame 本身的结构属性对各种数据进行索引。然而,为了利用数据做有趣的事情,我们经常需要根据条件提出问题。

例如,假设我们对意大利生产的优于平均水平的葡萄酒特别感兴趣。

我们可以首先检查每种葡萄酒是否是意大利的:

1
reviews.country == 'Italy'
0          True
1         False
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

此操作根据每条记录的国家/地区生成一系列 True/False 布尔值。然后可以在 loc 内部使用此结果来选择相关数据:

1
reviews.loc[reviews.country == 'Italy']

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
13 Italy This is dominated by oak and oak-driven aromas... Rosso 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Masseria Setteporte 2012 Rosso (Etna) Nerello Mascalese Masseria Setteporte
22 Italy Delicate aromas recall white flower and citrus... Ficiligno 87 19.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Baglio di Pianetto 2007 Ficiligno White (Sicilia) White Blend Baglio di Pianetto
24 Italy Aromas of prune, blackcurrant, toast and oak c... Aynat 87 35.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Canicattì 2009 Aynat Nero d'Avola (Sicilia) Nero d'Avola Canicattì
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129929 Italy This luminous sparkler has a sweet, fruit-forw... NaN 91 38.0 Veneto Prosecco Superiore di Cartizze NaN NaN NaN Col Vetoraz Spumanti NV Prosecco Superiore di... Prosecco Col Vetoraz Spumanti
129943 Italy A blend of Nero d'Avola and Syrah, this convey... Adènzia 90 29.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... Red Blend Baglio del Cristo di Campobello
129947 Italy A blend of 65% Cabernet Sauvignon, 30% Merlot ... Symposio 90 20.0 Sicily & Sardinia Terre Siciliane NaN Kerin O’Keefe @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... Red Blend Feudo Principi di Butera
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

19540 rows × 13 columns

该DataFrame有~20,000行。原来的数据约为130,000行。这意味着大约15%的葡萄酒来自意大利。

我们还想知道哪些葡萄酒比平均水平好。葡萄酒的评价标准为80-100分,因此这可能意味着至少获得90分的葡萄酒。

我们可以用”&”将这两个问题联系起来:

1
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
120 Italy Slightly backward, particularly given the vint... Bricco Rocche Prapó 92 70.0 Piedmont Barolo NaN NaN NaN Ceretto 2003 Bricco Rocche Prapó (Barolo) Nebbiolo Ceretto
130 Italy At the first it was quite muted and subdued, b... Bricco Rocche Brunate 91 70.0 Piedmont Barolo NaN NaN NaN Ceretto 2003 Bricco Rocche Brunate (Barolo) Nebbiolo Ceretto
133 Italy Einaudi's wines have been improving lately, an... NaN 91 68.0 Piedmont Barolo NaN NaN NaN Poderi Luigi Einaudi 2003 Barolo Nebbiolo Poderi Luigi Einaudi
135 Italy The color is just beginning to show signs of b... Sorano 91 60.0 Piedmont Barolo NaN NaN NaN Giacomo Ascheri 2001 Sorano (Barolo) Nebbiolo Giacomo Ascheri
140 Italy A big, fat, luscious wine with plenty of toast... Costa Bruna 90 26.0 Piedmont Barbera d'Alba NaN NaN NaN Poderi Colla 2005 Costa Bruna (Barbera d'Alba) Barbera Poderi Colla
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129929 Italy This luminous sparkler has a sweet, fruit-forw... NaN 91 38.0 Veneto Prosecco Superiore di Cartizze NaN NaN NaN Col Vetoraz Spumanti NV Prosecco Superiore di... Prosecco Col Vetoraz Spumanti
129943 Italy A blend of Nero d'Avola and Syrah, this convey... Adènzia 90 29.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... Red Blend Baglio del Cristo di Campobello
129947 Italy A blend of 65% Cabernet Sauvignon, 30% Merlot ... Symposio 90 20.0 Sicily & Sardinia Terre Siciliane NaN Kerin O’Keefe @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... Red Blend Feudo Principi di Butera
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

6648 rows × 13 columns

Pandas内置了一些条件选择器,我们在此重点介绍其中的两个。

第一个是 “isin”。”isin “让您选择其值 “位于 “一个值列表中的数据。例如,我们可以用它来选择仅来自意大利或法国的葡萄酒:

1
reviews.loc[reviews.country.isin(['Italy', 'France'])]

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
7 France This dry and restrained wine offers spice in p... NaN 87 24.0 Alsace Alsace NaN Roger Voss @vossroger Trimbach 2012 Gewurztraminer (Alsace) Gewürztraminer Trimbach
9 France This has great depth of flavor with its fresh ... Les Natures 87 27.0 Alsace Alsace NaN Roger Voss @vossroger Jean-Baptiste Adam 2012 Les Natures Pinot Gris... Pinot Gris Jean-Baptiste Adam
11 France This is a dry wine, very spicy, with a tight, ... NaN 87 30.0 Alsace Alsace NaN Roger Voss @vossroger Leon Beyer 2012 Gewurztraminer (Alsace) Gewürztraminer Leon Beyer
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129964 France Initially quite muted, this wine slowly develo... Domaine Saint-Rémy Herrenweg 90 NaN Alsace Alsace NaN Roger Voss @vossroger Domaine Ehrhart 2013 Domaine Saint-Rémy Herren... Gewürztraminer Domaine Ehrhart
129965 France While it's rich, this beautiful dry wine also ... Seppi Landmann Vallée Noble 90 28.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Rieflé-Landmann 2013 Seppi Landmann Va... Pinot Gris Domaine Rieflé-Landmann
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

41633 rows × 13 columns

第二个是 isnull (及其同伴 notnull)。这些方法可让您突出显示空(或非空)(NaN) 的值。例如,要过滤掉数据集中缺少价格标签的葡萄酒,我们将执行以下操作:

1
reviews.loc[reviews.price.notnull()]

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
5 Spain Blackberry and raspberry aromas show a typical... Ars In Vitro 87 15.0 Northern Spain Navarra NaN Michael Schachner @wineschach Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot Tandem
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

120975 rows × 13 columns

分配数据

另一方面,将数据分配给 DataFrame 很容易。可以分配一个常量值:

1
2
reviews['critic'] = 'everyone'
reviews['critic']
0         everyone
1         everyone
2         everyone
3         everyone
4         everyone
            ...   
129966    everyone
129967    everyone
129968    everyone
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

或者使用可迭代的值:

1
2
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']
0         129971
1         129970
2         129969
3         129968
4         129967
           ...  
129966         5
129967         4
129968         3
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

TEST

1
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)
1
reviews.head()

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

1.从description中选择描述列,并将结果分配给变量 desc。

1
2
desc = reviews['description']
desc
0         Aromas include tropical fruit, broom, brimston...
1         This is ripe and fruity, a wine that is smooth...
2         Tart and snappy, the flavors of lime flesh and...
3         Pineapple rind, lemon pith and orange blossom ...
4         Much like the regular bottling from 2012, this...
                                ...                        
129966    Notes of honeysuckle and cantaloupe sweeten th...
129967    Citation is given as much as a decade of bottl...
129968    Well-drained gravel soil gives this wine its c...
129969    A dry style of Pinot Gris, this is crisp with ...
129970    Big, rich and off-dry, this is powered by inte...
Name: description, Length: 129971, dtype: object

2.从评论的描述列中选择第一个值,将其分配给变量first_description。

1
2
first_description = desc[0]
first_description
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."

3.从reviews中选择第一行数据(第一条记录),并将其分配给变量first_row。

1
2
first_row = reviews.loc[0]
first_row
country                                                              Italy
description              Aromas include tropical fruit, broom, brimston...
designation                                                   Vulkà Bianco
points                                                                  87
price                                                                  NaN
province                                                 Sicily & Sardinia
region_1                                                              Etna
region_2                                                               NaN
taster_name                                                  Kerin O’Keefe
taster_twitter_handle                                         @kerinokeefe
title                                    Nicosia 2013 Vulkà Bianco  (Etna)
variety                                                        White Blend
winery                                                             Nicosia
Name: 0, dtype: object

4.从评论中的描述列中选择前 10 个值,将结果分配给变量first_descriptions。

提示:将输出格式化为 pandas 系列。

1
2
first_descriptions = reviews['description'].loc[0:9]
first_descriptions
0    Aromas include tropical fruit, broom, brimston...
1    This is ripe and fruity, a wine that is smooth...
2    Tart and snappy, the flavors of lime flesh and...
3    Pineapple rind, lemon pith and orange blossom ...
4    Much like the regular bottling from 2012, this...
5    Blackberry and raspberry aromas show a typical...
6    Here's a bright, informal red that opens with ...
7    This dry and restrained wine offers spice in p...
8    Savory dried thyme notes accent sunnier flavor...
9    This has great depth of flavor with its fresh ...
Name: description, dtype: object

5.Select the records with index labels 1, 2, 3, 5, and 8, assigning the result to the variable sample_reviews.

In other words, generate the following DataFrame:

1
2
sample_reviews = reviews.loc[[1,2,3,5,8]]
sample_reviews

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
5 Spain Blackberry and raspberry aromas show a typical... Ars In Vitro 87 15.0 Northern Spain Navarra NaN Michael Schachner @wineschach Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... Tempranillo-Merlot Tandem
8 Germany Savory dried thyme notes accent sunnier flavor... Shine 87 12.0 Rheinhessen NaN NaN Anna Lee C. Iijima NaN Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... Gewürztraminer Heinz Eifel

创建一个变量 df,其中包含索引标签为 0、1、10 和 100 的记录的 Country、Province、Region_1 和 Region_2 列。换句话说,生成以下 DataFrame:

1
2
df = reviews[['country','province','region_1','region_2']].loc[[0,1,10,100]]
df

country province region_1 region_2
0 Italy Sicily & Sardinia Etna NaN
1 Portugal Douro NaN NaN
10 US California Napa Valley Napa
100 US New York Finger Lakes Finger Lakes

7.Create a variable df containing the country and variety columns of the first 100 records.

Hint: you may use loc or iloc. When working on the answer this question and the several of the ones that follow, keep the following “gotcha” described in the tutorial:

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded.
loc, meanwhile, indexes inclusively.

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.iloc[0:999].

1
2
df = reviews[['country','variety']].loc[0:99]
df

country variety
0 Italy White Blend
1 Portugal Portuguese Red
2 US Pinot Gris
3 US Riesling
4 US Pinot Noir
... ... ...
95 France Gamay
96 France Gamay
97 US Riesling
98 Italy Sangiovese
99 US Bordeaux-style Red Blend

100 rows × 2 columns

8.创建一个包含意大利葡萄酒评论的 DataFrame italian_wines。提示:reviews.country 等于什么?

1
2
italian_wines = reviews[reviews['country']=='Italy']
italian_wines

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
6 Italy Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria NaN Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato Terre di Giurfo
13 Italy This is dominated by oak and oak-driven aromas... Rosso 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Masseria Setteporte 2012 Rosso (Etna) Nerello Mascalese Masseria Setteporte
22 Italy Delicate aromas recall white flower and citrus... Ficiligno 87 19.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Baglio di Pianetto 2007 Ficiligno White (Sicilia) White Blend Baglio di Pianetto
24 Italy Aromas of prune, blackcurrant, toast and oak c... Aynat 87 35.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Canicattì 2009 Aynat Nero d'Avola (Sicilia) Nero d'Avola Canicattì
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129929 Italy This luminous sparkler has a sweet, fruit-forw... NaN 91 38.0 Veneto Prosecco Superiore di Cartizze NaN NaN NaN Col Vetoraz Spumanti NV Prosecco Superiore di... Prosecco Col Vetoraz Spumanti
129943 Italy A blend of Nero d'Avola and Syrah, this convey... Adènzia 90 29.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... Red Blend Baglio del Cristo di Campobello
129947 Italy A blend of 65% Cabernet Sauvignon, 30% Merlot ... Symposio 90 20.0 Sicily & Sardinia Terre Siciliane NaN Kerin O’Keefe @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... Red Blend Feudo Principi di Butera
129961 Italy Intense aromas of wild cherry, baking spice, t... NaN 90 30.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato COS
129962 Italy Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia NaN Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola Cusumano

19540 rows × 13 columns

9.创建一个 DataFrame top_oceania_wines,其中包含来自澳大利亚或新西兰的葡萄酒的至少 95 分(满分 100 分)的所有评论。

1
2
top_oceania_wines = reviews[reviews.country.isin(['Australia', 'New Zealand']) & (reviews.points>=95)]
top_oceania_wines

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
345 Australia This wine contains some material over 100 year... Rare 100 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards
346 Australia This deep brown wine smells like a damp, mossy... Rare 98 350.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... Muscadelle Chambers Rosewood Vineyards
348 Australia Deep mahogany. Dried fig and black tea on the ... Grand 97 100.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Grand Muscat (R... Muscat Chambers Rosewood Vineyards
349 Australia RunRig is always complex, and the 2012 doesn't... RunRig 97 225.0 South Australia Barossa NaN Joe Czerwinski @JoeCz Torbreck 2012 RunRig Shiraz-Viognier (Barossa) Shiraz-Viognier Torbreck
356 Australia Dusty, firm, powerful: just a few apt descript... Georgia's Paddock 95 85.0 Victoria Heathcote NaN Joe Czerwinski @JoeCz Jasper Hill 2013 Georgia's Paddock Shiraz (Hea... Shiraz Jasper Hill
360 Australia Bacon and tapenade elements merge easily on th... Descendant 95 125.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2012 Descendant Shiraz-Viognier (Baro... Shiraz-Viognier Torbreck
365 Australia The Taylor family selected Clare Valley for it... St. Andrews Single Vineyard Release 95 60.0 South Australia Clare Valley NaN Joe Czerwinski @JoeCz Wakefield 2013 St. Andrews Single Vineyard Rel... Shiraz Wakefield
14354 Australia This wine's concentrated dark fruit shows in t... Old Vine 95 60.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Kaesler 2006 Old Vine Shiraz (Barossa Valley) Shiraz Kaesler
16538 Australia Rich, dense and intense, this is a big, muscul... The Family Tree 95 65.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Lambert 2013 The Family Tree Shiraz (Barossa V... Shiraz Lambert
28573 Australia Astralis has become one of Australia's top col... Astralis 95 350.0 South Australia Clarendon NaN Joe Czerwinski @JoeCz Clarendon Hills 2005 Astralis Syrah (Clarendon) Syrah Clarendon Hills
34502 Australia This prodigious wine showcases Barossa's abili... The Relic 98 135.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Standish 2006 The Relic Shiraz (Barossa Valley) Shiraz Standish
34506 Australia If Standish's Relic is the feminine side of Sh... The Standish Single Vineyard 96 135.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Standish 2005 The Standish Single Vineyard Shi... Shiraz Standish
38988 Australia Penfolds Bin 707 has leapt in quality over the... Bin 707 95 200.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Penfolds 2009 Bin 707 Cabernet Sauvignon (Sout... Cabernet Sauvignon Penfolds
39059 Australia The Taylor family selected Clare Valley for it... St. Andrews Single Vineyard Release 95 60.0 South Australia Clare Valley NaN Joe Czerwinski @JoeCz Wakefield 2013 St. Andrews Single Vineyard Rel... Shiraz Wakefield
39961 Australia As unevolved as they are, the dense and multil... Grange 96 185.0 South Australia South Australia NaN NaN NaN Penfolds 1996 Grange Shiraz (South Australia) Shiraz Penfolds
39962 Australia Seamless luxury from stem to stern, this ‘baby... RWT 95 70.0 South Australia Barossa Valley NaN NaN NaN Penfolds 1998 RWT Shiraz (Barossa Valley) Shiraz Penfolds
45809 Australia The 2007 Astralis impresses for its combinatio... Astralis 95 225.0 South Australia Clarendon NaN Joe Czerwinski @JoeCz Clarendon Hills 2007 Astralis Syrah (Clarendon) Syrah Clarendon Hills
56953 Australia This inky, embryonic wine deserves to be cella... Grange 99 850.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Penfolds 2010 Grange Shiraz (South Australia) Shiraz Penfolds
56956 Australia You may have to scour the country to secure so... Andelmonde 97 95.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Standish 2012 Andelmonde Shiraz (Barossa Valley) Shiraz Standish
56957 Australia Thorn Clarke has taken its Shiraz to a new lev... Ron Thorn Single Vineyard 96 89.0 South Australia Barossa NaN Joe Czerwinski @JoeCz Thorn Clarke 2012 Ron Thorn Single Vineyard Sh... Shiraz Thorn Clarke
56959 Australia Is this the Yin to Grange's Yang? The wines ar... Hill of Grace 96 820.0 South Australia Eden Valley NaN Joe Czerwinski @JoeCz Henschke 2010 Hill of Grace Shiraz (Eden Valley) Shiraz Henschke
59977 Australia This is a top example of the classic Australia... The Peake 96 150.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz Hickinbotham 2013 The Peake Cabernet-Shiraz (M... Cabernet-Shiraz Hickinbotham
59984 Australia This is a throwback to those brash, flavor-exu... One 95 95.0 South Australia Langhorne Creek NaN Joe Czerwinski @JoeCz Heartland 2013 One Red (Langhorne Creek) Red Blend Heartland
67096 Australia Just a tiny serving of this dark nectar will l... Calliope Rare 98 86.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz R.L. Buller & Son NV Calliope Rare Tokay (Ruth... Tokay R.L. Buller & Son
67101 Australia This Muscat is the color of dark coffee, with ... Rare 95 300.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscat (Ru... Muscat Chambers Rosewood Vineyards
76392 Australia When the alcohol levels are reined in to appro... Georgia's Paddock 95 85.0 Victoria Heathcote NaN Joe Czerwinski @JoeCz Jasper Hill 2012 Georgia's Paddock Shiraz (Hea... Shiraz Jasper Hill
77028 Australia This has all the size and weight you've come t... Grange 98 850.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Penfolds 2008 Grange Shiraz (South Australia) Shiraz Penfolds
77036 Australia RWT (unromantically derived from “Red Wine Tri... RWT 96 150.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Penfolds 2009 RWT Shiraz (Barossa Valley) Shiraz Penfolds
77037 Australia Winemaker Dave Powell is no longer with Torbre... RunRig 96 225.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2007 RunRig Shiraz-Viognier (Barossa ... Shiraz-Viognier Torbreck
77042 Australia This is likely the most ageworthy Shiraz winem... Eligo 95 100.0 South Australia Barossa NaN Joe Czerwinski @JoeCz John Duval Wines 2010 Eligo Shiraz (Barossa) Shiraz John Duval Wines
77044 Australia The fruit for this offering comes from the Gre... R Reserve 95 105.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Kilikanoon 2009 R Reserve Shiraz (Barossa Valley) Shiraz Kilikanoon
77046 Australia With aromas and flavors that range widely from... The Factor 95 125.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2007 The Factor Shiraz (Barossa Valley) Shiraz Torbreck
83357 Australia A throwback to the monster Shiraz style of old... Grange 96 500.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Penfolds 2004 Grange Shiraz (South Australia) Shiraz Penfolds
84815 Australia The Factor is always one of Torbreck's biggest... The Factor 95 125.0 South Australia Barossa NaN Joe Czerwinski @JoeCz Torbreck 2012 The Factor Shiraz (Barossa) Shiraz Torbreck
84816 Australia Nashwauk is Kaesler's McLaren Vale project, fi... Beacon 95 145.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz Nashwauk 2010 Beacon Shiraz (McLaren Vale) Shiraz Nashwauk
87128 Australia This full-bodied, muscular Shiraz is built for... Amery Vineyard Block 6 96 120.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz Kay Brothers 2012 Amery Vineyard Block 6 Shira... Shiraz Kay Brothers
87137 Australia Perhaps the best young wine I've tasted from M... NaN 95 84.0 Western Australia Margaret River NaN Joe Czerwinski @JoeCz Moss Wood 2011 Cabernet Sauvignon (Margaret Ri... Cabernet Sauvignon Moss Wood
87143 Australia This is wonderfully complex and aromatic, with... St. Andrews Single Vineyard Release 95 60.0 South Australia Clare Valley NaN Joe Czerwinski @JoeCz Wakefield 2012 St. Andrews Single Vineyard Rel... Shiraz Wakefield
91851 New Zealand This full-bodied, richly tannic wine delivers.... Homage 95 100.0 Hawke's Bay NaN NaN Joe Czerwinski @JoeCz Trinity Hill 2013 Homage Syrah (Hawke's Bay) Syrah Trinity Hill
98386 Australia One of the more approachable of the d'Arenberg... Little Venice Single Vineyard 95 85.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz D'Arenberg 2010 Little Venice Single Vineyard ... Shiraz D'Arenberg
99318 Australia From vines planted in 1912, this has been an i... Mount Edelstone Vineyard 95 200.0 South Australia Eden Valley NaN Joe Czerwinski @JoeCz Henschke 2014 Mount Edelstone Vineyard Shiraz ... Shiraz Henschke
99330 Australia This Cabernet equivalent to Grange has explode... Bin 707 95 500.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Penfolds 2014 Bin 707 Cabernet Sauvignon (Sout... Cabernet Sauvignon Penfolds
99340 Australia This rich, opulent wine carries its massive oa... Les Amis 95 185.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2013 Les Amis Grenache (Barossa Valley) Grenache Torbreck
109427 Australia This wine is dark brown in hue with a greenish... Rare 99 300.0 Victoria Rutherglen NaN Joe Czerwinski @JoeCz Chambers Rosewood Vineyards NV Rare Muscadelle... Muscadelle Chambers Rosewood Vineyards
109434 Australia D'Arenberg's lineup of single-vineyard Shiraze... The Swinging Malaysian Single Vineyard 96 85.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz D'Arenberg 2010 The Swinging Malaysian Single ... Shiraz D'Arenberg
122421 Australia Despite this wine's weight and richness, it re... Amon-Ra Unfiltered 96 110.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Glaetzer 2010 Amon-Ra Unfiltered Shiraz (Baros... Shiraz Glaetzer
122430 Australia These blends are traditional in Australia—they... Anaperenna 95 80.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Glaetzer 2010 Anaperenna Shiraz-Cabernet Sauvi... Shiraz-Cabernet Sauvignon Glaetzer
122507 New Zealand This blend of Cabernet Sauvignon (62.5%), Merl... SQM Gimblett Gravels Cabernets/Merlot 95 79.0 Hawke's Bay NaN NaN Joe Czerwinski @JoeCz Squawking Magpie 2014 SQM Gimblett Gravels Cab... Bordeaux-style Red Blend Squawking Magpie
122939 Australia Full-bodied and plush yet vibrant and imbued w... The Factor 98 125.0 South Australia Barossa Valley NaN Joe Czerwinski @JoeCz Torbreck 2013 The Factor Shiraz (Barossa Valley) Shiraz Torbreck

Summary Functions and Maps

介绍

在上一教程中,我们学习了如何从 DataFrame 或 Series 中选择相关数据。正如我们在练习中演示的那样,从数据表示中提取正确的数据对于完成工作至关重要。

然而,数据并不总是以我们想要的格式从内存中直接取出。有时,我们必须自己再做一些工作来重新格式化数据,以完成手头的任务。本教程将介绍我们可以对数据进行的不同操作,以获得 “恰到好处 “的输入。

1
2
3
4
import pandas as pd
import numpy as np

reviews = pd.read_csv('datasets/wine/winemag-data-130k-v2.csv',index_col=0)
1
reviews

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Summary functions

Pandas 提供了许多简单的“摘要函数”(不是官方名称),它们以某种有用的方式重组数据。例如,考虑describe()方法:

1
reviews.points.describe()
count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

该方法生成给定列属性的高级摘要。它是类型感知的,这意味着它的输出会根据输入的数据类型而改变。上面的输出仅对数值数据有意义;对于字符串数据,我们得到的结果如下:

1
reviews.taster_name.describe()
count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

如果你想得到DataFrame或Series中某一列的一些特定的简单汇总统计,通常有一个有用的pandas函数可以实现。

例如,我们可以使用mean()函数来查看所分配分数的平均值(例如,平均评分的葡萄酒的表现如何):

1
reviews.points.mean()
88.44713820775404

要查看唯一值的列表,我们可以使用 unique() 函数:

1
reviews.taster_name.unique()
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

要查看唯一值的列表以及它们在数据集中出现的频率,我们可以使用 value_counts() 方法:

1
reviews.taster_name.value_counts()
taster_name
Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne Krebiehl MW       3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: count, dtype: int64

Maps

映射(map)是从数学中借用的术语,指的是将一组值 “映射 “到另一组值的函数。在数据科学中,我们经常需要从现有数据中创建新的表示方法,或者将数据从现在的格式转换为我们希望的格式。映射就是处理这些工作的工具,因此它对于完成工作极为重要!

有两种映射方法您会经常用到。

map()是第一个,也是稍微简单的一个。例如,假设我们想将葡萄酒得到的分数重新平均为0:

1
2
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean) #输入points,函数为points-points_mean
0        -1.447138
1        -1.447138
2        -1.447138
3        -1.447138
4        -1.447138
            ...   
129966    1.552862
129967    1.552862
129968    1.552862
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

您传递给map()的函数应该期望从Series中得到一个单独的值(在上面的示例中是一个点值),并返回该值的转换版本。 map()返回一个新的Series,其中所有的值都已被您的函数转换。

如果我们想通过在每一行上调用自定义方法来转换整个DataFrame,则apply()是等价的方法。

1
2
3
4
5
def remean_points(row):
row.points = row.points - review_points_mean
return row

reviews.apply(remean_points, axis='columns')

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco -1.447138 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos -1.447138 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN -1.447138 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest -1.447138 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block -1.447138 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 1.552862 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 1.552862 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 1.552862 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 1.552862 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 1.552862 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

如果我们在调用views.apply()时使用的是axis=’index’,那么我们就不需要传递一个函数来转换每一行,而是需要给出一个函数来转换每一列。请注意,map()和apply()分别返回新的、转换后的Series和DataFrames。它们不会修改被调用的原始数据。如果我们查看第一行的评论,我们可以看到它仍然具有原始的点值。

1
reviews.head(1)

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia

Pandas 提供了许多常见的内置映射操作。例如,以下是重新定义点列的更快方法:

1
2
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean
0        -1.447138
1        -1.447138
2        -1.447138
3        -1.447138
4        -1.447138
            ...   
129966    1.552862
129967    1.552862
129968    1.552862
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

在这段代码中,我们在左侧的大量值(系列中的所有值)和右侧的单个值(平均值)之间执行运算。Pandas查看该表达式,并计算出我们必须从数据集中的每个值中减去该平均值。

如果我们在等长序列之间执行这些操作,Pandas也会明白该怎么做。例如,结合数据集中的国家和地区信息的简单方法如下:

1
reviews.country + " - " + reviews.region_1
0                     Italy - Etna
1                              NaN
2           US - Willamette Valley
3         US - Lake Michigan Shore
4           US - Willamette Valley
                    ...           
129966                         NaN
129967                 US - Oregon
129968             France - Alsace
129969             France - Alsace
129970             France - Alsace
Length: 129971, dtype: object

这些运算符比map()或apply()更快,因为它们使用了pandas内置的加速功能。所有标准的 Python 运算符 (>, <, ==, 等等) 都以这种方式工作。

然而,它们不如map()或apply()灵活,后者可以做更高级的事情,比如应用条件逻辑,而这是加减法无法做到的。

TEST

1
2
import pandas as pd
reviews = pd.read_csv('datasets/wine/winemag-data-130k-v2.csv',index_col=0)

1.review DataFrame 中points列的中位数是多少?

1
reviews['points'].median()
88.0

2.What countries are represented in the dataset? (Your answer should not include any duplicates.)

1
2
countries=reviews['country'].unique()
countries
array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

3.How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.
每个国家在数据集中出现的频率是多少?创建一个 Series Reviews_per_country,将国家映射到该国家的葡萄酒reviews数量。

1
2

reviews['country'].value_counts()
country
US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14
Serbia                       12
Czech Republic               12
Macedonia                    12
Cyprus                       11
India                         9
Switzerland                   7
Luxembourg                    6
Bosnia and Herzegovina        2
Armenia                       2
Slovakia                      1
China                         1
Egypt                         1
Name: count, dtype: int64

Create variable centered_price containing a version of the price column with the mean price subtracted. (Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)

创建变量 centered_price,其中包含减去平均价格的价格列版本。 (注意:这种“居中”转换是应用各种机器学习算法之前的常见预处理步骤。)

1
2
a = reviews['price'].mean()
reviews['price'].map(lambda p: p-a)
0               NaN
1        -20.363389
2        -21.363389
3        -22.363389
4         29.636611
            ...    
129966    -7.363389
129967    39.636611
129968    -5.363389
129969    -3.363389
129970   -14.363389
Name: price, Length: 129971, dtype: float64

I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

我是一个经济实惠的葡萄酒买家。哪种酒是“最划算的”?使用数据集中积分价格比最高的葡萄酒的标题创建一个变量 deal_wine。

1
2
3
# reviews[[(reviews['points']/reviews['price']).idxmax()]]
reviews['title'][(reviews['points']/reviews['price']).idxmax()]

'Bandit NV Merlot (California)'

6.There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let’s ignore the capitalized versions of these words.)

在描述一瓶葡萄酒时,您可以使用的词汇有限。葡萄酒更可能是 “热带 “还是 “果味”?创建一个 “descriptor_counts “系列,计算这两个词在数据集的 “description “列中分别出现的次数。(为简单起见,让我们忽略这些词的大写版本)。

1
2
3
4
5
6
7
8
9
10
11
12
13
import re

def trop_ocurr(row):
return re.search("tropical",row.description)==None

def frui_ocurr(row):
return re.search("fruity",row.description)==None

a = reviews.apply(trop_ocurr, axis='columns').value_counts()[False]
b = reviews.apply(frui_ocurr, axis='columns').value_counts()[False]

pd.Series([a,b],index=['tropical', 'fruity'])

0    3607
1    9090
dtype: int64
1
2
3
4
############答案的方法##########
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
tropical    3607
fruity      9090
dtype: int64
  1. We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

我们希望将这些酒评放在我们的网站上,但从80分到100分的评分系统太难懂了–我们希望将其转化为简单的星级评分。95分及以上为3星,85分以上95分以下为2星。其他分数为1星。

此外,加拿大葡萄酒商协会在网站上购买了大量广告,因此任何来自加拿大的葡萄酒都应该自动获得3星,无论分数高低。

创建一个 “star_ratings “序列,其中包含数据集中每条review对应的星级数。

1
2
3
4
5
6
7
8
9
a = pd.Series()
def rate(row):
# row['points']=3 if (row['country']=='Canada' or row['points']>=95) else 2 if(row['points']>=85) else 1
return 3 if (row['country']=='Canada' or row['points']>=95) else 2 if(row['points']>=85) else 1
# return row

# reviews.apply(rate, axis='columns')['points']
reviews.apply(rate, axis='columns')

0         2
1         2
2         2
3         2
4         2
         ..
129966    2
129967    2
129968    2
129969    2
129970    2
Length: 129971, dtype: int64
1
2
3
4
5
6
7
8
9
10
11
12
###官方答案###
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1

reviews.apply(stars, axis='columns')
0         2
1         2
2         2
3         2
4         2
         ..
129966    2
129967    2
129968    2
129969    2
129970    2
Length: 129971, dtype: int64

Grouping and Sorting

映射允许我们对数据帧或系列中的数据进行转换,一次只转换整个列中的一个值。然而,我们经常希望对数据进行分组,然后对数据所在的组进行特定操作。

正如您将了解到的,我们通过groupby()操作来实现这一点。我们还将介绍一些其他主题,例如索引DataFrames的更复杂方法,以及如何排序数据。

分组分析

到目前为止,我们经常使用的一个函数是 value_counts() 函数。我们可以通过执行以下操作来复制 value_counts() 的作用:

1
2
3
import pandas as pd

reviews = pd.read_csv('datasets/wine/winemag-data-130k-v2.csv',index_col=0)
1
2
3
4
print(reviews.groupby('points').points.count())
print(reviews.groupby('points').price.count())
print(reviews.groupby('price').price.count())
print(reviews.groupby('price').points.count())
points
80       397
81       692
82      1836
83      3025
84      6480
85      9530
86     12600
87     16933
88     17207
89     12226
90     15410
91     11359
92      9613
93      6489
94      3758
95      1535
96       523
97       229
98        77
99        33
100       19
Name: points, dtype: int64
points
80       395
81       680
82      1772
83      2886
84      6099
85      8902
86     11745
87     15767
88     16014
89     11324
90     14361
91     10564
92      8871
93      5935
94      3449
95      1406
96       482
97       207
98        69
99        28
100       19
Name: price, dtype: int64
price
4.0        11
5.0        46
6.0       120
7.0       433
8.0       892
         ... 
1900.0      1
2000.0      2
2013.0      1
2500.0      2
3300.0      1
Name: price, Length: 390, dtype: int64
price
4.0        11
5.0        46
6.0       120
7.0       433
8.0       892
         ... 
1900.0      1
2000.0      2
2013.0      1
2500.0      2
3300.0      1
Name: points, Length: 390, dtype: int64

groupby()创建了一组评论,这些评论为给定的葡萄酒分配了相同的分值。然后,对于每一组,我们抓取points()列并计算它出现的次数。 value_counts()只是groupby()操作的快捷方式。

我们可以对这些数据使用我们之前使用过的任何汇总函数。例如,要获得每个点值类别中最便宜的葡萄酒,我们可以执行以下操作:

1
reviews.groupby('points').price.min()
points
80      5.0
81      5.0
82      4.0
83      4.0
84      4.0
85      4.0
86      4.0
87      5.0
88      6.0
89      7.0
90      8.0
91      7.0
92     11.0
93     12.0
94     13.0
95     20.0
96     20.0
97     35.0
98     50.0
99     44.0
100    80.0
Name: price, dtype: float64

您可以将我们生成的每个组视为DataFrame的一个片段,其中仅包含值匹配的数据。我们可以使用apply()方法直接访问该DataFrame,然后以我们认为合适的方式处理数据。例如,我们可以从数据集中的每个酒庄中选择第一款葡萄酒的名称:

1
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
winery
1+1=3                                     1+1=3 NV Rosé Sparkling (Cava)
10 Knots                            10 Knots 2010 Viognier (Paso Robles)
100 Percent Wine              100 Percent Wine 2015 Moscato (California)
1000 Stories           1000 Stories 2013 Bourbon Barrel Aged Zinfande...
1070 Green                  1070 Green 2011 Sauvignon Blanc (Rutherford)
                                             ...                        
Órale                       Órale 2011 Cabronita Red (Santa Ynez Valley)
Öko                    Öko 2013 Made With Organically Grown Grapes Ma...
Ökonomierat Rebholz    Ökonomierat Rebholz 2007 Von Rotliegenden Spät...
àMaurice               àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                                    Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

为了进行更细粒度的控制,您还可以按多列进行分组。举个例子,以下是我们如何按国家和省份挑选最好的葡萄酒的方法:

1
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
country province
Argentina Mendoza Province Argentina If the color doesn't tell the full story, the ... Nicasia Vineyard 97 120.0 Mendoza Province Mendoza NaN Michael Schachner @wineschach Bodega Catena Zapata 2006 Nicasia Vineyard Mal... Malbec Bodega Catena Zapata
Other Argentina Take note, this could be the best wine Colomé ... Reserva 95 90.0 Other Salta NaN Michael Schachner @wineschach Colomé 2010 Reserva Malbec (Salta) Malbec Colomé
Armenia Armenia Armenia Deep salmon in color, this wine offers a bouqu... Estate Bottled 88 15.0 Armenia NaN NaN Mike DeSimone @worldwineguys Van Ardi 2015 Estate Bottled Rosé (Armenia) Rosé Van Ardi
Australia Australia Other Australia Writes the book on how to make a wine filled w... Sarah's Blend 93 15.0 Australia Other South Eastern Australia NaN NaN NaN Marquis Philips 2000 Sarah's Blend Red (South ... Red Blend Marquis Philips
New South Wales Australia De Bortoli's Noble One is as good as ever in 2... Noble One Bortytis 94 32.0 New South Wales New South Wales NaN Joe Czerwinski @JoeCz De Bortoli 2007 Noble One Bortytis Semillon (N... Sémillon De Bortoli
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Uruguay Juanico Uruguay This mature Bordeaux-style blend is earthy on ... Preludio Barrel Select Lote N 77 90 45.0 Juanico NaN NaN Michael Schachner @wineschach Familia Deicas 2004 Preludio Barrel Select Lot... Red Blend Familia Deicas
Montevideo Uruguay A rich, heady bouquet offers aromas of blackbe... Monte Vide Eu Tannat-Merlot-Tempranillo 91 60.0 Montevideo NaN NaN Michael Schachner @wineschach Bouza 2015 Monte Vide Eu Tannat-Merlot-Tempran... Red Blend Bouza
Progreso Uruguay Rusty in color but deep and complex in nature,... Etxe Oneko Fortified Sweet Red 90 46.0 Progreso NaN NaN Michael Schachner @wineschach Pisano 2007 Etxe Oneko Fortified Sweet Red Tan... Tannat Pisano
San Jose Uruguay Baked, sweet, heavy aromas turn earthy with ti... El Preciado Gran Reserva 87 50.0 San Jose NaN NaN Michael Schachner @wineschach Castillo Viejo 2005 El Preciado Gran Reserva R... Red Blend Castillo Viejo
Uruguay Uruguay Cherry and berry aromas are ripe, healthy and ... Blend 002 Limited Edition 91 22.0 Uruguay NaN NaN Michael Schachner @wineschach Narbona NV Blend 002 Limited Edition Tannat-Ca... Tannat-Cabernet Franc Narbona

425 rows × 13 columns

另一个值得一提的 groupby() 方法是 agg(),它允许您同时在 DataFrame 上运行一堆不同的函数。例如,我们可以生成数据集的简单统计摘要,如下所示:

1
reviews.groupby(['country']).price.agg([len, min, max])

len min max
country
Argentina 3800 4.0 230.0
Armenia 2 14.0 15.0
Australia 2329 5.0 850.0
Austria 3345 7.0 1100.0
Bosnia and Herzegovina 2 12.0 13.0
Brazil 52 10.0 60.0
Bulgaria 141 8.0 100.0
Canada 257 12.0 120.0
Chile 4472 5.0 400.0
China 1 18.0 18.0
Croatia 73 12.0 65.0
Cyprus 11 11.0 21.0
Czech Republic 12 15.0 45.0
Egypt 1 NaN NaN
England 74 25.0 95.0
France 22093 5.0 3300.0
Georgia 86 9.0 40.0
Germany 2165 5.0 775.0
Greece 466 8.0 79.0
Hungary 146 10.0 764.0
India 9 10.0 20.0
Israel 505 8.0 150.0
Italy 19540 5.0 900.0
Lebanon 35 13.0 75.0
Luxembourg 6 16.0 30.0
Macedonia 12 15.0 20.0
Mexico 70 8.0 108.0
Moldova 59 8.0 42.0
Morocco 28 14.0 40.0
New Zealand 1419 7.0 130.0
Peru 16 10.0 68.0
Portugal 5691 5.0 1000.0
Romania 120 4.0 320.0
Serbia 12 15.0 42.0
Slovakia 1 16.0 16.0
Slovenia 87 7.0 90.0
South Africa 1401 5.0 330.0
Spain 6645 4.0 770.0
Switzerland 7 21.0 160.0
Turkey 90 14.0 120.0
US 54504 4.0 2013.0
Ukraine 14 6.0 13.0
Uruguay 109 10.0 130.0

有效使用 groupby() 将使您能够利用数据集做很多真正强大的事情。

Multi-indexes

在我们迄今为止看到的所有示例中,我们一直在使用单标签索引的DataFrame或Series对象。
groupby()略有不同,根据我们运行的操作,它有时会产生所谓的多索引。
多索引与常规索引的不同之处在于它具有多个级别。例如:

1
2
3
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed
# reviews['description'][0].__len__()

len
country province
Argentina Mendoza Province 3264
Other 536
Armenia Armenia 2
Australia Australia Other 245
New South Wales 85
... ... ...
Uruguay Juanico 12
Montevideo 11
Progreso 11
San Jose 3
Uruguay 24

425 rows × 1 columns

1
2
mi = countries_reviewed.index
type(mi)
pandas.core.indexes.multi.MultiIndex

多级索引有几种处理分层结构的方法,而单级索引则没有这些方法。它们还需要两层标签来获取一个值。对于刚接触pandas的用户来说,处理多索引输出是一个常见的 “疑难杂症”。

在pandas文档的MultiIndex/Advanced Selection部分详细介绍了多索引的用例和使用说明。

然而,一般来说,你最常使用的多索引方法是转换回普通索引的reset_index()方法:

1
countries_reviewed.reset_index()

country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
2 Armenia Armenia 2
3 Australia Australia Other 245
4 Australia New South Wales 85
... ... ... ...
420 Uruguay Juanico 12
421 Uruguay Montevideo 11
422 Uruguay Progreso 11
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns

Sorting

再次查看 countries_reviewed,我们可以看到分组返回的数据是按照索引顺序而不是值的顺序排列的。也就是说,在输出分组结果时,行的顺序取决于索引中的值,而不是数据中的值。

为了按照我们想要的顺序获得数据,我们可以自己对数据进行排序。sort_values()方法在这方面非常方便。

1
2
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

country province len
179 Greece Muscat of Kefallonian 1
192 Greece Sterea Ellada 1
194 Greece Thraki 1
354 South Africa Paardeberg 1
40 Brazil Serra do Sudeste 1
... ... ... ...
409 US Oregon 5373
227 Italy Tuscany 5897
118 France Bordeaux 5941
415 US Washington 8639
392 US California 36247

425 rows × 3 columns

sort_values() 默认为升序排序,其中最低的值排在前面。然而,大多数时候我们想要降序排序,即数字较大的排在前面。事情是这样的:

1
countries_reviewed.sort_values(by='len', ascending=False)

country province len
392 US California 36247
415 US Washington 8639
118 France Bordeaux 5941
227 Italy Tuscany 5897
409 US Oregon 5373
... ... ... ...
101 Croatia Krk 1
247 New Zealand Gladstone 1
357 South Africa Piekenierskloof 1
63 Chile Coelemu 1
149 Greece Beotia 1

425 rows × 3 columns

要按索引值排序,请使用配套方法 sort_index()。此方法具有相同的参数和默认顺序:

1
countries_reviewed.sort_index()

country province len
0 Argentina Mendoza Province 3264
1 Argentina Other 536
2 Armenia Armenia 2
3 Australia Australia Other 245
4 Australia New South Wales 85
... ... ... ...
420 Uruguay Juanico 12
421 Uruguay Montevideo 11
422 Uruguay Progreso 11
423 Uruguay San Jose 3
424 Uruguay Uruguay 24

425 rows × 3 columns

最后,要知道您一次可以按多个列进行排序:

1
countries_reviewed.sort_values(by=['country', 'len'])

country province len
1 Argentina Other 536
0 Argentina Mendoza Province 3264
2 Armenia Armenia 2
6 Australia Tasmania 42
4 Australia New South Wales 85
... ... ... ...
421 Uruguay Montevideo 11
422 Uruguay Progreso 11
420 Uruguay Juanico 12
424 Uruguay Uruguay 24
419 Uruguay Canelones 43

425 rows × 3 columns

TEST

1
2
3
4
import pandas as pd

reviews = pd.read_csv('datasets/wine/winemag-data-130k-v2.csv')
reviews.head()

Unnamed: 0 country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

1.

Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.

数据集中最常见的酒评人是谁?创建一个 “系列”,其索引是数据集中的 “taster_twitter_handle “类别,其值计算每个人写了多少评论。

1
reviews.groupby('taster_twitter_handle')['taster_twitter_handle'].count()
taster_twitter_handle
@AnneInVino          3685
@JoeCz               5147
@bkfiona               27
@gordone_cellars     4177
@kerinokeefe        10776
@laurbuzz            1835
@mattkettmann        6332
@paulgwine           9532
@suskostrzewa        1085
@vboone              9537
@vossroger          25514
@wawinereport        4966
@wineschach         15134
@winewchristina         6
@worldwineguys       1005
Name: taster_twitter_handle, dtype: int64

2.

What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).

我花多少钱可以买到最好的葡萄酒?创建一个 “系列”,其索引是葡萄酒的价格,其值是该价格的葡萄酒在评论中得到的最高分。按价格升序排序(这样4.0美元在最上面,3300.0美元在最下面)。

1
reviews.groupby('price')['points'].max()
price
4.0       86
5.0       87
6.0       88
7.0       91
8.0       91
          ..
1900.0    98
2000.0    97
2013.0    91
2500.0    96
3300.0    88
Name: points, Length: 390, dtype: int64

3.

What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.

每种葡萄酒的最低和最高价格是多少?创建一个DataFrame,其索引为数据集中的variety类别,其值为minmax值。

1
2
price_extremes = reviews.groupby('variety')['price'].agg([min,max])
price_extremes

min max
variety
Abouriou 15.0 75.0
Agiorgitiko 10.0 66.0
Aglianico 6.0 180.0
Aidani 27.0 27.0
Airen 8.0 10.0
... ... ...
Zinfandel 5.0 100.0
Zlahtina 13.0 16.0
Zweigelt 9.0 70.0
Çalkarası 19.0 19.0
Žilavka 15.0 15.0

707 rows × 2 columns

4.

What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

最贵的葡萄酒品种是什么?创建一个变量sorted_varieties,其中包含上一个问题中数据帧的副本,根据最低价格,然后根据最高价格,按降序排列(打破并列关系)。

1
price_extremes.sort_values(by=['min', 'max'],ascending=False)

min max
variety
Ramisco 495.0 495.0
Terrantez 236.0 236.0
Francisa 160.0 160.0
Rosenmuskateller 150.0 150.0
Tinta Negra Mole 112.0 112.0
... ... ...
Roscetto NaN NaN
Sauvignon Blanc-Sauvignon Gris NaN NaN
Tempranillo-Malbec NaN NaN
Vital NaN NaN
Zelen NaN NaN

707 rows × 2 columns

5.

Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.

创建一个 “系列”,其索引为reviewers,其值为该reviewer给出的平均review分数。提示:您将需要taster_namepoints列。

1
2
3
# reviews.groupby('taster_name')['points'].agg(['mean'])
reviewer_mean_ratings = reviews.groupby('taster_name').apply(lambda df: df['points'].mean())
reviewer_mean_ratings
taster_name
Alexander Peartree    85.855422
Anna Lee C. Iijima    88.415629
Anne Krebiehl MW      90.562551
Carrie Dykes          86.395683
Christina Pickard     87.833333
Fiona Adams           86.888889
Jeff Jenssen          88.319756
Jim Gordon            88.626287
Joe Czerwinski        88.536235
Kerin O’Keefe         88.867947
Lauren Buzzeo         87.739510
Matt Kettmann         90.008686
Michael Schachner     86.907493
Mike DeSimone         89.101167
Paul Gregutt          89.082564
Roger Voss            88.708003
Sean P. Sullivan      88.755739
Susan Kostrzewa       86.609217
Virginie Boone        89.213379
dtype: float64

Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe() method to see a summary of the range of values.

不同评论人给出的平均分是否存在明显差异?运行下面的单元格,使用 describe()方法查看数值范围的摘要。

1
reviewer_mean_ratings.describe()
count    19.000000
mean     88.233026
std       1.243610
min      85.855422
25%      87.323501
50%      88.536235
75%      88.975256
max      90.562551
dtype: float64

6.

What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.

什么国家和品种的组合最常见?创建一个 Series,其索引是{country, variety} 对的 MultiIndex。例如,产自美国的黑比诺应该映射到{"美国", "黑比诺"}。将 Series中的值按照葡萄酒数量降序排序。

1
2
reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)
# reviews.head()
country  variety                 
US       Pinot Noir                  9885
         Cabernet Sauvignon          7315
         Chardonnay                  6801
France   Bordeaux-style Red Blend    4725
Italy    Red Blend                   3624
                                     ... 
Mexico   Cinsault                       1
         Grenache                       1
         Merlot                         1
         Rosado                         1
Uruguay  White Blend                    1
Length: 1612, dtype: int64

Data Types and Missing Values

Dtypes

DataFrame 或 Series 中列的数据类型称为 dtype。

您可以使用 dtype 属性来获取特定列的类型。例如,我们可以获取评论 DataFrame 中价格列的 dtype:

1
2
import pandas as pd
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)
1
reviews.price.dtype
dtype('float64')

或者,dtypes 属性返回 DataFrame 中每列的 dtype:

1
reviews.dtypes
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

float64 表示使用 64 位浮点数;int64 表示使用类似大小的整数,以此类推。

需要注意的一个特殊情况是(在这里可以很清楚地看到),完全由字符串组成的列没有自己的类型,而是被赋予了对象类型。

我们可以使用 astype() 函数将一种类型的列转换为另一种类型,只要这种转换是合理的。例如,我们可以将积分列从现有的 int64 数据类型转换为 float64 数据类型:

1
reviews.points.astype('float64')
0         87.0
1         87.0
2         87.0
3         87.0
4         87.0
          ... 
129966    90.0
129967    90.0
129968    90.0
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

DataFrame 或 Series 索引也有自己的数据类型:

1
reviews.index.dtype
dtype('int64')

Pandas 还支持更奇特的数据类型,例如分类数据和时间序列数据。

Missing data

缺失值的输入值为 NaN,即 “Not a Number(非数值)”的缩写。由于技术原因,这些 NaN 值始终是 float64 类型。

Pandas 提供了一些专门针对缺失数据的方法。要选择 NaN 条目,可以使用 pd.isnull()(或其同伴 pd.notnull())。使用方法如下

1
reviews[pd.isnull(reviews.country)]

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
913 NaN Amber in color, this wine has aromas of peach ... Asureti Valley 87 30.0 NaN NaN NaN Mike DeSimone @worldwineguys Gotsa Family Wines 2014 Asureti Valley Chinuri Chinuri Gotsa Family Wines
3131 NaN Soft, fruity and juicy, this is a pleasant, si... Partager 83 NaN NaN NaN NaN Roger Voss @vossroger Barton & Guestier NV Partager Red Red Blend Barton & Guestier
4243 NaN Violet-red in color, this semisweet wine has a... Red Naturally Semi-Sweet 88 18.0 NaN NaN NaN Mike DeSimone @worldwineguys Kakhetia Traditional Winemaking 2012 Red Natur... Ojaleshi Kakhetia Traditional Winemaking
9509 NaN This mouthwatering blend starts with a nose of... Theopetra Malagouzia-Assyrtiko 92 28.0 NaN NaN NaN Susan Kostrzewa @suskostrzewa Tsililis 2015 Theopetra Malagouzia-Assyrtiko W... White Blend Tsililis
9750 NaN This orange-style wine has a cloudy yellow-gol... Orange Nikolaevo Vineyard 89 28.0 NaN NaN NaN Jeff Jenssen @worldwineguys Ross-idi 2015 Orange Nikolaevo Vineyard Chardo... Chardonnay Ross-idi
... ... ... ... ... ... ... ... ... ... ... ... ... ...
124176 NaN This Swiss red blend is composed of four varie... Les Romaines 90 30.0 NaN NaN NaN Jeff Jenssen @worldwineguys Les Frères Dutruy 2014 Les Romaines Red Red Blend Les Frères Dutruy
129407 NaN Dry spicy aromas of dusty plum and tomato add ... Reserve 89 22.0 NaN NaN NaN Michael Schachner @wineschach El Capricho 2015 Reserve Cabernet Sauvignon Cabernet Sauvignon El Capricho
129408 NaN El Capricho is one of Uruguay's more consisten... Reserve 89 22.0 NaN NaN NaN Michael Schachner @wineschach El Capricho 2015 Reserve Tempranillo Tempranillo El Capricho
129590 NaN A blend of 60% Syrah, 30% Cabernet Sauvignon a... Shah 90 30.0 NaN NaN NaN Mike DeSimone @worldwineguys Büyülübağ 2012 Shah Red Red Blend Büyülübağ
129900 NaN This wine offers a delightful bouquet of black... NaN 91 32.0 NaN NaN NaN Mike DeSimone @worldwineguys Psagot 2014 Merlot Merlot Psagot

63 rows × 13 columns

替换缺失值是一种常见操作。 Pandas 为这个问题提供了一个非常方便的方法:fillna()。 fillna() 提供了几种不同的策略来减少此类数据。例如,我们可以简单地将每个 NaN 替换为“Unknown”:

1
reviews.region_2.fillna("Unknown")
0                   Unknown
1                   Unknown
2         Willamette Valley
3                   Unknown
4         Willamette Valley
                ...        
129966              Unknown
129967         Oregon Other
129968              Unknown
129969              Unknown
129970              Unknown
Name: region_2, Length: 129971, dtype: object

或者,我们可以使用数据库中给定记录之后某个时间出现的第一个非空值来填充每个缺失值。这称为回填策略。

或者,我们可能有一个非空值需要替换。例如,假设自本数据集发布以来,评论员 Kerin O’Keefe 已将其 Twitter 手柄从 @kerinokeefe 更改为 @kerino。使用 replace() 方法是在数据集中反映这一情况的方法之一:

1
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
0             @kerino
1          @vossroger
2         @paulgwine 
3                 NaN
4         @paulgwine 
             ...     
129966            NaN
129967    @paulgwine 
129968     @vossroger
129969     @vossroger
129970     @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

这里值得一提的是 Replace() 方法,因为它可以方便地替换丢失的数据,这些数据在数据集中被赋予某种哨兵值:诸如“未知”、“未公开”、“无效”等。

TEST

1
2
import pandas as pd
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)

1.数据集中pionts列的数据类型是什么?

1
reviews['points'].dtype
dtype('int64')

2.从points列中的条目创建一个系列,但将条目转换为字符串。提示:字符串在 Python 中是 str

1
reviews.points.astype('str')
0         87
1         87
2         87
3         87
4         87
          ..
129966    90
129967    90
129968    90
129969    90
129970    90
Name: points, Length: 129971, dtype: object

3.有时价格列为空。数据集中有多少评论缺少价格?

1
reviews['price'].isnull().sum()
8996

4.最常见的葡萄酒产区有哪些?创建一个Series,计算 region_1 字段中每个值出现的次数。这个字段经常缺少数据,因此用unknown替换缺少的值。按降序排序。输出结果应如下所示:

1
2
3
4
5
6
Unknown                    21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
1
2
reviews['region_1'] = reviews['region_1'].fillna("Unknown")
reviews['region_1'].value_counts()
region_1
Unknown                    21247
Napa Valley                 4480
Columbia Valley (WA)        4124
Russian River Valley        3091
California                  2629
                           ...  
Lamezia                        1
Trentino Superiore             1
Grave del Friuli               1
Vin Santo di Carmignano        1
Paestum                        1
Name: count, Length: 1230, dtype: int64

Renaming and Combining

Renaming

我们在这里介绍的第一个函数是 rename(),它允许您更改索引名称和/或列名称。例如,要更改数据集中的分数列以进行评分,我们将执行以下操作:

1
2
import pandas as pd
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)
1
reviews.rename(columns={'points': 'score'})

country description designation score price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

rename()可以让你通过分别指定索引或列关键字参数来重命名索引或列值。它支持多种输入格式,但通常 Python 字典是最方便的。下面是一个使用它重命名索引中某些元素的示例。

1
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
firstEntry Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
secondEntry Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

你可能会经常重命名列,但很少重命名索引值。为此,set_index() 通常更方便。

行索引和列索引都可以有自己的名称属性。可以使用赠送的 rename_axis() 方法来更改这些名称。例如

1
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

fields country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Combining

在对数据集执行操作时,我们有时需要将不同的 DataFrames 和/或 Series 以非同寻常的方式组合起来。Pandas 有三种核心方法可以实现这一目的。按照复杂程度递增的顺序,它们是 concat()join()merge()merge() 的大部分功能也可以通过 join() 更简单地实现。

最简单的合并方法是 concat()。如果给定一个元素列表,该函数会将这些元素沿着一个轴挤在一起。

当我们在不同的 DataFrame 或 Series 对象中拥有相同字段(列)的数据时,这种方法非常有用。例如:YouTube 视频数据集根据来源国(例如本例中的加拿大和英国)对数据进行了分割。如果我们想同时研究多个国家,可以使用 concat() 将它们合并在一起:

1
2
3
4
canadian_youtube = pd.read_csv("datasets/Trending_YouTube_Video_Statistics/CAvideos.csv")
british_youtube = pd.read_csv("datasets/Trending_YouTube_Video_Statistics/GBvideos.csv")

pd.concat([canadian_youtube,british_youtube])

video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé i...
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las...
2 5qpjK5DgCt4 17.14.11 Racist Superman | Rudy Mancuso, King Bach & Le... Rudy Mancuso 23 2017-11-12T19:05:24.000Z racist superman|"rudy"|"mancuso"|"king"|"bach"... 3191434 146035 5339 8181 https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg False False False WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3 d380meD0W0M 17.14.11 I Dare You: GOING BALD!? nigahiga 24 2017-11-12T18:01:41.000Z ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... 2095828 132239 1989 17518 https://i.ytimg.com/vi/d380meD0W0M/default.jpg False False False I know it's been a while since we did this sho...
4 2Vv-BfVoq4g 17.14.11 Ed Sheeran - Perfect (Official Music Video) Ed Sheeran 10 2017-11-09T11:04:14.000Z edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... 33523622 1634130 21082 85067 https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg False False False 🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38911 l884wKofd54 18.14.06 Enrique Iglesias - MOVE TO MIAMI (Official Vid... EnriqueIglesiasVEVO 10 2018-05-09T07:00:01.000Z Enrique Iglesias feat. Pitbull|"MOVE TO MIAMI"... 25066952 268088 12783 9933 https://i.ytimg.com/vi/l884wKofd54/default.jpg False False False NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ...
38912 IP8k2xkhOdI 18.14.06 Jacob Sartorius - Up With It (Official Music V... Jacob Sartorius 10 2018-05-11T17:09:16.000Z jacob sartorius|"jacob"|"up with it"|"jacob sa... 1492219 61998 13781 24330 https://i.ytimg.com/vi/IP8k2xkhOdI/default.jpg False False False THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new...
38913 Il-an3K9pjg 18.14.06 Anne-Marie - 2002 [Official Video] Anne-Marie 10 2018-05-08T11:05:08.000Z anne|"marie"|"anne-marie"|"2002"|"two thousand... 29641412 394830 8892 19988 https://i.ytimg.com/vi/Il-an3K9pjg/default.jpg False False False Get 2002 by Anne-Marie HERE ▶ http://ad.gt/200...
38914 -DRsfNObKIQ 18.14.06 Eleni Foureira - Fuego - Cyprus - LIVE - First... Eurovision Song Contest 24 2018-05-08T20:32:32.000Z Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... 14317515 151870 45875 26766 https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg False False False Eleni Foureira represented Cyprus at the first...
38915 4YFo4bdMO8Q 18.14.06 KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... SuperDuperKyle 10 2018-05-11T04:06:35.000Z Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... 607552 18271 274 1423 https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg False False False Debut album 'Light of Mine' out now: http://ky...

79797 rows × 16 columns

join()可让您组合具有共同索引的不同 DataFrame 对象。例如,要提取加拿大和英国同一天流行的视频,我们可以执行以下操作:

1
2
3
4
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

video_id_CAN channel_title_CAN category_id_CAN publish_time_CAN tags_CAN views_CAN likes_CAN dislikes_CAN comment_count_CAN thumbnail_link_CAN ... tags_UK views_UK likes_UK dislikes_UK comment_count_UK thumbnail_link_UK comments_disabled_UK ratings_disabled_UK video_error_or_removed_UK description_UK
title trending_date
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 18.04.01 PNn8sECd7io Markiplier 20 2018-01-03T19:33:53.000Z getting over it|"markiplier"|"funny moments"|"... 835930 47058 1023 8250 https://i.ytimg.com/vi/PNn8sECd7io/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,323 Solo Wins! 18.09.03 DvPW66IFhMI AlexRamiGaming 20 2018-03-09T07:15:52.000Z PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... 212838 5199 542 11 https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,330 Solo Wins! 18.10.03 EXEaMjFeiEk AlexRamiGaming 20 2018-03-10T06:26:17.000Z PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... 200764 5620 537 45 https://i.ytimg.com/vi/EXEaMjFeiEk/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 MOST ANTICIPATED VIDEO (Timber Frame House Raising) 17.20.12 bYvQmusLaxw Pure Living for Life 24 2017-12-20T02:49:11.000Z timber frame|"timber framing"|"timber frame ra... 79152 7761 159 1965 https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17.21.12 bYvQmusLaxw Pure Living for Life 24 2017-12-20T02:49:11.000Z timber frame|"timber framing"|"timber frame ra... 232762 15515 329 3601 https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
😲She Is So Nervous But BLOWS The ROOF After Taking on OPERA Song! | Britain´s Got Talent 2018 18.02.05 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 713400 4684 260 266 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18.29.04 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 231906 1924 78 146 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18.30.04 WttN1Z0XF4k How Talented 24 2018-04-28T19:40:58.000Z bgt|"bgt 2018"|"britain got talent"|"britain´s... 476253 3417 176 240 https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 18.07.05 Wt9Gkpmbt44 TheBigJackpot 24 2018-05-07T06:58:59.000Z Slot Machine|"win"|"Gambling"|"Big Win"|"raja"... 28973 2167 175 10 https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE 18.04.04 Az72jrKbANA Right Side Broadcasting Network 25 2018-04-03T23:12:37.000Z YouTube shooter|"YouTube active shooter"|"acti... 103513 1722 181 76 https://i.ytimg.com/vi/Az72jrKbANA/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

40900 rows × 28 columns

此处 lsuffixrsuffix 参数是必需的,因为数据在英国和加拿大数据集中具有相同的列名称。如果这不是真的(因为,比如说,我们事先重命名了它们),我们就不需要它们。

TEST

1
2
import pandas as pd
reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0)
1
reviews.head()

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

1.Region_1Region_2 是数据集中区域设置列的非常无信息的名称。创建reviews副本,并将这些列分别重命名为regionlocale

1
reviews.rename(columns={'region_1': 'region','region_2':'locale'})

country description designation points price province region locale taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

2.将数据集中的索引名称设置为 wines

1
reviews.rename_axis("wines", axis='columns')

wines country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129966 Germany Notes of honeysuckle and cantaloupe sweeten th... Brauneberger Juffer-Sonnenuhr Spätlese 90 28.0 Mosel NaN NaN Anna Lee C. Iijima NaN Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... Riesling Dr. H. Thanisch (Erben Müller-Burggraef)
129967 US Citation is given as much as a decade of bottl... NaN 90 75.0 Oregon Oregon Oregon Other Paul Gregutt @paulgwine Citation 2004 Pinot Noir (Oregon) Pinot Noir Citation
129968 France Well-drained gravel soil gives this wine its c... Kritt 90 30.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Gresser 2013 Kritt Gewurztraminer (Als... Gewürztraminer Domaine Gresser
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

3.Things on Reddit 数据集包含了来自 reddit.com 上部分排名靠前的论坛(”subreddits”)的产品链接。运行下面的单元格,加载在 /r/gaming 子论坛上提到的产品的数据框,以及在 r//movies 子论坛上提到的产品的数据框。

1
2
3
4
gaming_products = pd.read_csv("datasets/top-things/reddits/g/gaming.csv")
gaming_products['subreddit'] = "r/gaming"
movie_products = pd.read_csv("datasets/top-things/reddits/m/movies.csv")
movie_products['subreddit'] = "r/movies"

创建一个包含 Reddit 子版块中提到的产品(either subreddit)的 DataFrame

1
gaming_products.head()

name category amazon_link total_mentions subreddit_mentions subreddit
0 BOOMco Halo Covenant Needler Blaster Toys & Games https://www.amazon.com/BOOMco-Halo-Covenant-Ne... 4.0 4 r/gaming
1 Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... Electronics https://www.amazon.com/Raspberry-Model-A1-2GHz... 19.0 3 r/gaming
2 CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... Electronics https://www.amazon.com/CanaKit-Raspberry-Suppl... 7.0 3 r/gaming
3 Panasonic K-KJ17MCA4BA Advanced Individual Cel... Electronics https://www.amazon.com/Panasonic-Advanced-Indi... 29.0 2 r/gaming
4 Mayflash GameCube Controller Adapter for Wii U... Electronics https://www.amazon.com/GameCube-Controller-Ada... 24.0 2 r/gaming
1
pd.concat([gaming_products,movie_products])

name category amazon_link total_mentions subreddit_mentions subreddit
0 BOOMco Halo Covenant Needler Blaster Toys & Games https://www.amazon.com/BOOMco-Halo-Covenant-Ne... 4.0 4 r/gaming
1 Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... Electronics https://www.amazon.com/Raspberry-Model-A1-2GHz... 19.0 3 r/gaming
2 CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... Electronics https://www.amazon.com/CanaKit-Raspberry-Suppl... 7.0 3 r/gaming
3 Panasonic K-KJ17MCA4BA Advanced Individual Cel... Electronics https://www.amazon.com/Panasonic-Advanced-Indi... 29.0 2 r/gaming
4 Mayflash GameCube Controller Adapter for Wii U... Electronics https://www.amazon.com/GameCube-Controller-Ada... 24.0 2 r/gaming
... ... ... ... ... ... ...
298 Welcome to Night Vale CD: A Novel Books https://www.amazon.com/Welcome-Night-Vale-CD-N... 1.0 1 r/movies
299 Ran (StudioCanal Collection) [Blu-ray] Movies & TV https://www.amazon.com/StudioCanal-Collection-... 1.0 1 r/movies
300 The Art of John Alvin Books https://www.amazon.com/Art-John-Alvin-Andrea/d... 1.0 1 r/movies
301 Apocalypto [Blu-ray] Movies & TV https://www.amazon.com/Apocalypto-Blu-ray-Rudy... 1.0 1 r/movies
302 Cinelinx: A Card Game for People Who Love Movi... Toys & Games https://www.amazon.com/Cinelinx-Card-Game-Peop... 1.0 1 r/movies

796 rows × 6 columns

4.Kaggle 上的举重数据库数据集包括一张用于举重比赛的 CSV 表和一张单独的举重参赛者表。运行下面的单元格将这些数据集加载到数据框中:

1
2
powerlifting_meets = pd.read_csv("datasets/openpowerlifting/v1/meets.csv")
powerlifting_competitors = pd.read_csv("datasets/openpowerlifting/v1/openpowerlifting.csv")
1
powerlifting_competitors

MeetID Name Sex Equipment Age Division BodyweightKg WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
0 0 Angie Belk Terry F Wraps 47.0 Mst 45-49 59.60 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
1 0 Dawn Bogart F Single-ply 42.0 Mst 40-44 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
2 0 Dawn Bogart F Single-ply 42.0 Open Senior 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
3 0 Dawn Bogart F Raw 42.0 Open Senior 58.51 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
4 0 Destiny Dula F Raw 18.0 Teen 18-19 63.68 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
386409 8481 William Barabas M Multi-ply NaN Elite 113.58 125 NaN NaN NaN NaN NaN 347.50 347.50 2 202.60
386410 8481 Justin Zottl M Multi-ply NaN Elite 119.02 125 NaN NaN NaN NaN NaN 322.50 322.50 3 185.77
386411 8481 Jake Anderson M Multi-ply NaN Elite 120.29 125 NaN NaN NaN NaN NaN 367.50 367.50 1 211.17
386412 8481 Jeff Bumanglag M Multi-ply NaN Elite 126.73 140 NaN NaN NaN NaN NaN 320.00 320.00 3 181.85
386413 8481 Shane Hammock M Multi-ply NaN Elite 129.46 140 NaN NaN NaN NaN NaN 362.50 362.50 2 205.18

386414 rows × 17 columns

两个表都包含对 MeetID 的引用,MeetID 是数据库中包含的每次会议(竞赛)的唯一键。使用它,生成一个将两个表合并为一个的数据集。

1
2
3
4
left = powerlifting_meets.set_index(['MeetID'])
right = powerlifting_competitors.set_index(['MeetID'])
left.join(right)
# powerlifting_meets.join(powerlifting_competitors)

MeetPath Federation Date MeetCountry MeetState MeetTown MeetName Name Sex Equipment ... WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
MeetID
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Angie Belk Terry F Wraps ... 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply ... 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply ... 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Raw ... 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Destiny Dula F Raw ... 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals William Barabas M Multi-ply ... 125 NaN NaN NaN NaN NaN 347.50 347.50 2 202.60
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Justin Zottl M Multi-ply ... 125 NaN NaN NaN NaN NaN 322.50 322.50 3 185.77
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Jake Anderson M Multi-ply ... 125 NaN NaN NaN NaN NaN 367.50 367.50 1 211.17
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Jeff Bumanglag M Multi-ply ... 140 NaN NaN NaN NaN NaN 320.00 320.00 3 181.85
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Shane Hammock M Multi-ply ... 140 NaN NaN NaN NaN NaN 362.50 362.50 2 205.18

386414 rows × 23 columns

完成

1


test for vue

aaaaaaa

sssssss