学习pandas(从kaggle上)
先来个小测试
1 | import pandas as pd |
Setup complete.
1.
In the cell below, create a DataFrame fruits
that looks like this:
1 | # Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruits. |
Apples Bananas
0 30 21
2.
Create a dataframe fruit_sales
that matches the diagram below:
1 | # Your code goes here. Create a dataframe matching the above diagram and assign it to the variable fruit_sales. |
Apples Bananas
2017 Sales 35 21
2018 Sales 41 34
3.
Create a variable ingredients
with a Series that looks like:
1 | Flour 4 cups |
1 | ingredients = pd.Series(['4 cups','1 cup', '2 large', '1 can'],index=['Flour','Milk','Eggs','Spam'],name='Dinner',dtype=object) |
Flour 4 cups
Milk 1 cup
Eggs 2 large
Spam 1 can
Name: Dinner, dtype: object
4.
Read the following csv dataset of wine reviews into a DataFrame called reviews
:
The filepath to the csv file is ../input/wine-reviews/winemag-data_first150k.csv
. The first few lines look like:
1 | ,country,description,designation,points,price,province,region_1,region_2,variety,winery |
1 | reviews = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv',index_col=0) |
reviews = pd.read_csv(‘../input/wine-reviews/winemag-data_first150k.csv’,index_col=0)
1 | animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2']) |
Cows | Goats | |
---|---|---|
Year 1 | 12 | 22 |
Year 2 | 20 | 19 |
In the cell below, write code to save this DataFrame to disk as a csv file with the name cows_and_goats.csv
.
1 | animals.to_csv('cows_and_goats.csv') |
1 | reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0) |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
我们将数据当对象来访问
1 | reviews.country |
0 Italy
1 Portugal
2 US
3 US
4 US
...
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
如果我们有一个 Python 字典,我们可以使用索引 ([]) 运算符访问它的值。我们可以对 DataFrame 中的列执行相同的操作:
1 | reviews['country'] |
0 Italy
1 Portugal
2 US
3 US
4 US
...
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
这是从 DataFrame 中选择特定 Series 的两种方法。这两种方法在语法上都不太有效,但索引操作符[]的优点是可以处理列名中的保留字符(例如,如果我们有一个country providence列,reviews.country providence就不起作用)。
pandas 系列看起来不像一本精美的dictionary吗?几乎是这样,所以毫不奇怪,要深入到单个特定值,我们只需要再次使用索引运算符 [] :
1 | reviews['country'][0] |
'Italy'
Indexing in pandas
索引运算符和属性选择很好,因为它们就像在 Python 生态系统的其它部分一样工作。作为新手,这使得它们很容易上手和使用。然而,pandas有自己的访问操作符,loc和iloc。对于更高级的操作,你应该使用它们。
Index-based selection
Pandas 索引以两种范式之一工作。第一种是基于索引的选择:根据数据在数据中的数字位置来选择数据。iloc遵循这一范例。
要选择 DataFrame 中的第一行数据,我们可以使用以下命令:
1 | reviews.iloc[0] |
country Italy
description Aromas include tropical fruit, broom, brimston...
designation Vulkà Bianco
points 87
price NaN
province Sicily & Sardinia
region_1 Etna
region_2 NaN
taster_name Kerin O’Keefe
taster_twitter_handle @kerinokeefe
title Nicosia 2013 Vulkà Bianco (Etna)
variety White Blend
winery Nicosia
Name: 0, dtype: object
loc 和 iloc 都是行第一、列第二。这与我们在原生 Python 中所做的相反,即列在前,行在后。
这意味着检索行稍微容易一些,而检索列稍微困难一些。要使用 iloc 获取列,我们可以执行以下操作:
1 | reviews.iloc[:, 0] |
0 Italy
1 Portugal
2 US
3 US
4 US
...
129966 Germany
129967 US
129968 France
129969 France
129970 France
Name: country, Length: 129971, dtype: object
就其本身而言,:操作符(也来自本地Python)表示 “一切”。然而,当与其他选择器组合时,它可以用来表示值的范围。例如,要从第一、第二和第三行中选择country列,我们可以这样做
1 | reviews.iloc[:3, 0] |
0 Italy
1 Portugal
2 US
Name: country, dtype: object
1 | reviews.iloc[999:-99999, 0] |
999 US
1000 US
1001 Italy
1002 US
1003 France
...
29967 France
29968 US
29969 Argentina
29970 US
29971 US
Name: country, Length: 28973, dtype: object
It’s also possible to pass a list:
1 | reviews.iloc[[0, 1, 2], 0] |
0 Italy
1 Portugal
2 US
Name: country, dtype: object
最后,值得注意的是,负数可以用于选择。这将从数值的末尾开始向前计数。例如,这里是数据集的最后五个元素。
1 | reviews.iloc[-5:] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
Label-based selection
属性选择的第二种模式是loc操作符所遵循的模式:基于标签的选择。在这种模式下,重要的是数据索引值,而不是它的位置。
例如,要获得reviews中的第一个条目,我们现在要做如下操作:
1 | reviews.loc[0, 'country'] |
'Italy'
iloc在概念上比loc简单,因为它忽略了数据集的索引。当我们使用 iloc 时,我们把数据集当作一个大矩阵(一个列表的列表),一个我们必须按位置进行索引的矩阵。由于您的数据集通常具有有意义的索引,因此使用loc通常会更容易。例如,这里有一个使用loc更容易的操作:
1 | reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']] |
taster_name | taster_twitter_handle | points | |
---|---|---|---|
0 | Kerin O’Keefe | @kerinokeefe | 87 |
1 | Roger Voss | @vossroger | 87 |
2 | Paul Gregutt | @paulgwine | 87 |
3 | Alexander Peartree | NaN | 87 |
4 | Paul Gregutt | @paulgwine | 87 |
... | ... | ... | ... |
129966 | Anna Lee C. Iijima | NaN | 90 |
129967 | Paul Gregutt | @paulgwine | 90 |
129968 | Roger Voss | @vossroger | 90 |
129969 | Roger Voss | @vossroger | 90 |
129970 | Roger Voss | @vossroger | 90 |
129971 rows × 3 columns
Choosing between loc and iloc
在 loc 和 iloc 之间选择或转换时,有一个 “小问题 “值得注意,那就是这两种方法使用的索引方案略有不同。
iloc 使用 Python stdlib 索引方案,其中包含范围的第一个元素,而排除最后一个元素。因此,0:10 将选择条目 0,……9。而loc包含最后一个元素,因此0:10将选择0,…,10。
为什么会有这样的变化?请记住,loc可以索引任何stdlib类型:例如字符串。如果我们有一个索引值为 Apples, …, Potatoes, … 的 DataFrame,并且我们想选择 “所有按字母顺序排列的苹果和土豆之间的水果选择”,那么索引 df.loc[‘Apples’:’Potatoes’] 要比索引 df.loc[‘Apples’, ‘Potatoet’] 这样的东西方便得多 (t 在字母表的 s 后面)。
当DataFrame索引是一个简单的数字列表(例如0,…,1000)时,这种情况尤其令人困惑。在这种情况下,df.iloc[0:1000]将返回1000个条目,而df.loc[0:1000]将返回1001个条目!要使用 loc 得到 1000 个元素,您需要再低一级,请求 df.loc[0:999]。
否则,使用loc的语义和使用iloc的语义是一样的。
Manipulating the index
基于标签的选择功能来自索引中的标签。重要的是,我们使用的索引并非一成不变。我们可以以任何我们认为合适的方式操作索引。
我们可以使用set_index()方法来完成这项工作。下面是当我们将set_index设置为title字段时发生的情况:
1 | reviews.set_index("title") |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
title | ||||||||||||
Nicosia 2013 Vulkà Bianco (Etna) | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | White Blend | Nicosia |
Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Portuguese Red | Quinta dos Avidagos |
Rainstorm 2013 Pinot Gris (Willamette Valley) | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Pinot Gris | Rainstorm |
St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore) | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | Riesling | St. Julian |
Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley) | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Dr. H. Thanisch (Erben Müller-Burggraef) 2013 Brauneberger Juffer-Sonnenuhr Spätlese Riesling (Mosel) | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
Citation 2004 Pinot Noir (Oregon) | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Pinot Noir | Citation |
Domaine Gresser 2013 Kritt Gewurztraminer (Alsace) | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Gewürztraminer | Domaine Gresser |
Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Pinot Gris | Domaine Marcel Deiss |
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace) | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Gewürztraminer | Domaine Schoffit |
129971 rows × 12 columns
如果您可以为数据集找到一个比当前索引更好的索引,那么这将非常有用。
Conditional selection
到目前为止,我们已经使用 DataFrame 本身的结构属性对各种数据进行索引。然而,为了利用数据做有趣的事情,我们经常需要根据条件提出问题。
例如,假设我们对意大利生产的优于平均水平的葡萄酒特别感兴趣。
我们可以首先检查每种葡萄酒是否是意大利的:
1 | reviews.country == 'Italy' |
0 True
1 False
2 False
3 False
4 False
...
129966 False
129967 False
129968 False
129969 False
129970 False
Name: country, Length: 129971, dtype: bool
此操作根据每条记录的国家/地区生成一系列 True/False 布尔值。然后可以在 loc 内部使用此结果来选择相关数据:
1 | reviews.loc[reviews.country == 'Italy'] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
13 | Italy | This is dominated by oak and oak-driven aromas... | Rosso | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Masseria Setteporte 2012 Rosso (Etna) | Nerello Mascalese | Masseria Setteporte |
22 | Italy | Delicate aromas recall white flower and citrus... | Ficiligno | 87 | 19.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Baglio di Pianetto 2007 Ficiligno White (Sicilia) | White Blend | Baglio di Pianetto |
24 | Italy | Aromas of prune, blackcurrant, toast and oak c... | Aynat | 87 | 35.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Canicattì 2009 Aynat Nero d'Avola (Sicilia) | Nero d'Avola | Canicattì |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129929 | Italy | This luminous sparkler has a sweet, fruit-forw... | NaN | 91 | 38.0 | Veneto | Prosecco Superiore di Cartizze | NaN | NaN | NaN | Col Vetoraz Spumanti NV Prosecco Superiore di... | Prosecco | Col Vetoraz Spumanti |
129943 | Italy | A blend of Nero d'Avola and Syrah, this convey... | Adènzia | 90 | 29.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Baglio del Cristo di Campobello 2012 Adènzia R... | Red Blend | Baglio del Cristo di Campobello |
129947 | Italy | A blend of 65% Cabernet Sauvignon, 30% Merlot ... | Symposio | 90 | 20.0 | Sicily & Sardinia | Terre Siciliane | NaN | Kerin O’Keefe | @kerinokeefe | Feudo Principi di Butera 2012 Symposio Red (Te... | Red Blend | Feudo Principi di Butera |
129961 | Italy | Intense aromas of wild cherry, baking spice, t... | NaN | 90 | 30.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | COS 2013 Frappato (Sicilia) | Frappato | COS |
129962 | Italy | Blackberry, cassis, grilled herb and toasted a... | Sàgana Tenuta San Giacomo | 90 | 40.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... | Nero d'Avola | Cusumano |
19540 rows × 13 columns
该DataFrame有~20,000行。原来的数据约为130,000行。这意味着大约15%的葡萄酒来自意大利。
我们还想知道哪些葡萄酒比平均水平好。葡萄酒的评价标准为80-100分,因此这可能意味着至少获得90分的葡萄酒。
我们可以用”&”将这两个问题联系起来:
1 | reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
120 | Italy | Slightly backward, particularly given the vint... | Bricco Rocche Prapó | 92 | 70.0 | Piedmont | Barolo | NaN | NaN | NaN | Ceretto 2003 Bricco Rocche Prapó (Barolo) | Nebbiolo | Ceretto |
130 | Italy | At the first it was quite muted and subdued, b... | Bricco Rocche Brunate | 91 | 70.0 | Piedmont | Barolo | NaN | NaN | NaN | Ceretto 2003 Bricco Rocche Brunate (Barolo) | Nebbiolo | Ceretto |
133 | Italy | Einaudi's wines have been improving lately, an... | NaN | 91 | 68.0 | Piedmont | Barolo | NaN | NaN | NaN | Poderi Luigi Einaudi 2003 Barolo | Nebbiolo | Poderi Luigi Einaudi |
135 | Italy | The color is just beginning to show signs of b... | Sorano | 91 | 60.0 | Piedmont | Barolo | NaN | NaN | NaN | Giacomo Ascheri 2001 Sorano (Barolo) | Nebbiolo | Giacomo Ascheri |
140 | Italy | A big, fat, luscious wine with plenty of toast... | Costa Bruna | 90 | 26.0 | Piedmont | Barbera d'Alba | NaN | NaN | NaN | Poderi Colla 2005 Costa Bruna (Barbera d'Alba) | Barbera | Poderi Colla |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129929 | Italy | This luminous sparkler has a sweet, fruit-forw... | NaN | 91 | 38.0 | Veneto | Prosecco Superiore di Cartizze | NaN | NaN | NaN | Col Vetoraz Spumanti NV Prosecco Superiore di... | Prosecco | Col Vetoraz Spumanti |
129943 | Italy | A blend of Nero d'Avola and Syrah, this convey... | Adènzia | 90 | 29.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Baglio del Cristo di Campobello 2012 Adènzia R... | Red Blend | Baglio del Cristo di Campobello |
129947 | Italy | A blend of 65% Cabernet Sauvignon, 30% Merlot ... | Symposio | 90 | 20.0 | Sicily & Sardinia | Terre Siciliane | NaN | Kerin O’Keefe | @kerinokeefe | Feudo Principi di Butera 2012 Symposio Red (Te... | Red Blend | Feudo Principi di Butera |
129961 | Italy | Intense aromas of wild cherry, baking spice, t... | NaN | 90 | 30.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | COS 2013 Frappato (Sicilia) | Frappato | COS |
129962 | Italy | Blackberry, cassis, grilled herb and toasted a... | Sàgana Tenuta San Giacomo | 90 | 40.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... | Nero d'Avola | Cusumano |
6648 rows × 13 columns
Pandas内置了一些条件选择器,我们在此重点介绍其中的两个。
第一个是 “isin”。”isin “让您选择其值 “位于 “一个值列表中的数据。例如,我们可以用它来选择仅来自意大利或法国的葡萄酒:
1 | reviews.loc[reviews.country.isin(['Italy', 'France'])] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
7 | France | This dry and restrained wine offers spice in p... | NaN | 87 | 24.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Trimbach 2012 Gewurztraminer (Alsace) | Gewürztraminer | Trimbach |
9 | France | This has great depth of flavor with its fresh ... | Les Natures | 87 | 27.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Jean-Baptiste Adam 2012 Les Natures Pinot Gris... | Pinot Gris | Jean-Baptiste Adam |
11 | France | This is a dry wine, very spicy, with a tight, ... | NaN | 87 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Leon Beyer 2012 Gewurztraminer (Alsace) | Gewürztraminer | Leon Beyer |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129964 | France | Initially quite muted, this wine slowly develo... | Domaine Saint-Rémy Herrenweg | 90 | NaN | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Ehrhart 2013 Domaine Saint-Rémy Herren... | Gewürztraminer | Domaine Ehrhart |
129965 | France | While it's rich, this beautiful dry wine also ... | Seppi Landmann Vallée Noble | 90 | 28.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Rieflé-Landmann 2013 Seppi Landmann Va... | Pinot Gris | Domaine Rieflé-Landmann |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
41633 rows × 13 columns
第二个是 isnull (及其同伴 notnull)。这些方法可让您突出显示空(或非空)(NaN) 的值。例如,要过滤掉数据集中缺少价格标签的葡萄酒,我们将执行以下操作:
1 | reviews.loc[reviews.price.notnull()] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
5 | Spain | Blackberry and raspberry aromas show a typical... | Ars In Vitro | 87 | 15.0 | Northern Spain | Navarra | NaN | Michael Schachner | @wineschach | Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... | Tempranillo-Merlot | Tandem |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
120975 rows × 13 columns
分配数据
另一方面,将数据分配给 DataFrame 很容易。可以分配一个常量值:
1 | reviews['critic'] = 'everyone' |
0 everyone
1 everyone
2 everyone
3 everyone
4 everyone
...
129966 everyone
129967 everyone
129968 everyone
129969 everyone
129970 everyone
Name: critic, Length: 129971, dtype: object
或者使用可迭代的值:
1 | reviews['index_backwards'] = range(len(reviews), 0, -1) |
0 129971
1 129970
2 129969
3 129968
4 129967
...
129966 5
129967 4
129968 3
129969 2
129970 1
Name: index_backwards, Length: 129971, dtype: int64
TEST
1 | reviews = pd.read_csv("datasets/wine/winemag-data-130k-v2.csv", index_col=0) |
1 | reviews.head() |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
1.从description中选择描述列,并将结果分配给变量 desc。
1 | desc = reviews['description'] |
0 Aromas include tropical fruit, broom, brimston...
1 This is ripe and fruity, a wine that is smooth...
2 Tart and snappy, the flavors of lime flesh and...
3 Pineapple rind, lemon pith and orange blossom ...
4 Much like the regular bottling from 2012, this...
...
129966 Notes of honeysuckle and cantaloupe sweeten th...
129967 Citation is given as much as a decade of bottl...
129968 Well-drained gravel soil gives this wine its c...
129969 A dry style of Pinot Gris, this is crisp with ...
129970 Big, rich and off-dry, this is powered by inte...
Name: description, Length: 129971, dtype: object
2.从评论的描述列中选择第一个值,将其分配给变量first_description。
1 | first_description = desc[0] |
"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."
3.从reviews中选择第一行数据(第一条记录),并将其分配给变量first_row。
1 | first_row = reviews.loc[0] |
country Italy
description Aromas include tropical fruit, broom, brimston...
designation Vulkà Bianco
points 87
price NaN
province Sicily & Sardinia
region_1 Etna
region_2 NaN
taster_name Kerin O’Keefe
taster_twitter_handle @kerinokeefe
title Nicosia 2013 Vulkà Bianco (Etna)
variety White Blend
winery Nicosia
Name: 0, dtype: object
4.从评论中的描述列中选择前 10 个值,将结果分配给变量first_descriptions。
提示:将输出格式化为 pandas 系列。
1 | first_descriptions = reviews['description'].loc[0:9] |
0 Aromas include tropical fruit, broom, brimston...
1 This is ripe and fruity, a wine that is smooth...
2 Tart and snappy, the flavors of lime flesh and...
3 Pineapple rind, lemon pith and orange blossom ...
4 Much like the regular bottling from 2012, this...
5 Blackberry and raspberry aromas show a typical...
6 Here's a bright, informal red that opens with ...
7 This dry and restrained wine offers spice in p...
8 Savory dried thyme notes accent sunnier flavor...
9 This has great depth of flavor with its fresh ...
Name: description, dtype: object
5.Select the records with index labels 1
, 2
, 3
, 5
, and 8
, assigning the result to the variable sample_reviews
.
In other words, generate the following DataFrame:
1 | sample_reviews = reviews.loc[[1,2,3,5,8]] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
5 | Spain | Blackberry and raspberry aromas show a typical... | Ars In Vitro | 87 | 15.0 | Northern Spain | Navarra | NaN | Michael Schachner | @wineschach | Tandem 2011 Ars In Vitro Tempranillo-Merlot (N... | Tempranillo-Merlot | Tandem |
8 | Germany | Savory dried thyme notes accent sunnier flavor... | Shine | 87 | 12.0 | Rheinhessen | NaN | NaN | Anna Lee C. Iijima | NaN | Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe... | Gewürztraminer | Heinz Eifel |
创建一个变量 df,其中包含索引标签为 0、1、10 和 100 的记录的 Country、Province、Region_1 和 Region_2 列。换句话说,生成以下 DataFrame:
1 | df = reviews[['country','province','region_1','region_2']].loc[[0,1,10,100]] |
country | province | region_1 | region_2 | |
---|---|---|---|---|
0 | Italy | Sicily & Sardinia | Etna | NaN |
1 | Portugal | Douro | NaN | NaN |
10 | US | California | Napa Valley | Napa |
100 | US | New York | Finger Lakes | Finger Lakes |
7.Create a variable df
containing the country
and variety
columns of the first 100 records.
Hint: you may use loc
or iloc
. When working on the answer this question and the several of the ones that follow, keep the following “gotcha” described in the tutorial:
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded.loc
, meanwhile, indexes inclusively.
This is particularly confusing when the DataFrame index is a simple numerical list, e.g.
0,...,1000
. In this casedf.iloc[0:1000]
will return 1000 entries, whiledf.loc[0:1000]
return 1001 of them! To get 1000 elements usingloc
, you will need to go one lower and ask fordf.iloc[0:999]
.
1 | df = reviews[['country','variety']].loc[0:99] |
country | variety | |
---|---|---|
0 | Italy | White Blend |
1 | Portugal | Portuguese Red |
2 | US | Pinot Gris |
3 | US | Riesling |
4 | US | Pinot Noir |
... | ... | ... |
95 | France | Gamay |
96 | France | Gamay |
97 | US | Riesling |
98 | Italy | Sangiovese |
99 | US | Bordeaux-style Red Blend |
100 rows × 2 columns
8.创建一个包含意大利葡萄酒评论的 DataFrame italian_wines。提示:reviews.country 等于什么?
1 | italian_wines = reviews[reviews['country']=='Italy'] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
6 | Italy | Here's a bright, informal red that opens with ... | Belsito | 87 | 16.0 | Sicily & Sardinia | Vittoria | NaN | Kerin O’Keefe | @kerinokeefe | Terre di Giurfo 2013 Belsito Frappato (Vittoria) | Frappato | Terre di Giurfo |
13 | Italy | This is dominated by oak and oak-driven aromas... | Rosso | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Masseria Setteporte 2012 Rosso (Etna) | Nerello Mascalese | Masseria Setteporte |
22 | Italy | Delicate aromas recall white flower and citrus... | Ficiligno | 87 | 19.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Baglio di Pianetto 2007 Ficiligno White (Sicilia) | White Blend | Baglio di Pianetto |
24 | Italy | Aromas of prune, blackcurrant, toast and oak c... | Aynat | 87 | 35.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Canicattì 2009 Aynat Nero d'Avola (Sicilia) | Nero d'Avola | Canicattì |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129929 | Italy | This luminous sparkler has a sweet, fruit-forw... | NaN | 91 | 38.0 | Veneto | Prosecco Superiore di Cartizze | NaN | NaN | NaN | Col Vetoraz Spumanti NV Prosecco Superiore di... | Prosecco | Col Vetoraz Spumanti |
129943 | Italy | A blend of Nero d'Avola and Syrah, this convey... | Adènzia | 90 | 29.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Baglio del Cristo di Campobello 2012 Adènzia R... | Red Blend | Baglio del Cristo di Campobello |
129947 | Italy | A blend of 65% Cabernet Sauvignon, 30% Merlot ... | Symposio | 90 | 20.0 | Sicily & Sardinia | Terre Siciliane | NaN | Kerin O’Keefe | @kerinokeefe | Feudo Principi di Butera 2012 Symposio Red (Te... | Red Blend | Feudo Principi di Butera |
129961 | Italy | Intense aromas of wild cherry, baking spice, t... | NaN | 90 | 30.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | COS 2013 Frappato (Sicilia) | Frappato | COS |
129962 | Italy | Blackberry, cassis, grilled herb and toasted a... | Sàgana Tenuta San Giacomo | 90 | 40.0 | Sicily & Sardinia | Sicilia | NaN | Kerin O’Keefe | @kerinokeefe | Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... | Nero d'Avola | Cusumano |
19540 rows × 13 columns
9.创建一个 DataFrame top_oceania_wines,其中包含来自澳大利亚或新西兰的葡萄酒的至少 95 分(满分 100 分)的所有评论。
1 | top_oceania_wines = reviews[reviews.country.isin(['Australia', 'New Zealand']) & (reviews.points>=95)] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
345 | Australia | This wine contains some material over 100 year... | Rare | 100 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards |
346 | Australia | This deep brown wine smells like a damp, mossy... | Rare | 98 | 350.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscadelle... | Muscadelle | Chambers Rosewood Vineyards |
348 | Australia | Deep mahogany. Dried fig and black tea on the ... | Grand | 97 | 100.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Grand Muscat (R... | Muscat | Chambers Rosewood Vineyards |
349 | Australia | RunRig is always complex, and the 2012 doesn't... | RunRig | 97 | 225.0 | South Australia | Barossa | NaN | Joe Czerwinski | @JoeCz | Torbreck 2012 RunRig Shiraz-Viognier (Barossa) | Shiraz-Viognier | Torbreck |
356 | Australia | Dusty, firm, powerful: just a few apt descript... | Georgia's Paddock | 95 | 85.0 | Victoria | Heathcote | NaN | Joe Czerwinski | @JoeCz | Jasper Hill 2013 Georgia's Paddock Shiraz (Hea... | Shiraz | Jasper Hill |
360 | Australia | Bacon and tapenade elements merge easily on th... | Descendant | 95 | 125.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2012 Descendant Shiraz-Viognier (Baro... | Shiraz-Viognier | Torbreck |
365 | Australia | The Taylor family selected Clare Valley for it... | St. Andrews Single Vineyard Release | 95 | 60.0 | South Australia | Clare Valley | NaN | Joe Czerwinski | @JoeCz | Wakefield 2013 St. Andrews Single Vineyard Rel... | Shiraz | Wakefield |
14354 | Australia | This wine's concentrated dark fruit shows in t... | Old Vine | 95 | 60.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Kaesler 2006 Old Vine Shiraz (Barossa Valley) | Shiraz | Kaesler |
16538 | Australia | Rich, dense and intense, this is a big, muscul... | The Family Tree | 95 | 65.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Lambert 2013 The Family Tree Shiraz (Barossa V... | Shiraz | Lambert |
28573 | Australia | Astralis has become one of Australia's top col... | Astralis | 95 | 350.0 | South Australia | Clarendon | NaN | Joe Czerwinski | @JoeCz | Clarendon Hills 2005 Astralis Syrah (Clarendon) | Syrah | Clarendon Hills |
34502 | Australia | This prodigious wine showcases Barossa's abili... | The Relic | 98 | 135.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Standish 2006 The Relic Shiraz (Barossa Valley) | Shiraz | Standish |
34506 | Australia | If Standish's Relic is the feminine side of Sh... | The Standish Single Vineyard | 96 | 135.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Standish 2005 The Standish Single Vineyard Shi... | Shiraz | Standish |
38988 | Australia | Penfolds Bin 707 has leapt in quality over the... | Bin 707 | 95 | 200.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Penfolds 2009 Bin 707 Cabernet Sauvignon (Sout... | Cabernet Sauvignon | Penfolds |
39059 | Australia | The Taylor family selected Clare Valley for it... | St. Andrews Single Vineyard Release | 95 | 60.0 | South Australia | Clare Valley | NaN | Joe Czerwinski | @JoeCz | Wakefield 2013 St. Andrews Single Vineyard Rel... | Shiraz | Wakefield |
39961 | Australia | As unevolved as they are, the dense and multil... | Grange | 96 | 185.0 | South Australia | South Australia | NaN | NaN | NaN | Penfolds 1996 Grange Shiraz (South Australia) | Shiraz | Penfolds |
39962 | Australia | Seamless luxury from stem to stern, this ‘baby... | RWT | 95 | 70.0 | South Australia | Barossa Valley | NaN | NaN | NaN | Penfolds 1998 RWT Shiraz (Barossa Valley) | Shiraz | Penfolds |
45809 | Australia | The 2007 Astralis impresses for its combinatio... | Astralis | 95 | 225.0 | South Australia | Clarendon | NaN | Joe Czerwinski | @JoeCz | Clarendon Hills 2007 Astralis Syrah (Clarendon) | Syrah | Clarendon Hills |
56953 | Australia | This inky, embryonic wine deserves to be cella... | Grange | 99 | 850.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Penfolds 2010 Grange Shiraz (South Australia) | Shiraz | Penfolds |
56956 | Australia | You may have to scour the country to secure so... | Andelmonde | 97 | 95.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Standish 2012 Andelmonde Shiraz (Barossa Valley) | Shiraz | Standish |
56957 | Australia | Thorn Clarke has taken its Shiraz to a new lev... | Ron Thorn Single Vineyard | 96 | 89.0 | South Australia | Barossa | NaN | Joe Czerwinski | @JoeCz | Thorn Clarke 2012 Ron Thorn Single Vineyard Sh... | Shiraz | Thorn Clarke |
56959 | Australia | Is this the Yin to Grange's Yang? The wines ar... | Hill of Grace | 96 | 820.0 | South Australia | Eden Valley | NaN | Joe Czerwinski | @JoeCz | Henschke 2010 Hill of Grace Shiraz (Eden Valley) | Shiraz | Henschke |
59977 | Australia | This is a top example of the classic Australia... | The Peake | 96 | 150.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | Hickinbotham 2013 The Peake Cabernet-Shiraz (M... | Cabernet-Shiraz | Hickinbotham |
59984 | Australia | This is a throwback to those brash, flavor-exu... | One | 95 | 95.0 | South Australia | Langhorne Creek | NaN | Joe Czerwinski | @JoeCz | Heartland 2013 One Red (Langhorne Creek) | Red Blend | Heartland |
67096 | Australia | Just a tiny serving of this dark nectar will l... | Calliope Rare | 98 | 86.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | R.L. Buller & Son NV Calliope Rare Tokay (Ruth... | Tokay | R.L. Buller & Son |
67101 | Australia | This Muscat is the color of dark coffee, with ... | Rare | 95 | 300.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscat (Ru... | Muscat | Chambers Rosewood Vineyards |
76392 | Australia | When the alcohol levels are reined in to appro... | Georgia's Paddock | 95 | 85.0 | Victoria | Heathcote | NaN | Joe Czerwinski | @JoeCz | Jasper Hill 2012 Georgia's Paddock Shiraz (Hea... | Shiraz | Jasper Hill |
77028 | Australia | This has all the size and weight you've come t... | Grange | 98 | 850.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Penfolds 2008 Grange Shiraz (South Australia) | Shiraz | Penfolds |
77036 | Australia | RWT (unromantically derived from “Red Wine Tri... | RWT | 96 | 150.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Penfolds 2009 RWT Shiraz (Barossa Valley) | Shiraz | Penfolds |
77037 | Australia | Winemaker Dave Powell is no longer with Torbre... | RunRig | 96 | 225.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2007 RunRig Shiraz-Viognier (Barossa ... | Shiraz-Viognier | Torbreck |
77042 | Australia | This is likely the most ageworthy Shiraz winem... | Eligo | 95 | 100.0 | South Australia | Barossa | NaN | Joe Czerwinski | @JoeCz | John Duval Wines 2010 Eligo Shiraz (Barossa) | Shiraz | John Duval Wines |
77044 | Australia | The fruit for this offering comes from the Gre... | R Reserve | 95 | 105.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Kilikanoon 2009 R Reserve Shiraz (Barossa Valley) | Shiraz | Kilikanoon |
77046 | Australia | With aromas and flavors that range widely from... | The Factor | 95 | 125.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2007 The Factor Shiraz (Barossa Valley) | Shiraz | Torbreck |
83357 | Australia | A throwback to the monster Shiraz style of old... | Grange | 96 | 500.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Penfolds 2004 Grange Shiraz (South Australia) | Shiraz | Penfolds |
84815 | Australia | The Factor is always one of Torbreck's biggest... | The Factor | 95 | 125.0 | South Australia | Barossa | NaN | Joe Czerwinski | @JoeCz | Torbreck 2012 The Factor Shiraz (Barossa) | Shiraz | Torbreck |
84816 | Australia | Nashwauk is Kaesler's McLaren Vale project, fi... | Beacon | 95 | 145.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | Nashwauk 2010 Beacon Shiraz (McLaren Vale) | Shiraz | Nashwauk |
87128 | Australia | This full-bodied, muscular Shiraz is built for... | Amery Vineyard Block 6 | 96 | 120.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | Kay Brothers 2012 Amery Vineyard Block 6 Shira... | Shiraz | Kay Brothers |
87137 | Australia | Perhaps the best young wine I've tasted from M... | NaN | 95 | 84.0 | Western Australia | Margaret River | NaN | Joe Czerwinski | @JoeCz | Moss Wood 2011 Cabernet Sauvignon (Margaret Ri... | Cabernet Sauvignon | Moss Wood |
87143 | Australia | This is wonderfully complex and aromatic, with... | St. Andrews Single Vineyard Release | 95 | 60.0 | South Australia | Clare Valley | NaN | Joe Czerwinski | @JoeCz | Wakefield 2012 St. Andrews Single Vineyard Rel... | Shiraz | Wakefield |
91851 | New Zealand | This full-bodied, richly tannic wine delivers.... | Homage | 95 | 100.0 | Hawke's Bay | NaN | NaN | Joe Czerwinski | @JoeCz | Trinity Hill 2013 Homage Syrah (Hawke's Bay) | Syrah | Trinity Hill |
98386 | Australia | One of the more approachable of the d'Arenberg... | Little Venice Single Vineyard | 95 | 85.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | D'Arenberg 2010 Little Venice Single Vineyard ... | Shiraz | D'Arenberg |
99318 | Australia | From vines planted in 1912, this has been an i... | Mount Edelstone Vineyard | 95 | 200.0 | South Australia | Eden Valley | NaN | Joe Czerwinski | @JoeCz | Henschke 2014 Mount Edelstone Vineyard Shiraz ... | Shiraz | Henschke |
99330 | Australia | This Cabernet equivalent to Grange has explode... | Bin 707 | 95 | 500.0 | South Australia | South Australia | NaN | Joe Czerwinski | @JoeCz | Penfolds 2014 Bin 707 Cabernet Sauvignon (Sout... | Cabernet Sauvignon | Penfolds |
99340 | Australia | This rich, opulent wine carries its massive oa... | Les Amis | 95 | 185.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2013 Les Amis Grenache (Barossa Valley) | Grenache | Torbreck |
109427 | Australia | This wine is dark brown in hue with a greenish... | Rare | 99 | 300.0 | Victoria | Rutherglen | NaN | Joe Czerwinski | @JoeCz | Chambers Rosewood Vineyards NV Rare Muscadelle... | Muscadelle | Chambers Rosewood Vineyards |
109434 | Australia | D'Arenberg's lineup of single-vineyard Shiraze... | The Swinging Malaysian Single Vineyard | 96 | 85.0 | South Australia | McLaren Vale | NaN | Joe Czerwinski | @JoeCz | D'Arenberg 2010 The Swinging Malaysian Single ... | Shiraz | D'Arenberg |
122421 | Australia | Despite this wine's weight and richness, it re... | Amon-Ra Unfiltered | 96 | 110.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Glaetzer 2010 Amon-Ra Unfiltered Shiraz (Baros... | Shiraz | Glaetzer |
122430 | Australia | These blends are traditional in Australia—they... | Anaperenna | 95 | 80.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Glaetzer 2010 Anaperenna Shiraz-Cabernet Sauvi... | Shiraz-Cabernet Sauvignon | Glaetzer |
122507 | New Zealand | This blend of Cabernet Sauvignon (62.5%), Merl... | SQM Gimblett Gravels Cabernets/Merlot | 95 | 79.0 | Hawke's Bay | NaN | NaN | Joe Czerwinski | @JoeCz | Squawking Magpie 2014 SQM Gimblett Gravels Cab... | Bordeaux-style Red Blend | Squawking Magpie |
122939 | Australia | Full-bodied and plush yet vibrant and imbued w... | The Factor | 98 | 125.0 | South Australia | Barossa Valley | NaN | Joe Czerwinski | @JoeCz | Torbreck 2013 The Factor Shiraz (Barossa Valley) | Shiraz | Torbreck |
Summary Functions and Maps
介绍
在上一教程中,我们学习了如何从 DataFrame 或 Series 中选择相关数据。正如我们在练习中演示的那样,从数据表示中提取正确的数据对于完成工作至关重要。
然而,数据并不总是以我们想要的格式从内存中直接取出。有时,我们必须自己再做一些工作来重新格式化数据,以完成手头的任务。本教程将介绍我们可以对数据进行的不同操作,以获得 “恰到好处 “的输入。
1 | import pandas as pd |
1 | reviews |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
Summary functions
Pandas 提供了许多简单的“摘要函数”(不是官方名称),它们以某种有用的方式重组数据。例如,考虑describe()方法:
1 | reviews.points.describe() |
count 129971.000000
mean 88.447138
std 3.039730
min 80.000000
25% 86.000000
50% 88.000000
75% 91.000000
max 100.000000
Name: points, dtype: float64
该方法生成给定列属性的高级摘要。它是类型感知的,这意味着它的输出会根据输入的数据类型而改变。上面的输出仅对数值数据有意义;对于字符串数据,我们得到的结果如下:
1 | reviews.taster_name.describe() |
count 103727
unique 19
top Roger Voss
freq 25514
Name: taster_name, dtype: object
如果你想得到DataFrame或Series中某一列的一些特定的简单汇总统计,通常有一个有用的pandas函数可以实现。
例如,我们可以使用mean()函数来查看所分配分数的平均值(例如,平均评分的葡萄酒的表现如何):
1 | reviews.points.mean() |
88.44713820775404
要查看唯一值的列表,我们可以使用 unique() 函数:
1 | reviews.taster_name.unique() |
array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
'Christina Pickard'], dtype=object)
要查看唯一值的列表以及它们在数据集中出现的频率,我们可以使用 value_counts() 方法:
1 | reviews.taster_name.value_counts() |
taster_name
Roger Voss 25514
Michael Schachner 15134
Kerin O’Keefe 10776
Virginie Boone 9537
Paul Gregutt 9532
Matt Kettmann 6332
Joe Czerwinski 5147
Sean P. Sullivan 4966
Anna Lee C. Iijima 4415
Jim Gordon 4177
Anne Krebiehl MW 3685
Lauren Buzzeo 1835
Susan Kostrzewa 1085
Mike DeSimone 514
Jeff Jenssen 491
Alexander Peartree 415
Carrie Dykes 139
Fiona Adams 27
Christina Pickard 6
Name: count, dtype: int64
Maps
映射(map)是从数学中借用的术语,指的是将一组值 “映射 “到另一组值的函数。在数据科学中,我们经常需要从现有数据中创建新的表示方法,或者将数据从现在的格式转换为我们希望的格式。映射就是处理这些工作的工具,因此它对于完成工作极为重要!
有两种映射方法您会经常用到。
map()是第一个,也是稍微简单的一个。例如,假设我们想将葡萄酒得到的分数重新平均为0:
1 | review_points_mean = reviews.points.mean() |
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138
...
129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
您传递给map()的函数应该期望从Series中得到一个单独的值(在上面的示例中是一个点值),并返回该值的转换版本。 map()返回一个新的Series,其中所有的值都已被您的函数转换。
如果我们想通过在每一行上调用自定义方法来转换整个DataFrame,则apply()是等价的方法。
1 | def remean_points(row): |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | -1.447138 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | -1.447138 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | -1.447138 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | -1.447138 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | -1.447138 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 1.552862 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 1.552862 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 1.552862 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 1.552862 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 1.552862 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
如果我们在调用views.apply()时使用的是axis=’index’,那么我们就不需要传递一个函数来转换每一行,而是需要给出一个函数来转换每一列。请注意,map()和apply()分别返回新的、转换后的Series和DataFrames。它们不会修改被调用的原始数据。如果我们查看第一行的评论,我们可以看到它仍然具有原始的点值。
1 | reviews.head(1) |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
Pandas 提供了许多常见的内置映射操作。例如,以下是重新定义点列的更快方法:
1 | review_points_mean = reviews.points.mean() |
0 -1.447138
1 -1.447138
2 -1.447138
3 -1.447138
4 -1.447138
...
129966 1.552862
129967 1.552862
129968 1.552862
129969 1.552862
129970 1.552862
Name: points, Length: 129971, dtype: float64
在这段代码中,我们在左侧的大量值(系列中的所有值)和右侧的单个值(平均值)之间执行运算。Pandas查看该表达式,并计算出我们必须从数据集中的每个值中减去该平均值。
如果我们在等长序列之间执行这些操作,Pandas也会明白该怎么做。例如,结合数据集中的国家和地区信息的简单方法如下:
1 | reviews.country + " - " + reviews.region_1 |
0 Italy - Etna
1 NaN
2 US - Willamette Valley
3 US - Lake Michigan Shore
4 US - Willamette Valley
...
129966 NaN
129967 US - Oregon
129968 France - Alsace
129969 France - Alsace
129970 France - Alsace
Length: 129971, dtype: object
这些运算符比map()或apply()更快,因为它们使用了pandas内置的加速功能。所有标准的 Python 运算符 (>, <, ==, 等等) 都以这种方式工作。
然而,它们不如map()或apply()灵活,后者可以做更高级的事情,比如应用条件逻辑,而这是加减法无法做到的。
TEST
1 | import pandas as pd |
1.review DataFrame 中points列的中位数是多少?
1 | reviews['points'].median() |
88.0
2.What countries are represented in the dataset? (Your answer should not include any duplicates.)
1 | countries=reviews['country'].unique() |
array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
'China', 'Egypt'], dtype=object)
3.How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.
每个国家在数据集中出现的频率是多少?创建一个 Series Reviews_per_country,将国家映射到该国家的葡萄酒reviews数量。
1 |
|
country
US 54504
France 22093
Italy 19540
Spain 6645
Portugal 5691
Chile 4472
Argentina 3800
Austria 3345
Australia 2329
Germany 2165
New Zealand 1419
South Africa 1401
Israel 505
Greece 466
Canada 257
Hungary 146
Bulgaria 141
Romania 120
Uruguay 109
Turkey 90
Slovenia 87
Georgia 86
England 74
Croatia 73
Mexico 70
Moldova 59
Brazil 52
Lebanon 35
Morocco 28
Peru 16
Ukraine 14
Serbia 12
Czech Republic 12
Macedonia 12
Cyprus 11
India 9
Switzerland 7
Luxembourg 6
Bosnia and Herzegovina 2
Armenia 2
Slovakia 1
China 1
Egypt 1
Name: count, dtype: int64
Create variable centered_price containing a version of the price column with the mean price subtracted. (Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)
创建变量 centered_price,其中包含减去平均价格的价格列版本。 (注意:这种“居中”转换是应用各种机器学习算法之前的常见预处理步骤。)
1 | a = reviews['price'].mean() |
0 NaN
1 -20.363389
2 -21.363389
3 -22.363389
4 29.636611
...
129966 -7.363389
129967 39.636611
129968 -5.363389
129969 -3.363389
129970 -14.363389
Name: price, Length: 129971, dtype: float64
I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.
我是一个经济实惠的葡萄酒买家。哪种酒是“最划算的”?使用数据集中积分价格比最高的葡萄酒的标题创建一个变量 deal_wine。
1 | # reviews[[(reviews['points']/reviews['price']).idxmax()]] |
'Bandit NV Merlot (California)'
6.There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts
counting how many times each of these two words appears in the description
column in the dataset. (For simplicity, let’s ignore the capitalized versions of these words.)
在描述一瓶葡萄酒时,您可以使用的词汇有限。葡萄酒更可能是 “热带 “还是 “果味”?创建一个 “descriptor_counts “系列,计算这两个词在数据集的 “description “列中分别出现的次数。(为简单起见,让我们忽略这些词的大写版本)。
1 | import re |
0 3607
1 9090
dtype: int64
1 | ############答案的方法########## |
tropical 3607
fruity 9090
dtype: int64
- We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings
with the number of stars corresponding to each review in the dataset.
我们希望将这些酒评放在我们的网站上,但从80分到100分的评分系统太难懂了–我们希望将其转化为简单的星级评分。95分及以上为3星,85分以上95分以下为2星。其他分数为1星。
此外,加拿大葡萄酒商协会在网站上购买了大量广告,因此任何来自加拿大的葡萄酒都应该自动获得3星,无论分数高低。
创建一个 “star_ratings “序列,其中包含数据集中每条review对应的星级数。
1 | a = pd.Series() |
0 2
1 2
2 2
3 2
4 2
..
129966 2
129967 2
129968 2
129969 2
129970 2
Length: 129971, dtype: int64
1 | ###官方答案### |
0 2
1 2
2 2
3 2
4 2
..
129966 2
129967 2
129968 2
129969 2
129970 2
Length: 129971, dtype: int64
Grouping and Sorting
映射允许我们对数据帧或系列中的数据进行转换,一次只转换整个列中的一个值。然而,我们经常希望对数据进行分组,然后对数据所在的组进行特定操作。
正如您将了解到的,我们通过groupby()操作来实现这一点。我们还将介绍一些其他主题,例如索引DataFrames的更复杂方法,以及如何排序数据。
分组分析
到目前为止,我们经常使用的一个函数是 value_counts() 函数。我们可以通过执行以下操作来复制 value_counts() 的作用:
1 | import pandas as pd |
1 | print(reviews.groupby('points').points.count()) |
points
80 397
81 692
82 1836
83 3025
84 6480
85 9530
86 12600
87 16933
88 17207
89 12226
90 15410
91 11359
92 9613
93 6489
94 3758
95 1535
96 523
97 229
98 77
99 33
100 19
Name: points, dtype: int64
points
80 395
81 680
82 1772
83 2886
84 6099
85 8902
86 11745
87 15767
88 16014
89 11324
90 14361
91 10564
92 8871
93 5935
94 3449
95 1406
96 482
97 207
98 69
99 28
100 19
Name: price, dtype: int64
price
4.0 11
5.0 46
6.0 120
7.0 433
8.0 892
...
1900.0 1
2000.0 2
2013.0 1
2500.0 2
3300.0 1
Name: price, Length: 390, dtype: int64
price
4.0 11
5.0 46
6.0 120
7.0 433
8.0 892
...
1900.0 1
2000.0 2
2013.0 1
2500.0 2
3300.0 1
Name: points, Length: 390, dtype: int64
groupby()创建了一组评论,这些评论为给定的葡萄酒分配了相同的分值。然后,对于每一组,我们抓取points()列并计算它出现的次数。 value_counts()只是groupby()操作的快捷方式。
我们可以对这些数据使用我们之前使用过的任何汇总函数。例如,要获得每个点值类别中最便宜的葡萄酒,我们可以执行以下操作:
1 | reviews.groupby('points').price.min() |
points
80 5.0
81 5.0
82 4.0
83 4.0
84 4.0
85 4.0
86 4.0
87 5.0
88 6.0
89 7.0
90 8.0
91 7.0
92 11.0
93 12.0
94 13.0
95 20.0
96 20.0
97 35.0
98 50.0
99 44.0
100 80.0
Name: price, dtype: float64
您可以将我们生成的每个组视为DataFrame的一个片段,其中仅包含值匹配的数据。我们可以使用apply()方法直接访问该DataFrame,然后以我们认为合适的方式处理数据。例如,我们可以从数据集中的每个酒庄中选择第一款葡萄酒的名称:
1 | reviews.groupby('winery').apply(lambda df: df.title.iloc[0]) |
winery
1+1=3 1+1=3 NV Rosé Sparkling (Cava)
10 Knots 10 Knots 2010 Viognier (Paso Robles)
100 Percent Wine 100 Percent Wine 2015 Moscato (California)
1000 Stories 1000 Stories 2013 Bourbon Barrel Aged Zinfande...
1070 Green 1070 Green 2011 Sauvignon Blanc (Rutherford)
...
Órale Órale 2011 Cabronita Red (Santa Ynez Valley)
Öko Öko 2013 Made With Organically Grown Grapes Ma...
Ökonomierat Rebholz Ökonomierat Rebholz 2007 Von Rotliegenden Spät...
àMaurice àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object
为了进行更细粒度的控制,您还可以按多列进行分组。举个例子,以下是我们如何按国家和省份挑选最好的葡萄酒的方法:
1 | reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()]) |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | province | |||||||||||||
Argentina | Mendoza Province | Argentina | If the color doesn't tell the full story, the ... | Nicasia Vineyard | 97 | 120.0 | Mendoza Province | Mendoza | NaN | Michael Schachner | @wineschach | Bodega Catena Zapata 2006 Nicasia Vineyard Mal... | Malbec | Bodega Catena Zapata |
Other | Argentina | Take note, this could be the best wine Colomé ... | Reserva | 95 | 90.0 | Other | Salta | NaN | Michael Schachner | @wineschach | Colomé 2010 Reserva Malbec (Salta) | Malbec | Colomé | |
Armenia | Armenia | Armenia | Deep salmon in color, this wine offers a bouqu... | Estate Bottled | 88 | 15.0 | Armenia | NaN | NaN | Mike DeSimone | @worldwineguys | Van Ardi 2015 Estate Bottled Rosé (Armenia) | Rosé | Van Ardi |
Australia | Australia Other | Australia | Writes the book on how to make a wine filled w... | Sarah's Blend | 93 | 15.0 | Australia Other | South Eastern Australia | NaN | NaN | NaN | Marquis Philips 2000 Sarah's Blend Red (South ... | Red Blend | Marquis Philips |
New South Wales | Australia | De Bortoli's Noble One is as good as ever in 2... | Noble One Bortytis | 94 | 32.0 | New South Wales | New South Wales | NaN | Joe Czerwinski | @JoeCz | De Bortoli 2007 Noble One Bortytis Semillon (N... | Sémillon | De Bortoli | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Uruguay | Juanico | Uruguay | This mature Bordeaux-style blend is earthy on ... | Preludio Barrel Select Lote N 77 | 90 | 45.0 | Juanico | NaN | NaN | Michael Schachner | @wineschach | Familia Deicas 2004 Preludio Barrel Select Lot... | Red Blend | Familia Deicas |
Montevideo | Uruguay | A rich, heady bouquet offers aromas of blackbe... | Monte Vide Eu Tannat-Merlot-Tempranillo | 91 | 60.0 | Montevideo | NaN | NaN | Michael Schachner | @wineschach | Bouza 2015 Monte Vide Eu Tannat-Merlot-Tempran... | Red Blend | Bouza | |
Progreso | Uruguay | Rusty in color but deep and complex in nature,... | Etxe Oneko Fortified Sweet Red | 90 | 46.0 | Progreso | NaN | NaN | Michael Schachner | @wineschach | Pisano 2007 Etxe Oneko Fortified Sweet Red Tan... | Tannat | Pisano | |
San Jose | Uruguay | Baked, sweet, heavy aromas turn earthy with ti... | El Preciado Gran Reserva | 87 | 50.0 | San Jose | NaN | NaN | Michael Schachner | @wineschach | Castillo Viejo 2005 El Preciado Gran Reserva R... | Red Blend | Castillo Viejo | |
Uruguay | Uruguay | Cherry and berry aromas are ripe, healthy and ... | Blend 002 Limited Edition | 91 | 22.0 | Uruguay | NaN | NaN | Michael Schachner | @wineschach | Narbona NV Blend 002 Limited Edition Tannat-Ca... | Tannat-Cabernet Franc | Narbona |
425 rows × 13 columns
另一个值得一提的 groupby() 方法是 agg(),它允许您同时在 DataFrame 上运行一堆不同的函数。例如,我们可以生成数据集的简单统计摘要,如下所示:
1 | reviews.groupby(['country']).price.agg([len, min, max]) |
len | min | max | |
---|---|---|---|
country | |||
Argentina | 3800 | 4.0 | 230.0 |
Armenia | 2 | 14.0 | 15.0 |
Australia | 2329 | 5.0 | 850.0 |
Austria | 3345 | 7.0 | 1100.0 |
Bosnia and Herzegovina | 2 | 12.0 | 13.0 |
Brazil | 52 | 10.0 | 60.0 |
Bulgaria | 141 | 8.0 | 100.0 |
Canada | 257 | 12.0 | 120.0 |
Chile | 4472 | 5.0 | 400.0 |
China | 1 | 18.0 | 18.0 |
Croatia | 73 | 12.0 | 65.0 |
Cyprus | 11 | 11.0 | 21.0 |
Czech Republic | 12 | 15.0 | 45.0 |
Egypt | 1 | NaN | NaN |
England | 74 | 25.0 | 95.0 |
France | 22093 | 5.0 | 3300.0 |
Georgia | 86 | 9.0 | 40.0 |
Germany | 2165 | 5.0 | 775.0 |
Greece | 466 | 8.0 | 79.0 |
Hungary | 146 | 10.0 | 764.0 |
India | 9 | 10.0 | 20.0 |
Israel | 505 | 8.0 | 150.0 |
Italy | 19540 | 5.0 | 900.0 |
Lebanon | 35 | 13.0 | 75.0 |
Luxembourg | 6 | 16.0 | 30.0 |
Macedonia | 12 | 15.0 | 20.0 |
Mexico | 70 | 8.0 | 108.0 |
Moldova | 59 | 8.0 | 42.0 |
Morocco | 28 | 14.0 | 40.0 |
New Zealand | 1419 | 7.0 | 130.0 |
Peru | 16 | 10.0 | 68.0 |
Portugal | 5691 | 5.0 | 1000.0 |
Romania | 120 | 4.0 | 320.0 |
Serbia | 12 | 15.0 | 42.0 |
Slovakia | 1 | 16.0 | 16.0 |
Slovenia | 87 | 7.0 | 90.0 |
South Africa | 1401 | 5.0 | 330.0 |
Spain | 6645 | 4.0 | 770.0 |
Switzerland | 7 | 21.0 | 160.0 |
Turkey | 90 | 14.0 | 120.0 |
US | 54504 | 4.0 | 2013.0 |
Ukraine | 14 | 6.0 | 13.0 |
Uruguay | 109 | 10.0 | 130.0 |
有效使用 groupby() 将使您能够利用数据集做很多真正强大的事情。
Multi-indexes
在我们迄今为止看到的所有示例中,我们一直在使用单标签索引的DataFrame或Series对象。
groupby()略有不同,根据我们运行的操作,它有时会产生所谓的多索引。
多索引与常规索引的不同之处在于它具有多个级别。例如:
1 | countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len]) |
len | ||
---|---|---|
country | province | |
Argentina | Mendoza Province | 3264 |
Other | 536 | |
Armenia | Armenia | 2 |
Australia | Australia Other | 245 |
New South Wales | 85 | |
... | ... | ... |
Uruguay | Juanico | 12 |
Montevideo | 11 | |
Progreso | 11 | |
San Jose | 3 | |
Uruguay | 24 |
425 rows × 1 columns
1 | mi = countries_reviewed.index |
pandas.core.indexes.multi.MultiIndex
多级索引有几种处理分层结构的方法,而单级索引则没有这些方法。它们还需要两层标签来获取一个值。对于刚接触pandas的用户来说,处理多索引输出是一个常见的 “疑难杂症”。
在pandas文档的MultiIndex/Advanced Selection部分详细介绍了多索引的用例和使用说明。
然而,一般来说,你最常使用的多索引方法是转换回普通索引的reset_index()方法:
1 | countries_reviewed.reset_index() |
country | province | len | |
---|---|---|---|
0 | Argentina | Mendoza Province | 3264 |
1 | Argentina | Other | 536 |
2 | Armenia | Armenia | 2 |
3 | Australia | Australia Other | 245 |
4 | Australia | New South Wales | 85 |
... | ... | ... | ... |
420 | Uruguay | Juanico | 12 |
421 | Uruguay | Montevideo | 11 |
422 | Uruguay | Progreso | 11 |
423 | Uruguay | San Jose | 3 |
424 | Uruguay | Uruguay | 24 |
425 rows × 3 columns
Sorting
再次查看 countries_reviewed,我们可以看到分组返回的数据是按照索引顺序而不是值的顺序排列的。也就是说,在输出分组结果时,行的顺序取决于索引中的值,而不是数据中的值。
为了按照我们想要的顺序获得数据,我们可以自己对数据进行排序。sort_values()方法在这方面非常方便。
1 | countries_reviewed = countries_reviewed.reset_index() |
country | province | len | |
---|---|---|---|
179 | Greece | Muscat of Kefallonian | 1 |
192 | Greece | Sterea Ellada | 1 |
194 | Greece | Thraki | 1 |
354 | South Africa | Paardeberg | 1 |
40 | Brazil | Serra do Sudeste | 1 |
... | ... | ... | ... |
409 | US | Oregon | 5373 |
227 | Italy | Tuscany | 5897 |
118 | France | Bordeaux | 5941 |
415 | US | Washington | 8639 |
392 | US | California | 36247 |
425 rows × 3 columns
sort_values() 默认为升序排序,其中最低的值排在前面。然而,大多数时候我们想要降序排序,即数字较大的排在前面。事情是这样的:
1 | countries_reviewed.sort_values(by='len', ascending=False) |
country | province | len | |
---|---|---|---|
392 | US | California | 36247 |
415 | US | Washington | 8639 |
118 | France | Bordeaux | 5941 |
227 | Italy | Tuscany | 5897 |
409 | US | Oregon | 5373 |
... | ... | ... | ... |
101 | Croatia | Krk | 1 |
247 | New Zealand | Gladstone | 1 |
357 | South Africa | Piekenierskloof | 1 |
63 | Chile | Coelemu | 1 |
149 | Greece | Beotia | 1 |
425 rows × 3 columns
要按索引值排序,请使用配套方法 sort_index()。此方法具有相同的参数和默认顺序:
1 | countries_reviewed.sort_index() |
country | province | len | |
---|---|---|---|
0 | Argentina | Mendoza Province | 3264 |
1 | Argentina | Other | 536 |
2 | Armenia | Armenia | 2 |
3 | Australia | Australia Other | 245 |
4 | Australia | New South Wales | 85 |
... | ... | ... | ... |
420 | Uruguay | Juanico | 12 |
421 | Uruguay | Montevideo | 11 |
422 | Uruguay | Progreso | 11 |
423 | Uruguay | San Jose | 3 |
424 | Uruguay | Uruguay | 24 |
425 rows × 3 columns
最后,要知道您一次可以按多个列进行排序:
1 | countries_reviewed.sort_values(by=['country', 'len']) |
country | province | len | |
---|---|---|---|
1 | Argentina | Other | 536 |
0 | Argentina | Mendoza Province | 3264 |
2 | Armenia | Armenia | 2 |
6 | Australia | Tasmania | 42 |
4 | Australia | New South Wales | 85 |
... | ... | ... | ... |
421 | Uruguay | Montevideo | 11 |
422 | Uruguay | Progreso | 11 |
420 | Uruguay | Juanico | 12 |
424 | Uruguay | Uruguay | 24 |
419 | Uruguay | Canelones | 43 |
425 rows × 3 columns
TEST
1 | import pandas as pd |
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | 2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | 3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | 4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
1.
Who are the most common wine reviewers in the dataset? Create a Series
whose index is the taster_twitter_handle
category from the dataset, and whose values count how many reviews each person wrote.
数据集中最常见的酒评人是谁?创建一个 “系列”,其索引是数据集中的 “taster_twitter_handle “类别,其值计算每个人写了多少评论。
1 | reviews.groupby('taster_twitter_handle')['taster_twitter_handle'].count() |
taster_twitter_handle
@AnneInVino 3685
@JoeCz 5147
@bkfiona 27
@gordone_cellars 4177
@kerinokeefe 10776
@laurbuzz 1835
@mattkettmann 6332
@paulgwine 9532
@suskostrzewa 1085
@vboone 9537
@vossroger 25514
@wawinereport 4966
@wineschach 15134
@winewchristina 6
@worldwineguys 1005
Name: taster_twitter_handle, dtype: int64
2.
What is the best wine I can buy for a given amount of money? Create a Series
whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0
dollars is at the top and 3300.0
dollars is at the bottom).
我花多少钱可以买到最好的葡萄酒?创建一个 “系列”,其索引是葡萄酒的价格,其值是该价格的葡萄酒在评论中得到的最高分。按价格升序排序(这样4.0
美元在最上面,3300.0
美元在最下面)。
1 | reviews.groupby('price')['points'].max() |
price
4.0 86
5.0 87
6.0 88
7.0 91
8.0 91
..
1900.0 98
2000.0 97
2013.0 91
2500.0 96
3300.0 88
Name: points, Length: 390, dtype: int64
3.
What are the minimum and maximum prices for each variety
of wine? Create a DataFrame
whose index is the variety
category from the dataset and whose values are the min
and max
values thereof.
每种葡萄酒的最低和最高价格是多少?创建一个DataFrame
,其索引为数据集中的variety
类别,其值为min
和max
值。
1 | price_extremes = reviews.groupby('variety')['price'].agg([min,max]) |
min | max | |
---|---|---|
variety | ||
Abouriou | 15.0 | 75.0 |
Agiorgitiko | 10.0 | 66.0 |
Aglianico | 6.0 | 180.0 |
Aidani | 27.0 | 27.0 |
Airen | 8.0 | 10.0 |
... | ... | ... |
Zinfandel | 5.0 | 100.0 |
Zlahtina | 13.0 | 16.0 |
Zweigelt | 9.0 | 70.0 |
Çalkarası | 19.0 | 19.0 |
Žilavka | 15.0 | 15.0 |
707 rows × 2 columns
4.
What are the most expensive wine varieties? Create a variable sorted_varieties
containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
最贵的葡萄酒品种是什么?创建一个变量sorted_varieties
,其中包含上一个问题中数据帧的副本,根据最低价格,然后根据最高价格,按降序排列(打破并列关系)。
1 | price_extremes.sort_values(by=['min', 'max'],ascending=False) |
min | max | |
---|---|---|
variety | ||
Ramisco | 495.0 | 495.0 |
Terrantez | 236.0 | 236.0 |
Francisa | 160.0 | 160.0 |
Rosenmuskateller | 150.0 | 150.0 |
Tinta Negra Mole | 112.0 | 112.0 |
... | ... | ... |
Roscetto | NaN | NaN |
Sauvignon Blanc-Sauvignon Gris | NaN | NaN |
Tempranillo-Malbec | NaN | NaN |
Vital | NaN | NaN |
Zelen | NaN | NaN |
707 rows × 2 columns
5.
Create a Series
whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name
and points
columns.
创建一个 “系列”,其索引为reviewers,其值为该reviewer给出的平均review分数。提示:您将需要taster_name
和points
列。
1 | # reviews.groupby('taster_name')['points'].agg(['mean']) |
taster_name
Alexander Peartree 85.855422
Anna Lee C. Iijima 88.415629
Anne Krebiehl MW 90.562551
Carrie Dykes 86.395683
Christina Pickard 87.833333
Fiona Adams 86.888889
Jeff Jenssen 88.319756
Jim Gordon 88.626287
Joe Czerwinski 88.536235
Kerin O’Keefe 88.867947
Lauren Buzzeo 87.739510
Matt Kettmann 90.008686
Michael Schachner 86.907493
Mike DeSimone 89.101167
Paul Gregutt 89.082564
Roger Voss 88.708003
Sean P. Sullivan 88.755739
Susan Kostrzewa 86.609217
Virginie Boone 89.213379
dtype: float64
Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe()
method to see a summary of the range of values.
不同评论人给出的平均分是否存在明显差异?运行下面的单元格,使用 describe()
方法查看数值范围的摘要。
1 | reviewer_mean_ratings.describe() |
count 19.000000
mean 88.233026
std 1.243610
min 85.855422
25% 87.323501
50% 88.536235
75% 88.975256
max 90.562551
dtype: float64
6.
What combination of countries and varieties are most common? Create a Series
whose index is a MultiIndex
of {country, variety}
pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}
. Sort the values in the Series
in descending order based on wine count.
什么国家和品种的组合最常见?创建一个 Series
,其索引是{country, variety}
对的 MultiIndex
。例如,产自美国的黑比诺应该映射到{"美国", "黑比诺"}
。将 Series
中的值按照葡萄酒数量降序排序。
1 | reviews.groupby(['country', 'variety']).size().sort_values(ascending=False) |
country variety
US Pinot Noir 9885
Cabernet Sauvignon 7315
Chardonnay 6801
France Bordeaux-style Red Blend 4725
Italy Red Blend 3624
...
Mexico Cinsault 1
Grenache 1
Merlot 1
Rosado 1
Uruguay White Blend 1
Length: 1612, dtype: int64
Data Types and Missing Values
Dtypes
DataFrame 或 Series 中列的数据类型称为 dtype。
您可以使用 dtype 属性来获取特定列的类型。例如,我们可以获取评论 DataFrame 中价格列的 dtype:
1 | import pandas as pd |
1 | reviews.price.dtype |
dtype('float64')
或者,dtypes 属性返回 DataFrame 中每列的 dtype:
1 | reviews.dtypes |
country object
description object
designation object
points int64
price float64
province object
region_1 object
region_2 object
taster_name object
taster_twitter_handle object
title object
variety object
winery object
dtype: object
float64 表示使用 64 位浮点数;int64 表示使用类似大小的整数,以此类推。
需要注意的一个特殊情况是(在这里可以很清楚地看到),完全由字符串组成的列没有自己的类型,而是被赋予了对象类型。
我们可以使用 astype() 函数将一种类型的列转换为另一种类型,只要这种转换是合理的。例如,我们可以将积分列从现有的 int64 数据类型转换为 float64 数据类型:
1 | reviews.points.astype('float64') |
0 87.0
1 87.0
2 87.0
3 87.0
4 87.0
...
129966 90.0
129967 90.0
129968 90.0
129969 90.0
129970 90.0
Name: points, Length: 129971, dtype: float64
DataFrame 或 Series 索引也有自己的数据类型:
1 | reviews.index.dtype |
dtype('int64')
Pandas 还支持更奇特的数据类型,例如分类数据和时间序列数据。
Missing data
缺失值的输入值为 NaN,即 “Not a Number(非数值)”的缩写。由于技术原因,这些 NaN 值始终是 float64 类型。
Pandas 提供了一些专门针对缺失数据的方法。要选择 NaN 条目,可以使用 pd.isnull()(或其同伴 pd.notnull())。使用方法如下
1 | reviews[pd.isnull(reviews.country)] |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
913 | NaN | Amber in color, this wine has aromas of peach ... | Asureti Valley | 87 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Gotsa Family Wines 2014 Asureti Valley Chinuri | Chinuri | Gotsa Family Wines |
3131 | NaN | Soft, fruity and juicy, this is a pleasant, si... | Partager | 83 | NaN | NaN | NaN | NaN | Roger Voss | @vossroger | Barton & Guestier NV Partager Red | Red Blend | Barton & Guestier |
4243 | NaN | Violet-red in color, this semisweet wine has a... | Red Naturally Semi-Sweet | 88 | 18.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Kakhetia Traditional Winemaking 2012 Red Natur... | Ojaleshi | Kakhetia Traditional Winemaking |
9509 | NaN | This mouthwatering blend starts with a nose of... | Theopetra Malagouzia-Assyrtiko | 92 | 28.0 | NaN | NaN | NaN | Susan Kostrzewa | @suskostrzewa | Tsililis 2015 Theopetra Malagouzia-Assyrtiko W... | White Blend | Tsililis |
9750 | NaN | This orange-style wine has a cloudy yellow-gol... | Orange Nikolaevo Vineyard | 89 | 28.0 | NaN | NaN | NaN | Jeff Jenssen | @worldwineguys | Ross-idi 2015 Orange Nikolaevo Vineyard Chardo... | Chardonnay | Ross-idi |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
124176 | NaN | This Swiss red blend is composed of four varie... | Les Romaines | 90 | 30.0 | NaN | NaN | NaN | Jeff Jenssen | @worldwineguys | Les Frères Dutruy 2014 Les Romaines Red | Red Blend | Les Frères Dutruy |
129407 | NaN | Dry spicy aromas of dusty plum and tomato add ... | Reserve | 89 | 22.0 | NaN | NaN | NaN | Michael Schachner | @wineschach | El Capricho 2015 Reserve Cabernet Sauvignon | Cabernet Sauvignon | El Capricho |
129408 | NaN | El Capricho is one of Uruguay's more consisten... | Reserve | 89 | 22.0 | NaN | NaN | NaN | Michael Schachner | @wineschach | El Capricho 2015 Reserve Tempranillo | Tempranillo | El Capricho |
129590 | NaN | A blend of 60% Syrah, 30% Cabernet Sauvignon a... | Shah | 90 | 30.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Büyülübağ 2012 Shah Red | Red Blend | Büyülübağ |
129900 | NaN | This wine offers a delightful bouquet of black... | NaN | 91 | 32.0 | NaN | NaN | NaN | Mike DeSimone | @worldwineguys | Psagot 2014 Merlot | Merlot | Psagot |
63 rows × 13 columns
替换缺失值是一种常见操作。 Pandas 为这个问题提供了一个非常方便的方法:fillna()。 fillna() 提供了几种不同的策略来减少此类数据。例如,我们可以简单地将每个 NaN 替换为“Unknown”:
1 | reviews.region_2.fillna("Unknown") |
0 Unknown
1 Unknown
2 Willamette Valley
3 Unknown
4 Willamette Valley
...
129966 Unknown
129967 Oregon Other
129968 Unknown
129969 Unknown
129970 Unknown
Name: region_2, Length: 129971, dtype: object
或者,我们可以使用数据库中给定记录之后某个时间出现的第一个非空值来填充每个缺失值。这称为回填策略。
或者,我们可能有一个非空值需要替换。例如,假设自本数据集发布以来,评论员 Kerin O’Keefe 已将其 Twitter 手柄从 @kerinokeefe 更改为 @kerino。使用 replace() 方法是在数据集中反映这一情况的方法之一:
1 | reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino") |
0 @kerino
1 @vossroger
2 @paulgwine
3 NaN
4 @paulgwine
...
129966 NaN
129967 @paulgwine
129968 @vossroger
129969 @vossroger
129970 @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object
这里值得一提的是 Replace() 方法,因为它可以方便地替换丢失的数据,这些数据在数据集中被赋予某种哨兵值:诸如“未知”、“未公开”、“无效”等。
TEST
1 | import pandas as pd |
1.数据集中pionts
列的数据类型是什么?
1 | reviews['points'].dtype |
dtype('int64')
2.从points
列中的条目创建一个系列,但将条目转换为字符串。提示:字符串在 Python 中是 str
。
1 | reviews.points.astype('str') |
0 87
1 87
2 87
3 87
4 87
..
129966 90
129967 90
129968 90
129969 90
129970 90
Name: points, Length: 129971, dtype: object
3.有时价格列为空。数据集中有多少评论缺少价格?
1 | reviews['price'].isnull().sum() |
8996
4.最常见的葡萄酒产区有哪些?创建一个Series,计算 region_1
字段中每个值出现的次数。这个字段经常缺少数据,因此用unknown
替换缺少的值。按降序排序。输出结果应如下所示:
1 | Unknown 21247 |
1 | reviews['region_1'] = reviews['region_1'].fillna("Unknown") |
region_1
Unknown 21247
Napa Valley 4480
Columbia Valley (WA) 4124
Russian River Valley 3091
California 2629
...
Lamezia 1
Trentino Superiore 1
Grave del Friuli 1
Vin Santo di Carmignano 1
Paestum 1
Name: count, Length: 1230, dtype: int64
Renaming and Combining
Renaming
我们在这里介绍的第一个函数是 rename(),它允许您更改索引名称和/或列名称。例如,要更改数据集中的分数列以进行评分,我们将执行以下操作:
1 | import pandas as pd |
1 | reviews.rename(columns={'points': 'score'}) |
country | description | designation | score | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
rename()
可以让你通过分别指定索引或列关键字参数来重命名索引或列值。它支持多种输入格式,但通常 Python 字典是最方便的。下面是一个使用它重命名索引中某些元素的示例。
1 | reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}) |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
firstEntry | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
secondEntry | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
你可能会经常重命名列,但很少重命名索引值。为此,set_index() 通常更方便。
行索引和列索引都可以有自己的名称属性。可以使用赠送的 rename_axis() 方法来更改这些名称。例如
1 | reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns') |
fields | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
wines | |||||||||||||
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
Combining
在对数据集执行操作时,我们有时需要将不同的 DataFrames 和/或 Series 以非同寻常的方式组合起来。Pandas 有三种核心方法可以实现这一目的。按照复杂程度递增的顺序,它们是 concat()
、join()
和 merge()
。merge()
的大部分功能也可以通过 join()
更简单地实现。
最简单的合并方法是 concat()
。如果给定一个元素列表,该函数会将这些元素沿着一个轴挤在一起。
当我们在不同的 DataFrame 或 Series 对象中拥有相同字段(列)的数据时,这种方法非常有用。例如:YouTube 视频数据集根据来源国(例如本例中的加拿大和英国)对数据进行了分割。如果我们想同时研究多个国家,可以使用 concat()
将它们合并在一起:
1 | canadian_youtube = pd.read_csv("datasets/Trending_YouTube_Video_Statistics/CAvideos.csv") |
video_id | trending_date | title | channel_title | category_id | publish_time | tags | views | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | video_error_or_removed | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | n1WpP7iowLc | 17.14.11 | Eminem - Walk On Water (Audio) ft. Beyoncé | EminemVEVO | 10 | 2017-11-10T17:00:03.000Z | Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... | 17158579 | 787425 | 43420 | 125882 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | False | False | False | Eminem's new track Walk on Water ft. Beyoncé i... |
1 | 0dBIkQ4Mz1M | 17.14.11 | PLUSH - Bad Unboxing Fan Mail | iDubbbzTV | 23 | 2017-11-13T17:00:00.000Z | plush|"bad unboxing"|"unboxing"|"fan mail"|"id... | 1014651 | 127794 | 1688 | 13030 | https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg | False | False | False | STill got a lot of packages. Probably will las... |
2 | 5qpjK5DgCt4 | 17.14.11 | Racist Superman | Rudy Mancuso, King Bach & Le... | Rudy Mancuso | 23 | 2017-11-12T19:05:24.000Z | racist superman|"rudy"|"mancuso"|"king"|"bach"... | 3191434 | 146035 | 5339 | 8181 | https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg | False | False | False | WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http... |
3 | d380meD0W0M | 17.14.11 | I Dare You: GOING BALD!? | nigahiga | 24 | 2017-11-12T18:01:41.000Z | ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"... | 2095828 | 132239 | 1989 | 17518 | https://i.ytimg.com/vi/d380meD0W0M/default.jpg | False | False | False | I know it's been a while since we did this sho... |
4 | 2Vv-BfVoq4g | 17.14.11 | Ed Sheeran - Perfect (Official Music Video) | Ed Sheeran | 10 | 2017-11-09T11:04:14.000Z | edsheeran|"ed sheeran"|"acoustic"|"live"|"cove... | 33523622 | 1634130 | 21082 | 85067 | https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg | False | False | False | 🎧: https://ad.gt/yt-perfect\n💰: https://atlant... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
38911 | l884wKofd54 | 18.14.06 | Enrique Iglesias - MOVE TO MIAMI (Official Vid... | EnriqueIglesiasVEVO | 10 | 2018-05-09T07:00:01.000Z | Enrique Iglesias feat. Pitbull|"MOVE TO MIAMI"... | 25066952 | 268088 | 12783 | 9933 | https://i.ytimg.com/vi/l884wKofd54/default.jpg | False | False | False | NEW SONG - MOVE TO MIAMI feat. Pitbull (Click ... |
38912 | IP8k2xkhOdI | 18.14.06 | Jacob Sartorius - Up With It (Official Music V... | Jacob Sartorius | 10 | 2018-05-11T17:09:16.000Z | jacob sartorius|"jacob"|"up with it"|"jacob sa... | 1492219 | 61998 | 13781 | 24330 | https://i.ytimg.com/vi/IP8k2xkhOdI/default.jpg | False | False | False | THE OFFICIAL UP WITH IT MUSIC VIDEO!Get my new... |
38913 | Il-an3K9pjg | 18.14.06 | Anne-Marie - 2002 [Official Video] | Anne-Marie | 10 | 2018-05-08T11:05:08.000Z | anne|"marie"|"anne-marie"|"2002"|"two thousand... | 29641412 | 394830 | 8892 | 19988 | https://i.ytimg.com/vi/Il-an3K9pjg/default.jpg | False | False | False | Get 2002 by Anne-Marie HERE ▶ http://ad.gt/200... |
38914 | -DRsfNObKIQ | 18.14.06 | Eleni Foureira - Fuego - Cyprus - LIVE - First... | Eurovision Song Contest | 24 | 2018-05-08T20:32:32.000Z | Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... | 14317515 | 151870 | 45875 | 26766 | https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg | False | False | False | Eleni Foureira represented Cyprus at the first... |
38915 | 4YFo4bdMO8Q | 18.14.06 | KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... | SuperDuperKyle | 10 | 2018-05-11T04:06:35.000Z | Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... | 607552 | 18271 | 274 | 1423 | https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg | False | False | False | Debut album 'Light of Mine' out now: http://ky... |
79797 rows × 16 columns
join()
可让您组合具有共同索引的不同 DataFrame 对象。例如,要提取加拿大和英国同一天流行的视频,我们可以执行以下操作:
1 | left = canadian_youtube.set_index(['title', 'trending_date']) |
video_id_CAN | channel_title_CAN | category_id_CAN | publish_time_CAN | tags_CAN | views_CAN | likes_CAN | dislikes_CAN | comment_count_CAN | thumbnail_link_CAN | ... | tags_UK | views_UK | likes_UK | dislikes_UK | comment_count_UK | thumbnail_link_UK | comments_disabled_UK | ratings_disabled_UK | video_error_or_removed_UK | description_UK | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | trending_date | |||||||||||||||||||||
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 | 18.04.01 | PNn8sECd7io | Markiplier | 20 | 2018-01-03T19:33:53.000Z | getting over it|"markiplier"|"funny moments"|"... | 835930 | 47058 | 1023 | 8250 | https://i.ytimg.com/vi/PNn8sECd7io/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#1 Fortnite World Rank - 2,323 Solo Wins! | 18.09.03 | DvPW66IFhMI | AlexRamiGaming | 20 | 2018-03-09T07:15:52.000Z | PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... | 212838 | 5199 | 542 | 11 | https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#1 Fortnite World Rank - 2,330 Solo Wins! | 18.10.03 | EXEaMjFeiEk | AlexRamiGaming | 20 | 2018-03-10T06:26:17.000Z | PS4 Battle Royale|"PS4 Pro Battle Royale"|"Bat... | 200764 | 5620 | 537 | 45 | https://i.ytimg.com/vi/EXEaMjFeiEk/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#1 MOST ANTICIPATED VIDEO (Timber Frame House Raising) | 17.20.12 | bYvQmusLaxw | Pure Living for Life | 24 | 2017-12-20T02:49:11.000Z | timber frame|"timber framing"|"timber frame ra... | 79152 | 7761 | 159 | 1965 | https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17.21.12 | bYvQmusLaxw | Pure Living for Life | 24 | 2017-12-20T02:49:11.000Z | timber frame|"timber framing"|"timber frame ra... | 232762 | 15515 | 329 | 3601 | https://i.ytimg.com/vi/bYvQmusLaxw/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
😲She Is So Nervous But BLOWS The ROOF After Taking on OPERA Song! | Britain´s Got Talent 2018 | 18.02.05 | WttN1Z0XF4k | How Talented | 24 | 2018-04-28T19:40:58.000Z | bgt|"bgt 2018"|"britain got talent"|"britain´s... | 713400 | 4684 | 260 | 266 | https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
18.29.04 | WttN1Z0XF4k | How Talented | 24 | 2018-04-28T19:40:58.000Z | bgt|"bgt 2018"|"britain got talent"|"britain´s... | 231906 | 1924 | 78 | 146 | https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
18.30.04 | WttN1Z0XF4k | How Talented | 24 | 2018-04-28T19:40:58.000Z | bgt|"bgt 2018"|"britain got talent"|"britain´s... | 476253 | 3417 | 176 | 240 | https://i.ytimg.com/vi/WttN1Z0XF4k/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 | 18.07.05 | Wt9Gkpmbt44 | TheBigJackpot | 24 | 2018-05-07T06:58:59.000Z | Slot Machine|"win"|"Gambling"|"Big Win"|"raja"... | 28973 | 2167 | 175 | 10 | https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE | 18.04.04 | Az72jrKbANA | Right Side Broadcasting Network | 25 | 2018-04-03T23:12:37.000Z | YouTube shooter|"YouTube active shooter"|"acti... | 103513 | 1722 | 181 | 76 | https://i.ytimg.com/vi/Az72jrKbANA/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
40900 rows × 28 columns
此处 lsuffix
和 rsuffix
参数是必需的,因为数据在英国和加拿大数据集中具有相同的列名称。如果这不是真的(因为,比如说,我们事先重命名了它们),我们就不需要它们。
TEST
1 | import pandas as pd |
1 | reviews.head() |
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
1.Region_1
和 Region_2
是数据集中区域设置列的非常无信息的名称。创建reviews
副本,并将这些列分别重命名为region
和locale
。
1 | reviews.rename(columns={'region_1': 'region','region_2':'locale'}) |
country | description | designation | points | price | province | region | locale | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
2.将数据集中的索引名称设置为 wines
。
1 | reviews.rename_axis("wines", axis='columns') |
wines | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129966 | Germany | Notes of honeysuckle and cantaloupe sweeten th... | Brauneberger Juffer-Sonnenuhr Spätlese | 90 | 28.0 | Mosel | NaN | NaN | Anna Lee C. Iijima | NaN | Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ... | Riesling | Dr. H. Thanisch (Erben Müller-Burggraef) |
129967 | US | Citation is given as much as a decade of bottl... | NaN | 90 | 75.0 | Oregon | Oregon | Oregon Other | Paul Gregutt | @paulgwine | Citation 2004 Pinot Noir (Oregon) | Pinot Noir | Citation |
129968 | France | Well-drained gravel soil gives this wine its c... | Kritt | 90 | 30.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Gresser 2013 Kritt Gewurztraminer (Als... | Gewürztraminer | Domaine Gresser |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
3.Things on Reddit 数据集包含了来自 reddit.com 上部分排名靠前的论坛(”subreddits”)的产品链接。运行下面的单元格,加载在 /r/gaming
子论坛上提到的产品的数据框,以及在 r//movies
子论坛上提到的产品的数据框。
1 | gaming_products = pd.read_csv("datasets/top-things/reddits/g/gaming.csv") |
创建一个包含 Reddit 子版块中提到的产品(either subreddit
)的 DataFrame
。
1 | gaming_products.head() |
name | category | amazon_link | total_mentions | subreddit_mentions | subreddit | |
---|---|---|---|---|---|---|
0 | BOOMco Halo Covenant Needler Blaster | Toys & Games | https://www.amazon.com/BOOMco-Halo-Covenant-Ne... | 4.0 | 4 | r/gaming |
1 | Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... | Electronics | https://www.amazon.com/Raspberry-Model-A1-2GHz... | 19.0 | 3 | r/gaming |
2 | CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... | Electronics | https://www.amazon.com/CanaKit-Raspberry-Suppl... | 7.0 | 3 | r/gaming |
3 | Panasonic K-KJ17MCA4BA Advanced Individual Cel... | Electronics | https://www.amazon.com/Panasonic-Advanced-Indi... | 29.0 | 2 | r/gaming |
4 | Mayflash GameCube Controller Adapter for Wii U... | Electronics | https://www.amazon.com/GameCube-Controller-Ada... | 24.0 | 2 | r/gaming |
1 | pd.concat([gaming_products,movie_products]) |
name | category | amazon_link | total_mentions | subreddit_mentions | subreddit | |
---|---|---|---|---|---|---|
0 | BOOMco Halo Covenant Needler Blaster | Toys & Games | https://www.amazon.com/BOOMco-Halo-Covenant-Ne... | 4.0 | 4 | r/gaming |
1 | Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... | Electronics | https://www.amazon.com/Raspberry-Model-A1-2GHz... | 19.0 | 3 | r/gaming |
2 | CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... | Electronics | https://www.amazon.com/CanaKit-Raspberry-Suppl... | 7.0 | 3 | r/gaming |
3 | Panasonic K-KJ17MCA4BA Advanced Individual Cel... | Electronics | https://www.amazon.com/Panasonic-Advanced-Indi... | 29.0 | 2 | r/gaming |
4 | Mayflash GameCube Controller Adapter for Wii U... | Electronics | https://www.amazon.com/GameCube-Controller-Ada... | 24.0 | 2 | r/gaming |
... | ... | ... | ... | ... | ... | ... |
298 | Welcome to Night Vale CD: A Novel | Books | https://www.amazon.com/Welcome-Night-Vale-CD-N... | 1.0 | 1 | r/movies |
299 | Ran (StudioCanal Collection) [Blu-ray] | Movies & TV | https://www.amazon.com/StudioCanal-Collection-... | 1.0 | 1 | r/movies |
300 | The Art of John Alvin | Books | https://www.amazon.com/Art-John-Alvin-Andrea/d... | 1.0 | 1 | r/movies |
301 | Apocalypto [Blu-ray] | Movies & TV | https://www.amazon.com/Apocalypto-Blu-ray-Rudy... | 1.0 | 1 | r/movies |
302 | Cinelinx: A Card Game for People Who Love Movi... | Toys & Games | https://www.amazon.com/Cinelinx-Card-Game-Peop... | 1.0 | 1 | r/movies |
796 rows × 6 columns
4.Kaggle 上的举重数据库数据集包括一张用于举重比赛的 CSV 表和一张单独的举重参赛者表。运行下面的单元格将这些数据集加载到数据框中:
1 | powerlifting_meets = pd.read_csv("datasets/openpowerlifting/v1/meets.csv") |
1 | powerlifting_competitors |
MeetID | Name | Sex | Equipment | Age | Division | BodyweightKg | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Angie Belk Terry | F | Wraps | 47.0 | Mst 45-49 | 59.60 | 60 | NaN | 47.63 | NaN | 20.41 | NaN | 70.31 | 138.35 | 1 | 155.05 |
1 | 0 | Dawn Bogart | F | Single-ply | 42.0 | Mst 40-44 | 58.51 | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
2 | 0 | Dawn Bogart | F | Single-ply | 42.0 | Open Senior | 58.51 | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
3 | 0 | Dawn Bogart | F | Raw | 42.0 | Open Senior | 58.51 | 60 | NaN | NaN | NaN | 95.25 | NaN | NaN | 95.25 | 1 | 108.29 |
4 | 0 | Destiny Dula | F | Raw | 18.0 | Teen 18-19 | 63.68 | 67.5 | NaN | NaN | NaN | 31.75 | NaN | 90.72 | 122.47 | 1 | 130.47 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
386409 | 8481 | William Barabas | M | Multi-ply | NaN | Elite | 113.58 | 125 | NaN | NaN | NaN | NaN | NaN | 347.50 | 347.50 | 2 | 202.60 |
386410 | 8481 | Justin Zottl | M | Multi-ply | NaN | Elite | 119.02 | 125 | NaN | NaN | NaN | NaN | NaN | 322.50 | 322.50 | 3 | 185.77 |
386411 | 8481 | Jake Anderson | M | Multi-ply | NaN | Elite | 120.29 | 125 | NaN | NaN | NaN | NaN | NaN | 367.50 | 367.50 | 1 | 211.17 |
386412 | 8481 | Jeff Bumanglag | M | Multi-ply | NaN | Elite | 126.73 | 140 | NaN | NaN | NaN | NaN | NaN | 320.00 | 320.00 | 3 | 181.85 |
386413 | 8481 | Shane Hammock | M | Multi-ply | NaN | Elite | 129.46 | 140 | NaN | NaN | NaN | NaN | NaN | 362.50 | 362.50 | 2 | 205.18 |
386414 rows × 17 columns
两个表都包含对 MeetID
的引用,MeetID 是数据库中包含的每次会议(竞赛)的唯一键。使用它,生成一个将两个表合并为一个的数据集。
1 | left = powerlifting_meets.set_index(['MeetID']) |
MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName | Name | Sex | Equipment | ... | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MeetID | |||||||||||||||||||||
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Angie Belk Terry | F | Wraps | ... | 60 | NaN | 47.63 | NaN | 20.41 | NaN | 70.31 | 138.35 | 1 | 155.05 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Single-ply | ... | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Single-ply | ... | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Raw | ... | 60 | NaN | NaN | NaN | 95.25 | NaN | NaN | 95.25 | 1 | 108.29 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Destiny Dula | F | Raw | ... | 67.5 | NaN | NaN | NaN | 31.75 | NaN | 90.72 | 122.47 | 1 | 130.47 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | William Barabas | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 347.50 | 347.50 | 2 | 202.60 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Justin Zottl | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 322.50 | 322.50 | 3 | 185.77 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Jake Anderson | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 367.50 | 367.50 | 1 | 211.17 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Jeff Bumanglag | M | Multi-ply | ... | 140 | NaN | NaN | NaN | NaN | NaN | 320.00 | 320.00 | 3 | 181.85 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Shane Hammock | M | Multi-ply | ... | 140 | NaN | NaN | NaN | NaN | NaN | 362.50 | 362.50 | 2 | 205.18 |
386414 rows × 23 columns
完成
1 |
test for vue
aaaaaaa