第二章

原文:Chapter 2

译者:飞龙

协议:CC BY-NC-SA 4.0

  1. # 通常的开头
  2. import pandas as pd
  3. # 使图表更大更漂亮
  4. pd.set_option('display.mpl_style', 'default')
  5. pd.set_option('display.line_width', 5000)
  6. pd.set_option('display.max_columns', 60)
  7. figsize(15, 5)

我们将在这里使用一个新的数据集,来演示如何处理更大的数据集。 这是来自 NYC Open Data 的 311 个服务请求的子集。

  1. complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 里面究竟有什么?(总结)

当你查看一个大型数据框架,而不是显示数据框架的内容,它会显示一个摘要。 这包括所有列,以及每列中有多少非空值。

  1. complaints
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 111069 entries, 0 to 111068
  3. Data columns (total 52 columns):
  4. Unique Key 111069 non-null values
  5. Created Date 111069 non-null values
  6. Closed Date 60270 non-null values
  7. Agency 111069 non-null values
  8. Agency Name 111069 non-null values
  9. Complaint Type 111069 non-null values
  10. Descriptor 111068 non-null values
  11. Location Type 79048 non-null values
  12. Incident Zip 98813 non-null values
  13. Incident Address 84441 non-null values
  14. Street Name 84438 non-null values
  15. Cross Street 1 84728 non-null values
  16. Cross Street 2 84005 non-null values
  17. Intersection Street 1 19364 non-null values
  18. Intersection Street 2 19366 non-null values
  19. Address Type 102247 non-null values
  20. City 98860 non-null values
  21. Landmark 95 non-null values
  22. Facility Type 110938 non-null values
  23. Status 111069 non-null values
  24. Due Date 39239 non-null values
  25. Resolution Action Updated Date 96507 non-null values
  26. Community Board 111069 non-null values
  27. Borough 111069 non-null values
  28. X Coordinate (State Plane) 98143 non-null values
  29. Y Coordinate (State Plane) 98143 non-null values
  30. Park Facility Name 111069 non-null values
  31. Park Borough 111069 non-null values
  32. School Name 111069 non-null values
  33. School Number 111052 non-null values
  34. School Region 110524 non-null values
  35. School Code 110524 non-null values
  36. School Phone Number 111069 non-null values
  37. School Address 111069 non-null values
  38. School City 111069 non-null values
  39. School State 111069 non-null values
  40. School Zip 111069 non-null values
  41. School Not Found 38984 non-null values
  42. School or Citywide Complaint 0 non-null values
  43. Vehicle Type 99 non-null values
  44. Taxi Company Borough 117 non-null values
  45. Taxi Pick Up Location 1059 non-null values
  46. Bridge Highway Name 185 non-null values
  47. Bridge Highway Direction 185 non-null values
  48. Road Ramp 184 non-null values
  49. Bridge Highway Segment 223 non-null values
  50. Garage Lot Name 49 non-null values
  51. Ferry Direction 37 non-null values
  52. Ferry Terminal Name 336 non-null values
  53. Latitude 98143 non-null values
  54. Longitude 98143 non-null values
  55. Location 98143 non-null values
  56. dtypes: float64(5), int64(1), object(46)

2.2 选择列和行

为了选择一列,使用列名称作为索引,像这样:

  1. complaints['Complaint Type']
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. 5 Noise - Commercial
  7. 6 Blocked Driveway
  8. 7 Noise - Commercial
  9. 8 Noise - Commercial
  10. 9 Noise - Commercial
  11. 10 Noise - House of Worship
  12. 11 Noise - Commercial
  13. 12 Illegal Parking
  14. 13 Noise - Vehicle
  15. 14 Rodent
  16. ...
  17. 111054 Noise - Street/Sidewalk
  18. 111055 Noise - Commercial
  19. 111056 Street Sign - Missing
  20. 111057 Noise
  21. 111058 Noise - Commercial
  22. 111059 Noise - Street/Sidewalk
  23. 111060 Noise
  24. 111061 Noise - Commercial
  25. 111062 Water System
  26. 111063 Water System
  27. 111064 Maintenance or Facility
  28. 111065 Illegal Parking
  29. 111066 Noise - Street/Sidewalk
  30. 111067 Noise - Commercial
  31. 111068 Blocked Driveway
  32. Name: Complaint Type, Length: 111069, dtype: object

要获得DataFrame的前 5 行,我们可以使用切片:df [:5]

这是一个了解数据框架中存在什么信息的很好方式 - 花一点时间来查看内容并获得此数据集的感觉。

  1. complaints[:5]
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
1 26593698 10/31/2013 02:01:04 AM NaN NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 11378 58 AVENUE 58 AVENUE 58 PLACE 59 STREET NaN NaN BLOCKFACE MASPETH NaN Precinct Open 10/31/2013 10:01:04 AM NaN 05 QUEENS QUEENS 1009349 201984 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.721041 -73.909453 (40.721040535628305, -73.90945306791765)
2 26594139 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10032 4060 BROADWAY BROADWAY WEST 171 STREET WEST 172 STREET NaN NaN ADDRESS NEW YORK NaN Precinct Closed 10/31/2013 10:00:24 AM 10/31/2013 02:39:42 AM 12 MANHATTAN MANHATTAN 1001088 246531 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.843330 -73.939144 (40.84332975466513, -73.93914371913482)
3 26595721 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM NYPD New York City Police Department Noise - Vehicle Car/Truck Horn Street/Sidewalk 10023 WEST 72 STREET WEST 72 STREET COLUMBUS AVENUE AMSTERDAM AVENUE NaN NaN BLOCKFACE NEW YORK NaN Precinct Closed 10/31/2013 09:56:23 AM 10/31/2013 02:21:10 AM 07 MANHATTAN MANHATTAN 989730 222727 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.778009 -73.980213 (40.7780087446372, -73.98021349023975)
4 26590930 10/31/2013 01:53:44 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Condition Attracting Rodents Vacant Lot 10027 WEST 124 STREET WEST 124 STREET LENOX AVENUE ADAM CLAYTON POWELL JR BOULEVARD NaN NaN BLOCKFACE NEW YORK NaN N/A Pending 11/30/2013 01:53:44 AM 10/31/2013 01:59:54 AM 10 MANHATTAN MANHATTAN 998815 233545 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.807691 -73.947387 (40.80769092704951,

我们可以组合它们来获得一列的前五行。

  1. complaints['Complaint Type'][:5]
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. Name: Complaint Type, dtype: object

并且无论我们以什么方向:

  1. complaints[:5]['Complaint Type']
  1. 0 Noise - Street/Sidewalk
  2. 1 Illegal Parking
  3. 2 Noise - Commercial
  4. 3 Noise - Vehicle
  5. 4 Rodent
  6. Name: Complaint Type, dtype: object

2.3 选择多列

如果我们只关心投诉类型和区,但不关心其余的信息怎么办? Pandas 使它很容易选择列的一个子集:只需将所需列的列表用作索引。

  1. complaints[['Complaint Type', 'Borough']]
  1. <class 'pandas.core.frame.DataFrame'>
  2. Int64Index: 111069 entries, 0 to 111068
  3. Data columns (total 2 columns):
  4. Complaint Type 111069 non-null values
  5. Borough 111069 non-null values
  6. dtypes: object(2)

这会向我们展示总结,我们可以获取前 10 列:

  1. complaints[['Complaint Type', 'Borough']][:10]
Complaint Type Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking QUEENS
2 Noise - Commercial MANHATTAN
3 Noise - Vehicle MANHATTAN
4 Rodent MANHATTAN
5 Noise - Commercial QUEENS
6 Blocked Driveway QUEENS
7 Noise - Commercial QUEENS
8 Noise - Commercial MANHATTAN
9 Noise - Commercial BROOKLYN

2.4 什么是最常见的投诉类型?

这是个易于回答的问题,我们可以调用.value_counts()方法:

  1. complaints['Complaint Type'].value_counts()
  1. HEATING 14200
  2. GENERAL CONSTRUCTION 7471
  3. Street Light Condition 7117
  4. DOF Literature Request 5797
  5. PLUMBING 5373
  6. PAINT - PLASTER 5149
  7. Blocked Driveway 4590
  8. NONCONST 3998
  9. Street Condition 3473
  10. Illegal Parking 3343
  11. Noise 3321
  12. Traffic Signal Condition 3145
  13. Dirty Conditions 2653
  14. Water System 2636
  15. Noise - Commercial 2578
  16. ...
  17. Opinion for the Mayor 2
  18. Window Guard 2
  19. DFTA Literature Request 2
  20. Legal Services Provider Complaint 2
  21. Open Flame Permit 1
  22. Snow 1
  23. Municipal Parking Facility 1
  24. X-Ray Machine/Equipment 1
  25. Stalled Sites 1
  26. DHS Income Savings Requirement 1
  27. Tunnel Condition 1
  28. Highway Sign - Damaged 1
  29. Ferry Permit 1
  30. Trans Fat 1
  31. DWD 1
  32. Length: 165, dtype: int64

如果我们想要最常见的 10 个投诉类型,我们可以这样:

  1. complaint_counts = complaints['Complaint Type'].value_counts()
  2. complaint_counts[:10]
  1. HEATING 14200
  2. GENERAL CONSTRUCTION 7471
  3. Street Light Condition 7117
  4. DOF Literature Request 5797
  5. PLUMBING 5373
  6. PAINT - PLASTER 5149
  7. Blocked Driveway 4590
  8. NONCONST 3998
  9. Street Condition 3473
  10. Illegal Parking 3343
  11. dtype: int64

但是还可以更好,我们可以绘制出来!

  1. complaint_counts[:10].plot(kind='bar')
  1. <matplotlib.axes.AxesSubplot at 0x7ba2290>

第二章 - 图1