Titanic 作为Kaggle中最经典的机器学习入门算法,总结了机器学习的思路
导入数据 1 2 3 import  numpy as  np import  pandas as  pd 
1 2 3 4 5 6 7 8 import  osfor  dirname, _, filenames in  os.walk('datasets/titanic' ):    for  filename in  filenames:         print (os.path.join(dirname, filename)) 
datasets/titanic\gender_submission.csv
datasets/titanic\test.csv
datasets/titanic\train.csv
简单看一眼数据 1 2 train_data = pd.read_csv("datasets/titanic/train.csv" ) train_data.head() 
  
    
      PassengerId 
      Survived 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
     
   
  
    
      0 
      1 
      0 
      3 
      Braund, Mr. Owen Harris 
      male 
      22.0 
      1 
      0 
      A/5 21171 
      7.2500 
      NaN 
      S 
     
    
      1 
      2 
      1 
      1 
      Cumings, Mrs. John Bradley (Florence Briggs Th... 
      female 
      38.0 
      1 
      0 
      PC 17599 
      71.2833 
      C85 
      C 
     
    
      2 
      3 
      1 
      3 
      Heikkinen, Miss. Laina 
      female 
      26.0 
      0 
      0 
      STON/O2. 3101282 
      7.9250 
      NaN 
      S 
     
    
      3 
      4 
      1 
      1 
      Futrelle, Mrs. Jacques Heath (Lily May Peel) 
      female 
      35.0 
      1 
      0 
      113803 
      53.1000 
      C123 
      S 
     
    
      4 
      5 
      0 
      3 
      Allen, Mr. William Henry 
      male 
      35.0 
      0 
      0 
      373450 
      8.0500 
      NaN 
      S 
     
   
 
1 2 test_data = pd.read_csv("datasets/titanic/test.csv" ) test_data.head() 
  
    
      PassengerId 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
     
   
  
    
      0 
      892 
      3 
      Kelly, Mr. James 
      male 
      34.5 
      0 
      0 
      330911 
      7.8292 
      NaN 
      Q 
     
    
      1 
      893 
      3 
      Wilkes, Mrs. James (Ellen Needs) 
      female 
      47.0 
      1 
      0 
      363272 
      7.0000 
      NaN 
      S 
     
    
      2 
      894 
      2 
      Myles, Mr. Thomas Francis 
      male 
      62.0 
      0 
      0 
      240276 
      9.6875 
      NaN 
      Q 
     
    
      3 
      895 
      3 
      Wirz, Mr. Albert 
      male 
      27.0 
      0 
      0 
      315154 
      8.6625 
      NaN 
      S 
     
    
      4 
      896 
      3 
      Hirvonen, Mrs. Alexander (Helga E Lindqvist) 
      female 
      22.0 
      1 
      1 
      3101298 
      12.2875 
      NaN 
      S 
     
   
 
用原始的方法看一下性别对幸存的影响 1 2 3 4 women = train_data.loc[train_data.Sex == 'female' ]["Survived" ] rate_women = sum (women)/len (women) print ("% of women who survived:" , rate_women)
% of women who survived: 0.7420382165605095
1 2 3 4 men = train_data.loc[train_data.Sex == 'male' ]["Survived" ] rate_men = sum (men)/len (men) print ("% of men who survived:" , rate_men)
% of men who survived: 0.18890814558058924
使用决策树的方式对数据进行预测 原理如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from  sklearn.ensemble import  RandomForestClassifiery = train_data["Survived" ] features = ["Pclass" , "Sex" , "SibSp" , "Parch" ] X = pd.get_dummies(train_data[features]) X_test = pd.get_dummies(test_data[features]) model = RandomForestClassifier(n_estimators=100 , max_depth=5 , random_state=1 ) model.fit(X, y) predictions = model.predict(X_test) output = pd.DataFrame({'PassengerId' : test_data.PassengerId, 'Survived' : predictions}) output.to_csv('submission.csv' , index=False ) print ("Your submission was successfully saved!" )
Your submission was successfully saved!
1 train_data.describe(include='all' ) 
  
    
      PassengerId 
      Survived 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
     
   
  
    
      count 
      891.000000 
      891.000000 
      891.000000 
      891 
      891 
      714.000000 
      891.000000 
      891.000000 
      891 
      891.000000 
      204 
      889 
     
    
      unique 
      NaN 
      NaN 
      NaN 
      891 
      2 
      NaN 
      NaN 
      NaN 
      681 
      NaN 
      147 
      3 
     
    
      top 
      NaN 
      NaN 
      NaN 
      Braund, Mr. Owen Harris 
      male 
      NaN 
      NaN 
      NaN 
      347082 
      NaN 
      B96 B98 
      S 
     
    
      freq 
      NaN 
      NaN 
      NaN 
      1 
      577 
      NaN 
      NaN 
      NaN 
      7 
      NaN 
      4 
      644 
     
    
      mean 
      446.000000 
      0.383838 
      2.308642 
      NaN 
      NaN 
      29.699118 
      0.523008 
      0.381594 
      NaN 
      32.204208 
      NaN 
      NaN 
     
    
      std 
      257.353842 
      0.486592 
      0.836071 
      NaN 
      NaN 
      14.526497 
      1.102743 
      0.806057 
      NaN 
      49.693429 
      NaN 
      NaN 
     
    
      min 
      1.000000 
      0.000000 
      1.000000 
      NaN 
      NaN 
      0.420000 
      0.000000 
      0.000000 
      NaN 
      0.000000 
      NaN 
      NaN 
     
    
      25% 
      223.500000 
      0.000000 
      2.000000 
      NaN 
      NaN 
      20.125000 
      0.000000 
      0.000000 
      NaN 
      7.910400 
      NaN 
      NaN 
     
    
      50% 
      446.000000 
      0.000000 
      3.000000 
      NaN 
      NaN 
      28.000000 
      0.000000 
      0.000000 
      NaN 
      14.454200 
      NaN 
      NaN 
     
    
      75% 
      668.500000 
      1.000000 
      3.000000 
      NaN 
      NaN 
      38.000000 
      1.000000 
      0.000000 
      NaN 
      31.000000 
      NaN 
      NaN 
     
    
      max 
      891.000000 
      1.000000 
      3.000000 
      NaN 
      NaN 
      80.000000 
      8.000000 
      6.000000 
      NaN 
      512.329200 
      NaN 
      NaN