full$Title <- gsub('(.*, )|(\\..*)', '', full$Name)
...
[956] "Master" "Mrs" "Miss" "Mr" "Mr"
[961] "Mrs" "Miss" "Mr" "Miss" "Mr"
[966] "Miss" "Mr" "Mr" "Mrs" "Mr"
[971] "Miss" "Master" "Mr" "Mr" "Mr"
[976] "Mr" "Mr" "Miss" "Miss" "Miss"
[981] "Master" "Mrs" "Mr" "Mrs" "Mr"
[986] "Mr" "Mr" "Mrs" "Mr" "Miss"
[991] "Mr" "Mrs" "Mr" "Mr" "Mr"
[996] "Mrs" "Mr" "Mr" "Mr" "Mr"
----------------------------
원자료는
[991] "Nancarrow, Mr. William Henry"
[992] "Stengel, Mrs. Charles Emil Henry (Annie May Morris)"
[993] "Weisz, Mr. Leopold"
[994] "Foley, Mr. William"
[995] "Johansson Palmquist, Mr. Oskar Leander"
[996] "Thomas, Mrs. Alexander (Thamine Thelma\")\""
[997] "Holthen, Mr. Johan Martin"
[998] "Buckley, Mr. Daniel"
[999] "Ryan, Mr. Edward"
[1000] "Willer, Mr. Aaron (Abi Weller\")\""
이런 식으로 생겼었다.
정규표현식을 이용한 gsub() 함수 응용 단계인듯??
해당 표현식의 의미는 아래 주소의 답변을 참고.
http://www.talkstats.com/threads/how-to-extract-titles-out-of-a-full-name-using-gsub.69337/
match any character (.) zero or more times (*) up to a comma (,) OR (|) a period (\\.) followed any character (.) zero or more times (*)
essentially (.*,) eats up the string up to a character while (\\..*) eats up the string after the period
I would say there are more robust ways to extract the title. I'd use an extraction, rather than subbing approach. Base R can do extraction but it's more complicated than the stringi package which has the stri_extract_all_regex function.-- trinker
텍스트 데이터 분석을 위해서는 정규표현식을 한번 공부한 후 분석하는 편이 더 좋을 것으로 여겨짐.
'Kaggle > Titanic' 카테고리의 다른 글
[Titanic] 2번째 참고 공부 (0) | 2018.02.04 |
---|---|
[공부] Titanic: Machine Learning from Disaster(2) (0) | 2018.02.01 |
[공부] Titanic: Machine Learning from Disaster(1) (0) | 2018.01.31 |