Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly simplified.

I would like to fit some features into a logistic regression model to predict 'chinese' or 'non-chinese'. I have a raw_name which I will extract to get two features 1) is just the last name, and 2) is a list of substring of the last name, for example, 'Chan' will give ['ch', 'ha', 'an']. But it seems Dictvectorizer doesn't take list type as part of the dictionary. From the link above, I try to create a function list_to_dict, and successfully, return some dict elements,

{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}

but I have no idea how to incorporate that in the my_dict = ... before applying the dictvectorizer.

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
import re
import random
from random import randint
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer

lr = LogisticRegression()
dv = DictVectorizer()

# Get csv file into data frame
data = pd.read_csv("V2-1_2000Records_Processed_SEP2015.csv", header=0, encoding="utf-8")
df = DataFrame(data)

# Pandas data frame shuffling
df_shuffled = df.iloc[np.random.permutation(len(df))]
df_shuffled.reset_index(drop=True)

# Assign X and y variables
X = df.raw_name.values
y = df.chineseScan.values

# Feature extraction functions
def feature_full_last_name(nameString):
    try:
        last_name = nameString.rsplit(None, 1)[-1]
        if len(last_name) > 1: # not accept name with only 1 character
            return last_name
        else: return None
    except: return None

def feature_twoLetters(nameString):
    placeHolder = []
    try:
        for i in range(0, len(nameString)):
            x = nameString[i:i+2]
            if len(x) == 2:
                placeHolder.append(x)
        return placeHolder
    except: return []

def list_to_dict(substring_list):
    try:
        substring_dict = {}
        for i in substring_list:
            substring_dict['substring='+str(i)] = True
        return substring_dict
    except: return None

list_example = ['co', 'or', 'rn', 'ns']
print list_to_dict(list_example)

# Transform format of X variables, and spit out a numpy array for all features
my_dict = [{'two-letter-substrings': feature_twoLetters(feature_full_last_name(i)), 
    'last-name': feature_full_last_name(i), 'dummy': 1} for i in X]

print my_dict[3]

Output:

{'substring=co': True, 'substring=or': True, 'substring=rn': True, 'substring=ns': True}
{'dummy': 1, 'two-letter-substrings': [u'co', u'or', u'rn', u'ns'], 'last-name': u'corns'}

Sample data:

Raw_name    chineseScan
Jack Anderson    non-chinese
Po Lee    chinese
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
623 views
Welcome To Ask or Share your Answers For Others

1 Answer

If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:

my_dict_list = []

for i in X:
    # create a new feature dictionary
    feat_dict = {}
    # add the features that are straight forward
    feat_dict['last-name'] = feature_full_last_name(i)
    feat_dict['dummy'] = 1

    # for the features that have a list of values iterate over the values and
    # create a custom feature for each value
    for two_letters in feature_twoLetters(feature_full_last_name(i)):
        # make sure the naming is unique enough so that no other feature
        # unrelated to this will have the same name/ key
        feat_dict['two-letter-substrings-' + two_letters] = True

    # save it to the feature dictionary list that will be used in Dict vectorizer
    my_dict_list.append(feat_dict)

print my_dict_list

from sklearn.feature_extraction import DictVectorizer
dict_vect = DictVectorizer(sparse=False)
transformed_x = dict_vect.fit_transform(my_dict_list)
print transformed_x

Output:

[{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
[[ 1.  1.  0.  1.  0.  1.  0.  1.  1.  1.  1.  1.]
 [ 1.  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.]]

Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:

# sorting the values would be a good idea
feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
# or 
feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True

but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.

Output:

[{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
[{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
[[ 1.  0.  1.  1.  0.]
 [ 0.  1.  1.  0.  1.]]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...