Machine learning with sparse, high-dimensional and large datasets

Published Feb 08, 2019Last updated Aug 06, 2019

High-dimensional datasets arise in diverse areas ranging from computational advertising to natural language processing. Learning in such high-dimensions can be limited in terms of computations and/or memory. To this end, methods for dimensionality reduction like (linear/nonlinear) principal component analysis or manifold learning are used for transformation in a lower dimensional space. However, these dimensionality reduction techniques, sometimes, cannot be applicable, e.g., in sparse datsets that have independent features and the data lie in multiple lower dimensional manifolds.

In this article, we discuss and implement an approach to learning over such sparse, high dimensional datasets. Data dimensionality and data size are two facets of these problems. To reduce data dimensionality, feature hashing offers a scalable and computationally efficient feature representation. A large dataset, may still pose computational and memory limitations on a personal computer. To this end, we utilize mini-batch gradient descent for model training in an iterative manner.

In the post, we utilize tools from within Python like Pandas, Numpy and Scikit-Learn for utilizing feature hashing and mini-batch learning with sparse, high-dimensional and large datasets. Code snippets with detailed comments are included for practical implementation within python's data science framework.

Data Creation and storage

We demonstrate a practical use of feature hashing and mini-batch learning using a binary classification problem. For the problem, a fake dataset comprising 1.0 million data points is created. Each data point in the dataset comprises a bag of alphanumeric tokens along with a label. In creating the dataset, we firstly created 65464 alphanumeric features by randomly sampling from the set of alphabets and 10 digits. Secondly, we generated data for each of the two classes (0 and 1) separately using half the number of features, i.e., 32732 exclusively. To create sparsity, none of the features is assigned to more than 0.3% of the data. Note that our data creation leads to perfectly separable classes.

Mini-batch learning with feature hashing

The following code demonstrates practical implementation of iterative training with feature hashing. The function train(model, lines) updates the model over 1 epoch using mini-batch training. The training dataset is stored in lines as a list of strings with each string being a record. It is a generic function that is application to any model class within sklearn that supports partial_fit() method. Popular classification model choices that support partial_fit() are logistic regression, support vector machines, and artificial neural networks. This function supports minibatch training; each batch of size 1000 samples. In each iteration, the function constructs a pandas DataFrame out of 1000 samples from lines, computes the feature hashing according to a pre-defined feature dimensions, and updates the model.

The second function hash_representation(df) is used by train() function. It computes feature hashing on the input strings which are stored as a column in the input df. The computed feature representation is concatenated with the DataFrame df and is returned.

See detailed comments within the functions for better understanding.

def hash_representation(df):
    
    def vectorize(line):
        ''' This function transforms a line comprising bag to words to its hash representation. 
        Input to this function is a line that comprises , separated words.
        output is a numpy array which is the hash representation of line according to a hash 
        template (global object). 
        '''
        # Construct a dictionary comprising each word in the line as keys and 1 as its value. 
        D = [{key:1 for key in line.split('\n')[0].split(',')}]
        # tranform D to a numpy array according to a hash template h. 
        return h.transform(D).toarray()[0]
    
    # Obtain hash represntation of input strings. 
    s = df['inputs'].apply(vectorize)
    # Giving non-numeric names to the resulting data frame. 
    columns = ['col_'+str(i) for i in range(len(s.values[0]))]  
    new_df = pd.DataFrame.from_items(zip(s.index, s.values), orient='index', columns=columns)
    new_df['labels'] = df['labels']
    # Returning dataframe that has hash representation of inputs as features along with class labels. 
    return new_df

def train(model, lines):

    ''' This function takes updates the model over 1 epoch of training data.     
    Inputs to this function are:    
    model ~ An sklearn model object over which partial_fit() method is valid.
    lines ~ a list of strings and each string is a record, example or a measurement.     
    Output of this function is     
    model ~ an updated model.     
    '''
    
    ''' We construct a pandas DataFrame out of these lines and use its str 
    functionality to create new columns; one for input strings and the other for 
    output labels. '''
    df = pd.DataFrame(lines, columns=['raw_data'])
    df['labels'] = df['raw_data'].str.split(',').apply(lambda x:x[0]).astype(int) #%[0][0]
    df['inputs'] = df['raw_data'].str.split(',').apply(lambda x:x[1]) #%[0][0]
    del df['raw_data']
    
    ''' We shuffle the data to ensure randomness to avoid any cycles in training. '''
    df = shuffle(df)
    ''' We specify the batch size that specifies the data size over which model update is based in one iteration. '''
    batch_size = 1000
    "total number of batches"
    total_batches = math.ceil(df.shape[0] / batch_size)
    
    # Iterating over each batch. 
    for batch_numb in range(total_batches):
        st = batch_numb * batch_size # starting index (inclusive) of the batch. 
        en = (1+batch_numb) * batch_size # End index (exclusive) of the batch. 
        ''' Last batch may comprise data points less than the batch_size. 
            Under such a scenario, the batch_size comprises remaining elements
            that are less than the batch_size and we modify the End index to 
            be the size of the lines. 
        '''
        en = min(en, df.shape[0]) 
        '''For the subset of lines comprising current batch, 
           we construct a new data frame that comprises hash representation of the 
           inputs along with outputs. 
        '''
        tmp_df = hash_representation(df.iloc[st:en])
        tmp_labels = tmp_df['labels'].copy()
        del tmp_df['labels']
        
        # model update using current batch. 
        model.partial_fit(np.array(tmp_df), np.array(tmp_labels), classes=[0,1])
        
    # returning model after updating with 1 epoch. 
    return model

Model selection and hyperparameters

To demonstrate the effectiveness of our Mini-batch learning with feature hashing approach, we split our dataset into 80% training, 10% validation and 10% testing. it is to remark that 10% test data comprises 100000 samples that is sufficient for performance evaluation. We evaluated logistic regression, support vector machine and artificial neural networks with multiple architectural choices. It was found that the artificial neural networks yield the best validation performance. While performance over deeper architectures is slightly better, an ANN with 1 hidden layer comprising 100 neurons yields validation performance; sufficiently good to demonstrate our ideas. We also found the validation performance only slightly changes beyond the first epoch; suggesting validation performance with 1 epoch to be representative of the model performance. In our subsequent experiments, we used ANN with 1 hidden layer comprising 100 neurons as our model and training is done for just 1 epoch.

Optimal hashing dimension

Feature hashing can result in hashing collision that refers to two different data points yielding the same hashing representation. A collision between samples of a class with those of the other deteriorates classification accuracy. As the hashing dimension increases, smaller and smaller differences in the inputs result in different hashing representations; hence, number of collisions reduces. Deficiencies in training, possible model bias and hashing collisions are important sources of noise in this binary classification problem. While deficiencies in training and possible model bias can be investigated, we aim to demonstrate the impact of hashing dimensions to the classification performance. A higher hashing dimension probably improves classification accuracy; albeit at the cost of computations and memory.

Following is a feature hashing template from sklearn.feature_extraction with hashing dimension as its parameter.

h = FeatureHasher(n_features=hashing_dimension)

We performed numerical experiments with different hashing dimensions and obtained the following validation accuracies.

Hashing Dimension	Validation accuracy
1600	73.9%
3200	80.4%
6400	86.8%
12800	91.6%
19200	93.5%
25600	94.8%

Note from the table that the validation performance improves with the hashing dimension, however, at diminishing returns. A higher hashing dimensions leads computationally more expensive training and requires larger memory size or lower batch size. Hence, an optimal choice of hashing dimension depends on the computing resources and the point after which returns are negligible. For our case, we chose a hashing dimension of 12800 and evaluated the test performance which resulted in 91.4%.

Conclusions

We presented feature hashing; a scalable and robust dimensionality reduction method that may be the only dimensionality reduction option in certain problems and may yield superior performance compared to popular PCA or manifold learning. A comprehensive framework for feature hashing with mini-batch training is also demonstrated within sklearn framework. Our numerical experiments demonstrated the effectiveness of our approach and implementation.

Pandas Machine learning Python Feature hashing

Report

Enjoy this post? Give Waqas Bukhari, PhD a like if it's helpful.

Waqas Bukhari, PhD

Data Scientist and Machine/deep learning engineer

I have experience in a wide range of classical machine learning and more recent deep learning algorithms and have developed machine learning solutions to problems in multiple domains. My experience primarily comprises develop...

Discover and read more posts from Waqas Bukhari, PhD

get started

3Replies

ehtiwa 123

4 years ago

Thank you for giving the valuable Information really awesome. Looking forward for more of your work.

Sajjad ahmed

6 years ago

AOA Waqas,

Can you provide me dataset which you used to demonstrate the example?

Thanks
Sajjad Ahmed

Sarah Vargas

6 years ago

Well explain, i had some points in my mind regarding AI or machine learning, as i’m app developer & offering services from this website: https://www.branex.ca/. Now i’m learning AI from lynda & i have been learn some basics. one more article that i did read which is about tends of 2019 . here is the source: https://www.techsciencehub.com/future-artificial-intelligence-news-trends/