HiDi: Pipelines for Embeddings¶

HiDi is a library for high-dimensional embedding generation for collaborative filtering applications.

Why HiDi?¶

We created HiDi because generating embeddings for collaborative filtering applications is a work intensive process that involves many data transformations, each of which requires special consideration to get a good result. HiDi makes the process more simple by breaking work into small steps, each of which can be executed in a pipeline.

The unit of work in HiDi is a Transformer. Transformers need only implement one function, transform.

Ok, How Do I Use It?¶

This will get you started.

from hidi import inout, clean, matrix, pipeline


# CSV file with link_id and item_id columns
in_files = ['hidi/examples/data/user-item.csv']

# File to write output data to
outfile = 'embeddings.csv'

transforms = [
    inout.ReadTransform(in_files),      # Read data from disk
    clean.DedupeTransform(),            # Dedupe it
    matrix.SparseTransform(),           # Make a sparse user*item matrix
    matrix.SimilarityTransform(),       # To item*item similarity matrix
    matrix.SVDTransform(),              # Perform SVD dimensionality reduction
    matrix.ItemsMatrixToDFTransform(),  # Make a DataFrame with an index
    inout.WriteTransform(outfile)       # Write results to csv
]

pl = pipeline.Pipeline(transforms)
pl.run()

Setup¶

Requirements¶

HiDi is tested against CPython 2.7, 3.4, 3.5, and 3.6. It may work with different version of CPython.

Installation¶

To install HiDi, simply run

$ pip install hidi