HiDi: Pipelines for Latent Factor Modeling¶
HiDi is a library for high-dimensional latent factor modeling for collaborative filtering applications.
Why HiDi?¶
We created HiDi because modeling latent factors for collaborative filtering applications is a work intensive process that involves many data transformations, each of which requires special consideration to get a good result. HiDi makes the process more simple by breaking work into small steps, each of which can be executed in a pipeline.
The unit of work in HiDi is a Transformer. Transformers need only implement one function, transform.
Ok, How Do I Use It?¶
This will get you started.
from hidi import inout, clean, matrix, pipeline
# CSV file with link_id and item_id columns
in_files = ['hidi/examples/data/user-item.csv']
# File to write output data to
outfile = 'latent-factors.csv'
transforms = [
inout.ReadTransform(in_files), # Read data from disk
clean.DedupeTransform(), # Dedupe it
matrix.SparseTransform(), # Make a sparse user*item matrix
matrix.SimilarityTransform(), # To item*item similarity matrix
matrix.SVDTransform(), # Perform SVD dimensionality reduction
matrix.ItemsMatrixToDFTransform(), # Make a DataFrame with an index
inout.WriteTransform(outfile) # Write results to csv
]
pl = pipeline.Pipeline(transforms)
pl.run()
Setup¶
Requirements¶
HiDi is tested against CPython 2.7, 3.4, 3.5, and 3.6. It may work with different version of CPython.