HiDi: Pipelines for Embeddings¶
HiDi is a library for high-dimensional embedding generation for collaborative filtering applications.
We created HiDi because generating embeddings for collaborative filtering applications is a work intensive process that involves many data transformations, each of which requires special consideration to get a good result. HiDi makes the process more simple by breaking work into small steps, each of which can be executed in a pipeline.
The unit of work in HiDi is a Transformer. Transformers need only implement one function, transform.
Ok, How Do I Use It?¶
This will get you started.
from hidi import inout, clean, matrix, pipeline # CSV file with link_id and item_id columns in_files = ['hidi/examples/data/user-item.csv'] # File to write output data to outfile = 'embeddings.csv' transforms = [ inout.ReadTransform(in_files), # Read data from disk clean.DedupeTransform(), # Dedupe it matrix.SparseTransform(), # Make a sparse user*item matrix matrix.SimilarityTransform(), # To item*item similarity matrix matrix.SVDTransform(), # Perform SVD dimensionality reduction matrix.ItemsMatrixToDFTransform(), # Make a DataFrame with an index inout.WriteTransform(outfile) # Write results to csv ] pl = pipeline.Pipeline(transforms) pl.run()
HiDi is tested against CPython 2.7, 3.4, 3.5, and 3.6. It may work with different version of CPython.
To install HiDi, simply run
$ pip install hidi
- HiDi: Pipelines for Embeddings
- Pipeline Module
- Inout Module
- Matrix Module
- Clean Module
- Forking Module
- Writing Custom Transforms