内容相关搜索:对购物产品进行分类

时间:2015-06-19 10:31:14

标签: algorithm machine-learning classification

我的客户有一个新任务(不是传统的),它是关于机器学习的东西。 因为我从未去过机器学习"除了一些小数据挖掘的东西,所以我需要你的帮助。

我的任务是根据性别(产品所属的人),年龄组等对任何购物网站上的产品进行分类,我们可以获得的培训数据是产品的标题,关键字(可用)在产品页面的html中)和产品说明。

我做了很多R& D,我找到了Image Recog API(cloudsight,vufind),它返回了产品图片的详细信息,但没有完全填满需求,使用谷歌建议查询,搜索出许多机器学习算法和最后...

我开始了解"决策树学习算法"但无法弄清楚,它如何适用于我的问题。 我试过了" PlayingTennis"数据集,但无法理解该怎么做。

你能给我一些方向,从哪里开始这段旅程?我是否应该专注于决策树学习算法,或者您是否建议我应该专注于根据上下文对产品进行分类?

如果你说,我会详细分享我搜索的有关解决问题的内容。

2 个答案:

答案 0 :(得分:2)

I would suggest to do the following: Go through items in your dataset and classify them manually (decide for which gender each item is). Store each decision so that you would be able to somehow link each item in an original dataset with a target class. Develop an algorithm for converting each item from your dataset into a feature vector. This algorithm should be able to convert each item in your original dataset in a vector of numbers (more about how to do it later). Convert all your dataset with appropriate classes into a dataset that would look like this: Feature_1, Feature_2, Feature_3, ..., Gender value_1, value_2, value_3, ... male It would be a good decision to store it in CSV file since you would be able to load it and process in different machine learning tools (More about those later). Load dataset you've created at step 3 in machine learning tool of your choice and try to come up with the best model that can classify items in your dataset by gender. Store model created at step 4. It will be part of your production system. Develop a production code that can convert an unclassified product, create feature vector out of it and pass this feature vector to the model you've saved at step 5. The result of this operation should be a predicted gender. Details If there too many items (say tens of thousands) in your original dataset it may be impractical to classify them yourself. What you can do is to use Amazon Mechanical Turk to simplify your task. If you are unable to use it (the last time I've checked you had to have a USA address to use it) you can just classify few hundreds of items to start working on your model and classify the rest to improve accuracy of your classification (the more training data you use the better the accuracy, but up to a certain point) How to extract features from a dataset If keyword has form like tag=true/false, it's a boolean feature. If keyword has form like tag=42, it's a numerical one or ordinal. For example it can be price value or price range (0-10, 10-50, 50-100, etc.) If keyword has form like tag=string_value you can convert it into a categorical value A class (gender) is simply boolean value 0/1 You can experiment a bit with how you extract your features, since it may influence the result accuracy. How to extract features from product description There are different ways to convert a text into a feature vector. Look for TF-IDF algorithms or something similar. Machine learning tools You can use one of existing machine learning libraries and hack some code that loads your CSV dataset, trains a model and checks the accuracy, but at first I would suggest to use something like Weka. It has more or less intuitive UI and you can quickly start to experiment with different machine learning algorithms, convert different features in your dataset from string to categories, or from real values to ordinal values, etc. Good thing about Weka is that it has Java API, so you can automate all the process of data conversion, train models programmatically, etc. What algorithms to choose I would suggest to use decision tree algorithms like C4.5. It's fast and show good results on wide range of machine learning tasks. Additionally you can use ensemble of classifiers. There are various algorithms that can combine several algorithms like (google for boosting or random forest to find out more) usually they give better results, but work more slowly (since you need to run a single feature vector through several algorithms. One another trick that you can use to make your algorithm more accurate is to use models that work on different sets of features (say one algorithm uses features extracted from tags and another algorithm uses data extracted from product description). You can then combine them using algorithms like stacking to come up with a final result. For classification on the basis of features extracted from text, you can try to use Naive Bayes algorithm or SVM. They both show good results in text classification.

答案 1 :(得分:0)

请考虑支持向量分类器(SVC),或者为Google考虑支持向量机(SVM)。如果您有一个大型训练集(我怀疑),请搜索“快速”或“可扩展”的实现。