What methods to create singular content classification from inconsistent inbound info?

Question

I am attempting to aggregate professional profile info from multiple sources, imposing a consistent taxonomy. Specifically, the current problem is how to impose a preferred taxonomy on profiles with inconsistent or absent in-bound taxonomy terms.

Primary source of profile info is biography pages on people's employer websites. Some of those sites choose to state employees' multiple specialist topics, some make only narrative biographies available, some both. I have collected all available info, using Python's Scrapy, in to CSV files - one per company, people are rows  - where available, topics at my end now themselves reside in a comma-separated field/string.

Example: in one sheet, cell S7 is: "Analytics Applications,Big Data,Cognitive Computing,Competitive Intelligence,eDiscovery,Enterprise Content Management (ECM),Information Architecture,Market Research,Product Information Management (PIM)"

The problem is severalfold:

Taxonomy terms across companies are inconsistent (eg. "Cognitive Computing" in the above example may, to another company, be "AI").
Some companies use far too many terms in total (eg. one company alone uses approx 450 tags in total).
Often, none are available at all.
As biography narratives describe more than just employees' specialist topics (eg. education and upbringing background), their usefulness in automation may be questionable.

My goal is to create a taxonomy that categorises all the collected person bios in a much more harmonious, consistent and briefer fashion.

System setup is PHP/MySQL/WordPress. Profile CSVs are imported in to WordPress, and the system has the ability to perform PHP functions on imported content (not just on the info in WordPress after import, but during import via PHP).

Total profiles count is approx 4,500, so manual taxonomisation is unappealing. So I have examined AI/machine learning techniques. I am not strictly a developer and certainly not a data scientist or mathematician.

So far, I have found text classification tests carried out using Aylien and Monkey Learn to yield poor results. In each case, output results are not granular enough, ie. turning in-bound terms of biogs about granular topics like cloud computing infrastructure and data centres in to overly basic terms like "Computers & Internet". Aylien uses the off-the-shelf IPTC NewsCodes taxonomy, and I understand I can use Monkey Learn to train. I like the idea of using a standardised off-the-shelf taxonomy like NewsCodes, but a) the results are questionable, and b) it may not be granular enough for my needs.

At this point, I have decided to draw up my preferred hierarchy of taxonomy terms, approx 230, which should each speak roughly to the swathe of inconsistent in-bound terms and profiles (in other words, correlate to the people's topics). That seemed like an important step, assuming I need to steer this manually. But I'm struggling to grasp how to actually implement that correlation.

So, I am looking for some guidance on best methods.

One idea I am toying with is to put my own preferred taxonomy in to WordPress as taxonomy terms, and, alongside each, put a cluster of terms from the actual source material so that, if one of the related terms is found in a user's inbound data, the term from my preferred taxonomy should be assigned. But I'm not sure whether this is particularly efficient, or even wise.

This is my first time on the Data Science group at StackExchange. I apologise if I have shot wide of the mark here at all.

grldsndrs · Answer

You could import each companies data into a specific  table and then develop regular expression Scripts two change specific Expressions into your own taxonomy.
https://en.m.wikipedia.org/wiki/Regular_expression

What methods to create singular content classification from inconsistent inbound info?

One Answer

Add your own answers!

Ask a Question