How to create a system to detect text structure of a file?

Question

Let's say I want to create a Machine Learning system that has a lot of log files of some few types (F1, F2,.. Fn) and I get a new Log file with maybe some errors or missing data.

How do I classify it into these class types or classify it is an anomaly if it doesn't belong to anyone of them.

I thought about anomaly detection but couldn't figure how to parse structure information from the text classes like (F1, F2... .etc).

Also what kind of structural information to extract from text files?

These input classes contain 100 - 1000 lines of code per document of each class type.

I looked into Linting or DeepCode ...

A sample log file looks like this:

11-02-11 16:47:35,985 +0000 E Activity class {com.trackingeng/LandingActivity} does not exist.
12-02-11 17:47:35,985 +0000 I Starting: Intent { act=android.intent.action.MAIN
.....

A log file may have stack trace like this also:

Error:
    Error detail 1
    Error detail 2
    ....
Non-Error:
    .....
Warning:
    .....
and similar to this.

Any help in which direction to look for is greatly appreciated.

BenP · Answer

Based on your current examples.

You have an ML that is doing work across multiple systems and users. Generating logs.
The logs can be grouped or classed (e.g. $F1, F2,...,FN$) by:

i) the operating System used (e.g. Android, Windows etc) and
ii) the error or message generated. The error messages are differentiated by tab white space.

If you cannot work out the log class using some logic, you could look to engineer features with regex to capture the source OS and the tabbed structural information in the error lines.

|---------------------|------------------|
| Error detail        |     n_tabs       |
|---------------------|------------------|
|  string_error       |         0        |
|---------------------|------------------|

How to create a system to detect text structure of a file?

One Answer

Add your own answers!

Ask a Question