QMiner

 $ npm install qminer -save

Word count

It's the 'Hello World!' example used for text mining tools. QMiner can count words, however more sophisticated text mining applications better suited for the library!

Let's define a schema with text and push some example sentences in the store. The frequency of the keywords is provided by the "keyword" aggregate which returns a weighted and sorted vector of keywords: [ (yellow, 0.60), (pen, 0.49), (green, 0.37), (blue, 0.37), (marker, 0.31) ]. QMiner automatically discards connection words that are not important such as "this" and "is".

Play with it on RunKit!

Keyword search

Let's see which tweet includes the keyword 'pen' OR 'yellow'!

We define the schema and push data in the store. Note that we index keywords by adding the keys: [] configuration vector to the schema description. The query is run to retrieve the list of records from the store. All four records are included in the result.

Now replace the second line of the query object with "text": "pen", "text": "yellow" and hit Run. The result will now have two records that include both 'pen' AND 'yellow'.

Play with it on RunKit!

Nearest neighbor

Let's see which tweet is the most similar to an input tweet!

We define the schema and push data in the store. Then we create a feature space over the text of all the training tweets. The query is then used to create an ordered list of similar tweets. Here's the similarity vector: 0.71, 0.05, 0.02, 0.62". The first and the last tweets from the store are the most similar to the query tweet while the second and third are not too similar.

In the query tweet try making the words 'pen' and 'marker' plural and hit Run. Suddenly only the first tweet is now computed as being similar to the query 0.97, 0, 0, 0, even though, content-wise not much changed. Now add the following in the feature space definition after the last 'text': , tokenizer: { type: "unicode", stopwords: "en", stemmer: "porter" } and hit Run! The similarities now become the same as originally as the stemmer: porter makes sure these minor differences are ignored.

Play with it on RunKit!

Text streams

QMiner is also able to process streaming data. Here's a custom defined stream aggregate. For native aggregates check the 'Time series' menu tab.

We define the schema as usually. We simulate our stream by creating the s vector. Each element of the vector has to comply with the pre-defined schema. Then we define custom javascript stream aggregate that counts the frequency of the Names.

Note that when ingesting the stream, QMiner only keeps in memory the model and discards the data itself. The frequency of the two selected names is displayed after each new data point comes in:
"[0] John:1 --- Mary:null"
"[1] John:1 --- Mary:1"
"[2] John:1 --- Mary:1"
"[3] John:1 --- Mary:1"
"[4] John:1 --- Mary:2"
"[5] John:1 --- Mary:2"

Now let's change this example to do word count. First change the fields on lines 7-9 to { name: 'text', type: 'string' }. Now let's make sure each element of the stream complies with the schema, so update lines 17-23 to look like s.push(ps.newRecord({ text: 'John'}));. Replace line 29 with data[rec.text] = data[rec.text] == undefined ? 1 : data[rec.text] + 1;. Finally, replace line 45 with console.log(stream.saveJson());. Now try it out!

The resulting output that counts frequent words looks like:
{John: 1}
{John: 1, Mary: 1}
{Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 1, Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 2, Jack: 1, Jill: 1, John: 1, Mary: 2}

Play with it on RunKit!

Word count

Keyword search

Nearest neighbor

Text streams

Check out QMiner at Nodejs interactive!