$ npm install qminer -save 

Word count

It's the 'Hello World!' example used for text mining tools. QMiner can count words, however more sophisticated text mining applications better suited for the library!

Let's define a schema with text and push some example sentences in the store. The frequency of the keywords is provided by the "keyword" aggregate which returns a weighted and sorted vector of keywords: [ (yellow, 0.60), (pen, 0.49), (green, 0.37), (blue, 0.37), (marker, 0.31) ]. QMiner automatically discards connection words that are not important such as "this" and "is".

Play with it on RunKit!

var qm = require('qminer'); // create the base object with the desired schema var base = new qm.Base({ mode: 'createClean', schema: [{ name: 'tweets', fields: [{ name: 'text', type: 'string' }] }] }); // push the data let tweetStore = base.store('tweets'); tweetStore.push({text: "This pen is green."}); tweetStore.push({text: "This pen is yellow."}); tweetStore.push({text: "This pen is blue."}); tweetStore.push({text: "This marker is yellow."}); // get the distribution of keywords let distribution = tweetStore.allRecords.aggr( { name: "test", type: "keywords", field: "text" }); // output the sorted keyword-weight pairs distribution.keywords.forEach((obj) => { console.log(obj.keyword, obj.weight); });

Keyword search

Let's see which tweet includes the keyword 'pen' OR 'yellow'!

We define the schema and push data in the store. Note that we index keywords by adding the keys: [] configuration vector to the schema description. The query is run to retrieve the list of records from the store. All four records are included in the result.

Now replace the second line of the query object with "text": "pen", "text": "yellow" and hit Run. The result will now have two records that include both 'pen' AND 'yellow'.

Play with it on RunKit!


Nearest neighbor

Let's see which tweet is the most similar to an input tweet!

We define the schema and push data in the store. Then we create a feature space over the text of all the training tweets. The query is then used to create an ordered list of similar tweets. Here's the similarity vector: 0.71, 0.05, 0.02, 0.62". The first and the last tweets from the store are the most similar to the query tweet while the second and third are not too similar.

In the query tweet try making the words 'pen' and 'marker' plural and hit Run. Suddenly only the first tweet is now computed as being similar to the query 0.97, 0, 0, 0, even though, content-wise not much changed. Now add the following in the feature space definition after the last 'text': , tokenizer: { type: "unicode", stopwords: "en", stemmer: "porter" } and hit Run! The similarities now become the same as originally as the stemmer: porter makes sure these minor differences are ignored.

Play with it on RunKit!


var qm = require('qminer'); // create the base object let base = new qm.Base({ mode: 'createClean', schema: [{ name: 'People', fields: [ { name: 'Name', type: 'string', primary: true }, { name: 'Gender', type: 'string' } ]} ]}); let ps = base.store('People'); // create a custom stream object let s = []; // each element of the object has to comply to the base schema definition s.push(ps.newRecord({ Name: 'John', Gender: 'Male' })); s.push(ps.newRecord({ Name: 'Mary', Gender: 'Female' })); s.push(ps.newRecord({ Name: 'Jill', Gender: 'Female' })); s.push(ps.newRecord({ Name: 'Jack', Gender: 'Male' })); s.push(ps.newRecord({ Name: 'Mary', Gender: 'Female' })); s.push(ps.newRecord({ Name: 'Andy', Gender: 'Male' })); s.push(ps.newRecord({ Name: 'Andy', Gender: 'Male' })); // create your custom stream aggregate var stream = new qm.StreamAggr(base, new function () { var data = {}; this.onAdd = function (rec) { data[rec.Name] = data[rec.Name] == undefined ? 1 : data[rec.Name] + 1; }; this.saveJson = function (limit) { return data; }; this.getFloat = function (name) { return data[name] == undefined ? null : data[name]; }; this.getInteger = function (name) { return data[name] == undefined ? null : data[name]; }; }); // start ingesting the stream s.forEach((obj, idx) => { stream.onAdd(obj); console.log("[" + idx + "] John:" + stream.getFloat("John") + " --- Mary:" + stream.getFloat("Mary")); });

Text streams

QMiner is also able to process streaming data. Here's a custom defined stream aggregate. For native aggregates check the 'Time series' menu tab.

We define the schema as usually. We simulate our stream by creating the s vector. Each element of the vector has to comply with the pre-defined schema. Then we define custom javascript stream aggregate that counts the frequency of the Names.

Note that when ingesting the stream, QMiner only keeps in memory the model and discards the data itself. The frequency of the two selected names is displayed after each new data point comes in:
"[0] John:1 --- Mary:null"
"[1] John:1 --- Mary:1"
"[2] John:1 --- Mary:1"
"[3] John:1 --- Mary:1"
"[4] John:1 --- Mary:2"
"[5] John:1 --- Mary:2"

Now let's change this example to do word count. First change the fields on lines 7-9 to { name: 'text', type: 'string' }. Now let's make sure each element of the stream complies with the schema, so update lines 17-23 to look like s.push(ps.newRecord({ text: 'John'}));. Replace line 29 with data[rec.text] = data[rec.text] == undefined ? 1 : data[rec.text] + 1;. Finally, replace line 45 with console.log(stream.saveJson());. Now try it out!

The resulting output that counts frequent words looks like:
{John: 1}
{John: 1, Mary: 1}
{Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 1, Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 2, Jack: 1, Jill: 1, John: 1, Mary: 2}

Play with it on RunKit!

Check out QMiner at Nodejs interactive!