$ npm install qminer -save
Word count
It's the 'Hello World!' example used for text mining tools. QMiner can count words, however more sophisticated text mining applications better suited for the library!
Let's define a schema with text and push some example sentences in the store. The frequency
of the keywords is provided by the "keyword" aggregate which returns a weighted and sorted
vector of keywords: [ (yellow, 0.60), (pen, 0.49), (green, 0.37), (blue, 0.37), (marker, 0.31) ]
.
QMiner automatically discards connection words that are not important such as "this" and "is".
Play with it on RunKit!
Keyword search
Let's see which tweet includes the keyword 'pen' OR 'yellow'!
We define the schema and push data in the store. Note that we index keywords by adding the
keys: []
configuration vector to the schema description. The query is run
to retrieve the list of records from the store. All four records are included in the result.
Now replace the second line of the query object with "text": "pen", "text": "yellow"
and hit Run. The result will now have two records that include both 'pen' AND 'yellow'.
Play with it on RunKit!
Nearest neighbor
Let's see which tweet is the most similar to an input tweet!
We define the schema and push data in the store. Then we create a feature space over the text
of all the training tweets. The query is then used to create an ordered list of similar tweets.
Here's the similarity vector: 0.71, 0.05, 0.02, 0.62"
. The first and the last tweets
from the store are the most similar to the query tweet while the second and third are not too similar.
In the query tweet try making the words 'pen' and 'marker' plural and hit Run. Suddenly only the first
tweet is now computed as being similar to the query 0.97, 0, 0, 0
, even though,
content-wise not much changed. Now add the following in the feature space definition after the last
'text': , tokenizer: { type: "unicode", stopwords: "en", stemmer: "porter" }
and hit Run!
The similarities now become the same as originally as the stemmer: porter
makes sure these
minor differences are ignored.
Play with it on RunKit!
Text streams
QMiner is also able to process streaming data. Here's a custom defined stream aggregate. For native aggregates check the 'Time series' menu tab.
We define the schema as usually. We simulate our stream by creating the s
vector.
Each element of the vector has to comply with the pre-defined schema. Then we define custom
javascript stream aggregate that counts the frequency of the Names.
Note that when ingesting the stream, QMiner only keeps in memory the model and discards
the data itself. The frequency of the two selected names is displayed after each new data
point comes in:
"[0] John:1 --- Mary:null"
"[1] John:1 --- Mary:1"
"[2] John:1 --- Mary:1"
"[3] John:1 --- Mary:1"
"[4] John:1 --- Mary:2"
"[5] John:1 --- Mary:2"
Now let's change this example to do word count. First change the fields on lines 7-9 to
{ name: 'text', type: 'string' }
. Now let's make sure each element of the
stream complies with the schema, so update lines 17-23 to look like
s.push(ps.newRecord({ text: 'John'}));
. Replace line 29 with
data[rec.text] = data[rec.text] == undefined ? 1 : data[rec.text] + 1;
.
Finally, replace line 45 with console.log(stream.saveJson());
. Now try it out!
The resulting output that counts frequent words looks like:
{John: 1}
{John: 1, Mary: 1}
{Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 1}
{Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 1, Jack: 1, Jill: 1, John: 1, Mary: 2}
{Andy: 2, Jack: 1, Jill: 1, John: 1, Mary: 2}
Play with it on RunKit!