Initial commit

code2k13 · code2k13 · commit cf28435a1272 · 2022-05-13T19:50:51.000Z
diff --git a/README.md b/README.md
@@ -1,2 +1,85 @@
-# feed-visualizer
- Feed Visualizer is a tool that can cluster RSS/Atom feed items based on semantic similarity and generate interactive visualization. This tool can be used go generate 'semantic summary' of any website by reading it's RSS/Atom feed. The below image how the visualization generated by Feed Visualizer looks like.
+## Introduction
+
+Feed Visualizer is a tool that can cluster RSS/Atom feed items based on semantic similarity and generate interactive visualization.
+This tool can be used to generate 'semantic summary' of any website by reading it's RSS/Atom feed. Shown below is an image of how the visualization generated by Feed Visualizer looks like. If you like this tool please consider giving a ⭐ on github !
+![](sample_visualization.gif)
+
+
+Interactive Demos:
+* Visualization created from [Martin Fowler's Atom Feed](https://martinfowler.com/feed.atom) :
+[https://ashishware.com/static/martin_fowler_viz.html](https://ashishware.com/static/martin_fowler_viz.html)
+
+* Visualization created from [BCC's RSS Feed](http://feeds.bbci.co.uk/news/rss.xml) :
+[https://ashishware.com/static/martin_fowler_viz.html](https://ashishware.com/static/martin_fowler_viz.html)
+
+## Quick Start
+
+Clone the repo
+
+```bash
+git clone https://github.com/code2k13/feed-visualizer.git
+```
+
+Navigate to the the newly created directory
+```bash
+cd feed-visualizer
+```
+
+Install the required modules
+```bash
+pip install -r requirements.txt
+```
+
+
+
+> Typically a RSS or Atom file only contains recent information from the website. This is where, I would highly recommend using [wayback_machine_downloader](https://github.com/hartator/wayback-machine-downloader) tool. Follow the instructions on this page to install the tool.
+
+The below command downloads public RSS feed from [NASA](https://www.nasa.gov/rss/dyn/breaking_news.rss) for last few months and saves to folder named 'nasa'
+```bash
+wayback_machine_downloader https://www.nasa.gov/rss/dyn/breaking_news.rss -s -f 202101 -t 202106  -d nasa 
+```
+> Alternatively you can simply create a new folder  and paste all RSS or Atom files in it (if you have them) ! Make sure to point your config to this folder (read next step)
+
+
+Now, we need to create a config file for Feed Visualizer. The config file contains path to input directory, name of output directory and some other settings (discussed later) that control the output of the tool. This is what a sample configuration file looks like :
+
+```json
+{
+    "input_directory": "nasa",
+    "output_directory": "nasa_output",
+    "pretrained_model": "all-mpnet-base-v2",
+    "clust_dist_threshold": 4,
+    "tsne_iter": 8000,
+    "text_max_length": 2048
+}
+```
+
+Now its time to run our tool
+
+```bash
+python3 visualize.py -c config.json
+```
+
+Once the above command completes, you should see  *visualization.html* and *data.csv* files in the output folder (nasa_output). Copy these files to a webserver (or use a dummy server like [http-server](https://www.npmjs.com/package/http-server) ) and view the visualization.html page in a browser. You should see something like this:
+
+![nasa](nasa_visualization.png)
+
+
+## Config settings
+
+Here is some information on what each config setting does:
+
+```json
+{
+    "input_directory": "path to input directory. Can contain subfolders. But should only contain RSS  or Atom files",
+    "output_directory": "path to output directory where visualization will be stored. Directory is created if not present. Contents are always overwritten.",
+    "pretrained_model": "name of pretrained model. Here is list of all valid model names https://www.sbert.net/docs/pretrained_models.html#model-overview",
+    "clust_dist_threshold": "Integer representing maximum radius of cluster. There is no correct value here. Experiment !",
+    "tsne_iter": "Integer representing number of iterations for TSNE (higher is better)",
+    "text_max_length": "Integer representing number of characters to read from content/description for semantic encoding."
+}
+```
+
+## Issues/Feature Requests/Bugs
+
+You can reach out to me on  [👨‍💼 LinkedIn](https://www.linkedin.com/in/ashish-patil-66bb568/) and [🗨️Twitter](https://twitter.com/patilsaheb) for reporting any issues/bugs or for feature requests ! 
diff --git a/nasa_visualization.png b/nasa_visualization.png
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,4 @@
+feedparser
+sentence-transformers
+pandas
+nltk
diff --git a/sample_config.json b/sample_config.json
@@ -0,0 +1,8 @@
+{
+    "input_directory": "nasa",
+    "output_directory": "nasa_output",
+    "pretrained_model": "all-mpnet-base-v2",
+    "clust_dist_threshold": 4,
+    "tsne_iter": 8000,
+    "text_max_length": 2048
+}
diff --git a/sample_visualization.gif b/sample_visualization.gif
diff --git a/visualization.html b/visualization.html
@@ -0,0 +1,123 @@
+<head>
+    <!-- Load plotly.js into the DOM -->
+    <script src="https://d3js.org/d3.v7.min.js"></script>
+    <script src='https://cdn.plot.ly/plotly-2.11.1.min.js'></script>
+    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet"
+        integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3" crossorigin="anonymous">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+
+</head>
+
+<body>
+
+    <div class="container">
+        <div class="jumbotron">
+            <p class="lead" style="padding-top:20px">
+                Add information about the visualization here !
+            </p>
+        </div>
+        <div id="clusters"> </div>
+        <div id='myDiv' style="width:90%;height:800px"> </div>
+    </div>
+    <script>
+
+        //let color = d3.scaleOrdinal(d3.schemeCategory10);
+        let color = d3.scaleSequential(d3.interpolateRainbow);
+        let cluster_count = 1
+        let csv_data = null;
+
+        let trace1 = {
+            x: [],
+            y: [],
+            marker: {
+                size: [],
+                color: []
+            },
+            text: [],
+            mode: 'markers',
+            name: 'Markers and Text',
+            textposition: 'top',
+            type: 'scatter',
+
+        };
+        let data = [trace1]
+        let layout = {
+            showlegend: false,
+            xaxis: {
+                showgrid: false,
+                zeroline: false
+            },
+            yaxis: {
+                showgrid: false,
+                showline: false,
+                zeroline: false
+            }
+        }
+
+        function makeplot() {
+            d3.csv("data.csv" + '?' + Math.floor(Math.random() * 1000)).then((d) => {
+                csv_data = d
+                let clusterNumbers = d.map(a => parseInt(a.cluster))
+                cluster_count = Math.max(...clusterNumbers)
+                d3.select('#clusters')
+                    .selectAll('span')
+                    .data(d3.range(0, cluster_count))
+                    .enter()
+                    .append('span')
+                    .style("background-color", function (d) { return color(d / cluster_count) })
+                    .style("padding", "2px")
+                    .style("border", "1px solid grey")
+                    .style("min-width", "25px")
+                    .style("display", "inline-block")
+                    .style("color", "white")
+                    .style("text-shadow", "1px 1px grey")
+                    .style("margin", "1px")
+                    .style("border-radius", "2px")
+                    .attr("data-clusterId", function (d) { return d })
+                    .on("mouseover", function (e, d) {
+                        //console.log(this.data.cluserId)
+                        //let currentClusterId = this.getAttribute("data-clusterId")
+                        let newTrace = JSON.parse(JSON.stringify(trace1));
+                        let new_colors = newTrace.marker.color.map(function (c,idx) {                          
+                            return csv_data[idx].cluster == d ? color(d / cluster_count)  : "#e0eeeeee"
+                        })
+                        newTrace.marker.color = new_colors
+                        drawPlot('myDiv', [newTrace], layout)
+
+                    })
+                    .on("mouseout", function () {
+                        drawPlot('myDiv', [trace1], layout)
+                    })
+                    .text(function (d) {
+                        return d;
+                    });
+
+                d.forEach(element => {
+                    processData(element)
+                });
+                drawPlot('myDiv', data, layout)
+
+
+            })
+
+        };
+        function drawPlot(placeholder, data, layout) {
+            Plotly.newPlot(placeholder, data, layout).then(function () {
+                document.getElementById(placeholder).on('plotly_click', function (data) {
+                    window.open(csv_data[data.points[0].pointIndex].url, "__blank")
+                });
+            })
+
+        }
+
+        function processData(row) {
+            trace1.x.push(row.x)
+            trace1.y.push(row.y)
+            trace1.text.push(row.label.replace("Bliki:", ""))
+            trace1.marker.size.push((parseInt(row.count) + 8) * 1.2)
+            trace1.marker.color.push(color(parseInt(row.cluster) / cluster_count))
+            console.log(row.cluster)
+        }
+        makeplot()
+    </script>
+</body>
diff --git a/visualize.py b/visualize.py
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+
+import argparse
+import csv
+import glob
+import json
+import os
+import shutil
+
+import feedparser
+import numpy as np
+import pandas as pd
+from bs4 import BeautifulSoup, SoupStrainer
+from sentence_transformers import SentenceTransformer
+from sklearn.cluster import AgglomerativeClustering
+from sklearn.manifold import TSNE
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser(description='Generates cool visualization from Atom/RSS feeds !')
+parser.add_argument('-c', '--configuration', required=True,
+                    help='location of configuration file.')
+args = parser.parse_args()
+
+with open(args.configuration, 'r') as config_file:
+    config = json.load(config_file)
+ 
+semantic_encoder_model = SentenceTransformer(config["pretrained_model"])
+
+def get_all_entries(path):    
+    all_entries = {}
+    files = glob.glob(path+"/**/**/*.*",recursive=True)
+    for file in tqdm(files, desc='Reading posts from files'):
+         
+        feed = feedparser.parse(file)
+        for entry in feed['entries']:        
+            if 'summary' in entry:
+                all_entries[entry['link']] = [
+                    entry['title'], entry['title'] + " " + entry['summary']]
+            elif 'content' in entry:
+                all_entries[entry['link']] = [
+                    entry['title'], entry['title'] + " " + entry['content'][0]['value']]
+    return all_entries
+
+
+def generate_text_for_entry(raw_text,entry_counts):
+    output = []
+    raw_text = raw_text.replace("\n", " ")
+    soup = BeautifulSoup(raw_text, features="html.parser")
+    output.append(soup.text)
+    for link in BeautifulSoup(raw_text, parse_only=SoupStrainer('a'), features="html.parser"):
+        if link.has_attr('href'):
+            url = link['href']
+            if url in entry_counts:
+                entry_counts[url] = entry_counts[url] + 1
+            else:
+                entry_counts[url] = 0
+
+    return ' ' .join(output)
+
+
+def generate_embeddings(entries,entry_counts):
+    sentences = [generate_text_for_entry(
+        entries[a][1][0:config["text_max_length"]],entry_counts) for a in entries]
+    print('Generating embeddings ...')
+    embeddings = semantic_encoder_model.encode(sentences)
+    print('Generating embeddings ... Done !')
+    index = 0
+    for uri in entries:
+        entries[uri].append(embeddings[index])
+        index = index+1
+    return entries
+
+
+def get_coordinates(entries):
+    X = [entries[e][-1] for e in entries]
+    X = np.array(X)
+    tsne = TSNE(n_iter=config["tsne_iter"],init='pca',learning_rate='auto')
+    clustering_model = AgglomerativeClustering(distance_threshold=config["clust_dist_threshold"], n_clusters=None)
+    clusters = clustering_model.fit_predict(tsne.fit_transform(X))
+    return [x[0] for x in tsne.fit_transform(X)], [x[1] for x in tsne.fit_transform(X)], clusters
+
+
+def main():
+    all_entries = get_all_entries(config["input_directory"])
+    entry_counts = {}
+    entry_texts = []
+    disinct_entries = {}
+    for k in all_entries.keys():
+        if all_entries[k][0] not in entry_texts:
+            disinct_entries[k] = all_entries[k]
+            entry_texts.append(all_entries[k][0])
+
+    all_entries = disinct_entries
+    entries = generate_embeddings(all_entries,entry_counts)
+    print('Creating clusters ...')
+    x, y, cluster_info = get_coordinates(entries)
+    print('Creating clusters ... Done !')
+    labels = [entries[k][0] for k in entries]
+
+
+    counts = [entry_counts[k] if k in entry_counts else 0 for k in entries]
+
+    df = pd.DataFrame({'x': x, 'y': y, 'label': labels,
+                    'count': counts, 'url': entries.keys(), 'cluster': cluster_info})
+
+
+    if not os.path.exists(config["output_directory"]):
+        os.makedirs(config["output_directory"])
+    df.to_csv(config["output_directory"]+"/data.csv")
+
+    shutil.copy('visualization.html', config["output_directory"])
+    print('Vizualization generation is complete !!')
+
+if __name__ == "__main__":
+    main()

-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +feedparser
 +sentence-transformers
 +pandas
 +nltk