data_diff v1.0 Track Changes on Websites

data_diff v1.0 Track Changes on Websites

data_diff helps to track the changes of a website. User first crawls a website using given webcrawler or any other crawling methods. Save the results in json format, then process the results with this tool.

After the website is updated with new url and features. User crawl the website again, and process the results in the same way.

Then this tool is able to show changes of the website by checking the difference of each data pair.

This tool is able to show the difference of multiple data sets.

 

Usage

Usage: delta.py [options]

Options:
 -h, --help show this help message and exit
 -l, --list list all available data sets
 -p FILE data_source, --process=FILE data_source
 process file into data base
 -o [HASH 1] [HASH 2], --diff-overlap=[HASH 1] [HASH 2]
 find the overlap of 2 data sets
 -m [HASH 1] [HASH 2], --diff-minus=[HASH 1] [HASH 2]
 find the data in data set 1 but excluded from data set
 2
 -c [HASH 1] [HASH 2], --diff-combine=[HASH 1] [HASH 2]
 find the combined data set from data set 1 and data
 set 2
 -D, --delete-all Delete all data

 

Installation

  • Elastic Search 1.7.2
  • Python 2.7
    • pip install elasticsearch (2.2.0)
    • pip install Scrapy (1.0.6)
    • pip install mmh3 (2.3.1)
    • pip install urllib3 (1.14)

(The versions listed pass the tests. Other versions need more investigation.)

More Information: here

[button size=large style=round color=red align=none url=https://github.com/pangbo-1988/data_diff]data_diff v1.0[/button]

Thanks to Bo Pang for sharing this tool with us.

MaxiSoler

www.artssec.com @maxisoler