Arrow V - Command line crawler

VersionCommentsDownload
1.0Download

ArrowV is a quick and manual command line crawler. It's very inspired from the Firefox plug-in "Navicrawler".

Basically you "jump" from siteweb to siteweb, and the program collect all the hyperlinks and create graph.

Ok great, but how it's work ?

First: you need Python (I tested on v2.7.2, not sure it will work on v3, sorry)

Turn on your terminal (cmd or linux shell)

Launch the ArrowV like this

>> python ArrowV.py

It launch the program and you have a very shiny ascii art that welcome you .

Launch the component 'navcom'

>> Arrow V : navcom

You normally arrive at the navcom prompt.

Navcom

Navcom is the component developed to crawl the web manualy.

Jump command

First step is to jump to a website. For exemple, let's jump to www.utc.fr (my university ;)

Arrow V  [Navcom] >  : jump http://www.utc.Fr
http://www.utc.Fr > Downloading
http://www.utc.Fr > Downloaded
http://www.utc.Fr > Analyzing Page
** You Are at http://www.utc.Fr **
** Here are links founded **
> [0] : u'http://abc-innovation.utc.fr'
> [1] : u'http://interactions.utc.fr'
> [2] : u'http://utcenligne.utc.fr'
> [3] : u'http://www.tremplin-utc.asso.fr'
> [4] : u'http://bibliotheque.utc.fr'
> [5] : u'http://wwwassos.utc.fr'
> [6] : u'http://ent.utc.fr'
> [7] : u'http://www.facebook.com'
> [8] : u'http://twitter.com'
> [9] : u'http://www.youtube.com'

Don't run away. I'm going to explain you.

The web page has been analysed to find all the hyperlinks (<a>) in the page. Internally, the program have created the graph.

The hyperlinks founded are displayed and you can jump just by using the number in front of the link.

For exemple if I want to jump to http://twitter.com now, I can type:

Arrow V  [Navcom] >  : jump 8

Scan command

With the scan command, the navcom will analyse all the neighbor hyperlinks to create the network.

For exemple, if you  jump http://www.utc.fr and you perfom a scan, the program will automatically "jump" to every neibourght to analyse them and come back to the original point.

Arrow V  [Navcom] > @www.utc.Fr : scan
http://abc-innovation.utc.fr > Downloading
http://abc-innovation.utc.fr > Downloaded
http://abc-innovation.utc.fr > Analyzing Page
http://interactions.utc.fr > Downloading
http://interactions.utc.fr > Downloaded
http://interactions.utc.fr > Analyzing Page
http://utcenligne.utc.fr > Downloading
http://utcenligne.utc.fr > Downloaded
http://utcenligne.utc.fr > Analyzing Page
http://www.tremplin-utc.asso.fr > Downloading
http://www.tremplin-utc.asso.fr > Downloaded
http://www.tremplin-utc.asso.fr > Analyzing Page
http://bibliotheque.utc.fr > Downloading
http://bibliotheque.utc.fr > Downloaded
http://bibliotheque.utc.fr > Analyzing Page
http://wwwassos.utc.fr > Downloading
http://wwwassos.utc.fr > Downloaded
http://wwwassos.utc.fr > Analyzing Page
http://ent.utc.fr > Downloading
http://ent.utc.fr > Downloaded
http://ent.utc.fr > Analyzing Page
http://www.facebook.com > Downloading
http://www.facebook.com > Downloaded
http://www.facebook.com > Analyzing Page
http://twitter.com > Downloading
http://twitter.com > Downloaded
http://twitter.com > Analyzing Page
http://www.youtube.com > Downloading
http://www.youtube.com > Downloaded
http://www.youtube.com > Analyzing Page

The option -d is the distance. By default, scan will only look on neighbor (so distance = 1). If you use

Arrow V  [Navcom] > @www.utc.Fr : scan -d 2

The scan will look on neighbor and neighbor's neighbor (understand?)

Gehpi Stream

I'm a gephi lover , and particulary the Stream function of Gephi (<3)

So you can activate the connection with gephi by typing upgephi.

Be sure that gephi is launched and the Streaming Master Server is running .

If you want to stop the streaming, just use downgephi

Arrow V  [Navcom] > @www.utc.Fr : upgephi
Gephi connector is ON
Arrow V  [Navcom] > @www.utc.Fr : downgephi
Gephi connector is OFF

Other commands

info : to check where you are now

history : to see your jump history

map : to see the actual map

save <name> : to save your session (gdf file)

What the FAQ

Hey ! There is some bug  / Hey I wanna change something

I know there is bugs and I'm sorry by advance, I'll try to fix it asap.

But please, feel free to change the code. It's in python and it's open source !

If lot of people are interested, I'll try to github / sourceforece / googlecode it.

Why name ArrowV ?

Because I love Wing Commander III (good old game that show lolcats will rule the universe)

This article was updated on 20/06/20