Why page scrape?

If you go to a web page, you get to see all the pretty pictures and advertisements. What if you not only just want part of the page and to avoid all the advertisements or you do not have a gui based terminal? You can accomplish that with what is known as a page scrape. A page scrape extracts just the data we want without even having to look at a gui web page. What really happens we sort of print only the text from a page to the screen.



Take the following page:


$ firefox http://www.creators.com/lifestylefeatures/horoscopes/holiday-mathis-weekly.html


But all we really want is this part of the page.


Now consider we want to do everything from the command line, so that all you can use is text characters. No problem, but you have to use the mouse to highlight what we need and then paste it to the screen or to a file. The same page as text where we have scrolled down to where the data we want is. 

$ lynx "http://www.creators.com/lifestylefeatures/horoscopes/holiday-mathis-weekly.html"





But then you would like to go one step further and have the computer visit the page and get and save the data for you. We can accomplish this with something as simple as using a shell script

logic:
#Get the page and dump it to the buffer.
lynx -width 1000 -dump "http://www.creators.com/lifestylefeatures/horoscopes/holiday-mathis-weekly.html"
# using the buffer, grep the page for some particular non-repeated text.
| grep $hsign
# Take that result and print it in a column so many characters wide
| fold -sw 60

So if we create a batch file the results might look like this:

 $ ghpcl.sh
Enter your horoscope sign:
   _
  ' `:--.--.
     |  |  |_     Virgo-  The Virgin
     |  |  | )
     |  |  |/
          (J

Today's date: 12/05/14
Today's horoscope for:
   VIRGO (Aug. 23-Sept. 22). Instead
of looking to relationships to make
you happy, look to them to make you
conscious of what has been weighing
heavily inside you at an unconscious
level.

We have done this with many web pages. You may have to vary how you do it though. Just a little trial and error to get what you want. What is really neat is that you can combine several different page scrapes and make sort of your own newsletter. (Earlier articles have explained how to do that in detail.) Now you do not have to pour through many web pages just to get what you want. The computer will have done that for you.  Plus you can get a screen reader to put the words to speech!  Save the speech to an audio file for later listening from your music player.

$ lynx -width 1000 -dump "http://www.creators.com/lifestylefeatures/horoscopes/holiday-mathis-weekly.html" | grep "VIRGO" | fold -sw 60 > tts.txt
$ text2wave tts.txt -o tts.wav
$ play tts.wav

tts.wav:

 File Size: 746k      Bit Rate: 256k
  Encoding: Signed PCM   
  Channels: 1 @ 16-bit  
Samplerate: 16000Hz     
Replaygain: off        
  Duration: 00:00:23.30 

In:100%  00:00:23.30 [00:00:00.00] Out:373k  [      |      ] Hd:2.4 Clip:0   
Done.
$

Starting  your own podiobooks is going to the next level.


[code]
# Get today's horoscope
echo "--------------------------------------------"
# character width
cw=60
# If no sign entered, use virgo a default.
hsign=$1
if [ $# -lt "1" ]; 
then hsign="Virgo"
fi
# set the sign to upper case
hsign="`echo $hsign|tr '[a-z]' '[A-Z]'`"
#Print the  symbol text from an existing file
cat ~/signs/$hsign
# show the date
echo -n "Today's date: "
date +%D
# print out the data
echo "Today's horoscope for:"
lynx -width 1000 -dump "http://www.creators.com/lifestylefeatures/horoscopes/holiday-mathis-weekly.html" | grep $hsign | fold -sw $cw
echo "--------------------------------------------"
[/code]

Symbol files:


 .-"-._.-"-._.-   Aquarius-  The Water Bearer
 .-"-._.-"-._.-


  .-.   .-.
  (_  \ /  _)    Aries-  The Ram
       |
       |

      .--.
     /   _`.     Cancer-  The Crab
    (_) ( )
   '.    /
     `--' 
    \      /_)    Capricorn-  The Goat
     \    /`.
      \  /   ;
       \/ __.'

    ._____.
      | |        Gemini-  The Twins
      | |
     _|_|_
    '     '

      .--.
     (    )       Leo-  The Lion
    (_)  /
        (_,


        __
   ___.'  '.___   Libra-  The Balance
   ____________


     `-.    .-'   Pisces-  The Fishes
        :  :
      --:--:--
        :  :
     .-'    `-.

          ...
          .':     Sagittarius-  The Archer
        .'
    `..'
    .'`.

   _
  ' `:--.--.
     |  |  |      Scorpius-  The Scorpion
     |  |  |
     |  |  |  ..,
           `---':

    .     .
    '.___.'      Taurus-  The Bull
    .'   `.  
   :       : 
   :       :
    `.___.'

   _
  ' `:--.--.
     |  |  |_     Virgo-  The Virgin
     |  |  | )
     |  |  |/
          (J

Comments

Popular posts from this blog

Guiless?

Web.com and Network Solutions, the Walmart of the internet.

MSOffice vs Libreoffice