Wednesday, September 17, 2014

Scraping CNN.

Decided to take my score scraping script and apply it to other media. Per se maybe we just want the headlines of CNN.com. Used essentially the same set of instructions on a local news site with results. You will want to use the same rough logic or pseudocode:

Dump web page as ascii (text) to a disk file.
Read disk file one line at a time in, but ignore all lines till you get what is needed.
Now continue reading one line at a time, but ignore certain lines
With the line you read in, output what is needed while editing out unwanted characters.
Stop output when you get to a point nothing else is needed.
Finnish reading file one line at a time.
End.

You might get results like this ignoring equals signs.

=========
THE LATEST

* Source: Joan Rivers' doc took selfie
* Freak accident kills hero bus
driver
* IOS 8 is live: How to get it
* Five iOS 8 features you'll love
* NEW Billionaire tells big named computer company: Innovate
* Obama stands firm: No ground troops
* Kerry heckled during testimony
* NEW Stocks hit record; thank Yellen
* NEW Panthers star takes leave
* Vikings: Peterson must stay away
* NEW Virus coming to a state near you
* Dowd inspires edible-pot campaign
* Wrongly convicted man gets a statue
* China blacks out CNN's report
* He mistakenly calls 911, then ...
* Surprise! Mendes, Gosling have baby

OPINION
...
...
===============

Then we can we can add it to our report.sh homemade newspaper. For details see:
http://computoman.blogspot.com/2013/12/create-your-own-newspages.html

#--------------------------------------------------------
# cnn.sh
echo "<h3>CNN Headines</h3>" >> report.html
echo "<pre>" >> report.html
# creates cnn.txt
./cnn.sh > cnn.txt
echo "<pre>" >> report.html
cat cnn.txt >> report.html
echo "</pre>" >> report.html


[code]
####################################
# Cnn Headline  Grabber
#
#===============================
# Assignments
# --------------------------------
datafile="rawcnn.txt"
let "flag = 0"
# end assignments
#=================================
#
# Get data file
#---------------------------------
elinks -dump "www.cnn.com"  > $datafile
#=================================
#
# Extract and display data
#---------------------------------
while read line
do fdata[$a]=$line
    echo $line | grep -q "THE LATEST"
    if  [ $? -eq 0 ]; then
        # header
        clear
        let "flag = 1"
    fi
    if [ $flag -eq 1 ]; then
        echo $line | grep -q "Weather"
            if [ $? -eq 0 ]; then
            let "flag = 0"
        else
            echo $line | grep -q "IMG"         
            if [ $? -eq 0 ]; then
                let "response = donothing"
            else
                echo $line | sed 's/\[.*\]//'
            fi
        fi
    fi
let "a += 1"
done < $datafile
# footer
echo ---------------------------------------------
echo
#===================================
# End.
####################################

[/code]

No comments:

Post a Comment