Scraping CNN.
Decided to take my score scraping script and apply it to other media. Per se maybe we just want the headlines of CNN.com. Used essentially the same set of instructions on a local news site with results. You will want to use the same rough logic or pseudocode:
Dump web page as ascii (text) to a disk file.
Read disk file one line at a time in, but ignore all lines till you get what is needed.
Now continue reading one line at a time, but ignore certain lines
With the line you read in, output what is needed while editing out unwanted characters.
Stop output when you get to a point nothing else is needed.
Finnish reading file one line at a time.
End.
You might get results like this ignoring equals signs.
=========
Then we can we can add it to our report.sh homemade newspaper. For details see:
http://computoman.blogspot.com/2013/12/create-your-own-newspages.html
#--------------------------------------------------------
# cnn.sh
echo "<h3>CNN Headines</h3>" >> report.html
echo "<pre>" >> report.html
# creates cnn.txt
./cnn.sh > cnn.txt
echo "<pre>" >> report.html
cat cnn.txt >> report.html
echo "</pre>" >> report.html
[code]
####################################
# Cnn Headline Grabber
#
#===============================
# Assignments
# --------------------------------
datafile="rawcnn.txt"
let "flag = 0"
# end assignments
#=================================
#
# Get data file
#---------------------------------
elinks -dump "www.cnn.com" > $datafile
#=================================
#
# Extract and display data
#---------------------------------
while read line
do fdata[$a]=$line
echo $line | grep -q "THE LATEST"
if [ $? -eq 0 ]; then
# header
clear
let "flag = 1"
fi
if [ $flag -eq 1 ]; then
echo $line | grep -q "Weather"
if [ $? -eq 0 ]; then
let "flag = 0"
else
echo $line | grep -q "IMG"
if [ $? -eq 0 ]; then
let "response = donothing"
else
echo $line | sed 's/\[.*\]//'
fi
fi
fi
let "a += 1"
done < $datafile
# footer
echo ---------------------------------------------
echo
#===================================
# End.
####################################
[/code]
Dump web page as ascii (text) to a disk file.
Read disk file one line at a time in, but ignore all lines till you get what is needed.
Now continue reading one line at a time, but ignore certain lines
With the line you read in, output what is needed while editing out unwanted characters.
Stop output when you get to a point nothing else is needed.
Finnish reading file one line at a time.
End.
You might get results like this ignoring equals signs.
=========
THE LATEST
* Source: Joan Rivers' doc took selfie
* Freak accident kills hero bus
driver
* IOS 8 is live: How to get it
* Five iOS 8 features you'll love
* NEW Billionaire tells big named computer company: Innovate
* Obama stands firm: No ground troops
* Kerry heckled during testimony
* NEW Stocks hit record; thank Yellen
* NEW Panthers star takes leave
* Vikings: Peterson must stay away
* NEW Virus coming to a state near you
* Dowd inspires edible-pot campaign
* Wrongly convicted man gets a statue
* China blacks out CNN's report
* He mistakenly calls 911, then ...
* Surprise! Mendes, Gosling have baby
OPINION
* Source: Joan Rivers' doc took selfie
* Freak accident kills hero bus
driver
* IOS 8 is live: How to get it
* Five iOS 8 features you'll love
* NEW Billionaire tells big named computer company: Innovate
* Obama stands firm: No ground troops
* Kerry heckled during testimony
* NEW Stocks hit record; thank Yellen
* NEW Panthers star takes leave
* Vikings: Peterson must stay away
* NEW Virus coming to a state near you
* Dowd inspires edible-pot campaign
* Wrongly convicted man gets a statue
* China blacks out CNN's report
* He mistakenly calls 911, then ...
* Surprise! Mendes, Gosling have baby
OPINION
...
...
===============Then we can we can add it to our report.sh homemade newspaper. For details see:
http://computoman.blogspot.com/2013/12/create-your-own-newspages.html
#--------------------------------------------------------
# cnn.sh
echo "<h3>CNN Headines</h3>" >> report.html
echo "<pre>" >> report.html
# creates cnn.txt
./cnn.sh > cnn.txt
echo "<pre>" >> report.html
cat cnn.txt >> report.html
echo "</pre>" >> report.html
[code]
####################################
# Cnn Headline Grabber
#
#===============================
# Assignments
# --------------------------------
datafile="rawcnn.txt"
let "flag = 0"
# end assignments
#=================================
#
# Get data file
#---------------------------------
elinks -dump "www.cnn.com" > $datafile
#=================================
#
# Extract and display data
#---------------------------------
while read line
do fdata[$a]=$line
echo $line | grep -q "THE LATEST"
if [ $? -eq 0 ]; then
# header
clear
let "flag = 1"
fi
if [ $flag -eq 1 ]; then
echo $line | grep -q "Weather"
if [ $? -eq 0 ]; then
let "flag = 0"
else
echo $line | grep -q "IMG"
if [ $? -eq 0 ]; then
let "response = donothing"
else
echo $line | sed 's/\[.*\]//'
fi
fi
fi
let "a += 1"
done < $datafile
# footer
echo ---------------------------------------------
echo
#===================================
# End.
####################################
[/code]
Comments
Post a Comment