-->

kshWeb ftpSpider
Web Crawler

 

kshWeb Home | ksh Applications, Tools, Utilities | Obtain Korn Shell | Shell News Groups | Publications | Links
Korn Shell Jobs | Tip-O-The-Day | Visitor Supplied Scripts | GuestBook | Registration and Download | MicroEMACS

FREE On-Line
Business Card File
Data Entry System
FREE On-Line
Korn Shell Books Free E-mail
YourName@ UnixGuru.zzn.com
YourName@ UnixWizard.zzn.com
YourName@ MCSE.zzn.com
... and many others!!!

This script performs FTP site crawling or spidering, which means that it will to go a specified URL and crawl through the ftp site detecting child directories. The output of this program is a list of FTP directories in URL format.

The character based WWW browser Lynx is required to run this application and must be installed first.

Cut and paste the following script into a file called "ftpspider.ksh" on your system. Or click on ftpspider.ksh to download a file containing the function.

#!/bin/ksh
#############################################################
# Program:	ftpspider.sh
#               Copyright 1999
#
# Description:	This program accepts a single URL from the 
#		command line and then crawls or spiders the ftp
#		site to find all child directories associated
#		with it.  This program returns a list of
#		directories as standard output.
#
# Author:	Dana French (dfrench@mtxia.com.com)
#		 
#		(405) 936-2342
#
# Date:		03/22/99
#
#############################################################
# Modifications:
#
# 07/07/98	Version: 1.0
#		Original Code
################################################################
syntax()
{
echo ""
echo "No URL was specified"
echo ""
echo "Syntax:
	ftpspider.sh [-?][-v] \"ftp://some.domain.com/some/directory\"
		-v Verbose Mode
"
}
################################################################
spider()
{
while read PAGEURL PARENT
do
	if [[ ${VERBOSE} -eq 1 ]]
	then
		echo "crawling: ${PAGEURL}"
		echo "	parent: ${PARENT}"
	else
		echo "${PAGEURL}"
	fi

	VAR_LIST=`${CMD_LYNX} -dump "${PAGEURL}" \
	| grep -i "Directory" \
	| grep -iv "current Directory" \
	| cut -d"]" -f2 \
	| sed -e "s|^|${PAGEURL}/|g" \
	| sort \
	| uniq`

# 	echo "${VAR_LIST}"

	if [[ "_${VAR_LIST}" != "_" ]]
	then
		if [[ ${VERBOSE} -eq 1 ]]
		then
			echo "${VAR_LIST}" | sed -e "s/^/	child: /g"
		fi
	echo "${VAR_LIST}" | sed -e "s|$|	${PAGEURL}|g" >> ${TMPLIST}
	fi
	echo "${PAGEURL}	${PARENT}" >> ${RUNLIST}
done < ${WORKLIST}
}
################################################################

CMD_LYNX="lynx"
VERBOSE="0"

case "_${1}" in
	"_-?" ) syntax; exit;;
	"_-v" ) VERBOSE="1"; shift;;
esac

PAGEURL="${1}"
if [[ "_${PAGEURL}" = "_" ]]
then
	syntax
	exit
fi

SITE=`echo "${PAGEURL}" | cut -d"/" -f1-3`
TOPDIR=`echo "${PAGEURL}" | sed -e "s|${SITE}||g;s|^/||g;s|/$||g"`

URLTOP="${SITE}/${TOPDIR}"
if [[ "_${TOPDIR}" = "_" ]]
then
	URLTOP="${SITE}"
fi

RUNLIST="/tmp/runlist${$}.tmp"
WORKLIST="/tmp/worklist${$}.tmp"
TMPLIST="/tmp/tmplist${$}.tmp"
NEWLIST="/tmp/newlist${$}.tmp"
RUNTMP="/tmp/runtmp${$}.tmp"
WORKTMP="/tmp/worktmp${$}.tmp"
NEWTMP="/tmp/newtmp${$}.tmp"

rm -f ${RUNLIST}
rm -f ${WORKLIST}
rm -f ${TMPLIST}
rm -f ${NEWLIST}
rm -f ${RUNTMP}
rm -f ${WORKTMP}
rm -f ${NEWTMP}

echo "${URLTOP}" > ${WORKLIST}

LINES=`wc -l < ${WORKLIST}`
while [ ${LINES} -gt 0 ]
do
	spider
	sort ${TMPLIST} | uniq > ${NEWLIST}
	cp ${RUNLIST} ${TMPLIST}
	sort ${TMPLIST} | uniq > ${RUNLIST}

	cut -d"	" -f1 < ${RUNLIST} | sort | uniq > ${RUNTMP}
	cut -d"	" -f1 < ${NEWLIST} | sort | uniq > ${NEWTMP}
	comm -13 ${RUNTMP} ${NEWTMP} > ${WORKTMP}
	rm -f ${WORKLIST}
	touch ${WORKLIST}
	while read LINK
	do
		grep -i "^${LINK}" ${NEWLIST} >> ${WORKLIST}
	done < ${WORKTMP}
	rm -f ${RUNTMP}
	rm -f ${NEWTMP}
	rm -f ${WORKTMP}

	rm -f ${TMPLIST}
	rm -f ${NEWLIST}
	LINES=`wc -l < ${WORKLIST}`
done

rm -f ${RUNLIST}
rm -f ${WORKLIST}
rm -f ${TMPLIST}
rm -f ${NEWLIST}
################################################################

kshWeb Home | ksh Applications, Tools, Utilities | Obtain Korn Shell | Shell News Groups | Publications | Links
Korn Shell Jobs | Tip-O-The-Day | Visitor Supplied Scripts | GuestBook | Registration and Download | MicroEMACS

FREE On-Line
Business Card File
Data Entry System
FREE On-Line
Korn Shell Books Free E-mail
YourName@ UnixGuru.zzn.com
YourName@ UnixWizard.zzn.com
YourName@ MCSE.zzn.com
... and many others!!!

 

For Information regarding this page, contact Dana French ( dfrench@mtxia.com )