2007
DOI: 10.1002/asi.20704
|View full text |Cite
|
Sign up to set email alerts
|

Extracting accurate and complete results from search engines: Case study windows live

Abstract: Although designed for general Web searching, Webometrics and related research commercial search engines are also used to produce estimated hit counts or lists of URLs matching a query. Unfortunately, however, they do not return all matching URLs for a search and their hit count estimates are unreliable. In this article, we assess whether it is possible to obtain complete lists of matching URLs from Windows Live, and whether any of its hit count estimates are robust. As part of this, we introduce two new method… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
35
0

Year Published

2008
2008
2013
2013

Publication Types

Select...
8

Relationship

7
1

Authors

Journals

citations
Cited by 58 publications
(36 citation statements)
references
References 25 publications
1
35
0
Order By: Relevance
“…These represent the three current major search engine families, and although combining their results would probably cover less than half of the Web (perhaps under 16% each), 3 it still gives a significant amount of data. A main difficulty at this stage derives from the fact that for a given query, a search engine returns a maximum of 1,000 matches and automatically filters its results to avoid apparently redundant pages (Thelwall, 2008). The redundancy problem is particularly acute when searching for copies of a single joke because these copies are, by definition, large chunks of similar or identical text.…”
Section: : Gathering Urls and Assessing The Web Presence Of The Memementioning
confidence: 99%
“…These represent the three current major search engine families, and although combining their results would probably cover less than half of the Web (perhaps under 16% each), 3 it still gives a significant amount of data. A main difficulty at this stage derives from the fact that for a given query, a search engine returns a maximum of 1,000 matches and automatically filters its results to avoid apparently redundant pages (Thelwall, 2008). The redundancy problem is particularly acute when searching for copies of a single joke because these copies are, by definition, large chunks of similar or identical text.…”
Section: : Gathering Urls and Assessing The Web Presence Of The Memementioning
confidence: 99%
“…Using this software, we were also able to automatically split the queries whose results exceeded the maximum of 1,000 hits permitted when using Yahoo! for this purpose (Thelwall, 2008b). In this way, we obtained up to 19,619 inlinks for each query.…”
Section: Methodsmentioning
confidence: 99%
“…Instead of hit count estimates, complete lists of matching URLs can be obtained. This is an improvement because the hit count estimates can be unreliable (Thelwall, 2008;Uyar, 2009) and because additional information can be extracted from URL lists, as described below. Full URL lists can be extracted from the results pages manually but this can be timeconsuming so the use of software like Webometric Analyst (http://lexiurl.wlv.ac.uk) is recommended to automate this although this program is only able to use Bing.…”
Section: Lists and Counts Of Web Pages Citing The Documentsmentioning
confidence: 99%
“…A problem arises if there are more than 1,000 results because search engines never return any results after the 1,000 th . The "query splitting" technique has been designed to resolve this issue by automatically constructing new queries to retrieve additional results (Thelwall, 2008). This is available in Webometric Analyst.…”
Section: Lists and Counts Of Web Pages Citing The Documentsmentioning
confidence: 99%