blue abstract modern background
Blog DomainTools 101

Using Iris Investigate Pivot Engine to Collect Bulk Screenshots

There are times when an analyst would like to examine and capture images of a website (or set of websites) that may be of interest.

This may be a matter of generally understanding “what a site is all about,” or there may be more specific goals such as confirming that a set of sites share a common “look-and-feel,” and thus are likely linked/centrally coordinated. 

When it comes to doing this, consider three basic approaches:

a) Manually Collecting Screenshots: At its most basic, an analyst may simply visit and manually screenshot each site of interest using a regular web browser and a simple screen grabbing tool, perhaps routing those visits through a proxy server. 

That approach, while straightforward, may be risky, slow, and ultimately counterproductive:

  • The investigator may stumble across a malware-infested site and the investigator’s own system may end up getting infected
  • Manually visiting a long list of sites “site-by-site-by-site…” tends to be a bit onerous
  • An investigative target may be situationally aware and notice that anomalous visits are taking place – potentially for investigatory purposes? When put “under the microscope” that way, some site operators may “run for cover” or adopt various countermeasures (such as requiring a particular referrer for access).

b) Automatically Collecting Screenshots: The collection of screenshots for multiple sites can be automated using a tool such as Pyppeteer. That automation can reduce the tedium of manually collecting pages, but most web designers aren’t focused on making “screenshot friendly” websites. For example, many web designers may create web sites that use scripting, animation, and other technologies that can be tricky to cleanly screenshot. We’ve previously documented numerous examples where Pyppeteer ran into issues while attempting to screenshot a set of college websites (see Appendix V of the “Bang_Question” report). This isn’t to say that it’s impossible to learn to automatically screenshot most websites, but is that really how you want to dedicate your time? Fortunately, there is a third option.

c) Using DomainTools Iris Investigate to Collect Screenshots: DomainTools Iris Investigate may be an easier, more powerful, and less-likely-to-be-noticed solution. The one notable limitation to be aware of is that Iris Investigate’s screenshot-taking is limited to the main registrable domain itself (e.g., it isn’t intended for taking screenshots of an arbitrary page buried deep in a domain’s page tree).

In this blog post, we’ll first talk about using Iris Investigate to manually collect screenshots one domain at a time. We’ll move on to show you how you can easily and efficiently queue hundreds of domains for bulk screenshots with Iris Investigate’s pivot engine.

Manual Domain-by-Domain Screenshots in Iris Investigate

If you’re looking at individual registered domain in Iris Investigate, a screenshot may have previously been collected. When that’s the case, you’ll see the most recent screenshot as part of the default Iris Investigate display. For example:

Illustration 1. Sample Screenshot of a Previously Collected Screenshot of the University of Oregon's website
Illustration 1. Sample Screenshot of a Previously Collected Screenshot of the University of Oregon’s website

If your focus is primarily on screenshots (rather than all aspects of a domain name), you can tailor your interface by closing the two default panes and just select “Screenshot History” from the menu on the bottom of the page. This will result in screen shots receiving a more prominent display. See the next two screenshots:

Illustration 2. Closing Currently-Unneeded Panels & Selecting Screenshot History from the Bottom Menu
Illustration 2. Closing Currently-Unneeded Panels & Selecting Screenshot History from the Bottom Menu

Our interface now looks like Illustration 3, below. Now Iris Investigate is “all about screenshots:”

Illustration 3. Our Focus Is Now the Screenshot History Panel
Illustration 3. Our Focus Is Now the Screenshot History Panel

Part of the power of the Iris Investigate screenshot capability is:

  • If it’s been some time since a screenshot was collected for a site, you can click “Queue Screenshot for Update” to request that a site’s screenshot be refreshed.
  • Alternatively, if you want to see what the site previously looked like, you can browse the historical screenshots that have already been collected (see the screenshot history panel on the left hand side of your screen). Click on one of the earlier historical screenshots if you’d like to inspect it more closely.

Once you’re done with the current site and you’re ready to move on to a new site, just enter the new site’s name in the search bar and hit the blue search button:

Illustration 4. Continuing Our Hypothetical Investigation with Another Sample Site
Illustration 4. Continuing Our Hypothetical Investigation with Another Sample Site

Taking Screenshots for Hundreds of Sites with Iris Investigate’s Pivot Engine

Let’s assume, for example, that we’re interested in “Native Sovereign Nation” (NSN) dot gov websites, a program announced 20 years ago.

For example, perhaps a history-of-the-Internet researcher is curious if those sites were popular, well-accepted, and are still used and relied on today; or have more flexibly-named sites in some other top level domain (TLD) largely supplanted the NSN dot gov program? Or are there differences by region of the country? We’re not going to do that study in this article, but we will show you how a researcher could collect screenshots for all the NSN dot gov websites.

We can get a list of all current 8,149 dot gov domains (including both NSN domains and other dot gov domains) by saying: 

$ wget https://raw.githubusercontent.com/cisagov/dotgov-data/main/current-full.csv
$ wc -l current-full.csv
8150         <-- includes a "header" row that doesn't count.

Looking at the 2nd field in that file, we can see the breakdown by type of entity:

$ cut -d, -f2 < current-full.csv | sort | uniq -c | sort -nr
   3888 City
   1406 County
   1161 Federal - Executive
   1145 State
    219 Tribal
    166 Independent Intrastate
    117 Federal - Legislative
     25 Federal - Judicial
     22 Interstate
      1 Domain Type

For this current hypothetical study, our focus is on just the 219 Tribal NSN dot gov domains. We can extract those by saying:

$ grep "Tribal" current-full.csv | cut -d, -f1 > nsn-subset.txt
$ wc -l nsn-subset.txt
219

We’re now ready to run screen shots for that subset of dot gov sites. To do so, hit the “Advanced” button in Iris Investigate:

Illustration 5. Selecting Advanced Search
Illustration 5. Selecting Advanced Search 

The Advanced Search Panel will then open. Select the “in” operator from the pull down (as highlighted in the following screenshot). Ensure you’re searching for “Domains” in the left hand pull down, as we are here. Then, cut and paste our list of NSN dot gov domains where indicated. Hit the blue search button to find matching domains.

Illustration 6. Specifying Our Advanced Search
Illustration 6. Specifying Our Advanced Search

When that search finishes, select the Pivot Engine tab on the black bar at the bottom of the screen (highlighted in red below):

Illustration 7. Selecting the Pivot Engine Panel
Illustration 7. Selecting the Pivot Engine Panel

You’ll then see a display like the following. Note the little checkbox on the blue menu bar (highlighted in red). Check that box. 

Checking the box will select ALL the domains shown in the table. (If there are any of those domains you DON’T want to screenshot, you can scroll down and unclick those names as exceptions).

Illustration 8. Selecting All Results in the Pivot Engine by Default
Illustration 8. Selecting All Results in the Pivot Engine by Default

Once you’ve got the set of domains selected, a new bar will appear (highlighted in red in the following). Click “Queue Screenshots” (over on the right side of that bar) to proceed.

Illustration 9. New Processing Options Are Now Shown
Illustration 9. New Processing Options Are Now Shown

When the domains have been queued for updated screenshots, you should see a confirmation as highlighted in red below:

Illustration 10. Confirmation That the Screenshots Have Been Queued for Processing
Illustration 10. Confirmation That the Screenshots Have Been Queued for Processing

Screenshots will then be processed asynchronously. After a bit, screenshots for your sites will normally be ready to view.

However, if DomainTools tried screenshotting the specified site only to find that literally nothing has changed since the previous successful screen shot, “duplicate detection” kicks in. In that case, you’ll still see the previous screen shot (including the original date associated with that collection).

That doesn’t mean that your request wasn’t run! Rather, if you rollover the top entry in the screen shot history, you’ll see a popup explaining when the most recent screenshot was collected and checked, even if the main screenshot date shown hasn’t changed.

Illustration 11. Screenshot showing duplicate detection reporting in action
Illustration 11. Screenshot showing duplicate detection reporting in action

Acknowledgments

Thanks to Mr. Grant Cole, Principal Product Manager at DomainTools, for pointing out the power of the pivot engine for bulk screenshot work!

Appendix: FAQs

Q1) “Can I have Iris Investigate take a screenshot of a particular URL (such as a subordinate page on a site), rather than just the default top level page associated with a registrable domain?”

DomainTools only supports taking screenshots of the default top level page of registrable domains. If you supply more than that, your query will automatically be reduced to just a basic registrable domain.

Q2) “What’s the maximum number of registerable domains I can ask to have screen shotted in a single bulk query?”

The maximum number of registered domains that can be queued for bulk screenshotting in a single run is 1,000, but we generally recommend running batches of 500 domains or less.

Q3) “If I do a single bulk request for 500 domains, does that one request ‘cost’ one query from my account’s quota or will I get charged for each individual domain that’s queued to be screenshotted?”

Bulk queries get charged per bulk query, whether that bulk query is for one domain or a thousand (quite the bargain, eh?)

Q4) “Can I bulk download all the most-recent screenshots for a set of domains I’ve queued up, perhaps in the form of a compressed archive file?”

Unfortunately, no. You can right click on a screenshot you’re viewing to save a copy to your workstation, but only image-by-image.

Q5) “After queuing domains for screenshotting, the application says to check back in a ‘few hours.’ Will it really take that long for my screenshots to be run?”

Most sites you may want to screenshot will load fast, but it can take a minute for a complex and heavily loaded site to fully download and finish painting a requested page, and we can’t take a screenshot until the page has finished doing so. If we were to try to overly-rush that process, it’s easy to end up with a screenshot of a page that’s still in the process of getting “put together.”

The time required to pull a batch of screenshots can also depend on a variety of other factors, including how many other customers have also queued requests for screenshots. Updated screenshots MAY be available in just a few minutes after you’ve queued them, but the “few hours” reference is mentioned as a worst-case “upper-bound” for a particularly busy period with a particularly slow set of targeted spage.

Q6) “Is it possible to ask to screenshot a site only if a particular condition is met, such as the site having a risk score above a certain value, or a site having moved to a new IP address since the last time a screenshot was taken?”

You’d need to do that preprocessing yourself prior to submitting domains for screenshotting – that sort of preprocessing cannot be done automatically as part of the screenshot workflow at this time.

Q7) “Some sites may return a different ‘version’ of their web page for ‘small screens’ (such as smartphones) vs ‘large screens’ (such as desktops). Can I change the sort of browser/device the screenshotting tool declares itself to be?”

Not at this time. Currently, the screen shotter presents as an 800×600 pixel device running Chrome under Linux with Javascript support, so the screenshotter will get a “large screen” experience when visiting most sites for a screenshot.

Q8) “Some sites may return ‘localized’ or ‘internationalized’ versions of their pages for queries made from different parts of the world. Can I ask for queries to be made from a Spanish-speaking area, or from a Germanic-speaking area, for example?”

Currently DomainTools screenshots are taken from cloud provider IPs located in the United States only.

Q9) “Some sites may want visitors to click a box to accept GDPR policies, or to complete a Captcha in order to be allowed access. Will the DomainTools screenshot tool do that sort of thing?”

The DomainTools screenshotter won’t interact with Captchas or a site’s scripts/forms, no.

Q10) “Will all domains automatically be screenshotted as soon as they’re ‘noticed’?”

No. Many of the more popular domains will quickly end up being screenshotted by someone, but we do not automatically screenshot “all” domains by default.

Q11) “I’d like to get a screenshot taken of my domain. How can I ask for that to happen?”

DomainTools customers can use Iris Investigate to queue sites for screenshotting/rescreenshotting at their discretion. 

Q12) “I’d like to opt-out of having ANY screenshots of my site(s) taken by DomainTools. Do you have some mechanism to control that (perhaps looking for a robots.txt file)? Or perhaps I could get a publicly disclosed list of the IP addresses you screenshot from? If so, I can then just block access from those IPs myself…”

We do not currently offer a screenshot “opt-out” program – we suspect that if we were to offer one, some of the most enthusiastic users might be nefarious sites, ditto if we were to routinely disclose and update the specific IPs used for screenshotting. That said, sites that might like to discourage screenshot collection might consider requiring at least minimal “user interaction” for access (remember that our screenshot tool intentionally doesn’t interact with Captchas or scripts/forms).

Q13) “My content is intended for mature audiences only. If you screenshot it, will you ensure minors don’t get access to it?”

DomainTools-collected screenshots are non-public and we do not knowingly sell services to minors. We also only crawl top level (default) home pages, so if you have an interstitial page asking visitors to confirm their age, that will limit what gets collected.

Q14) “You happen to have taken some screenshots of our site from a period of time when we’d just been hacked or were otherwise having problems. Can we get those screenshots manually deleted? Or perhaps those screenshots eventually get automatically purged?”

DomainTools normally retains all the screenshots it collects, but under exceptional circumstances get in touch with us and describe the problem you’re attempting to resolve. (And no, we don’t automatically age out any screen shots by default.)

Q15) “Are DomainTools screenshots collected in a way that makes them suitable for use in criminal prosecutions or civil litigation? For example, are screenshots “pristine”/”un-tinkered with”, or do you modify them in some way, perhaps blurring some content or annotating the screen shots with a timestamp or watermark?”

Our screenshots are meant for investigatory purposes but are NOT intended for use as evidence in civil litigation or criminal prosecutions per se. For example, we don’t compute hashes for images we collect, nor do we maintain chain-of-custody logs for screenshots. Screenshots DO receive an innocuous watermark at the bottom as part of the collection process.

Q16) “What if I want to take a screenshot of a server running on a non-default port? For example, what if I want to screenshot http://www.asnt.org:8080/Test.html – would that be possible?

No, that URL would be simplified/rewritten to just asnt.org

Q17) “Can I get a screenshot for the default web server running on a raw IP address that doesn’t require SNI, or do I always need to specify a domain name?”

You’ll always need to specify a domain name, and the domain name needs to be reachable over IPv4. (You may be able to discover relevant domains associated with an IP address of interest using DNSDB passive DNS.)

Q18) “I’ve got additional questions you haven’t addressed above – who can I ask about them?”