142 comments
  • jadell6y

    It seems he's doing something with header detection. I used Puppeteer to play around with the site and various configurations I use when scraping.

    In headless Chrome, the "Accept-Language" header is not sent. In Puppeteer, one can force the header to be sent by doing:

      page.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9' })
    
    
    However, Puppeteer sends that header as lowercase:

      accept-language: en-US,en;q=0.9
    
    
    So it seems the detection is as simply: if there is no 'Accept-Language' header (case-sensitive), then "Headless Chrome"; else, "Not Headless Chrome".

    This is a completely server-side check, which is why he can say the fpcollect client-side javascript library isn't involved.

    Here are some curl commands that demonstrate:

    Detected: not headless

      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
      -H 'Accept-Language: en-US,en;q=0.9'
    
    Detected: headless

      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36'
    
    Detected: headless

      curl 'https://arh.antoinevastel.com/bots/areyouheadless' \
      -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36' \
      -H 'accept-language: en-US,en;q=0.9'
    • jadell6y

      As a followup, if you have the ability to modify the Chrome/Chromium command line arguments, using the following option completely fools the detection:

        --lang=en-US,en;q=0.9
      
      You can prove this with the following Puppeteer script:

        (async () => {
            const puppeteer = require('puppeteer');
            const browserOpts = {
                headless: true,
                args: [
                    '--no-sandbox',
                    '--disable-setuid-sandbox',
                    '--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3803.0 Safari/537.36',
                    // THIS IS THE KEY BIT!
                    '--lang=en-US,en;q=0.9',
                ],
            };
        
            const browser = await puppeteer.launch(browserOpts);
            const page = await browser.newPage();
            await page.goto('https://arh.antoinevastel.com/bots/areyouheadless');
            await page.screenshot({ path: 'areyouheadless.png' });
            await browser.close();
        })();
  • Pinbenterjamin6y

    I run the division at my company that builds crawlers for websites with public records. We scrape this information on-demand when a case is requested, and we handle an enormous volume of different sites (or sources as we call them). We recently passed 700 total custom scrapers.

    Recently, we have seen a spike in sites that detect, and block our crawlers with some sort of Javascript we cannot identify. We use headless Chrome and selenium to build out most of our integrations, I'm starting to wonder if the science of blocking scraping is getting more popular...

    I don't think what I'm doing is subversive at all, we're running background checks on people, and we can reduce business costs by eliminating error-prone researchers with smart scrapers that run all day.

    I don't want to seem like the bad guy here, but what if I wanted to do the opposite of this research? Where do I start? Study the chromium source? Can anyone recommend a few papers?

    • curun1r6y

      > Where do I start? Study the chromium source?

      I'm curious why you'd jump straight to browser detection as the most likely culprit. When I was doing scraping, the far more common case was bot detection by origin and access patterns. It's just very difficult to make an automated scraper look like a residential or business user.

      Where do you run your scraping operation? Is it in AWS or some other hosting provider, because that will get you blocked quickly by a lot of sites? Do you rate limit, including adding random jitter to mimic the way a human might use a browser?

      There's scraping services available that essentially use a network of browsers on residential connections with their extension installed to get around scraping detection. It's much slower, but it's much more reliable. We also had some success by signing up with a bunch of the VPN providers (PIA, NordVPN, ExpressVPN, etc) and cycling through their servers frequently. Anything to avoid creating patterns that look automated or being tied to an IP that can be blacklisted. I'd start there before I'd worry about hacky javascript detection like in this story being what's tripping you up.

      • Pinbenterjamin6y

        According to the NDA with my company I can't reveal anything about the architecture beyond the fact that it is hosted locally on a homebuilt distributed system that randomly chooses from a pool of 120 residential IPs.

        We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.

        Some sites are just so damn good and detecting us and I just don't get it.

        • nutjob26y

          They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

          The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.

          • Pinbenterjamin6y

            That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.

            • UweSchmidt6y

              In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?

              • pault6y

                I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.

              • dang6y

                That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

                https://news.ycombinator.com/newsguidelines.html

              • formercoder6y

                OP is being reasonably compensated for something that is perfectly legal.

        • arpa6y

          A pool of 120 residential ips is way too small - patterns are more emergent. Go for thousands, even better, hundreds of thousands. Outsource the residential proxy system to luminati or oxylabs.

        • underwater6y

          This sounds, at best, ethically dubious and at worst illegal. Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

          Given that your run this division there is a good chance you are personally liable.

          • Pinbenterjamin6y

            We have an enormous legal team that communicates constantly with end points to ensure they are aware of our scraping. And as I said in another comment, we store no results other than what is already available to anyone else using the web.

            We've had this division for many many years, and before my time we paid another company to do this. There's no legal issues.

            • underwater6y

              Your legal teamn is in contact with them, but their security is actively trying to block you? That doesn't make sense.

              Computer security laws are very broad. It doesn't matter if it's just a website that the public can access. If you're accessing it in a matter that they don't want AND you're aware of that, then I struggle to see how your lawyers can justify it.

              > Computer hacking is broadly defined as intentionally accesses a computer without authorization or exceeds authorized access.

              https://definitions.uslegal.com/c/computer-hacking/

              Hiding your user agent because you know they don't want automated retrieval of information is "without authorisation".

          • Havoc6y

            >Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

            Don't think connecting a computer to a private network to suck up subscriber data is comparable to scraping publicly accessible internet content.

            • 3xblah6y

              These fear mongering comments always ignore the notice provision in the CFAA. Web scraping publicly accessible information is not "illegal" under the CFAA. That law, at most, only makes someone who continues scraping after being asked to stop potentially culpable.

              First, the accuser needs to, at least, send a cease and desist letter to the accused asking them to stop accessing the protected computer. Second, the accused needs to ignore that request and keep accessing the protected computer.

              Is it possible to build a solid CFAA case when those two things do not happen? I cannot find any examples.

              https://iapp.org/news/a/can-a-cease-and-desist-notice-create...

            • underwater6y

              My understanding of the case is that he was charged with evading JSTOR security, not for accessing the MIT network.

          • morpheuskafka6y

            Although his charges were ridiculous, they involved physically connecting to a secure network without permission, not just scraping the public part of pages from his own networks.

          • abcpassword6y

            He’s definitely personally liable. I was tasked with a similar thing at work and refused. Prison time isn’t worth the paycheck.

      • brlewis6y

        > rate limit, including adding random jitter to mimic the way a human might use a browser

        Even if you aren't trying to disguise anything, adding some randomness helps avoid one particular bad pattern with operations on a network. I recall the pattern being called "network synchronization" but I can't get good search results for that.

    • floatingatoll6y

      Reducing your business costs by scraping a public access website is often considered an alternative to paying the business costs of the website operator.

      Are you saving money at the expense of the site operator by scraping their site for public records, or are you saving money as well as the site operator?

      If you're costing them money to reduce your own bottom line without their express written consent, that makes you "the bad guy". Offsetting costs onto an unwitting, non-consenting third party is an unethical approach to doing business.

      I interpret your request as a similar problem to "help me with my homework problem". I could dig up papers and studies, but at the end of the day, you need to go do your homework. Reach out to each municipality and figure out a business arrangement with them that satisfies your needs. It's possible they do not wish you to perform this activity, in which case you will either need to violate their intent for your own profit using scraping or accede to their wishes and stop scraping their municipality. That's your homework as a for-profit business.

      • Pinbenterjamin6y

        I don't empathize with your viewpoint because, whether it's a web scraper, or a person, the work is exactly the same. There's no additional volume, or extra steps. We just emulate a worker.

        We measure the value in FTEs, and when a researcher quits, we do not replace them if the appropriate FTEs have been reached with projects.

        It's a major benefit to the business not only because we don't have to pay another employee, but we can reduce training costs, and costs incurred by mistakes. We can also adjust execution of one of these agents, which normally would require rearrangement of work instructions, and retraining.

        These are public records, 90% of them do not have integrations for automated systems, and those that do, we utilize. They are typically search boxes with results. We are not circumventing any type of cost that would otherwise be incurred.

        We do not log any of the results, store them locally, or maintain any of the PII with each search. If a case was searched 20 minutes ago, and comes up again, we rerun the entire thing just as a human would.

        Finally, to your point about 'help me with my homework', I consider posting on the HN forums homework for this type of research. There are a diverse set of talented developers on here with esoteric experience. The fact that an article related to the work I do came up on here, I thought, was an excellent opportunity to seek advice and perspective.

        • pault6y

          Don't be discouraged by the spiteful kneejerk reactions in this thread. HN is a diverse place and some commenters get triggered by an association with one of their pet peeves and launch into a rant without taking time to assess the nuance of your position. I've been the butt of this behavior a few times and it can be pretty toxic.

        • floatingatoll6y

          Sadly, you are correct to have realized that many posters on HN are so naive that they will offer you $0/hour consulting for your for-profit business. Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this, but I don't much respect it, either.

          • TeMPOraL6y

            What you call being naïve, I'd call being a good human being. Skilled professionals willing to freely share knowledge are a great thing. BTW., it's literally the foundation of our industry and the whole point of Open Source movement.

            If it reduces market for some consultants, well, sucks to be them, they'd better find a different way of providing value. Not every value needs to be captured and priced. A world in which all value was captured and priced would really suck.

          • danShumway6y

            I'm glad that sites like Wikipedia, StackOverflow, and HN exist. I don't think the world is a worse place because they exist, and I respect the people who post there.

            This is the same attitude that says, "why would someone just give away Open Source software when they could build a SaaS business instead?"

            • floatingatoll6y

              I don’t think Stackoverflow for “how can I avoid paying a municipality a reasonable public records fee” should exist, but I do endorse Stackoverflow in general. You’ll have to do what you will with that; generalizing my point to “all Stackoverflow” is certainly wrong, though.

          • jimnotgym6y

            >Posting on the HN forums means you "don't have to pay another employee" that's an expert in the field. I can't do much to prevent this

            Sometimes the answer tells you much more about what skills you need to be hiring. Sometimes they give you a lead.

      • stickfigure6y

        Public records are public.

        The fact that some government organizations make it hard to retrieve public records is a flaw in the system. I'd be in favor of a national law requiring all public records to be published in machine-readable form.

        In the mean time, it is our civic responsibility to conspire to circumvent these misbehaving public services.

        • floatingatoll6y

          If such a national law were passed with funding guaranteed for open publication of records, I would endorse your point of view.

          No such funding exists, and municipalities are regularly denied tax increases by their voters for any reason — much less public records publication that would often embarrass and humiliate those same voters.

          So in essence you're asking them to cut public services and staffing in order to give hundreds of dollars of IT costs a month to for-profit businesses who can't be bothered to pay some small fraction of their revenue for the costs of delivering those records.

          It is our civic responsibility to republish those records for free as citizens. Doing so for profit at the expense of citizens is unethical.

          If OP republishes all records received in a freely-downloadable, unrestricted form, then I would happily help them fix their scrapers. They, of course, do not.

          • ijpoijpoihpiuoh6y

            Often what the municipalities are doing for public records is harder and more expensive than just publishing an API. So The funding excuse doesn't really cut muster with me.

            • floatingatoll6y

              Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

              The public records are public. Charging for them is, by the above arguments, immoral. Therefore, not only the municipalities but also the businesses profiting from those public records owe us their scraped data, for free, without regard for profit concerns.

              Not one for-profit business does so. Why is their immoral action acceptable, when the same action by a municipality is not?

              • stickfigure6y

                There's nothing immoral about charging for content that you've aggregated. People sell dictionaries.

                The problem here is that instead of building APIs (or just posting to FTP sites), governments are building offices and funding staff to answer snail mail requests. Or building sophisticated web forms and search engines.

                It's obvious how we got to this point (before the internet, you obtained public records by walking into an office) but it's long past time to change. We don't need fancy web forms to search and find data; cut all that out and just provide data in machine readable form to anyone who wants it.

                Someone will build a pretty commercial interface to public records data. Chances are, they can do it for less than the 8-figure sum required for UI development in the public sector. Win-win.

                • pyrale6y

                  It is not obvious to me that reducing the cost to consult public data is necessarily a good thing. Just because this data is accessible, it should not amways also be accessible inexpensively. Example given: trial records should be public but it would probably not be nice to have all your judicial record displayed in people's glasses.

                  • cameronbrown6y

                    Disagree. It's inherently in the public interest to have access to this data as easily as possible. If it's too embarrassing then that's a cultural problem.

                • z3t46y

                  Some "public" records are in the gray area as in; should or should they not (black and white) be published. For example salaries, the employer might forbid disclosing salaries, but anyone can just request anyone's salary from the government because its public. But if they could be downloaded from an FTP ...

              • reaperducer6y

                Can you name a single for-profit public records scraper who republishes the parsed data scraped without charging for data access?

                Currently? Not off the top of my head. But there was one that scraped municipal records in a large midwest city and made them public for free because they were confusing to get to otherwise.

                Unfortunately, the company was bought by a larger company and that portion of what they did was shut down.

              • devxvda6y

                Loveland (now apparently called Landgrid). https://landgrid.com/

        • huhtenberg6y

          Public records are published based on certain demand assumptions.

          If a real-world demand for, say, some GIS data is hundreds of requests per day, then a crawler that comes in with hundreds requests PER MINUTE will obviously stress the infrastructure. Adjusting infrastructure to cope is not an instant process, nor is it a sure thing to begin with given all the budgeting formalities. So your "civic duty" will ultimately result in destruction of these services, because they simply don't have the means to deal with such thoughtless activism.

          • ijpoijpoihpiuoh6y

            You've made an unfounded assumption -- that is, that the person you're responding to is scraping irresponsibly. If they are, as they say, simply replacing human researchers with the equivalent bots, then the net load change from automation is zero, or possibly even negative.

      • satyrnein6y

        Imagine if search engines had to "reach out to each [site owner] and figure out a business arrangement with them." The world decided that opt out via robots.txt was a better approach.

        If the municipality wants to get the information out, this could be a win-win, just like search engines were. Do check for robots.text, though!

        • floatingatoll6y

          We found at one job that approximate one quarter of well-known search engines blatantly use robots.txt noindex declarations as a list of URLs to index, and one openly mocked us for asking them to stop.

          Voluntary honor systems don’t work, because there’s no way to compel non-compliers to stop other than standard “anti-attacker arms race” approaches, such as the obstacle described at the head of this thread.

          • jdc6y

            It sounds like scraping is a big problem for you guys. What kind of outfit is it, if you don't mind me asking?

            • floatingatoll6y

              Drop me an email and I’m happy to describe further.

        • samcal6y

          Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

          • vageli6y

            > Well, the search engines decided that robots.txt was the better approach for them. Which makes sense, since they want control over as much data as possible, that's their profit motive. The jury is still out on whether that's a long-term win-win social contract between search engine companies and the world.

            Are you really arguing that the internet would be _more_ accessible if search engines had to reach out to every site they wanted to crawl?

            How many companies out there complain about being scraped by Google? How many companies benefit from search-driven traffic?

            • gnud6y

              The alternative would have been opt-in instead of opt-out. Everything excluded by default, except what robots.txt allows you to index.

              Naturally, Google didn't want that.

        • klenwell6y

          I would assume that any site that was implementing JS-level blocks also has the appropriate robots.txt file in place.

          • greglindahl6y

            That's not true in the actual web, however.

            The best example is a large number of unimportant sites that send 429 errors for /robots.txt if they think it's a scraper. A 4xx result for robots.txt is considered to mean no robots.txt for most crawlers. So the website is getting the reverse of what it thought it was getting.

      • jdc6y

        Why privilege traffic based on its source (whether it's from a human or Selenium)? If some resources are expensive to serve, you can rate limit them.

        • crispyporkbites6y

          Because some information is more valuable than the sum of its parts.

    • jadell6y

      To the siblings wondering about reaching out to the sites and offering to pay for the data: I'm not parent poster, but where I work, we absolutely have reached out. We've even offered to build and maintain the systems/APIs/etc we'd need at our own expense in addition to paying for the data. None of the companies we've reached out to seem interested in providing easy access to their data.

    • ProCicero6y

      I'm very curious to know how you are able to get precise and accurate enough identification information from public websites to be able to credibly run a "background check" on someone. I used to work in the criminal justice system, and had unlimited access to every single criminal case initiated in my state going back for almost 40 years. It's difficult enough for a trained person to do it by hand, let alone automating it. How do you provide any guarantee of accuracy?

      • ilikehurdles6y

        My wife's SSN/credit history/online identities have in the past been mistakenly tied up with her sibling's. This has since been corrected with all the appropriate agencies and organizations.

        However, from what I've noticed of search results over time, these background check (AND identity verification) sites crawl each other and create a kind of feedback loop, as I've been noticing that some of these pages will falsely report parts of her sibling's background among her own, and falsely flag her as having certain ugly events in her past that don't actually belong to her. This is concerning, as her career area cares a lot about employees having a clean background, and employers using these cheap automated options see cheap, inaccurate results. She has a squeaky clean background with a high credit score and impressive educational credentials, while her sibling has had run-ins with the law and bad debts. I'm concerned about how this will affect her future career prospects.

        Beyond background checks, identity verification is a big concern as well. You may have noticed some services ask you to confirm certain facts about your past (street names of where you've lived, schools you attended, jobs and cars you've held). When pulling her credit bureau reports, some of these verifications required confirming facts about her sibling rather than her own in order to gain access.

        Like I said, these issues have been fixed with all the "official" record-keeping organizations; however, since the fix, I've been noticing increasing issues with the original mistakes propagating to 3rd-party background-check organizations.

        These services cause more problems than they solve, and should require consent, oversight, and civil or criminal penalties associated with a failure to meet high quality standards.

        • brlewis6y

          > These services cause more problems than they solve, and should require consent, oversight, and civil or criminal penalties associated with a failure to meet high quality standards.

          Existing law does not proscribe recklessly sharing damaging false information about people?

      • Pinbenterjamin6y

        There are a number of ways we do this.

        First, the process of automating a source is not as simple as 'grab data, send to person that creates the case'.

        We have many many layers of precaution and validation both by humans and other automated systems, that helps guarantee accuracy.

        On top of this, even public records has reporting rules in the industry. There are dates, specific charges, charge types (Misdemeanor/Felony), and a battery of other rules that the information is processed through in order to ensure we do not report information that we are not allowed.

        We always lean to the side of throwing a case to a human. In the circumstance that anything new, unrecognized, or even slightly off happens, we toss the case to a team that processes the information by hand. At that point, we are simply a scraper for information and we cut out the step of having a human order and retrieve results.

        We do not go back 40 years. Industry standards dictate that most records older than 7 years are expunged from Employment background checks. And most of our clients don't care about more than 3 years worth, with exceptions like Murder, Federal Crimes, and some obviously heinous things.

        We also run a number of other tests, outside of public records to provide full background data. We have integrations with major labs to schedule drug screens, we allow those who are having a background check run on them to fill out an application to provide reasoning and information from their point of view to allow customers to empathize with an employee.

        We also have a robust dispute system. The person having a background check run on them receives the report before the client requesting it in order to review the results and dispute anything they find wrong. These cases are always handled by a human, and often involve intensive research, no cost spared, to ensure the accuracy of the report.

        There's a plethora of other things I'm missing, but if you have any specific questions, I'm happy to answer.

        *EDIT

        To clarify, there is a lot of information in public records. It isn't unclear or ambiguous at all. Motor Vehicle and Court records are extremely in-depth and spare no detail.

      • heyoni6y

        I would imagine a live person audits the information collected by the scrapers, thereby eliminating the hassle of collecting it from multiple different sources.

        As a private person, we only have access to court documents on a state or county base. Any central database we have access to would be made my scrapers.

    • slaymaker19076y

      I personally think that it maybe be ethically questionable to be making background checks easier. There is a reason why the right to be forgotten is becoming a thing in various jurisdictions and lack of easy access for sensitive data is one countermeasure to try and counterbalance the need for public access to data with the right to privacy for individuals.

      • Pinbenterjamin6y

        I don't have a perspective on the ethics of easier background checks. We run employment checks, the ultimate decision of whether to hire falls to the customer ALWAYS. I've seen plenty of former criminals get hired. It's a workplace culture 'thing'.

        The right to be forgotten is alive and well most of the time, 90% of our clients don't observe information further back than a few years. I feel like that is a fair assessment of someone's behavior.

        • colechristensen6y

          "Just" providing the data doesn't absolve you of responsibility for the decisions others make using it.

          There is a point where data collection becomes unethical, and making everything fine as long as it isn't legal makes for a shitty society. (i.e. legislating behavior should be a last resort not a first judgement on right and wrong)

          I don't know precisely where that point is, but automated scraping of social media probably is past (automated scraping of judicial records? probably ok)

          • Pinbenterjamin6y

            I still don't agree. The whole reason this business exists is to remove the cost from all the industries that need to run background checks.

            I think the extent and reason for the checks aren't apparent. So I'll give a few examples where we have high volume and I hope that will enlighten you as to the reason why there are so many players in the industry.

            The highest volume checks are around the medical and teaching fields. We often run 6-month, to one year recurring checks on teachers and doctors to ensure licenses and certifications are still active. As well as necessary immunizations to work in their environments.

            Do you expect a low margin industry like teaching to staff a full time employee to do nothing but run background checks? They want them done and the schools have access to the information, it's just much easier for them to pay us a few dollars an employee and get a nice report than do the legwork themselves.

            Additionally, incurring the cost of access for the relevant data is a barrier for companies without a bunch of cash laying around.

            We don't solicit companies with incriminating information about their employees, it's a necessary part to a safe environment.

            • colechristensen6y

              shrug not trying to imply it is all bad. The nature of the information is important. Professional certification or licensing checks are obviously harmless.

              What isn't harmless is gathering information about the private lives of people (even when done in the public eye) in ways that are difficult, labor intensive, or impossible without automation.

    • turtlebits6y

      The right thing to do would be to reach out to those sites and see if they is they have paid options for getting the data you need.

      • hannasanarion6y

        And what happens when they ignore you? I've reached out to tons of website operators to ask for machine readable access to their data on academic, personal and professional projects, I have never gotten a reply and had to resort to scraping.

        • ohithereyou6y

          I can second this for public records websites.

          A previous company I worked for aggregated publicly recorded mortgage data. The mortgage data was scraped from municipal sites on a nightly basis because it was not available as a bulk download or purchasable option.

          We had requested on several occasions for a service we could pay for in order to get a bulk download of this data, but the municipalities did not have the know how to provide this as were using systems from a private vendor that were prohibitively expensive for them to request modifications. As a result, we worked hand in glove with the municipalities to ensure we were not stressing their infrastructure when we did this scraping, and I think that's the best we were able to do in this case.

      • Pinbenterjamin6y

        Well, when that option is available, as in the case of something like SAMBA WEB MVR, we absolutely opt for that instead, and pay our dues.

    • ok_coo6y

      If it suits your needs, please consider using Common Crawl instead.

      http://commoncrawl.org/

    • 3xblah6y

      If you are not doing anything subversive then can you share with us some examples of the sites that are selectively blocking you? And give us an example of the public information that is sought. Perhaps disclosing any of the sites puts you at disadvantage versus potential competitors? It does not make much sense to block access to public information, assuming you are not interfering with others' access.

    • bdcravens6y

      My understanding is that Selenium injects Javascript in the page, whether you're using a headless browser or not. The best bet would be to switch away from Selenium and write the code using something like Puppeteer.

      If you do want to stick with Selenium, you're better studying the chromedriver source than Chromium itself.

    • Buttons8406y

      Have you tried running your browsers in virtual frame buffers? Do they still get detected?

    • pault6y

      Have you tried writing a chrome extension and running it in a desktop browser instance? It's super easy to set up and shouldn't appear any different than a regular user if you rate limit and add some randomness to the input events.

    • katzgrau6y

      I run an ad delivery platform (hey, we're both popular) and I detect and block bots because they tend to inadvertly drive up engagement counts on ad campaigns, creating a situation where publishers can't be confident in their numbers. Some clients have their own tech to do the the same.

    • huhtenberg6y

      If you don't inspect and respect robots.txt, you shouldn't be surprised by sites actively blocking your crawlers. Ditto for when you try and work around crawling restrictions by hiding behind real browser UAs.

    • elorant6y

      Have you tried loading a full browser session? Not just headless.

      • paganel6y

        Not the OP, but I did that about 12 years ago, with Firefox. My boss at the time had asked me to parse some public institution website that was quite difficult to write a parser for directly in Python, so in the end we just decided to write a quick extension for Firefox and let an instance of it run on a spare computer. That public institution website had some JS bug that would cause FF to gobble up memory pretty fast, but we also solved that by automatically restarting FF at certain intervals (or when we noticed something was off).

        Not sure if people do this sort of things nowadays.

        • pault6y

          When I'm doing personal scraping, I just write a chrome extension. You can find boilerplates that are super easy to set up, and they persist in a background thread between page loads. It's really easy to collect the data and log it in the console or send it to a local API or database. It's the lowest effort method of scraping I know, and you can monitor it while it runs to make sure it doesn't get hung up on some edge case.

        • elorant6y

          Sure we do. Through Selenium. You can either load a full browser session, or a headless one. But headless sessions are identifiable.

      • bdcravens6y

        Selenium injects predictable Javascript in both situations.

    • disiplus6y

      i run a scraper on craigs list style marketplace for my country, they have now one of those commercial scraping protection, that i trivialy escape with basicaly adding random string to a url. try how they work and then create a workaround, i think most of them use some of those comercial solutions.

      i do my scraping just for myself. maybe if i would scale it up they would detect me.

    • dnautics6y

      putting my 'bad guy' hat on, I would think about automating via sikuli script if you had to (but only if you had to).

  • lol7686y

    There are additional tests included in https://arh.antoinevastel.com/javascripts/fpCollect.min.js that do not exist in the GitHub repository over at https://github.com/antoinevastel/fp-collect.

      redPill: function() {
          for (var e = performance.now(), n = 0, t = 0, r = [], o = performance.now(); o - e < 50; o = performance.now()) r.push(Math.floor(1e6 * Math.random())), r.pop(), n++;
          e = performance.now();
          for (var a = performance.now(); a - e < 50; a = performance.now()) localStorage.setItem("0", "constant string"), localStorage.removeItem("0"), t++;
          return 1e3 * Math.round(t / n)
        },
        redPill2: function() {
          function e(n, t) {
            return n < 1e-8 ? t : n < t ? e(t - Math.floor(t / n) * n, n) : n == t ? n : e(t, n)
          }
          for (var n = performance.now() / 1e3, t = performance.now() / 1e3 - n, r = 0; r < 10; r++) t = e(t, performance.now() / 1e3 - n);
          return Math.round(1 / t)
        },
        redPill3: function() {
          var e = void 0;
          try {
            for (var n = "", t = [Math.abs, Math.acos, Math.asin, Math.atanh, Math.cbrt, Math.exp, Math.random, Math.round, Math.sqrt, isFinite, isNaN, parseFloat, parseInt, JSON.parse], r = 0; r < t.length; r++) {
              var o = [],
                a = 0,
                i = performance.now(),
                c = 0,
                u = 0;
              if (void 0 !== t[r]) {
                for (c = 0; c < 1e3 && a < .6; c++) {
                  for (var d = performance.now(), s = 0; s < 4e3; s++) t[r](3.14);
                  var m = performance.now();
                  o.push(Math.round(1e3 * (m - d))), a = m - i
                }
                var l = o.sort();
                u = l[Math.floor(l.length / 2)] / 5
              }
              n = n + u + ","
            }
            e = n
          } catch (t) {
            e = "error"
          }
          return e
        }
      };
    • jadell6y

      It doesn't seem to be using the Javascript. Looking at the page source, it has already made the determination before the Javascript runs.

      If I load the page source in Chrome, it already includes the "You are not Chrome headless" message, but when I run it in a scraper I maintain, the page source loads with the "You are Chrome headless" message, even without running any Javascript.

    • Hitton6y

      So it's just measuring computation speed of math calculations? That doesn't sound very reliable.

  • ggreer6y

    https://arh.antoinevastel.com/javascripts/fpCollect.min.js contains some functions called redPill that aren't in the normal fpCollect library. redPill3 measures the time of some JS functions and sends that data to the backend. Here's a chart of redPill3's timing data on my computer: https://i.imgur.com/c8iuV6I.png

    Those are averages of multiple runs on a Core i7-8550U running Chromium 75.0.3770.90 on Ubuntu 19.04.

    isNan and isFinite are much slower in headless mode, but other functions like parseFloat and parseInt aren't. My guess is that the backend is comparing the relative times that certain functions take. If isNan and isFinite take the same time as parseFloat, then you're not in headless mode. If those functions take 6x longer than parseFloat, you're in headless mode.

    I don't know if this holds true for non x86 architectures or other platforms.

    • TeMPOraL6y

      Huh. Your chart reminded me of an experiment I did 10 years ago to test if you could distinguish whether an image request was triggered by <img> tag, vs. user clicking on a link (or entering its URL in the address bar). I created a test page and asked people on the Internet to visit it, and then analyzed PHP & server logs.

      Unexpectedly, it turned out that Accept header was perfect for this. The final chart was this:

      https://i.imgur.com/ZA8qD8t.png

      ("link" means clicking on an URL or entering it manually; "embedded" means <img> tag)

      Makes me wonder whether Accept header is still useful for fingerprinting in general, and distinguishing between headless and headful(?) browsers in particular.

  • itake6y

    This would be more interesting if the author explained this technique.

    People that are knowledgeable enough will deep dive into the webpage, but for everyone else, expect disappointment.

    • zxcvbn40386y

      Agreed that the article is poorly written for the Hacker News crowd, would be nice to have a description of his technique so the merits/faults could be analyzed without everyone conducting a reverse engineering effort.

      It’s sad to have a smart guy like this dedicating his academic career to something this inconsequential. Anyone with enough incentive is going to be able to defeat any technique this guy dreams up. Anyone who might benefit on paper from detecting a headless browser isn’t going to want this because any possibility of a false positive is a missed impression, or a missed sales opportunity, or an ADA lawsuit (US), or an angry customer.

    • mcescalante6y

      I'm not sure if diving deep into the page will yield results of how it's done. The page's javascript does a POST to a backend with the browser's fingerprint, and the server does all the "magic" where we can't see it. Unless there is new fingerprint info that is being sent to the server that wasn't around before, I'm skeptical about the javascript in the page revealing the full technique.

      • shawnz6y

        He claims the fingerprint library's techniques aren't used for the check though, so surely there must be an observable difference between the POST request from headless and non-headless

        Edit: According to other commenters there are checks in the included version of the library which are not in the release version.

      • jadell6y

        The "You are/are not" message seems to be included in the page source before any Javascript runs. Is it possible there are detectable differences in the original HTTP request itself?

        • pbhjpbhj6y

          My guess is he's looking at XSS mitigations or similar that aren't in headless?

          If it were doing something like using CSS being non-blocking (? I don't know that it is) that's a server side detection .. but that would seem to work even against spoofing.

          But he says if you spoof another Chrome-based browser (Safari) he can't tell. So he's looking first at UA?? That's weird.

        • bsmith06y

          Yep, you got it, checkout the top commment on this thread

      • nfRfqX5n6y

        only way to do it these days... although the payload is not hashed or obfuscated in any way so it would be extremely easy to fake if it's even being stored in a db or memory somewhere, else you can just copy the request exactly as is

    • m0006y

      I don't think there's actually anything concrete to discuss. After reading the post, my feeling is that the author is probably fishing for HN geeks that would be intrigued to test his headless detection mechanism. Which would be borderline antisocial behaviour towards the HN community. This would better fit under the "Ask HN" label.

      Moreover, I'm not even sure how useful would this community testing be academic-wise. Black-box testing is great stuff for CTF competitions. But any decent academic venue would dismiss systems that can't withstand white-box testing as security-through-obscurity.

    • 6y
      [deleted]
    • krlx6y

      Well, if I was him I guess I would prefer to have an accepted paper about this new technique before releasing everything to the public.

    • 6y
      [deleted]
  • ryandrake6y

    Out of all the zero-sum tech arms races (increasingly complex DRM, SPAM senders/blockers, software crackers vs. copy protection, code obfuscation) this one seems to me to be the stupidest. Here we have people putting data out in public for free, for anyone to access, and then agonizing over how someone accesses it. If some data is your company's secret sauce, your competitive advantage, don't put it out on the Internet. If your data is not your competitive advantage, then why bother wasting all this development effort stopping browsers from browsing it? So much waste on both sides.

    • jadell6y

      I agonize about this every day, since a large part of my job is aggregating data from many sites that seem hell-bent on not letting anyone access it without going one-form-at-a-time through their crap UI.

      The thing is, we would gladly pay these companies for an API or even just a periodic data-dump of what we need. We've even offered to some of them to write and maintain the API for them. They're not interested, for various industry-specific reasons.

      I often wonder how much developer time and money are wasted in total between them blocking and devs working around their blocks.

      • lyxsus16y

        Sometimes when I'm thinking about it and what 95% of developers are working on, it feels like a planet-wide charity project against unemployment.

        • cameronbrown6y

          I think it's a fairly well known thing that 'junk jobs' tend to spring up in response to supply. I think it's a bizzare cultural thing.

      • driverdan6y

        The travel industry is highly protective of its data. It's my understanding that they consider it proprietary and only sell it to those who they deem worthy.

    • black_puppydog6y

      who says this is zero sum? in the limit, it seems like a loose/loose situation to me. or at "best" modest rewards for [spammers|scrapers|...] at tremendous (spread out over the population) cost in loss of usability and compute cycles and development effort

    • skybrian6y

      This is like saying, if you're going to give people free samples, why not give away the whole grocery store?

      Wanting to give out limited free samples inevitably leads to making sure you are giving out samples to people and not bots and not too much to each person, and that leads to user tracking.

      Compare with the arms race between newspapers and incognito mode:

      https://www.blog.google/outreach-initiatives/google-news-ini...

  • foob6y

    I'm the other half of the cat and mouse game that Antoine is referring to, and I just wrote another rebuttal that people here might find interesting [1]. It goes into a little more detail about what his test site is actually doing, and also walks through the process of writing a Puppeteer script to bypass the tests.

    - [1] -https://www.tenantbase.com/tech/blog/cat-and-mouse/

  • eastendguy6y

    All this can be avoided (from a scraper's perspective) by using the Selenium IDE++ project. It adds a command line to Chrome and Firefox to run scripts. See https://ui.vision/docs#cmd and https://ui.vision/docs/selenium-ide/web-scraping

    => Using Chrome directly is slower, but undetectable.

    • rivercam6y

      I am using the UI Vision extension for a few months now. It is not very fast, but it always works. It can extract text and data from images and canvas elements, too.

  • fjp6y

    I work in telecom and we interface with large carriers like AT&T, Verizon, etc. We use headless browsers to automate processes using their 15-year-old admin portals, since the carriers simply refuse to provide an API, or one that works acceptably.

    Thankfully they're also so technologically slow that they never change the websites or do any kind of headless detection. Its works, and allows us to offer automated [process] to our customers, but it seems so fragile. Just give us a damn API.

  • nprateem6y

    Well the user agent of chrome headless contains 'HeadlessChrome' according to this site [1]. Sure enough when I spoof my user agent to the first in the list it magically determines I'm using headless Chrome.

    He basically says he's inspecting user agents:

    > Under the hood, I only verify if browsers pretending to be Chromium-based are who they pretend to be. Thus, if your Chrome headless pretends to be Safari, I won’t catch it with my technique.

    Maybe I should apply for a PhD too.

    [1] https://user-agents.net/browsers/headless-chrome

    • jsnell6y

      If a browser claims to be Headless Chrome, you believe it. Nobody has a reason to lie about that. The interesting question is the opposite case: is somebody claiming to be a normal Chrome, but is actually Headless Chrome (or an automated member of some other browser family, or not a browser at all but e.g. a Python script).

      So if you take a Headless Chrome instance but change the User-Agent to match that of a normal Chrome, does the detector think it's not headless?

      • jadell6y

        The detector still thinks it's headless even if you spoof the user-agent.

  • born2discover6y

    The author seems to be making use of: fpscanner[1] and fp-collect[2] libraries for achieving his task though he doesn't seem to explain how exactly the detection is done.

    [1]: https://github.com/antoinevastel/fpscanner

    [2]: https://github.com/antoinevastel/fp-collect

  • AznHisoka6y

    I think there might be a market for "human crawlers". Just like people use Mechanical Turk to get humans to beat CAPTCHAs, you could use it to get humans to visit a web page for you, and return its HTML source. There are of course residential proxy services (ie HolaVPN), but they're still technically can be detected.

    • TomMarius6y

      Why would you do that when you can automate it?

      • AznHisoka6y

        Because of the issues the article described: detection of headless crawlers/bots/etc

        • driverdan6y

          You can automate a regular browser. It doesn't have to be headless.

          • AznHisoka6y

            Unfortunately, there are some sites that can even detect regular automated browser sessions.

        • TomMarius6y

          You can still simulate mouse and keyboard.

  • bsmith06y

    This straight up crashes my scraper's browser, using puppeteer, extra-stealth ect.

  • sieabahlpark6y

    Wouldn't these render the Brave browser unusable on some sites?

    • rubbingalcohol6y

      Why would it do that? Brave is just a Chromium fork afaik.

  • phil4scott6y

    I have always maintained a high credit score over the years up till a messy job divorce and property split which affected me psychologically and financially. I accumulated many negative collections (repossession, charge offs, late payments) and my credit score suffered for it and did stop me from getting loans. I got to hear of REDEMPTIONHACKERSCREW from an old college mate and also saw success stories of people he had helped increase their credit scores. I contacted him via his email REDEMPTIONHACKERSCREW, we got the ball rolling and in 5 days his magic hand was all over my report when I pulled it. 790–800 on all bureaus. To be honest it Still feels like a dream getting the loans. If you wish to be saved like me and others reach out to REDEMPTIONHACKERSCREW contact email: REDEMPTIONHACKERSCREW at GMAIL dot COM or text REDEMPTION with your request to +1 909 375 5075 (WHATAPP ONLY)

  • rosemary66556y

    You think your credit card is bad, mine was terrible. i was at the edge of loosing my family, when i got saved by an old friend of mine who introduced me to a hacker by the name PAUL MIKE. I didn’t at first believe it but i had no choice, i was about to loose everything. So i contacted him via his email ([email protected]) and i must say, hackers are the best. He raised my credit score to a golden score of 790 and removed the eviction from my credit among other negative listings. Now my life is way better than i ever imagined it would. I can now get approved for loans, mortgage, surgery e.t.c. I’ll advise you contact him to help fix your credit now.You can also reach him on his number: +17206630651‬

  • jennywarren6y

    Many people have tried to be in touch with hackers but in most cases, they end up disappointed either with their service or with the charges. You can hire a professional and ethical hacker PAUL MIKE to keep yourself safe online or to keep a watch on your family members just for legal reasons. These hacker "PAULMIKE" will provide you complete security from cyber-crimes and mischievous people who are trying to steal your personal data.He can also help you upgrade your credit score to a golden score of 830 and removed the eviction from your credit among other negative listings.You can reach him on his number: +17206630651‬,([email protected])

  • murphy096y

    I am an Entrepreneur, I work as a FLAT AND SALES MANAGER but I have been in a sorrowful mood and pressure since my 40’s days ,I have not been able to acquire all my desires despite earning a lot of money doing businesses, my score deprived me of many things in life. But all thanks to God and Albert Gonzalez of CyberSpyGuru (At) Gmail DOT (com) he help me fix my credit report and removed all my evictions within 48hours ,now as I speak I have finally gotten a new home for Family. I didn’t believe at first and It’s just like a miracle but I can’t doubt what’s happening physically in my life.

  • sarah7676y

    Do you know that an excellent credit score of 830 is better than $100,000 cash in hand? Don't fall into the wrong hands Contact THE REAL HACKER Paul Mike for your credit score upgrade and He will remove all evictions from your card among other negative issues. He did a surprising job on my credit score and since then my life have been different. That's why I have to tell everyone with a bad score to act now and fast. Contact him on his mail; [email protected]. Phone number; +17206630651.

  • alexdrian123456y

    I HAVE A GOOD NEWS!!!! I just met the greatest hacker on the planet ([email protected]) I was at first scared he might scam me but he turned out to be the best so far. HE delivered my work just few hours after I texted him. His works are historically unbeaten plus he's trustworthy and reliable. whatsapp, facebook,instagram, bitcoin, email and so on. You really need to try him today on ([email protected]) or text him on whatsapp: +971-547-461170 Thank me later.

  • murphy096y

    Albert of CyberSpyGuru is the best in the business of fixing credit, second to none…He got rid of all my negative listings and eviction on my credit at a reasonable price. Thanks to him I am gladly living with an excellent score of 810, now as i speak he has also helped over Three of my friends clear their Mortgage and boost their Fico scores , I cant imagine what my life would have been like without him. And for those interested in his services. CYBERSPYGURU AT Gmail {Dot} com

  • GABRIELLA656y

    Good day friends, I strongly recommend [email protected] for your Credit score BOOST, Credit card debt clearing, bank account debt clearing, negative record clearing, Criminal record clearing, Loan application, Credit card application and any related hacking job. You can also reach him on his number: +17206630651‬, He really changed my life for good. A GOOD hacker is HARD to find.

  • sarah7676y

    The best hacker that I know can boost your credit score from a poor score of 250 to an excellent score of 750, is Hacker Paul Mike is a life saver. contact him hackerpaulmikeATgmailDOTcom CELL 702-663-0651. YOU Will be glad you did.

  • CaliforniaKarl6y

    I’ve participated in a number of Stanford research studies, and what the author is doing here is similar to part of it.

    The studies in which I’ve participated always start with a statement of what they are generally looking for in a participant. You then take a survey that confirms if you are qualified. You then given a release to sign (and keep a copy of), which states what you’ll be doing, and providing an IRB contact. You then go through the study.

    At the end of your participation, you are asked “What do you think the study is about?”, and then you were told the real purpose of the study. Eventually the paper(s) is/are published, with hypothesis, methodology, and results.

    This seems similar: You decide if you want to participate, and are participating; the only thing that’s missing is the final paper.