[00:18] *** tomwsmf-a has joined #archiveteam-bs [00:57] *** ndiddy has joined #archiveteam-bs [00:57] *** JesseW has joined #archiveteam-bs [00:59] *** BlueMaxim has joined #archiveteam-bs [01:22] *** JesseW has quit IRC (Ping timeout: 370 seconds) [03:35] *** will has quit IRC (Ping timeout: 244 seconds) [03:35] *** will has joined #archiveteam-bs [03:45] *** JesseW has joined #archiveteam-bs [04:32] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:39] *** Sk1d has joined #archiveteam-bs [04:56] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [05:10] Is this software worth buying? [05:10] http://www.scrapebox.com [05:11] no [05:14] just curious, why don't you think it's worth buying? [05:14] SEO tool? [05:15] that's a spamming tool. [05:15] how is it considered a spamming tool? [05:16] "comment poster" [05:16] oh. didn't see that :P [05:17] It could be useful for archiving though. there are some urls that we won't get to but google will because they are linked to from somewhere else on the internet. [05:18] if they're not linked to they're probably not worth grabbing, or are risky to grab. [05:19] What do you mean by risky? [05:20] Also, it's on the black hat forums. xD [05:20] http://www.blackhatworld.com/seo/scrapebox-the-ultimate-serp-scraper-auto-blog-commenter-with-prstorm-mode.129096/ [05:21] lots of sites leak contents to google that aren't meant to be public, have authentication strings in them, just things that generally shouldn't be in archives. if you're limited to just things you can reach on a public site you can at least use that as a defence if things blow up in your face. [05:21] yes, that's your cue to jet out of the situation as quickly as possible. [05:22] If something has authentication strings in them then why would they be on google? [05:22] linked to from elsewhere, googlebot has an amazing ability to go places it shouldn't. [05:24] it would appear they harvest URLs from some sort of user input though I can't work out where. I've seen things in Google's index that most definitely weren't linked to from anywhere, yet there they are. [05:26] Interesting. I wonder if they could potentially be using user data from chrome users... [05:27] I still stand by what I said though. [05:28] There is plenty of good stuff that is found by Google that we won't get to. [05:29] generally you won't have luck scraping google, I've managed to get soft-banned (capatcha on every search) just by how methodical my search terms are. [05:30] Also, if someone wants to archive pages or files that are under a certain subdirectory, they will most likely find that the site won't let them view the file listings for the directories... If they used inurl: and Google they could they could atleast get some of it. [05:31] I wonder why this software is getting praised then... [05:32] To be honest, I would be fine with doing captchas, but I probably wouldn't use it very heavily... [05:34] Is there an open source alternative to this that is user friendly? [05:34] what are your goals? [05:35] Get a list of URLs from the search results. It might be nice to have the ability to get them from tons of search engines at once. [05:39] To be honest, I would be a lot less hesitant about buying it if there wasn't a 1 PC per license rule. [05:40] there's a very high probability that it's in itself malware anyway. [05:40] don't give money to people making blackhat tools. [05:41] there are a whole bunch of captcha solvers out there [05:43] We could download the software and try to decompile it. It requires activation, but that wouldn't matter if we aren't actually using it. [05:43] http://www.scrapebox.com/payment-received [05:46] Ducky_: are there any open source captcha solvers? [05:47] there are mostly apis that use a combination of programs and humans. Can't think of any os off the top of my head [05:49] *** Rotab has joined #archiveteam-bs [05:55] google will still very quickly hard-ban you if you are solving captchas. [05:56] bing has an API you can make a few thousand calls a month on, but their results are pretty poor. [05:58] wow really you can get actually banned form Google? [05:58] IP? [05:58] what if you did it at a uni [06:05] Then I guess your uni would get banned. [06:11] Or either Google would contact your uni, or your uni would contact Google. And then your uni would take steps to stop you from doing it. [06:19] *** Start has quit IRC (Ping timeout: 260 seconds) [06:25] *** blahah has quit IRC (Quit: Connection closed for inactivity) [06:27] *** Start has joined #archiveteam-bs [06:29] *** mutoso has joined #archiveteam-bs [06:33] *** mutoso has quit IRC (Read error: Connection reset by peer) [06:39] *** metalcamp has joined #archiveteam-bs [06:57] *** Start has quit IRC (Ping timeout: 260 seconds) [07:04] *** mutoso has joined #archiveteam-bs [07:05] *** mutoso has quit IRC (Read error: Connection reset by peer) [07:06] *** Start has joined #archiveteam-bs [07:11] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:18] *** mutoso has joined #archiveteam-bs [07:27] *** mutoso_ has joined #archiveteam-bs [07:27] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [07:32] *** mutoso has quit IRC (Read error: Operation timed out) [07:34] note to self, webarchiveplayer doenst like opening million URL warcs [07:40] *** mutoso has joined #archiveteam-bs [07:40] *** mutoso has quit IRC (Read error: Connection reset by peer) [07:43] A lot of valid responses to "it breaks when I do this" is "don't dop that" [07:51] *** mutoso has joined #archiveteam-bs [07:56] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:01] *** mutoso has joined #archiveteam-bs [08:01] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:07] *** mutoso has joined #archiveteam-bs [08:07] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:18] *** mutoso has joined #archiveteam-bs [08:31] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:33] i can work with that [08:37] *** mutoso has joined #archiveteam-bs [08:38] *** schbirid has joined #archiveteam-bs [08:39] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:44] *** mutoso has joined #archiveteam-bs [08:47] i'm at 706k items now [08:53] can anyone grab this: https://www.youtube.com/watch?v=61cZlkjIyaA [08:54] *** mutoso has quit IRC (Read error: Connection reset by peer) [08:58] Seems to be blocked in UK, Frace and Germany [09:00] Ducky_: yes, you will get a "we can't complete this request" message. [09:01] https://kb.nsd.org/assets/googleerror.jpg [09:30] godane, got it. Give me a second [09:32] godane, http://harrycross.ovh/downloads/JosephTheUnderground.mp4 [09:34] *** mutoso has joined #archiveteam-bs [09:40] thanks [09:45] *** mutoso has quit IRC (Read error: Connection reset by peer) [09:50] *** mutoso has joined #archiveteam-bs [09:50] *** mutoso has quit IRC (Read error: Connection reset by peer) [10:03] *** mutoso has joined #archiveteam-bs [10:05] *** mutoso has quit IRC (Read error: Connection reset by peer) [10:21] *** mutoso has joined #archiveteam-bs [10:26] *** mutoso has quit IRC (Read error: Connection reset by peer) [10:27] i'm going after Free North Korea Radio mp3s [10:28] there are more of them then what is listed [10:36] *** mutoso has joined #archiveteam-bs [10:36] *** mutoso has quit IRC (Read error: Connection reset by peer) [10:42] *** bwn has quit IRC (Read error: Operation timed out) [10:52] *** bwn has joined #archiveteam-bs [11:06] *** mutoso has joined #archiveteam-bs [11:11] *** mutoso has quit IRC (Read error: Connection reset by peer) [11:22] *** mutoso has joined #archiveteam-bs [11:24] *** mutoso has quit IRC (Read error: Connection reset by peer) [11:40] *** mutoso has joined #archiveteam-bs [11:42] *** mutoso has quit IRC (Read error: Connection reset by peer) [11:47] *** mutoso has joined #archiveteam-bs [11:57] *** mutoso_ has joined #archiveteam-bs [11:57] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [11:59] *** mutoso has quit IRC (Read error: Operation timed out) [12:02] *** mutoso has joined #archiveteam-bs [12:02] *** mutoso has quit IRC (Read error: Connection reset by peer) [12:30] *** mutoso has joined #archiveteam-bs [12:32] *** mutoso has quit IRC (Read error: Connection reset by peer) [12:43] *** mutoso has joined #archiveteam-bs [12:46] *** mutoso has quit IRC (Read error: Connection reset by peer) [13:07] *** BlueMaxim has quit IRC (Quit: Leaving) [13:12] *** mutoso has joined #archiveteam-bs [13:12] *** mutoso has quit IRC (Read error: Connection reset by peer) [13:17] *** mutoso has joined #archiveteam-bs [13:18] *** mutoso has quit IRC (Read error: Connection reset by peer) [13:35] *** mutoso has joined #archiveteam-bs [13:38] *** mutoso has quit IRC (Read error: Connection reset by peer) [13:54] *** mutoso has joined #archiveteam-bs [13:56] *** Start has quit IRC (Quit: Disconnected.) [13:59] *** mutoso has quit IRC (Read error: Connection reset by peer) [14:10] *** mutoso has joined #archiveteam-bs [14:25] *** mutoso has quit IRC (Read error: Connection reset by peer) [14:35] *** mutoso has joined #archiveteam-bs [14:36] *** mutoso has quit IRC (Read error: Connection reset by peer) [14:46] *** mutoso has joined #archiveteam-bs [14:48] *** mutoso has quit IRC (Read error: Connection reset by peer) [15:06] *** mutoso has joined #archiveteam-bs [15:16] *** mutoso_ has joined #archiveteam-bs [15:16] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [15:21] *** mutoso has quit IRC (Read error: Operation timed out) [15:25] *** Start has joined #archiveteam-bs [15:26] *** mutoso has joined #archiveteam-bs [15:27] *** JesseW has joined #archiveteam-bs [15:38] *** mutoso has quit IRC (Read error: Connection reset by peer) [15:43] *** mutoso has joined #archiveteam-bs [15:47] *** mutoso has quit IRC (Read error: Connection reset by peer) [15:59] *** mutoso has joined #archiveteam-bs [16:00] *** mutoso has quit IRC (Read error: Connection reset by peer) [16:04] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:06] *** mutoso has joined #archiveteam-bs [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:14] *** mutoso has quit IRC (Read error: Connection reset by peer) [16:20] *** mutoso has joined #archiveteam-bs [16:20] *** mutoso has quit IRC (Read error: Connection reset by peer) [16:26] *** mutoso has joined #archiveteam-bs [16:27] *** mutoso has quit IRC (Read error: Connection reset by peer) [16:32] *** mutoso has joined #archiveteam-bs [16:33] *** Start has joined #archiveteam-bs [16:36] *** mutoso has quit IRC (Read error: Connection reset by peer) [16:41] *** mutoso has joined #archiveteam-bs [16:41] *** mutoso has quit IRC (Read error: Connection reset by peer) [17:01] *** mutoso has joined #archiveteam-bs [17:02] *** JW_work has quit IRC (Read error: Operation timed out) [17:05] *** mutoso has quit IRC (Read error: Operation timed out) [17:06] *** tomwsmf-a has joined #archiveteam-bs [17:10] *** JW_work has joined #archiveteam-bs [17:15] *** mutoso has joined #archiveteam-bs [17:15] >Page cannot be displayed due to robots.txt. [17:15] *checks robots.txt* [17:15] Internal Server Error [17:16] seriously, tone it down with the robots.txt crap, archive.org :( [17:19] Sanqui: what URL? [17:19] http://www.ma-mari.net/robots.txt [17:19] but there's no need to check warcs for me, I archived this site myself a year ago [17:22] hm, that's probably worth sending to info@ — it really shouldn't be blocking based on robots.txt *not being present* [17:24] *** SimpBrain has joined #archiveteam-bs [17:25] *** SimpBrain has quit IRC (Quit: Leaving) [17:38] *** Start has quit IRC (Quit: Disconnected.) [17:43] *** mutoso has quit IRC (Read error: Operation timed out) [17:47] *** mutoso has joined #archiveteam-bs [17:51] *** zenguy has quit IRC (Read error: Operation timed out) [17:53] *** mutoso has quit IRC (Read error: Connection reset by peer) [17:55] *** zenguy has joined #archiveteam-bs [17:59] *** mutoso has joined #archiveteam-bs [18:00] *** mutoso has quit IRC (Read error: Connection reset by peer) [18:06] *** mutoso has joined #archiveteam-bs [18:16] *** mutoso has quit IRC (Read error: Connection reset by peer) [18:29] *** mutoso has joined #archiveteam-bs [18:36] *** mutoso has quit IRC (Read error: Connection reset by peer) [18:41] *** mutoso has joined #archiveteam-bs [18:43] *** mutoso has quit IRC (Read error: Connection reset by peer) [18:48] *** mutoso has joined #archiveteam-bs [18:55] *** Simpbrain has joined #archiveteam-bs [18:56] *** mutoso_ has joined #archiveteam-bs [19:00] *** mutoso has quit IRC (Read error: Operation timed out) [19:02] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [19:18] *** mutoso has joined #archiveteam-bs [19:36] *** mutoso_ has joined #archiveteam-bs [19:38] *** mutoso has quit IRC (Read error: Operation timed out) [19:40] *** Start has joined #archiveteam-bs [19:40] *** robink has quit IRC (Ping timeout: 633 seconds) [19:41] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [19:44] *** Start has quit IRC (Client Quit) [19:53] *** robink has joined #archiveteam-bs [19:56] *** zenguy has quit IRC (Ping timeout: 246 seconds) [19:59] *** mutoso has joined #archiveteam-bs [20:06] *** schbirid has quit IRC (Quit: Leaving) [20:10] *** mutoso has quit IRC (Read error: Connection reset by peer) [20:17] *** zenguy has joined #archiveteam-bs [20:23] *** mutoso has joined #archiveteam-bs [20:23] *** mutoso has quit IRC (Read error: Connection reset by peer) [20:50] https://twitter.com/EFF/status/735925111732076545 HELL YES [20:53] *** mutoso has joined #archiveteam-bs [20:53] HCross: awesome! [20:58] *** mutoso has quit IRC (Read error: Connection reset by peer) [21:10] *** mutoso has joined #archiveteam-bs [21:11] *** mutoso has quit IRC (Read error: Connection reset by peer) [21:16] *** mutoso has joined #archiveteam-bs [21:22] *** mutoso_ has joined #archiveteam-bs [21:23] *** mutoso_ has quit IRC (Read error: Connection reset by peer) [21:26] *** mutoso has quit IRC (Read error: Operation timed out) [21:53] *** mutoso has joined #archiveteam-bs [21:54] *** mutoso has quit IRC (Read error: Connection reset by peer) [21:59] *** mutoso has joined #archiveteam-bs [22:00] *** zenguy has quit IRC (Read error: Operation timed out) [22:04] *** metalcamp has quit IRC (Read error: Connection reset by peer) [22:04] *** metalcamp has joined #archiveteam-bs [22:04] *** Rye has quit IRC (Ping timeout: 244 seconds) [22:04] *** mutoso has quit IRC (Read error: Connection reset by peer) [22:07] *** Rye has joined #archiveteam-bs [22:13] *** Start has joined #archiveteam-bs [22:26] *** mutoso has joined #archiveteam-bs [22:31] *** zenguy has joined #archiveteam-bs [22:35] *** VADemon has joined #archiveteam-bs [22:44] *** BlueMaxim has joined #archiveteam-bs [22:47] *** dashcloud has quit IRC (Read error: Operation timed out) [22:51] *** dashcloud has joined #archiveteam-bs [23:04] *** bsmith093 has quit IRC (Ping timeout: 250 seconds) [23:21] Sanqui: I know that feel [23:22] basically unless robots.txt is 200 or 404 it denies access [23:22] Been like that for me for maybe 6 months [23:22] I hope it doesn't delete old archives [23:22] interesting — I hadn't realized that was so new [23:23] I don't know if it's a bug or not, or if I'm just noticing it more [23:23] archive.is doesn't respect robots.txt [23:23] I'm pretty confident that IA's interpretation of robots.txt merely denies access, rather than permanently removing anything, as they will restore access if the most recent robots.txt permits it. [23:24] Anyone know what happened to longurl.org? [23:25] trying to find a url unshortener with https [23:30] why are you looking for a urlunshortner? [23:35] try this shell recipe: curl -I http://bit.ly/12345 | grep 'Location:' [23:36] will that follow to the last location if there are multiple redirects? [23:37] no [23:37] :( [23:37] but you can do it again! [23:37] dammit my irc client I can't work out if that's an eye or an ell [23:37] -L is follow location [23:37] maybe -Eye is headers? [23:37] curl -I as in big i, gets the headers [23:37] ty [23:37] you could copy and paste it into your terminal [23:45] *** bsmith093 has joined #archiveteam-bs [23:48] actually dash big eye makes a HEAD request, which may or may not produce the same results as a more typical GET or POST request. [23:48] thanks, cap'm actually [23:49] eh, it's a regular pain point in urlteam, which is why I felt obligated to stick my oar in [23:49] you can use dash little eye to get the headers and the response [23:50] [19:37:27] dammit my irc client I can't work out if that's an eye or an ell [23:50] I hate fonts like that