#archiveteam-bs 2016-05-26,Thu

↑back Search

Time Nickname Message
00:18 🔗 tomwsmf-a has joined #archiveteam-bs
00:57 🔗 ndiddy has joined #archiveteam-bs
00:57 🔗 JesseW has joined #archiveteam-bs
00:59 🔗 BlueMaxim has joined #archiveteam-bs
01:22 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
03:35 🔗 will has quit IRC (Ping timeout: 244 seconds)
03:35 🔗 will has joined #archiveteam-bs
03:45 🔗 JesseW has joined #archiveteam-bs
04:32 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:39 🔗 Sk1d has joined #archiveteam-bs
04:56 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
05:10 🔗 hook54321 Is this software worth buying?
05:10 🔗 hook54321 http://www.scrapebox.com
05:11 🔗 xmc no
05:14 🔗 hook54321 just curious, why don't you think it's worth buying?
05:14 🔗 murk SEO tool?
05:15 🔗 murk that's a spamming tool.
05:15 🔗 hook54321 how is it considered a spamming tool?
05:16 🔗 murk "comment poster"
05:16 🔗 hook54321 oh. didn't see that :P
05:17 🔗 hook54321 It could be useful for archiving though. there are some urls that we won't get to but google will because they are linked to from somewhere else on the internet.
05:18 🔗 murk if they're not linked to they're probably not worth grabbing, or are risky to grab.
05:19 🔗 hook54321 What do you mean by risky?
05:20 🔗 hook54321 Also, it's on the black hat forums. xD
05:20 🔗 hook54321 http://www.blackhatworld.com/seo/scrapebox-the-ultimate-serp-scraper-auto-blog-commenter-with-prstorm-mode.129096/
05:21 🔗 murk lots of sites leak contents to google that aren't meant to be public, have authentication strings in them, just things that generally shouldn't be in archives. if you're limited to just things you can reach on a public site you can at least use that as a defence if things blow up in your face.
05:21 🔗 murk yes, that's your cue to jet out of the situation as quickly as possible.
05:22 🔗 hook54321 If something has authentication strings in them then why would they be on google?
05:22 🔗 murk linked to from elsewhere, googlebot has an amazing ability to go places it shouldn't.
05:24 🔗 murk it would appear they harvest URLs from some sort of user input though I can't work out where. I've seen things in Google's index that most definitely weren't linked to from anywhere, yet there they are.
05:26 🔗 hook54321 Interesting. I wonder if they could potentially be using user data from chrome users...
05:27 🔗 hook54321 I still stand by what I said though.
05:28 🔗 hook54321 There is plenty of good stuff that is found by Google that we won't get to.
05:29 🔗 murk generally you won't have luck scraping google, I've managed to get soft-banned (capatcha on every search) just by how methodical my search terms are.
05:30 🔗 hook54321 Also, if someone wants to archive pages or files that are under a certain subdirectory, they will most likely find that the site won't let them view the file listings for the directories... If they used inurl: and Google they could they could atleast get some of it.
05:31 🔗 hook54321 I wonder why this software is getting praised then...
05:32 🔗 hook54321 To be honest, I would be fine with doing captchas, but I probably wouldn't use it very heavily...
05:34 🔗 hook54321 Is there an open source alternative to this that is user friendly?
05:34 🔗 xmc what are your goals?
05:35 🔗 hook54321 Get a list of URLs from the search results. It might be nice to have the ability to get them from tons of search engines at once.
05:39 🔗 hook54321 To be honest, I would be a lot less hesitant about buying it if there wasn't a 1 PC per license rule.
05:40 🔗 murk there's a very high probability that it's in itself malware anyway.
05:40 🔗 murk don't give money to people making blackhat tools.
05:41 🔗 Ducky_ there are a whole bunch of captcha solvers out there
05:43 🔗 hook54321 We could download the software and try to decompile it. It requires activation, but that wouldn't matter if we aren't actually using it.
05:43 🔗 hook54321 http://www.scrapebox.com/payment-received
05:46 🔗 hook54321 Ducky_: are there any open source captcha solvers?
05:47 🔗 Ducky_ there are mostly apis that use a combination of programs and humans. Can't think of any os off the top of my head
05:49 🔗 Rotab has joined #archiveteam-bs
05:55 🔗 murk google will still very quickly hard-ban you if you are solving captchas.
05:56 🔗 murk bing has an API you can make a few thousand calls a month on, but their results are pretty poor.
05:58 🔗 Ducky_ wow really you can get actually banned form Google?
05:58 🔗 Ducky_ IP?
05:58 🔗 Ducky_ what if you did it at a uni
06:05 🔗 hook54321 Then I guess your uni would get banned.
06:11 🔗 hook54321 Or either Google would contact your uni, or your uni would contact Google. And then your uni would take steps to stop you from doing it.
06:19 🔗 Start has quit IRC (Ping timeout: 260 seconds)
06:25 🔗 blahah has quit IRC (Quit: Connection closed for inactivity)
06:27 🔗 Start has joined #archiveteam-bs
06:29 🔗 mutoso has joined #archiveteam-bs
06:33 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
06:39 🔗 metalcamp has joined #archiveteam-bs
06:57 🔗 Start has quit IRC (Ping timeout: 260 seconds)
07:04 🔗 mutoso has joined #archiveteam-bs
07:05 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
07:06 🔗 Start has joined #archiveteam-bs
07:11 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:18 🔗 mutoso has joined #archiveteam-bs
07:27 🔗 mutoso_ has joined #archiveteam-bs
07:27 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
07:32 🔗 mutoso has quit IRC (Read error: Operation timed out)
07:34 🔗 HCross note to self, webarchiveplayer doenst like opening million URL warcs
07:40 🔗 mutoso has joined #archiveteam-bs
07:40 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
07:43 🔗 SketchCow A lot of valid responses to "it breaks when I do this" is "don't dop that"
07:51 🔗 mutoso has joined #archiveteam-bs
07:56 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:01 🔗 mutoso has joined #archiveteam-bs
08:01 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:07 🔗 mutoso has joined #archiveteam-bs
08:07 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:18 🔗 mutoso has joined #archiveteam-bs
08:31 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:33 🔗 midas i can work with that
08:37 🔗 mutoso has joined #archiveteam-bs
08:38 🔗 schbirid has joined #archiveteam-bs
08:39 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:44 🔗 mutoso has joined #archiveteam-bs
08:47 🔗 godane i'm at 706k items now
08:53 🔗 godane can anyone grab this: https://www.youtube.com/watch?v=61cZlkjIyaA
08:54 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
08:58 🔗 HCross Seems to be blocked in UK, Frace and Germany
09:00 🔗 murk Ducky_: yes, you will get a "we can't complete this request" message.
09:01 🔗 murk https://kb.nsd.org/assets/googleerror.jpg
09:30 🔗 HCross godane, got it. Give me a second
09:32 🔗 HCross godane, http://harrycross.ovh/downloads/JosephTheUnderground.mp4
09:34 🔗 mutoso has joined #archiveteam-bs
09:40 🔗 godane thanks
09:45 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
09:50 🔗 mutoso has joined #archiveteam-bs
09:50 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
10:03 🔗 mutoso has joined #archiveteam-bs
10:05 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
10:21 🔗 mutoso has joined #archiveteam-bs
10:26 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
10:27 🔗 godane i'm going after Free North Korea Radio mp3s
10:28 🔗 godane there are more of them then what is listed
10:36 🔗 mutoso has joined #archiveteam-bs
10:36 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
10:42 🔗 bwn has quit IRC (Read error: Operation timed out)
10:52 🔗 bwn has joined #archiveteam-bs
11:06 🔗 mutoso has joined #archiveteam-bs
11:11 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:22 🔗 mutoso has joined #archiveteam-bs
11:24 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:40 🔗 mutoso has joined #archiveteam-bs
11:42 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
11:47 🔗 mutoso has joined #archiveteam-bs
11:57 🔗 mutoso_ has joined #archiveteam-bs
11:57 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
11:59 🔗 mutoso has quit IRC (Read error: Operation timed out)
12:02 🔗 mutoso has joined #archiveteam-bs
12:02 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
12:30 🔗 mutoso has joined #archiveteam-bs
12:32 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
12:43 🔗 mutoso has joined #archiveteam-bs
12:46 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
13:07 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:12 🔗 mutoso has joined #archiveteam-bs
13:12 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
13:17 🔗 mutoso has joined #archiveteam-bs
13:18 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
13:35 🔗 mutoso has joined #archiveteam-bs
13:38 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
13:54 🔗 mutoso has joined #archiveteam-bs
13:56 🔗 Start has quit IRC (Quit: Disconnected.)
13:59 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:10 🔗 mutoso has joined #archiveteam-bs
14:25 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:35 🔗 mutoso has joined #archiveteam-bs
14:36 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
14:46 🔗 mutoso has joined #archiveteam-bs
14:48 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
15:06 🔗 mutoso has joined #archiveteam-bs
15:16 🔗 mutoso_ has joined #archiveteam-bs
15:16 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
15:21 🔗 mutoso has quit IRC (Read error: Operation timed out)
15:25 🔗 Start has joined #archiveteam-bs
15:26 🔗 mutoso has joined #archiveteam-bs
15:27 🔗 JesseW has joined #archiveteam-bs
15:38 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
15:43 🔗 mutoso has joined #archiveteam-bs
15:47 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
15:59 🔗 mutoso has joined #archiveteam-bs
16:00 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
16:04 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:06 🔗 mutoso has joined #archiveteam-bs
16:06 🔗 Start has quit IRC (Quit: Disconnected.)
16:14 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
16:20 🔗 mutoso has joined #archiveteam-bs
16:20 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
16:26 🔗 mutoso has joined #archiveteam-bs
16:27 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
16:32 🔗 mutoso has joined #archiveteam-bs
16:33 🔗 Start has joined #archiveteam-bs
16:36 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
16:41 🔗 mutoso has joined #archiveteam-bs
16:41 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
17:01 🔗 mutoso has joined #archiveteam-bs
17:02 🔗 JW_work has quit IRC (Read error: Operation timed out)
17:05 🔗 mutoso has quit IRC (Read error: Operation timed out)
17:06 🔗 tomwsmf-a has joined #archiveteam-bs
17:10 🔗 JW_work has joined #archiveteam-bs
17:15 🔗 mutoso has joined #archiveteam-bs
17:15 🔗 Sanqui >Page cannot be displayed due to robots.txt.
17:15 🔗 Sanqui *checks robots.txt*
17:15 🔗 Sanqui Internal Server Error
17:16 🔗 Sanqui seriously, tone it down with the robots.txt crap, archive.org :(
17:19 🔗 JW_work Sanqui: what URL?
17:19 🔗 Sanqui http://www.ma-mari.net/robots.txt
17:19 🔗 Sanqui but there's no need to check warcs for me, I archived this site myself a year ago
17:22 🔗 JW_work hm, that's probably worth sending to info@ — it really shouldn't be blocking based on robots.txt *not being present*
17:24 🔗 SimpBrain has joined #archiveteam-bs
17:25 🔗 SimpBrain has quit IRC (Quit: Leaving)
17:38 🔗 Start has quit IRC (Quit: Disconnected.)
17:43 🔗 mutoso has quit IRC (Read error: Operation timed out)
17:47 🔗 mutoso has joined #archiveteam-bs
17:51 🔗 zenguy has quit IRC (Read error: Operation timed out)
17:53 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
17:55 🔗 zenguy has joined #archiveteam-bs
17:59 🔗 mutoso has joined #archiveteam-bs
18:00 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
18:06 🔗 mutoso has joined #archiveteam-bs
18:16 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
18:29 🔗 mutoso has joined #archiveteam-bs
18:36 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
18:41 🔗 mutoso has joined #archiveteam-bs
18:43 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
18:48 🔗 mutoso has joined #archiveteam-bs
18:55 🔗 Simpbrain has joined #archiveteam-bs
18:56 🔗 mutoso_ has joined #archiveteam-bs
19:00 🔗 mutoso has quit IRC (Read error: Operation timed out)
19:02 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
19:18 🔗 mutoso has joined #archiveteam-bs
19:36 🔗 mutoso_ has joined #archiveteam-bs
19:38 🔗 mutoso has quit IRC (Read error: Operation timed out)
19:40 🔗 Start has joined #archiveteam-bs
19:40 🔗 robink has quit IRC (Ping timeout: 633 seconds)
19:41 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
19:44 🔗 Start has quit IRC (Client Quit)
19:53 🔗 robink has joined #archiveteam-bs
19:56 🔗 zenguy has quit IRC (Ping timeout: 246 seconds)
19:59 🔗 mutoso has joined #archiveteam-bs
20:06 🔗 schbirid has quit IRC (Quit: Leaving)
20:10 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
20:17 🔗 zenguy has joined #archiveteam-bs
20:23 🔗 mutoso has joined #archiveteam-bs
20:23 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
20:50 🔗 HCross https://twitter.com/EFF/status/735925111732076545 HELL YES
20:53 🔗 mutoso has joined #archiveteam-bs
20:53 🔗 arkiver HCross: awesome!
20:58 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
21:10 🔗 mutoso has joined #archiveteam-bs
21:11 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
21:16 🔗 mutoso has joined #archiveteam-bs
21:22 🔗 mutoso_ has joined #archiveteam-bs
21:23 🔗 mutoso_ has quit IRC (Read error: Connection reset by peer)
21:26 🔗 mutoso has quit IRC (Read error: Operation timed out)
21:53 🔗 mutoso has joined #archiveteam-bs
21:54 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
21:59 🔗 mutoso has joined #archiveteam-bs
22:00 🔗 zenguy has quit IRC (Read error: Operation timed out)
22:04 🔗 metalcamp has quit IRC (Read error: Connection reset by peer)
22:04 🔗 metalcamp has joined #archiveteam-bs
22:04 🔗 Rye has quit IRC (Ping timeout: 244 seconds)
22:04 🔗 mutoso has quit IRC (Read error: Connection reset by peer)
22:07 🔗 Rye has joined #archiveteam-bs
22:13 🔗 Start has joined #archiveteam-bs
22:26 🔗 mutoso has joined #archiveteam-bs
22:31 🔗 zenguy has joined #archiveteam-bs
22:35 🔗 VADemon has joined #archiveteam-bs
22:44 🔗 BlueMaxim has joined #archiveteam-bs
22:47 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:51 🔗 dashcloud has joined #archiveteam-bs
23:04 🔗 bsmith093 has quit IRC (Ping timeout: 250 seconds)
23:21 🔗 Ducky_ Sanqui: I know that feel
23:22 🔗 Ducky_ basically unless robots.txt is 200 or 404 it denies access
23:22 🔗 Ducky_ Been like that for me for maybe 6 months
23:22 🔗 Ducky_ I hope it doesn't delete old archives
23:22 🔗 JW_work interesting — I hadn't realized that was so new
23:23 🔗 Ducky_ I don't know if it's a bug or not, or if I'm just noticing it more
23:23 🔗 Ducky_ archive.is doesn't respect robots.txt
23:23 🔗 JW_work I'm pretty confident that IA's interpretation of robots.txt merely denies access, rather than permanently removing anything, as they will restore access if the most recent robots.txt permits it.
23:24 🔗 Ducky_ Anyone know what happened to longurl.org?
23:25 🔗 Ducky_ trying to find a url unshortener with https
23:30 🔗 JW_work why are you looking for a urlunshortner?
23:35 🔗 xmc try this shell recipe: curl -I http://bit.ly/12345 | grep 'Location:'
23:36 🔗 Ducky_ will that follow to the last location if there are multiple redirects?
23:37 🔗 xmc no
23:37 🔗 Ducky_ :(
23:37 🔗 xmc but you can do it again!
23:37 🔗 Ducky_ dammit my irc client I can't work out if that's an eye or an ell
23:37 🔗 Ducky_ -L is follow location
23:37 🔗 Ducky_ maybe -Eye is headers?
23:37 🔗 xmc curl -I as in big i, gets the headers
23:37 🔗 Ducky_ ty
23:37 🔗 xmc you could copy and paste it into your terminal
23:45 🔗 bsmith093 has joined #archiveteam-bs
23:48 🔗 JW_work actually dash big eye makes a HEAD request, which may or may not produce the same results as a more typical GET or POST request.
23:48 🔗 xmc thanks, cap'm actually
23:49 🔗 JW_work eh, it's a regular pain point in urlteam, which is why I felt obligated to stick my oar in
23:49 🔗 JW_work you can use dash little eye to get the headers and the response
23:50 🔗 Frogging [19:37:27] <Ducky_> dammit my irc client I can't work out if that's an eye or an ell
23:50 🔗 Frogging I hate fonts like that

irclogger-viewer