[00:00] *** Kitaru has quit IRC (Quit: This computer has gone to sleep) [00:03] *** BlueMaxim has joined #archiveteam [00:08] *** Kitaru has joined #archiveteam [00:18] *** Kitaru has quit IRC (Quit: This computer has gone to sleep) [00:25] What useragent does archivebot use? [00:38] erm? [00:38] why do you ask? [00:38] each of the bots can use their own [00:59] *** yipdw has quit IRC (Quit: No Ping reply in 180 seconds.) [01:02] *** yipdw has joined #archiveteam [01:03] *** JesseW has joined #archiveteam [01:10] *** j08nY has quit IRC (Quit: Leaving) [01:17] *** Aranje has quit IRC (Ping timeout: 260 seconds) [01:17] FalconK: I want to check what a site looks like with the UA before archiving it. [01:18] :) [01:18] the default is in the source code somewhere, over on github [01:32] FalconK: Default/1 ? [01:33] *** fie has joined #archiveteam [01:35] *** Aranje has joined #archiveteam [01:37] *** Kitaru has joined #archiveteam [01:50] *** Aranje has quit IRC (Ping timeout: 260 seconds) [01:56] *** Aranje has joined #archiveteam [02:29] *** Aranje has quit IRC (Quit: Three sheets to the wind) [02:32] Downloaded: 41395 files, 748G in 4d 5h 8m 25s (2.10 MB/s) [02:32] WHEEEEEEEEEEEEEEEEEEEEEEEEEE [02:34] such speed [02:34] *** Frogging sets mode: +o yipdw [02:48] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:49] *** JesseW has quit IRC (Ping timeout: 370 seconds) [03:12] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [03:15] *** BartoCH has joined #archiveteam [03:23] *** RichardG has quit IRC (Ping timeout: 260 seconds) [03:24] *** dashcloud has quit IRC (Read error: Operation timed out) [03:27] *** dashcloud has joined #archiveteam [03:52] *** RichardG has joined #archiveteam [04:05] *** JesseW has joined #archiveteam [04:05] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:09] hook54321: https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/pipeline.py#L111-L114 [04:10] the "and not Mozilla ..." bit is there to satisfy checks for /Mozilla/ etc [04:13] *** Sk1d has joined #archiveteam [04:28] *** VADemon has joined #archiveteam [04:33] *** Stiletto has quit IRC () [04:48] *** Kitaru has quit IRC (Quit: This computer has gone to sleep) [04:49] *** DiscantX has joined #archiveteam [05:06] *** Kitaru has joined #archiveteam [05:10] *** Kitaru has quit IRC (Client Quit) [05:17] *** dashcloud has quit IRC (Read error: Operation timed out) [05:18] *** DiscantX has quit IRC (Read error: Operation timed out) [05:20] *** dashcloud has joined #archiveteam [05:32] *** Stiletto has joined #archiveteam [05:33] *** robink has joined #archiveteam [05:50] *** Deewiant_ has joined #archiveteam [05:51] *** aschmitz_ has joined #archiveteam [05:51] *** LordNigh2 has joined #archiveteam [05:52] *** aschmitz has quit IRC (hub.se efnet.portlane.se) [05:52] *** d_rebel has quit IRC (hub.se efnet.portlane.se) [05:52] *** Lord_Nigh has quit IRC (hub.se efnet.portlane.se) [05:52] *** wp494 has quit IRC (hub.se efnet.portlane.se) [05:52] *** Gfy has quit IRC (hub.se efnet.portlane.se) [05:52] *** Deewiant has quit IRC (hub.se efnet.portlane.se) [05:52] *** thefinn93 has quit IRC (hub.se efnet.portlane.se) [05:52] *** espes__ has quit IRC (hub.se efnet.portlane.se) [05:52] *** xhdr has quit IRC (hub.se efnet.portlane.se) [05:52] *** Fletcher_ has quit IRC (hub.se efnet.portlane.se) [05:52] *** xhdr- has joined #archiveteam [05:55] *** d_rebel has joined #archiveteam [05:55] *** wp494 has joined #archiveteam [05:55] *** thefinn93 has joined #archiveteam [05:55] *** Fletcher_ has joined #archiveteam [06:03] *** Gfy_ has joined #archiveteam [06:07] *** LordNigh2 is now known as Lord_Nigh [06:09] *** mutoso has quit IRC (Read error: Operation timed out) [06:09] *** mutoso has joined #archiveteam [06:15] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:16] *** Deewiant_ is now known as Deewiant [06:17] *** espes__ has joined #archiveteam [06:34] *** Start has quit IRC (Quit: Disconnected.) [06:37] *** Start has joined #archiveteam [07:03] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [07:13] *** bwn has quit IRC (Ping timeout: 244 seconds) [07:19] *** bwn has joined #archiveteam [07:27] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [07:30] *** Lord_Nigh has joined #archiveteam [07:55] *** RichardG has quit IRC (Read error: Operation timed out) [07:55] *** RichardG has joined #archiveteam [08:14] *** DiscantX has joined #archiveteam [08:28] *** redlob has quit IRC (Quit: ZNC - http://znc.in) [08:33] *** redlob has joined #archiveteam [08:33] *** dashcloud has quit IRC (Read error: Operation timed out) [08:33] *** dashcloud has joined #archiveteam [09:01] *** Wuked has joined #archiveteam [09:02] *** RichardG has quit IRC (Read error: Operation timed out) [09:02] *** RichardG has joined #archiveteam [09:14] *** WinterFox has joined #archiveteam [09:18] *** dashcloud has quit IRC (Read error: Operation timed out) [09:23] *** dashcloud has joined #archiveteam [09:26] *** DiscantX has quit IRC (Read error: Operation timed out) [09:43] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [09:58] *** Wuked has joined #archiveteam [10:11] *** RichardG has quit IRC (Read error: Operation timed out) [10:11] *** RichardG has joined #archiveteam [10:31] *** BlueMaxim has quit IRC (Read error: Operation timed out) [10:33] since thomas is finished, can anyone set the default back to urlteam? [10:57] *** zhongfu_ has joined #archiveteam [10:58] *** RichardG has quit IRC (Read error: Operation timed out) [10:58] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [10:58] *** RichardG has joined #archiveteam [11:23] *** RichardG has quit IRC (Read error: Operation timed out) [11:23] *** RichardG has joined #archiveteam [11:26] *** zhongfu_ is now known as zhongfu [11:32] Medowar: there are still items for Thomas? [11:53] *** RichardG has quit IRC (Read error: Operation timed out) [11:53] *** RichardG has joined #archiveteam [12:13] Thomas is gone though [12:31] *** j08nY has joined #archiveteam [12:44] Ah has it finally shut? [12:48] Yes. I forget the date but it was in the past week [12:49] I was still getting results from it a day or so back [12:49] Nevermind, I think we got a good chunk [13:08] *** dashcloud has quit IRC (Read error: Operation timed out) [13:15] *** dashcloud has joined #archiveteam [13:16] *** RichardG has quit IRC (Read error: Operation timed out) [13:16] *** RichardG has joined #archiveteam [13:44] *** RichardG has quit IRC (Read error: Operation timed out) [13:44] *** RichardG has joined #archiveteam [13:47] *** hellow has joined #archiveteam [13:47] *** hellow is now known as bayesianp [13:49] http://www.examiner.com/ is closing on July 10 [13:49] see: http://wikipediocracy.com/forum/viewtopic.php?f=21&t=7869 [13:50] Has anyone archived it yet? [13:51] ^ robots.txt is quite restrictive but it has a /sitemap.xml [13:51] It's being crawled right now bayesianp [13:55] VADemon: I only see a HTML sitemap... where's the xml? [13:58] Sorry, I should've copy-pasted the URL: http://www.examiner.com/sitemapindex.xml which leads to other pages containing the actual links [14:01] Igloo: are you using the sitemap.xml for the crawl? [14:01] It's in our archivebot added by SketchCow [14:02] I don't know what it was seeded with but I don't think so [14:03] code to make web archive based on day: curl -L -s http://www.examiner.com/html_sitemap/content/2010/01/01 | grep '^
  • One thing: If anyone wants to run a newsbuddy grabber, we really really could do with a few more now [15:02] just let me know if you want to help, please [15:02] HCross: I'll get back to you when I got some time [15:03] ok. thanks [15:03] maybe some time during the summer [15:03] I can run one HCross [15:03] Tell me what you need [15:03] Igloo, #newsgrabber [15:04] *** BlueMaxim has joined #archiveteam [15:07] *** Kitaru has joined #archiveteam [15:10] *** zxtx has left Leaving [15:29] *** RichardG has quit IRC (Read error: Operation timed out) [15:29] *** RichardG has joined #archiveteam [15:30] *** JesseW has joined #archiveteam [15:35] *** BlueMaxim has quit IRC (Quit: Leaving) [15:38] *** Aranje has joined #archiveteam [15:41] ============== [15:41] 1. A bug in the archive uploader script meant our uploads weren't being derived/integrated. A fix is coming very shortly. [15:42] *** metalcamp has joined #archiveteam [15:42] 2. A bug with the ROBOTS.TXT being misread is fixed and stuff will be seen again [15:42] ============== [15:43] *** Kitaru_ has joined #archiveteam [15:43] *** Kitaru has quit IRC (Ping timeout: 258 seconds) [15:47] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:00] *** Start has quit IRC (Quit: Disconnected.) [16:09] *** SilSte has quit IRC (Ping timeout: 194 seconds) [16:16] *** Kitaru_ has quit IRC (Quit: This computer has gone to sleep) [16:24] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:46] *** morbus_ has joined #archiveteam [16:48] *** Morbus has quit IRC (Read error: Operation timed out) [16:53] *** Kitaru has joined #archiveteam [17:11] *** j08nY has quit IRC (Ping timeout: 633 seconds) [17:19] I'm having problems uploading files to internet archive using the ia tool - my uploads are being disconnected after about 300MB or so of an upload. [17:20] requests.exceptions.ConnectionError: ('Connection aborted.', error(32, 'Broken pipe')) [17:20] uploading WEB-20160629153930591-00003-32710~db~8443.warc.gz: [################################] 411/494 - 00:00:30 [17:21] is it just a RST or is there other stuff happening to the connection also? [17:22] It is reproducable - I've not had any files > 300MB succeed in the last 20-30 mins [17:22] hm [17:22] maybe you're hitting an angry s3 node [17:23] possibly [17:23] Turning on --debug doesn't really provide any interesting information [17:23] tcpdump? :) [17:29] Angry S3 Node is my spirit animal [17:29] *** bauruine has quit IRC (Ping timeout: 260 seconds) [17:31] tcpdump just says tons of RST's from IA. [17:31] 35.197587 207.241.224.50 -> 10.0.0.13 TCP 54 http > 57043 [RST] Seq=959 Win=0 Len=0 [17:31] 35.197707 207.241.224.50 -> 10.0.0.13 TCP 54 http > 57043 [RST] Seq=959 Win=0 Len=0 [17:31] 35.197767 207.241.224.50 -> 10.0.0.13 TCP 54 http > 57043 [RST] Seq=959 Win=0 Len=0 [17:31] huuuh [17:32] *** dashcloud has quit IRC (Read error: Operation timed out) [17:32] I've got a pretty fast connection - could I be overloading the S3 endpoint? [17:32] maybe its disk is full? [17:33] what's the dns name of the s3 endpoint again? [17:33] yea, perhaps. Something along those lines maybe. [17:33] s3.us.archive.org [17:34] s3.us.archive.org is an alias for s3-lb1.us.archive.org. [17:34] hm [17:35] I'll just wait a while and try again later. If it's a full disk or something that's temporary, it'll resolve itself. [17:35] swebb, watching newsbuddy and ill let you know if he has any issues [17:35] ok [17:35] I'm only trying to upload like 3.5 GB, so I can wait. :) [17:35] hes pushed out over 1.2TB today fine [17:38] *** philpem has joined #archiveteam [17:39] *** Stilett0 has joined #archiveteam [17:39] *** dashcloud has joined #archiveteam [17:39] *** Stiletto has quit IRC (Read error: Operation timed out) [17:40] HCross: Newsbuddy is hardcore. Does IA have the capcity for that?! [17:43] Weird, but I used tc to throttle my connection to 1mbps and the transfer was killed after approx 30s. I then throttled to 2mbps and again, dropped after 30s. [17:44] it also seems to slow down pretty dramatically just before the disconnect. [17:46] Igloo, yes. Sometimes. It depends what part of the IA is on fire [17:47] My crawl of gawker media is 122G compressed so-far. [17:50] *** Aranje has quit IRC (Quit: Three sheets to the wind) [17:52] *** bauruine has joined #archiveteam [17:54] *** Kitaru_ has joined #archiveteam [17:56] *** Kitaru has quit IRC (Ping timeout: 258 seconds) [18:19] *** RichardG has quit IRC (Read error: Operation timed out) [18:19] *** RichardG has joined #archiveteam [18:26] *** Chorca has joined #archiveteam [18:43] *** RichardG has quit IRC (Read error: Operation timed out) [18:43] *** RichardG has joined #archiveteam [19:13] *** RichardG has quit IRC (Read error: Operation timed out) [19:13] *** RichardG has joined #archiveteam [19:14] *** DiscantX has joined #archiveteam [19:19] https://techcrunch.com/2016/07/07/vyclone-hits-the-deadpool/ [19:40] *** RichardG has quit IRC (Read error: Operation timed out) [19:40] *** RichardG has joined #archiveteam [19:40] *** Gfy_ is now known as Gfy [19:45] uploads are working again. [19:45] (for me) [19:46] Working for me, Currently pumping ~100mbit at IA [19:47] Yea, me too. I'm not sure at the data rate though. I've got a gig fibre connection, so probably somewhere around 100mbit [20:15] *** RichardG has quit IRC (Read error: Operation timed out) [20:15] *** RichardG has joined #archiveteam [20:49] *** DiscantX has quit IRC (Read error: Operation timed out) [20:50] *** j08nY has joined #archiveteam [20:57] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [21:09] *** Aranje has joined #archiveteam [21:17] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [21:33] *** dashcloud has quit IRC (Read error: Operation timed out) [21:36] *** dashcloud has joined #archiveteam [21:57] *** Swizzle has joined #archiveteam [21:59] *** DoomTay has joined #archiveteam [22:01] *** maseck has quit IRC (Read error: Operation timed out) [22:21] *** Start has joined #archiveteam [22:43] *** JordanJ2 has quit IRC (ZNC - http://znc.in) [23:00] *** Stilett0 has quit IRC (Read error: Operation timed out) [23:09] *** RichardG has quit IRC (Read error: Operation timed out) [23:09] *** RichardG has joined #archiveteam [23:36] *** RichardG has quit IRC (Read error: Operation timed out) [23:36] *** RichardG has joined #archiveteam [23:46] *** j08nY has quit IRC (Remote host closed the connection) [23:52] *** BlueMaxim has joined #archiveteam [23:54] *** maseck has joined #archiveteam [23:55] *** DoomTay has quit IRC (Ping timeout: 268 seconds) [23:59] *** Kitaru_ has quit IRC (Quit: This computer has gone to sleep)