[00:18] *** DoomTay has joined #archiveteam [00:19] *** Stiletto has joined #archiveteam [00:19] *** tomwsmf-a has joined #archiveteam [00:24] *** DiscantX has joined #archiveteam [00:30] *** JesseW has joined #archiveteam [00:57] *** VADemon has quit IRC (Quit: left4dead) [00:57] *** DiscantX has quit IRC (Read error: Operation timed out) [01:12] *** JesseW has quit IRC (Ping timeout: 370 seconds) [01:18] *** ats has quit IRC (Ping timeout: 244 seconds) [01:23] *** Stiletto has quit IRC (Ping timeout: 244 seconds) [01:24] *** Coderjoe has quit IRC (Read error: Operation timed out) [01:28] *** Coderjoe has joined #archiveteam [01:34] *** Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~) [01:36] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:01] *** coretx has quit IRC (Read error: Operation timed out) [02:02] *** RichardG has quit IRC (Read error: Operation timed out) [02:02] *** RichardG has joined #archiveteam [02:04] *** coretx has joined #archiveteam [02:05] *** JesseW has joined #archiveteam [02:10] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [02:18] *** Stiletto has joined #archiveteam [02:36] *** ats has joined #archiveteam [02:45] *** RichardG has quit IRC (Read error: Operation timed out) [02:45] *** RichardG has joined #archiveteam [02:46] *** ats has quit IRC (Read error: Operation timed out) [02:52] *** ats has joined #archiveteam [03:09] *** RichardG has quit IRC (Read error: Operation timed out) [03:09] *** RichardG has joined #archiveteam [03:31] *** Froggypwn has joined #archiveteam [03:33] *** Coderjoe has quit IRC (Read error: Operation timed out) [03:48] *** Kitaru_ has joined #archiveteam [03:52] *** RichardG has quit IRC (Ping timeout: 370 seconds) [03:54] *** Swizzle has quit IRC (Quit: Leaving) [03:57] *** RichardG has joined #archiveteam [04:01] *** Coderjoe has joined #archiveteam [04:05] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:08] *** RichardG has quit IRC (Ping timeout: 260 seconds) [04:11] *** Sk1d has joined #archiveteam [04:12] *** RichardG has joined #archiveteam [04:13] *** Kitaru_ has quit IRC (Quit: This computer has gone to sleep) [04:25] *** db48x` has quit IRC (Read error: Connection reset by peer) [04:30] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [04:31] *** db48x has joined #archiveteam [05:05] *** metalcamp has joined #archiveteam [05:12] *** Trevor has joined #archiveteam [05:13] *** Trevor has quit IRC (Client Quit) [05:31] *** Aranje has quit IRC (Quit: Three sheets to the wind) [06:13] *** dashcloud has quit IRC (Read error: Operation timed out) [06:16] *** dashcloud has joined #archiveteam [06:54] *** BlueMaxim has quit IRC (Quit: Leaving) [07:01] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [07:07] Delightfully weird history: https://motherboard.vice.com/read/the-secret-nuclear-history-of-cat-videos [07:13] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:27] *** RichardG has joined #archiveteam [07:55] *** DoomTay has quit IRC (Quit: Page closed) [08:16] *** WinterFox has joined #archiveteam [08:25] *** Gfy has quit IRC (Read error: Operation timed out) [08:32] *** Gfy has joined #archiveteam [08:50] *** Wuked has joined #archiveteam [08:50] *** DiscantX has joined #archiveteam [08:57] *** zhongfu_ has joined #archiveteam [08:57] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [09:04] *** zhongfu_ has quit IRC (Ping timeout: 260 seconds) [09:04] *** DiscantX has quit IRC (Read error: Operation timed out) [09:05] *** zhongfu has joined #archiveteam [09:12] *** DiscantX has joined #archiveteam [09:26] *** BlueMaxim has joined #archiveteam [09:39] *** atomotic has joined #archiveteam [09:51] *** zhongfu has quit IRC (Remote host closed the connection) [09:58] *** Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) [10:00] *** Wuked has joined #archiveteam [10:07] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [10:14] *** zhongfu has joined #archiveteam [10:17] *** Wuked has joined #archiveteam [10:20] *** Wuked has quit IRC (Client Quit) [10:22] *** Wuked has joined #archiveteam [10:24] *** Wuked has quit IRC (Client Quit) [10:30] *** Wuked has joined #archiveteam [10:32] *** zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) [10:32] *** Wuked has quit IRC (Client Quit) [10:32] *** GLaDOS has joined #archiveteam [10:34] *** zhongfu has joined #archiveteam [10:45] *** Wuked has joined #archiveteam [10:47] *** Wuked has quit IRC (Client Quit) [10:57] *** morbus_ has quit IRC (Read error: Operation timed out) [11:09] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [11:16] *** Wuked has joined #archiveteam [12:05] *** BlueMaxim has quit IRC (Quit: Leaving) [12:10] *** BlueMaxim has joined #archiveteam [12:23] *** DiscantX has quit IRC (Read error: Operation timed out) [13:38] *** BlueMaxim has quit IRC (Quit: Leaving) [13:48] *** VADemon has joined #archiveteam [13:58] *** Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) [14:10] *** Wuked has joined #archiveteam [14:17] *** Start has quit IRC (Quit: Disconnected.) [14:59] *** WinterFox has quit IRC (Read error: Operation timed out) [15:05] *** philpem has joined #archiveteam [15:06] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [15:07] *** BartoCH has joined #archiveteam [15:12] *** Wuked has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:21] *** r3c0d3x has quit IRC (Ping timeout: 260 seconds) [15:23] *** r3c0d3x has joined #archiveteam [15:25] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [15:25] *** BartoCH has joined #archiveteam [15:26] *** BartoCH has quit IRC (Client Quit) [15:27] *** BartoCH has joined #archiveteam [15:54] *** Start has joined #archiveteam [15:59] *** Start has quit IRC (Quit: Disconnected.) [16:03] *** ploop has quit IRC (Ping timeout: 244 seconds) [16:04] *** ploop has joined #archiveteam [16:15] *** DoomTay has joined #archiveteam [16:18] *** JesseW has joined #archiveteam [16:33] JesseW: JW_work: I'll have a look at examiner.com [16:34] not sure how much we could get in 2 days [16:34] Isn't ArchiveBot already doing it, even if it's probably not going off of the sitemaps? [16:34] DoomTay: it's slow [16:35] ArchiveBot is one host crawling a site. Warrior projects are many hosts downloading a site in a more organized item-based fashion [16:36] We are now saving around 1 TB of news with NewsBuddy every day! [16:37] however in the latter case, the speed is still limited because without rate limiting that's just about the same thing as a DDoS. [16:37] still faster and more efficient generally though. but necessitates writing scripts that are aware of the site structure [16:40] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:42] !a http://www.postnewspapers.com.au/ [16:42] i put it in archivebot [16:48] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [16:49] *** dashcloud has joined #archiveteam [16:56] Frogging: #archivebot is *much* slower than nearly any source site, even with a delay of zero, due to the administrative back and forth with the central server. [17:01] yes, there's that too [17:48] VideoBot is recording livestream again. [17:48] VideoBot can be found at #videobot [17:49] It is especially handy to record and save Twitter videos directly to the Internet Archive as video item [17:49] vine and periscope are also supported. periscope needs a fix though [17:50] Coming up soon for VideoBot is following a twitter hashtag and downloading all twitter and/or periscope videos using that hashtag and uploading those videos to IA as video items. [17:50] All videos are also saved into the Wayback Machine [18:45] *** Start has joined #archiveteam [19:08] niiice [19:36] *** dashcloud has quit IRC (Ping timeout: 244 seconds) [19:37] *** VADemon has quit IRC (Quit: left4dead) [19:39] *** DiscantX has joined #archiveteam [19:40] *** dashcloud has joined #archiveteam [19:46] *** Start has quit IRC (Quit: Disconnected.) [19:47] *** Start has joined #archiveteam [19:52] *** Start has quit IRC (Quit: Disconnected.) [20:07] *** DiscantX has quit IRC (Read error: Operation timed out) [20:08] *** mutoso has quit IRC (Quit: leaving) [20:18] *** mutoso has joined #archiveteam [20:37] *** dxrt has quit IRC (Read error: Operation timed out) [20:38] *** jspiros has quit IRC (Read error: Operation timed out) [20:41] *** dxrt has joined #archiveteam [20:45] *** Kitaru has joined #archiveteam [21:12] wait what wait what back and forth is there for archivebot in the middle of a crawl? [21:13] what I see is more like the site being crawled throttles large amounts of requests from a single IP. [21:13] and the deduplication lookups take a significant amount of cycles, which can't be parallelized because of python [21:15] FalconK: it's because it has to synchronously (!) contact the controller for every request [21:16] it probably doesn't have to do that [21:16] I bet we could make that better [21:16] it's managing the crawl locally, so what it exchanges with the controller can't be more than status updates [21:17] the code is unfortunately complicated though [21:17] I think I remember someone, probably Asparagir, saying that the only thing stopping rollout of any big updates is the high number of long-term jobs underway [21:18] hmmmmmmmmmmmmmmmmmm [21:18] complex. [21:18] that is a much bigger problem than just serviceability [21:18] so a couple days ago the PDU on ananiel died [21:18] I am told there was a great popping, and crackling, and smoke issued [21:19] and with it, about 14 long-running jobs [21:19] now they are started from the beginning [21:19] on the bright side, ananiel was updated before taking on new work [21:20] but if you set a stopfile, you can expect it to halt sometime between a few hours and 4 months from then (and it can't service small jobs while waiting for the long jobs to go away) [21:21] initially, one would suppose that the trick is to make it into a distributed workload [21:21] *** bzc6p has joined #archiveteam [21:21] *** swebb sets mode: +o bzc6p [21:21] * FalconK ! [21:21] bzc6p! you're the one with vt.idiota.hu, yes? [21:22] um, yes [21:22] *** robink has quit IRC (Ping timeout: 633 seconds) [21:22] why? [21:22] im still up for running a pipeline [21:22] I'm trying to figure out why you use it for url lists, and not like pastebin [21:22] and then remove the A record for it after the job starts but before it is completed [21:22] I think he lost the job and can't remember the exact URL [21:23] there have been a few of those jobs [21:23] and sometimes the thing running them dies a painful death due to power failures, the epoll_wait bug in wpull, or something else [21:23] and then the next pipeline that takes up the job just tries infinitely to pull the URL list, and fails because the A record is gone and there's nothing there anymore [21:24] and the job takes up a job slot and keeps reporting infinitely [21:24] I'm not sure the A record disappearing is his fault. Right now the site as a whole can't be reached. [21:24] when I get one of those jobs and notice this condition, the only way to fix it is to add a hosts entry for vt.idiota.hu so it goes to like google.com, so it can "finish" the job [21:24] yeah [21:25] ... wait, is it URL lists, or were we trying to archive the site? [21:25] because it looked like URL lists [21:25] Stop. [21:25] So. [21:26] The list resides on my computer, I start a webserver and announce my IP to a dyndns site. [21:26] I wait until it finishes and then shut down my server. [21:26] *** nwf_ has quit IRC (Read error: Operation timed out) [21:26] It happened twice out of like 14 cases that there was an outage meanwhile and the task got stuck. [21:27] Let me remark that it is a bug in the ArchiveBot software, which was also admitted by yipdw. [21:27] I do it this way because I don't want to use third parties unless necessary. [21:27] yes, you're triggering a bug [21:27] * FalconK nods [21:28] But if you think it is a considerable problem that sometimes that happens, I may consider doing it in a way that this bug is very less likely to happen. [21:28] it's probably worth fixing the bug, since that is not the only case that can trigger it, but [21:28] *** nwf_ has joined #archiveteam [21:28] erstwhile, I know I've had to clear it on my pipeline ~5 times, and when it happens on a pipeline that is less watched, it takes quite a while [21:29] on the other hand, the next time you bring the host up, if it stays that way for longer than the DNS failure is cached, that pipeline will expunge the job [21:29] but, this method also risks keeping the DNS failure cached on the same pipeline that takes up your new job [21:29] in which case both jobs will be stuck [21:30] * FalconK shrugs [21:30] Please tell me more about how your pipeline had to be cleared 5 times while I think I've never (or only once) targeted a job to your pipeline [21:30] but in #archiveteam-bs [21:30] you're right, what's really needed is a limit on DNS retries [21:30] (I use Frogging, now Luckolors pipelines) [21:30] it's not a big deal. I just wanted to understand what it was you were trying to do. [21:32] what's probably happening is that the job gets taken up by ananiel after the pipeline it was on dies (because ananiel is so big that condition is probable, though once I noticed it on Cadbury too) [21:32] which guarantees the bug is triggered. [21:32] so thanks for explaining! perhaps this is the next thing I should attack. [21:33] * bzc6p is a bit confused [21:33] I think I'll just upload it to some drop site and shit [21:33] then it won't happen and won't trigger problems [21:34] that is probably easier, at any rate [21:34] * bzc6p doesn't like using third parties as our main goal is saving shit from third parties [21:34] the bug is still a bug that needs fixing [21:34] well the URL list is hopefully ephemeral and wherever you put it is probably unrelated to the crawl target [21:35] *** bzc6p sets mode: +o FalconK [21:35] ty [21:36] ------------------------- [21:36] it is an interesting point, though, that someone would be fairly easily able to DoS archivebot with these jobs [21:36] * FalconK goes back to writing [21:36] In fact, I came here to inform you that 8086.net doesn't give a crap on us saving their stuff [21:36] #archiveteam-bs [21:36] https://secure.8086.net/portal/viewticket.php?tid=NTM-405143&c=9JIYc46J [21:37] or, in fact, they do, instantly activated CloudFlare but don't mind deleting everything [21:38] You have heard ArchiveTeam News [21:39] "There is no way you can archive the >billion pages on the site and trying to do so is causing issues for other users on the site." [21:40] I can hear SketchCow saying "yeah, deleting everything will also cause issues for other users of the site" [21:40] Why did Timothy post the same thing several times? [21:42] After wumpus reported he hasn't received reply, Timothy decided to send the letter to three contact addresses, and after one and a half day without reply, he decided to send it to the uppercase initial email addresses too, in case their shitty mailserver is case-sensitive. That's 6 times altogether. [21:45] *** jspiros has joined #archiveteam [21:56] How much has been saved of DNSHistory befor e then anyway? [21:57] 0% [21:58] Ow [21:59] In fact, nearly 0,02% [21:59] Andn ow he's deliberately blocking our efforts to get more... [21:59] That's just evil [22:06] Where's my hug [22:07] * JW_work points to the giant hug monster gathering dust in the corner [22:10] *** dashcloud has quit IRC (Read error: Operation timed out) [22:12] *** TC01_ is now known as TC01 [22:12] Who hosed it off last [22:12] *** Lune has joined #archiveteam [22:13] *** dashcloud has joined #archiveteam [22:16] * JW_work points to the giant hoses coming out of it, used for self-washing [22:17] I wonder what makes a person do that, enable cloudflare anti-flood for a site they plan to shut down in 2 days due to funding. [22:17] perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see. [22:18] that seems likely, yeah [22:23] *** bzc6p has left [22:26] lol are these the dns people? [22:27] is it on ATW? [22:28] you can still reply to the ticket, by the way [22:28] even if it is closed [22:29] [00:17] perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see. [22:29] I very strongly doubt there's a CDN involved here [22:41] well cloudflare [22:44] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [22:47] *** aschmitz_ has quit IRC (Read error: Operation timed out) [22:48] *** aschmitz_ has joined #archiveteam [23:12] you don't pay cloudflare for bandwidth [23:12] that's not how their business model works [23:12] so that's a non-argument :P [23:16] *** K4k has quit IRC (Quit: WeeChat 1.5) [23:16] *** K4k has joined #archiveteam [23:19] *** robink has joined #archiveteam [23:21] I'm able to write some scripts for a little project for examiner.com tomorrow. But I'm not sure if we have enough time to save everything [23:23] Wait, what makes you think it's in danger? [23:26] DoomTay: they announced it's closing on the 10th [23:27] Oh... [23:39] *** tomwsmf-a has joined #archiveteam [23:46] *** Start has joined #archiveteam [23:47] bloody french [23:59] *** DoomTay has quit IRC (Quit: Page closed)