#archiveteam 2016-07-08,Fri

↑back Search

Time Nickname Message
00:18 🔗 DoomTay has joined #archiveteam
00:19 🔗 Stiletto has joined #archiveteam
00:19 🔗 tomwsmf-a has joined #archiveteam
00:24 🔗 DiscantX has joined #archiveteam
00:30 🔗 JesseW has joined #archiveteam
00:57 🔗 VADemon has quit IRC (Quit: left4dead)
00:57 🔗 DiscantX has quit IRC (Read error: Operation timed out)
01:12 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
01:18 🔗 ats has quit IRC (Ping timeout: 244 seconds)
01:23 🔗 Stiletto has quit IRC (Ping timeout: 244 seconds)
01:24 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
01:28 🔗 Coderjoe has joined #archiveteam
01:34 🔗 Froggypwn has quit IRC (~ Trillian Astra - www.trillian.im ~)
01:36 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
02:01 🔗 coretx has quit IRC (Read error: Operation timed out)
02:02 🔗 RichardG has quit IRC (Read error: Operation timed out)
02:02 🔗 RichardG has joined #archiveteam
02:04 🔗 coretx has joined #archiveteam
02:05 🔗 JesseW has joined #archiveteam
02:10 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
02:18 🔗 Stiletto has joined #archiveteam
02:36 🔗 ats has joined #archiveteam
02:45 🔗 RichardG has quit IRC (Read error: Operation timed out)
02:45 🔗 RichardG has joined #archiveteam
02:46 🔗 ats has quit IRC (Read error: Operation timed out)
02:52 🔗 ats has joined #archiveteam
03:09 🔗 RichardG has quit IRC (Read error: Operation timed out)
03:09 🔗 RichardG has joined #archiveteam
03:31 🔗 Froggypwn has joined #archiveteam
03:33 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
03:48 🔗 Kitaru_ has joined #archiveteam
03:52 🔗 RichardG has quit IRC (Ping timeout: 370 seconds)
03:54 🔗 Swizzle has quit IRC (Quit: Leaving)
03:57 🔗 RichardG has joined #archiveteam
04:01 🔗 Coderjoe has joined #archiveteam
04:05 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:08 🔗 RichardG has quit IRC (Ping timeout: 260 seconds)
04:11 🔗 Sk1d has joined #archiveteam
04:12 🔗 RichardG has joined #archiveteam
04:13 🔗 Kitaru_ has quit IRC (Quit: This computer has gone to sleep)
04:25 🔗 db48x` has quit IRC (Read error: Connection reset by peer)
04:30 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
04:31 🔗 db48x has joined #archiveteam
05:05 🔗 metalcamp has joined #archiveteam
05:12 🔗 Trevor has joined #archiveteam
05:13 🔗 Trevor has quit IRC (Client Quit)
05:31 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
06:13 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:16 🔗 dashcloud has joined #archiveteam
06:54 🔗 BlueMaxim has quit IRC (Quit: Leaving)
07:01 🔗 RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue)
07:07 🔗 JesseW Delightfully weird history: https://motherboard.vice.com/read/the-secret-nuclear-history-of-cat-videos
07:13 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:27 🔗 RichardG has joined #archiveteam
07:55 🔗 DoomTay has quit IRC (Quit: Page closed)
08:16 🔗 WinterFox has joined #archiveteam
08:25 🔗 Gfy has quit IRC (Read error: Operation timed out)
08:32 🔗 Gfy has joined #archiveteam
08:50 🔗 Wuked has joined #archiveteam
08:50 🔗 DiscantX has joined #archiveteam
08:57 🔗 zhongfu_ has joined #archiveteam
08:57 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
09:04 🔗 zhongfu_ has quit IRC (Ping timeout: 260 seconds)
09:04 🔗 DiscantX has quit IRC (Read error: Operation timed out)
09:05 🔗 zhongfu has joined #archiveteam
09:12 🔗 DiscantX has joined #archiveteam
09:26 🔗 BlueMaxim has joined #archiveteam
09:39 🔗 atomotic has joined #archiveteam
09:51 🔗 zhongfu has quit IRC (Remote host closed the connection)
09:58 🔗 Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…)
10:00 🔗 Wuked has joined #archiveteam
10:07 🔗 Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
10:14 🔗 zhongfu has joined #archiveteam
10:17 🔗 Wuked has joined #archiveteam
10:20 🔗 Wuked has quit IRC (Client Quit)
10:22 🔗 Wuked has joined #archiveteam
10:24 🔗 Wuked has quit IRC (Client Quit)
10:30 🔗 Wuked has joined #archiveteam
10:32 🔗 zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.)
10:32 🔗 Wuked has quit IRC (Client Quit)
10:32 🔗 GLaDOS has joined #archiveteam
10:34 🔗 zhongfu has joined #archiveteam
10:45 🔗 Wuked has joined #archiveteam
10:47 🔗 Wuked has quit IRC (Client Quit)
10:57 🔗 morbus_ has quit IRC (Read error: Operation timed out)
11:09 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
11:16 🔗 Wuked has joined #archiveteam
12:05 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:10 🔗 BlueMaxim has joined #archiveteam
12:23 🔗 DiscantX has quit IRC (Read error: Operation timed out)
13:38 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:48 🔗 VADemon has joined #archiveteam
13:58 🔗 Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…)
14:10 🔗 Wuked has joined #archiveteam
14:17 🔗 Start has quit IRC (Quit: Disconnected.)
14:59 🔗 WinterFox has quit IRC (Read error: Operation timed out)
15:05 🔗 philpem has joined #archiveteam
15:06 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
15:07 🔗 BartoCH has joined #archiveteam
15:12 🔗 Wuked has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:21 🔗 r3c0d3x has quit IRC (Ping timeout: 260 seconds)
15:23 🔗 r3c0d3x has joined #archiveteam
15:25 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
15:25 🔗 BartoCH has joined #archiveteam
15:26 🔗 BartoCH has quit IRC (Client Quit)
15:27 🔗 BartoCH has joined #archiveteam
15:54 🔗 Start has joined #archiveteam
15:59 🔗 Start has quit IRC (Quit: Disconnected.)
16:03 🔗 ploop has quit IRC (Ping timeout: 244 seconds)
16:04 🔗 ploop has joined #archiveteam
16:15 🔗 DoomTay has joined #archiveteam
16:18 🔗 JesseW has joined #archiveteam
16:33 🔗 arkiver JesseW: JW_work: I'll have a look at examiner.com
16:34 🔗 Frogging not sure how much we could get in 2 days
16:34 🔗 DoomTay Isn't ArchiveBot already doing it, even if it's probably not going off of the sitemaps?
16:34 🔗 Frogging DoomTay: it's slow
16:35 🔗 Frogging ArchiveBot is one host crawling a site. Warrior projects are many hosts downloading a site in a more organized item-based fashion
16:36 🔗 arkiver We are now saving around 1 TB of news with NewsBuddy every day!
16:37 🔗 Frogging however in the latter case, the speed is still limited because without rate limiting that's just about the same thing as a DDoS.
16:37 🔗 Frogging still faster and more efficient generally though. but necessitates writing scripts that are aware of the site structure
16:40 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:42 🔗 godane !a http://www.postnewspapers.com.au/
16:42 🔗 godane i put it in archivebot
16:48 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
16:49 🔗 dashcloud has joined #archiveteam
16:56 🔗 JW_work Frogging: #archivebot is *much* slower than nearly any source site, even with a delay of zero, due to the administrative back and forth with the central server.
17:01 🔗 Frogging yes, there's that too
17:48 🔗 arkiver VideoBot is recording livestream again.
17:48 🔗 arkiver VideoBot can be found at #videobot
17:49 🔗 arkiver It is especially handy to record and save Twitter videos directly to the Internet Archive as video item
17:49 🔗 arkiver vine and periscope are also supported. periscope needs a fix though
17:50 🔗 arkiver Coming up soon for VideoBot is following a twitter hashtag and downloading all twitter and/or periscope videos using that hashtag and uploading those videos to IA as video items.
17:50 🔗 arkiver All videos are also saved into the Wayback Machine
18:45 🔗 Start has joined #archiveteam
19:08 🔗 DFJustin niiice
19:36 🔗 dashcloud has quit IRC (Ping timeout: 244 seconds)
19:37 🔗 VADemon has quit IRC (Quit: left4dead)
19:39 🔗 DiscantX has joined #archiveteam
19:40 🔗 dashcloud has joined #archiveteam
19:46 🔗 Start has quit IRC (Quit: Disconnected.)
19:47 🔗 Start has joined #archiveteam
19:52 🔗 Start has quit IRC (Quit: Disconnected.)
20:07 🔗 DiscantX has quit IRC (Read error: Operation timed out)
20:08 🔗 mutoso has quit IRC (Quit: leaving)
20:18 🔗 mutoso has joined #archiveteam
20:37 🔗 dxrt has quit IRC (Read error: Operation timed out)
20:38 🔗 jspiros has quit IRC (Read error: Operation timed out)
20:41 🔗 dxrt has joined #archiveteam
20:45 🔗 Kitaru has joined #archiveteam
21:12 🔗 FalconK wait what wait what back and forth is there for archivebot in the middle of a crawl?
21:13 🔗 FalconK what I see is more like the site being crawled throttles large amounts of requests from a single IP.
21:13 🔗 FalconK and the deduplication lookups take a significant amount of cycles, which can't be parallelized because of python
21:15 🔗 Frogging FalconK: it's because it has to synchronously (!) contact the controller for every request
21:16 🔗 FalconK it probably doesn't have to do that
21:16 🔗 FalconK I bet we could make that better
21:16 🔗 FalconK it's managing the crawl locally, so what it exchanges with the controller can't be more than status updates
21:17 🔗 FalconK the code is unfortunately complicated though
21:17 🔗 DoomTay I think I remember someone, probably Asparagir, saying that the only thing stopping rollout of any big updates is the high number of long-term jobs underway
21:18 🔗 FalconK hmmmmmmmmmmmmmmmmmm
21:18 🔗 FalconK complex.
21:18 🔗 FalconK that is a much bigger problem than just serviceability
21:18 🔗 FalconK so a couple days ago the PDU on ananiel died
21:18 🔗 FalconK I am told there was a great popping, and crackling, and smoke issued
21:19 🔗 FalconK and with it, about 14 long-running jobs
21:19 🔗 FalconK now they are started from the beginning
21:19 🔗 FalconK on the bright side, ananiel was updated before taking on new work
21:20 🔗 FalconK but if you set a stopfile, you can expect it to halt sometime between a few hours and 4 months from then (and it can't service small jobs while waiting for the long jobs to go away)
21:21 🔗 FalconK initially, one would suppose that the trick is to make it into a distributed workload
21:21 🔗 bzc6p has joined #archiveteam
21:21 🔗 swebb sets mode: +o bzc6p
21:21 🔗 * FalconK !
21:21 🔗 FalconK bzc6p! you're the one with vt.idiota.hu, yes?
21:22 🔗 bzc6p um, yes
21:22 🔗 robink has quit IRC (Ping timeout: 633 seconds)
21:22 🔗 bzc6p why?
21:22 🔗 HCross im still up for running a pipeline
21:22 🔗 FalconK I'm trying to figure out why you use it for url lists, and not like pastebin
21:22 🔗 FalconK and then remove the A record for it after the job starts but before it is completed
21:22 🔗 DoomTay I think he lost the job and can't remember the exact URL
21:23 🔗 FalconK there have been a few of those jobs
21:23 🔗 FalconK and sometimes the thing running them dies a painful death due to power failures, the epoll_wait bug in wpull, or something else
21:23 🔗 FalconK and then the next pipeline that takes up the job just tries infinitely to pull the URL list, and fails because the A record is gone and there's nothing there anymore
21:24 🔗 FalconK and the job takes up a job slot and keeps reporting infinitely
21:24 🔗 DoomTay I'm not sure the A record disappearing is his fault. Right now the site as a whole can't be reached.
21:24 🔗 FalconK when I get one of those jobs and notice this condition, the only way to fix it is to add a hosts entry for vt.idiota.hu so it goes to like google.com, so it can "finish" the job
21:24 🔗 FalconK yeah
21:25 🔗 FalconK ... wait, is it URL lists, or were we trying to archive the site?
21:25 🔗 FalconK because it looked like URL lists
21:25 🔗 bzc6p Stop.
21:25 🔗 bzc6p So.
21:26 🔗 bzc6p The list resides on my computer, I start a webserver and announce my IP to a dyndns site.
21:26 🔗 bzc6p I wait until it finishes and then shut down my server.
21:26 🔗 nwf_ has quit IRC (Read error: Operation timed out)
21:26 🔗 bzc6p It happened twice out of like 14 cases that there was an outage meanwhile and the task got stuck.
21:27 🔗 bzc6p Let me remark that it is a bug in the ArchiveBot software, which was also admitted by yipdw.
21:27 🔗 bzc6p I do it this way because I don't want to use third parties unless necessary.
21:27 🔗 FalconK yes, you're triggering a bug
21:27 🔗 * FalconK nods
21:28 🔗 bzc6p But if you think it is a considerable problem that sometimes that happens, I may consider doing it in a way that this bug is very less likely to happen.
21:28 🔗 FalconK it's probably worth fixing the bug, since that is not the only case that can trigger it, but
21:28 🔗 nwf_ has joined #archiveteam
21:28 🔗 FalconK erstwhile, I know I've had to clear it on my pipeline ~5 times, and when it happens on a pipeline that is less watched, it takes quite a while
21:29 🔗 FalconK on the other hand, the next time you bring the host up, if it stays that way for longer than the DNS failure is cached, that pipeline will expunge the job
21:29 🔗 FalconK but, this method also risks keeping the DNS failure cached on the same pipeline that takes up your new job
21:29 🔗 FalconK in which case both jobs will be stuck
21:30 🔗 * FalconK shrugs
21:30 🔗 bzc6p Please tell me more about how your pipeline had to be cleared 5 times while I think I've never (or only once) targeted a job to your pipeline
21:30 🔗 bzc6p but in #archiveteam-bs
21:30 🔗 FalconK you're right, what's really needed is a limit on DNS retries
21:30 🔗 bzc6p (I use Frogging, now Luckolors pipelines)
21:30 🔗 FalconK it's not a big deal. I just wanted to understand what it was you were trying to do.
21:32 🔗 FalconK what's probably happening is that the job gets taken up by ananiel after the pipeline it was on dies (because ananiel is so big that condition is probable, though once I noticed it on Cadbury too)
21:32 🔗 FalconK which guarantees the bug is triggered.
21:32 🔗 FalconK so thanks for explaining! perhaps this is the next thing I should attack.
21:33 🔗 * bzc6p is a bit confused
21:33 🔗 bzc6p I think I'll just upload it to some drop site and shit
21:33 🔗 bzc6p then it won't happen and won't trigger problems
21:34 🔗 FalconK that is probably easier, at any rate
21:34 🔗 * bzc6p doesn't like using third parties as our main goal is saving shit from third parties
21:34 🔗 FalconK the bug is still a bug that needs fixing
21:34 🔗 FalconK well the URL list is hopefully ephemeral and wherever you put it is probably unrelated to the crawl target
21:35 🔗 bzc6p sets mode: +o FalconK
21:35 🔗 FalconK ty
21:36 🔗 bzc6p -------------------------
21:36 🔗 FalconK it is an interesting point, though, that someone would be fairly easily able to DoS archivebot with these jobs
21:36 🔗 * FalconK goes back to writing
21:36 🔗 bzc6p In fact, I came here to inform you that 8086.net doesn't give a crap on us saving their stuff
21:36 🔗 HCross #archiveteam-bs
21:36 🔗 bzc6p https://secure.8086.net/portal/viewticket.php?tid=NTM-405143&c=9JIYc46J
21:37 🔗 bzc6p or, in fact, they do, instantly activated CloudFlare but don't mind deleting everything
21:38 🔗 bzc6p You have heard ArchiveTeam News
21:39 🔗 bzc6p "There is no way you can archive the >billion pages on the site and trying to do so is causing issues for other users on the site."
21:40 🔗 bzc6p I can hear SketchCow saying "yeah, deleting everything will also cause issues for other users of the site"
21:40 🔗 DoomTay Why did Timothy post the same thing several times?
21:42 🔗 bzc6p After wumpus reported he hasn't received reply, Timothy decided to send the letter to three contact addresses, and after one and a half day without reply, he decided to send it to the uppercase initial email addresses too, in case their shitty mailserver is case-sensitive. That's 6 times altogether.
21:45 🔗 jspiros has joined #archiveteam
21:56 🔗 DoomTay How much has been saved of DNSHistory befor e then anyway?
21:57 🔗 bzc6p 0%
21:58 🔗 DoomTay Ow
21:59 🔗 bzc6p In fact, nearly 0,02%
21:59 🔗 DoomTay Andn ow he's deliberately blocking our efforts to get more...
21:59 🔗 DoomTay That's just evil
22:06 🔗 SketchCow Where's my hug
22:07 🔗 * JW_work points to the giant hug monster gathering dust in the corner
22:10 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:12 🔗 TC01_ is now known as TC01
22:12 🔗 SketchCow Who hosed it off last
22:12 🔗 Lune has joined #archiveteam
22:13 🔗 dashcloud has joined #archiveteam
22:16 🔗 * JW_work points to the giant hoses coming out of it, used for self-washing
22:17 🔗 FalconK I wonder what makes a person do that, enable cloudflare anti-flood for a site they plan to shut down in 2 days due to funding.
22:17 🔗 FalconK perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see.
22:18 🔗 JW_work that seems likely, yeah
22:23 🔗 bzc6p has left
22:26 🔗 Lune lol are these the dns people?
22:27 🔗 ranma is it on ATW?
22:28 🔗 joepie91 you can still reply to the ticket, by the way
22:28 🔗 joepie91 even if it is closed
22:29 🔗 joepie91 [00:17] <FalconK> perhaps they didn't like paying the CDN charges for our crawl. maybe it was more bandwidth than they normally ever see.
22:29 🔗 joepie91 I very strongly doubt there's a CDN involved here
22:41 🔗 FalconK well cloudflare
22:44 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
22:47 🔗 aschmitz_ has quit IRC (Read error: Operation timed out)
22:48 🔗 aschmitz_ has joined #archiveteam
23:12 🔗 joepie91 you don't pay cloudflare for bandwidth
23:12 🔗 joepie91 that's not how their business model works
23:12 🔗 joepie91 so that's a non-argument :P
23:16 🔗 K4k has quit IRC (Quit: WeeChat 1.5)
23:16 🔗 K4k has joined #archiveteam
23:19 🔗 robink has joined #archiveteam
23:21 🔗 arkiver I'm able to write some scripts for a little project for examiner.com tomorrow. But I'm not sure if we have enough time to save everything
23:23 🔗 DoomTay Wait, what makes you think it's in danger?
23:26 🔗 Kitaru DoomTay: they announced it's closing on the 10th
23:27 🔗 DoomTay Oh...
23:39 🔗 tomwsmf-a has joined #archiveteam
23:46 🔗 Start has joined #archiveteam
23:47 🔗 Lune bloody french
23:59 🔗 DoomTay has quit IRC (Quit: Page closed)

irclogger-viewer