#archiveteam 2016-05-02,Mon

↑back Search

Time Nickname Message
00:00 🔗 WinterFox has joined #archiveteam
00:19 🔗 swebb has quit IRC (Ping timeout: 370 seconds)
00:23 🔗 philpem has quit IRC (Ping timeout: 260 seconds)
00:24 🔗 BlueMaxim has joined #archiveteam
00:31 🔗 swebb has joined #archiveteam
00:31 🔗 swebb has quit IRC (Client Quit)
00:34 🔗 ariscop has joined #archiveteam
00:47 🔗 BlueMaxim has quit IRC (Quit: Leaving)
00:49 🔗 atlogbot has joined #archiveteam
00:50 🔗 BlueMaxim has joined #archiveteam
00:50 🔗 swebb has joined #archiveteam
00:57 🔗 swebb has quit IRC (Ping timeout: 246 seconds)
00:57 🔗 atlogbot has quit IRC (Read error: Operation timed out)
01:09 🔗 Froggypwn has quit IRC (Ping timeout: 1208 seconds)
01:15 🔗 W1nterFox has joined #archiveteam
01:21 🔗 swebb has joined #archiveteam
01:22 🔗 WinterFox has quit IRC (Read error: Operation timed out)
01:24 🔗 Froggypwn has joined #archiveteam
01:33 🔗 Ymgve has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 altlabel has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 Darkstar has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 PotcFdk has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 PurpleSym has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 sHATNER has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 i0npulse has quit IRC (hub.dk irc.homelien.no)
01:33 🔗 Meeh has quit IRC (hub.dk irc.homelien.no)
01:37 🔗 Meeh_ has joined #archiveteam
01:40 🔗 ndiddy has joined #archiveteam
01:49 🔗 swebb has quit IRC (Ping timeout: 246 seconds)
01:55 🔗 PotcFdk has joined #archiveteam
01:57 🔗 Darkstar has joined #archiveteam
01:59 🔗 i0npulse has joined #archiveteam
02:05 🔗 Vito` huh
02:05 🔗 Vito` k5 shut down?
02:07 🔗 JesseW What's k5?
02:07 🔗 Vito` their current robots.txt is blocking IA, although I assume it's been around long enough there are captures
02:08 🔗 xmc kuro5hin
02:09 🔗 JesseW oh, wow. I mean, I hadn't thought about it in years, which is probably the point, but still -- wow.
02:10 🔗 JesseW kuro5hin was the slashdot alternative for a while, wasn't it?
02:10 🔗 JesseW before reddit, then hackernews took that title
02:11 🔗 xmc yep
02:11 🔗 JesseW I'm sorry we didn't run an archivebot job on it. :-(
02:12 🔗 MrRadar Maybe we should have a project to scan the top xxx,000 domains' robots.txt files to see if they block the IA and then put them in ArchiveBot (or similar) to ensure they're captured at least once
02:13 🔗 JesseW MrRadar: That makes sense to me, although we should probably discuss it in a non-logged channel. :-)
02:13 🔗 MrRadar Good point
02:13 🔗 xmc better hide from the cops
02:13 🔗 dashcloud doesn't the internet census file have the robots.txt, or am I mistaken?
02:14 🔗 JesseW Yes, the wayback machine will always capture robots.txt files, even if the robots.txt file blocks IA
02:14 🔗 JesseW I presume if someone contacts IA directly, they can get that hidden too, but I doubt many people have.
02:15 🔗 JesseW I suggest #robotstxtsucks as a good channel name
02:35 🔗 ralphdnak has joined #archiveteam
02:54 🔗 bwn has quit IRC (Read error: Operation timed out)
03:11 🔗 swebb has joined #archiveteam
03:20 🔗 atlogbot has joined #archiveteam
03:46 🔗 bwn has joined #archiveteam
04:32 🔗 altlabel has joined #archiveteam
04:33 🔗 _Crocatow has quit IRC (Quit: Later Ya'll)
04:40 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:47 🔗 Sk1d has joined #archiveteam
04:56 🔗 balrog has quit IRC (Read error: Operation timed out)
04:57 🔗 balrog has joined #archiveteam
05:18 🔗 ralphdnak has quit IRC (Read error: Operation timed out)
05:30 🔗 SketchCow has quit IRC (Read error: Connection reset by peer)
05:39 🔗 schbirid has joined #archiveteam
05:42 🔗 SketchCow has joined #archiveteam
05:50 🔗 kcaj has quit IRC (Quit: ZNC - 1.6.0 - http://znc.in)
05:56 🔗 kcaj has joined #archiveteam
05:56 🔗 RichardG_ has joined #archiveteam
06:00 🔗 RichardG has quit IRC (Read error: Operation timed out)
06:11 🔗 Meroje has quit IRC (Quit: bye!)
06:11 🔗 Meroje has joined #archiveteam
06:33 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
06:39 🔗 ariscop has quit IRC (Read error: Operation timed out)
06:50 🔗 wyatt8740 has quit IRC (Ping timeout: 246 seconds)
06:50 🔗 PurpleSym has joined #archiveteam
06:52 🔗 redlob has quit IRC (Remote host closed the connection)
06:53 🔗 wyatt8740 has joined #archiveteam
06:57 🔗 redlob has joined #archiveteam
07:25 🔗 Honno has joined #archiveteam
07:25 🔗 Honno_ has joined #archiveteam
07:29 🔗 ralphdnak has joined #archiveteam
07:30 🔗 Honno_ has quit IRC (Quit: Leaving)
07:31 🔗 Honno_ has joined #archiveteam
07:32 🔗 Honno has quit IRC (Read error: Operation timed out)
07:42 🔗 brayden_ has joined #archiveteam
07:44 🔗 brayden has quit IRC (Read error: Operation timed out)
08:14 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
08:25 🔗 wyatt8740 has joined #archiveteam
08:31 🔗 philpem has joined #archiveteam
08:31 🔗 ariscop has joined #archiveteam
08:49 🔗 atomotic has joined #archiveteam
08:53 🔗 bwn has quit IRC (Read error: Operation timed out)
09:08 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
09:10 🔗 bwn has joined #archiveteam
09:10 🔗 wyatt8740 has joined #archiveteam
09:33 🔗 fmope_ has quit IRC (Remote host closed the connection)
09:33 🔗 fmope_ has joined #archiveteam
09:38 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
09:45 🔗 BartoCH has joined #archiveteam
09:52 🔗 BartoCH has quit IRC (Read error: Connection timed out)
09:52 🔗 BartoCH has joined #archiveteam
09:55 🔗 pfallenop has quit IRC (Ping timeout: 260 seconds)
09:56 🔗 pfallenop has joined #archiveteam
10:10 🔗 pfallenop has quit IRC (Ping timeout: 260 seconds)
10:11 🔗 pfallenop has joined #archiveteam
10:18 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
10:20 🔗 pfallenop has quit IRC (Remote host closed the connection)
10:30 🔗 RichardG has joined #archiveteam
10:32 🔗 pfallenop has joined #archiveteam
10:32 🔗 RichardG_ has quit IRC (Ping timeout: 272 seconds)
10:42 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
10:51 🔗 wyatt8740 has joined #archiveteam
11:00 🔗 metalcamp has joined #archiveteam
11:16 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
11:34 🔗 RichardG has quit IRC (Ping timeout: 272 seconds)
11:53 🔗 W1nterFox has quit IRC (Remote host closed the connection)
12:00 🔗 vitzli has joined #archiveteam
12:02 🔗 RichardG has joined #archiveteam
12:06 🔗 atomotic has joined #archiveteam
12:15 🔗 vitzli has quit IRC (Quit: Leaving)
12:43 🔗 pfallenop has quit IRC (Ping timeout: 244 seconds)
12:46 🔗 xXx_ndidd has joined #archiveteam
12:47 🔗 ndizzle has joined #archiveteam
12:48 🔗 xXx_ndidd has quit IRC (Read error: Operation timed out)
12:50 🔗 pfallenop has joined #archiveteam
12:58 🔗 ndiddy has quit IRC (Read error: Operation timed out)
13:02 🔗 RichardG has quit IRC (Ping timeout: 499 seconds)
13:07 🔗 RichardG has joined #archiveteam
13:09 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
13:11 🔗 oli has quit IRC (Read error: Operation timed out)
13:14 🔗 oli has joined #archiveteam
13:17 🔗 BartoCH has joined #archiveteam
13:20 🔗 BlueMaxim has quit IRC (Quit: Leaving)
14:13 🔗 MMovie2 has joined #archiveteam
14:14 🔗 MMovie has quit IRC (Read error: Operation timed out)
14:31 🔗 Start has quit IRC (Quit: Disconnected.)
14:35 🔗 ralphdnak has quit IRC (Read error: Connection reset by peer)
14:39 🔗 ralphdnak has joined #archiveteam
14:39 🔗 ralphdnak has quit IRC (Read error: Connection reset by peer)
14:40 🔗 ralphdnak has joined #archiveteam
14:41 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:12 🔗 dashcloud has quit IRC (Read error: Operation timed out)
15:17 🔗 xXx_ndidd has joined #archiveteam
15:19 🔗 dashcloud has joined #archiveteam
15:29 🔗 Start has joined #archiveteam
15:29 🔗 ndizzle has quit IRC (Read error: Operation timed out)
15:34 🔗 atomotic has joined #archiveteam
15:53 🔗 bsmith093 has quit IRC (Ping timeout: 370 seconds)
15:55 🔗 bsmith093 has joined #archiveteam
15:55 🔗 JesseW has joined #archiveteam
16:04 🔗 atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
16:05 🔗 SimpBrain has joined #archiveteam
16:06 🔗 Start has quit IRC (Quit: Disconnected.)
16:16 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:29 🔗 Atluxity any concurrency limit on yuku-grab? :D
16:32 🔗 Atluxity only that and google code running now?
16:32 🔗 Atluxity and urlteam ofc
16:32 🔗 Start has joined #archiveteam
16:33 🔗 HCross There is a chance GameFront might be coming back though, waiting on arkiver to look at it more
16:35 🔗 HCross With the wrong size download thing, I tested multithreading each download, and it seemed to finish
16:35 🔗 HCross but wget-lua cant do it
17:11 🔗 godane has quit IRC (Quit: Leaving.)
17:17 🔗 VADemon has joined #archiveteam
17:26 🔗 bwn has quit IRC (Read error: Operation timed out)
17:31 🔗 Neurosplo has joined #archiveteam
17:32 🔗 ralphdnak has quit IRC (Read error: Operation timed out)
17:35 🔗 FalconK has quit IRC (Ping timeout: 260 seconds)
17:37 🔗 Start has quit IRC (Quit: Disconnected.)
17:38 🔗 FalconK has joined #archiveteam
17:38 🔗 FalconK has quit IRC (Read error: Connection reset by peer)
17:38 🔗 FalconK has joined #archiveteam
17:42 🔗 MrRadar Good news about Kuro5hin. The person running the site showed up Hacker News and said that it disappeared because his hosting provider shut down the datacenter his server was in: https://news.ycombinator.com/item?id=11612648
17:42 🔗 MrRadar He is planning to bring the site back in some form, though it may be a static archive
17:42 🔗 xmc oh damn
17:42 🔗 xmc "good" news
17:43 🔗 MrRadar Also someone further down in the thread was running a scraper on them and has a partial archive of the Kuro5hin user diaries (but not the stories): https://news.ycombinator.com/item?id=11609514
17:46 🔗 atomotic has joined #archiveteam
17:48 🔗 bwn has joined #archiveteam
17:52 🔗 xarph has joined #archiveteam
18:18 🔗 Start has joined #archiveteam
18:46 🔗 godane has joined #archiveteam
18:51 🔗 metal_cam has joined #archiveteam
18:52 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
19:15 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
19:25 🔗 Ymgve has joined #archiveteam
19:35 🔗 scyther_ has joined #archiveteam
19:35 🔗 scyther_ has quit IRC (Connection closed)
19:44 🔗 Start has quit IRC (Quit: Disconnected.)
19:46 🔗 jessehick has joined #archiveteam
20:42 🔗 ariscop has quit IRC (Read error: Operation timed out)
20:59 🔗 Neurosplo has quit IRC ()
21:05 🔗 schbirid has quit IRC (Quit: Leaving)
21:35 🔗 xarph has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
21:46 🔗 Emcy has quit IRC (Read error: Connection reset by peer)
21:46 🔗 ariscop has joined #archiveteam
22:13 🔗 SimpBrain has quit IRC (Remote host closed the connection)
22:20 🔗 zino arkiver: Sent you a mail with the naming details for the next two uploads. Hava a look at it when you can.
22:21 🔗 Honno_ has quit IRC (Read error: Operation timed out)
22:25 🔗 arkiver zino: I'll have a look!
22:26 🔗 arkiver #videobot now supports twitter.com and vine.co videos.
22:26 🔗 arkiver The videos will be downloaded as WARCs and as 'normal' video items, which will be uploaded to IA.
22:27 🔗 Emcy has joined #archiveteam
22:34 🔗 Ravenloft has joined #archiveteam
22:36 🔗 metal_cam has quit IRC (Ping timeout: 244 seconds)
23:00 🔗 redlob has quit IRC (Read error: Operation timed out)
23:02 🔗 redlob has joined #archiveteam
23:17 🔗 vOYtEC has quit IRC (Read error: Connection reset by peer)
23:17 🔗 vOYtEC has joined #archiveteam
23:20 🔗 justsome has joined #archiveteam
23:22 🔗 justsome hello
23:22 🔗 arkiver hi
23:23 🔗 justsome who do i talk to about the project?
23:25 🔗 JW_work justsome: what project?
23:27 🔗 justsome the archiveteam project, i have a question about it
23:27 🔗 JW_work archiveteam does a bunch of things, but this is the right place to ask questions about it, yes.
23:28 🔗 justsome ok well then... is there a list of sites that the team has recently archived somewhere? i couldn't find it on the wiki...
23:29 🔗 JW_work justsome: http://archiveteam.org , scroll down to "Recently finished"?
23:31 🔗 justsome ah sorry i mean specifically FTP sites actually... i went to the internet archive page linked from the wiki, and all i see are daily/hourly archive files, there is no single list i can find
23:31 🔗 justsome i want to know what FTP sites have been grabbed, say, within the past few months
23:32 🔗 JW_work ah, the FTP project. probably worth asking #effteepee also.
23:33 🔗 justsome it's been silent for two days now i think
23:33 🔗 JW_work heh
23:33 🔗 justsome that's why i came here...
23:34 🔗 justsome your wiki tells people to go on IRC to talk about the project... and there's almost nobody talking about the project on IRC...
23:34 🔗 JW_work It doesn't look like there's an up to date index that I can find, either.
23:35 🔗 JW_work You can extract one yourself from the cdx files, e.g. https://archive.org/download/archiveteam_ftp_20160416204923/archiveteam_ftp_20160416204923.cdx.gz
23:35 🔗 JW_work but that's quite a bit of a hassle.
23:35 🔗 justsome so how does each batch work? there are so many uploads per day on that site
23:35 🔗 justsome is it just spread out so you don't upload too much at once? how does it work?
23:35 🔗 JW_work arkiver may be able to answer more usefully about specific details. I'm not that familiar with the FTP grab.
23:36 🔗 JW_work uploads on which site?
23:36 🔗 HCross https://github.com/ArchiveTeam/ftp-items might give some clues
23:36 🔗 justsome the site you linked to
23:37 🔗 JW_work archive.org? Yes, there are a lot of uploads to it. :-)
23:37 🔗 justsome no i mean why are there so many "archiveteam" files on archive.org *per day*
23:37 🔗 justsome specifically the FTP files
23:38 🔗 justsome or is each one one site or something?
23:38 🔗 JW_work *ahhh*. Well, we pack the data we grab into big collections, called megawarcs. And each one gets a separate item.
23:38 🔗 JW_work For the current grab, I don't think they are separated by site, no.
23:39 🔗 JW_work the reason there were lots uploaded on Saturday is that the pipeline between our staging server and the Internet Archive was jammed a bit, and unjammed on that day (I think).
23:40 🔗 JW_work actually, cancel that — I think that was a different project.
23:41 🔗 JW_work But the reason for the multiple items is just to avoid having painfully large things in each item. It makes it easier on the Wayback Machine (which parses them for display)
23:41 🔗 justsome and makes it harder to search for something... way harder.
23:42 🔗 JW_work it's not particularly arranged for ease of searching, in that form.
23:42 🔗 slpeeds has joined #archiveteam
23:42 🔗 Frogging I thought IA had a Web tool that would search through CDXs for things
23:43 🔗 JW_work There may be one — I don't remember what it's called.
23:43 🔗 arkiver Before we started the FTP grab as WARCs I emailed about it with SketchCow
23:43 🔗 arkiver I gave him a sample of how a FTP WARC would look and asked if the wayback machine would get support for FTP sites
23:44 🔗 xmc ah neat
23:44 🔗 arkiver SketchCow let me know I could start the FTP grab.
23:44 🔗 arkiver Currently FTP WARCs are not recognized by the cdx-creator
23:44 🔗 JW_work cool
23:44 🔗 arkiver So basically you'd have to download these WARCs and go through them
23:44 🔗 Frogging so they can't be searched than
23:44 🔗 Frogging then*
23:44 🔗 arkiver No.
23:45 🔗 arkiver Not at the moment
23:45 🔗 JW_work Someone certainly could create an index though.
23:45 🔗 JW_work Or fix cdx-creator (which I presume is planned eventually)
23:45 🔗 arkiver But if the wayback machine has support for them, this is a better way to save them then in tar files I think
23:46 🔗 arkiver I'll ask around a bit more about FTP support for the wayback machine
23:46 🔗 JW_work it looks like the grab from Feb 2015 separated the items by site
23:47 🔗 justsome is anyone doing these sites in particular: http://archiveteam.org/index.php?title=FTP/List
23:47 🔗 fmope_ has quit IRC (Ping timeout: 864 seconds)
23:49 🔗 arkiver ErkDog is currently scanning a lot of university FTPs, that should give us some work soon
23:49 🔗 arkiver We grabbed the canadian FTPs for as far as they are online
23:49 🔗 arkiver not sure about the others
23:49 🔗 justsome where are the canadian FTP archives?
23:49 🔗 arkiver in the WARCs
23:50 🔗 justsome ... and that's why not having an index/search method isn't helpful.
23:50 🔗 arkiver see the above story
23:50 🔗 JW_work justsome: it'll come (or maybe you could write it)?
23:51 🔗 justsome do you only grab these sites once?
23:53 🔗 justsome i'm assuming that every day/hour's grab is different in that case...
23:53 🔗 justsome (that is, of different sites)
23:53 🔗 justsome or is there a way to optimize my manual search, so to speak?
23:53 🔗 arkiver Sites can be rescanned, all new files or in size changed files will then be readded to the lists to be grabbed
23:53 🔗 justsome no i mean how should i browse through the WARCs
23:53 🔗 arkiver What are you looking for exactly?
23:53 🔗 justsome the canadian sites
23:54 🔗 justsome is there a date range, or something...? some search tip?
23:54 🔗 arkiver yeah, but what in those FTPs?
23:54 🔗 arkiver we have this with file lists https://github.com/ArchiveTeam/ftp-queue/tree/master/archive
23:54 🔗 arkiver I'm not sure if that list is complete at the moment
23:55 🔗 JW_work that repo should probably get added to the wiki page
23:56 🔗 justsome hmmm... the site i'm focused on doesn't seem to be in that list, yet it's in the list i linked to (the one on the wiki)
23:57 🔗 arkiver which site is it
23:57 🔗 JW_work probably worth making a list of *all* the sites in the wiki list that aren't in the ftp-queue list.
23:58 🔗 JW_work presumably they are ones that wen't down before we could get to them?
23:59 🔗 justsome the site i'm referring to may be taken down soon... which is why i'm worried about it in the first place: support.crtc.gc.ca

irclogger-viewer