[00:23] *** bzc6p_ has joined #archiveteam [00:29] *** bzc6p has quit IRC (Read error: Operation timed out) [00:31] *** dashcloud has quit IRC (Read error: Operation timed out) [00:34] *** raven_ has left WeeChat 1.2 [00:34] HCross: tried poking arkiver ? [00:36] arkiver, you around? [00:36] *** dashcloud has joined #archiveteam [00:39] hi [00:39] my knowledge of rsync is not very good [00:40] *** JesseW has joined #archiveteam [00:51] http://archiveteam.org/index.php?title=Dev/Staging [00:53] HCross: those are the directions I followed last time. [00:54] yeah already been given them [00:58] *** pwnsrv has quit IRC (Read error: Operation timed out) [00:58] *** ex-parrot has quit IRC (Read error: Operation timed out) [01:08] *** remsen has joined #archiveteam [01:23] *** ex-parrot has joined #archiveteam [01:49] *** VADemon has joined #archiveteam [01:51] *** Ungstein has joined #archiveteam [02:13] *** dashcloud has quit IRC (Read error: Operation timed out) [02:15] *** primus104 has quit IRC (Leaving.) [02:18] *** dashcloud has joined #archiveteam [03:16] *** bwn has quit IRC (Read error: Operation timed out) [03:20] *** Ghost_of_ has joined #archiveteam [03:42] *** xk_id_ has quit IRC (Remote host closed the connection) [03:58] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [04:02] *** nertzy has joined #archiveteam [04:12] *** VADemon has quit IRC (Read error: Connection reset by peer) [04:26] *** aaaaaaaaa has quit IRC (Leaving) [04:33] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [04:33] *** nertzy has joined #archiveteam [04:45] *** antonizoo has joined #archiveteam [04:50] *** Ghost_of_ has quit IRC (Quit: Leaving) [05:09] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [05:10] *** nertzy has joined #archiveteam [05:20] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [05:25] *** za3k has joined #archiveteam [05:25] Okay, ran some number on Github. I'm downloading the repos, my estimate is that there are currently 25 million repos and 7 million gists. The repos (minus forks) look to be around 110TB [05:26] It is impossible to get a comprehensive list of gists, but I'll publicly post the list of repos when it finishes downloading (will be another week or so) [05:26] Shallow vs deep checkout isn't that significant, nor is compression [05:27] I haven't done any testing with dropping the largest repos [05:28] Oh, my numbers are from a uniform sample of 1000 repos out of the first 8 million that have downloaded so far [05:29] Hovering at around 40% forks on the sample [05:30] cool, glad you are doing this! [05:30] Well, it's around where I have to stop, since I don't have 2TB free, let alone 100TB, but I thought I'd post it at least to save the next person time. [05:31] if you'd like to write it up on the ArchiveTeam wiki, that might be a good place for it. The secret word is: yahoosucks [05:32] I'll write it up once the repos finish downloading and I can post a link. [05:33] Err sorry, I'm downloading the /LIST/ of repos, as I don't have space for the repos. [05:34] But I should be able to maintain that, similar to what githubarchive.com is doing. [05:35] Rather less exciting. [05:39] still valuable. [05:44] *** za3k has quit IRC (Quit: Page closed) [05:57] NEEDS MORE EMOJIS [06:13] *** BlueMaxim has joined #archiveteam [06:18] * JesseW can't type EMOJIS on my keyboard [06:52] 💩💩💩💩 [06:52] only required emoji [06:53] OH WAIT 🖕 [06:53] NICE [07:20] *** WubTheCap has quit IRC (Quit: Leaving) [07:29] *** MMovie1 has quit IRC (Read error: Connection reset by peer) [07:31] *** _vOYtEC has joined #archiveteam [07:32] *** vOYtEC_ has quit IRC (Read error: Connection reset by peer) [07:32] *** khaoohs_ has joined #archiveteam [07:34] *** SmileyG has joined #archiveteam [07:34] *** Stiletto has joined #archiveteam [07:35] *** MMovie has joined #archiveteam [07:37] *** Baljem has joined #archiveteam [07:37] *** K4k has joined #archiveteam [07:38] *** RichardG_ has joined #archiveteam [07:38] *** Selanda_ has joined #archiveteam [07:38] *** ex-parro1 has joined #archiveteam [07:38] *** swebb_ has joined #archiveteam [07:38] *** RichardG has quit IRC (Ping timeout: 253 seconds) [07:38] *** Stilett0 has quit IRC (Ping timeout: 253 seconds) [07:38] *** Baljem_ has quit IRC (Ping timeout: 253 seconds) [07:38] *** jk[SVP] has quit IRC (Ping timeout: 253 seconds) [07:38] *** swebb has quit IRC (Ping timeout: 253 seconds) [07:38] *** thefinn93 has quit IRC (Write error: Broken pipe) [07:38] *** K4k__ has quit IRC (Read error: Operation timed out) [07:38] *** balrog has quit IRC (Read error: Operation timed out) [07:38] *** Selanda has quit IRC (Read error: Operation timed out) [07:38] *** bai has quit IRC (Ping timeout: 243 seconds) [07:38] *** khaoohs has quit IRC (Read error: Operation timed out) [07:38] *** swebb_ is now known as swebb [07:38] *** Gfy has quit IRC (Ping timeout: 490 seconds) [07:38] *** ex-parrot has quit IRC (Write error: Broken pipe) [07:38] *** jmtd has quit IRC (Killed (hub.efnet.us (Nick change collision))) [07:38] *** jk[[SVP]] has joined #archiveteam [07:38] *** jk[[SVP]] is now known as jk[SVP] [07:38] *** bai has joined #archiveteam [07:38] *** Smiley has quit IRC (Read error: Operation timed out) [07:39] *** jmtd has joined #archiveteam [07:40] *** wp494 has quit IRC (Ping timeout: 498 seconds) [07:41] *** wp494 has joined #archiveteam [07:50] *** thefinn93 has joined #archiveteam [07:50] *** scyther has joined #archiveteam [07:50] *** Ungstein has quit IRC (Quit: Leaving.) [07:51] *** Ungstein has joined #archiveteam [07:54] *** Start has quit IRC (Quit: Disconnected.) [07:56] *** Start has joined #archiveteam [07:58] *** Gfy has joined #archiveteam [08:00] arkiver: ok, added [08:05] *** bwn has joined #archiveteam [08:26] *** primus104 has joined #archiveteam [08:39] *** balrog has joined #archiveteam [08:56] *** JesseW has quit IRC (Read error: Operation timed out) [09:18] *** primus104 has quit IRC (Leaving.) [10:15] *** bwn has quit IRC (Read error: Connection reset by peer) [10:15] *** bwn has joined #archiveteam [10:31] *** notjack has joined #archiveteam [10:38] There are a lot of outdated pages on this wiki :-P [10:43] And almost no links from finished projects to IA items either. [10:45] I think I need to spend a day just updating the wiki at some point. [10:45] *** bzc6p_ is now known as bzc6p [10:47] It's a shame i can't edit the main-page either, but I guess it makes sense :) [10:48] notjack: thank you for that. Once in a while I also do some maintainance. It's best if we immediately update project pages on status change. [10:49] bzc6p: Nice. Yeah, I'll try and go through Deathwatch and Fire Drill and check statuses as well. [10:49] *** remsen has quit IRC (Read error: Operation timed out) [10:51] Wrt links to IA items: We could either extend the infobox itself or {{saved}} and {{partiallysaved}}. Opinions? [10:56] {{saved}} and {{partiallysaved}} [10:56] I've updated Deathwatch to include some dealines that have passed. [10:56] PurpleSym: Reasonable, but not sure if people would find that. Also, we've got used to have and "Archives" section. [10:57] Maybe we can make http://archiveteam.org/index.php?title=Projects up to date [10:58] Make the page more visible from the frontpage [10:58] Sure, a section is fine too. But the infobox is more visible imo. [10:58] we should also update this list http://archiveteam.org/index.php?title=Warrior_projects [11:00] *** Sk1d has joined #archiveteam [11:00] I think only Warrior projects is much behind. Projects is quite up-to-date. [11:03] looks like we're almost ready for the google code project [11:03] Atluxity: then you can really bring in the big guns ;) [11:04] *** bwn has quit IRC (Read error: Operation timed out) [11:05] bzc6p: yeah, and Warrior projects is I think one of the most important pages on the wiki [11:05] Moved CyberPunkReview.com from Fire Drill to Deathwatch [11:05] A good list of everything we've done [11:06] arkiver: aggreed. [11:06] Will Google Code probably block us with captchas? [11:06] Nah, I don't think they will [11:06] I've run it quite fast here and didn't get any captchas [11:07] And from earlier project it never lookes like Google was really actively working on stopping us from grabbing their sites [11:07] I have the feeling they even delayed the shutdown of baraza for us [11:10] We had to do tricks when discovering blogger [11:10] I don't remember baraza [11:11] google search is also protected by captchas. That's why I though about that, but if not, that's good. [11:12] arkiver: niiiie! :D [11:13] +c [11:15] Moved FriendFeed from Fire Drill to Deathwatch [11:15] I like you notjack [11:15] Atluxity: Thanks :) [11:20] Moved The Grid from Fire Drill to Deathwatch [11:22] Moved Nakido from Fire Drill to Deathwatch [11:29] Moved GameSpy, 1UP and Ugo from Fire Drill to Deathwatch [11:31] Moved Ovi Store from Fire Drill to Deathwatch [11:35] *** remsen has joined #archiveteam [11:35] Moved Blip.tv from Fire Drill to Deathwatch [11:36] And that looks like the Fire Drill is updated @all [11:45] *** bwn has joined #archiveteam [11:46] *** atomotic has joined #archiveteam [11:47] *** primus104 has joined #archiveteam [11:53] notjack: I think the sites gone for good could not only be striked out but removed from Fire Drill. [11:54] bzc6p: I left the sites there for historical purposes and so it could be obvious what I changed, but I can remove those entries if you think that would be better. [11:55] The list is long enough. Important information can be moved to Deatchwatch along with the mention. It's just my opinion, though. [12:01] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:01] Removed items moved to Deathwatch from Fire Drill [12:02] Updated titles to be more understandable in Deathwatch [12:05] *** nertzy has joined #archiveteam [12:06] *** WinterFox has quit IRC (Remote host closed the connection) [12:11] *** wooza has joined #archiveteam [12:14] Hi, i was just listing to "The Spendiferous Story of Archive Team - Jason Scott - PDA2011.mp3" and it reminded me i was going to ask in this channel [12:14] could someone point me in the direction for all the backups of alexa top-1m.csv.zip files, some use to be on archive.org but had access restiction, is their an ftp with them somewhere? [12:19] anyone? [12:22] *** bzc6p_ has joined #archiveteam [12:24] *** BlueMaxim has quit IRC (Leaving) [12:25] *** bzc6p has quit IRC (Read error: Operation timed out) [12:26] wooza: this is IRC, it may take a while for people to respond :) [12:27] ok [12:27] do you have any leads? [12:29] @joepie91 i had seen some on archive.org in the past but some where missing also, thought someone might have an ftp or dir with them [12:31] wooza: I have no idea, personally. [12:31] best to wait around until somebody comes along who does [12:31] :p [12:31] @joepie91: is their someone i could email? [12:32] archiveteam@archiveteam.org goes to whom? [12:36] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [12:37] *** bzc6p_ is now known as bzc6p [12:38] wooza: afaik that's read by Jason Scott [12:40] wooza: I found this: https://ia802605.us.archive.org/13/items/MostPopular1MillionWebsitesIn2008/ [12:40] but I guess you need several issues? [12:40] bzc6p: i already saw that, there is daily backups somewhere [12:40] bzc6p: thanks anyway [12:40] on IA? [12:41] bzc6p: i have seen some where its daily on archive.org but they have restricted access, not sure if they exist still [12:41] wooza: It probably exists. Rumour has it that IA doesn't delete a byte ever, just hides. [12:42] bzc6p: some are restricted access though, so need a ftp or dir etc... [12:42] If this is the case, SketchCow (Jason Scott), who is an IA employee, can confirm, but I doubt he will give it away, as IA probably has a reason to hide that. [12:43] bzc6p: its just alexa rank data that has been public? .... [12:43] bzc6p: its just im stupid and didnt set up a vps 10 years ago to back it up daily [12:43] Now SketchCow has been pinged and might reply. If not, try emailing. [12:43] bzc6p: someone else did though so im trying to get that ;p [12:44] bzc6p: is SketchCow jason? [12:44] Yes. [12:44] bzc6p: you pinged him? [12:45] you too, by mentioning him [12:45] bzc6p: oh i havent used irc since i was young so dont remember how it works much lol [12:45] :) [12:45] bzc6p: thanks anyway though [12:46] when you do "name: hi name bla bla" are you supposed to put the @ and stuff in if they are op? or just their username and no symbols? [12:47] I think a smart IRC client understands both. [12:48] what about the : is that needed? [12:49] yes [12:49] ok [12:49] for these talks, please go to #archiveteam-bs [12:49] I don't think so. [12:49] sorry, no it's not needed to ping someone [12:49] yeah [12:49] @arkiver: wops my bad, did you have any info on the alexa top-1m.csv.zip backups? [12:50] no [12:51] seems like my hose-rig is down for a little [13:00] *** VADemon has joined #archiveteam [13:01] *** schbirid has joined #archiveteam [13:01] *** sigkell has joined #archiveteam [13:08] *** bwn has quit IRC (Read error: Operation timed out) [13:19] *** SimpBrain has quit IRC (Leaving) [13:24] *** Stiletto has quit IRC (Read error: Connection reset by peer) [13:25] *** Stiletto has joined #archiveteam [13:35] *** scyther has quit IRC (Quit: Leaving) [13:48] *** RichardG_ is now known as RichardG [13:58] *** notjack_ has joined #archiveteam [14:00] *** notjack has quit IRC (Ping timeout: 240 seconds) [14:01] *** notjack_ is now known as notjack [14:11] *** wooza has quit IRC (Quit: Page closed) [14:18] *** remsen has quit IRC (Read error: Operation timed out) [14:56] *** atomotic has joined #archiveteam [15:04] *** antomatic has quit IRC (Read error: Operation timed out) [15:18] *** xk_id has joined #archiveteam [15:23] *** Emcy has joined #archiveteam [15:49] *** antomatic has joined #archiveteam [16:03] *** nertzy has joined #archiveteam [16:11] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:17] *** Jonimus has quit IRC (Read error: Operation timed out) [16:21] *** zenguy_pc has quit IRC (Remote host closed the connection) [16:35] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [17:11] So Internet Archive is recording channels on and off [17:11] ^ television channels [17:12] FRANCE24 was last recorded 20140522 and IA started recording it 20151114 again [17:13] I'm not sure what IA's reason is to stop recording these channels [17:14] but I really think it's bad for an archive to just start and stop recordings of channels like this [17:15] IA has now totally missed the first reports on 20151113 of the shootings [17:16] A reason for stopping the recording int 2014 might be that FRANCE24 recordings are almost never watched [17:17] And FRANCE24 is definitely not the only channel being on and off recorded currently [17:18] As far as I know the whole point of recording these channels is to preserve news and important events. [17:18] The way IA is currently recording some channels, this point is totally missed [17:19] *** zenguy_pc has joined #archiveteam [17:19] Recording channels is expensive though, so that might be a reason to stop recording when videos are not watched often. [17:20] *** xk_id has quit IRC (Remote host closed the connection) [17:20] SketchCow: ^ [17:21] And it's always better to get a bit of a television channel then nothing. [17:22] But shutting down the recording of FRANCE24, one of the only or maybe the only (?) recorded french television channel, is bad I think [17:23] Maybe IA should try to spread number of recordings more equally over the countries. [17:29] it takes about 100gb per a day to archive TV channels [17:29] thats 3tb a month [17:30] *** xk_id has joined #archiveteam [17:31] about 40 to 50TB after encodes to other formats + originals [17:31] a year [17:45] Ahem. [17:45] *** notjack has quit IRC (Ping timeout: 240 seconds) [17:46] I will waste the amount of time in this discussion to simply say, there is no situation where IA is able to record a channel where they go "Ah, fuck that. Stop recording." [17:46] And when you guys speculate about "choices" the archive is making, for example in this case, it's high comedy. [17:47] I am positive they lost a node, or a channe stopped being unencrypted, or so on. [17:47] So, they're recording things "on and off" like Archive Team is saving sites "on and off" [17:53] if I was to have access to livestreams for a couple of norwegian tv-channels, mostly cause of my norwegian IP, would that be something IA would be interested in? Norway have a compulsory archiveing law covering tv broadcast, but that does not mean its easily accessible. [17:53] There needs to be a discussion. I'll have one when I'm onsite on Tuesday. [18:01] *** JesseW has joined #archiveteam [18:11] Brought it up in the Slack channel because hey Slack [18:13] *** xk_id has quit IRC (Remote host closed the connection) [18:15] *** xk_id has joined #archiveteam [18:16] thanks [18:38] *** Emcy has quit IRC (Read error: Connection reset by peer) [18:40] *** SimpBrain has joined #archiveteam [18:41] SketchCow: if its on and off then me save stuff like Korea Today makes more sense now [18:41] and all of the other news channel stuff [18:43] i'm also now getting something thats special [18:43] Sega Germany Dreamcast Pre-Launch "Press Data and Material". [18:44] that disc is only released to press members [18:45] ah i've seen that posted too [18:46] http://academictorrents.com/ looks pretty inactive [18:54] *** aaaaaaaaa has joined #archiveteam [19:02] SketchCow: So is it ok if I check every now and then if channels stopped recording, in case something went wrong with nodes or encryption? [19:05] *** dashcloud has quit IRC (Read error: Operation timed out) [19:07] So currently the IA is not grabbing any Dutch/Belgian channels. BVN is a Dutch/Belgian tv channel which is also freely available in the US by the Eutelsat 113 West A satellite. http://www.bvn.tv/ontvangst [19:08] SketchCow: Do you think IA can record that channel? [19:10] arkiver: where is this list of recorded channels? [19:13] *** dashcloud has joined #archiveteam [19:14] godane: I don't know how many tv channels you think needs 100 gb/day [19:15] I think one tv channel needs not more than 4 terabytes a year in a medium quality. [19:15] (11 GB/day) [19:16] Say, 100 channels of the world would cost 400 TB/year. [19:16] If I were a bit more wealthy, I'd do 3 channels plus some radio myself, of my country. Maybe a bit later. [19:17] i'm basing this 4GB per a hour MPEG1/2 encoding [19:17] thats what was used for 9/11 video [19:17] * bzc6p switches to archiveteam-bs [19:33] *** nertzy has joined #archiveteam [19:45] France24 is FTA on satellite in Europe, if there's any lack of coverage it wouldn't be impossible to set up a recording elsewhere [19:45] It has no live captioning, which makes searching and indexing less simple, though [19:51] *** cvb has joined #archiveteam [19:52] *** remsen has joined #archiveteam [19:52] [Also comes in three versions - French, English, and Arabic] [19:59] 4GB per hour is about 9-10 megabits, Godane - few news channels broadcast at anything like that bitrate. only the occasinosl HD network would get anywhere close. [19:59] *occasional [20:03] i was wrong [20:03] *** Boppen has quit IRC (Read error: Connection reset by peer) [20:03] i thought i said 4gb one time as the mpg [20:03] its about 3gb for 2 hours [20:03] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [20:04] *** Boppen has joined #archiveteam [20:04] Works out to about 3.5mbps - that's good broadcast-quality standard definition MPEg2, for exmaple. [20:07] *** dashcloud has quit IRC (Read error: Operation timed out) [20:08] *** scyther has joined #archiveteam [20:08] *** schbirid has quit IRC (Quit: Leaving) [20:09] France24 on Astra2 _right now_ swings between about 2.5 to 3.2mbps on average - doesn't seem to jump far above/below those rates. [20:10] *** dashcloud has joined #archiveteam [20:16] that seems low. 1080*1920*24bits is around 50 megabits for 1 frame [20:18] anyway this is more -bs [20:19] FRANCE24 is recorded in 704*480 [20:20] aaaaaa - don't underestimate broadcast compresion, it's literally of the ratios of 100:1 or more [20:22] *** vvvvv has joined #archiveteam [20:22] *** vvvvv has quit IRC (Connection closed) [20:23] It's 720x576 in Europe - higher resolution due to Her Majesty's PAL standard :) [20:23] *** Froggypwn has quit IRC (Read error: Connection reset by peer) [20:27] *** Rickster has quit IRC (Read error: Connection reset by peer) [20:36] *** mr-b has quit IRC (Read error: Operation timed out) [20:44] *** Froggypwn has joined #archiveteam [20:44] *** xk_id has quit IRC (Remote host closed the connection) [21:04] *** dashcloud has quit IRC (Read error: Operation timed out) [21:08] *** dashcloud has joined #archiveteam [21:22] *** VADemon has quit IRC (left4dead) [21:26] *** Ravenloft has joined #archiveteam [21:27] *** xk_id has joined #archiveteam [21:51] if there is a archivebot grab going on for lowendbox/talk, might be good if that is paused as they are under quite a large ddos [21:51] hey. can anyone recursively archive this webpage: http://mgvs.org/public/ [21:53] *** JesseW has quit IRC (Leaving.) [21:58] *** dashcloud has quit IRC (Read error: Operation timed out) [22:01] *** dashcloud has joined #archiveteam [22:20] *** Gfy has left [22:29] *** Sk1d has quit IRC (Quit: ZNC - http://znc.in) [22:30] *** scyther has quit IRC (Read error: Connection reset by peer) [22:51] *** ironman_ has quit IRC (Quit: Connection closed for inactivity) [23:02] *** JesseW has joined #archiveteam [23:06] *** remsen has quit IRC (Read error: Operation timed out) [23:08] *** remsen has joined #archiveteam [23:08] *** Boppen has quit IRC (Read error: Connection reset by peer) [23:09] *** Boppen has joined #archiveteam [23:26] Re the France 24 talk earlier, about an hour or two after the attacks I started grabbing their English web stream through Archive Bot and continued for several hours afterwards [23:27] I should probably download the warcs and stitch them together and then upload it as a video back to the Archive [23:33] MrRadar: awesome [23:43] *** MrRadar_ has joined #archiveteam [23:44] *** MrRadar has quit IRC (Read error: Operation timed out) [23:45] *** MrRadar_ is now known as MrRadar [23:51] *** bwn has joined #archiveteam [23:51] That could be a good ad-hoc warrior project - maybe something that picks some kind of streaming video source in the country matching the IP address of the warrior, and records for a certain amount of time. [23:52] or fetch on-demand news bulletins (e.g. BBC1 UK Six O'Clock News, 13/11/2015) from local on-demand sources. [23:52] Newswarrior. [23:55] Could grab radio news bulletins too, or talk radio stations, etc [23:56] Might be a nice 'slow'/always-on type project that guaranteed there'd always be something for warriors to do, in some way