[00:02] Pete Chivani. [00:02] Now I'm spelling it wrong. [00:02] YOU DID THOS TO ME [00:02] sdlkfslfkjdsf [00:04] So, this hotel network? Suuuuuuucl [00:04] Can't believe I paid for it. [00:04] win 8 [00:08] OK, feeling better. [00:08] His name is Pete Chvany. [00:08] "Computer Networks: The Heralds of Resource Sharing". 1972. [00:08] It's on archive.org. [00:28] thanks so much [02:02] 33 on the ACT [02:02] I'm pretty excited [02:02] :B [02:02] * BlueMax stabs underscor [02:15] http://en.wikipedia.org/wiki/Super_High_Me [03:49] underscor: is that out of 32? [03:49] :P [03:49] chronomex: 36 [03:49] :P [03:49] I kid I kid [03:50] good job [03:50] Thanks :D [03:50] I think I got something around 33 too [03:50] so ... you're in good company [03:50] haha [03:50] 99th percentile, fuck yeah [03:50] * chronomex currently packaging up symbian for the torrent [03:50] 99th percentile on ACT, 94th percentile on SAT, and 3.05 GPA [03:51] One of these is not like the other [03:51] :V [03:51] 94th? :( [03:51] Supposedly [03:51] I couldn't find any concrete numbers anywhere [03:51] you disappoint [03:51] I got a 2010 [03:51] 2010 is last year [03:51] Whatever percentile that is [03:51] 1380 on the old scale [03:51] mmmm [03:53] Either way, I'm pretty happy [03:53] Except for my GPA [03:53] lol [03:53] heh I got like 1510 on the old scale [03:53] But that's because I hate mundane work [03:53] and I spend all my time on archiveteam and other fun things [03:53] instead of doing homework [03:53] :< [03:53] DFJustin: That's almost perfect [03:54] :P [03:54] archiveteam is a good thing to spend your time on [03:54] Hopefully this internship thing in august works out well too [05:58] ndurner: Able to get a stats update when you have a minute? [05:58] :) [06:08] will do :-) [10:44] useless fact of the day, the number of domains starting with each character in alexa's top 1M list http://pastebin.com/gpPxZWZY [10:44] M as in million, not thousand [10:48] and on place 744459 there is "_live.it"... [10:49] i wonder if i should try to grab 100.000 robots.txt per day instead of 10.000 [11:34] Spirit_: that's a lot of suicide notes [12:58] bbot_: hm? [13:00] as in, jason's "robots.txt is a suicide note" essay [13:03] ah yes [13:03] currently thinking how to make it nicely accessible [13:04] maybe after each scrape, check which files were changed or are new/gone and put that information in a database [13:05] well, let's try if i can get 100000 down instead [13:27] Hi there. :-) [13:29] h [13:29] i [13:32] I'm just having a look, what you guys are doing is so great. But I assume you've been receiving lots of thanks lately. :-P [13:35] careful if you look to much, i am afraid some guys in here wear no pants [13:35] actually it is not that often that people come here i think but i am just a side peobn [13:35] peon [13:45] What is the problem with wearing skirt? [13:46] Female archivist here. [13:47] ha, i never knew [13:47] girls in skirts are cool [13:52] brb [14:04] 17962 files so far [14:04] i estimate 70k, since i always got ~7k from 10k [14:05] so something around 5-6 hours, that is great [14:11] back [14:29] any word on yahoo video? [14:30] sadcarrot: What kind of words are you looking for? :) [14:30] lol [14:30] bash question: if i get an error, i would it to be in $result, result=$(diff -q $yesterday $today) [14:30] the good kinds! [14:30] any hint? [14:30] i mean this is my line "result=$(diff -q $yesterday $today)" [14:30] i can no longer rsync my yahoo video [14:30] but if a file is missing, i get an error and $result is empty [14:30] so, just wanted to verify that is complete [14:30] (password doesn't work) [14:30] wait a second [14:31] sadcarrot: Oh, well - it's best if you'd check with SketchCow on that [14:31] yes [14:31] result=$(diff -q $yesterday $today 2>&1) [14:31] thanks :P [14:32] gotcha [14:39] does anyone have a tested and proven method how to identify true HTML files from bash? many sites serve random crap pages when i ask them for a robots.txt [14:40] i am afraid that "file" might misclassify some [14:46] grep "<" ? [14:47] that character is in txt files [14:47] file seems to do a good job actually [14:48] http://pastebin.com/raw.php?i=2ymRsydX [14:49] seems like people ilke to roundrobin and serve different files too, meh [14:59] Spirit_: Did you check the mime type? [15:36] look for a doctype [15:37] [15:37] lots of people leave them off though [15:43] In other news: I did a little experimenting based on Coderjoe's idea for a whois archiver. http://whoisarchive.heroku.com/ [15:45] The whois/domain lookup archiver looks cool. :-) [15:49] There is a paid service that does the same, though: http://www.domaintools.com/ [15:56] i think i will go with "file" [15:56] i dont like whois archiving, especially not indexed by search engines [15:56] actually, i wish whois would vanish [15:57] for privacy [15:57] Spirit_: Why? [15:57] IT's the same thing as a business license, or a car registration [15:57] because i do not want john doe to google my name and find domain x and y [15:57] They're all public information [15:57] in the US maybe [15:58] Then buy a domain privacy thing [15:58] Well, com and net are administered in the us, so... :P [15:58] yeah, but fuck that! :P [15:58] Convince your local ccTLD to get rid of whois [15:58] and there you go [15:58] yea, that's good information to archive [15:59] The compilation, [15:59] prohibited without the prior written consent of VeriSign. [15:59] repackaging, dissemination or other use of this Data is expressly [15:59] says many (all?) com whois' [15:59] well, yea. they would say that [15:59] Registrant Organization:ARCHIVE TEAM IS GO [15:59] hahaha [15:59] they want to have control [16:00] heh [16:00] Server Name: FRIENDSTER.COM.ZZZZZ.GET.LAID.AT.WWW.SWINGINGCOMMUNITY.COM [16:00] IP Address: 69.41.185.226 [16:00] Referral URL: http://domainhelp.opensrs.net [16:00] Registrar: TUCOWS.COM CO. [16:00] Whois Server: whois.tucows.com [16:00] What?!?! [16:00] http://whoisarchive.heroku.com/friendster.com/20110628142730.txt [16:01] heh [16:01] brb [16:02] that would be the database lookup i guess [16:05] hm, do i want to delete html responses? [16:06] bash: /bin/ls: Argument list too long :( [16:06] xargs [16:06] find whatever -print0 | xargs -0 rm [16:07] sorry, completely unrelated to the deletion [16:09] but find was a good suggestion, thanks [16:10] or not [16:10] uggestion, thanks [16:10] bash: /usr/bin/find: Argument list too long [16:10] find robotstxt2/files/*/*/20110628 | wc -l [16:10] there was a trick with echo for this, hm [16:20] any time the arguments list is too long, use find [16:20] find whatever -print0 | xargs -0 wc -l [16:21] sorry, for that you'll want find whatever -print0 | xargs -0 ls -l | wc -l [16:21] annoying, but [16:22] anyway, I'm late [16:22] bbl [16:25] thanks, that works [16:25] underscor: hey man [16:25] underscor: can you check the status of my yahoo vid upload? [16:25] i guess -print0 does not buffer like without [16:26] sadcarrot: Were you uploading to me or to rsync.net? [16:27] "(If you have reviews, I’d begin the process of archiving them via a Word document." http://wheredangerlives.blogspot.com/2011/06/professor-is-dead-long-live-netflix.html [16:28] netflix reviews that is [16:28] underscor: datadump.textfiles.com [16:29] You'll have to talk to SketchCow then [16:29] oh ok [16:32] 55k files down [16:33] about 7/10th through the 100k list [17:50] db48x: [17:50] $ time find files/*/*/20110628 -print0 | xargs -0 ls -l | wc -l [17:50] bash: /usr/bin/find: Argument list too long [17:50] :] [17:50] i guess 64k is a limit [18:27] SketchCow: ping [18:27] as for the bitsavers stuff ... are you familiar with Manx? [18:48] alard: is there a problem with your Google Groups script? [18:48] now lets see if 7z likes to pack these files [18:50] ndurner: No, it's just switched off. [18:51] My connection is currently busy with downloading Friendster user connections and uploading the other Friendster data. [18:51] I'll probably turn ggroups back on when those things are done. [18:57] ah, ok [18:58] can you upload your script somewhere so that someone else can jump in? [18:58] seems to work [18:58] (also, having the code for that kind of trickery might help future projects) [19:01] ndurner: the ggroups script? [19:40] alard: yes [20:06] ndurner: Sorry for the delay, I had to find my notes on ipv6 tunnels first. [20:06] https://gist.github.com/30cff29b602b818d018c#file_instructions.txt [20:06] thanks! [20:06] https://gist.github.com/30cff29b602b818d018c#file_ggroups_zipdl_ipv6.sh [20:23] underscor: Google Groups update: [20:23] directories: TOTAL: 243898, NEW: 105872, PROCESSING: 15, DONE_DIR: 138011
[20:23] completion rate: directories: 337/hr, groups: 865/hr [20:23] groups: TOTAL: 1245968, NEW: 767342, PROCESSING: 44, ERROR: 10944, ADULT: 4236, DONE_GRP: 463402
[21:23] marceloan: Hi, have you been able to upload your twaud.io files yet? [21:24] Or haven't you been able to contact SketchCow? [21:28] Hi [21:30] alard: No and no. [21:30] Ah. [21:30] alard: What compression should I use? [21:30] No compression, I guess. [21:31] alard: I have to send all the data unzipped? [21:31] You can try bzip or gzip, but it probably won't help. mp3's are already pretty compressed. [21:31] If it helps, you could rsync it to me and then I'll upload it along with my part. [21:32] Yes, how can I do it? [21:32] Is rsync okay? [21:32] I have to use Linux? [21:33] No, you can also use cwRsync, the Windows version. [21:34] That? http://www.itefix.no/cwrsync/ [21:34] Yes. And then you probably don't want the server, just the client. [21:40] 3.6MB, downloading... 10 minutes left... [21:40] Ah, that takes a while. [21:41] That gives me the time to figure out how I can set up an rsyncd server. [21:53] Ok, I installed it. [21:56] Great. Let's continue in a private message.