[00:46] til about rsync's -K [00:46] i've used symlinks with cfv a bunch but not with rsync i guess [01:57] I'll take it. yeah [02:26] I have successfully made it to MI! [02:26] (alive) [02:28] Dec_29_2011_19:21:00 SketchCow gui77: ext2 or thereabouts [02:29] ext2 has some really funky limits, and no journaling [02:29] I'd recommend ext3/4 or JFS [02:29] (jfs is about as solid as you can get as far as data integrity if you lose power/blow up/knock out the sata cable/etc) [02:31] Oh [02:31] chronomex already beat me at saying that [02:31] * undersco2 reading through logs [02:39] I just applied for the stephen hawking thing [02:39] job [02:39] shit for pay but seems like an adventure [02:50] i got 98 episodes of systm backup [02:50] almost done [02:55] There's a stephen hawking thing? [02:58] yeah [02:58] saw it on slashdot [02:58] http://www.news.com.au/breaking-news/stephen-hawking-wants-you-to-find-him-a-voice/story-e6frfku0-1226233018306#ixzz1hxAssZOx [03:01] Be careful, hawking's kind of a dick [03:01] But I mean, he can't, you know, hit you [03:01] I've heard rumors of such :( [03:02] haha [03:16] hahaha [03:41] godane: where will these eps be when you're done uploading them? [04:18] Just took a shipment of 1.4gb of geocities. [04:18] yay [04:19] woot [04:20] Mmm geocities [04:22] this is new geocities beyond what you currently had? [04:26] http://www.archive.org/details/geocities-jcn-pack [04:26] Yes [04:27] is the geocities stuff in a standard enough format to diff the new stuff against it? [04:27] and easily see what's old/new [04:27] that's amazing [04:29] No idea. [04:29] But I have to move to the next thing. [04:29] Although you're right, I should add a .txt. [04:35] cat geocities.jcn-grab.tar.bz2 | bunzip2 -- | tar vtf - | awk '{print $3," ",$4," ",$5," ",$6}' > geocities.jcn-grab.txt [04:39] i have a script i found and tweaked to pull fanfiction.net stories and puts them in the folder named for the author, how do i sort those folders by first letter forexample books/author would go in books/A/author [04:39] there about 50k folders at this point so automated is preferred [04:46] for i in {A..Z}; do echo mkdir "$i" && echo mv "$i"* "$i"/"; done [04:46] that should work for uppercase, separate thing needed with {a..z} for lowercase [04:46] few typos in there too heh [04:47] shopt -s nocaseglob [04:48] apparently can be used to turn off case sensitivity in globbing, so that should help [04:50] http://stackoverflow.com/questions/156916/case-insensitive-glob-on-zsh-bash [04:50] http://www.archive.org/details/geocities-jcn-pack&reCache=1 Done [04:50] nice [04:50] undersco2: jfs vs ext? i've looked into jfs but just for performance and ended up with ext3/xfs. dunno about data integrity. [05:04] jfs is a bit slower [05:04] but it runs like a tank [05:04] highly recommended [05:05] chronomex: would you say using fossil / zfs-fuse be too extreme, say if one's goal is data integrity? [05:05] I used ZFS-fuse for a while [05:05] it turned many of my symlinks into 0000-permission empty files. [05:05] not recommended [05:10] ouch [05:11] yeah it's this weird good/bad tradeoff with that stuff, namely zfs and btrfs. like they're supposed to be super good in terms of hashing your data and detecting random bit flip errors but their various implementations have quite a bit of instability in other areas [05:16] I tried btrfs too [05:16] accidentally filled up my disk [05:16] couldn't delete anything because it had nowhere to copy-on-write the metadata [05:16] The big thing now is how do I deal with these google groups. [05:16] I think I have to move things around. [05:16] I'm inclined to wait to January, let more people finish filtering on. [05:16] SketchCow: google groups? hm. I think we should go after yahoo groups maybe. [05:16] We're done, dude. [05:16] We have them all. [05:17] all the messages? [05:17] don't be silly. [05:17] No, no, the files. [05:18] root@teamarchive-0:/2/FTP/swebb/GOOGLEGROUPS/groups-k# ls [05:18] k19tcnh2-pages.zip kauri-files.zip kiska-pages.zip kongoni-pages.zip [05:18] k2chocolategenova-pages.zip kauri-pages.zip kismamak-es-vedonok-csoportja-pages.zip konwersacje-jezykowe-aegee-krakow-files.zip [05:18] kadez-pages.zip kaybedecek-bir-yok-ki-files.zip kjsce_electronics_2005-files.zip ko-se-slaze-da-je-marija-petrovic-naj-naj-razredna-pages.zip [05:18] kahrmann-blog-files.zip kaybedecek-bir-yok-ki-pages.zip kjsce_electronics_2005-pages.zip kroatologija2010-pages.zip [05:18] kaliann-pages.zip kenya-chess-forum-pages.zip kkaiengineer-pages.zip kup4ino-files.zip [05:18] kaliteyonetimi2011-pages.zip kenya-welcome-files.zip knjizevnost-pages.zip kurdistan-dolamari-pages.zip [05:18] kamash--pages.zip kenya-welcome-pages.zip knowledgeharvester-pages.zip kurniandikojaya-files.zip [05:18] kami-roke-pages.zip khuyen-mai-vnn-pages.zip koan2008-files.zip [05:19] kanisiusinfo-files.zip kik555-pages.zip koan2008-pages.zip [05:19] katotakkyu-pages.zip kinhte34tsnt1992-1997-pages.zip kolna3rab-pages.zip [05:19] see [05:20] I see [05:21] it'd be really bitchen to suck all the messages out of yahoo groups [05:21] rchive: koan2008-files.zip [05:21] --------- ---------- ----- ---- [05:21] 7148029 2011-07-18 00:35 koan2008/03 It's Easy.mp3 [05:21] Length Date Time Name [05:21] 2535424 2011-07-18 00:35 koan2008/15 The Wind.mp3 [05:21] 217168 2011-07-18 00:35 koan2008/Archive Welcome sunset copy.jpg [05:21] 219648 2011-07-18 00:35 koan2008/Koan2008 Goes Google Group Newletter & FAQs.doc [05:21] --------- ---------- ----- ---- [05:21] 6156 2011-07-18 00:35 koan2008/__welcome.txt [05:21] 958 2011-07-18 00:35 koan2008/1-a-cup-of-tea.txt [05:21] Archive: koan2008-pages.zip [05:21] Length Date Time Name [05:21] 12507 2011-07-18 00:35 koan2008/10-the-last-poem-of-hoshin-commentary-enough-is-enough.txt [05:21] 10228 2011-07-18 00:35 koan2008/11-the-story-of-shunkai-extra-innocence.txt [05:21] 2310 2011-07-18 00:35 koan2008/12-happy-chinaman.txt [05:21] 2175 2011-07-18 00:35 koan2008/13-a-buddha.txt [05:21] 6841 2011-07-18 00:35 koan2008/14-muddy-road-commentary-mind-body-i-the-fascinating-body.txt [05:23] Anyway, I really am inclined to make directories of these. [05:23] Like an item with, say, 200 of these groups. [05:24] But it'll still be a few thousand details I'm worried about. [05:24] SOMEONE MENTIONED THIS IN AN ARTICLE [05:28] i got 33gb of systm [05:29] large and hd version [05:29] large only when that was the high bitrate version there [05:30] hd after episode 36 [05:45] HEY EVERYONE [05:46] http://www.prelinger.com/dmapa.html [05:47] I am unqualified [05:55] is google groups planned to be mirrored? [06:05] way to go, Jason, you're gonna apply right? [06:07] I doubt that [06:48] filemaker? [06:49] once was apple's userfacing db [06:49] i know what it is. i'm just surprised it still sees use [06:50] ah [06:50] i've seen it before but i always have to look it up [06:58] I'm WAY too busy [06:58] I can't live in SF [07:07] yeah... geography is my problem as well... plus a couple other requirements [07:42] uploading ctrl atl chicken [07:43] a very belef cooking show on rev3 [07:44] http://www.youtube.com/watch?v=IlBmbt8IVv4 [07:48] Working on my audio soundscape for Barcelona art group [07:48] Oh yeah, 2:49am before a Jan 1 deadline [07:48] You know it [07:49] Should be 40 minutes long. [07:49] 4 minutes done so far. [08:14] i'm getting very slow upload speed to archive.org [08:15] its like 40kbytes [08:15] i should be getting 200kbytes [08:16] http://www.archive.org/download/jamendo-002843/02.mp3 [08:16] They do runs at night [08:16] I can't use that track, but that track is awesome. [08:27] SketchCow: so archive.org gets hammered at night? [08:28] thats why its slow [08:33] i assumed you moved to sf when you got the IA position? [08:34] or is that nmostly telepresence [08:34] nobody can make SketchCow stay in one place [08:34] nobody [08:38] nobody puts baby in corner >_> [08:40] Nobody tells smash to stop smash [08:41] Bam bam bam bam [08:41] Soundscape up to 7 minutes out of 40. [08:41] nobody makes google release a functioning copy of android [08:41] Nobody makes rape truck stop raping or stop being a truck [08:42] what the fuck is rape truck [08:42] 2,620,000 results in google image search for rape truck [08:42] 452 for "rape truck" [08:42] these are all vans [08:43] 28c3: How governments have tried to block Tor 1:25:40 https://www.youtube.com/watch?v=DX46Qv_b7F4 [08:44] are Google groups messages being downloaded? [08:45] http://i46.photobucket.com/albums/f102/edou812/DONK.jpg [08:45] archiveteam mobile is here [08:45] awwwwyeah [08:47] so since questions about google groups messages are going unanswered i take it they're not being grabbed [08:48] that's a safe bet [08:48] Not at the moment. [08:48] The files and pages were [08:49] There's no indication the messages are going away [08:50] ah alright [09:02] f [09:02] oops [09:03] where should one look for a complete archive of Usenet messages? [09:03] I don't nderstand how complete my uni. news server is [09:15] Nemo_bis: google groups is as complete of one as there can be i think [09:15] Nemo_bis: there's been some stuff about people finding long lost usenet archives and sending them to google and they add them to google groups [09:15] yep, but it's not easy to download [09:21] yeah ;/ [09:23] maybe something using NNTP [09:41] a little of the source data is on archive.org http://www.archive.org/details/utzoo-wiseman-usenet-archive [09:51] yep, saw it, but it's very little [09:51] it's also very old [09:51] clearly worthless [09:51] I tried command line but couldn't find any tool which easily allows to download all content of all groups [09:51] so I used a GUI software but it was very slow [09:51] downloaded some 600 MiB in a night IIRC [09:52] back when google groups first started they mentioned using some kind of publicly available CD-ROMs for some of it too, those may be around someplace too [09:52] hmm [09:52] the stuff they got from dejanews is not available though [09:53] http://tidbits.com/article/3229 [09:53] is this it? [09:54] probably [10:02] I suspect ftp.sterling.com is no longer active [10:22] Nemo_bis: i know emacs has newsreader things and you can do anything in emacslisp [10:22] there's gotta be some fairly easily scriptable nntp clients/libraries. python or perl should have a bunch, probably also ruby [10:22] looks like my uni's newserver was started in 1996 only, no idea whether this makes it useful [10:22] if not just some existing client [10:23] too hard for me, I'm a newbie [10:23] Nemo_bis: that's a place to start [10:23] Nemo_bis: http://learnpythonthehardway.org [10:23] aww [10:23] get un-newbie'd :P [10:24] sudo apt-get install unnewbieator [10:25] also, what format should I look for? the software I tried saved in some sqdblite format [10:25] mbox is good [10:25] maildir is good [10:25] anything that can be turned into mbox or maildir is good [10:30] nothing else? [10:30] example: http://toolserver.org/~nemobis/tmp/italia.venezia.discussioni.sqlitedb [10:30] can't remember the software name [10:32] sqlite is at least super open. so you might have to write your own thing to turn it into an mbox but it's totally doable [10:34] yes, it seems to store them in mbox format somehow, you can read emails in plain text [10:34] wonder what's in the sqlite db then [10:36] http://web.archive.org/web/20110514012530/http://groups.google.com/group/google.public.support.general/msg/d88f36fb3e2c0aac [10:39] ironically you can't access this message on live google groups anymore, they've made the whole newsgroup non-public for some reason [10:39] DFJustin: good link [10:39] that's odd [10:39] but yeah dejanews and various people in universities sending backups. there's some guy talking about his story of some usenet archives that made the news rounds nt too long ago [10:40] First 20 minutes finished! [10:40] \o/ [10:40] woo! [10:51] wait [10:52] jan 1st is at least 48 hrs away or so [10:52] Yep, I'm ahead of crunch [10:52] Poor crunch [10:52] ah, good [10:53] well, around irc i've seen people talking about going out drinking tonight. either people like to go out drinking on thursdays or they're forgetting days. [11:03] I go drinking wednesdays :) [11:14] hahahahahah [11:14] I sent in the beta to the people [11:14] Someone just mailed me, basically saying "Please, do you have a phone number? There has been a great misunderstanding and we need to talk." [11:14] Ha ha [11:15] You know the best part? This project I made? Fucking awesome. Even if there's a huge misunderstanding and they want none of it, it's getting released. [11:19] in about 5 hours i will have ctrl alt chicken uploaded [11:52] SketchCow: project? [17:56] it's great. you'll love it. [20:13] :-\ [20:13] http://arstechnica.com/tech-policy/news/2011/12/godaddy-wins-and-loses-move-your-domain-day-over-sopa.ars [20:13] (note that the story is just using DNS server info...) [20:16] Coderjoe: I'm going to guess most of GoDaddy's customers do not give a shit about SOPA [20:16] and it's not because they're malicious [20:17] they just don't give a shit about all these cyberlaw things [20:17] I'm not sure how to make them care, either [20:17] there is speculation that the trasfer out count is low because they only looked at the one day and not the full counts since the can opened. [20:18] plus there is speculation that godaddy themselves transferred a number of sites they were holding in from elsewhere. [20:18] and then the whole slowdown of transfer auth code emails [20:18] too many people don't care about what the government is doing, online or off [20:19] I think that even after you factor that in it does not impact GoDaddy's numbers at all [20:19] "that's boring. let's watch (sport)!" [20:19] their reputation is already pretty shit, so [20:19] we actually use GoDaddy as an SSL certificate authority [20:19] at work [20:20] we will not switch for the next few years, because SSL certificates are expensive as fuck [20:20] how much? [20:20] I don't know if this will be enough impetus to not renew with GoDaddy, but I'm going to bring it up in the next SCRUM [20:20] bsmith093: for a wildcard domain? a ton [20:20] er, wildcard certificate [20:21] $200/year [20:21] quick irc question how do i register my nick on this server? [20:23] oh, actually, never mind [20:23] I thought we had our GoDaddy SSL certificate for a few years; turns out it expires September 9, 2012 [20:23] that's not that much of an investment [20:25] bsmith093: efnet doesn't run registration services [20:25] well thats annouing [20:25] at all. even channels. [20:26] though I am surprised by the occasional appearance of chanfix [20:26] if a channel goes opless, that bot pops in and ops someone (probably the person with the oldest join time) [20:27] http://www.efnet.org/chanfix/ [20:27] that's the only IRC service I know of, anyway [20:28] anyway http://fanficdownloader-fork.googlecode.com/files/fanficdownloader-fork0.0.1.7z that link is to the thing im using for the fanfiction.net downloads, unpack cd into it and run ./automate x01 or some other x file those are the split link lists for the download scripts thats the rest of the folders, im doing x00 which is the first 200K stories [20:29] i didnt write it, but i tweaked the config enough that its faster to just distribute that [20:29] have you verified that it actually grabs everything you want? [20:30] multi-chapter stories, author profiles, etc [20:30] also whether it records request/response metadata [20:30] the stroies anyway, the reviews can wait and i can figure out calibre rename epub by tags lateer [20:30] admittedly on fanfiction.net some of the response headers are not very useful [20:30] multi chap, yes, profiles no, but this is good for now, also no warc, AFAIK [20:31] I'll look into making one that uses wget-warc [20:31] great please do [20:43] SketchCow: hey any update on getting me an rsync slot to start uploading? i'm running out of space xD [21:10] Ctrl_Alt_Chicken done: www.archive.org/details/Ctrl_Alt_Chicken [21:20] hey guys when running the upload script for mobileme do i need to manually erase already uploaded files? [21:22] no, it should skip them on future uploads [21:22] they will stay around [21:22] right right, but i meant, after i upload the files, i don't need them anymore, right? [21:22] if so it'd be better to delete them once they're finished to make room for more [21:23] or should i keep even the ones i've already uploaded? [21:23] I prefer you didn't, but I will defer to someone more involved with memac [21:24] cool mate thanks :) it's my first time so don't really knowhow everything works yet [21:24] aye [21:49] i added time travel support to slitaz-tank project [21:50] this way source can be touch to update timestamp when the local clock is behind [21:52] one alt topic: Friendliness - Understanding - Camaraderie - Kloseness [21:53] not feeling it, arrith [21:53] anyone have experiencing downloading an entire subreddit? [21:53] keep trying :P [21:53] chronomex: haha, tough room [21:54] Toughest room there is [21:55] ahh [21:55] well dang, something i had fear'd: [21:55] " [21:55] "This script made me figure out that reddit lists only the last 1000 posts! Older posts are hidden. If you have a direct link to them, fine, otherwise they are gone :( So this script will only list 40 pages. This is a limitation of reddit. [21:55] fuckres [21:55] i remember there was some limit, forgot where. now how to get around it hmm [21:57] "There's no limit on the number of saved links, just a limit on listing sizes. For some reason (probably performance) reddit will never return a list of more than 1000 things (whether they are submissions, posts, messages, or saved links). They are still there, but you can't see them, unless you remove (unsave, in the case of saved links) the most recent ones. You could also try sorting in another way, of course, but that will st [21:57] ill return 1000 things at most. [21:57] " [21:57] http://www.reddit.com/r/help/comments/jhxmr/why_are_we_limited_in_the_number_of_links_we_can/ [21:58] could maybe links be scraped off search engines? [21:59] and perhaps url checked just by bruteforce to fidn real (existing) combinations? in that case the /jhxmr/ [21:59] although that would take ages [21:59] yeah a guy suggested doing a search engine thing in /r/help, but google supposedly misses a bunch [22:00] actually, bruteforcing the key is an interesting idea [22:00] since i do have time on my side (i think) [22:01] actually dangit, that's a huge number. 5^36 [22:02] ~14.55 * 10^12 TB if each was only 1 byte [22:03] a reddit mirror is starting to be more and more something i want to look into [22:06] arrith: how did you get 5^36? [22:06] http://rorr.im/about [22:06] these guys do it by scraping the content as soon as it's up and just adding it to a rolling database [22:07] gui77: seem to be 5 characters in the id, and the id seems to be made up of the lowercase alphabet with either 0-9 or 1-9 digits. 26 letters in the alphabet plus 10 digits (0-9) gives 36. [22:07] ah saw those in my search results but i assumed they wouldn't have a db up to dl.. i'll look at their site now [22:07] arrith of course how did i not see that haha [22:07] they open sourced some pytohn modules to mirror content [22:07] maybe some can be helpful [22:07] that's an idea [22:08] you could also think about asking reddit directly, they might very well be sympathetic to the cause and make it easier for us [22:08] yeah, maybe for AT. me as just some guy they don't seem to be into [22:09] "Removing the 1000-pages crawl limi"t http://groups.google.com/group/reddit-dev/browse_thread/thread/cbf05aa83dd03de5 [22:09] AT? [22:09] gui77: archive team [22:10] woo rorrim is in python [22:10] ah ok [22:11] well i suppose crawling rorr.im to fill in gaps is one option [22:13] getting teh old stuff is teh issue, right? [22:13] *the [22:13] yep [22:13] periodic new crawling can get all the new stuff [22:16] just like for ffnet archive [22:17] there will be gaps and they probably will never be filled but the near end will just keep filling uo and crawling that will wokr beautifully [22:17] yeah but there's a LOT of backlog on reddit [22:18] well practically most of reddit as it is now wouldn't be scrapeable, right? [22:22] what if you crawl existing reddit pages for more reddit links? [22:24] gui77: that'd be a good bet [22:25] any decent (worth having) content will be linked to multiple times in teh future, for sure [22:25] *the [22:25] it'd take time but if you did that, the only stuff you wouldn't have would be stuff that is old, unpopular, and never linked to. doesn't seem like a major los [22:26] goal is to get it alll [22:26] yeah, i know, but better than nothing i guess :/ [22:27] yeah [22:27] ooh i have an idea [22:27] reddit.com/random [22:27] that gives you a random submission [22:28] it'll redirect to something of the sort reddit.com/tb/xxxxx where xxxxx is the 5char code [22:28] could you maybe just hit those repetedly? [22:29] haha [22:29] yeah, that's probably a decent bit more feasible than enumerating the ids [22:30] yeah. although i'm not sure if that'll reutrn those of any subreddit or just those you're subscribed to [22:30] oh btw you don't happen to know how i can check that the upload script finished uploading a profile for mobileme? [22:32] ah can't say that i do, there might be a status page somewhere [22:36] Hm, the whole reddit discussion reminds me. Reddit is kind of an url shortener: redd.it/nwcbb [22:38] good point. similar tactics to archiving those are a good idea to look into [22:40] Haha I have a feeling they won't like it if they get a couple thousand requests per second from a single IP [22:40] sleep 5 [22:40] ms [22:40] oh no of course they won't. but if it's distributed, and slow [22:41] the old stuff won't run away, we can take our time downloading [22:41] But what if users submit faster than we download? [22:41] soultcer: cross that bridge when we come to it :P [22:41] although i think it'd be easier to get in touch with people over at reddit. helping us out somehow might be easier for them than having to deal with the added load [22:41] soultcer: besides getting the older stuff, i see that as the most pressing issue [22:41] soultcer: we don't thousands of requests per second to scrape onyl new content, do we? [22:42] "Give us a database dump or we (D)DoS you!" <-- :D [22:42] From my perspective only the "reddit content id" to "target URL" mapping is of interest, but I guess you want stuff like subreddit, name of submitter, points, too? [22:43] soultcer: yeah. i mean i'd be for taking what we can get [22:44] maybe doing a faster crawl on the url level then going back over at the page level [22:44] did we backup berliOS? [22:44] I think so [22:44] http://archiveteam.org/index.php?title=BerliOS [22:45] godane: they ended up announcing they weren't really going down but switching to new leadership. so it wasn't really necessary [22:45] godane: although it hasn't been confirmed if there's anything that might be lost in the switch [22:45] one thing i remembered is google reader keeps a kind of cache, so google reader might be something [22:45] Anyone else have a problem with wiki sites and Firefox 9 where the wiki navigation bar is suddenly on the bottom of the page? [22:48] man i really need to free some disk space :/ [22:48] soultcer: wiki's fine for me on ff9 [22:49] Weird [22:50] tried clearing cache and all? [22:50] Yup [22:51] Might just be an addon of mine [22:51] i figure the internet archive probably has some bits of reddit and it looks fine except nsfw pages seem to not have been crawled due to a "you must be at least eighteen to view this reddit" page [22:53] anyone know how to gracefully stop the upload script? [22:53] gui77: pretty sure you do ctrl-c and it finishes up the last thing it's working on, check the docs though [22:53] arrith: doesn't ctrl-c immediately stop it? [22:54] what docs? the wiki? [22:55] the readme is very short and vague [22:55] I read the reddit post there- the reddit person at the end says they'd much prefer to work out a way to get a data dump than have require use of a scraper [22:56] dashcloud: yeah i suppose contacting them for a data dump would be a good first step [22:57] how does the Archive Team do that kind of thing btw? just anyone or are there people that've done it before and are good at it? [22:59] dashcloud: Which post? Can you give me a link? [23:00] http://groups.google.com/group/reddit-dev/browse_thread/thread/cbf05aa83dd03de5 [23:00] some research guy asks about the 1000 post limit [23:01] Thanks [23:01] thank arrith- that's how I saw the link [23:28] arrith: you were rigt about ctrl-c. it immediately stops it but since it's rsync it'll just left off where it stopped! [23:28]