[00:02] Has anyone contacted this Jason Scott fellow about his 4chan archive? [00:03] He claimed to have 10 Million threads back in 2009 [00:03] http://ascii.textfiles.com/archives/2083 [00:03] he's here as SketchCow [00:03] oh [00:03] as far as I know he was persuaded not to spread it for the time being but eventually the internet archive will get [00:03] I understand [00:03] it [00:04] I also have a newer 109gb snap shot of images and posts [00:04] I just shot the guy who runs rbt.asia a email about his archive [00:04] he archives /w/, /soc/, /mu/, /clg/, and /g/ [00:05] I've got some /mu/ threads from 2011 backed up [00:05] about a gig. Unfortunately in HTML [00:05] but they work [00:05] there was some talk recently about the chanarchive.org collection too, don't recall the outcome [00:07] Is there a suggested compression setting scheme for 7zip anywhere? [00:07] I have ultra on, but I see there are other options. [00:09] also set lzma2 you can play with the dictonary, word, and block size but ultra normally does the best imho. [00:12] alright [00:12] I have hundreds of GB of 4chanarchive in httrack format [00:13] cool [00:14] I've been backing up a few MediaWikis in the last few days [00:19] heh, google reader does not canonicalize https:// and http:// feed URLs for wordpress blogs [00:19] that was confusing [00:19] guess we'll have to grab everything ;) [02:52] is there a wget-lua with gzip support? [02:52] something like https://github.com/kravietz/wget-gzip [02:54] or https://github.com/ptolts/wget-with-gzip-compression [02:57] which for some reason is forked off a 10 year old wget :/ [03:26] someone have a good channel name for the greader grab? :) [03:36] also, can someone fork https://github.com/ludios/greader-grab into ArchiveTeam and give `github.com/ivan` write access? [03:38] also, is there a convenient existing thing that could be used for collecting .opml files and feed URLs from users? [03:38] perhaps a pastebin under archiveteam control [03:42] http://paste.archivingyoursh.it/ [03:43] so i'm getting g4 confessions of a booth babe [03:48] DFJustin: thank [03:48] s [04:07] ivan`: done [04:07] https://github.com/archiveteam/greader-grab [04:09] "howdoireadgoogle grants 1 user push access to 1 repository" heh thanks :) [04:12] I should experiment with my own universal-tracker instance, right? [04:13] I can set up a test tracker for you [04:13] Just give me a few items [04:14] thanks, will let you know when I have something useful [08:08] SketchCow, before you talked about a 'hint' field in metadata so you can tell IA how large something is going to be [08:30] x-archive-size-hint:19327352832 [08:33] Do I have to send that in the header or is that a metadata.csv thing [08:35] would have to be in the header, all the metadata.csv stuff is of the form x-archive-meta-xxxx [08:35] and if you're uploading multiple files it needs to be in the first request that creates the bucket [09:04] is #livingandloving too obscure a reference for the channel name? ;) [09:21] http://www.archiveteam.org/index.php?title=Google_Reader [09:53] ivan`, what does your wget --version look like [09:53] mine has gzip support via libz [09:53] unless that only works for making the gz warc files [09:53] https://twitter.com/at_warrior/status/336052787404238848 lets get this thing on the road [09:54] GLaDOS: oh your alive! [09:55] hi [09:55] 1. we need a channel? [09:57] 2. the takeout gives you something not listed on that site [09:57] i guess you wuld want the subscriptions.xml [09:58] {"files":[{"name":"subscriptions.xml","size":3484}]} [09:58] Don't know if that worked either... [09:58] https://twitter.com/at_warrior/status/336058209263562752 fixed [09:59] lol hmmm [09:59] yeah but have you done an upload? [09:59] {"files":[{"name":"subscriptions.xml","size":3484}]} << thats not a helpful return page [09:59] I never really used reader [10:00] ivan`: ples asplen [10:00] when I upload, thats what I get back [10:00] Even a "Thanks for your upload" would be better. [10:02] So yeah, we are asking users for OPML files, yet google takeout doesn't provide thoes. [10:03] omf_: right, only the warc [10:03] Smiley: if you run all of the JavaScript, it's friendlier [10:03] I'll try to fix it for the other case, it was a rush job [10:04] ivan`: did the second time [10:04] and still got that page back D: [10:04] Oh it didn't load the rest, weird. [10:04] Ah ok thats better :) [10:05] GLaDOS: "I have always believed that technology should do the hard work—discovery, organization, communication—so users can do what makes them happiest: living and loving, not messing with annoying computers!" https://investor.google.com/corporate/2012/ceo-letter.html [10:05] yeah, I need to serve all the JavaScript from my domain [10:05] Smiley: takeout provides the OPML file inside the .zip [10:05] hue [10:06] 7 json files + 1 xml [10:06] I like Google Reader, still use it every day... suppose I really need to hop to an alternative sooner rather than later, though... [10:07] ivan`: not for me :/ [10:07] the .xml is the OPML file [10:07] is it missing in your zip? [10:07] i have the .xml file [10:08] no where does it say anything about opml.... D: [10:08] [10:08] don't expect readers to read. [10:08] :/ [10:15] For the URL collector, after you upload a file, the process is done? [10:15] yes, I'll go mention this [10:16] Please :-) [10:16] I added the spokenword.org archive of RSS feeds [10:16] thanks [10:21] I've starting backing up submissions to my machine, I have 5 so far [10:23] you can get the entire list of feeds if you find a way to crawl https://www.google.com/reader/directory/search?q=english [10:23] (and other keywords) [10:24] nice find [10:24] there's also the recommendations feature [10:26] we really need a channel though - #googleread? #donereading? [10:26] I like #donereading [10:27] #readingisfundamental ? [10:27] :) [10:28] just call it #googleburner [10:30] #fahrenheit451 [10:30] donereading is pretty clever [10:31] and subdued instead of irritated [10:32] * ivan` updates the wiki [12:52] music.aol.com and www.theboot.com have been backed up. The other AOL music sites are in progress [13:18] So Rapidshare might be closing down soon(ish) [13:32] heh [16:38] Not sure if this is in the Archive Team's boundaries, but there is a building, a museum, which is slated to be demolished. It's only 12 years old, so it's a little unusual. I'm offering the architectural autoCAD drawings and specifications. http://mafa.noneinc.com The ReadMe.nfo has some links to newspaper articles on the issue. [16:45] if you've got the items in your possession, you should reach out to SketchCow [16:45] I didn't realize the items were already uploaded to that site [16:51] I am grabbing those few files now [16:51] Yes, it's that 23mb RAR file. Haven't heard of any group working to save architectural drawings. Typically there are copyright concerns as with everything else, but as this is a building which is to be demolished, prematurely, wondering if makes for a good example to see if anyone wants to get into the conversation. [16:52] hey, are we backuping tumblr yet? [16:53] no [16:57] asie: not enough space [16:57] plain and simple [16:57] hneio: at least part of it, we managed a part of geocities [16:57] and i have this feeling tumblr will go through the same fate [16:57] geocities web 2.0 [17:00] the google reader grab will grab tumblr's text content, heh [17:01] then someone can buy the exabytes of disks needed to store all the porn [17:06] lol [17:32] having a few upload problems... server keps hitting max connections. [17:33] blueskin: hmmmmm we are too successful,. [17:33] Just leave it running and it'll eventually go through,. [17:39] well, at least it shows plenty of people working, indeed. [17:40] Smiley: will the upload server remain up for some days after the deadline? [17:46] hneio: upload server is ours [17:46] it'll remain there until theres nothing left to upload. [17:51] archive all the things! [17:52] Indeed. [18:21] I know the rsync server is swamped due to Formspring. [18:22] I took a shot at implementing an exponential backoff with failed attempts. [18:22] I posted a pull request on seesaw. [18:22] But my naive attempt doesn't work with concurrent items. [18:25] Maybe someone else will find this helpful, or at least a step in the right direction. [18:28] how about raising the connection limit? ;) [19:13] i found the gwbush intereview with zdtv [19:13] or techtv [19:14] this was before the election [23:10] hi folks, I'd like to remind everyone that there is an AOL archiving project (yes, that AOL- the one you used to dial into) in the works, and we could really use your help in #aohell. Happy to answer your questions here or there.