[00:11] i'm tracking down original diggnation hd episodes [00:12] :-D [01:33] http://archive.org/details/messmame [02:51] ATZ0_: hmm? [05:22] just got an email from puush regarding "important changes" [05:22] will post more if anything of interest [05:23] " * Stop offering permanent storage, and files will expire after not being accessed for: [05:23] - Free users: 1 month [05:23] - Pro users: up to 6 months" [05:23] "How this will affect you after the 1st of August 2013: [05:23] * We are going to start expiring files. At this point, any files which haven't been recently viewed by anyone will be automatically deleted after 1 month, or up to 6 months for pro users." [05:23] and " * If you wish to grab a copy of your files before this begins, you can download an archive from your My Account page (Account -> Settings -> Pools -> Export)." [05:23] seems a lot like imgur-style expiration to me, except on a more extreme scale [05:24] if we were to start a project, it'd have to evolve into something like the urlteam project [05:25] imgur expires posts? didn't know that [05:26] it looks like puush uses incremental IDs [05:28] yeah, they do after 6 months IIRC [05:28] (re. imgur) [05:29] * xmc nods [05:34] it should be easy to archive what exists already and then over the long-term archive what's uploaded afterwards [05:35] provided if done in urlteam style [05:44] any thoughts? [05:44] channel name's probably going to be hard to come up with [05:45] #pushharder [05:46] You know, we wouldn't have to archive everything initially.. [05:46] We'd just have to 'access' the file. [05:52] good point [05:52] but I wouldn't think we'd be able to keep it up depending on how many files they have [05:57] probably better off in the long term just to grab anything we can [05:57] in case they decide to make the limits even shorter if we were to go through with the plan of just accessing [05:59] (which would suck for both us and users) [06:14] Besides, gobs of data is more fun [07:12] Here is the shutdown notice - http://allthingsd.com/20130706/microsoft-quietly-shuts-down-msn-tv-once-known-as-webtv/ [07:12] Closes at the end of september [07:12] from looking at a hosted site it should not be a problem to grab we just need to build a username list [07:13] This has pages going back to the late 90s I believe [07:24] i'd be surprised if most of them weren't from the 90s [07:24] hm :/ [07:25] Just looking at the markup for some of those sites tells a story. I like finding shit like this [07:28] http://news.uscourts.gov/pacer-survey-shows-rise-user-satisfaction [08:06] they wouldn't be paying so much otherwise? [08:10] so how do you guys find sites, anyway [08:11] by which i mean, how do you get a list of websites or whatever hosted on a given service [08:15] Depends on the site - some, you just have to go with brute force [08:15] Others, you can scrape and discover users easily [08:18] ersi: what about stuff like webtv and free webhosts? [08:18] i.e. arbitrary usernames [08:18] no standard format, no links between pages [08:19] i might put a page together about this on the wiki [08:21] and i'm finding old ODP data (from about 2009, i needed it once and never deleted it) quite useful [08:24] winr4r, ODP data? [08:26] omf_: Open Directory Project [08:27] they offer dumps of their data, about 1.9 gigabytes [08:27] (uncompressed) [08:39] http://archiveteam.org/index.php?title=MSN_TV [08:41] ============ [08:41] The code has been prepared to run the hell out of. [08:41] To help out, join #jenga [08:41] Xanga has 8 days left, and we've yet to download 4 million users. [09:07] http://archiveteam.org/index.php?title=Puu.sh [09:07] wiki page for puu.sh now up [09:16] winr4r: I'd look into searching through search engines and then Common Crawl. Then I'd go brute-forcing usernames [09:17] ersi: are there any scrapable search engines? [09:17] bing used to have a useful API, doesn't now [09:18] What does the shape of the urls you need look like, winr4r? [09:18] I can pull stuff out of wayback [09:20] underscor: anything from community.webtv.net or community-X.webtv.net for values of X = 1..4 [09:21] winr4r: I know there's an "old script" alard made for scraping Google [09:21] somewhere [09:40] http://farm8.staticflickr.com/7433/9228353492_aa9169e927_k.jpg [09:40] Mmmmm [09:40] Explosion-y goodness from July 4th [10:10] http://paste.archivingyoursh.it/nuxopefaci.py [10:10] wrote that just now, takes list of shortened URLs on stdin, outputs non-shortened URLs on stdout [10:11] dunno if anyone else would find it useful, but there it is [10:11] i had a big list of t.co URLs from a twitter search, needed to convert to real URLs [10:13] tbh i don't know if MSN TV will even merit using warrior, rather than one guy with a fast pipe and wget [10:19] oh shit [10:20] apparently yeah some people link some disgusting shit what the fuck [10:20] haha [10:21] okay so one of the URLs to which a t.co link resolved was a google groups search with the query string "12+year+old+daughter+sex" [10:22] am i fucked? [10:24] i don't know how the fuck that showed up in a search for "webtv.net" on twitter, but it did [10:25] Pack your things [10:25] Before the vans arrive [10:27] http://community-2.webtv.net/@HH!17!BF!62DA2CCF370F/TvFoutreach/COUNTDOWNTO666/ [10:39] Hi all, intrested in the project. I was wondering if there is a standalone archiver? Got a load of Linux servers and a few Windows servers with a shed load of bandwidth going spare every month [10:39] JackWS: xanga? yes, there is [10:39] You can run the projects as standalone. [10:40] https://github.com/ArchiveTeam/xanga-grab [10:40] there aren't installation instructions there [10:40] install pip (python), pip install seesaw, clone the project repo, run ./get-wget-lua.sh, then run-pipeline pipeline.py YOURNAMEHERE --concurrent amountofthreads --disable-web-server [10:40] ...yeah i was about to say something like that :) [10:41] Should write something for it on the wiki. [10:41] the dependency instructions at https://github.com/ArchiveTeam/greader-grab will probably work just as well for xanga-grab [10:41] Or just commit a README.md [10:48] thanks for the info [10:48] ill take a look [10:48] :D [10:58] http://www.faroo.com/hp/api/api.html [10:58] well this exists [11:00] Cool [11:00] English, German and Chinese results [11:00] How specific [11:01] oh, scratch that, it seems it doesn't support "site:" queries [11:14] Hi, having some issues compiling wget-lua for xanga-grab, anyone know what causes this issue? http://www.hastebin.com/yetorupupa.vbs [11:20] googling the error i'm seeing that it happens when you don't have -ldl in LDFLAGS, but it's clear that you do [11:32] Damn, I have to go afk. If anyone finds what the issue is, please pm me. [11:34] Rainbow: yup, paging GLaDOS [11:34] I have no idea when it comes to building [11:34] my bad! [13:00] \o/ Fixed it! [13:00] Rainbow: how? [13:01] Left over lua install seemed to cause it [13:01] Odd as it sounds [13:01] ah :) [13:01] WARNING:tornado.general:Connect error on fd 6: ECONNREFUSED [13:26] Why would I be getting New item: Step 1 of 8 No HTTP response received from tracker. ? [13:26] tracker down? [13:26] working on my machine [13:26] just not on my server :? [13:28] you sure there's no outbound filtering? [13:28] Should not be [13:28] what ports is it wanting to use? [13:29] not sure [13:30] It just uses port 80 [13:31] testing it in a VM before I deploy it onto a few servers [13:38] ah I got it [13:38] [screen is terminating] [13:38] when trying to run [13:44] ah got it running [13:44] was just being funny I think [13:45] are you able to enable the graph on the webserver site? would be nice to see hot much it is using [15:26] sup [15:33] hey [15:46] hows it going antomatic ? [15:47] ah, can't complain. Just sitting here staring at the Xanga leaderboard. :) [16:56] do we have a tool that breaks a megawarc back up into the original warcs? [16:59] db48x: https://pypi.python.org/pypi/Warcat/ ? [16:59] not quite [16:59] it can extract records from a warc (or a megawarc) [17:00] but the original warc was a series of related records [17:01] metadata about the process used to create the warc, each request as it was made, and each response recieved [17:01] i'll pass on that question, then [17:03] the warc viewer is pretty good [17:03] but I don't want to use wget to spider a site being served up by the warc viewer's proxy server [17:05] warc-to-zip is interesting, but alas it requires byte offsets [17:06] I can get the start addresses of the response records, but not their lengths [17:06] db48x: https://github.com/alard/megawarc "megawarc restore megafoo.warc.gz" [17:07] iirc it creates a file bit-for-bit identical to the original source [17:07] is that what you're looking for? [17:07] ah, that sounds promising [17:12] I will have to update the description on the warc ecosystem page [17:21] ooh, warcproxy [17:21] I was meaning to write that [17:21] cool that someone else did! [17:21] now to bend it to my will [17:24] heh [17:39] well, not now, maybe later. [17:47] xmc: thanks, btw [17:48] that turned out to be precisely what I needed [17:48] my pleasure [17:48] excellent [17:49] we ought to get something set up so that people can reclaim their data by putting in the site url [17:50] not a bad idea at all [17:50] hmm, there are 444 of these megawarcs; I had to download all the idx files to find the one containing the site I wanted [17:51] not sure I have 22 tb just laying around [17:52] @_@ [17:52] might be more reasonable to patch up the megawarc program to submit range-requests to the Archive and reassemble that way [17:53] that's what warc-to-zip does [17:53] you give it the url of a warc and a byte range, and it gives you a zip [17:53] ah cool [17:54] looks like the json files have the best information [17:54] {"target":{"container":"warc","offset":0,"size":29265692},"src_offsets":{"entry":0,"data":512,"next_entry":29266432},"header_fields":{"uid":1001,"chksum":0,"uname":"","gname":"","size":29265692,"devmajor":0,"name":"20130526205026/posterous.com-vividturtle.posterous.com-20130522-061616.warc.gz","devminor":0,"gid":1001,"mtime":1369567781.0,"mode":420,"linkname":"","type":"0"},"header_base64":"MjAxMzA1MjYyMDUwMjYvcG9zdGVyb3VzLmNvbS12aXZpZHR1 [18:02] yes, very nice [18:02] using the offset and offset+size as the byte range I get a very nice zip [18:03] so it would just be a matter of parsing the filenames from the json indexes to get the site urls [18:04] fantastique [18:06] precisimo [19:45] I'm sure it's been mentioned, but if it hasn't... MSN TV is closing! [19:46] Heh, my WebTV Philips/Magnavox client is in my recycling [21:50] [14:45:46.746] I'm sure it's been mentioned, but if it hasn't... MSN TV is closing! [21:50] we're aware [21:51] also, puu.sh has now been added to the navbox [21:51] (channel for those that weren't awake at 4 AM CDT: #pushharder) [22:16] posterous still remains on the tracker and in warriors for whatever reason [22:17] what gives, if I can ask?