#archiveteam 2013-07-07,Sun

↑back Search

Time Nickname Message
00:11 🔗 godane i'm tracking down original diggnation hd episodes
00:12 🔗 godane :-D
01:33 🔗 SketchCow http://archive.org/details/messmame
02:51 🔗 balrog ATZ0_: hmm?
05:22 🔗 wp494 just got an email from puush regarding "important changes"
05:22 🔗 wp494 will post more if anything of interest
05:23 🔗 wp494 " * Stop offering permanent storage, and files will expire after not being accessed for:
05:23 🔗 wp494 - Free users: 1 month
05:23 🔗 wp494 - Pro users: up to 6 months"
05:23 🔗 wp494 "How this will affect you after the 1st of August 2013:
05:23 🔗 wp494 * We are going to start expiring files. At this point, any files which haven't been recently viewed by anyone will be automatically deleted after 1 month, or up to 6 months for pro users."
05:23 🔗 wp494 and " * If you wish to grab a copy of your files before this begins, you can download an archive from your My Account page (Account -> Settings -> Pools -> Export)."
05:23 🔗 wp494 seems a lot like imgur-style expiration to me, except on a more extreme scale
05:24 🔗 wp494 if we were to start a project, it'd have to evolve into something like the urlteam project
05:25 🔗 xmc imgur expires posts? didn't know that
05:26 🔗 winr4r it looks like puush uses incremental IDs
05:28 🔗 wp494 yeah, they do after 6 months IIRC
05:28 🔗 wp494 (re. imgur)
05:29 🔗 * xmc nods
05:34 🔗 wp494 it should be easy to archive what exists already and then over the long-term archive what's uploaded afterwards
05:35 🔗 wp494 provided if done in urlteam style
05:44 🔗 wp494 any thoughts?
05:44 🔗 wp494 channel name's probably going to be hard to come up with
05:45 🔗 GLaDOS #pushharder
05:46 🔗 GLaDOS You know, we wouldn't have to archive everything initially..
05:46 🔗 GLaDOS We'd just have to 'access' the file.
05:52 🔗 wp494 good point
05:52 🔗 wp494 but I wouldn't think we'd be able to keep it up depending on how many files they have
05:57 🔗 wp494 probably better off in the long term just to grab anything we can
05:57 🔗 wp494 in case they decide to make the limits even shorter if we were to go through with the plan of just accessing
05:59 🔗 wp494 (which would suck for both us and users)
06:14 🔗 underscor Besides, gobs of data is more fun
07:12 🔗 omf_ Here is the shutdown notice - http://allthingsd.com/20130706/microsoft-quietly-shuts-down-msn-tv-once-known-as-webtv/
07:12 🔗 omf_ Closes at the end of september
07:12 🔗 omf_ from looking at a hosted site it should not be a problem to grab we just need to build a username list
07:13 🔗 omf_ This has pages going back to the late 90s I believe
07:24 🔗 winr4r i'd be surprised if most of them weren't from the 90s
07:24 🔗 winr4r hm :/
07:25 🔗 omf_ Just looking at the markup for some of those sites tells a story. I like finding shit like this
07:28 🔗 poqpoq http://news.uscourts.gov/pacer-survey-shows-rise-user-satisfaction
08:06 🔗 Nemo_bis they wouldn't be paying so much otherwise?
08:10 🔗 winr4r so how do you guys find sites, anyway
08:11 🔗 winr4r by which i mean, how do you get a list of websites or whatever hosted on a given service
08:15 🔗 ersi Depends on the site - some, you just have to go with brute force
08:15 🔗 ersi Others, you can scrape and discover users easily
08:18 🔗 winr4r ersi: what about stuff like webtv and free webhosts?
08:18 🔗 winr4r i.e. arbitrary usernames
08:18 🔗 winr4r no standard format, no links between pages
08:19 🔗 winr4r i might put a page together about this on the wiki
08:21 🔗 winr4r and i'm finding old ODP data (from about 2009, i needed it once and never deleted it) quite useful
08:24 🔗 omf_ winr4r, ODP data?
08:26 🔗 winr4r omf_: Open Directory Project
08:27 🔗 winr4r they offer dumps of their data, about 1.9 gigabytes
08:27 🔗 winr4r (uncompressed)
08:39 🔗 winr4r http://archiveteam.org/index.php?title=MSN_TV
08:41 🔗 GLaDOS ============
08:41 🔗 GLaDOS The code has been prepared to run the hell out of.
08:41 🔗 GLaDOS To help out, join #jenga
08:41 🔗 GLaDOS Xanga has 8 days left, and we've yet to download 4 million users.
09:07 🔗 wp494 http://archiveteam.org/index.php?title=Puu.sh
09:07 🔗 wp494 wiki page for puu.sh now up
09:16 🔗 ersi winr4r: I'd look into searching through search engines and then Common Crawl. Then I'd go brute-forcing usernames
09:17 🔗 winr4r ersi: are there any scrapable search engines?
09:17 🔗 winr4r bing used to have a useful API, doesn't now
09:18 🔗 underscor What does the shape of the urls you need look like, winr4r?
09:18 🔗 underscor I can pull stuff out of wayback
09:20 🔗 winr4r underscor: anything from community.webtv.net or community-X.webtv.net for values of X = 1..4
09:21 🔗 ersi winr4r: I know there's an "old script" alard made for scraping Google
09:21 🔗 ersi somewhere
09:40 🔗 underscor http://farm8.staticflickr.com/7433/9228353492_aa9169e927_k.jpg
09:40 🔗 underscor Mmmmm
09:40 🔗 underscor Explosion-y goodness from July 4th
10:10 🔗 winr4r http://paste.archivingyoursh.it/nuxopefaci.py
10:10 🔗 winr4r wrote that just now, takes list of shortened URLs on stdin, outputs non-shortened URLs on stdout
10:11 🔗 winr4r dunno if anyone else would find it useful, but there it is
10:11 🔗 winr4r i had a big list of t.co URLs from a twitter search, needed to convert to real URLs
10:13 🔗 winr4r tbh i don't know if MSN TV will even merit using warrior, rather than one guy with a fast pipe and wget
10:19 🔗 winr4r oh shit
10:20 🔗 winr4r apparently yeah some people link some disgusting shit what the fuck
10:20 🔗 ersi haha
10:21 🔗 winr4r okay so one of the URLs to which a t.co link resolved was a google groups search with the query string "12+year+old+daughter+sex"
10:22 🔗 winr4r am i fucked?
10:24 🔗 winr4r i don't know how the fuck that showed up in a search for "webtv.net" on twitter, but it did
10:25 🔗 ersi Pack your things
10:25 🔗 ersi Before the vans arrive
10:27 🔗 winr4r http://community-2.webtv.net/@HH!17!BF!62DA2CCF370F/TvFoutreach/COUNTDOWNTO666/
10:39 🔗 JackWS Hi all, intrested in the project. I was wondering if there is a standalone archiver? Got a load of Linux servers and a few Windows servers with a shed load of bandwidth going spare every month
10:39 🔗 winr4r JackWS: xanga? yes, there is
10:39 🔗 GLaDOS You can run the projects as standalone.
10:40 🔗 winr4r https://github.com/ArchiveTeam/xanga-grab
10:40 🔗 winr4r there aren't installation instructions there
10:40 🔗 GLaDOS install pip (python), pip install seesaw, clone the project repo, run ./get-wget-lua.sh, then run-pipeline pipeline.py YOURNAMEHERE --concurrent amountofthreads --disable-web-server
10:40 🔗 winr4r ...yeah i was about to say something like that :)
10:41 🔗 GLaDOS Should write something for it on the wiki.
10:41 🔗 winr4r the dependency instructions at https://github.com/ArchiveTeam/greader-grab will probably work just as well for xanga-grab
10:41 🔗 ersi Or just commit a README.md
10:48 🔗 JackWS thanks for the info
10:48 🔗 JackWS ill take a look
10:48 🔗 winr4r :D
10:58 🔗 winr4r http://www.faroo.com/hp/api/api.html
10:58 🔗 winr4r well this exists
11:00 🔗 ersi Cool
11:00 🔗 BlueMax English, German and Chinese results
11:00 🔗 BlueMax How specific
11:01 🔗 winr4r oh, scratch that, it seems it doesn't support "site:" queries
11:14 🔗 Rainbow Hi, having some issues compiling wget-lua for xanga-grab, anyone know what causes this issue? http://www.hastebin.com/yetorupupa.vbs
11:20 🔗 winr4r googling the error i'm seeing that it happens when you don't have -ldl in LDFLAGS, but it's clear that you do
11:32 🔗 Rainbow Damn, I have to go afk. If anyone finds what the issue is, please pm me.
11:34 🔗 winr4r Rainbow: yup, paging GLaDOS
11:34 🔗 GLaDOS I have no idea when it comes to building
11:34 🔗 winr4r my bad!
13:00 🔗 Rainbow \o/ Fixed it!
13:00 🔗 winr4r Rainbow: how?
13:01 🔗 Rainbow Left over lua install seemed to cause it
13:01 🔗 Rainbow Odd as it sounds
13:01 🔗 winr4r ah :)
13:01 🔗 IceGuest_ WARNING:tornado.general:Connect error on fd 6: ECONNREFUSED
13:26 🔗 JackWS Why would I be getting New item: Step 1 of 8 No HTTP response received from tracker. ?
13:26 🔗 winr4r tracker down?
13:26 🔗 JackWS working on my machine
13:26 🔗 JackWS just not on my server :?
13:28 🔗 winr4r you sure there's no outbound filtering?
13:28 🔗 JackWS Should not be
13:28 🔗 JackWS what ports is it wanting to use?
13:29 🔗 winr4r not sure
13:30 🔗 GLaDOS It just uses port 80
13:31 🔗 JackWS testing it in a VM before I deploy it onto a few servers
13:38 🔗 JackWS ah I got it
13:38 🔗 JackWS [screen is terminating]
13:38 🔗 JackWS when trying to run
13:44 🔗 JackWS ah got it running
13:44 🔗 JackWS was just being funny I think
13:45 🔗 JackWS are you able to enable the graph on the webserver site? would be nice to see hot much it is using
15:26 🔗 WiK sup
15:33 🔗 antomatic hey
15:46 🔗 WiK hows it going antomatic ?
15:47 🔗 antomatic ah, can't complain. Just sitting here staring at the Xanga leaderboard. :)
16:56 🔗 db48x do we have a tool that breaks a megawarc back up into the original warcs?
16:59 🔗 winr4r db48x: https://pypi.python.org/pypi/Warcat/ ?
16:59 🔗 db48x not quite
16:59 🔗 db48x it can extract records from a warc (or a megawarc)
17:00 🔗 db48x but the original warc was a series of related records
17:01 🔗 db48x metadata about the process used to create the warc, each request as it was made, and each response recieved
17:01 🔗 winr4r i'll pass on that question, then
17:03 🔗 db48x the warc viewer is pretty good
17:03 🔗 db48x but I don't want to use wget to spider a site being served up by the warc viewer's proxy server
17:05 🔗 db48x warc-to-zip is interesting, but alas it requires byte offsets
17:06 🔗 db48x I can get the start addresses of the response records, but not their lengths
17:06 🔗 xmc db48x: https://github.com/alard/megawarc "megawarc restore megafoo.warc.gz"
17:07 🔗 xmc iirc it creates a file bit-for-bit identical to the original source
17:07 🔗 xmc is that what you're looking for?
17:07 🔗 db48x ah, that sounds promising
17:12 🔗 db48x I will have to update the description on the warc ecosystem page
17:21 🔗 xmc ooh, warcproxy
17:21 🔗 xmc I was meaning to write that
17:21 🔗 xmc cool that someone else did!
17:21 🔗 xmc now to bend it to my will
17:24 🔗 db48x heh
17:39 🔗 xmc well, not now, maybe later.
17:47 🔗 db48x xmc: thanks, btw
17:48 🔗 db48x that turned out to be precisely what I needed
17:48 🔗 xmc my pleasure
17:48 🔗 xmc excellent
17:49 🔗 db48x we ought to get something set up so that people can reclaim their data by putting in the site url
17:50 🔗 xmc not a bad idea at all
17:50 🔗 db48x hmm, there are 444 of these megawarcs; I had to download all the idx files to find the one containing the site I wanted
17:51 🔗 db48x not sure I have 22 tb just laying around
17:52 🔗 xmc @_@
17:52 🔗 xmc might be more reasonable to patch up the megawarc program to submit range-requests to the Archive and reassemble that way
17:53 🔗 db48x that's what warc-to-zip does
17:53 🔗 db48x you give it the url of a warc and a byte range, and it gives you a zip
17:53 🔗 xmc ah cool
17:54 🔗 db48x looks like the json files have the best information
17:54 🔗 db48x {"target":{"container":"warc","offset":0,"size":29265692},"src_offsets":{"entry":0,"data":512,"next_entry":29266432},"header_fields":{"uid":1001,"chksum":0,"uname":"","gname":"","size":29265692,"devmajor":0,"name":"20130526205026/posterous.com-vividturtle.posterous.com-20130522-061616.warc.gz","devminor":0,"gid":1001,"mtime":1369567781.0,"mode":420,"linkname":"","type":"0"},"header_base64":"MjAxMzA1MjYyMDUwMjYvcG9zdGVyb3VzLmNvbS12aXZpZHR1
18:02 🔗 db48x yes, very nice
18:02 🔗 db48x using the offset and offset+size as the byte range I get a very nice zip
18:03 🔗 db48x so it would just be a matter of parsing the filenames from the json indexes to get the site urls
18:04 🔗 xmc fantastique
18:06 🔗 db48x precisimo
19:45 🔗 arkhive I'm sure it's been mentioned, but if it hasn't... MSN TV is closing!
19:46 🔗 arkhive Heh, my WebTV Philips/Magnavox client is in my recycling
21:50 🔗 wp494 [14:45:46.746] <arkhive> I'm sure it's been mentioned, but if it hasn't... MSN TV is closing!
21:50 🔗 wp494 we're aware
21:51 🔗 wp494 also, puu.sh has now been added to the navbox
21:51 🔗 wp494 (channel for those that weren't awake at 4 AM CDT: #pushharder)
22:16 🔗 wp494 posterous still remains on the tracker and in warriors for whatever reason
22:17 🔗 wp494 what gives, if I can ask?

irclogger-viewer