#archiveteam 2013-07-07,Sun

↑back Search

Time	Nickname	Message
00:11 ^🔗	godane	i'm tracking down original diggnation hd episodes
00:12 ^🔗	godane	:-D
01:33 ^🔗	SketchCow	http://archive.org/details/messmame
02:51 ^🔗	balrog	ATZ0_: hmm?
05:22 ^🔗	wp494	just got an email from puush regarding "important changes"
05:22 ^🔗	wp494	will post more if anything of interest
05:23 ^🔗	wp494	" * Stop offering permanent storage, and files will expire after not being accessed for:
05:23 ^🔗	wp494	- Free users: 1 month
05:23 ^🔗	wp494	- Pro users: up to 6 months"
05:23 ^🔗	wp494	"How this will affect you after the 1st of August 2013:
05:23 ^🔗	wp494	* We are going to start expiring files. At this point, any files which haven't been recently viewed by anyone will be automatically deleted after 1 month, or up to 6 months for pro users."
05:23 ^🔗	wp494	and " * If you wish to grab a copy of your files before this begins, you can download an archive from your My Account page (Account -> Settings -> Pools -> Export)."
05:23 ^🔗	wp494	seems a lot like imgur-style expiration to me, except on a more extreme scale
05:24 ^🔗	wp494	if we were to start a project, it'd have to evolve into something like the urlteam project
05:25 ^🔗	xmc	imgur expires posts? didn't know that
05:26 ^🔗	winr4r	it looks like puush uses incremental IDs
05:28 ^🔗	wp494	yeah, they do after 6 months IIRC
05:28 ^🔗	wp494	(re. imgur)
05:29 ^🔗	*	xmc nods
05:34 ^🔗	wp494	it should be easy to archive what exists already and then over the long-term archive what's uploaded afterwards
05:35 ^🔗	wp494	provided if done in urlteam style
05:44 ^🔗	wp494	any thoughts?
05:44 ^🔗	wp494	channel name's probably going to be hard to come up with
05:45 ^🔗	GLaDOS	#pushharder
05:46 ^🔗	GLaDOS	You know, we wouldn't have to archive everything initially..
05:46 ^🔗	GLaDOS	We'd just have to 'access' the file.
05:52 ^🔗	wp494	good point
05:52 ^🔗	wp494	but I wouldn't think we'd be able to keep it up depending on how many files they have
05:57 ^🔗	wp494	probably better off in the long term just to grab anything we can
05:57 ^🔗	wp494	in case they decide to make the limits even shorter if we were to go through with the plan of just accessing
05:59 ^🔗	wp494	(which would suck for both us and users)
06:14 ^🔗	underscor	Besides, gobs of data is more fun
07:12 ^🔗	omf_	Here is the shutdown notice - http://allthingsd.com/20130706/microsoft-quietly-shuts-down-msn-tv-once-known-as-webtv/
07:12 ^🔗	omf_	Closes at the end of september
07:12 ^🔗	omf_	from looking at a hosted site it should not be a problem to grab we just need to build a username list
07:13 ^🔗	omf_	This has pages going back to the late 90s I believe
07:24 ^🔗	winr4r	i'd be surprised if most of them weren't from the 90s
07:24 ^🔗	winr4r	hm :/
07:25 ^🔗	omf_	Just looking at the markup for some of those sites tells a story. I like finding shit like this
07:28 ^🔗	poqpoq	http://news.uscourts.gov/pacer-survey-shows-rise-user-satisfaction
08:06 ^🔗	Nemo_bis	they wouldn't be paying so much otherwise?
08:10 ^🔗	winr4r	so how do you guys find sites, anyway
08:11 ^🔗	winr4r	by which i mean, how do you get a list of websites or whatever hosted on a given service
08:15 ^🔗	ersi	Depends on the site - some, you just have to go with brute force
08:15 ^🔗	ersi	Others, you can scrape and discover users easily
08:18 ^🔗	winr4r	ersi: what about stuff like webtv and free webhosts?
08:18 ^🔗	winr4r	i.e. arbitrary usernames
08:18 ^🔗	winr4r	no standard format, no links between pages
08:19 ^🔗	winr4r	i might put a page together about this on the wiki
08:21 ^🔗	winr4r	and i'm finding old ODP data (from about 2009, i needed it once and never deleted it) quite useful
08:24 ^🔗	omf_	winr4r, ODP data?
08:26 ^🔗	winr4r	omf_: Open Directory Project
08:27 ^🔗	winr4r	they offer dumps of their data, about 1.9 gigabytes
08:27 ^🔗	winr4r	(uncompressed)
08:39 ^🔗	winr4r	http://archiveteam.org/index.php?title=MSN_TV
08:41 ^🔗	GLaDOS	============
08:41 ^🔗	GLaDOS	The code has been prepared to run the hell out of.
08:41 ^🔗	GLaDOS	To help out, join #jenga
08:41 ^🔗	GLaDOS	Xanga has 8 days left, and we've yet to download 4 million users.
09:07 ^🔗	wp494	http://archiveteam.org/index.php?title=Puu.sh
09:07 ^🔗	wp494	wiki page for puu.sh now up
09:16 ^🔗	ersi	winr4r: I'd look into searching through search engines and then Common Crawl. Then I'd go brute-forcing usernames
09:17 ^🔗	winr4r	ersi: are there any scrapable search engines?
09:17 ^🔗	winr4r	bing used to have a useful API, doesn't now
09:18 ^🔗	underscor	What does the shape of the urls you need look like, winr4r?
09:18 ^🔗	underscor	I can pull stuff out of wayback
09:20 ^🔗	winr4r	underscor: anything from community.webtv.net or community-X.webtv.net for values of X = 1..4
09:21 ^🔗	ersi	winr4r: I know there's an "old script" alard made for scraping Google
09:21 ^🔗	ersi	somewhere
09:40 ^🔗	underscor	http://farm8.staticflickr.com/7433/9228353492_aa9169e927_k.jpg
09:40 ^🔗	underscor	Mmmmm
09:40 ^🔗	underscor	Explosion-y goodness from July 4th
10:10 ^🔗	winr4r	http://paste.archivingyoursh.it/nuxopefaci.py
10:10 ^🔗	winr4r	wrote that just now, takes list of shortened URLs on stdin, outputs non-shortened URLs on stdout
10:11 ^🔗	winr4r	dunno if anyone else would find it useful, but there it is
10:11 ^🔗	winr4r	i had a big list of t.co URLs from a twitter search, needed to convert to real URLs
10:13 ^🔗	winr4r	tbh i don't know if MSN TV will even merit using warrior, rather than one guy with a fast pipe and wget
10:19 ^🔗	winr4r	oh shit
10:20 ^🔗	winr4r	apparently yeah some people link some disgusting shit what the fuck
10:20 ^🔗	ersi	haha
10:21 ^🔗	winr4r	okay so one of the URLs to which a t.co link resolved was a google groups search with the query string "12+year+old+daughter+sex"
10:22 ^🔗	winr4r	am i fucked?
10:24 ^🔗	winr4r	i don't know how the fuck that showed up in a search for "webtv.net" on twitter, but it did
10:25 ^🔗	ersi	Pack your things
10:25 ^🔗	ersi	Before the vans arrive
10:27 ^🔗	winr4r	http://community-2.webtv.net/@HH!17!BF!62DA2CCF370F/TvFoutreach/COUNTDOWNTO666/
10:39 ^🔗	JackWS	Hi all, intrested in the project. I was wondering if there is a standalone archiver? Got a load of Linux servers and a few Windows servers with a shed load of bandwidth going spare every month
10:39 ^🔗	winr4r	JackWS: xanga? yes, there is
10:39 ^🔗	GLaDOS	You can run the projects as standalone.
10:40 ^🔗	winr4r	https://github.com/ArchiveTeam/xanga-grab
10:40 ^🔗	winr4r	there aren't installation instructions there
10:40 ^🔗	GLaDOS	install pip (python), pip install seesaw, clone the project repo, run ./get-wget-lua.sh, then run-pipeline pipeline.py YOURNAMEHERE --concurrent amountofthreads --disable-web-server
10:40 ^🔗	winr4r	...yeah i was about to say something like that :)
10:41 ^🔗	GLaDOS	Should write something for it on the wiki.
10:41 ^🔗	winr4r	the dependency instructions at https://github.com/ArchiveTeam/greader-grab will probably work just as well for xanga-grab
10:41 ^🔗	ersi	Or just commit a README.md
10:48 ^🔗	JackWS	thanks for the info
10:48 ^🔗	JackWS	ill take a look
10:48 ^🔗	winr4r	:D
10:58 ^🔗	winr4r	http://www.faroo.com/hp/api/api.html
10:58 ^🔗	winr4r	well this exists
11:00 ^🔗	ersi	Cool
11:00 ^🔗	BlueMax	English, German and Chinese results
11:00 ^🔗	BlueMax	How specific
11:01 ^🔗	winr4r	oh, scratch that, it seems it doesn't support "site:" queries
11:14 ^🔗	Rainbow	Hi, having some issues compiling wget-lua for xanga-grab, anyone know what causes this issue? http://www.hastebin.com/yetorupupa.vbs
11:20 ^🔗	winr4r	googling the error i'm seeing that it happens when you don't have -ldl in LDFLAGS, but it's clear that you do
11:32 ^🔗	Rainbow	Damn, I have to go afk. If anyone finds what the issue is, please pm me.
11:34 ^🔗	winr4r	Rainbow: yup, paging GLaDOS
11:34 ^🔗	GLaDOS	I have no idea when it comes to building
11:34 ^🔗	winr4r	my bad!
13:00 ^🔗	Rainbow	\o/ Fixed it!
13:00 ^🔗	winr4r	Rainbow: how?
13:01 ^🔗	Rainbow	Left over lua install seemed to cause it
13:01 ^🔗	Rainbow	Odd as it sounds
13:01 ^🔗	winr4r	ah :)
13:01 ^🔗	IceGuest_	WARNING:tornado.general:Connect error on fd 6: ECONNREFUSED
13:26 ^🔗	JackWS	Why would I be getting New item: Step 1 of 8 No HTTP response received from tracker. ?
13:26 ^🔗	winr4r	tracker down?
13:26 ^🔗	JackWS	working on my machine
13:26 ^🔗	JackWS	just not on my server :?
13:28 ^🔗	winr4r	you sure there's no outbound filtering?
13:28 ^🔗	JackWS	Should not be
13:28 ^🔗	JackWS	what ports is it wanting to use?
13:29 ^🔗	winr4r	not sure
13:30 ^🔗	GLaDOS	It just uses port 80
13:31 ^🔗	JackWS	testing it in a VM before I deploy it onto a few servers
13:38 ^🔗	JackWS	ah I got it
13:38 ^🔗	JackWS	[screen is terminating]
13:38 ^🔗	JackWS	when trying to run
13:44 ^🔗	JackWS	ah got it running
13:44 ^🔗	JackWS	was just being funny I think
13:45 ^🔗	JackWS	are you able to enable the graph on the webserver site? would be nice to see hot much it is using
15:26 ^🔗	WiK	sup
15:33 ^🔗	antomatic	hey
15:46 ^🔗	WiK	hows it going antomatic ?
15:47 ^🔗	antomatic	ah, can't complain. Just sitting here staring at the Xanga leaderboard. :)
16:56 ^🔗	db48x	do we have a tool that breaks a megawarc back up into the original warcs?
16:59 ^🔗	winr4r	db48x: https://pypi.python.org/pypi/Warcat/ ?
16:59 ^🔗	db48x	not quite
16:59 ^🔗	db48x	it can extract records from a warc (or a megawarc)
17:00 ^🔗	db48x	but the original warc was a series of related records
17:01 ^🔗	db48x	metadata about the process used to create the warc, each request as it was made, and each response recieved
17:01 ^🔗	winr4r	i'll pass on that question, then
17:03 ^🔗	db48x	the warc viewer is pretty good
17:03 ^🔗	db48x	but I don't want to use wget to spider a site being served up by the warc viewer's proxy server
17:05 ^🔗	db48x	warc-to-zip is interesting, but alas it requires byte offsets
17:06 ^🔗	db48x	I can get the start addresses of the response records, but not their lengths
17:06 ^🔗	xmc	db48x: https://github.com/alard/megawarc "megawarc restore megafoo.warc.gz"
17:07 ^🔗	xmc	iirc it creates a file bit-for-bit identical to the original source
17:07 ^🔗	xmc	is that what you're looking for?
17:07 ^🔗	db48x	ah, that sounds promising
17:12 ^🔗	db48x	I will have to update the description on the warc ecosystem page
17:21 ^🔗	xmc	ooh, warcproxy
17:21 ^🔗	xmc	I was meaning to write that
17:21 ^🔗	xmc	cool that someone else did!
17:21 ^🔗	xmc	now to bend it to my will
17:24 ^🔗	db48x	heh
17:39 ^🔗	xmc	well, not now, maybe later.
17:47 ^🔗	db48x	xmc: thanks, btw
17:48 ^🔗	db48x	that turned out to be precisely what I needed
17:48 ^🔗	xmc	my pleasure
17:48 ^🔗	xmc	excellent
17:49 ^🔗	db48x	we ought to get something set up so that people can reclaim their data by putting in the site url
17:50 ^🔗	xmc	not a bad idea at all
17:50 ^🔗	db48x	hmm, there are 444 of these megawarcs; I had to download all the idx files to find the one containing the site I wanted
17:51 ^🔗	db48x	not sure I have 22 tb just laying around
17:52 ^🔗	xmc	@_@
17:52 ^🔗	xmc	might be more reasonable to patch up the megawarc program to submit range-requests to the Archive and reassemble that way
17:53 ^🔗	db48x	that's what warc-to-zip does
17:53 ^🔗	db48x	you give it the url of a warc and a byte range, and it gives you a zip
17:53 ^🔗	xmc	ah cool
17:54 ^🔗	db48x	looks like the json files have the best information
17:54 ^🔗	db48x	{"target":{"container":"warc","offset":0,"size":29265692},"src_offsets":{"entry":0,"data":512,"next_entry":29266432},"header_fields":{"uid":1001,"chksum":0,"uname":"","gname":"","size":29265692,"devmajor":0,"name":"20130526205026/posterous.com-vividturtle.posterous.com-20130522-061616.warc.gz","devminor":0,"gid":1001,"mtime":1369567781.0,"mode":420,"linkname":"","type":"0"},"header_base64":"MjAxMzA1MjYyMDUwMjYvcG9zdGVyb3VzLmNvbS12aXZpZHR1
18:02 ^🔗	db48x	yes, very nice
18:02 ^🔗	db48x	using the offset and offset+size as the byte range I get a very nice zip
18:03 ^🔗	db48x	so it would just be a matter of parsing the filenames from the json indexes to get the site urls
18:04 ^🔗	xmc	fantastique
18:06 ^🔗	db48x	precisimo
19:45 ^🔗	arkhive	I'm sure it's been mentioned, but if it hasn't... MSN TV is closing!
19:46 ^🔗	arkhive	Heh, my WebTV Philips/Magnavox client is in my recycling
21:50 ^🔗	wp494	[14:45:46.746] <arkhive> I'm sure it's been mentioned, but if it hasn't... MSN TV is closing!
21:50 ^🔗	wp494	we're aware
21:51 ^🔗	wp494	also, puu.sh has now been added to the navbox
21:51 ^🔗	wp494	(channel for those that weren't awake at 4 AM CDT: #pushharder)
22:16 ^🔗	wp494	posterous still remains on the tracker and in warriors for whatever reason
22:17 ^🔗	wp494	what gives, if I can ask?

irclogger-viewer