#archiveteam 2016-08-17,Wed

↑back Search

Time	Nickname	Message
00:00 ^🔗		laufwerkf has joined #archiveteam
00:17 ^🔗		tfgbd_znc has quit IRC (Ping timeout: 633 seconds)
00:29 ^🔗		kristian_ has quit IRC (Leaving)
00:49 ^🔗		JesseW has joined #archiveteam
00:58 ^🔗		DoomTay has joined #archiveteam
01:26 ^🔗		ZeoNet has quit IRC (Read error: Operation timed out)
01:52 ^🔗		BlueMaxim has joined #archiveteam
02:24 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
02:27 ^🔗		Zialus is now known as RMF\|away
02:31 ^🔗	nicolas17	so...
02:32 ^🔗	nicolas17	how do I help with archiving stuff? the Warrior?
02:52 ^🔗	dashcloud	that's probably the easiest way
02:55 ^🔗	nicolas17	I just ran it as a docker container in a VPS
02:56 ^🔗	nicolas17	took me a while to get the web UI to work (port forwarding and all)
02:56 ^🔗	nicolas17	I set it to "archiveteam's choice" and it seems its choice was urlteam, which is frequently giving "no tasks available" :o
02:58 ^🔗	nicolas17	the web UI is super slick
03:07 ^🔗		laufwerkf has quit IRC ()
03:09 ^🔗	nicolas17	what project does have tasks?
03:09 ^🔗	nicolas17	I have a lot of bandwidth and I'd like to use it :P
03:15 ^🔗		JesseW has joined #archiveteam
03:28 ^🔗	DoomTay	Doesn't the warrior interface display all possible tasks?
03:32 ^🔗	nicolas17	it shows all possible projects, and most of the ones I try say they have no tasks from the tracker
03:32 ^🔗	DoomTay	nicolas17: How about Orkut? That's going kaput next month.
03:33 ^🔗		RichardG has quit IRC (Ping timeout: 370 seconds)
03:34 ^🔗	nicolas17	the throttling and project switching is pretty annoying... like if I switch to orkut, it doesn't start any new orkut task because there are too many concurrent tasks from another project running already... but all those tasks are sleeping! ("No items available currently. Trying again in 120 seconds")
03:35 ^🔗	JesseW	nicolas17: they will eventually time out, and it will load new tasks from orkut
03:36 ^🔗	JesseW	there certainly are various things that could be improved about the warrior, though
03:36 ^🔗	nicolas17	more like keep sleeping because of heavy tracker rate limiting on the orkut project :P
03:36 ^🔗	JesseW	when I looked into it, I got stuck trying to set up a testing environment
03:36 ^🔗	JesseW	nicolas17: sure, but at least they'll be waiting on orkut
03:37 ^🔗	JesseW	also, if you can, URLteam can always use people investigating shorteners -- then I can add more to the tracker, and there will be more work to do
03:38 ^🔗	nicolas17	I'm on a gigabit pipe doing pretty brief 50KB/s bursts and then sleeping, it's a bit frustrating
03:40 ^🔗		tomwsmf has quit IRC (Read error: Operation timed out)
03:42 ^🔗	DoomTay	What's your ISP?
03:42 ^🔗	nicolas17	I'm running it on a VPS
03:44 ^🔗	nicolas17	maybe I should run an archivebot node instead? :P
03:44 ^🔗	bwn	nicolas17: <HCross2> Putting another call out. We really could do with a few more newsbuddy grabbers. If anyone has a fast, stable connection and is willing to help, just come into #newsgrabber and let myself or arkiver know please
03:44 ^🔗	*	nicolas17 still reading the wiki
03:44 ^🔗	JesseW	nicolas17: no new archivebot nodes for now
03:45 ^🔗	nicolas17	oki
03:53 ^🔗	nicolas17	will the orkut grab finish in time with the current rate limiting? it seems like adding more warriors would make no difference
03:54 ^🔗		RichardG has joined #archiveteam
04:13 ^🔗		ravetcofx has quit IRC (Read error: Connection reset by peer)
04:16 ^🔗		ravetcofx has joined #archiveteam
04:20 ^🔗		DoomTay has quit IRC (DoomTay)
04:23 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:30 ^🔗		Sk1d has joined #archiveteam
05:35 ^🔗		barblfish has joined #archiveteam
05:36 ^🔗	barblfish	According to the wiki article on DNS History, "the site is a zombie"
05:36 ^🔗	barblfish	I just took a peek and the site looks to be working normally, including search
05:37 ^🔗		barblfish has quit IRC (Client Quit)
05:38 ^🔗		barblfish has joined #archiveteam
05:39 ^🔗	JesseW	barblfish: good! maybe the interest prompted the site owner to keep it running
05:40 ^🔗	JesseW	barblfish: feel free to update the wiki page, mentioning that the site seems to be working (but mention exactly what you did and didn't try, as other pieces may still be broken).
05:40 ^🔗	barblfish	Probably should TRY to archve it at a "leisurely" pace just in case. For whatever reason, the closure notice is still there
05:40 ^🔗	JesseW	the magic word is yahoosucks
05:40 ^🔗	barblfish	K
05:41 ^🔗	JesseW	barblfish: yeah, having an individual run a grab-site instance at, say 1 request per couple of minutes (in a random order, with a random delay) is probably worth doing
05:46 ^🔗		barblfish has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 48.0/20160726073904])
06:02 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
06:07 ^🔗		nicolas17 has quit IRC (Read error: Operation timed out)
07:44 ^🔗		Selavi has joined #archiveteam
07:48 ^🔗		JesseW has joined #archiveteam
08:25 ^🔗		MMovie1 has joined #archiveteam
08:27 ^🔗		MMovie has quit IRC (Read error: Operation timed out)
08:42 ^🔗		Morbus has quit IRC (Ping timeout: 255 seconds)
08:45 ^🔗		Morbus has joined #archiveteam
09:08 ^🔗		Honno has joined #archiveteam
10:07 ^🔗		WinterFox has joined #archiveteam
10:10 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
10:25 ^🔗		SketchCow has quit IRC (Read error: Operation timed out)
10:28 ^🔗		SketchCow has joined #archiveteam
10:28 ^🔗		swebb sets mode: +o SketchCow
11:03 ^🔗		ats has quit IRC (Quit: Lost terminal)
11:05 ^🔗		ats has joined #archiveteam
12:59 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
13:00 ^🔗		RMF\|away is now known as Zialus
13:05 ^🔗		WinterFox has quit IRC (Read error: Operation timed out)
13:21 ^🔗		DoomTay has joined #archiveteam
13:30 ^🔗		ats has quit IRC (Quit: leaving)
13:36 ^🔗		ats has joined #archiveteam
13:36 ^🔗	joepie91	Reddit thread regarding Google Code deleting tarballs in 5 months: https://www.reddit.com/r/programming/comments/4y4epv/about_5_months_from_now_the_tarballs_from_google/
13:45 ^🔗	arkiver	joepie91: you're talking about the Google Code Archive shutting down too?
13:46 ^🔗	joepie91	seems so
13:46 ^🔗	joepie91	have not read the thread carefully
13:46 ^🔗	joepie91	just passing it on
13:46 ^🔗	arkiver	yeah, it looks like it
13:47 ^🔗	arkiver	When we're done with the 'original' google code we'll do the google code archive too
13:48 ^🔗		bauruine has quit IRC (Ping timeout: 260 seconds)
13:53 ^🔗		bauruine has joined #archiveteam
13:58 ^🔗	DoomTay	Since ArchiveBot seems to not handle LEGO.com videos properly, I'm going to try my hand at archiving videos at http://web.archive.org/web/20160616230429/http://www.lego.com/en-us/chima/videos "manually". And I just figured out how to do that
14:07 ^🔗	arkiver	how are you going to do that?
14:07 ^🔗	arkiver	let's move this to #archiveteam-bs also
14:08 ^🔗	DoomTay	Yeaah....
14:09 ^🔗	DoomTay	Can't
14:31 ^🔗		Sneakyimp has joined #archiveteam
14:53 ^🔗	voltagex	DoomTay: come over to -bs, also, youtube-dl should save those for you.
14:53 ^🔗	voltagex	arkiver, joepie91: just emailed Chris DiBona about getting in touch with ArchiveTeam re: Google Code, we'll see how that goes.
14:54 ^🔗	voltagex	the #googlecodeblue wiki needs some TLC.
14:54 ^🔗	voltagex	tracker seems to be down, also.
14:54 ^🔗	DoomTay	It looks like I'm banned from -bs. Also, I already tried youtube-dl with ArchiveBot. no luck.
14:56 ^🔗	voltagex	no, you'd only be able to use it on the live site
14:56 ^🔗	voltagex	if the videos ain't in archive.org, they ain't in archive.org.
14:56 ^🔗	voltagex	I wonder what you did to get banned from bs
14:57 ^🔗	DoomTay	Apparently a history of "saying galactically dumb shit"
14:59 ^🔗	voltagex	oh well, live and learn
14:59 ^🔗	voltagex	what are you trying to save exactly?
14:59 ^🔗	voltagex	if it's missing files in archive.org you'd have to go back to the source
14:59 ^🔗	voltagex	if they're gone there, YouTube or you're too late.
15:01 ^🔗	DoomTay	I'm trying to save videos off of http://www.lego.com/en-us/chima/videos . Problem is their video player is powered by AngularJS, and the player is set up "on the fly"
15:02 ^🔗	DoomTay	And using youtube-dl is also a no go: "unsupported URL"
15:03 ^🔗	voltagex	correct
15:04 ^🔗	voltagex	post a correctly formatted issue / request on https://github.com/rg3/youtube-dl/issues
15:06 ^🔗	voltagex	the only other hint I'll give you is look at the network traffic for manifest.f4m
15:22 ^🔗		JesseW has joined #archiveteam
15:33 ^🔗	nwf	Hey channel. I have Internet2 at my disposal and a huge stash of unused disk space; can I be of assistance for google code or some other project? Ideally your answer is something like "Yes, please run aria2 on each URL in the list at $URL." ;)
15:36 ^🔗	JesseW	nwf: join #newsgrabber and ask about being a grabber
15:37 ^🔗	JesseW	nwf: also, check out iabackup.archiveteam.org for a use for your disk space
15:37 ^🔗	JesseW	and THANK YOU!
15:37 ^🔗	JesseW	feel free to ask here if you have questions
15:38 ^🔗	nwf	Thanks. :)
15:39 ^🔗	JesseW	you can also run a #warrior, but we don't have any project ATM that needs help, I think. But that could change anytime.
15:40 ^🔗	Sanqui	you can run an archivebot pipeline
15:40 ^🔗	Sanqui	reliable long term ones are always wanted
15:41 ^🔗	nwf	Whazzat?
15:41 ^🔗	Sanqui	on-demand archiver of small-to-medium or at-risk websites
15:42 ^🔗	Sanqui	see #archivebot, http://archiveteam.org/index.php?title=ArchiveBot
15:42 ^🔗	nwf	Sounds neat. Who has authority to push to the queue? (I don't want there to be risk to my hosting organization.)
15:43 ^🔗	Sanqui	trusted users from here, though the bar is set pretty low
15:43 ^🔗	Sanqui	and sometimes questionable stuff is archived
15:43 ^🔗	Sanqui	so if that's of concern, it's fine
15:43 ^🔗	nwf	Well, it just means I need to ask the admins for permission / give them a heads up that network security might come after them for a particular IP address.
15:44 ^🔗	JesseW	we're also not accepting new #archivebot pipelines right now, according to yipdw (who maintains the list)
15:44 ^🔗	Sanqui	oh
15:44 ^🔗	Sanqui	I didn't kbkw that, alright
15:44 ^🔗		nicolas17 has joined #archiveteam
15:45 ^🔗	JesseW	yeah, new archivebot pipelines is blocked by various code changes (i'm not certain exactly what)
15:45 ^🔗	Sanqui	well archivebot needs to be rewritten, I know that, but it's trudging along anyway :P
15:45 ^🔗	JesseW	but AFAIK, #newsgrabber is actively looking for new pipelines, and #iabackup, while inactive, is still accepting new storage
15:47 ^🔗	JesseW	Sanqui: http://archiveteam.org/index.php?title=ArchiveBot#Volunteer_a_Node see the note at the top
15:47 ^🔗	Sanqui	got it
16:03 ^🔗		DoomTay has quit IRC (Quit: Page closed)
16:08 ^🔗		JesseW has quit IRC (Ping timeout: 370 seconds)
16:21 ^🔗		DoomTay has joined #archiveteam
17:04 ^🔗		AlexLehm has joined #archiveteam
17:33 ^🔗		kristian_ has joined #archiveteam
18:06 ^🔗		JW_work has quit IRC (Quit: Leaving.)
18:07 ^🔗		JW_work has joined #archiveteam
18:10 ^🔗	r3c0d3x	http://www.npr.org/sections/ombudsman/2016/08/17/489516952/npr-website-to-get-rid-of-comments NPR is removing comments from articles, but the comments will still be alive through Disqus. Are we planning on addressing this? (i.e. dumping threads from disqus for each article, or perhaps something else..?)
18:11 ^🔗	r3c0d3x	Quote from the article: "All existing comments on the site will disappear. That is because while comments look as though they exist on the NPR.org pages, they actually live within Disqus, an outside moderation platform used by NPR. So when the commenting software is removed, the archival comments go with it, Montgomery said, adding that it is not possible to remove the comment system but leave the old comments. Individual users will still be able
18:11 ^🔗	r3c0d3x	to see an archive of their own comments in their Disqus accounts."
18:30 ^🔗		JW_work has quit IRC (Read error: Connection reset by peer)
18:31 ^🔗		JW_work has joined #archiveteam
18:55 ^🔗		GLaDOS has quit IRC (Read error: Operation timed out)
18:56 ^🔗		GLaDOS has joined #archiveteam
19:03 ^🔗		SmileyG has quit IRC (Remote host closed the connection)
19:13 ^🔗		JW_work1 has joined #archiveteam
19:15 ^🔗		JW_work has quit IRC (Read error: Operation timed out)
19:16 ^🔗		Smiley has joined #archiveteam
19:28 ^🔗		ats has quit IRC (reeeeboooooooot)
19:59 ^🔗		AlexLehm has quit IRC (Ping timeout: 260 seconds)
20:06 ^🔗		tomwsmf has joined #archiveteam
20:07 ^🔗		SirCmpwn has quit IRC (Read error: Operation timed out)
20:10 ^🔗		ats has joined #archiveteam
20:24 ^🔗		SirCmpwn has joined #archiveteam
20:25 ^🔗		kristian_ has quit IRC (Leaving)
20:31 ^🔗		pfallenop has quit IRC (Read error: Operation timed out)
20:37 ^🔗		mr-b has quit IRC (Read error: Operation timed out)
20:40 ^🔗		mr-b has joined #archiveteam
20:40 ^🔗		pfallenop has joined #archiveteam
20:42 ^🔗		DoomTay has quit IRC (Quit: Page closed)
20:45 ^🔗		mr-b has quit IRC (Ping timeout: 246 seconds)
21:02 ^🔗		kristian_ has joined #archiveteam
21:02 ^🔗		mr-b has joined #archiveteam
21:06 ^🔗		Honno has quit IRC (Read error: Operation timed out)
21:35 ^🔗		robink has quit IRC (Ping timeout: 501 seconds)
22:08 ^🔗		robink has joined #archiveteam
22:38 ^🔗		pfallenop has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		nicolas17 has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		SketchCow has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Morbus has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		zenguy has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		superkuh has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		dashcloud has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		chazchaz has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		winr5r has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		MrRadar has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		RedType_ has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		zino has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		arkiver has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Peetz0r_ has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Infreq has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		aschmitz has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		gibigiana has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		w0rp has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		HCross has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		indrora has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		dxrt has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Zebranky has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		ranma has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		antomatic has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		hook54321 has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		luckcolor has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		ErkDog has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Cameron_D has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		dcmorton has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		is- has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		Jogie has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		mistym- has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		swebb has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		atlogbot has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		dserodio has quit IRC (ny.us.hub irc.servercentral.net)
22:38 ^🔗		filippo__ has quit IRC (ny.us.hub irc.servercentral.net)
22:40 ^🔗		andromed1 has quit IRC (Read error: Connection reset by peer)
22:55 ^🔗		JW_work has joined #archiveteam
23:04 ^🔗		JW_work1 has quit IRC (Read error: Operation timed out)
23:05 ^🔗		DoomTay has joined #archiveteam
23:05 ^🔗		pfallenop has joined #archiveteam
23:05 ^🔗		nicolas17 has joined #archiveteam
23:05 ^🔗		SketchCow has joined #archiveteam
23:05 ^🔗		Morbus has joined #archiveteam
23:05 ^🔗		zenguy has joined #archiveteam
23:05 ^🔗		superkuh has joined #archiveteam
23:05 ^🔗		dashcloud has joined #archiveteam
23:05 ^🔗		chazchaz has joined #archiveteam
23:05 ^🔗		winr5r has joined #archiveteam
23:05 ^🔗		MrRadar has joined #archiveteam
23:05 ^🔗		RedType_ has joined #archiveteam
23:05 ^🔗		zino has joined #archiveteam
23:05 ^🔗		arkiver has joined #archiveteam
23:05 ^🔗		Infreq has joined #archiveteam
23:05 ^🔗		Peetz0r_ has joined #archiveteam
23:05 ^🔗		indrora has joined #archiveteam
23:05 ^🔗		aschmitz has joined #archiveteam
23:05 ^🔗		gibigiana has joined #archiveteam
23:05 ^🔗		w0rp has joined #archiveteam
23:05 ^🔗		HCross has joined #archiveteam
23:05 ^🔗		irc.servercentral.net sets mode: +oooo SketchCow chazchaz arkiver HCross
23:05 ^🔗		dxrt has joined #archiveteam
23:05 ^🔗		Zebranky has joined #archiveteam
23:05 ^🔗		ranma has joined #archiveteam
23:05 ^🔗		antomatic has joined #archiveteam
23:05 ^🔗		hook54321 has joined #archiveteam
23:05 ^🔗		luckcolor has joined #archiveteam
23:05 ^🔗		ErkDog has joined #archiveteam
23:05 ^🔗		Cameron_D has joined #archiveteam
23:05 ^🔗		dcmorton has joined #archiveteam
23:05 ^🔗		irc.servercentral.net sets mode: +oooo dxrt antomatic luckcolor dcmorton
23:05 ^🔗		is- has joined #archiveteam
23:05 ^🔗		mistym- has joined #archiveteam
23:05 ^🔗		swebb has joined #archiveteam
23:05 ^🔗		atlogbot has joined #archiveteam
23:05 ^🔗		dserodio has joined #archiveteam
23:05 ^🔗		filippo__ has joined #archiveteam
23:05 ^🔗		irc.servercentral.net sets mode: +oo mistym- swebb
23:05 ^🔗		swebb sets mode: +o brayden_
23:05 ^🔗		swebb sets mode: +o Atluxity
23:05 ^🔗		swebb sets mode: +o DFJustin
23:05 ^🔗		swebb sets mode: +o beardicus
23:05 ^🔗		swebb sets mode: +o midas
23:05 ^🔗		swebb sets mode: +o SadDM
23:05 ^🔗		swebb sets mode: +o balrog
23:05 ^🔗		swebb sets mode: +o edsu
23:05 ^🔗		swebb sets mode: +o joepie91
23:05 ^🔗		swebb sets mode: +o altlabel
23:05 ^🔗		swebb sets mode: +o Jonimoose
23:05 ^🔗		swebb sets mode: +o xmc
23:08 ^🔗		dxrt has quit IRC (Ping timeout: 370 seconds)
23:10 ^🔗		dxrt has joined #archiveteam
23:13 ^🔗		max has joined #archiveteam
23:14 ^🔗	max	i have a site that may have historical significance and i am thinking of shutting it down. who should i talk to about potentially getting it archived efficiently?
23:15 ^🔗	Frogging	What's the site?
23:15 ^🔗	max	www.ytmnd.com
23:16 ^🔗	xmc	o my
23:16 ^🔗	nicolas17	...okay yes that has historical / internet culture significance o.O
23:16 ^🔗	Frogging	o.o
23:16 ^🔗	max	it isn't really cost-effective to host anymore
23:16 ^🔗	xmc	yea we can hold it
23:16 ^🔗	JW_work	max: thank you for considering how best to archive it
23:16 ^🔗	xmc	<3
23:16 ^🔗	max	i could spend the time to try to get it on all virtualized, but i think it would only prolong the inevitable death
23:17 ^🔗	nicolas17	max: how much bandwidth is it eating?
23:17 ^🔗	JW_work	the best way would be to make a copy of the whole site database, and ship/upload that to archive.org as an item
23:17 ^🔗	JW_work	(we can help if you have questions)
23:18 ^🔗	JW_work	if that's not feasible (and maybe as an alternative), we can make a scrape of it before it goes down, which will get copied into the Wayback Machine
23:18 ^🔗	max	nicolas17: probably less than 10mbps on average, mainly the costs are colocation fees at the moment since the hardware is aging
23:18 ^🔗	xmc	imo a scrape would be best in any case
23:18 ^🔗		howdoicom has joined #archiveteam
23:18 ^🔗	xmc	guided by a list of valid sites
23:18 ^🔗	xmc	warc it up
23:18 ^🔗	JW_work	it'd just be nice to have the raw database, too, in case someone else wants to host it again later
23:18 ^🔗		BlueMaxim has joined #archiveteam
23:18 ^🔗	xmc	yeah
23:19 ^🔗	JW_work	but yeah, both — both would be best
23:19 ^🔗	nicolas17	max: I meant in GB/month (a constant 10mbps would mean 3TB/mo)
23:19 ^🔗	max	nicolas17: i haven't looked and i get billed at 95th percentile
23:19 ^🔗	nicolas17	JW_work: bothisgood.gif
23:19 ^🔗	JW_work	exactly
23:19 ^🔗	max	the content drive is currently 1.7T, i think i'd probably need to anonymize the db at the very least, remove private messages and stuff
23:20 ^🔗	max	at the very least, i could write a script to create a list of every unique URL on the entire site
23:20 ^🔗	Frogging	JW_work: someone should probably write some scripts
23:21 ^🔗	JW_work	well, if you're willing, I'm pretty certain archive.org would be delighted to get a non-anoymized version of the drive and keep it private for a couple of decades or so
23:21 ^🔗	max	to be fair, there is probably a ton of dmca violations, and horrific nsfw stuff
23:21 ^🔗	JW_work	1.7T is not particular painfully large for us
23:21 ^🔗	max	i figured
23:22 ^🔗	nicolas17	JW_work: I heard you guys wanted a copy of Mapillary in case they go under...
23:22 ^🔗	max	the database is pretty large
23:22 ^🔗	JW_work	yep, it'd be great to have that, too
23:22 ^🔗	nicolas17	Mapillary staff told me they have 200TB of photos, so yeah, 1.7TB is small XD
23:23 ^🔗	JW_work	yeah, 200TB is in the range where we'd need to discuss with IA staff before dropping it on them :-)
23:23 ^🔗	max	but 1.5TB of our content is probably homemade drawings of sonic the hedgehog having sex with tails
23:24 ^🔗	Frogging	that's fine
23:24 ^🔗	JW_work	eh, we're still glad to have it
23:24 ^🔗	max	db is mysql and around 180gb. it has historical view data for every site dating back to 2004 i think
23:24 ^🔗	JW_work	that would be awesome to have
23:25 ^🔗	nicolas17	you mean like access logs? data scientists are drooling right now
23:25 ^🔗	JW_work	:-)
23:25 ^🔗	max	it's more like date, site_id, view_counter
23:25 ^🔗	max	but yeah some neat stuff could be done with it
23:27 ^🔗	max	i wonder if a warc would be able to faithfully encapsulate/play back a ytmnd
23:27 ^🔗	max	it uses a flash loader because at the time it was the only way you could gaplessly loop WAV files
23:28 ^🔗	nicolas17	warc as a format should support it, a naive scraper trying to create the warc would have trouble with the Flash though
23:28 ^🔗	Frogging	max: How long can you keep it online for?
23:28 ^🔗	xmc	if you want to make a html5 version that plays nicely in the archive, people from the future would appreciate it
23:28 ^🔗	xmc	if you don't want to, that's fine
23:29 ^🔗	max	Frogging: indefinitely
23:29 ^🔗	Frogging	thanks
23:29 ^🔗	max	this is pretty preliminary, but if i dont give it to someone it will just sit on a hard drive in my closet forever which seems pretty lame
23:30 ^🔗	nicolas17	I think Google made a Flash-to-HTML5 converter (mainly for Flash ads to work on mobile), it would be interesting to see if it can handle ytmnd .swf's
23:30 ^🔗	DoomTay	I once tried to make a script to convert the things to HTML5, but I got absolutely nowhere with it
23:30 ^🔗	nicolas17	(actually kind of Flash-to-JSON which is then interpreted by an HTML5/Javascript player)
23:30 ^🔗	max	ytmnd just has 1 swf for the player and everything else is standard image/audio formats
23:31 ^🔗	max	i made a prelim html5 version in 2011 but audio support wasnt very good back then
23:31 ^🔗	nicolas17	oh :o
23:31 ^🔗	max	and that was the last time i really worked on the site
23:31 ^🔗	nicolas17	I thought you had vector swf animations and stuff
23:34 ^🔗	max	it's a glorified flash intro and then is just used to play sound
23:34 ^🔗	max	i.e. waits until the gif and audio are loaded before playing either
23:37 ^🔗	DoomTay	We should probably give max the secret phrase so he can make a page about this
23:37 ^🔗	nicolas17	xmc: ok, there is no way you can throw a generic warc grabber at this; there is a swf loader that gets a json to know what wav and jpg to load
23:38 ^🔗	nicolas17	so if you want to scrape, custom script it is
23:38 ^🔗	xmc	yea
23:38 ^🔗	xmc	that was my gut feeling
23:38 ^🔗	nicolas17	http://picard.ytmnd.com/info/508/json
23:38 ^🔗	xmc	this sounds like a good job for the warrior
23:42 ^🔗		kristian_ has quit IRC (Leaving)
23:43 ^🔗	max	turns out if i just change the default from flash to html5, it seems to work fine now
23:44 ^🔗	max	less flashy since there's no status or anything, but it lets you see the site at least
23:47 ^🔗	DoomTay	Ha ha, flashy
23:51 ^🔗	ErkDog	omg I grew up on YTMD
23:51 ^🔗	nicolas17	ErkDog: what were the consequences? :P
23:52 ^🔗	ErkDog	max, is it PHP / MySQL?
23:53 ^🔗	ErkDog	guess I could read, lol
23:54 ^🔗	max	yeah common lamp stack
23:54 ^🔗	ErkDog	lol I mirrored this from YTMD 1,000 years ago
23:54 ^🔗	ErkDog	http://erkdog.netho.tk/picard/
23:55 ^🔗	ErkDog	YTMD was basically audio-meme's before meme's were even a thing
23:55 ^🔗	ErkDog	well is*
23:55 ^🔗	ErkDog	it's like a meme w/ audio / animation except meme's didn't exist back then
23:57 ^🔗	max	we just called them fads, very few of them had the staying power of something like dickbutt

irclogger-viewer