#archiveteam 2012-07-04,Wed

↑back Search

Time	Nickname	Message
00:08 ^🔗	SketchCow	Fan Fiction archiving now happening - 830gb of data.
00:15 ^🔗	nitro2k01	Holy crap. A million monkeys on a million typewriters for a million years...
00:15 ^🔗	nitro2k01	(would create more data than that, assuming the poor things could figure out how to type)
00:16 ^🔗	chronomex	correct
00:16 ^🔗	chronomex	150T
00:16 ^🔗	chronomex	er, that's one monkey at 50ish wpm for a million years
00:17 ^🔗	chronomex	whatever
00:18 ^🔗	SketchCow	Fuck those little bastards
00:18 ^🔗	nitro2k01	My subtle hint was that the majority of the archived fanfiction would be written by the human equivalent of monkeys
00:27 ^🔗	Coderjoe	I've run across fiction where the premise seemed interesting, but I couldn't finish it because I had to mentally rewrite every sentence in order to really make sense of it.
00:27 ^🔗	Coderjoe	oh, and the number of times I've seen mixups beteen clothes/cloths and breathe/breath
00:27 ^🔗	Coderjoe	argh. nothing like a grammar/spelling nitpick messing up in their complaint. :D
00:35 ^🔗	arrith1	nice, ffnet needs a good archiving
00:46 ^🔗	underscor	I have a really hard time reading fanfiction with typos
00:53 ^🔗	SketchCow	I'm just savin' it, I ain't judgin' it.
00:53 ^🔗	SketchCow	I've got the two partitions down to 83% and 49% so crisis over
00:54 ^🔗	underscor	(this is the part where I find another TB of data just laying around somewhere
00:54 ^🔗	underscor	)
00:54 ^🔗	godane	i remember some really good firefly fanfiction on there
00:55 ^🔗	SketchCow	Definitely one of those times I wish I could assign a boring task to an underling.
00:55 ^🔗	SketchCow	http://archive.org/details/hackercons-notacon-2007
00:55 ^🔗	SketchCow	Hundreds of hacker con speeches. Just have to type in names of the presenters, and the talks.
00:59 ^🔗	godane	SketchCow: Famicoman beat you to this: http://archive.org/details/notacon4video
01:00 ^🔗	SketchCow	Yeah, Famicoman made a non-described, blown-up pile of derived video
01:01 ^🔗	SketchCow	Compare the information you get there to, say, http://archive.org/details/hackercons-notacon-2007-brickipedia
01:02 ^🔗	godane	of course he used ftp to upload them
01:03 ^🔗	godane	i see what you mean
01:03 ^🔗	godane	if i had faster upload i may have put it all in one item with lots of .txt files for descs
01:03 ^🔗	Coderjoe	ugh
01:03 ^🔗	godane	i did that with mostly twit podcasts
01:03 ^🔗	Coderjoe	all in one item
01:04 ^🔗	godane	stuff like diggnation shouldn't have been done that way
01:05 ^🔗	godane	my rule is to keep items under 5-6gb
01:05 ^🔗	SketchCow	Yeah
01:05 ^🔗	SketchCow	See, I wouldn't do it that way at all.
01:05 ^🔗	SketchCow	Anyway, I'm doing such by re-doing them as you can see.
01:06 ^🔗	godane	ok
01:06 ^🔗	godane	then you remove the anarchivism ones
01:06 ^🔗	SketchCow	I'm not really famous for letting others half-done jobs dictate my not doing it.
01:06 ^🔗	SketchCow	No, they're adorable.
01:06 ^🔗	SketchCow	I'm actually not allowed to.
01:06 ^🔗	godane	ok
01:06 ^🔗	Coderjoe	I kinda like one item per video/episode/talk/whatever. though I see the PDA vids were tossed up in two items
01:06 ^🔗	SketchCow	Yes, which I had nothing to do with.
01:06 ^🔗	SketchCow	I do one episode an item
01:06 ^🔗	Coderjoe	i know
01:07 ^🔗	SketchCow	http://archive.org/details/securityjustice
01:07 ^🔗	SketchCow	See? One episode an item
01:07 ^🔗	SketchCow	When I have a chance, I'll go back and inject their descriptions in.
01:08 ^🔗	godane	i may have something to upload
01:08 ^🔗	godane	firefly fanfiction audio drama
01:08 ^🔗	SketchCow	http://archive.org/details/securityjustice-25
01:08 ^🔗	SketchCow	Then they'll look like that.
01:10 ^🔗	godane	with data de-dupliation i don't think it matters how much its uploaded
01:10 ^🔗	godane	or change to be neat
01:10 ^🔗	Coderjoe	ia doesn't do dedup
01:10 ^🔗	godane	it doesn't
01:10 ^🔗	godane	but i thought it did
01:10 ^🔗	SketchCow	It does not.
01:11 ^🔗	godane	now i see storage is going to be a problem
01:11 ^🔗	SketchCow	he's so adorable? can we keep him?
01:12 ^🔗	godane	i just don't like 240gb of diggnation being on there like 20 times or something
01:13 ^🔗	godane	of course there dedup maybe very hard since we are talking 1000s have hard drives
01:13 ^🔗	godane	*of
01:14 ^🔗	SketchCow	So adorable
01:16 ^🔗	godane	i may have to do a panic download of the signal now
01:18 ^🔗	godane	lots of audio podcasts: http://signal.serenityfirefly.com/mmx/series/
01:19 ^🔗	DFJustin	shiny
01:21 ^🔗	godane	i figure that i need to call on you guys to backup those podcasts
01:21 ^🔗	godane	to much for my hard drives right now
01:21 ^🔗	DFJustin	are they going anywhere soon
01:21 ^🔗	godane	don't know
01:22 ^🔗	SketchCow	Especially if you continue this terrible habit of shoving dozens of descrete episodes and broadcasts into one big gloppy item
01:22 ^🔗	godane	but its been 8 seasons so fare
01:22 ^🔗	godane	*far
01:25 ^🔗	Coderjoe	some of it is not godane
01:26 ^🔗	SketchCow	The Library of Congress, the Preserving Virtual Worlds Project, and a bunch of others have jumped into my project.
01:39 ^🔗	arrith1	SketchCow: is http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012 / JSP2012 getting its own name, site and wiki?
01:39 ^🔗	SketchCow	yes
01:39 ^🔗	arrith1	SketchCow: why is there a focus one just one month?
01:39 ^🔗	SketchCow	this is just a prelim scratchpad
01:39 ^🔗	SketchCow	Ask that second question in english
01:40 ^🔗	godane	looks like archive.org hates my firefly parody i have uploaded
01:41 ^🔗	arrith1	SketchCow: why is it "30 days dedicated to solving a problem" which might mean actually solving the problem within that time, instead of making an organization within that time
01:43 ^🔗	SketchCow	You know what the world doesn't need?
01:43 ^🔗	SketchCow	Another organization
01:43 ^🔗	arrith1	AT is kind of an org, but it works
01:44 ^🔗	SketchCow	You say that
01:44 ^🔗	SketchCow	But every time we have a vote, someone dies
01:44 ^🔗	SketchCow	A child, usually
01:44 ^🔗	SketchCow	Usually
02:00 ^🔗	Famicoman	With you all as my witness, I am changing my ways
02:01 ^🔗	solo	i regret to report that all videos removed by youtube users prior to july 1st 2012 have been irrevocably deleted
02:04 ^🔗	solo	strangely, videos taken offline for copyright infringement have been preserved
02:08 ^🔗	shaqfu	I don't have time to read all of textfiles.com right now
02:08 ^🔗	shaqfu	But a friend, when I mentioned JSTP to him, said that there used to be a floating list of file formats on BBSes
02:08 ^🔗	shaqfu	Is this still extant?
02:17 ^🔗	SketchCow	Yes
02:18 ^🔗	SketchCow	All this exists.
02:18 ^🔗	shaqfu	Awesome
03:50 ^🔗	S[h]O[r]T	am i the only one who doesnt understand this just solve the problem project
03:51 ^🔗	S[h]O[r]T	i seem to understand from it, trying to gather people to figure out what file formats a bunch of random crap is in? or make something useful out of all the stuff that has been archived in some displayable format?
03:58 ^🔗	balrog	S[h]O[r]T: the goal is to document as many formats as possible
03:59 ^🔗	balrog	Figure out how to decode them and such
03:59 ^🔗	balrog	How the actual data is stored in these files
04:01 ^🔗	arrith1	i think it also extends to physical media. maybe like "how to solder your own kit to get data off a disc_x type disc"
04:03 ^🔗	solo	or how to dump the firmware of your television
04:03 ^🔗	balrog	arrith1: That's something I'm trying to work with with the discferret project
04:04 ^🔗	balrog	solo: That's ... annoying because hardware needed to dump a lot of firmware is expensive and requires nasty proprietary software
04:05 ^🔗	arrith1	balrog: wow discferret is wild
04:05 ^🔗	arrith1	"The source code and CAD files for the DiscFerret design are completely open-sourced: the hardware and software are released under the GNU GPL (in the case of the board, microcode, and firmware) or the Apache Public Licence (in the case of the DiscFerret Hardware Access Library)"
04:06 ^🔗	solo	did anyone archive revver or livevideo?
04:06 ^🔗	balrog	arrith1: We're looking for help software side
04:07 ^🔗	balrog	If anyone's good at software architecture and willing to help, and has time, stop by the IRC
04:07 ^🔗	balrog	:)
04:08 ^🔗	arrith1	discferret plus a big archive of fileformat info could make quite the killer ArchiveTeam member disaster kit
04:10 ^🔗	balrog	The software we're starting work on is intended to handle other data sources too :)
04:11 ^🔗	balrog	Unfortunately we're just starting out and I'm not all that good at designing it yet
04:15 ^🔗	arrith1	balrog: the sofware, hardware, or both?
04:15 ^🔗	balrog	Software.
04:15 ^🔗	arrith1	ah
04:16 ^🔗	balrog	Hardware is pretty solid, if not a bit slow. That's going to be fixed with a hardware revision, but if you're interested rev-1 is available now.
04:16 ^🔗	balrog	Another thing we're working on fixing is a somewhat high price
04:16 ^🔗	balrog	(thanks to a lot of components, a slightly overdesigned power supply, and hand assembly)
04:27 ^🔗	arrith1	balrog: yeah high prices would be good to fix. i'm totally hw ignorant but maybe there's some way to use more commodity components? there are lots of arduinos and raspberry pi competitors
04:28 ^🔗	balrog	Well you have to record a stream of data at a high rate. Current design is based on an FPGA and a microcontroller and thats how it will be. But the current power stuff is somewhat overkill
04:28 ^🔗	balrog	Most drives don't need 2A output :)
04:29 ^🔗	balrog	I have to see if it's feasible to power a drive externally and how much that would reduce the cost
04:29 ^🔗	balrog	It's nice though to have a single unit that can power both itself and the drive
04:32 ^🔗	arrith1	ah yeah, almost like an external hdd case all wrapped up
04:32 ^🔗	arrith1	dang fpgas are always expensive
04:38 ^🔗	balrog	The fpga isn't the worst
04:39 ^🔗	balrog	It's $12 or so
04:39 ^🔗	balrog	The USB 2.0 microcontroller will be about $6-$7
04:39 ^🔗	balrog	You get nickel and dimed to death on the smaller parts.
05:03 ^🔗	Coderjoe	balrog: and the memory?
05:03 ^🔗	balrog	We found a somewhat cheaper source.
05:04 ^🔗	balrog	We're thinking of doing an sdram based design. Would mean much more memory at a lower price at an increase in microcode complexity
05:04 ^🔗	balrog	(need an sdram controller)
05:06 ^🔗	arrith1	balrog: oh that's pretty good
05:09 ^🔗	balrog	The current price is around $250 for a fully assembled board which I feel is a bit much
05:09 ^🔗	balrog	I'd like to get it toward $100, hopefully to $150 if not lower
05:10 ^🔗	DFJustin	kryoflux recommends you power the drive separately and they sell an adapter for that purpose
05:11 ^🔗	DFJustin	I used a gutted 3.5" HDD enclosure for power
05:13 ^🔗	balrog	DFJustin: The discferret power components can power 2-3 3.5" drives easily as it is now
05:13 ^🔗	balrog	Which I feel is overkill
05:13 ^🔗	balrog	It's overdesigned. It's extremely robust but I don't think that's necessary.
05:14 ^🔗	balrog	Half the capacity would still power even a 5.25" drive
05:15 ^🔗	balrog	The kryoflux just pulls power off the 5V USB
05:16 ^🔗	joepie91	this seems relevant here:
05:16 ^🔗	joepie91	Google Video stopped taking uploads in May 2009. Later this summer weÃ¢??ll be moving the remaining hosted content to YouTube. Google Video users have until August 20 to migrate, delete or download their content. WeÃ¢??ll then move all remaining Google Video content to YouTube as private videos that users can access in the YouTube video manager. For more details, please see our post on the YouTube blog.
05:17 ^🔗	balrog	joepie91: Link?
05:17 ^🔗	joepie91	tl;dr google video videos will become unavailable for public viewing unless the uploader specifically makes it public
05:17 ^🔗	joepie91	http://googleblog.blogspot.nl/2012/07/spring-cleaning-in-summer.html
05:17 ^🔗	balrog	I see
05:17 ^🔗	balrog	UGH why
05:18 ^🔗	joepie91	no idea :/
05:18 ^🔗	balrog	If they were public before they should stay as such
05:18 ^🔗	joepie91	but that seems like a LOT of potential for huge data loss
05:18 ^🔗	balrog	(I think)
05:18 ^🔗	balrog	Yeah :(
05:18 ^🔗	joepie91	or rather, public data loss
05:18 ^🔗	joepie91	and I mean huge
05:18 ^🔗	balrog	SketchCow: ^
05:19 ^🔗	balrog	DFJustin: Anyway my point was that maybe we don't even need to have drive power support. Will have to check how much extra cost that adds.
05:19 ^🔗	arrith1	i linked stuff about that earlier :)
05:19 ^🔗	balrog	Ah...
05:19 ^🔗	arrith1	he's going to check with archive.org people, since i guess archive.org has been working on youtube
05:19 ^🔗	arrith1	if archive.org doesn't get it all then i guess AT can spring into action
05:22 ^🔗	p4nd4	O hai
05:23 ^🔗	joepie91	ohai
05:24 ^🔗	joepie91	arrith1: alright
05:24 ^🔗	SketchCow	-bs
05:25 ^🔗	SketchCow	Wow, you filled 5 screens with discussion of hardware
05:25 ^🔗	arrith1	oops
05:25 ^🔗	arrith1	wait, well it's sorta related to the just solve it stuff, which is kind of #archiveteam related
05:26 ^🔗	arrith1	but yeah k -bs
05:28 ^🔗	SketchCow	It's only sort of related
05:28 ^🔗	SketchCow	-bs
05:29 ^🔗	arrith1	SketchCow: is archive.org doing for Google Video what they did for stage6?
05:29 ^🔗	p4nd4	Stage6 was awesome
05:33 ^🔗	SketchCow	Archive.org didn't do stage6, we did
05:33 ^🔗	SketchCow	One of us did.
05:33 ^🔗	Coderjoe	i did
05:33 ^🔗	Coderjoe	i wish I had gotten more of it
05:34 ^🔗	Coderjoe	particularly more user-generated content, as opposed to all those music videos, tv shows, and movies :(
05:35 ^🔗	SketchCow	I am torn on the google video
05:36 ^🔗	SketchCow	I'll spend another day thinking about it.
05:45 ^🔗	joepie91	also, for those that missed it - meebo is shutting down on july 11, instructions for downloading your recorded chatlogs for your meebo account until that date are available at http://www.meebo.com/support/article/175/
05:45 ^🔗	joepie91	lots of big things shutting down lately :(
05:47 ^🔗	joepie91	on that note - SketchCow, does archiveteam keep some kind of RSS feed that provides a list of services that will be shut down soon?
05:47 ^🔗	joepie91	or similar
05:47 ^🔗	joepie91	(preferably including archival status, of course :)
05:48 ^🔗	arrith1	joepie91: there are pages for that on the wiki
05:48 ^🔗	arrith1	mainly deathwatch i think
05:48 ^🔗	arrith1	and the frontpage
05:48 ^🔗	joepie91	alright, but is there some kind of feed that can for example be automatically retrieved?
05:48 ^🔗	joepie91	I can think of some interesting things to do with that
05:49 ^🔗	arrith1	one could cobble together a script that looks for changes to specific portions of the site from the overall wiki changes rss feed
05:49 ^🔗	arrith1	i'm not aware of something that does that currently
05:49 ^🔗	joepie91	hrm.. that would be hacky, and probably break when the page layout changes :\|
05:49 ^🔗	arrith1	yep
05:49 ^🔗	arrith1	wikis are tricky like that ;/
06:02 ^🔗	Nemo_bis	what portions of the site? of course there are solutions
06:04 ^🔗	Nemo_bis	do you just want something like this? http://archiveteam.org/index.php?title=Deathwatch&feed=atom&action=history
06:04 ^🔗	Nemo_bis	otherwise there's plenty of IRC-RC based services
06:06 ^🔗	joepie91	Nemo_bis: no, that is literally just a feed of changes
06:06 ^🔗	joepie91	I mean a feed that announces new site clousers
06:06 ^🔗	joepie91	closures *
06:06 ^🔗	Nemo_bis	a feed doesn't announce anything
06:06 ^🔗	joepie91	sigh
06:07 ^🔗	joepie91	..
06:07 ^🔗	joepie91	a feed that has as its items newly announced site closures
06:07 ^🔗	Nemo_bis	so this is not "portions of the wiki"
06:07 ^🔗	joepie91	no
06:07 ^🔗	Nemo_bis	anyway you can construct it from the feed, or make the wiki page machine-readable
06:08 ^🔗	joepie91	I never said anything about the wiki, it was arrith1 coming up with that suggestion
06:08 ^🔗	joepie91	yes, which would break if the page layout changes
06:09 ^🔗	arrith1	i refer to the wiki since it's basically the only place info is, besides say live irc channels
06:12 ^🔗	Coderjoe	or the AT twitter account
06:12 ^🔗	Coderjoe	@archiveteam
06:13 ^🔗	Coderjoe	but that's generally after details have been worked out and grunts are needed.
06:13 ^🔗	arrith1	ah yeah
06:15 ^🔗	arrith1	joepie91: what were you thinking of using the feed for?
06:18 ^🔗	SketchCow	No feed
06:18 ^🔗	SketchCow	Should be fixed? Yes.
06:23 ^🔗	joepie91	arrith1: I had a few ideas, actually
06:23 ^🔗	joepie91	mailing list, widget, possibly irc integration
06:24 ^🔗	joepie91	anything else I can think of
06:24 ^🔗	joepie91	just a way for people to easily keep track of services that are shutting down, that look less intimidating to the average user than a wiki page
06:24 ^🔗	arrith1	having a twitter account specifically for sites confirmed going down could work. then have a bot announce that in irc, etc
06:26 ^🔗	joepie91	this is actually interesting:
06:26 ^🔗	joepie91	Archived but not available
06:26 ^🔗	joepie91	Google Video
06:26 ^🔗	joepie91	http://www.archiveteam.org/index.php?title=Archives
06:27 ^🔗	joepie91	does that imply being fully archived?
06:27 ^🔗	joepie91	or only partially?
06:27 ^🔗	joepie91	arrith1: that would limit you to very short messages though
06:28 ^🔗	arrith1	joepie91: it would. but at the end of the short message maybe have a url to a special part of the wiki with specially formatted messages or something
06:29 ^🔗	aggro	IIRC, it was partially archived, back when Google first said they were going to just delete all of the videos. Apparently the AT bandwidth hive was too much even for Google, and they caved :P Now it looks like they're just moving videos over. Making them private, but not deleted.
06:30 ^🔗	arrith1	joepie91: it's on some archive.org servers somewhere i think. but GV seemed to back down and said they were keeping the site up.
06:30 ^🔗	arrith1	joepie91: but now that recent announcement of GV coming down, that's probably going to be reevaluated
06:30 ^🔗	aggro	(12:47:55 AM) SketchCow: I am torn on the google video
06:30 ^🔗	aggro	(12:48:14 AM) SketchCow: I'll spend another day thinking about it.
06:33 ^🔗	joepie91	mmm
06:33 ^🔗	joepie91	arrith1: may be better to have a dedicated page without all the wiki overhead
06:34 ^🔗	arrith1	joepie91: making something community accessible without a wiki gets tricky. i mean you could do like hg/git but that's quite a barrier to entry vs a wiki in terms of novice users
06:35 ^🔗	joepie91	hm.
06:35 ^🔗	joepie91	I'll have a think about it.
06:39 ^🔗	joepie91	also, on an unrelated note, I've heard some people on various irc networks complain about certain stories getting removed from fanfiction for some reason
06:39 ^🔗	joepie91	does anyone know more about that?
06:40 ^🔗	arrith1	joepie91: people in #fanfriction might know
07:16 ^🔗	ersi	Even though Google will move all the videos over to YouTube (they say at least) - I'm a bit in the mood to try to download it anyhow
07:17 ^🔗	ersi	I mean, we ate MobileMe (even though I guess largely thanks to Kenneth/Heroku)
07:17 ^🔗	Coderjoe	https://github.com/ArchiveTeam/googlegrape iirc
07:18 ^🔗	ersi	yeah
07:18 ^🔗	C-Keen	what's your preferred tool to archive an entire site? wget? which magic options do you use?
07:18 ^🔗	Coderjoe	though that was pre-warc
07:18 ^🔗	ersi	Coderjoe: wget with WARC support. The last part is important :)
07:19 ^🔗	Coderjoe	C-Keen: wget. options depend on site, but we like warc
07:19 ^🔗	Coderjoe	ersi: wrong target
07:19 ^🔗	*	C-Keen looks up warc
07:19 ^🔗	ersi	WARC is Web Archive format, it saves the HTTP Request + Response. It's a format used by the largest Archive places.
07:19 ^🔗	ersi	Coderjoe: wrong target?
07:19 ^🔗	C-Keen	ersi: I see
07:20 ^🔗	Coderjoe	ersi: i think you meant C-Keen not me
07:20 ^🔗	ersi	Ah, I totally missed that I tab-completed to you instead of C-Keen :p
07:20 ^🔗	Coderjoe	figured
07:21 ^🔗	C-Keen	ok so I shall build a wget from trunk...no problem.
07:23 ^🔗	Coderjoe	has the gnulib build stopper been fixed?
07:23 ^🔗	ersi	you could take a short cut and use a get-wget-warc.sh script from.. I can't remember which project is the freshes.. I think it might be MobileMe/MeMac - that'll make a wget-warc version that works very easily (I tried compiling wget-trunk a month ago.. didn't end well :P)
07:24 ^🔗	Coderjoe	i know misty mentioned a patch, but i don't know if it was accepted yet
07:25 ^🔗	Coderjoe	yeah i think memac was the latest update
07:25 ^🔗	Coderjoe	in order to get the regex support
07:27 ^🔗	C-Keen	let's see
07:31 ^🔗	ersi	C-Keen: Script's available at https://github.com/ArchiveTeam/mobileme-grab
07:31 ^🔗	ersi	You want the "get-wget-warc.sh" one :-)
07:36 ^🔗	C-Keen	ersi: trunk built
07:37 ^🔗	ersi	with the above script? or by itself? :)
07:38 ^🔗	C-Keen	by itself
07:39 ^🔗	ersi	neat!
07:39 ^🔗	C-Keen	hm, I wonder whether I should tell wget to rewrite links so I can view the site locally
07:40 ^🔗	ersi	no, don't do that
07:40 ^🔗	Coderjoe	you can. wget will save the unmodified version to the warc
07:40 ^🔗	ersi	oh, nice
07:40 ^🔗	Coderjoe	(and it does the modification of the files at the end of the run anyway)
07:40 ^🔗	ersi	otherwise I'd use https://github.com/alard/warc-proxy to proxy the content of the WARC :)
07:43 ^🔗	C-Keen	also the site I want to archive is using some kind of blog software so it contains links to page.html?p=1234. In previous attempts this turns out to be broken as the pages get downloaded as "page.html?p=1234" but of course the browser will always load the "page.html"
07:43 ^🔗	C-Keen	how do you deal with this?
07:45 ^🔗	p4nd4	Those HTML endings are probably PHP or ASP rewritten by an apache module C-Keen
07:46 ^🔗	p4nd4	Oh actually, nevermind
07:46 ^🔗	p4nd4	Wrong channel
07:47 ^🔗	p4nd4	What blog software is it using? If you can find it you can check how URL's get rewritten and perhaps revert it because the original URLs should still work
07:58 ^🔗	C-Keen	good question
07:58 ^🔗	C-Keen	I will investigate
07:58 ^🔗	p4nd4	:)
07:58 ^🔗	p4nd4	Wappalyzer
07:59 ^🔗	brayden	This is terribly interesting.
08:00 ^🔗	C-Keen	Wappalyzer?
08:00 ^🔗	C-Keen	ah cool
08:01 ^🔗	ersi	C-Keen: Nothing wrong with getting "page.html?p=X" pages. As long as the content differs and have some other meaning than just page.html.. One can always do a rewrite serverside if you want to present it later
08:02 ^🔗	C-Keen	ersi: ack. I just hoped for some already existing magic to do so
08:02 ^🔗	C-Keen	p4nd4: heh wappalyzer is cool, it says some wordpress cms
08:02 ^🔗	ersi	I mean, from the archiving Point of View, there's nothing wrong with saving them down as "page.html?p=1234"
08:03 ^🔗	ersi	And there's solutions for dealing with that, if you want to present that material later as well :)
08:06 ^🔗	C-Keen	heh archiving entire sites feels good ;)
08:06 ^🔗	ersi	It sure does!
08:08 ^🔗	C-Keen	now to something completely different. If I want to help on archiving huge sites, I can run the archive team warrior. but I am connected with asymmetric DSL which means I get 6Mbit/s down but only 200KB/s up, so while downloading gigabytes is fast getting these gigabytes off my machines will take (almost) forever
08:08 ^🔗	ersi	indeed, unfortunally that's the case for many
08:14 ^🔗	p4nd4	In theory you could find an injection vulnerability
08:15 ^🔗	p4nd4	And clone their database
08:15 ^🔗	p4nd4	and set up your own wordpress site with a cloned database
08:15 ^🔗	p4nd4	That way you'd have an exact copy
08:15 ^🔗	brayden	lol
08:15 ^🔗	brayden	that's evil
08:15 ^🔗	p4nd4	It's evil if you cause harm
08:15 ^🔗	brayden	and also difficult
08:15 ^🔗	p4nd4	not difficult
08:15 ^🔗	ersi	No, you'd have an exact copy of the internal state. Not the external one
08:15 ^🔗	ersi	You'd miss all static content and graphical representation
08:15 ^🔗	brayden	You'd have to the important part, the posts table.
08:16 ^🔗	brayden	and you can crawl .jpg,.png etc. at a later stage
08:16 ^🔗	p4nd4	You could clone the whole database, including settings, posts, comments
08:16 ^🔗	p4nd4	Everything
08:16 ^🔗	*	ersi sighs and rolls his eyes
08:16 ^🔗	p4nd4	Do you use some backup software to clone stuff btw?
08:16 ^🔗	p4nd4	Like a crawler to recreate pages and links?
08:17 ^🔗	brayden	lol
08:17 ^🔗	brayden	could use wget with spider
08:17 ^🔗	brayden	but that only checks if links exist.
08:17 ^🔗	brayden	Or download it, and parse it for links.
08:17 ^🔗	brayden	wget can user-agent spoof, as can curl as well
08:18 ^🔗	p4nd4	Yes that's what I meant, you could crawl it, find all in-links, follow them and crawl them as well for in-links
08:18 ^🔗	p4nd4	And map them to each other
08:18 ^🔗	p4nd4	But it'd be static
08:18 ^🔗	brayden	Depends on the content I guess.
08:18 ^🔗	brayden	If it is a personal blog of some sort then maybe the post thumbnails etc. don't matter quite as much as the words.
08:19 ^🔗	p4nd4	I'm just wondering if you're using a software already
08:19 ^🔗	p4nd4	or if maybe that'd be a good project for me to start writing?
08:19 ^🔗	brayden	AFAIK there's a lua script that is adapted for archiving.
08:19 ^🔗	p4nd4	Ah alright
08:19 ^🔗	C-Keen	it's already a mess with all these links to some CDN, as you cannot distinguish data beloning to the site from other things anymore
08:19 ^🔗	brayden	:(
08:19 ^🔗	C-Keen	in the past people just hosted their stuff on their servers
08:19 ^🔗	p4nd4	:(
08:20 ^🔗	p4nd4	Yeah
08:20 ^🔗	brayden	can't you just include only the CDN's links? or are they just IPs, not hostnames?
08:20 ^🔗	C-Keen	brayden: but how do you now that some aws.amazon.com is essential for the content?
08:21 ^🔗	brayden	I don't know but it is probably more essential than random hotlinks.
08:21 ^🔗	p4nd4	There are also many kinds of CDNs
08:21 ^🔗	p4nd4	as well as private CDNs
08:21 ^🔗	C-Keen	true
08:21 ^🔗	p4nd4	Hard to distinguish
08:21 ^🔗	C-Keen	then there is hacker news which is another bag of hate as they create dynamically expiring links on their pages grrr
08:22 ^🔗	Coderjoe	wget has a page requisies option
08:22 ^🔗	C-Keen	Coderjoe: true
08:22 ^🔗	Coderjoe	there is also the lua option, at least if using the picplz version of wget-warc-lua
08:23 ^🔗	C-Keen	lua?
08:23 ^🔗	brayden	lua is a scripting language
08:23 ^🔗	Coderjoe	a scripting language. with the lua addition to wget, you can write a hook script for generating the list of links for wget to crawl
08:23 ^🔗	brayden	Very simple to use.
08:23 ^🔗	C-Keen	I know lua I don't see the connection
08:23 ^🔗	C-Keen	ah
08:23 ^🔗	brayden	oh
08:24 ^🔗	p4nd4	I've done that with a PHP script as well, fetch a site and crawl all in-links
08:24 ^🔗	Coderjoe	the hook function would be passed the page that was just downloaded, and it can parse it and return a list of links
08:24 ^🔗	Coderjoe	(for examples, you can see the picplz usage)
08:24 ^🔗	p4nd4	I was just thinking, instead of just stripping it of content and taking the links it could store the page as well, then fetch all sites it links to, and save them as well, and replace all links to link to the stored versions
08:25 ^🔗	p4nd4	And do that recursively
08:25 ^🔗	p4nd4	But the problem would be CDNs and remotely included scripts etc
08:25 ^🔗	C-Keen	yep
08:25 ^🔗	Coderjoe	p4nd4: but the wget-warc-lua solution allows you to add it all into one warc file during a single run
08:25 ^🔗	p4nd4	Oh
08:25 ^🔗	p4nd4	I'm not very familiar with archiving, I'm just brainstorming
08:25 ^🔗	p4nd4	:)
08:26 ^🔗	C-Keen	why is archiving the headers important as well?
08:26 ^🔗	Coderjoe	and with your recursive link following, at least without some sort of limitation, you will wind up trying to download all of the intarwebs
08:27 ^🔗	p4nd4	no
08:27 ^🔗	p4nd4	I said in-links
08:27 ^🔗	p4nd4	As in, links within the same domain
08:27 ^🔗	p4nd4	:c
08:27 ^🔗	Coderjoe	the reason for warc files is because that is what the wayback machine takes
08:27 ^🔗	p4nd4	Ahh
08:27 ^🔗	brayden	lol
08:27 ^🔗	brayden	Once I decided to use Xenu's link sleuth on Google
08:28 ^🔗	brayden	Even with a fairly small depth I still ended up downloading the internetz
08:28 ^🔗	Coderjoe	i saw no mention of "in-links" in your description
08:29 ^🔗	Coderjoe	I had a friend that was looking to mirror one of the gamespy-hosted sites and wound up trying to download all of gamespy on my isdn connection
08:29 ^🔗	brayden	:o
08:29 ^🔗	Coderjoe	(he only had dialup at the time, so he used the ssh access into my server to do this)
08:30 ^🔗	Coderjoe	simply with a forgotten wget option
08:31 ^🔗	Coderjoe	I ended up creating a wget group, putting wget in that group, setting it 0750, and omitting him from that groupp
08:33 ^🔗	Coderjoe	while this discussion has been generally on-topic, I would like to point out that there is an #archiveteam-bs channel for offtopic chatter
08:34 ^🔗	Coderjoe	and with that, I am going to get some sleep
08:34 ^🔗	C-Keen	sorry
08:34 ^🔗	C-Keen	good night
08:41 ^🔗	ersi	Well, this was borderline off-topic - I'd say it's mostly on topic
08:41 ^🔗	ersi	no need for the sorry :)
08:43 ^🔗	p4nd4	Coderjoe: "(10:24:17 AM) p4nd4: I've done that with a PHP script as well, fetch a site and crawl all in-links" :p
08:43 ^🔗	ersi	shrug
13:15 ^🔗	alard	mistym: Maybe you've found it already, but a real solution to the Wget bootstrap problem is to remove the line $build_aux/missing from bootstrap.conf.
13:16 ^🔗	mistym	alard: Yep, I noticed - thanks! The problem was patched upstream in gnulib, so doing a bootstrap-sync works too.
13:17 ^🔗	alard	What is bootstrap-sync?
13:17 ^🔗	mistym	It replaces the package's copy of the bootstrap script with gnulib's copy.
13:17 ^🔗	alard	It's the bootstrap.conf file in the Wget repository that needs fixing, as far as I can see.
13:18 ^🔗	mistym	Maybe I'm remembering wrong. Anyway - thanks!
13:32 ^🔗	SketchCow	Ha ha, some archivists are coming out of the woodwork to criticize Just Solve the Problem.
13:35 ^🔗	SketchCow	No, I am not a new registry in .competition. for .mindshare. on the issue. I am a chaos agent, like we.ve been with Archive Team (different than archive.org, by the way), turning the theoretical and the progressive into the real. When Archive Team started, people sniffed how we were using WGET instead of some properly standards compliant web archive format. Within a short time, WE CHANGED WGET TO SUPPORT WARC. And I can assure you, our ability to
13:40 ^🔗	X-Scale	It's just a shame the old look and feel of Google Groups is going away. I wish there was an efficient way of preserving it (for possible use in future projects).
13:43 ^🔗	SketchCow	http://blogs.loc.gov/digitalpreservation/2012/07/rescuing-the-tangible-from-the-intangible/ - that'sa lotsa cds
14:00 ^🔗	godane	SketchCow: All 2006 episodes of dl.tv is up
14:00 ^🔗	godane	all of 2005 but episode 6
14:01 ^🔗	godane	the 2005 ones was at great risk of being delete forever
14:02 ^🔗	godane	everything past episode 30 i got from mevio
15:01 ^🔗	balrog	SketchCow: Sorry! I was trying to keep the discussion to the formats and the software but arrith1 started asking hardware questions :) anyway I want to keep the focus on the software.
15:01 ^🔗	balrog	Software for decoding stuff in general. Not limited to floppy disks
16:11 ^🔗	SketchCow	If an Archive Team member wanted to attend http://www.digitalpreservation.gov/meetings/ndiipp12.html - I'd endorse you
16:52 ^🔗	underscor	Oh, man. I'll be in Vegas
16:52 ^🔗	underscor	Partying it up with SketchCow
17:06 ^🔗	qbc	the big casinos with the odds-fixes are in the city of london and on wall street though...now those are places to par-TAY
18:47 ^🔗	arkhive	More Google Spring Cleaning. Including Google Video. http://googleblog.blogspot.com/2012/07/spring-cleaning-in-summer.html
18:48 ^🔗	C-Keen	yep
18:48 ^🔗	arkhive	"Google Video users have until August 20 to migrate, delete or download their content. Weâll then move all remaining Google Video content to YouTube as private videos that users can access in the YouTube video manager."
18:49 ^🔗	C-Keen	so a lot of stuff will just stay inaccessible if noone shows up to republish it?
18:49 ^🔗	C-Keen	am I reading this correctly?
18:49 ^🔗	arkhive	Should we start downloading them again? It makes sense to since most videos (I assume) will be private.
18:49 ^🔗	arkhive	ya
18:49 ^🔗	arkhive	That's how I read it.
18:50 ^🔗	C-Keen	well then...
18:51 ^🔗	arkhive	you can make them public if you'd like.
18:51 ^🔗	arkhive	Last link: http://youtube-global.blogspot.com/2012/07/google-video-content-moving-to-youtube.html
18:52 ^🔗	arkhive	What does everyone else think about this?
18:53 ^🔗	SmileyG	they are keeping your content. This is a good thing. They have given like a years notice, this is a good thing.
18:53 ^🔗	SmileyG	aren't most google video, videos private anyway?
18:54 ^🔗	omf_	Google has never released hard numbers on public vs private videos so it is really unknown
18:54 ^🔗	SmileyG	wtf. Go to google.com/videohp ; click "I'm feeling lucky" without typing anything -> goes to the doodles page o_O
18:55 ^🔗	SmileyG	Infact, you seem unable to search google video?
19:02 ^🔗	arkhive	site:video.google.com bird
19:02 ^🔗	arkhive	into the search field
19:03 ^🔗	arkhive	replace bird with desired
19:03 ^🔗	SmileyG	yah :<
19:15 ^🔗	arkhive	So, Good.net , iWork.com , possibly Google Video , and the stuff listed on the 'deathwatch' wiki page.
19:15 ^🔗	qbc	video.google's an issue to corporatists who wish more control over truths which detract from their image .. in that regard, a migration is no surprise--i've wondered why they weren't quicker in their fascism in fact
20:27 ^🔗	Coderjoe	SketchCow: isn't criticism what archivists do best?
20:28 ^🔗	mistym	Coderjoe: I think debating over how best to approach solving the problem, without ever solving the problem, is what archivists do best
20:28 ^🔗	mistym	Archivists are also very good at criticizing archivists
20:28 ^🔗	BlueMax	mistym also apparently archivists are politicians
20:28 ^🔗	BlueMax	Jason Scott for President of Earth 3012
20:35 ^🔗	mistym	Why wait 1000 years?
20:46 ^🔗	Nemo_bis	to prove he's been archived well
20:46 ^🔗	Nemo_bis	and digitally preserved in a good format
20:46 ^🔗	Nemo_bis	thanks to his projects
21:09 ^🔗	nitro2k01	SmileyG: Typing anything brings up results with default settings, so "I'm feeling lucky" is effectively dead
21:15 ^🔗	_case	Coderjoe & mistym: it's an academia thing
21:15 ^🔗	chronomex	stupid academics
21:16 ^🔗	mistym	_case: I am not in academia and I can say for sure I see it outside academic archives!
21:18 ^🔗	_case	mistym: oh for sure. just saying archivists [speaking as one] are at their core - academics [for better or worse].
21:19 ^🔗	mistym	This is true. (Also speaking as one.)
21:24 ^🔗	_case	my god. they got you too.
21:27 ^🔗	Tephra	as someone who got my foot in both academia and archiving i would agree somewhat
21:38 ^🔗	joepie91	<SmileyG>aren't most google video, videos private anyway?
21:38 ^🔗	joepie91	possibly - the problem here is that those that are public (sometimes for very good reason) also become private
23:18 ^🔗	arkhive	My connection was lost..Is there any place I can look back at the chat log to see if anyone else is interested in those projects?
23:20 ^🔗	arkhive	found it..nevermind

irclogger-viewer