#archiveteam 2012-12-26,Wed

↑back Search

Time	Nickname	Message
04:33 ^🔗	ArtimusAg	Not sure if off-topic, but if this is relevant to current projects, I began personally archiving VGMusic.com's MIDIs and organizing them per game since there seems to be lacking a proper archive for its contents
04:40 ^🔗	chronomex	cool
06:13 ^🔗	Vito``	tef: from my experience running private bookmarking and caching/archiving services (we donated the CSS parsing code to wget), you increasingly need a "real browser" to do a good job of caching/archiving a page/site.
06:14 ^🔗	Vito``	tef: I actually work off of three different "archives" for any site we cache: we take a screenshot, we cache it with wget, and we're working on caching a static representation as captured from within the browser
06:14 ^🔗	Vito``	none can feed back into wayback machine yet, but it's on the to-do list
06:29 ^🔗	Coderjoe	that's the curse of "Web 2.0" designs :-\
06:47 ^🔗	instence	The biggest problem right now are sites that use AJAX calls to dynamically load data. wget doesn't have a javascript interpreter engine, and when wget hits pages like this, it just sees an empty DIV and goes nowhere.
06:49 ^🔗	Vito``	yeah, I expect to completely replace wget with phantomjs at some point
06:50 ^🔗	Vito``	well, except for single-file mirroring, like a PDF or something
09:43 ^🔗	godane	chronomex: if vgmusic.com is just midi files then i may look at archiving it so we have a full warc.gz of it
09:51 ^🔗	BlueMaxim	godane it pretty much is, to my knowledge
13:11 ^🔗	hiker1	I am trying to download a site that uses assets on a subdomain. I used --span-hosts and --domains, but now it's making a duplicate copy of the site for the www. domain extension. I set -D to include tinypic.com so that it would download hotlinked images, but it seems to have downloaded some of the web pages from tinypic too.
13:20 ^🔗	ersi	AFAIK the images from tinypic are hosted on an subdomain
13:20 ^🔗	ersi	like i.tinypic.com or something like that
13:23 ^🔗	hiker1	yes
13:24 ^🔗	hiker1	But how do I tell it to not access ^tinypic.com and only access *.tinypic.com?
13:25 ^🔗	hiker1	--domains and --exclude-domains don't appear to accept wildcards or regex
13:30 ^🔗	schbirid1	correct
13:49 ^🔗	hiker1	How can I avoid downloading from the wrong domain then?
17:54 ^🔗	SketchCow	Hooray, oxing day
18:15 ^🔗	Nemo_bis	SketchCow: would it be useful to email you a list of magazines (searches/keywords) I uploaded so that when you have time you can create collections/darken/do whatever you like with them?
18:29 ^🔗	godane	SketchCow: I'm up to 2011.08.31 of attack of the show
18:29 ^🔗	godane	also i'm uploading vgmusic.com warc.gz right now
19:01 ^🔗	hiker1	godane: Andriasang.com appears stable now, if you were still willing to try to grab a copy
19:08 ^🔗	godane	i'm grabing it
19:09 ^🔗	godane	i grabbed the articles
19:09 ^🔗	godane	but the images i will have to try next
19:12 ^🔗	godane	uploaded: http://archive.org/details/vgmusic.com-20121226-mirror
19:48 ^🔗	godane	hiker1: http://archive.org/details/andriasang.com-articles-20121224-mirror
19:58 ^🔗	hiker1	godane: What commands did you use to mirror the site/
20:20 ^🔗	godane	i made a index file first
20:20 ^🔗	hiker1	using what command?
20:21 ^🔗	godane	wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:22 ^🔗	godane	i had to do it this way cause there was way too many images to mirror it whole
20:22 ^🔗	hiker1	You append http://andriasang.com to that command, right?
20:23 ^🔗	godane	its all from http://andriasang.com
20:23 ^🔗	hiker1	And will that grab the html articles, just not the images?
20:24 ^🔗	godane	*I had to add http://andriasang.com to all urls since there local urls
20:25 ^🔗	hiker1	I don't understand what you mean by that. How did you add it to all the urls?
20:25 ^🔗	godane	with sed
20:26 ^🔗	hiker1	to start, that first command grabs all the html files, correct?
20:26 ^🔗	godane	when i may my index.txt file from a dump of the pages you get urls without http like this: /?date=2007-11-05
20:26 ^🔗	hiker1	and ignores images because you did not use --page-requisites
20:27 ^🔗	hiker1	Am I correct in saying that?
20:28 ^🔗	godane	i just grabbed what was listed in my index.txt
20:28 ^🔗	godane	there is one image in there
20:28 ^🔗	godane	http://andriasang.com/u/anoop/avatar_full.1351839050.jpg
20:28 ^🔗	hiker1	Does running this command save html files, or just save an index? `wget -x -i index.txt --warc-file=$website-articles-$(date +%Y%m%d) --warc-cdx -E -o wget.log`
20:29 ^🔗	godane	it saves html files
20:29 ^🔗	godane	i got the index.txt file from another warc of the pages
20:29 ^🔗	DFJustin	ultraman cooking what
20:30 ^🔗	hiker1	godane: Could you explain that? How did you get the index file to begin with?
20:32 ^🔗	godane	i think i grabed it by: zcat *.warc.gz \| grep -ohP 'href='[^'>]+'
20:33 ^🔗	godane	i did this to my pages warc.gz
20:33 ^🔗	hiker1	How'd you get the warc.gz to begin with?
20:34 ^🔗	godane	for i in $(seq 1 895); do
20:34 ^🔗	godane	echo "http://andriasang.com/?page=$i" >> index.txt
20:34 ^🔗	godane	done
20:36 ^🔗	hiker1	So that gives you a list of all the pages. How then did you get the warc.gz/index.txt with the full urls and with the urls by date?
20:36 ^🔗	godane	i then did this: wget -x -i index.txt --warc-file=andrisasang.com-$(date +%Y%m%d) --warc-cdx -E -o wget.log
20:39 ^🔗	hiker1	So you end up downloading the page listings twice in this process?
20:39 ^🔗	hiker1	the first time to get all the urls, then the second time to get the real warc file with all the articles?
20:39 ^🔗	godane	no
20:39 ^🔗	godane	first time it was pages
20:40 ^🔗	godane	then all urls of articles
20:40 ^🔗	hiker1	Did you then merge the two together?
20:40 ^🔗	godane	the dates and pages would also be in the articles dump to
20:40 ^🔗	hiker1	oh, ok
20:41 ^🔗	hiker1	How do you plan to get the images?
20:43 ^🔗	godane	by grabing the urls like how i grabed the images
20:44 ^🔗	hiker1	Will you then be able to merge the two warc files so that the images can be viewed in the articles?
20:45 ^🔗	godane	the way back machine can handler multiable warcs
20:45 ^🔗	hiker1	Can you use the wayback machine to read these from the web? Or do you mean by running a private copy of the wayback machine?
20:46 ^🔗	godane	you can use warc-proxy to do it locally
20:46 ^🔗	hiker1	and just load both warc files from that?
20:46 ^🔗	godane	yes
20:48 ^🔗	hiker1	Thank you for explaining this to me. I was having a hard time understand the process. I really appreciate the help.
22:14 ^🔗	hiker1	godane: How do you handle grabbing CSS or images embedded in CSS?
22:28 ^🔗	godane	i sadly don't know how to grab stuff in css
22:28 ^🔗	godane	even with wget
22:28 ^🔗	godane	cause i don't know if wget grabs urls in css
22:30 ^🔗	Nemo_bis	the requisites option maybe?
22:32 ^🔗	godane	i can't grab the full website in one warc
22:39 ^🔗	hiker1	Why can't you?
22:40 ^🔗	godane	it was 2.8gb big and was still going when i was doing it the first time
22:40 ^🔗	hiker1	is that too large for one wget?
22:41 ^🔗	godane	4gb is the limit on one warc.gz
22:41 ^🔗	godane	it was getting there and it bothered me
22:41 ^🔗	hiker1	oh.
22:43 ^🔗	godane	there is over 317000+ images in that site
22:44 ^🔗	ersi	that's a few
22:44 ^🔗	chronomex	yeah that'll add up
22:45 ^🔗	hiker1	wow
22:46 ^🔗	godane	i may have to do another grab later of images
22:46 ^🔗	hiker1	What do you mean?
22:46 ^🔗	godane	there was alot of images that had no folder/url path in it
22:46 ^🔗	godane	it was just the file name
22:47 ^🔗	hiker1	I thought you were only grabbing html files right now
22:47 ^🔗	godane	html was already done
22:47 ^🔗	godane	http://archive.org/details/andriasang.com-articles-20121224-mirror
22:47 ^🔗	godane	thats the html articles
22:48 ^🔗	godane	there was about 30 articles that gave the 502 bad gateway error
22:48 ^🔗	godane	i was only able to get 4 of them on a retry
22:49 ^🔗	godane	i limit the warc.gz file size to 1G

irclogger-viewer