#archiveteam-bs 2013-06-24,Mon

↑back Search

Time	Nickname	Message
00:14 ^🔗	omf_	ivan`, you just need rss feeds
00:14 ^🔗	omf_	or domain names
00:16 ^🔗	ivan`	omf_: I need feeds, but I can infer the feed URLs based on subdomains or a /path or /user/path
00:17 ^🔗	ivan`	there are also some 'obvious' feeds like /rss.xml /atom.xml that would be nice to get
00:17 ^🔗	ivan`	I can make a pattern for those too
00:17 ^🔗	ivan`	the domains will be the sites listed in http://www.archiveteam.org/index.php?title=Google_Reader and more, if I go look for them tomorrow
00:17 ^🔗	arrith1	omf_: for example one thing that can be in a url is a username which can be used to infer feeds
00:20 ^🔗	ivan`	will domain prefixes + url filtering script work for you?
00:21 ^🔗	ivan`	or, I could skip the domain prefixes and have the script filter all 160 billion URLs
00:26 ^🔗	omf_	ivan`, did posterous have rss feeds?
00:30 ^🔗	ivan`	yes
00:30 ^🔗	ivan`	I've grabbed them, but almost all were 404 in reader's cache
00:30 ^🔗	omf_	I assume you loaded that 9.8 million url list
00:30 ^🔗	ivan`	yes
00:30 ^🔗	omf_	what about the live journal list we got
00:30 ^🔗	ivan`	IA list has been imported, #donereading is working on an lj crawler
00:31 ^🔗	ivan`	is there another lj list?
00:31 ^🔗	omf_	xanga users
00:31 ^🔗	omf_	not that I know of
00:31 ^🔗	ivan`	by IA list I meant wayback
00:31 ^🔗	ivan`	xanga has been imported
00:31 ^🔗	ivan`	hm, not the new discoveries though
00:34 ^🔗	arrith1	also #donereading on a blogspot crawler
00:35 ^🔗	omf_	what about reddit subreddits
00:36 ^🔗	ivan`	done
00:36 ^🔗	joepie91	anyone that natively speaks something that isn't Dutch or English, want to help out with the VPS panel I'm working on?
00:37 ^🔗	joepie91	https://www.transifex.com/projects/p/cvm
00:37 ^🔗	joepie91	(if a language is missing, tell me and I'll add it :P)
00:38 ^🔗	omf_	I assume you also loaded in the alexa and quantcast url lists
00:38 ^🔗	ivan`	I did not know those existed
00:39 ^🔗	ivan`	is there a dump or do I have to query something?
00:39 ^🔗	arrith1	omf_: yeah any url list ideas you have would be greatly appreciated
00:40 ^🔗	omf_	here let me link you
00:40 ^🔗	omf_	top 1 million sites http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
00:42 ^🔗	omf_	and here is the quantcast top 1 million sites https://ak.quantcast.com/quantcast-top-million.zip
00:42 ^🔗	ivan`	thanks
00:42 ^🔗	ivan`	so, will you be able to run through the cuil data?
00:42 ^🔗	omf_	it will take time
00:42 ^🔗	ivan`	okay
00:43 ^🔗	omf_	what about the url lists from the url shortenors
00:44 ^🔗	ivan`	good idea
00:44 ^🔗	omf_	http://urlte.am/
00:44 ^🔗	omf_	what did you get our of common crawl
00:45 ^🔗	ivan`	2.4 billion URLs, a lot of stuff imported from there
00:45 ^🔗	ivan`	I have a 22GB bz2 of their URLs
00:47 ^🔗	omf_	that seems small the common crawl url index with no content is 200gb compressed
00:48 ^🔗	ivan`	https://github.com/trivio/common_crawl_index claims 5 billion URLs
00:49 ^🔗	ivan`	I got 2.4 billion when running the included Python program
00:49 ^🔗	ivan`	5 billion URLs compressed can't take up 200GB
00:52 ^🔗	ivan`	urlteam torrent has 1 seed uploading at 1KB/s
00:53 ^🔗	omf_	I think that is all on IA as well
00:54 ^🔗	ivan`	I see only 2011 dumps on IA
00:59 ^🔗	ivan`	I will bbl, gotta sleep
01:10 ^🔗	omf_	ivan`, I got a few smaller but interesting lists
01:10 ^🔗	omf_	universites - http://doors.stanford.edu/universities.html
01:12 ^🔗	omf_	http://catalog.data.gov/dataset/us-department-of-education-ed-internet-domains
04:40 ^🔗	Coderjoe	wow
04:40 ^🔗	Coderjoe	http://www.forensicswiki.org/wiki/Main_Page
04:41 ^🔗	Coderjoe	just stumbled on this while looking for information on disk image formats
04:43 ^🔗	omf_	Coderjoe, that is a good site
04:52 ^🔗	Coderjoe	is there a simple fuse program that can take a raw disk image and export the partitions? (I've tried guestfs and vdfuse and both do not like me)
04:52 ^🔗	omf_	you mean mount it?
04:52 ^🔗	omf_	fuseiso
04:53 ^🔗	Coderjoe	the ultimate goal is to mount a partition in the image without using loop or dm
04:53 ^🔗	omf_	fuseiso blah.iso dir/
04:53 ^🔗	Coderjoe	not an iso
04:53 ^🔗	omf_	why not just create a chroot and mount it in there?
04:54 ^🔗	Coderjoe	eh?
04:55 ^🔗	Coderjoe	i have a rawhardriveimage.img file made using dd. the first 512 bytes contain an MBR partition table with one partition (in this case).
04:55 ^🔗	Coderjoe	I want to use ntfs-3g on the partition in the image to access files in the image
04:56 ^🔗	Coderjoe	without dding out the partition and without using loop devices and/or device mapper. (IE: without needing root)
04:57 ^🔗	Coderjoe	(errata: the image is also compressed with bzip2 and being accessed through avfs at the moment)
04:59 ^🔗	omf_	Coderjoe, you tried guestmount from guestfs
04:59 ^🔗	omf_	no root required
04:59 ^🔗	Coderjoe	yes
05:00 ^🔗	Coderjoe	gave stupid error that I cannot determine the cause of
05:01 ^🔗	Coderjoe	also, it probes deeper than I really care for. (it determines filesystem types and everything)
05:21 ^🔗	arrith1	Coderjoe: yeah that site is pretty good for documenting disk recovery stuff/ ddrescue (aka gddrescue) vs dd_rescue for example
05:23 ^🔗	arrith1	Coderjoe: depending on a system's config, you can usually interact with loop stuff without root, at least on ubuntu. might be some group thing
05:23 ^🔗	arrith1	kpartx and losetup with offsets
05:23 ^🔗	godane	looks like the twinkies are coming back
05:23 ^🔗	godane	july 15th i hear
05:27 ^🔗	arrith1	not in time for 4th of july aw
07:28 ^🔗	godane	SketchCow: this is all you: http://www.atlasobscura.com/articles/object-of-intrigue-mickey-mouse-gas-mask
07:31 ^🔗	godane	in that its something to so at one of speaches
07:39 ^🔗	ivan`	omf_: nice, thanks
10:41 ^🔗	ivan`	http://reader.aol.com/
10:42 ^🔗	ivan`	they even implement a Reader-style API
10:42 ^🔗	joepie91	hah
10:42 ^🔗	joepie91	"so Google doesn't want to do Reader? fine, we'll do it then"
10:43 ^🔗	norbert79	too bad it refuses to work
10:44 ^🔗	norbert79	Doesn't do anything in FF 21
10:44 ^🔗	norbert79	Close, but no cigar
10:45 ^🔗	norbert79	on the other hand, Deep Space Nine - The fallen is a great UT 1 engine based game
10:46 ^🔗	norbert79	but on an unrelated note really
12:17 ^🔗	Smiley	Anyone here good with building ec2 images?
12:17 ^🔗	*	Smiley wants to make a xanga one
12:26 ^🔗	Smiley	I honestly have no clue where to start
12:27 ^🔗	Smiley	I'm thinking fire up a default debian install
12:27 ^🔗	Smiley	then install the extra bits on top
12:36 ^🔗	ivan`	why bother with an image if you can just use ssh to execute a setup script on each box
12:37 ^🔗	joepie91	so
12:37 ^🔗	joepie91	have we archived scribd yet
12:37 ^🔗	Smiley	ivan`: ok, tehn i need a setup script ;)
12:37 ^🔗	Smiley	I just need some automated way of firing up a load of instances.
12:38 ^🔗	ivan`	joepie91: no, and +1 on that, there's a lot of good stuff there
12:40 ^🔗	Smiley	concidering we have 2 dying projects at teh mo, you can feel free but don't expect much help.
12:41 ^🔗	joepie91	I really don't like Scribd :/
12:42 ^🔗	Smiley	wtf why can't I ssh into these ec2 instances o_O
12:42 ^🔗	Smiley	debug1: Authentications that can continue: publickey
12:42 ^🔗	Smiley	debug1: Trying private key: ./.ssh/amazon.pem
12:42 ^🔗	Smiley	debug1: read PEM private key done: type RSA
12:42 ^🔗	Smiley	So it is reading mah key :/
12:43 ^🔗	norbert79	joepie91: I also wondered how a PHP and flash based webpage could be archived well with all functionalities and documents inside, especially document access requires an active username on scribd :)
12:43 ^🔗	Smiley	eh, where have my ssh rules gone o_O
12:46 ^🔗	joepie91	norbert79: as I said, I don't like Scribd
12:47 ^🔗	norbert79	joepie91: I understand, I was just curious, as I really would like to see some solutions to such
12:48 ^🔗	norbert79	joepie91: one of the webpages I used to visit, all around former Luftwaffe (http://luftarchiv.de) was once my target for full dump, but I failed badly...
12:48 ^🔗	joepie91	obvious solution would be automated account creation and downloading
12:48 ^🔗	joepie91	but that'd probably require you to spend a few bucks on breaking captchas
12:49 ^🔗	norbert79	I wonder if any webpage wiould offer a sharing of their whole webpage for free, like a backup of the page running the stuff
12:49 ^🔗	joepie91	how do you mean?
12:50 ^🔗	norbert79	I mean like a webpage we would like to archive based on PHP would run on a server, within a subdirectory and we would ask kindly and given of the content of that directory
12:50 ^🔗	norbert79	that would be nice
12:50 ^🔗	norbert79	if webpage owners would willing sharing
12:50 ^🔗	norbert79	without the sensitive data of course
12:51 ^🔗	joepie91	I was actually thinking about that
12:51 ^🔗	joepie91	perhaps a framework should exist for sites to offer a 'backup' file
12:51 ^🔗	Smiley	ok who uses the pipeline/seesaw stuff?
12:51 ^🔗	joepie91	in a standardized format
12:51 ^🔗	joepie91	something that doesn't take long to implement
12:52 ^🔗	norbert79	joepie91: Aye the same was I thinking about too
12:52 ^🔗	norbert79	joepie91: Like I run a Wiki, I would be happy sharing that without the sensitive stuff, including all dump, like the SQL dump
12:52 ^🔗	norbert79	but a method would be nice to exist to be able to put those packages to a VM image to make it work again
12:53 ^🔗	joepie91	not necessarily even a VM image
12:53 ^🔗	joepie91	just some standardized machine-readable format
12:53 ^🔗	norbert79	aye
12:53 ^🔗	norbert79	Just thinking about this
12:53 ^🔗	joepie91	speaking of which, mind speccing that out a bit?
12:53 ^🔗	joepie91	as to what kind of data you would need to store
12:53 ^🔗	joepie91	etc
12:53 ^🔗	joepie91	how it'd be structured
12:53 ^🔗	norbert79	hmm
12:53 ^🔗	norbert79	In my case I have a mediawiki running
12:53 ^🔗	joepie91	I'll have a think over it as well
12:53 ^🔗	norbert79	I have some specific changes
12:53 ^🔗	joepie91	perhaps combining the ideas it might yield a nice result
12:54 ^🔗	norbert79	like I use a captcha but I use a bash script to make random pictures for replacing old captcha pictures which runs as a cronjob
12:54 ^🔗	norbert79	I use almost the generic things, has short URL
12:54 ^🔗	norbert79	but runs within one subdirectory instead of /var/www
12:54 ^🔗	norbert79	so it's a bit modded
12:55 ^🔗	norbert79	but nothing serious
12:55 ^🔗	norbert79	joepie91: https://telehack.tk/wiki
12:56 ^🔗	norbert79	but a Wiki is an easier thing as there are already methods available
12:56 ^🔗	norbert79	the issue is with non-standard-cms engines
12:56 ^🔗	norbert79	like anything written manually
12:57 ^🔗	norbert79	there should be more API's existing
12:57 ^🔗	norbert79	for more common solutions and easier dumping
12:57 ^🔗	norbert79	but of course why would homepage owners wish to share their content this easy
18:49 ^🔗	godane	i got a shit ton of maximum pc disks today
18:50 ^🔗	winr4r	godane: excellent
19:04 ^🔗	godane	i'm going to upload my cnn20 disk
19:06 ^🔗	Smiley	metric or imperial shitton?
19:06 ^🔗	Smiley	;)
19:23 ^🔗	ersi	"Facebook login \| 100% anonymous!"
19:23 ^🔗	ersi	lol'd
19:30 ^🔗	joepie91	where?
19:37 ^🔗	ersi	doesn't really matter.. but in the corner to the right at https://trigd.com/
19:53 ^🔗	godane	uploaded: https://archive.org/details/cdrom-cnn20
23:45 ^🔗	joepie91	http://cryto.net/~joepie91/manual.html
23:47 ^🔗	xmc	interesting.

irclogger-viewer