[00:14] ivan`, you just need rss feeds [00:14] or domain names [00:16] omf_: I need feeds, but I can infer the feed URLs based on subdomains or a /path or /user/path [00:17] there are also some 'obvious' feeds like /rss.xml /atom.xml that would be nice to get [00:17] I can make a pattern for those too [00:17] the domains will be the sites listed in http://www.archiveteam.org/index.php?title=Google_Reader and more, if I go look for them tomorrow [00:17] omf_: for example one thing that can be in a url is a username which can be used to infer feeds [00:20] will domain prefixes + url filtering script work for you? [00:21] or, I could skip the domain prefixes and have the script filter all 160 billion URLs [00:26] ivan`, did posterous have rss feeds? [00:30] yes [00:30] I've grabbed them, but almost all were 404 in reader's cache [00:30] I assume you loaded that 9.8 million url list [00:30] yes [00:30] what about the live journal list we got [00:30] IA list has been imported, #donereading is working on an lj crawler [00:31] is there another lj list? [00:31] xanga users [00:31] not that I know of [00:31] by IA list I meant wayback [00:31] xanga has been imported [00:31] hm, not the new discoveries though [00:34] also #donereading on a blogspot crawler [00:35] what about reddit subreddits [00:36] done [00:36] anyone that natively speaks something that isn't Dutch or English, want to help out with the VPS panel I'm working on? [00:37] https://www.transifex.com/projects/p/cvm [00:37] (if a language is missing, tell me and I'll add it :P) [00:38] I assume you also loaded in the alexa and quantcast url lists [00:38] I did not know those existed [00:39] is there a dump or do I have to query something? [00:39] omf_: yeah any url list ideas you have would be greatly appreciated [00:40] here let me link you [00:40] top 1 million sites http://s3.amazonaws.com/alexa-static/top-1m.csv.zip [00:42] and here is the quantcast top 1 million sites https://ak.quantcast.com/quantcast-top-million.zip [00:42] thanks [00:42] so, will you be able to run through the cuil data? [00:42] it will take time [00:42] okay [00:43] what about the url lists from the url shortenors [00:44] good idea [00:44] http://urlte.am/ [00:44] what did you get our of common crawl [00:45] 2.4 billion URLs, a lot of stuff imported from there [00:45] I have a 22GB bz2 of their URLs [00:47] that seems small the common crawl url index with no content is 200gb compressed [00:48] https://github.com/trivio/common_crawl_index claims 5 billion URLs [00:49] I got 2.4 billion when running the included Python program [00:49] 5 billion URLs compressed can't take up 200GB [00:52] urlteam torrent has 1 seed uploading at 1KB/s [00:53] I think that is all on IA as well [00:54] I see only 2011 dumps on IA [00:59] I will bbl, gotta sleep [01:10] ivan`, I got a few smaller but interesting lists [01:10] universites - http://doors.stanford.edu/universities.html [01:12] http://catalog.data.gov/dataset/us-department-of-education-ed-internet-domains [04:40] wow [04:40] http://www.forensicswiki.org/wiki/Main_Page [04:41] just stumbled on this while looking for information on disk image formats [04:43] Coderjoe, that is a good site [04:52] is there a simple fuse program that can take a raw disk image and export the partitions? (I've tried guestfs and vdfuse and both do not like me) [04:52] you mean mount it? [04:52] fuseiso [04:53] the ultimate goal is to mount a partition in the image without using loop or dm [04:53] fuseiso blah.iso dir/ [04:53] not an iso [04:53] why not just create a chroot and mount it in there? [04:54] eh? [04:55] i have a rawhardriveimage.img file made using dd. the first 512 bytes contain an MBR partition table with one partition (in this case). [04:55] I want to use ntfs-3g on the partition in the image to access files in the image [04:56] without dding out the partition and without using loop devices and/or device mapper. (IE: without needing root) [04:57] (errata: the image is also compressed with bzip2 and being accessed through avfs at the moment) [04:59] Coderjoe, you tried guestmount from guestfs [04:59] no root required [04:59] yes [05:00] gave stupid error that I cannot determine the cause of [05:01] also, it probes deeper than I really care for. (it determines filesystem types and everything) [05:21] Coderjoe: yeah that site is pretty good for documenting disk recovery stuff/ ddrescue (aka gddrescue) vs dd_rescue for example [05:23] Coderjoe: depending on a system's config, you can usually interact with loop stuff without root, at least on ubuntu. might be some group thing [05:23] kpartx and losetup with offsets [05:23] looks like the twinkies are coming back [05:23] july 15th i hear [05:27] not in time for 4th of july aw [07:28] SketchCow: this is all you: http://www.atlasobscura.com/articles/object-of-intrigue-mickey-mouse-gas-mask [07:31] in that its something to so at one of speaches [07:39] omf_: nice, thanks [10:41] http://reader.aol.com/ [10:42] they even implement a Reader-style API [10:42] hah [10:42] "so Google doesn't want to do Reader? fine, we'll do it then" [10:43] too bad it refuses to work [10:44] Doesn't do anything in FF 21 [10:44] Close, but no cigar [10:45] on the other hand, Deep Space Nine - The fallen is a great UT 1 engine based game [10:46] but on an unrelated note really [12:17] Anyone here good with building ec2 images? [12:17] * Smiley wants to make a xanga one [12:26] I honestly have no clue where to start [12:27] I'm thinking fire up a default debian install [12:27] then install the extra bits on top [12:36] why bother with an image if you can just use ssh to execute a setup script on each box [12:37] so [12:37] have we archived scribd yet [12:37] ivan`: ok, tehn i need a setup script ;) [12:37] I just need some automated way of firing up a load of instances. [12:38] joepie91: no, and +1 on that, there's a lot of good stuff there [12:40] concidering we have 2 dying projects at teh mo, you can feel free but don't expect much help. [12:41] I really don't like Scribd :/ [12:42] wtf why can't I ssh into these ec2 instances o_O [12:42] debug1: Authentications that can continue: publickey [12:42] debug1: Trying private key: ./.ssh/amazon.pem [12:42] debug1: read PEM private key done: type RSA [12:42] So it is reading mah key :/ [12:43] joepie91: I also wondered how a PHP and flash based webpage could be archived well with all functionalities and documents inside, especially document access requires an active username on scribd :) [12:43] eh, where have my ssh rules gone o_O [12:46] norbert79: as I said, I don't like Scribd [12:47] joepie91: I understand, I was just curious, as I really would like to see some solutions to such [12:48] joepie91: one of the webpages I used to visit, all around former Luftwaffe (http://luftarchiv.de) was once my target for full dump, but I failed badly... [12:48] obvious solution would be automated account creation and downloading [12:48] but that'd probably require you to spend a few bucks on breaking captchas [12:49] I wonder if any webpage wiould offer a sharing of their whole webpage for free, like a backup of the page running the stuff [12:49] how do you mean? [12:50] I mean like a webpage we would like to archive based on PHP would run on a server, within a subdirectory and we would ask kindly and given of the content of that directory [12:50] that would be nice [12:50] if webpage owners would willing sharing [12:50] without the sensitive data of course [12:51] I was actually thinking about that [12:51] perhaps a framework should exist for sites to offer a 'backup' file [12:51] ok who uses the pipeline/seesaw stuff? [12:51] in a standardized format [12:51] something that doesn't take long to implement [12:52] joepie91: Aye the same was I thinking about too [12:52] joepie91: Like I run a Wiki, I would be happy sharing that without the sensitive stuff, including all dump, like the SQL dump [12:52] but a method would be nice to exist to be able to put those packages to a VM image to make it work again [12:53] not necessarily even a VM image [12:53] just some standardized machine-readable format [12:53] aye [12:53] Just thinking about this [12:53] speaking of which, mind speccing that out a bit? [12:53] as to what kind of data you would need to store [12:53] etc [12:53] how it'd be structured [12:53] hmm [12:53] In my case I have a mediawiki running [12:53] I'll have a think over it as well [12:53] I have some specific changes [12:53] perhaps combining the ideas it might yield a nice result [12:54] like I use a captcha but I use a bash script to make random pictures for replacing old captcha pictures which runs as a cronjob [12:54] I use almost the generic things, has short URL [12:54] but runs within one subdirectory instead of /var/www [12:54] so it's a bit modded [12:55] but nothing serious [12:55] joepie91: https://telehack.tk/wiki [12:56] but a Wiki is an easier thing as there are already methods available [12:56] the issue is with non-standard-cms engines [12:56] like anything written manually [12:57] there should be more API's existing [12:57] for more common solutions and easier dumping [12:57] but of course why would homepage owners wish to share their content this easy [18:49] i got a shit ton of maximum pc disks today [18:50] godane: excellent [19:04] i'm going to upload my cnn20 disk [19:06] metric or imperial shitton? [19:06] ;) [19:23] "Facebook login | 100% anonymous!" [19:23] lol'd [19:30] where? [19:37] doesn't really matter.. but in the corner to the right at https://trigd.com/ [19:53] uploaded: https://archive.org/details/cdrom-cnn20 [23:45] http://cryto.net/~joepie91/manual.html [23:47] interesting.