Time |
Nickname |
Message |
00:14
🔗
|
omf_ |
ivan`, you just need rss feeds |
00:14
🔗
|
omf_ |
or domain names |
00:16
🔗
|
ivan` |
omf_: I need feeds, but I can infer the feed URLs based on subdomains or a /path or /user/path |
00:17
🔗
|
ivan` |
there are also some 'obvious' feeds like /rss.xml /atom.xml that would be nice to get |
00:17
🔗
|
ivan` |
I can make a pattern for those too |
00:17
🔗
|
ivan` |
the domains will be the sites listed in http://www.archiveteam.org/index.php?title=Google_Reader and more, if I go look for them tomorrow |
00:17
🔗
|
arrith1 |
omf_: for example one thing that can be in a url is a username which can be used to infer feeds |
00:20
🔗
|
ivan` |
will domain prefixes + url filtering script work for you? |
00:21
🔗
|
ivan` |
or, I could skip the domain prefixes and have the script filter all 160 billion URLs |
00:26
🔗
|
omf_ |
ivan`, did posterous have rss feeds? |
00:30
🔗
|
ivan` |
yes |
00:30
🔗
|
ivan` |
I've grabbed them, but almost all were 404 in reader's cache |
00:30
🔗
|
omf_ |
I assume you loaded that 9.8 million url list |
00:30
🔗
|
ivan` |
yes |
00:30
🔗
|
omf_ |
what about the live journal list we got |
00:30
🔗
|
ivan` |
IA list has been imported, #donereading is working on an lj crawler |
00:31
🔗
|
ivan` |
is there another lj list? |
00:31
🔗
|
omf_ |
xanga users |
00:31
🔗
|
omf_ |
not that I know of |
00:31
🔗
|
ivan` |
by IA list I meant wayback |
00:31
🔗
|
ivan` |
xanga has been imported |
00:31
🔗
|
ivan` |
hm, not the new discoveries though |
00:34
🔗
|
arrith1 |
also #donereading on a blogspot crawler |
00:35
🔗
|
omf_ |
what about reddit subreddits |
00:36
🔗
|
ivan` |
done |
00:36
🔗
|
joepie91 |
anyone that natively speaks something that isn't Dutch or English, want to help out with the VPS panel I'm working on? |
00:37
🔗
|
joepie91 |
https://www.transifex.com/projects/p/cvm |
00:37
🔗
|
joepie91 |
(if a language is missing, tell me and I'll add it :P) |
00:38
🔗
|
omf_ |
I assume you also loaded in the alexa and quantcast url lists |
00:38
🔗
|
ivan` |
I did not know those existed |
00:39
🔗
|
ivan` |
is there a dump or do I have to query something? |
00:39
🔗
|
arrith1 |
omf_: yeah any url list ideas you have would be greatly appreciated |
00:40
🔗
|
omf_ |
here let me link you |
00:40
🔗
|
omf_ |
top 1 million sites http://s3.amazonaws.com/alexa-static/top-1m.csv.zip |
00:42
🔗
|
omf_ |
and here is the quantcast top 1 million sites https://ak.quantcast.com/quantcast-top-million.zip |
00:42
🔗
|
ivan` |
thanks |
00:42
🔗
|
ivan` |
so, will you be able to run through the cuil data? |
00:42
🔗
|
omf_ |
it will take time |
00:42
🔗
|
ivan` |
okay |
00:43
🔗
|
omf_ |
what about the url lists from the url shortenors |
00:44
🔗
|
ivan` |
good idea |
00:44
🔗
|
omf_ |
http://urlte.am/ |
00:44
🔗
|
omf_ |
what did you get our of common crawl |
00:45
🔗
|
ivan` |
2.4 billion URLs, a lot of stuff imported from there |
00:45
🔗
|
ivan` |
I have a 22GB bz2 of their URLs |
00:47
🔗
|
omf_ |
that seems small the common crawl url index with no content is 200gb compressed |
00:48
🔗
|
ivan` |
https://github.com/trivio/common_crawl_index claims 5 billion URLs |
00:49
🔗
|
ivan` |
I got 2.4 billion when running the included Python program |
00:49
🔗
|
ivan` |
5 billion URLs compressed can't take up 200GB |
00:52
🔗
|
ivan` |
urlteam torrent has 1 seed uploading at 1KB/s |
00:53
🔗
|
omf_ |
I think that is all on IA as well |
00:54
🔗
|
ivan` |
I see only 2011 dumps on IA |
00:59
🔗
|
ivan` |
I will bbl, gotta sleep |
01:10
🔗
|
omf_ |
ivan`, I got a few smaller but interesting lists |
01:10
🔗
|
omf_ |
universites - http://doors.stanford.edu/universities.html |
01:12
🔗
|
omf_ |
http://catalog.data.gov/dataset/us-department-of-education-ed-internet-domains |
04:40
🔗
|
Coderjoe |
wow |
04:40
🔗
|
Coderjoe |
http://www.forensicswiki.org/wiki/Main_Page |
04:41
🔗
|
Coderjoe |
just stumbled on this while looking for information on disk image formats |
04:43
🔗
|
omf_ |
Coderjoe, that is a good site |
04:52
🔗
|
Coderjoe |
is there a simple fuse program that can take a raw disk image and export the partitions? (I've tried guestfs and vdfuse and both do not like me) |
04:52
🔗
|
omf_ |
you mean mount it? |
04:52
🔗
|
omf_ |
fuseiso |
04:53
🔗
|
Coderjoe |
the ultimate goal is to mount a partition in the image without using loop or dm |
04:53
🔗
|
omf_ |
fuseiso blah.iso dir/ |
04:53
🔗
|
Coderjoe |
not an iso |
04:53
🔗
|
omf_ |
why not just create a chroot and mount it in there? |
04:54
🔗
|
Coderjoe |
eh? |
04:55
🔗
|
Coderjoe |
i have a rawhardriveimage.img file made using dd. the first 512 bytes contain an MBR partition table with one partition (in this case). |
04:55
🔗
|
Coderjoe |
I want to use ntfs-3g on the partition in the image to access files in the image |
04:56
🔗
|
Coderjoe |
without dding out the partition and without using loop devices and/or device mapper. (IE: without needing root) |
04:57
🔗
|
Coderjoe |
(errata: the image is also compressed with bzip2 and being accessed through avfs at the moment) |
04:59
🔗
|
omf_ |
Coderjoe, you tried guestmount from guestfs |
04:59
🔗
|
omf_ |
no root required |
04:59
🔗
|
Coderjoe |
yes |
05:00
🔗
|
Coderjoe |
gave stupid error that I cannot determine the cause of |
05:01
🔗
|
Coderjoe |
also, it probes deeper than I really care for. (it determines filesystem types and everything) |
05:21
🔗
|
arrith1 |
Coderjoe: yeah that site is pretty good for documenting disk recovery stuff/ ddrescue (aka gddrescue) vs dd_rescue for example |
05:23
🔗
|
arrith1 |
Coderjoe: depending on a system's config, you can usually interact with loop stuff without root, at least on ubuntu. might be some group thing |
05:23
🔗
|
arrith1 |
kpartx and losetup with offsets |
05:23
🔗
|
godane |
looks like the twinkies are coming back |
05:23
🔗
|
godane |
july 15th i hear |
05:27
🔗
|
arrith1 |
not in time for 4th of july aw |
07:28
🔗
|
godane |
SketchCow: this is all you: http://www.atlasobscura.com/articles/object-of-intrigue-mickey-mouse-gas-mask |
07:31
🔗
|
godane |
in that its something to so at one of speaches |
07:39
🔗
|
ivan` |
omf_: nice, thanks |
10:41
🔗
|
ivan` |
http://reader.aol.com/ |
10:42
🔗
|
ivan` |
they even implement a Reader-style API |
10:42
🔗
|
joepie91 |
hah |
10:42
🔗
|
joepie91 |
"so Google doesn't want to do Reader? fine, we'll do it then" |
10:43
🔗
|
norbert79 |
too bad it refuses to work |
10:44
🔗
|
norbert79 |
Doesn't do anything in FF 21 |
10:44
🔗
|
norbert79 |
Close, but no cigar |
10:45
🔗
|
norbert79 |
on the other hand, Deep Space Nine - The fallen is a great UT 1 engine based game |
10:46
🔗
|
norbert79 |
but on an unrelated note really |
12:17
🔗
|
Smiley |
Anyone here good with building ec2 images? |
12:17
🔗
|
* |
Smiley wants to make a xanga one |
12:26
🔗
|
Smiley |
I honestly have no clue where to start |
12:27
🔗
|
Smiley |
I'm thinking fire up a default debian install |
12:27
🔗
|
Smiley |
then install the extra bits on top |
12:36
🔗
|
ivan` |
why bother with an image if you can just use ssh to execute a setup script on each box |
12:37
🔗
|
joepie91 |
so |
12:37
🔗
|
joepie91 |
have we archived scribd yet |
12:37
🔗
|
Smiley |
ivan`: ok, tehn i need a setup script ;) |
12:37
🔗
|
Smiley |
I just need some automated way of firing up a load of instances. |
12:38
🔗
|
ivan` |
joepie91: no, and +1 on that, there's a lot of good stuff there |
12:40
🔗
|
Smiley |
concidering we have 2 dying projects at teh mo, you can feel free but don't expect much help. |
12:41
🔗
|
joepie91 |
I really don't like Scribd :/ |
12:42
🔗
|
Smiley |
wtf why can't I ssh into these ec2 instances o_O |
12:42
🔗
|
Smiley |
debug1: Authentications that can continue: publickey |
12:42
🔗
|
Smiley |
debug1: Trying private key: ./.ssh/amazon.pem |
12:42
🔗
|
Smiley |
debug1: read PEM private key done: type RSA |
12:42
🔗
|
Smiley |
So it is reading mah key :/ |
12:43
🔗
|
norbert79 |
joepie91: I also wondered how a PHP and flash based webpage could be archived well with all functionalities and documents inside, especially document access requires an active username on scribd :) |
12:43
🔗
|
Smiley |
eh, where have my ssh rules gone o_O |
12:46
🔗
|
joepie91 |
norbert79: as I said, I don't like Scribd |
12:47
🔗
|
norbert79 |
joepie91: I understand, I was just curious, as I really would like to see some solutions to such |
12:48
🔗
|
norbert79 |
joepie91: one of the webpages I used to visit, all around former Luftwaffe (http://luftarchiv.de) was once my target for full dump, but I failed badly... |
12:48
🔗
|
joepie91 |
obvious solution would be automated account creation and downloading |
12:48
🔗
|
joepie91 |
but that'd probably require you to spend a few bucks on breaking captchas |
12:49
🔗
|
norbert79 |
I wonder if any webpage wiould offer a sharing of their whole webpage for free, like a backup of the page running the stuff |
12:49
🔗
|
joepie91 |
how do you mean? |
12:50
🔗
|
norbert79 |
I mean like a webpage we would like to archive based on PHP would run on a server, within a subdirectory and we would ask kindly and given of the content of that directory |
12:50
🔗
|
norbert79 |
that would be nice |
12:50
🔗
|
norbert79 |
if webpage owners would willing sharing |
12:50
🔗
|
norbert79 |
without the sensitive data of course |
12:51
🔗
|
joepie91 |
I was actually thinking about that |
12:51
🔗
|
joepie91 |
perhaps a framework should exist for sites to offer a 'backup' file |
12:51
🔗
|
Smiley |
ok who uses the pipeline/seesaw stuff? |
12:51
🔗
|
joepie91 |
in a standardized format |
12:51
🔗
|
joepie91 |
something that doesn't take long to implement |
12:52
🔗
|
norbert79 |
joepie91: Aye the same was I thinking about too |
12:52
🔗
|
norbert79 |
joepie91: Like I run a Wiki, I would be happy sharing that without the sensitive stuff, including all dump, like the SQL dump |
12:52
🔗
|
norbert79 |
but a method would be nice to exist to be able to put those packages to a VM image to make it work again |
12:53
🔗
|
joepie91 |
not necessarily even a VM image |
12:53
🔗
|
joepie91 |
just some standardized machine-readable format |
12:53
🔗
|
norbert79 |
aye |
12:53
🔗
|
norbert79 |
Just thinking about this |
12:53
🔗
|
joepie91 |
speaking of which, mind speccing that out a bit? |
12:53
🔗
|
joepie91 |
as to what kind of data you would need to store |
12:53
🔗
|
joepie91 |
etc |
12:53
🔗
|
joepie91 |
how it'd be structured |
12:53
🔗
|
norbert79 |
hmm |
12:53
🔗
|
norbert79 |
In my case I have a mediawiki running |
12:53
🔗
|
joepie91 |
I'll have a think over it as well |
12:53
🔗
|
norbert79 |
I have some specific changes |
12:53
🔗
|
joepie91 |
perhaps combining the ideas it might yield a nice result |
12:54
🔗
|
norbert79 |
like I use a captcha but I use a bash script to make random pictures for replacing old captcha pictures which runs as a cronjob |
12:54
🔗
|
norbert79 |
I use almost the generic things, has short URL |
12:54
🔗
|
norbert79 |
but runs within one subdirectory instead of /var/www |
12:54
🔗
|
norbert79 |
so it's a bit modded |
12:55
🔗
|
norbert79 |
but nothing serious |
12:55
🔗
|
norbert79 |
joepie91: https://telehack.tk/wiki |
12:56
🔗
|
norbert79 |
but a Wiki is an easier thing as there are already methods available |
12:56
🔗
|
norbert79 |
the issue is with non-standard-cms engines |
12:56
🔗
|
norbert79 |
like anything written manually |
12:57
🔗
|
norbert79 |
there should be more API's existing |
12:57
🔗
|
norbert79 |
for more common solutions and easier dumping |
12:57
🔗
|
norbert79 |
but of course why would homepage owners wish to share their content this easy |
18:49
🔗
|
godane |
i got a shit ton of maximum pc disks today |
18:50
🔗
|
winr4r |
godane: excellent |
19:04
🔗
|
godane |
i'm going to upload my cnn20 disk |
19:06
🔗
|
Smiley |
metric or imperial shitton? |
19:06
🔗
|
Smiley |
;) |
19:23
🔗
|
ersi |
"Facebook login | 100% anonymous!" |
19:23
🔗
|
ersi |
lol'd |
19:30
🔗
|
joepie91 |
where? |
19:37
🔗
|
ersi |
doesn't really matter.. but in the corner to the right at https://trigd.com/ |
19:53
🔗
|
godane |
uploaded: https://archive.org/details/cdrom-cnn20 |
23:45
🔗
|
joepie91 |
http://cryto.net/~joepie91/manual.html |
23:47
🔗
|
xmc |
interesting. |