#archiveteam 2015-06-08,Mon

↑back Search

Time Nickname Message
00:05 🔗 primus104 has quit IRC (Leaving.)
01:15 🔗 Ymgve has quit IRC (Ping timeout: 506 seconds)
01:16 🔗 Ymgve has joined #archiveteam
01:17 🔗 d6e has left :3
01:30 🔗 Jonimus has joined #archiveteam
01:42 🔗 schbirid2 has joined #archiveteam
01:45 🔗 schbirid has quit IRC (Read error: Operation timed out)
01:46 🔗 Emcy has quit IRC (Ping timeout: 306 seconds)
01:54 🔗 aaaaaaaaa has joined #archiveteam
02:03 🔗 mistym has quit IRC (Remote host closed the connection)
02:04 🔗 mistym has joined #archiveteam
02:04 🔗 Ymgve has quit IRC ()
02:16 🔗 Coderjoe has quit IRC (Read error: Connection reset by peer)
02:16 🔗 Coderjoe has joined #archiveteam
02:24 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
02:25 🔗 aaaaaaaaa has joined #archiveteam
02:28 🔗 bzc6p__ has joined #archiveteam
02:32 🔗 bzc6p_ has quit IRC (Read error: Operation timed out)
02:35 🔗 mistym has quit IRC (Remote host closed the connection)
02:45 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
02:46 🔗 aaaaaaaaa has joined #archiveteam
02:47 🔗 BlueMaxim has joined #archiveteam
03:00 🔗 JesseW has joined #archiveteam
03:20 🔗 mistym has joined #archiveteam
04:03 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
04:04 🔗 aaaaaaaaa has joined #archiveteam
04:17 🔗 mr_rippit has joined #archiveteam
04:17 🔗 ripvanwin has quit IRC (Read error: Connection reset by peer)
04:22 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
04:26 🔗 mistym has quit IRC (Remote host closed the connection)
04:40 🔗 Emcy has joined #archiveteam
04:59 🔗 mistym has joined #archiveteam
05:43 🔗 JesseW has quit IRC (Quit: Leaving.)
05:45 🔗 godane has quit IRC (Read error: Operation timed out)
05:47 🔗 JesseW has joined #archiveteam
06:08 🔗 godane has joined #archiveteam
06:51 🔗 mistym has quit IRC (Remote host closed the connection)
07:11 🔗 JesseW has quit IRC (Quit: Leaving.)
07:20 🔗 bzc6p__ is now known as bzc6p
07:29 🔗 primus104 has joined #archiveteam
07:31 🔗 rolf has joined #archiveteam
07:33 🔗 Laverne has quit IRC (Read error: Operation timed out)
07:51 🔗 mistym has joined #archiveteam
08:03 🔗 mistym has quit IRC (Read error: Operation timed out)
08:53 🔗 mistym has joined #archiveteam
09:01 🔗 mistym has quit IRC (Ping timeout: 483 seconds)
09:25 🔗 rolf has quit IRC (Leaving...)
09:28 🔗 rolf has joined #archiveteam
09:32 🔗 primus104 has quit IRC (Leaving.)
09:33 🔗 rolf has quit IRC (Leaving...)
10:05 🔗 SadDM has quit IRC (Ping timeout: 370 seconds)
10:17 🔗 rolf has joined #archiveteam
10:17 🔗 rolf has quit IRC (Client Quit)
10:51 🔗 sirdancea has quit IRC (Remote host closed the connection)
10:55 🔗 Ymgve has joined #archiveteam
11:00 🔗 sirdancea has joined #archiveteam
12:18 🔗 primus104 has joined #archiveteam
12:19 🔗 Morbus has quit IRC (Quit: http://www.disobey.com/)
12:24 🔗 Morbus has joined #archiveteam
12:24 🔗 p9ne has joined #archiveteam
12:45 🔗 p9ne has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
12:55 🔗 primus104 has quit IRC (Leaving.)
13:00 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
13:30 🔗 McGEE has quit IRC (Quit: Connection closed for inactivity)
13:31 🔗 sankin has joined #archiveteam
13:37 🔗 p9ne has joined #archiveteam
13:45 🔗 vOYtEC has quit IRC (Ping timeout: 362 seconds)
13:57 🔗 primus104 has joined #archiveteam
14:32 🔗 mistym has joined #archiveteam
14:36 🔗 bzc6p_ has joined #archiveteam
14:40 🔗 bzc6p has quit IRC (Read error: Operation timed out)
14:40 🔗 mistym has quit IRC (Remote host closed the connection)
14:58 🔗 JesseW has joined #archiveteam
15:09 🔗 mistym has joined #archiveteam
15:27 🔗 JesseW has quit IRC (Quit: Leaving.)
15:40 🔗 sankin has quit IRC (Leaving.)
15:46 🔗 mistym has quit IRC (Remote host closed the connection)
16:03 🔗 mistym has joined #archiveteam
16:11 🔗 sirdancea has quit IRC (Read error: Operation timed out)
16:21 🔗 lytv has quit IRC (Read error: Connection reset by peer)
16:30 🔗 nox has quit IRC ()
16:32 🔗 lexicon has joined #archiveteam
16:43 🔗 lytv has joined #archiveteam
16:47 🔗 aaaaaaaaa has joined #archiveteam
16:47 🔗 philpem has joined #archiveteam
16:48 🔗 p9ne has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
16:57 🔗 Mayonaise has quit IRC (Read error: Operation timed out)
16:58 🔗 bzc6p_ is now known as bzc6p
16:58 🔗 joepie91_ has quit IRC (Read error: Operation timed out)
16:58 🔗 tephra has quit IRC (Read error: Operation timed out)
16:58 🔗 closure has quit IRC (Read error: Operation timed out)
16:59 🔗 joepie91 has joined #archiveteam
17:03 🔗 closure has joined #archiveteam
17:03 🔗 Mayonaise has joined #archiveteam
17:06 🔗 nox has joined #archiveteam
17:07 🔗 aaaaaaaa_ has joined #archiveteam
17:09 🔗 tephra has joined #archiveteam
17:13 🔗 aaaaaaaaa has quit IRC (Ping timeout: 370 seconds)
17:14 🔗 aaaaaaaa_ is now known as aaaaaaaaa
17:24 🔗 mistym has quit IRC (Remote host closed the connection)
17:27 🔗 mistym has joined #archiveteam
17:30 🔗 mistym has quit IRC (Remote host closed the connection)
17:46 🔗 primus104 has quit IRC (Leaving.)
17:46 🔗 McGEE has joined #archiveteam
17:50 🔗 mistym has joined #archiveteam
17:51 🔗 Jonimus has quit IRC (Ping timeout: 370 seconds)
18:22 🔗 sirdancea has joined #archiveteam
18:56 🔗 mistym has quit IRC (Remote host closed the connection)
19:01 🔗 mistym has joined #archiveteam
19:02 🔗 db48x has joined #archiveteam
19:21 🔗 mistym has quit IRC (Remote host closed the connection)
19:22 🔗 mistym has joined #archiveteam
19:31 🔗 SimpBrain has joined #archiveteam
19:37 🔗 neku has joined #archiveteam
19:37 🔗 neku Hi
19:38 🔗 Kazzy hey
19:38 🔗 aaaaaaaaa hello
19:39 🔗 neku Was told by a friend (wub) to go here, I run Pomf.se and would need to archive it.
19:40 🔗 deafnet has joined #archiveteam
19:41 🔗 bzc6p neku: it's basically a file sharing service, right?
19:41 🔗 neku bzc6p, correct
19:42 🔗 bzc6p At first glance, I guess it doesn't have an index of uploaded files.
19:42 🔗 Kazzy possible to get a list of all URLs/files?
19:42 🔗 bzc6p Does it?
19:42 🔗 neku It does not have a index, however files are public.
19:43 🔗 bzc6p Hey. You said you run it.
19:43 🔗 neku I do..
19:44 🔗 bzc6p Erm... so you know the structure the most.
19:45 🔗 neku As said above, it does not have a index list, however files are not private as anyone can view them once they get a hold of the link.
19:45 🔗 Kazzy judging by the fact that each file seems to keep its original extension, bruteforcing is out of the question
19:45 🔗 aaaaaaaaa I think what he was asking is whether you have or can generate a list of all the file links.
19:45 🔗 Kazzy archival will require some sort of 'inside knowledge' and a list of files
19:46 🔗 bzc6p neku: I'd be surprised if a site admin couldn't provide a list of stored files.
19:47 🔗 neku Well of course I can, it's all in a database, however my question is how would I archive this in the best way.
19:47 🔗 WubTheCap has joined #archiveteam
19:47 🔗 aaaaaaaaa neku: the basic way to archive a site is in a warc file. This records both the request to the website and the response.
19:47 🔗 deafnet wubbois
19:48 🔗 WubTheCap So if it didn't became clear yet, it seems Pomf.se has to shutdown soon due to lack of money and CDN issues
19:48 🔗 WubTheCap It's like what happened to MediaCrush, which was archived
19:48 🔗 WubTheCap http://archiveteam.org/index.php?title=MediaCrush
19:48 🔗 deafnet neku: split up the 4tb into zips or something
19:48 🔗 bzc6p How big is this, approximately?
19:48 🔗 WubTheCap 1.6 million files, 4 TB
19:48 🔗 aaaaaaaaa there are tools that can do that, like wpull and wget. The most basic method is to prepare a list of urls and feed it to a warc aware tool.
19:48 🔗 Kazzy neku: what's the total size? 4TB (assuming raid1)
19:49 🔗 Kazzy oh, there we go
19:49 🔗 Kazzy too big for archivebot, possibly a warrior project? (cc arkiver)
19:49 🔗 WubTheCap I guess neku can make an index file from the database' filenames
19:49 🔗 deafnet it took us 8 mins to get to this point
19:50 🔗 aaaaaaaaa apparently, I'm a autist neckbeard
19:50 🔗 neku WubTheCap, could work I guess
19:50 🔗 McGEE has quit IRC (Quit: Connection closed for inactivity)
19:51 🔗 WubTheCap There's some 404s in that database though, because of removed malware files and keeping a database record to prevent uploading them again
19:51 🔗 aaaaaaaaa those can be recorded in the warc too.
19:51 🔗 WubTheCap But, would it still be a better idea to contact info@archive.org for this?
19:52 🔗 bzc6p aaaaaaaaa: +1
19:52 🔗 bzc6p (autist neckbeard)
19:52 🔗 mistym has quit IRC (Remote host closed the connection)
19:53 🔗 aaaaaaaaa WubTheCap: are you concerned about the space?
19:53 🔗 WubTheCap aaaaaaaaa: It was SketchCow's recommendation yesterday
19:53 🔗 bzc6p Well, content should be warced because original URLs should be available through wayback machine
19:53 🔗 WubTheCap Yeah that's true
19:54 🔗 WubTheCap Also I don't know why but Wayback Machine doesn't like manually inputted a.pomf.se URLs
19:54 🔗 bzc6p ArchiveTeam uses 50GB warc pieces, that would result in 80 files. Neither size nor num of files is extremely large I think. However, SketchCow was who suggested contacting IA.
19:54 🔗 WubTheCap WARCs are a different thing though
19:54 🔗 aaaaaaaaa I think the concern may be the cost. I think 4 TB is $8000 in costs for them
19:54 🔗 Kazzy WubTheCap: http://a.pomf.se/robots.txt
19:54 🔗 Kazzy that's why
19:54 🔗 WubTheCap I have no idea about file consensus on Pomf.se, but I assume most of them are under 2 MB screenshots or maybe under 10 MB WebM videos
19:55 🔗 WubTheCap Kazzy: ia_archiver though, also it used to allow everything
19:55 🔗 bzc6p aaaaaaaaa: This is archiveteam. I don't remember IA rejected 4 TB.
19:56 🔗 bzc6p I don't even think IA counts the terabytes we upload. It's "nothing" on this scale.
19:56 🔗 xmc 4tb is awkwardly large to put in a single item
19:56 🔗 WubTheCap MediaCrush was split into some 60 chunks I think
19:56 🔗 xmc an IA item, afaik, lives as one thing on a computer
19:56 🔗 bzc6p I though it should be done like AT does: 50GB warcs per item
19:56 🔗 bzc6p then go to a collection
19:56 🔗 xmc so 4TB would kind of monopolize a disk
19:56 🔗 WubTheCap e.g. https://archive.org/details/mediacrush_coldstorage_part_1
19:57 🔗 xmc so split it up for making IA's life easier
19:57 🔗 arkiver I think we should make this a warrior project
19:57 🔗 arkiver 4TB is fine
19:58 🔗 bzc6p If WubTheCap, neku don't have the space to do the scraping alone, it's obviously a Warrior project
19:58 🔗 WubTheCap bzc6p: There's 8 TB storage on Pomf's colocated server
19:59 🔗 WubTheCap 4 TB in use
19:59 🔗 arkiver I'd rather like it to be a warrior project then a .tar packup project
19:59 🔗 bzc6p if they do, we can give clear instructions on how to do
19:59 🔗 bzc6p ok
19:59 🔗 WubTheCap 4x4 TB actually
19:59 🔗 bzc6p how?
19:59 🔗 WubTheCap 2x4 TB RAID1 | 2x4 TB RAID1
19:59 🔗 arkiver ok
20:00 🔗 arkiver neku: are you able to provide us with a list of all files?
20:00 🔗 arkiver we'll take care of the rest then
20:00 🔗 bzc6p Sorry for interrupting gentlemen, but I think it's time to go to a new channel
20:00 🔗 WubTheCap arkiver: Uploads are still enabled though, do you want to wait for those to be disabled?
20:00 🔗 deafnet suggestion #pomfret
20:00 🔗 Kazzy a moving target is harder to hit, WubTheCap
20:00 🔗 arkiver Yes, we should do that
20:01 🔗 neku waiting until I disable uploading would probably be good
20:01 🔗 arkiver neku: ok, we'll do that
20:01 🔗 arkiver can you then provide us with a list of the files?
20:02 🔗 arkiver (after uploading is disabled)
20:02 🔗 neku At that point I will provide a list of all files
20:03 🔗 Kazzy suggesting movement to #pomfret, to keep this channel clean ^^
20:03 🔗 arkiver neku: ok, thank you!
20:04 🔗 bzc6p Right
20:13 🔗 mistym has joined #archiveteam
20:13 🔗 deafnet has left
20:26 🔗 Jonimus has joined #archiveteam
20:37 🔗 philpem has quit IRC (Remote host closed the connection)
20:38 🔗 rolfb has joined #archiveteam
20:48 🔗 mistym has quit IRC (Remote host closed the connection)
20:48 🔗 rolfb has quit IRC (Leaving...)
20:56 🔗 rolfb has joined #archiveteam
20:57 🔗 rolfb has quit IRC (Linkinus - http://linkinus.com)
21:06 🔗 mistym has joined #archiveteam
21:25 🔗 schbirid2 http://archiveteam.org/index.php?title=MediaCrush could use a link tot he archive
22:07 🔗 sirdancea has quit IRC (Read error: Operation timed out)
22:12 🔗 SimpBrain has quit IRC (Quit: Leaving)
23:24 🔗 neku has quit IRC (Quit: Leaving)
23:25 🔗 Start so now that apple music's been introduced, looks like beats music doesn't have much longer to live
23:25 🔗 Ymgve has quit IRC ()
23:26 🔗 Start http://www.beatsmusic.com/robots.txt
23:26 🔗 Start heh
23:27 🔗 Start most of the content appears to be on on.beatmusic.com
23:27 🔗 Start https://encrypted.google.com/search?q=site%3Aon.beatsmusic.com
23:48 🔗 primus104 has joined #archiveteam

irclogger-viewer