#archiveteam-bs 2016-12-13,Tue

↑back Search

Time Nickname Message
00:02 🔗 i336_ mmm. anxiety has a tendency to make the "kinda uncomfortably long time" thing end up being measured in seconds or low single-digit minutes, but I do get what you're talking about, and I'll see if I can stretch that to longer. thanks again
00:04 🔗 i336_ SketchCow: I also wanted to clarify one last thing. ex.ua has been officially closed, so the web interface no longer works. with rover.info, that's not the case. I personally believe rover.info will close too. the thing is, they're one and the same website - once you login to rover.info you see the EX logo and the site is identical. IMHO, rover.info is what we should have been mirroring all along. it
00:04 🔗 i336_ does require cookie management, but it provides complete access to the site.
00:04 🔗 i336_ SketchCow: right now we're just getting references to uploaded files, not the conversations. it's arguable this is a "99% vs 100%" thing, but it's also arguable that archiving the file references and archiving the conversations produces for a very different dataset.
00:04 🔗 xmc yes. patience is a skill that must be learned.
00:05 🔗 xmc i336_: did you read me earlier when i said "trust us, the what.cd data is safe (if inaccessible)"?
00:05 🔗 i336_ xmc: yes, and I was really happy to learn about that :P
00:05 🔗 xmc ok
00:06 🔗 i336_ so you're saying that rover.info might also happen next?
00:06 🔗 xmc i am not saying that
00:06 🔗 i336_ ok.
00:06 🔗 Stiletto has joined #archiveteam-bs
00:06 🔗 xmc i have said nothing about ex.ua or rover.info
00:06 🔗 i336_ right
00:08 🔗 i336_ SketchCow: just reread what I said, I don't think I made this part clear - the API methods I've found only provide access to file listings. if we hit rover.info, we get a) files b) folders c) collections d) threads e) individual messages f) user avatars ...etc etc etc. I must be frank and say that I'm really frustrated that we aren't indexing that.
00:09 🔗 i336_ with the API methods we only get files, folders, and the first 100 items in collections. some users have collections with thousands of folders in them. (collections contain folders contain files)
00:09 🔗 xmc hey maybe you should make a channel for this project and discuss it there with interested parties
00:10 🔗 i336_ xmc: SketchCow isn't in #exexbaby
00:10 🔗 xmc then he's not interested in the project
00:11 🔗 xmc or he hasn't heard of it yet
00:11 🔗 xmc my money's on the first
00:11 🔗 i336_ I can understand that. I'm poking him because I don't know how else to convince arkiver that rover.info is a good idea
00:12 🔗 xmc that doesn't make any sense
00:12 🔗 i336_ what happened was that, I first found the XSPFs, then I found the r_view URLs, and then the next day I found out about rover.info. each discovered completely superceded the one before it
00:12 🔗 xmc -> #exexbaby
00:12 🔗 i336_ ok
00:23 🔗 SketchCow You're concerning arkiver
00:23 🔗 SketchCow that takes talent.
00:31 🔗 krazedkat has quit IRC (Quit: Leaving)
00:33 🔗 * i336_ is sad now
00:42 🔗 SketchCow Pick yourself up and work on a coherent, limited, directed effort to save some portion of ex.ua.
00:42 🔗 * i336_ sighs in frustration
00:43 🔗 i336_ the thing is, I've been trying to say that ex.ua is not where the data is. rover.info is another domain run by the same company that provides full access to the site content. all the conversations etc,
00:43 🔗 i336_ s/,/./
00:43 🔗 i336_ I was wrong in thinking ex.ua was where it's at. rover.info IS. it requires cookie management to access, but provides superceeded data over ex.ua.
00:43 🔗 i336_ unfortunately I did not make this discovery initially.
00:44 🔗 i336_ I don't really know what else to say. :(
00:44 🔗 i336_ I'm sorry if there's some thing here that I'm not getting or something I'm not picking up on.
00:45 🔗 i336_ another thing - saving rover.info is actually simpler than saving ex.ua
00:46 🔗 i336_ just save all URLs that return 200, ignore everything that sends a 302
00:46 🔗 i336_ if there's pagination in the page, send &per=200&p=... until you're on the last page, and then move on
00:46 🔗 i336_ done
00:47 🔗 trs80 i336: I think you've made the point about rover.info several times now. given the timezone differences, just chill and wait for people to respond
00:48 🔗 trs80 if they're interested, they'll join #exexbaby. if not, you're just spamming the channel and people will tune you out
00:48 🔗 i336_ alright then. sorry for the spam. I'll wait it out then.
00:48 🔗 i336_ and I'll try and get in touch with arkiver
00:51 🔗 SketchCow It's OK, only one user had to die
00:53 🔗 i336_ lol
00:55 🔗 robink has quit IRC (Read error: Connection reset by peer)
00:59 🔗 i336_ I'm curious if johansch can come back in here, or if he's gone for good
01:00 🔗 vantec Read topic, insert knowledge
01:00 🔗 * i336_ +1 knowledge!
01:00 🔗 * i336_ updates question
01:00 🔗 i336_ do bans ever get revoked?
01:00 🔗 xmc they tend to age out manually
01:01 🔗 i336_ right. I see.
01:01 🔗 xmc speaking of
01:01 🔗 xmc sets mode: -b *!uid118096@*
01:01 🔗 xmc sets mode: -b *!4f8dff3d@ag-255-61.sta.ji.cz
01:01 🔗 xmc sets mode: -b *!*Thunderbi@*.res.bhn.net
01:02 🔗 xmc sets mode: -b *!*webchat@*.res.bhn.net
01:02 🔗 godane i336_: the best i can see doing with 50GB per a month cap is use archivebot
01:04 🔗 i336_ yeah
01:05 🔗 ndiddy has joined #archiveteam-bs
01:12 🔗 godane so i found this website: http://radio.garden/
01:13 🔗 robink has joined #archiveteam-bs
01:15 🔗 SketchCow https://archive.org/details/cratediggers?&sort=publicdate is going to slowly grow overnight
02:04 🔗 ndiddy has quit IRC (Quit: Leaving)
02:11 🔗 BlueMaxim has quit IRC (Quit: Leaving)
02:11 🔗 BlueMaxim has joined #archiveteam-bs
02:17 🔗 Stiletto has quit IRC (Read error: Operation timed out)
02:22 🔗 DFJustin "It also allegedly obtained and allegedly posted the allegedly uncut 5-minute "footage" of President George W. Bush allegedly sitting in a Florida classroom as the 9/11 attacks happened."
02:22 🔗 DFJustin that's a lot of allegedly
02:29 🔗 Asparagir I noticed that too.
02:44 🔗 robink has quit IRC (Read error: Connection reset by peer)
02:47 🔗 SketchCow I wrote a thing that is now going through 2.5 million items in the "audio uploads" section
02:47 🔗 SketchCow And if it has a cover, it's going into another collection
02:47 🔗 SketchCow Lotta audiobooks, turns out
02:48 🔗 SketchCow Occasional calls to jihad
02:56 🔗 hook54321 has quit IRC (Quit: Updating details, brb)
02:57 🔗 hook54321 has joined #archiveteam-bs
02:58 🔗 hook54321 has quit IRC (Client Quit)
02:58 🔗 robink has joined #archiveteam-bs
03:00 🔗 hook54321 has joined #archiveteam-bs
03:08 🔗 hook54321 has quit IRC (Quit: Updating details, brb)
03:11 🔗 arkiver SketchCow: we're now at 2.4 TB for the exua grab, can you please start the upload on FOS?
03:14 🔗 Stiletto has joined #archiveteam-bs
03:15 🔗 i336_ arkiver: I'm currently doing some work on figuring out the concrete details about how to parse rover.info's HTML. is there any interest in also adding rover.info to the crawl project?
03:17 🔗 i336_ arkiver: if there is, I've taken a look at setting up my own project tracker, and it's admittedly over my head. I have a small request to ask. could you setup an unlisted project on the tracker that sends me say 5 or 6 specific URLs every time I request it, so I can have a go at building and tuning wget+lua to fetch rover.info?
03:17 🔗 i336_ (I realize sending the same URLs every time may be tricky)
03:26 🔗 i336_ woops, wrong channel (again), my apologies. I'll try and be more attentive
03:26 🔗 hook54321 has joined #archiveteam-bs
03:30 🔗 hook54321 what are the most reliable efnet servers?
03:47 🔗 vantec Your own leaf server, lol
03:51 🔗 hook54321 leaf server...?? what's that?
03:54 🔗 lain_ has joined #archiveteam-bs
04:07 🔗 Asparagir hook54321: I use irc.choopa.net, seems okay.
04:19 🔗 SketchCow arkiver: What is IN it
04:26 🔗 SketchCow What. Is. In. It
04:42 🔗 SketchCow Kenshin: Hey hey
04:42 🔗 Kenshin SketchCow: what's up?
04:43 🔗 SketchCow Are you doing parallel ex.ua stuff
04:43 🔗 Kenshin nope
04:43 🔗 SketchCow Good
04:43 🔗 SketchCow Avoid. We are not doing this project into IA servers
04:43 🔗 Kenshin k
04:44 🔗 Kenshin i usually only touch projects that can be hit hard
04:44 🔗 Kenshin normal projects the guys already hit hard enough
04:44 🔗 SketchCow Understood
04:49 🔗 SketchCow Are you able to kill a project off the tracker?
04:51 🔗 SketchCow We're really seeing the best of me today
04:59 🔗 yipdw yes, I do
04:59 🔗 yipdw but then, so does arkiver, so might just let him make the call
05:00 🔗 SketchCow I am deleting all data arriving on the machine
05:01 🔗 SketchCow related to exua
05:01 🔗 SketchCow So if we can stop it, that would be good
05:01 🔗 yipdw O_o
05:01 🔗 yipdw uh ok
05:01 🔗 i336_ SketchCow: is it gone yet? can I have a copy?
05:01 🔗 SketchCow exua was a needless distraction from gov backup
05:02 🔗 i336_ if there's anything left I wouldn't mind downloading it... it'll take me a really long time but I'd appreciate it
05:02 🔗 SketchCow No.
05:02 🔗 i336_ :(
05:19 🔗 SketchCow sets mode: +b *!*i336@*.lnse3.ken.bigpond.net.au
05:19 🔗 i336_ was kicked by SketchCow (i336_)
05:24 🔗 SketchCow Wow, 9gb came in after the delete
05:24 🔗 SketchCow I wonder if the pipes have stopped.
05:31 🔗 * xmc slurps
05:32 🔗 SketchCow root@teamarchive0:/1/CHFOO/warrior/exua# du -sh .
05:32 🔗 SketchCow 2.1G .
05:32 🔗 SketchCow So that's still happening.
05:32 🔗 SketchCow (I mean, I realize the jobs have stopped)
05:34 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
05:42 🔗 Sk1d has joined #archiveteam-bs
05:46 🔗 SketchCow In other news, I have scripts sorting audio on archive.org pretty intensely.
06:41 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
06:54 🔗 hook54321 SketchCow: what's ex.ua?
06:55 🔗 SketchCow shhh
06:57 🔗 hook54321 nvm, I searched in the logs
06:57 🔗 hook54321 go back to sleep now if I woke you up :P
06:58 🔗 SketchCow I'm doing this massive language sort using scripts
06:58 🔗 hook54321 Why isn't it going on archive.org?
07:03 🔗 kristian_ has joined #archiveteam-bs
07:05 🔗 SketchCow It's garbage
07:08 🔗 hook54321 Like, pastebin but more useless?
07:09 🔗 SketchCow Yes
07:09 🔗 SketchCow Megaupload, but for the Ukraine
07:10 🔗 hook54321 Yeah, I won't ever need to access anything from it then
07:10 🔗 SketchCow Busted multiple times
07:10 🔗 hook54321 Nor will most people
07:11 🔗 SketchCow Still working by a trick of IP geofiltering
07:11 🔗 hook54321 Busted for what exactly? Piracy?
07:11 🔗 SketchCow Probably so mobbed up it has a seat in the restaurant you knock people out of if it wants it
07:11 🔗 SketchCow megapiracy
07:11 🔗 SketchCow petabytes
07:12 🔗 SketchCow In the roughest way, I tried to think "well, if it had some ukranian culture, maybe"
07:12 🔗 SketchCow But it doesn't.
07:14 🔗 hook54321 It's kinda sad that some internet culture is largely hosted exclusively on pastebin and similar sites
07:15 🔗 hook54321 Are there any ukranian people involved in archiveteam?
07:38 🔗 GE has joined #archiveteam-bs
08:33 🔗 ravetcofx has quit IRC (Read error: Operation timed out)
08:56 🔗 BlueMaxim has quit IRC (Quit: Leaving)
10:30 🔗 kristian_ has quit IRC (Quit: Leaving)
10:57 🔗 arkiver SketchCow: it was all metadata and preview images.
10:58 🔗 arkiver Project is removed from the tracker and github
11:27 🔗 GE has quit IRC (Remote host closed the connection)
12:47 🔗 VADemon has joined #archiveteam-bs
13:00 🔗 GE has joined #archiveteam-bs
14:24 🔗 ravetcofx has joined #archiveteam-bs
14:30 🔗 SketchCow Again, sorry for the misunderstanding and miscommunication, arkiver
14:30 🔗 SketchCow All on me.
14:59 🔗 ravetcofx has quit IRC (Read error: Operation timed out)
15:25 🔗 Start has quit IRC (Quit: Disconnected.)
16:11 🔗 Honno has joined #archiveteam-bs
16:19 🔗 xmc what exploded while i was sleeping
16:28 🔗 SketchCow Everybody's dead
16:29 🔗 xmc that explains why it's so cold here
16:30 🔗 SketchCow And delicious
16:30 🔗 xmc damn fine cup of coffee
16:30 🔗 xmc speaking of, i went to the diner where twin peaks was filmed
16:30 🔗 xmc as billed, they have good coffee and cherry pie
16:32 🔗 SketchCow As I begin the basic run against the 2,500,000 items in the audio inbox, some amazing crap is starting to emerge.
16:32 🔗 SketchCow Best of all, my routine is running without me, classifying languagesand moving bulks of uploads never regarded before.
16:32 🔗 xmc what kind of tasks do you perform on the things?
16:33 🔗 SketchCow Well, when someone uploads a lot, I can't just do a "move them all" command because it chokes the metadata manager.
16:33 🔗 xmc ah
16:33 🔗 SketchCow So I have a script that's got the list of one guy who uploaded 10,189 russian language audiobooks and is now shifting those over to a collection, over an hour.
16:33 🔗 SketchCow Another is searching for Arabic language items and classifying them, because they tend to be a bit of a mess.
16:34 🔗 SketchCow I'm also finding where one guy uploaded a pile of one theme
16:34 🔗 xmc beautiful
16:34 🔗 SketchCow And I have scripts that say "find everything from this guy and make a collection"
16:34 🔗 SketchCow https://archive.org/details/audioboo_ru
16:34 🔗 xmc useful, that
16:35 🔗 SketchCow 6,449 audiobooks in russian.
16:35 🔗 xmc holy moly
16:35 🔗 SketchCow Iknow it will eventually have 10,500 inthere.
16:35 🔗 SketchCow New ones come in with every refresh you do, due to that script.
16:36 🔗 SketchCow https://archive.org/details/cratediggers is my workspace. It'll grow and shrink
16:39 🔗 SketchCow Because I'll notice trends like "oh, someone uploaded 2300 2-hour chill mixes"
16:39 🔗 SketchCow And that becomes a collection
16:39 🔗 SketchCow Also, this find a language trick is now running in the texts uploads section, classifying thousands of Arabic texts and mislabelled arabic texts.
16:40 🔗 xmc how does it figure the language?
16:40 🔗 SketchCow Finds arabic characters
16:40 🔗 xmc oh, easy enough
16:40 🔗 xmc that's right you go for the automated 85% solution because it's better than the manual 100%-but-actually-only-3%-gets-done solution
16:41 🔗 SketchCow Always
16:41 🔗 SketchCow new collection of all this audio horseshit is http://archive.org/details/folksoundomy by the way
16:41 🔗 SketchCow I'm throwing in ecollections and adding new ones and so on
16:43 🔗 Honno has quit IRC (Quit: Leaving)
16:44 🔗 Honno has joined #archiveteam-bs
17:51 🔗 SketchCow I'm happy to say Russ Kick is joining us
17:51 🔗 SketchCow He is going to Archivebot archivebot like Scotty
17:53 🔗 Kaz The memory hole guy? (quick google)
17:53 🔗 xmc correct
18:55 🔗 SketchCow Keep an eye out for him
20:02 🔗 BlueMaxim has joined #archiveteam-bs
20:03 🔗 BlueMaxim has quit IRC (Client Quit)
20:03 🔗 BlueMaxim has joined #archiveteam-bs
20:05 🔗 HCross2 arkiver: can you run multiple FTP chunkers on one FTP server to speed it up?
20:06 🔗 arkiver not right now
20:06 🔗 arkiver but I need to update the FTP project
20:06 🔗 arkiver to make discovery also warrior compatible
20:06 🔗 HCross2 Chunking 200tb will take a long time
20:06 🔗 godane agree
20:06 🔗 arkiver that will be after the 23 december though
20:06 🔗 arkiver depends on the number of files?
20:07 🔗 HCross2 True. Hopefully it's like 100 2tb files :p
20:10 🔗 Igloo I've looked at NOAA
20:10 🔗 Igloo It's not.
20:10 🔗 Igloo Few KB txt files, csv files
20:10 🔗 squires you probably know, but it's no large files
20:10 🔗 Igloo Occasional zip files
20:10 🔗 squires *not
20:11 🔗 HCross2 Fun
20:11 🔗 squires generally it's a file per day, week, month, etc... for each and every type of dataset
20:11 🔗 squires of which there are hundreds
20:11 🔗 Igloo I tried squirting some through archivebot, It didn't like it much.
20:11 🔗 squires and separate date-sorted files per locale
20:11 🔗 squires ie, individual weather station or town
20:11 🔗 HCross2 Igloo: my local grab site very promptly fell over too
20:12 🔗 HCross2 Times out on dir listing
20:12 🔗 Igloo My big dedicated is currently crunching through the same URL list
20:12 🔗 Igloo However,I have noticed inconsistencies with the FTP server
20:12 🔗 Igloo From one IP the directory is valid
20:12 🔗 Igloo From another it is not.
20:12 🔗 HCross2 They may IP and connection limit
20:13 🔗 HCross2 We had that problem massively on one FTP grab before
20:13 🔗 Igloo Maybe. I've got 2 concurrent conns
20:13 🔗 Igloo From one IP at the mo.
20:13 🔗 Igloo I need to think how to effectively chunk ths up in a manner which is going to be managable to a) download b) get into IA
20:14 🔗 Igloo Waiting to see how big this test WARC is and go from there really
20:22 🔗 SketchCow Climate change work, go to #cheetoflee
21:05 🔗 Start has joined #archiveteam-bs
21:10 🔗 Ravenloft has joined #archiveteam-bs
21:24 🔗 Meroje has quit IRC (Quit: bye!)
21:25 🔗 Meroje has joined #archiveteam-bs
21:40 🔗 Start has quit IRC (Quit: Disconnected.)
21:44 🔗 Start has joined #archiveteam-bs
21:51 🔗 glass3 has joined #archiveteam-bs
22:11 🔗 Start has quit IRC (Remote host closed the connection)
22:15 🔗 Start has joined #archiveteam-bs
22:23 🔗 Start has quit IRC (Quit: Disconnected.)
23:04 🔗 GE has quit IRC (Remote host closed the connection)
23:43 🔗 Start has joined #archiveteam-bs
23:55 🔗 DiscantX has joined #archiveteam-bs

irclogger-viewer