#archiveteam-bs 2016-03-31,Thu

↑back Search

Time Nickname Message
00:35 🔗 hawc145 has joined #archiveteam-bs
00:39 🔗 HCross has quit IRC (Ping timeout: 370 seconds)
00:50 🔗 JesseW has joined #archiveteam-bs
00:55 🔗 BlueMaxim has joined #archiveteam-bs
01:00 🔗 Cameron_D has quit IRC (Ping timeout: 370 seconds)
01:02 🔗 phuzion has quit IRC (Ping timeout: 370 seconds)
01:02 🔗 phuzion has joined #archiveteam-bs
01:07 🔗 MrRadar has quit IRC (Ping timeout: 370 seconds)
01:19 🔗 Cameron_D has joined #archiveteam-bs
01:20 🔗 ErkDog has quit IRC (Read error: Operation timed out)
01:21 🔗 ErkDog has joined #archiveteam-bs
01:28 🔗 useretail has quit IRC (Ping timeout: 244 seconds)
01:30 🔗 tomwsmf-a has quit IRC (Read error: Operation timed out)
01:50 🔗 ohhdemgir has quit IRC (Ping timeout: 1208 seconds)
01:51 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
01:58 🔗 ohhdemgir has joined #archiveteam-bs
02:02 🔗 MrRadar has joined #archiveteam-bs
02:03 🔗 JesseW has joined #archiveteam-bs
02:08 🔗 dashcloud has quit IRC (Read error: Operation timed out)
02:11 🔗 dashcloud has joined #archiveteam-bs
02:27 🔗 JesseW bsmith093: verified that the uploaded MD5s match what I have.
02:27 🔗 JesseW I think I'm going to delete my copy of the old tarball now.
02:51 🔗 Stiletto is now known as Stilett0
03:07 🔗 bsmith093 JesseW: i will too, i could use the space
03:14 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
03:16 🔗 BlueMaxim has joined #archiveteam-bs
03:46 🔗 JesseW since the IA provided torrent won't include the full data, it would probably be good to make a torrent
03:50 🔗 bsmith094 JesseW: on it
03:50 🔗 JesseW :-)
03:50 🔗 bsmith094 serious'y, though
03:50 🔗 bsmith094 JesseW: Seriously though, thanks for your help.
03:51 🔗 bwn_ has quit IRC (Ping timeout: 492 seconds)
03:53 🔗 JesseW You are very welcome! I still need to combine the csvs into an sqlite db.
03:53 🔗 JesseW If you want to hack up a shell command to do that, I'd be grateful.
03:54 🔗 JesseW There's one csv, called inventory.csv, in each directory -- I need to run the sqlite import on each one.
03:55 🔗 bsmith093 JesseW: maybe a for loop
03:55 🔗 bsmith093 JesseW: at least something to get them all in one idirectory, named for the folder they were a database *of*
03:56 🔗 bsmith093 JesseW: also, throw that in the fos directory when it's done, i'd want to include that for analysis.
03:56 🔗 bsmith093 :)
03:59 🔗 JesseW sure
04:06 🔗 bsmith093 JesseW: Possibly a very stupid question, but can you tell a for loop to go alphabetically? is that a thing
04:06 🔗 toad2 has quit IRC (Read error: Operation timed out)
04:07 🔗 bsmith093 JesseW: it might be faster to jsut import one by one, and use the command history to speed it along.
04:08 🔗 toad1 has joined #archiveteam-bs
04:10 🔗 Atluxity bsmith093: pipe the input to the for loop via sort-command
04:10 🔗 Atluxity bsmith093: do you have an example?
04:13 🔗 bsmith093 Atluxity: trying to import 30 something csv files, into one sql db file, recursively.
04:13 🔗 JesseW well, I think the `find` command is probably the right answer for finding the csvs
04:13 🔗 JesseW and it's not 30.
04:13 🔗 bsmith093 all in differetn sub folders of the same path
04:13 🔗 JesseW It's more like thousands
04:13 🔗 JesseW one per *folder*
04:13 🔗 bsmith093 JesseW: yes, per A B C folder, right?
04:14 🔗 JesseW nooo
04:14 🔗 toad1 has quit IRC (Read error: Operation timed out)
04:14 🔗 JesseW one per folder in Fanfiction
04:14 🔗 bsmith093 wait, you went with CATEGORIES?! oy
04:14 🔗 JesseW it was the easiest way to generate them
04:15 🔗 JesseW so yeah, many thousands
04:16 🔗 bsmith093 so yeah, maybe " for *.csv in $file ; do sql3 import $file.csv final.db
04:16 🔗 bsmith093 i've found several gui tools that do this, such as razor sql, free 30 day trial, full functions
04:17 🔗 JesseW 89,472 to be specific.
04:17 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:17 🔗 JesseW well, sqlite has .import
04:18 🔗 toad1 has joined #archiveteam-bs
04:24 🔗 Sk1d has joined #archiveteam-bs
04:26 🔗 bsmith093 JesseW: here http://pastebin.com/DQD5h0Li found this
04:27 🔗 JesseW thx
04:29 🔗 bsmith093 if you have php, and a for loop for the files, that should work
04:31 🔗 bsmith093 turns out this is a rather edge case problem!
04:31 🔗 JesseW heh
04:32 🔗 JesseW I'm actually going to do it this way: cat baz | awk '{print ".import \""$0"\" metadata\nselect \""$0"\", count(*) from metadata;"}' | sqlite3 -csv metadata.sqlite
04:32 🔗 JesseW (baz contains a list of paths to the csvs)
04:34 🔗 bsmith093 JesseW: i know nothing about awk, and regex(?) scares the pants off me.
04:34 🔗 JesseW :-)
04:34 🔗 JesseW no regexes in there, actually
04:34 🔗 bsmith093 thats actually the simplest-looking regex i've ever seen!
04:35 🔗 JesseW up to 13,000
04:35 🔗 bsmith093 ah, well thats why, then.
04:36 🔗 JesseW how many files is it in total, again?
04:36 🔗 bsmith093 why not just one huge csv-file-list fiel?>
04:36 🔗 bsmith093 *file ?
04:36 🔗 Atluxity JesseW: wc -l baz
04:36 🔗 JesseW Atluxity: that's a count of categories, not individual files.
04:36 🔗 Atluxity ah
04:37 🔗 JesseW I have a list of files, but it takes long enough to *run* wc -l that I thought I'd ask bsmith093 to run it again instead of me. :-)
04:37 🔗 bsmith093 man, everything about this project is huge, isn't it?
04:37 🔗 Atluxity :)
04:37 🔗 Atluxity big data baby
04:37 🔗 JesseW eh, huge-*ish*
04:37 🔗 bsmith093 JesseW: um, run what, you have the csv's
04:38 🔗 JesseW just a count of inventory.txt
04:43 🔗 JesseW up to 257,265
04:43 🔗 bsmith093 JesseW: i had to rebuild it, because i forgot to add Fanfiction_misc.zip
04:44 🔗 JesseW bsmith093: could you update the description on https://archive.org/details/FanfictionNearlyCompleteArchive to link to the repack?
04:44 🔗 bsmith093 probably a faster way, but i figued why take chances
04:44 🔗 JesseW rebuild what, the torrent?
04:45 🔗 JesseW it looks like there are less than 10 million stories
04:46 🔗 bsmith093 JesseW: no, the inventory file.
04:47 🔗 JesseW ah, I see.
04:47 🔗 * JesseW is now reading the Minesweeper fanfic
04:48 🔗 bsmith093 JesseW: 6,845,581 lines.
04:48 🔗 bsmith093 JesseW: read the zoo tycoon fanfic, bring brain bleach.
04:49 🔗 JesseW ha
04:50 🔗 JesseW well if there are less than 7 million, then I'm about 7% done
04:50 🔗 JesseW ~ 567,000 entered
05:03 🔗 bsmith093 JesseW: rebuilding md5, btw, is there a way to add a line to an md5sum-generated file, that's very tedious to re-do every time.
05:03 🔗 bsmith093 JesseW: also reuploading inventory zip file
05:04 🔗 JesseW sure, open it in a text editor (like notepad) and paste it in. :-)
05:05 🔗 JesseW but you don't really need to make the md5.txt file -- IA generates them itself, in _files.xml
05:05 🔗 bsmith093 well now i feel really stupid :P
05:06 🔗 JesseW ;-P
05:06 🔗 bsmith093 btw, sudo pip install ia , archive.org cli interface
05:07 🔗 bsmith093 wroks great, if you have an account, just get your secret key, for creds, run ia configure
05:07 🔗 JesseW I know, it's very nice -- I've contributed to it
05:07 🔗 bsmith093 oh, right, forgot, great work!
05:07 🔗 bsmith093 inventory.zip uploaded.
05:08 🔗 bsmith093 i also changed the old tar description to point to this upload.
05:08 🔗 bsmith093 JesseW: how gos the sql import?
05:09 🔗 JesseW 618,000
05:09 🔗 bsmith093 JesseW: roughlt how many per second
05:09 🔗 JesseW pretty slow
05:11 🔗 JesseW done about 7,000 of the 89,000 categories
05:11 🔗 JesseW I think I'm going to turn off the counting
05:12 🔗 bsmith093 might speed it up, a bit anyway
05:14 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
05:15 🔗 BlueMaxim has joined #archiveteam-bs
05:20 🔗 bsmith093 JesseW: protip, when screenlog.0 is annoying to parse, do screen -S name, then run "script -f logname command" inside it, works much better.
05:21 🔗 JesseW hm
05:21 🔗 JesseW yeah, I need to get more familiar with script
05:21 🔗 bsmith093 JesseW: -f is flush, it's basically a realtime log file
05:22 🔗 JesseW removing th count is a LOT faster
05:22 🔗 bsmith093 you worked with the python IA package, so you might know, is there a way to verify uploads after the fact?
05:23 🔗 JesseW what do you mean by "verify"?
05:23 🔗 bsmith093 the -v option, i never uploaded the fanfic grab with it, is it too late now to verify?
05:23 🔗 JesseW https://archive.org/metadata/fanfictiondotnet_repack/files/0
05:25 🔗 JesseW https://github.com/jjjake/internetarchive/blob/master/internetarchive/cli/ia_upload.py
05:26 🔗 JesseW https://github.com/jjjake/internetarchive/blob/master/internetarchive/item.py#L492
05:27 🔗 JesseW It's just checking the md5
05:27 🔗 JesseW so you can do that after the fact
05:27 🔗 JesseW see my first link
05:31 🔗 JetBalsa has quit IRC (Ping timeout: 250 seconds)
05:31 🔗 JetBalsa has joined #archiveteam-bs
05:37 🔗 metalcamp has joined #archiveteam-bs
05:40 🔗 JesseW 25,000 categories done
05:44 🔗 VADemon has joined #archiveteam-bs
05:45 🔗 JesseW 31,000
05:50 🔗 bsmith093 whoo, progress!
05:53 🔗 JesseW 41,000
05:55 🔗 bsmith093 ~38 minutes to go, at current speed
05:57 🔗 JesseW 45,000
05:59 🔗 bsmith093 JesseW: ~1200/min
05:59 🔗 bsmith093 i'm crazy bored, also timestamps rule!
06:02 🔗 JesseW if you're board, go analyze some url shorteners. :-)
06:03 🔗 JesseW http://archiveteam.org/index.php?title=URLTeam
06:03 🔗 JesseW 52,000
06:05 🔗 bsmith093 i just noticed i have neither wireshark or virtualbox on the machine. fixing.
06:06 🔗 JesseW heh. good things to fix
06:06 🔗 JesseW it is now working on Harry Potter
06:06 🔗 JesseW and done with that
06:06 🔗 bsmith093 the largest plurality of stories
06:06 🔗 bsmith093 damn!
06:06 🔗 JesseW 55,000
06:07 🔗 JesseW I want to clean up my (non-virtual) desk top -- but there's a lack of room to clean it off into. :-/
06:09 🔗 bsmith093 attics are great for that :P
06:10 🔗 JesseW heh -- sadly, no attic
06:10 🔗 bsmith093 JesseW: can i still run the urlteam thing without the warrior?
06:11 🔗 JesseW 59,000
06:11 🔗 JesseW bsmith093: you bet
06:11 🔗 JesseW most of the big contributors do, I think
06:11 🔗 JesseW ask Atluxity or johtso about doing so
06:16 🔗 JetBalsa has quit IRC (Ping timeout: 250 seconds)
06:16 🔗 JetBalsa has joined #archiveteam-bs
06:17 🔗 bsmith093 k urlteam grabber is running in screen, whoo! how often does it phone home wioth its results?
06:17 🔗 JesseW after each batch, usually 50 items
06:17 🔗 Frogging ^
06:19 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
06:20 🔗 BlueMaxim has joined #archiveteam-bs
06:21 🔗 JesseW 71,000
06:27 🔗 JetBalsa has quit IRC (Read error: Operation timed out)
06:28 🔗 bsmith093 i'm also running the new fanfic id's through fanficfare, grabbing everything from 10-12 million
06:28 🔗 JetBalsa has joined #archiveteam-bs
06:29 🔗 bsmith093 ls -aR | wc -l returns 71537 files so far
06:30 🔗 bsmith093 956505 id's to go
06:33 🔗 JesseW nice!
06:33 🔗 bsmith093 plus i no longer have the .hack sign problem, that character was added to the unsafe chars list,
06:33 🔗 JesseW I'm reading through: https://www.fanfiction.net/s/1106180/1/ -- which is quite good
06:33 🔗 bsmith093 is it would be the first character, it's now an underscr=ore by default
06:36 🔗 bsmith093 JesseW: https://www.fanfiction.net/game/Minesweeper/?&srt=1&lan=1&r=10&len=10
06:36 🔗 bsmith093 2 minesweeper stories over 10K words
06:36 🔗 bsmith093 you are reading the other one
06:36 🔗 JesseW heh
06:37 🔗 JesseW finished making the database
06:37 🔗 JesseW now counting items in it
06:39 🔗 bsmith093 bet your ass i'm putting both of those minesweeper stories into calibre
06:40 🔗 JesseW :-)
06:41 🔗 JesseW apparently we aren't the only ones to like it: " and for that one Minesweeper fic I wrote years ago that got kind of famous. "
06:41 🔗 JesseW http://www.whoaisnotme.net/anakinmcfly/fanfic.htm
06:41 🔗 bsmith093 also seriously, this. Easily the 3rd or 4th most amazing thing I've ever read, given the source material. https://www.fanfiction.net/s/10983213/1/The-True-Love-Loophole
06:43 🔗 JesseW 6,704,321 in the database
06:43 🔗 JesseW 4.7GB
06:43 🔗 JesseW sending it up to FOS now.
06:45 🔗 JesseW probably about 20 minutes
06:47 🔗 Honno has joined #archiveteam-bs
06:50 🔗 bsmith093 JesseW: i found some random fanfics you might find hilarious
06:50 🔗 JesseW are they not online?
06:50 🔗 bsmith093 most of them aren't
06:51 🔗 JesseW toss them on FOS -- my IRC client doesn't like file transfers
06:51 🔗 JesseW thanks, though
06:51 🔗 bsmith093 that explains a lot
06:52 🔗 bsmith093 annnnd done!
06:52 🔗 bsmith093 in fanfic repack, for consistency
06:53 🔗 JesseW nods
06:54 🔗 JesseW 9 minutes for the db
06:55 🔗 JesseW now I'm checking if there are any fics with over 200 chapters
06:56 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
06:58 🔗 BlueMaxim has joined #archiveteam-bs
06:58 🔗 bsmith093 there are
07:00 🔗 JesseW The one with the most chapters is CentiStories, with 985.
07:01 🔗 bsmith093 the matrix fanfic in that folder i sent, is from the agent's perpesctive.
07:01 🔗 bsmith093 somewhere in that mess of stories is a 15 MB fanfic
07:01 🔗 bsmith093 thats not a typo
07:01 🔗 JesseW heh
07:02 🔗 bsmith093 also this https://www.fanfiction.net/s/4112682/1/The-Subspace-Emissary-s-Worlds-Conquest
07:02 🔗 JesseW 4 entries with weird values for Rating.
07:03 🔗 JesseW The rest are T, K, K+ and M.
07:04 🔗 JesseW 3 million T, one million each of the others
07:04 🔗 bsmith093 what values are weird?
07:05 🔗 JesseW 3.2 million Completed, 3.4 million In-Progress
07:06 🔗 bsmith093 how are you getting this info so fats?!
07:06 🔗 bsmith093 fast!
07:06 🔗 bsmith093 what acn read this?
07:06 🔗 JesseW SQL, my dear, SQL!
07:06 🔗 bsmith093 gui?
07:06 🔗 bsmith093 i have sqliteman and sqlbrowser
07:06 🔗 JesseW Here are the weird values for Rating:
07:06 🔗 JesseW Eigentlich G, aber wegen einem Satz, einer zeile aus ei
07:06 🔗 JesseW PG-13 for language, I suppose
07:06 🔗 JesseW PG...should be higher, because it's d
07:06 🔗 JesseW Viol
07:06 🔗 JesseW uh... just a bit of swearing. just being careful, ya know?
07:07 🔗 JesseW One each.
07:07 🔗 JesseW I just use the sqlite3 command shell.
07:07 🔗 bsmith093 what are the id numbers for those, they are probably acient
07:07 🔗 JesseW probably; I'll let you find them.
07:07 🔗 bsmith093 i can't type anymore :P
07:08 🔗 JesseW there are also 79 that I failed to extract anything from.
07:08 🔗 bsmith093 totally blank entries
07:08 🔗 bsmith093 ???
07:08 🔗 JesseW yeah, except for the path.
07:08 🔗 JesseW select * from metadata where Status = "";
07:08 🔗 JesseW will show them
07:08 🔗 JesseW BTW, the database is now up at FOS
07:09 🔗 JesseW metadata.sqlite
07:09 🔗 JesseW in the usual directory
07:09 🔗 bsmith093 what files, example? grabbing it already
07:09 🔗 bsmith093 38 minties to go
07:09 🔗 bsmith093 dear $deity the typos are multiplying!?
07:14 🔗 VADemon has quit IRC (Quit: left4dead)
07:19 🔗 VADemon has joined #archiveteam-bs
07:20 🔗 VADemon has quit IRC (Client Quit)
07:20 🔗 JesseW the 79 appear to be empty files.
07:22 🔗 VADemon has joined #archiveteam-bs
07:23 🔗 bsmith093 damn. oh well
07:24 🔗 bsmith093 20 minutes to go grabbing that db.
07:24 🔗 VADemon has quit IRC (Client Quit)
07:24 🔗 bsmith093 solid 2 MB/s though so can't complain, even though it could be going 5 times faster!
07:24 🔗 VADemon has joined #archiveteam-bs
07:27 🔗 VADemon_ has joined #archiveteam-bs
07:27 🔗 VADemon_ has quit IRC (Read error: Connection reset by peer)
07:28 🔗 VADemon_ has joined #archiveteam-bs
07:29 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
07:29 🔗 VADemon has quit IRC (Ping timeout: 250 seconds)
07:32 🔗 VADemon_ has quit IRC (Read error: Connection reset by peer)
07:54 🔗 schbirid has joined #archiveteam-bs
08:00 🔗 bwn has joined #archiveteam-bs
08:04 🔗 wyatt8750 has joined #archiveteam-bs
08:04 🔗 wyatt8740 has quit IRC (Read error: Connection reset by peer)
08:17 🔗 hawc145 is now known as HCross
08:22 🔗 bwn has quit IRC (Ping timeout: 1208 seconds)
08:23 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
08:52 🔗 VADemon has joined #archiveteam-bs
09:05 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
09:07 🔗 BlueMaxim has joined #archiveteam-bs
09:11 🔗 ohhdemgir has quit IRC (Ping timeout: 260 seconds)
09:33 🔗 Kaz_ has joined #archiveteam-bs
09:33 🔗 Kaz has quit IRC (Read error: Operation timed out)
10:04 🔗 ohhdemgir has joined #archiveteam-bs
10:46 🔗 Honno has quit IRC (Read error: Connection reset by peer)
10:53 🔗 Honno has joined #archiveteam-bs
11:01 🔗 RedType_ has quit IRC (Ping timeout: 258 seconds)
11:33 🔗 Honno VADemon, hey you were spot on with the virtualization thing in BIOS, thanks ^^
11:45 🔗 VADemon Honno: does it work now? :O
11:45 🔗 Honno yeah VADemon
11:45 🔗 SilSte has quit IRC (Ping timeout: 492 seconds)
11:46 🔗 VADemon That's fantastic!
11:46 🔗 VADemon Warrior Helper +1
11:47 🔗 Honno VADemon, if I turn it on, get on the browser, and start running the warrior, then turn off that tab, will the warrior still be archiving?
11:50 🔗 VADemon Yes, the web page is just for controlling the warrior
11:50 🔗 Honno Sweet
11:50 🔗 VADemon The archiving runs as long as the virtual machine is running
11:50 🔗 Honno yea
11:51 🔗 Honno o livejournal is being archived huh
11:57 🔗 VADemon Yeah and "SCRIPTS ONLY" does exclude warriors
12:04 🔗 godane so i found a French magazine called 20 Minutes
12:04 🔗 godane it has pdfs going back to at least 2012
12:06 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
12:16 🔗 Honno chfoo, whats the difference between archiveteam_gamemaker_20141118080519.cdx.gz, archiveteam_gamemaker_20141118080519.cdx.idx, gamemaker_20141118080519.megawarc.json.gz and gamemaker_20141118080519.megawarc.warc.os.cdx.gz, for this archive?
12:16 🔗 Honno https://archive.org/download/archiveteam_gamemaker_20141118080519
12:17 🔗 Honno For all the other parts as well, they have these files, which is the index thingy I should use?
12:17 🔗 Honno I'm using the warc.os.cdx.gz and it seems fine, but it takes ages to load any page so I'm thinking it's not the right index file?
12:28 🔗 joepie91 Honno: the .idx files are indexes for the corresponding .warc files - that is, they contain newline-delineated information about each request/response and where in the warc to find it
12:28 🔗 joepie91 er
12:28 🔗 joepie91 cdx files, sorry
12:29 🔗 joepie91 not sure about the idx
12:29 🔗 Honno There are two cdx files there
12:29 🔗 Honno Which one should I be using?
12:29 🔗 joepie91 I -think- the megawarc is the right one
12:29 🔗 joepie91 sec
12:29 🔗 Honno yeah I think it is as well
12:29 🔗 Honno This just takes ages to load a page
12:29 🔗 joepie91 okay, yes
12:29 🔗 joepie91 Honno: so
12:29 🔗 joepie91 Honno: one second\
12:29 🔗 Honno It's a massive warc collection tho, amounts to 600gbish
12:29 🔗 Honno 50gb per warc
12:29 🔗 ersi cute
12:31 🔗 joepie91 Honno: http://storage2.static.itmages.com/i/16/0331/h_1459427518_7459135_2a03278340.png
12:31 🔗 joepie91 Honno: notice how the first few files match the name of the item
12:31 🔗 joepie91 ie. in the archive.org/details/XXX url
12:32 🔗 joepie91 so it's just the metadata for that
12:32 🔗 * ersi drops jaw
12:32 🔗 joepie91 the description, tags, uploader, and so on
12:32 🔗 Honno Ah joepie91, thanks yeah, thats what I guessed but I wasn't sure how the archive team/most people upload warcs like this
12:32 🔗 joepie91 Honno: well, that info is added by IA itself automatically :)
12:32 🔗 Honno and I don't know how warcs work or how to do anything and aggh
12:32 🔗 Honno mhmk
12:32 🔗 ersi It's often in MegaWARCs, with WARC's inside
12:32 🔗 joepie91 Honno: only the blue part is what archiveteam added
12:32 🔗 joepie91 Honno: how are you currently trying to use/read the megawarc?
12:33 🔗 Honno joepie91, I'm using pywb, storing it as a collection (didn't concat everything)
12:33 🔗 Honno should I do that for faster speeds?
12:33 🔗 joepie91 I haven't used pywb, but does it read the index files automatically? (cc ersi)
12:33 🔗 Honno yeah
12:33 🔗 Honno I think?
12:33 🔗 Honno it says it finds my index files
12:33 🔗 joepie91 because then it shouldn't be slow
12:33 🔗 joepie91 Honno: did it identify the .warc.os.cdx.gz?
12:33 🔗 joepie91 ie. the last item
12:33 🔗 joepie91 in that list
12:33 🔗 Honno yep
12:34 🔗 Honno only those work for it btw
12:34 🔗 joepie91 strange, then it shouldn't be slow
12:34 🔗 Honno all the others it just says "no cdx found)
12:34 🔗 joepie91 that would make sense :p
12:34 🔗 Honno yeah heh sorry
12:35 🔗 joepie91 but yeah, I haven't used pywb... but if it keeps being slow, then maybe there's a bug?
12:35 🔗 dashcloud has quit IRC (Ping timeout: 250 seconds)
12:36 🔗 Honno joepie91, I don't know anything about warcs really, so I take it they work by "recreating" every page by doing the GET requests for each piece of content that the original site would of done, rather than store static stuff?
12:36 🔗 Honno do I make sense haha
12:39 🔗 Honno so I just checked, it took 3.5 minutes to load this page http://sandbox.yoyogames.com/games/174569-innoquous-4
12:39 🔗 Honno on the warc file
12:41 🔗 dashcloud has joined #archiveteam-bs
12:47 🔗 joepie91 Honno: a WARC file is basically just a big file of HTTP requests and responses. when a site is archived,. every request and response is added to the end of the WARC
12:48 🔗 joepie91 Honno: the key is that it stores EVERYTHING
12:48 🔗 joepie91 including HTTP headers
12:48 🔗 Honno joepie91, mhm, sweet
12:48 🔗 joepie91 and, in the case of IA's crawler, even DNS requests
12:48 🔗 joepie91 so a WARC viewer can fully recreate the response for a given URL, status code and headers and all
12:48 🔗 joepie91 by reading them out of the WARC fil
12:48 🔗 joepie91 file*
12:49 🔗 joepie91 that's why it's used by IA; it retains all the important metadata, whereas a simple .html file wouldn't
12:49 🔗 Honno Yeah
12:49 🔗 Honno joepie91, so uh, how can I web scrape from this warc?
12:49 🔗 joepie91 Honno: 'web scrape' in what sense?
12:50 🔗 Honno joepie91, look for data in specific html lements of pages, ie "<div id="developer_name">", and store the text inside
12:50 🔗 Honno I can do it with live websites
12:50 🔗 joepie91 ahh
12:51 🔗 joepie91 Honno: well, two options
12:51 🔗 Honno But it takes so long to load stuff locally, it seems impractical with warcs
12:51 🔗 joepie91 Honno: the loading time is unrelated to it being a WARC
12:51 🔗 joepie91 lookup in WARCs is very quick if you have an index file
12:51 🔗 joepie91 as it contains the exact positions of every request
12:51 🔗 joepie91 but your options are basically
12:51 🔗 joepie91 1. use something like pywb, then scrape like a regular site
12:52 🔗 Honno Yeah I was trying 1, but alas the loading times
12:52 🔗 joepie91 2. use a WARC library, read out the WARC file directly and work from that (a bit faster, but also more work)
12:52 🔗 * joepie91 wonders who develops pywb anyway
12:52 🔗 Honno WARC library joepie91? like IA's warc library?
12:53 🔗 joepie91 Honno: anything that reads WARC in your language of choice
12:53 🔗 joepie91 :P
12:53 🔗 Honno joepie91, I got this geneerated from one WARC right http://puu.sh/o0ENB/e5063f18a8.txt
12:53 🔗 joepie91 Honno: anyway, try filing a bug in pywb
12:53 🔗 joepie91 on*
12:53 🔗 joepie91 about the slowness
12:53 🔗 Honno but how do I like, get the content of pages?
12:53 🔗 joepie91 it might just be a bug
12:53 🔗 Honno mhm I will thanks
12:54 🔗 joepie91 Honno: I know more about WARC as a format than about the existing libraries for reading / writing it, so I'm probably not the best person to ask about how to work with it
12:54 🔗 joepie91 :P
12:54 🔗 Honno aight, thanks for all your help :)
13:01 🔗 RedType has joined #archiveteam-bs
13:18 🔗 metalcamp has joined #archiveteam-bs
13:21 🔗 joepie91 http://www.theverge.com/2016/3/10/11195370/hot-wheels-pc-restored-patriot-computer
13:22 🔗 Stilett0 is now known as Stiletto
14:22 🔗 ohhdemgir Watching videos like ~ https://youtu.be/2RHEaRlJedA?t=290 ~ thinking, 'that stuff should be archived....' -_-
14:37 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
14:44 🔗 Stiletto has joined #archiveteam-bs
15:06 🔗 Honno has quit IRC (Ping timeout: 492 seconds)
15:24 🔗 signius has quit IRC (Read error: Operation timed out)
15:32 🔗 underscor has joined #archiveteam-bs
15:33 🔗 bsmith093 has quit IRC (Ping timeout: 633 seconds)
15:39 🔗 underscor has quit IRC (http://www.mibbit.com ajax IRC Client)
15:40 🔗 undersco2 has joined #archiveteam-bs
15:56 🔗 marvinw is now known as ivan`
15:58 🔗 ivan` https://github.com/chfoo/wpull/issues/319 FYI all the WARCs made by grab-site and wpull (concurrency > 1) don't really work in pywb. might be worth looking into if someone is in a python bugfixing mood
15:58 🔗 ivan` bug submitter wants to know if there is a more working WARC reader, too. the open wayback repo on github didnt seem to have any docs
15:59 🔗 ivan` https://github.com/iipc/openwayback/wiki oh there it is
16:07 🔗 JesseW has joined #archiveteam-bs
16:10 🔗 Honno has joined #archiveteam-bs
16:11 🔗 Honno has quit IRC (Read error: Connection reset by peer)
16:11 🔗 Honno has joined #archiveteam-bs
16:23 🔗 bsmith093 has joined #archiveteam-bs
16:24 🔗 JetBalsa has quit IRC (Quit: - nbs-irc 2.39 - www.nbs-irc.net -)
16:30 🔗 Kazzy Anyone else running warriors through docker? trying to find out if i'm having gui issues or the warrior isn't taking jobs
16:32 🔗 signius has joined #archiveteam-bs
16:33 🔗 Kazzy nvm, chrome is bad
16:46 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
16:54 🔗 wyatt8750 is now known as wyatt8740
16:55 🔗 SimpBrain has quit IRC (Ping timeout: 246 seconds)
17:04 🔗 JW_work1 has joined #archiveteam-bs
17:09 🔗 chazchaz has quit IRC (Read error: Operation timed out)
17:10 🔗 chazchaz has joined #archiveteam-bs
17:11 🔗 SimpBrain has joined #archiveteam-bs
17:26 🔗 robink has quit IRC (Ping timeout: 260 seconds)
17:38 🔗 robink has joined #archiveteam-bs
18:04 🔗 joepie91 https://i.imgur.com/XaZdF6V.jpg
18:04 🔗 joepie91 light++
18:04 🔗 bwn has joined #archiveteam-bs
18:06 🔗 bwn has quit IRC (Client Quit)
18:06 🔗 bwn has joined #archiveteam-bs
18:21 🔗 bwn has quit IRC (Read error: Operation timed out)
18:35 🔗 joepie91 FYI
18:35 🔗 joepie91 buncha free books available only today: http://www.versobooks.com/blogs/2575-psst-downloading-isn-t-stealing-for-today
18:47 🔗 bwn has joined #archiveteam-bs
19:08 🔗 alfie joepie91: grab 'em for me? nowhere to keep them rn :P (feckin housemove)
19:22 🔗 schbirid has quit IRC (Quit: Leaving)
19:32 🔗 ikreymer has joined #archiveteam-bs
19:47 🔗 bsmith093 sqlite is being stupid, nothing i do actually returns anything
19:49 🔗 ikreymer ivan`: re: wpull warcs with concurrency > 1, i found the issue (kind of a stupid bug) and will have a fix for pywb soon
19:49 🔗 ikreymer ivan`: the issue is with the cdx creation, thanks for reporting it, will let you know when an update is out
19:52 🔗 ikreymer also, for anyone interested, after this bugfix release, the next release of pywb (and hopefully WebArchivePlayer) will support Python 3.3+ as well
19:58 🔗 JW_work1 bsmith093: semicolons?
19:59 🔗 bsmith093 JW_work1: literally the only thing i've managed to do is somehow tell sqlite3 to completely scrub the db file, so i'm redownloading it anyway.
20:00 🔗 bsmith093 2.5 hours to go
20:01 🔗 bsmith093 .schema returns nothing. but again i somehow managed to delete the file and replace it with 800 bytes of semi random sql.
20:01 🔗 JW_work1 has quit IRC (Read error: Operation timed out)
20:02 🔗 JW_work has joined #archiveteam-bs
20:03 🔗 bsmith093 any sql people here? i have a massive sqlite file, that i'm trying to read. .schema returns nothing, literally. and sqlite3 managed to overwrite the db with sql statements. 5gb, just gone.
20:03 🔗 JW_work :-(
20:04 🔗 bsmith093 JW_work: i don't get it.
20:05 🔗 bsmith093 and of course my eta is going UP
20:05 🔗 JW_work well, first you'll have to re-download the database (and probably keep a copy)
20:05 🔗 bsmith093 :(
20:05 🔗 bsmith093 on it
20:05 🔗 JW_work next, .schema doesn't need a semicolon after it, but all select statements *do*
20:06 🔗 bsmith093 all i did was, in the sqilit3 shell, .open metadata.sqlite
20:06 🔗 bsmith093 was that it?
20:06 🔗 JW_work yeah, that's … not right
20:06 🔗 JW_work pass the database in on the command line, i.e. sqlite3 metadata.sqlite
20:06 🔗 bsmith093 ah, thats where i screwed up.
20:07 🔗 bsmith093 still, you'd think open filename, means either create it or open it if it exists.
20:07 🔗 JW_work yeah, it does seem like that should work
20:08 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
20:09 🔗 bsmith093 JW_work: can i do anything with a partial copy, or do i seriously have to wait 3 hours to grab the full thing?
20:09 🔗 JW_work IDK. you can try it
20:10 🔗 bsmith093 JW_work: disk image is malformed, so atleast it's reading it.
20:10 🔗 bsmith093 oh well, ill just wait.
20:13 🔗 JW_work has quit IRC (Quit: Leaving.)
20:23 🔗 JW_work has joined #archiveteam-bs
20:25 🔗 joepie91 looks like Reddit got an NSL
20:25 🔗 ikreymer has quit IRC (Quit: Page closed)
20:33 🔗 JW_work has quit IRC (Quit: Leaving.)
20:34 🔗 alfie joepie91: yeah, just saw your link in #archivebot :/
20:34 🔗 JW_work has joined #archiveteam-bs
20:36 🔗 alfie joepie91: i mean, i'm not surprised, but... L/
20:47 🔗 JW_work has quit IRC (Quit: Leaving.)
20:49 🔗 JW_work has joined #archiveteam-bs
20:49 🔗 RedType actually joepie91 reddit has decided not to comment on their removal of warrant canaries
20:50 🔗 alfie RedType: reddit has decided to comment on their lack of comment.
20:50 🔗 RedType damn spez's comment on treading a fine line
20:50 🔗 RedType Even with the canaries, we're treading a fine line. The whole thing is icky, which is why we joined Twitter in pushing back.
20:50 🔗 RedType it sounds like it answers the comment above, /but it doesnt actually/
20:50 🔗 RedType i love it
20:51 🔗 alfie RedType: https://np.reddit.com/r/announcements/comments/4cqyia/for_your_reading_pleasure_our_2015_transparency/d1kpn4k?context=4 is the most important comment, IMO
20:51 🔗 joepie91 so, for those using DigitalOcean: https://twitter.com/joepie91/status/715642213129175040
20:51 🔗 RedType yeah, that's what i was referring to
20:51 🔗 alfie joepie91: glaaad, i migraaaateeeeedddd
20:51 🔗 RedType commenting on not commenting on their removal of the warrant canary
20:51 🔗 JW_work has quit IRC (Client Quit)
20:52 🔗 alfie RedType: the removal of the canary is pretty fuckin conclusive, though
20:53 🔗 xmc joepie91: a whole penny, wow
20:53 🔗 JW_work has joined #archiveteam-bs
21:05 🔗 joepie91 xmc: it's not about the penny
21:05 🔗 joepie91 it's about the fact that it's SLA credit
21:05 🔗 joepie91 if it was expired for me, it will almost certainly be expired for everybody
21:05 🔗 joepie91 including people who have a TON of SLA credits
21:05 🔗 joepie91 this is just not okay
21:09 🔗 RichardG has quit IRC (Read error: Operation timed out)
21:09 🔗 RichardG has joined #archiveteam-bs
21:21 🔗 Kazzy hm
21:22 🔗 Kazzy i got the $100 credit from github student kit a while back, haven't received an email about that expiring yet though
21:22 🔗 Kazzy ..nvm just unlocked my phone and got the gmail notification
21:25 🔗 alfie Kazzy: ooh, 100USD, that could buy you... fuck all, DO are expensive as balls :P
21:26 🔗 Kazzy at $5/mo that lasts a very long time!, still have $16 left of it now
21:27 🔗 alfie Kazzy: yeah, but... cirrus.alfiepates.me costs me about £22 a year, so :P
21:30 🔗 BlueMaxim has joined #archiveteam-bs
21:33 🔗 Kazzy I dropped mine after a year, a .com is like £8/year so no point keeping the .me really
21:33 🔗 alfie Kazzy: as in, the server behind it :P
21:33 🔗 alfie (yes, i use FQDNs in general conversation. yes, you should too ;) )
21:33 🔗 alfie yeah, i plan to migrate alfiepates.me to alfiepates.com
21:33 🔗 Kazzy ah, fair enough
21:33 🔗 alfie run both domains for a year, then stab alfiepates.me in the back
21:34 🔗 Kazzy £22/yr surely doesn't get you much, 512mb/1cpu at some cheap host?
21:34 🔗 Kazzy ah, ramnode
21:34 🔗 alfie yeup, ramnode, of course :P
21:34 🔗 alfie cheap, decent enough for alfiepates.me and the other things i run on it
21:35 🔗 alfie mail.alfiepates.me is on a £8/m ramnode box (spam/AV is apparently memory hungry)
21:35 🔗 alfie yes, i run my own email. :P
21:37 🔗 HCross tbh, I wouldnt use fancy names for servers
21:37 🔗 HCross I use WhatTheThingDoes.domain.tld
21:38 🔗 HCross so storage.harrycross.me is storage. newsgrabber.harrycross.me is newsgrabbing etc etc
21:38 🔗 alfie HCross: all my servers do all sorts
21:38 🔗 joepie91 ^
21:38 🔗 alfie whereas, like, LP-AP1.networktld is obviously this laptop, etc
21:38 🔗 HCross ah, I just set up multiple records pointing at the same thing, makes it easier to get at what I want
21:38 🔗 alfie HCross: I do that too :P cname is a wonderful thing
21:38 🔗 Kazzy all my vm's at home are vaguely named correctly
21:39 🔗 alfie the server itself gets a hostname, then I cname the other stuff to it
21:39 🔗 xmc my computers are named randomly
21:39 🔗 HCross although I use GApps for mail, and some fancy anycast DNS setup
21:39 🔗 Kazzy although sickrage turned into /everything to do with content acquisition ever/
21:39 🔗 joepie91 my services and servers have a many-to-many relationship
21:39 🔗 joepie91 many servers run many services
21:39 🔗 joepie91 so the only reasonable naming scheme is unique names for each server
21:39 🔗 joepie91 that are easily identifiable
21:39 🔗 joepie91 because tasks can be spread across servers
21:39 🔗 joepie91 :p
21:40 🔗 alfie so, like, "cirrus" is that box, "stratus" is my other VPS box, "alexandria" is my storage server, and so on.
21:40 🔗 zino My actual machines at home are named n1 through n12. VMs get descriptive names.
21:40 🔗 HCross atm, I need. ThisThingIsProbablyUsingALotOfCPU.domain.tld :P
21:40 🔗 joepie91 oh man
21:40 🔗 joepie91 I am so happy with my new lighting
21:40 🔗 joepie91 :D
21:40 🔗 zino :-D
21:40 🔗 alfie joepie91: can you actually see now?
21:40 🔗 joepie91 I now have daylight-level illumination
21:41 🔗 Kazzy shoulda got some swish hue lights, joepie91
21:41 🔗 joepie91 3x20W LED
21:41 🔗 Kazzy don't think i've touched a light switch in weeks
21:41 🔗 joepie91 Kazzy: um. no?
21:41 🔗 joepie91 "sorry, I can't turn on my lights, my light switch is updating"
21:41 🔗 joepie91 :P
21:41 🔗 alfie lol
21:42 🔗 xmc LEDs are great, i have a house of friends who never turn off the lights in their front room because they are basically free to leave on
21:42 🔗 Kazzy never happens :< the switch still works
21:42 🔗 alfie i am considering the LIFX bulbs, mind. was recommended them on freenode :P
21:42 🔗 joepie91 until it does
21:42 🔗 joepie91 xmc: 60W is still non-negligible
21:42 🔗 joepie91 :p
21:42 🔗 joepie91 also, picture: https://i.imgur.com/XaZdF6V.jpg
21:43 🔗 xmc joepie91: sure ... but when you consider the very low cost of power here + not stubbing your toe, it becomes pretty easy to justify
21:43 🔗 xmc not bad
21:43 🔗 joepie91 that's two out of three bras
21:43 🔗 joepie91 bars*
21:43 🔗 joepie91 they run alongside my wall
21:43 🔗 joepie91 with one of those stand-alone lamp switches
21:43 🔗 xmc power in my city is about $0.05 USD/kWh
21:43 🔗 joepie91 and a cable running along the wall
21:43 🔗 joepie91 because it's temporary
21:46 🔗 joepie91 pictures incoming...
21:47 🔗 joepie91 https://imgur.com/a/To8nP
21:47 🔗 joepie91 such professional
21:47 🔗 joepie91 :p
21:47 🔗 Kazzy slight tangent, anyone in the UK having issues getting onto microsoft.com? cc HCross
21:48 🔗 HCross2 Fine from my Virgin line
21:48 🔗 Sk2d has joined #archiveteam-bs
21:49 🔗 Kazzy zz, maybe the first time virgin's been better than BT at something? :)
21:50 🔗 HCross2 Fine from M247 in Manchester too
21:52 🔗 Kazzy you got colo there?
21:53 🔗 HCross2 Nah, just a VPS
21:54 🔗 Sk1d has quit IRC (hub.se irc.du.se)
21:54 🔗 Boppen has quit IRC (hub.se irc.du.se)
21:58 🔗 kvieta has quit IRC (Read error: Operation timed out)
21:58 🔗 antomati_ has joined #archiveteam-bs
21:58 🔗 swebb sets mode: +o antomati_
21:58 🔗 Stiletto is now known as Stilett0
21:59 🔗 bwn_ has joined #archiveteam-bs
21:59 🔗 alfie i ought to go sleep, night all
21:59 🔗 phuzion has quit IRC (Read error: Operation timed out)
21:59 🔗 antomatic has quit IRC (Write error: Broken pipe)
22:01 🔗 phuzion has joined #archiveteam-bs
22:02 🔗 beardicus has quit IRC (Read error: Operation timed out)
22:02 🔗 bwn has quit IRC (Read error: Operation timed out)
22:02 🔗 ivan` has quit IRC (Ping timeout: 635 seconds)
22:04 🔗 Honno has quit IRC (Ping timeout: 492 seconds)
22:04 🔗 beardicus has joined #archiveteam-bs
22:05 🔗 lysobit has quit IRC (Read error: Operation timed out)
22:09 🔗 Sk2d is now known as Sk1d
22:10 🔗 GLaDOS has quit IRC (Ping timeout: 633 seconds)
22:12 🔗 GLaDOS has joined #archiveteam-bs
22:12 🔗 lysobit has joined #archiveteam-bs
22:12 🔗 Stilett0 has quit IRC (Ping timeout: 260 seconds)
22:15 🔗 godane you guys maybe getting some old web based radio shows
22:15 🔗 godane one of them is called Web Talk Guys
22:16 🔗 godane i'm getting shows going back to 2002
22:16 🔗 Jonimus has quit IRC (Ping timeout: 633 seconds)
22:19 🔗 kvieta has joined #archiveteam-bs
22:19 🔗 kvieta has quit IRC (Excess Flood)
22:20 🔗 Jonimus has joined #archiveteam-bs
22:20 🔗 swebb sets mode: +o Jonimus
22:21 🔗 marvinw has joined #archiveteam-bs
22:26 🔗 kvieta has joined #archiveteam-bs
22:26 🔗 kvieta has quit IRC (Excess Flood)
22:27 🔗 kvieta has joined #archiveteam-bs
22:35 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
22:35 🔗 BlueMaxim has joined #archiveteam-bs
22:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:48 🔗 dashcloud has joined #archiveteam-bs
23:11 🔗 undersco2 has quit IRC (Leaving)
23:32 🔗 dashcloud has quit IRC (Read error: Operation timed out)
23:35 🔗 dashcloud has joined #archiveteam-bs
23:47 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
23:48 🔗 BlueMaxim has joined #archiveteam-bs

irclogger-viewer