#archiveteam-bs 2017-12-21,Thu

↑back Search

Time	Nickname	Message
00:05 ^🔗	zino	jrwr, not that it helps, but I'm sending all my symphaty and good thoughts.
00:05 ^🔗	jrwr	Im going to hack up some cgi-bin
00:05 ^🔗	jrwr	and compile php7 on this box
00:05 ^🔗	jrwr	wouldnt be the first time I've worked around shit like this (LOOKING AT YOUR FERAL)
00:56 ^🔗	jrwr	so I turned on file based cache
00:57 ^🔗	jrwr	it should help /some/
00:57 ^🔗	jrwr	I'm working with apache some to get this working its being a PITA, I suspect a apache module doing this
01:08 ^🔗		BnAboyZ has joined #archiveteam-bs
01:27 ^🔗	Somebody2	BTW, regarding WARC uploads going into the Wayback Machine -- I've now gotten confirmation that it is still a trusted-uploaders-only process (which isn't surprising).
01:28 ^🔗	Somebody2	JAA is trusted, and ivan as well, presumably.
01:42 ^🔗	jrwr	So SketchCow, I'm stuck, the PHP is too old to update mediawiki, Apache is not behaving with the CGI override due to mod_security being forced on, and overall the entire account is limited to 7 (confirmed) connections concurrent (thats whats causing the resource limit pages currently)
01:42 ^🔗	jrwr	I've added the static file cache and it is helping
01:44 ^🔗	SketchCow	There'll be some roughness as we figure out what to do.
01:44 ^🔗	jrwr	Ya
01:44 ^🔗	jrwr	its using 2.6 as its kernel....
01:44 ^🔗	SketchCow	But if I have intelligent requests for the host, I'm sure they can help.
01:46 ^🔗	jrwr	Ok, So the main one is can I have my limits increased for the number of CGI scripts ran at one one time. I keep getting resource limit errors on top of this error log: [Wed Dec 20 20:41:37 2017] [error] mod_hostinglimits:Error on LVE enter: LVE(527) HANDLER(application/x-httpd-php5) HOSTNAME(archiveteam.org) URL(/index.php) TID(318310) errno (7) Read more: http://e.cloudlinux.com/MHL-E2BIG min_uid (0)
01:46 ^🔗	SketchCow	Well, assemble them all in one place for me.
01:47 ^🔗	SketchCow	I mean after a day of looking it COMPLETELY over
01:47 ^🔗	SketchCow	And then I'll bring it to TQ and see what they thing
01:47 ^🔗	SketchCow	think
01:47 ^🔗	jrwr	Ok
01:47 ^🔗	SketchCow	No sense in piecemealing
01:47 ^🔗	SketchCow	Also, let me glance at the cpanel
01:49 ^🔗	jrwr	Ok
01:50 ^🔗	jrwr	Ya, its pretty much those two issues I have with it, I'm compiling them into a google sheet for tracking
01:58 ^🔗	jacketcha	Wow. I was having problems compiling WARC files in javascript and was going to ask if there was a preexisting API for something like that, but I can barely even read what you guys are saying.
01:58 ^🔗	jrwr	Im poking the poor wiki very hard
01:59 ^🔗	jrwr	whats up jacketcha
01:59 ^🔗	jrwr	WARC reading in javascript, hrm
01:59 ^🔗	jacketcha	yeah
01:59 ^🔗	jrwr	Well, the WARC standard is pretty simple overall
01:59 ^🔗	jrwr	its all about the indexing and lookups that make it fast and JavaScript is not that great at it.
02:00 ^🔗	jrwr	https://www.npmjs.com/package/node-warc
02:00 ^🔗	jrwr	have some node
02:00 ^🔗	jrwr	its /javascript/
02:00 ^🔗	jacketcha	thanks
02:01 ^🔗	jacketcha	I was planning to add it to my chrome extension
02:01 ^🔗	jrwr	ah
02:02 ^🔗	jrwr	im logging off for now SketchCow, See ya in the morning
02:13 ^🔗		robink has quit IRC (Ping timeout: 246 seconds)
02:15 ^🔗	jacketcha	ok, so can somebody explain to me how warc files work
02:15 ^🔗	jacketcha	sorry for being dumb
02:15 ^🔗	jacketcha	i whonestly have no idea
02:15 ^🔗	jacketcha	*honestly
02:16 ^🔗	Frogging	do you have a more specific question?
02:20 ^🔗	jacketcha	How is the data structured? I am going to assume that it isn't just copying in the HTML source code after the headers are added.
02:20 ^🔗	Frogging	It stores the full response headers and body
02:21 ^🔗	Frogging	That includes responses containing binary data, HTML, CSS, plain text, whatever
02:22 ^🔗	jacketcha	Is there any specific order to that?
02:24 ^🔗	Frogging	Records can be in any order AFAIK
02:25 ^🔗	jacketcha	Great, so it'll work just find with asynchronous saving. Thanks, that was actually really helpful.
02:28 ^🔗	Frogging	http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html
02:28 ^🔗	jacketcha	thanks!
02:29 ^🔗	Frogging	there's WARC 1.1 (which is the latest version) linked there too
02:38 ^🔗		bithippo has joined #archiveteam-bs
02:43 ^🔗	bithippo	Thinking about grabbing Imgur. All of it. Anything I should keep in mind prior to putting it in cold storage?
02:44 ^🔗	bithippo	(iterating over ever permutation of image urls based on how Imgur generates image urls)
02:47 ^🔗	jacketcha	Is the way Imgur generates urls known?
02:47 ^🔗	jacketcha	Better question, is the source of data for the RNG Imgur uses known?
02:48 ^🔗		robink has joined #archiveteam-bs
02:50 ^🔗	bithippo	https://blog.imgur.com/2013/01/18/more-characters-in-filenames/
02:50 ^🔗	bithippo	"Choosing 5 characters from 26 lowercase letters + 26 uppercase letters + 10 numerical digests, leaves us with 916,132,832 possible combinations (625). Upgrading to 7 characters gives us 3,521,614,606,208 (3.52 trillion) possibilities."
02:53 ^🔗	bithippo	404->check back in the future, 200->WARC gz
02:53 ^🔗	bithippo	Etag header on a request is the MD5 of the image
02:53 ^🔗	jacketcha	That is still 3 trillion web requests
02:53 ^🔗	bithippo	¯\_(ツ)_/¯
02:54 ^🔗	bithippo	Alternatives? Besides waiting until Imgur runs out of runway and then its more pressing :/
02:55 ^🔗	bithippo	(ie Twitpic 2.0)
02:56 ^🔗	bithippo	My only frustration is that the URL isn't deterministic from a hash of the image, so it's possible an image exists, is deleted, and then replaced without any way to know
02:58 ^🔗	jacketcha	Unless it was already archived
02:58 ^🔗	jacketcha	Look, it's a good idea in practice, but here's the thing
02:58 ^🔗	bithippo	Ahh, truth
02:58 ^🔗	jacketcha	Imgur gets around 17.3611111111 new images per second
02:59 ^🔗	jacketcha	That would place it at around 2687000000 images today
03:01 ^🔗	bithippo	Le sigh.
03:01 ^🔗	jacketcha	It gets worse
03:02 ^🔗	jacketcha	That means you have a roughly 0.076300228743465619423705658937702969914635168416766807608280491104220976146849283303277891127907878357259164917744009861455363906114286574925748247085571170136781627322552461470824865159611513732687195% chance of getting an image every time you send a request
03:03 ^🔗	jacketcha	Don't be fooled by the high precision, the accuracy of your plan is very low.
03:03 ^🔗	jacketcha	But, there is a way to make it higher.
03:04 ^🔗	jacketcha	Much higher, in fact
03:04 ^🔗	jacketcha	If you can figure out the source of the data that Imgur uses for its random number generation algorithms, you can at least grab the newest images
03:06 ^🔗	bithippo	Sounds workable using their API to get latest image paths and then working backwards
03:07 ^🔗	jacketcha	possibly
03:08 ^🔗	jacketcha	But if it is randomly generated, even pseudorandomly generated, you're still screwed.
03:09 ^🔗	jacketcha	or
03:09 ^🔗	jacketcha	you know
03:09 ^🔗	jacketcha	you could just email them
03:09 ^🔗	jacketcha	and ask
03:10 ^🔗	bithippo	"Hi. I will take one Imgur pls. Will be over with hard drives shortly."
03:10 ^🔗	bithippo	Appreciate the input!
03:11 ^🔗	jacketcha	No problem
03:19 ^🔗	jacketcha	hold up
03:20 ^🔗	jacketcha	If a Warrior or ArchiveBot finds a WARC file, is it added to the collection of WARC files or is it added into the WARC file of the site it is located on?
03:23 ^🔗	godane	i'm up to 18k items this month
03:23 ^🔗	godane	this year has been slower then last year
03:23 ^🔗	godane	i just hope i can get it past 100k for the year
03:24 ^🔗	godane	https://archive.org/details/@chris85?&and[]=addeddate:2017
03:24 ^🔗	godane	it 97,682 so far
03:24 ^🔗	jacketcha	woah
03:24 ^🔗	jacketcha	maybe I should start counting mine
03:24 ^🔗	jacketcha	and putting it in actual warc files
03:33 ^🔗		pizzaiolo has quit IRC (Remote host closed the connection)
03:45 ^🔗		robink has quit IRC (Ping timeout: 246 seconds)
03:48 ^🔗	Somebody2	jacketcha: if a HTTP request returns a WARC file, and that HTTP request and response is being stored into a WARC file,
03:48 ^🔗	Somebody2	then, yes, you'll have nested WARC-formatted data
03:48 ^🔗	Somebody2	AFAIK, no WARC-recording tool will automatically un-nest it (and that would probably not be a good idea in any case)
03:55 ^🔗	jacketcha	wait
03:56 ^🔗	jacketcha	so that means that there possibly could be an archive of the entire internet floating around the wayback machine somewhere, but nobody would ever know because it was nested.
04:01 ^🔗	SketchCow	This is.....
04:01 ^🔗	SketchCow	See, this is one of the things
04:01 ^🔗	SketchCow	You are asking... well, you're asking for a college course in how WARC works
04:01 ^🔗	SketchCow	It's sort of on topic and sort of off
04:01 ^🔗	SketchCow	It's certainly sucking all the air out of the room
04:01 ^🔗	SketchCow	It's nice to see people talking
04:01 ^🔗	jacketcha	so nested warc files are basically politics
04:02 ^🔗	jacketcha	got it
04:11 ^🔗		bithippo has quit IRC (Ping timeout: 260 seconds)
04:14 ^🔗	SketchCow	No.
04:14 ^🔗	SketchCow	You're wandering into a welding shop going "So.... why cold welds"
04:16 ^🔗	jacketcha	that seems very accurate
04:20 ^🔗		bithippo has joined #archiveteam-bs
04:33 ^🔗		kyounko has joined #archiveteam-bs
04:55 ^🔗		qw3rty117 has joined #archiveteam-bs
05:01 ^🔗		qw3rty116 has quit IRC (Read error: Operation timed out)
05:22 ^🔗		bithippo has quit IRC (Quit: Page closed)
05:29 ^🔗		Stiletto has quit IRC (Read error: Operation timed out)
05:34 ^🔗		Stilett0 has joined #archiveteam-bs
05:35 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
05:36 ^🔗		BlueMaxim has joined #archiveteam-bs
06:02 ^🔗		wp494 has quit IRC (Ping timeout: 250 seconds)
06:02 ^🔗		wp494 has joined #archiveteam-bs
06:12 ^🔗		zgrant has left
06:15 ^🔗		wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
06:16 ^🔗		wp494 has joined #archiveteam-bs
06:27 ^🔗		kimmer1 has joined #archiveteam-bs
06:29 ^🔗		midas2 has quit IRC (Ping timeout: 1212 seconds)
06:33 ^🔗		kimmer12 has quit IRC (Ping timeout: 633 seconds)
06:47 ^🔗		midas2 has joined #archiveteam-bs
07:24 ^🔗		wp494_ has joined #archiveteam-bs
07:29 ^🔗		ZexaronS has quit IRC (Read error: Connection reset by peer)
07:30 ^🔗		ZexaronS has joined #archiveteam-bs
07:31 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
07:39 ^🔗		wp494_ has quit IRC (Ping timeout: 633 seconds)
07:41 ^🔗		odemg has quit IRC (Read error: Operation timed out)
07:46 ^🔗		odemg has joined #archiveteam-bs
07:53 ^🔗		wp494 has joined #archiveteam-bs
08:03 ^🔗		robink has joined #archiveteam-bs
08:06 ^🔗	jacketcha	I wonder how many times jquery has been archived
08:07 ^🔗	jacketcha	By this point, there must be at least a hundred copies made of it each week
09:09 ^🔗		jacketcha has quit IRC (Read error: Connection reset by peer)
09:09 ^🔗		jacketcha has joined #archiveteam-bs
09:16 ^🔗		Mateon1 has quit IRC (Ping timeout: 245 seconds)
09:16 ^🔗		Mateon1 has joined #archiveteam-bs
09:35 ^🔗	jrwr	SWEET BABY JESUS
09:35 ^🔗	jrwr	someone, I got php7 to run
09:35 ^🔗	jrwr	on this holy shit old host
09:39 ^🔗	PurpleSym	Is this a dedicated server, jrwr?
09:39 ^🔗	jrwr	not even close
09:40 ^🔗	jrwr	its a shared host running linux 2.6 on a old cpanel
09:40 ^🔗	jrwr	running a god old apache + php 5.3
09:40 ^🔗	jrwr	I override the mod_security and mod_suphp all to fux to get PHP scripts to run with a custom statically linked php binary I made
09:41 ^🔗	PurpleSym	Wtf? 2.6 EOL’d years ago.
09:43 ^🔗	jrwr	I'm making do with what I have
09:43 ^🔗	jrwr	its where its staying
09:43 ^🔗	jrwr	I'm making its own little world on this webhost
09:43 ^🔗	*	jrwr compiles memcached
09:45 ^🔗	jrwr	now
09:45 ^🔗	jrwr	comes the fun part
09:45 ^🔗	jrwr	I'm going to update mediawiki
09:46 ^🔗	jacketcha	good luck
09:47 ^🔗	jacketcha	i can't even update windows without doing a ritual to please the tech gods
09:50 ^🔗	jrwr	this is dark magic
09:50 ^🔗	jrwr	php does not like doing this
09:52 ^🔗	jrwr	I have a plan
09:52 ^🔗	jrwr	to compile php with memcached, and then run a little memcached server so mediawiki can cache objects
09:53 ^🔗	jacketcha	nah
09:53 ^🔗	jacketcha	you don't even need php
09:54 ^🔗	jacketcha	just do what I do and use 437 IFTTT applets as your server
09:54 ^🔗	jacketcha	with a touch of github pages
09:54 ^🔗	jrwr	lol
09:54 ^🔗	jrwr	this is the archiveteam wiki I'm working on
09:55 ^🔗	jacketcha	Is the ArchiveTeam wiki archived?
09:55 ^🔗	Igloo	Mostly :p
09:57 ^🔗	jacketcha	You know what I want to try? My school has unlimited storage on all google accounts under its organization. I wonder how far they would let me push that.
09:58 ^🔗	jrwr	Its staying where it is for now
09:58 ^🔗	jrwr	for ~reasons~
09:58 ^🔗	jacketcha	is it because you missed a semicolon somewhere but there isn't a really good php linter yet
10:00 ^🔗	jacketcha	oh no
10:01 ^🔗	jacketcha	i just remembered i have midterms
10:01 ^🔗	jacketcha	gn
10:19 ^🔗	jrwr	oh man
10:19 ^🔗	jrwr	thats a ton better
10:19 ^🔗		jacketcha has quit IRC (Remote host closed the connection)
10:19 ^🔗	jrwr	Archive team is now running on mediawiki 1.30.0
10:21 ^🔗		jacketcha has joined #archiveteam-bs
10:24 ^🔗		jacketcha has quit IRC (Remote host closed the connection)
10:29 ^🔗		jacket has joined #archiveteam-bs
10:43 ^🔗		fie has joined #archiveteam-bs
10:48 ^🔗		fie has quit IRC (Read error: Connection reset by peer)
11:07 ^🔗		pizzaiolo has joined #archiveteam-bs
11:14 ^🔗	jrwr	Igloo: better huh?
11:16 ^🔗	Igloo	jrwr: miles and miles
11:16 ^🔗	jrwr	Ya
11:17 ^🔗	jrwr	Response times are sub 200ms
11:17 ^🔗	jrwr	Before they where 1400ms
11:20 ^🔗	JAA	jrwr: Well done! Much, much better. <3
11:20 ^🔗	jrwr	Thanks
11:21 ^🔗	JAA	Somebody2: Ah, makes sense. Thanks for checking.
11:21 ^🔗	jrwr	I woke up from a strange dream at 3am (flying a airplane and somewhat crashing it)
11:22 ^🔗	jrwr	And then had a brainwave on how to get php working correctly
11:22 ^🔗	jrwr	Been up since then, work is going to be hell today
12:08 ^🔗		BlueMaxim has quit IRC (Leaving)
12:40 ^🔗	Igloo	Ok, So it looks like we can iterate through the numbers
12:41 ^🔗	JAA	For user profiles, maybe. For characters, no way.
12:41 ^🔗	Igloo	212 million users? Unlikely
12:42 ^🔗	JAA	But it should be fairly simple to scrape them from https://www.saintsrow.com/community/characters/mostrecent
12:42 ^🔗	JAA	The question is, how do we get the actual characters (not just the images)?
12:42 ^🔗	Smiley	is there a log of what has gone before?
12:42 ^🔗	Smiley	I have a 'archiveteam' account registered along with my personal one
12:42 ^🔗	Smiley	not sure why
12:42 ^🔗	Smiley	maybe i just suggested scraping for SR3
12:46 ^🔗	Igloo	https://www.saintsrow.com/users/show/212300001 appears to be the lowest. https://www.saintsrow.com/users/show/213056573 latest
12:46 ^🔗	Igloo	2300001 3056573 ~705,000 user profiles?
12:47 ^🔗	Igloo	Those are easy
13:22 ^🔗	jrwr	SketchCow: email fixed
13:22 ^🔗	jrwr	confirmed working with password resets being sent to a gmail account
13:23 ^🔗	SketchCow	Great
13:32 ^🔗	jrwr	https://usercontent.irccloud-cdn.com/file/brVInVWJ/image.png
13:32 ^🔗	jrwr	you can see when I dropped in the php changes
13:52 ^🔗	jrwr	joepie91: it doesnt work like that
13:52 ^🔗	jrwr	this whole box is from 2011
13:53 ^🔗	joepie91	ah, just an old cpanel then that doesn't support it, or?
14:02 ^🔗	jrwr	ya
14:02 ^🔗	jrwr	I have methods and apis
14:02 ^🔗	jrwr	im patching it in
14:05 ^🔗		icedice has joined #archiveteam-bs
14:13 ^🔗		icedice has quit IRC (Ping timeout: 250 seconds)
14:13 ^🔗	JAA	jrwr: LOL, that graph is beautiful!
14:28 ^🔗	jrwr	Thanks
14:28 ^🔗	jrwr	its going up on my wall
15:18 ^🔗	jrwr	JAA: Igloo
15:18 ^🔗	jrwr	guess what
15:18 ^🔗	jrwr	SSL BITCHES
15:19 ^🔗	JAA	Yiss
15:19 ^🔗	jrwr	https://www.ssllabs.com/ssltest/analyze.html?d=archiveteam.org
15:19 ^🔗	jrwr	fucking A rating!
15:20 ^🔗	MrRadar2	:D :D :D https://i.imgur.com/CloHYLR.png
15:20 ^🔗	jrwr	with Strict Transport Security (HSTS) on (left it pretty short just in case)
15:20 ^🔗	Igloo	Just need a redirect now ;-)
15:20 ^🔗	JAA	^
15:21 ^🔗	jrwr	na, not going to enforce it
15:21 ^🔗	jrwr	HTST is enough
15:23 ^🔗		zgrant has joined #archiveteam-bs
15:24 ^🔗	jrwr	fuck it
15:24 ^🔗	jrwr	done
15:24 ^🔗	jrwr	SketchCow: SSL is now installed
15:24 ^🔗	jrwr	anything else?
15:24 ^🔗	Igloo	:)
15:24 ^🔗	Igloo	hehe, all my home stuff with LE gets A rating too
15:24 ^🔗	Igloo	Which is bonza
15:26 ^🔗	SketchCow	I think that's all I can think of
15:26 ^🔗	SketchCow	Someone proposed some sort of theme upgrade
15:27 ^🔗	SketchCow	But it all seems just fine to me now.
15:31 ^🔗	jrwr	ah
15:31 ^🔗	jrwr	its fine
15:32 ^🔗	jrwr	I /might/ get bored and add in a new editor but the new editor requires all kinds of crazy
15:32 ^🔗	SketchCow	If people come up with things, we'll consider them now that it's possible
15:32 ^🔗	SketchCow	Generally, someone complaining they can't work on te Wiki because they miss a gimgaw are focused on the wrong things.
15:33 ^🔗	jrwr	Ya
15:33 ^🔗	jrwr	I am using the file based cache built into mw
15:33 ^🔗	jrwr	so bots and stuff all get served static pages
15:37 ^🔗	jrwr	I feel like I just refurbed my 1984 Chrysler lebaron convertible (I own one) https://drive.google.com/file/d/1AQqXNiluKTk5xuCYStfVexiH1LLUOYaLLQ/view?usp=sharing
15:43 ^🔗	Igloo	Nice car
15:45 ^🔗	jrwr	900$
15:45 ^🔗	jrwr	runs great, and talks to you
15:46 ^🔗	jrwr	https://www.youtube.com/watch?v=nGuRS-L2BN0
15:53 ^🔗	jrwr	I love the old DEC speech Synths
15:53 ^🔗	jrwr	sound better then software ones
16:28 ^🔗	godane	so another box of tapes i bought is shipped
16:50 ^🔗		dd0a13f37 has joined #archiveteam-bs
16:53 ^🔗	godane	so this happenedhttp://mashable.com/2017/12/20/sesame-street-irc-macarthur-grant-refugee-middle-east/#fIS9la5_bSq7
17:18 ^🔗		schbirid has joined #archiveteam-bs
17:21 ^🔗		jacket has quit IRC (Read error: Connection reset by peer)
17:22 ^🔗		jacket has joined #archiveteam-bs
17:23 ^🔗	dd0a13f37	aria2c is a mystery
17:23 ^🔗	dd0a13f37	if I have it use 1 connection or 10, I still get about 2r/s
17:24 ^🔗	dd0a13f37	if I split it up across 6 command windows, 12r/s
17:24 ^🔗	dd0a13f37	might have to do with the fact that it's split across multiple IPs though
17:25 ^🔗	dd0a13f37	Anyone know a good tool to do this automatically? Split up http requests over multiple proxies?
17:59 ^🔗		bithippo has joined #archiveteam-bs
18:42 ^🔗	ola_norsk	is there some sort of code available to look at on how IA get urls from warcs?
18:43 ^🔗	ola_norsk	or does it convert first, somehow, then do _that_ ?
18:45 ^🔗	ola_norsk	i was kind of expecting warcs to a kind of archive with an index, not all data being a single file :/
18:46 ^🔗	ola_norsk	e.g containing something i could open in gedit etc..
18:50 ^🔗	dd0a13f37	iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/index.html
18:51 ^🔗	bithippo	https://github.com/recrm/ArchiveTools/wiki/warc-extractor
18:52 ^🔗	dd0a13f37	Can someone help me?
18:53 ^🔗	dd0a13f37	I'm trying to archive a site with aria2c. When using multiple concurrent connections, I get about 2r/s. The speed is around 50bkit, despite my internet connection being much faster.
18:53 ^🔗	dd0a13f37	When using multiple instances over multiple IPs, it's not much better. The individual speeds for some drop down to 5kbit.
18:54 ^🔗	bithippo	Recommend using https://github.com/ludios/grab-site to archive instead
18:54 ^🔗	dd0a13f37	I am currently running 50 instances of aria2 with 20 concurrent connections each. I get 5r/s total.
18:54 ^🔗	bithippo	If you need a WARC file, request/response headers, etc.
18:54 ^🔗	dd0a13f37	This is abysmally low, what gives?
18:54 ^🔗	bithippo	You are most likely being throttled by IP
18:54 ^🔗	dd0a13f37	But I'm spreading it over 50 different IPs.
18:54 ^🔗	bithippo	(even if distributing across multiple IPs)
18:55 ^🔗	bithippo	Sliding window of bytes in the webserver config. Initial requests are fast, subsequent requests slow down if you try to firehose
18:55 ^🔗	bithippo	What's the hostname?
18:55 ^🔗	dd0a13f37	ratsit.se
18:55 ^🔗	dd0a13f37	Or my hostname?
18:55 ^🔗	bithippo	Nope, site hostname. Checking something.
18:55 ^🔗	dd0a13f37	So how can it throttle them to 2-5 KiB/s when using 50 different IPs, but 50 KiB/s when using 1?
18:57 ^🔗	bithippo	Are these all anonymous web requests? Or are you signed in/setting a cookie to be logged in to fetch data?
18:58 ^🔗	dd0a13f37	These are all anonymous requests from tor exit nodes. No cookies are stored.
18:58 ^🔗	bithippo	Could be throttling by tor IPs. I did that at my last gig on our Nginx servers.
18:58 ^🔗	bithippo	(Tor requests were notoriously bad scraping actors in our case)
18:58 ^🔗	dd0a13f37	But I used tor IPs before too. And those were at 50KiB/s, not 2-5KiB/s.
18:59 ^🔗	bithippo	I don't have a good answer unfortunately :/ Lots of variables that could be causing it. What's the purpose of using Tor to perform the requests?
18:59 ^🔗	dd0a13f37	The only logical explanation is my connection being the bottleneck, but that would put it at around 2 mbit, which is way too slow
19:00 ^🔗	dd0a13f37	Because I don't want to get in any trouble for the scraping, and they could ban my IP
19:01 ^🔗	dd0a13f37	Now it jumped up to 12 resp/s, which was my previous peak when using 6 different IPs.
19:01 ^🔗	bithippo	Is a cloud provider VM out of the question with a slow concurrency rate?
19:01 ^🔗	bithippo	2 requests per second, say.
19:01 ^🔗	dd0a13f37	It could be on their end too.
19:02 ^🔗	dd0a13f37	I could just leave the computer on over night, but I would prefer not to pay any money.
19:02 ^🔗	dd0a13f37	Could it be they just serve 12 connections at the same time?
19:03 ^🔗	dd0a13f37	Now down to 7 again. Sure is a mystery what is going on...
19:06 ^🔗	dd0a13f37	And now back up to 14. At this speed, it will take 6 hours, which is slow but acceptable.
19:07 ^🔗	bithippo	I can rip it for you and provide a torrent file when I'm done.
19:08 ^🔗	dd0a13f37	It went up to 34 now.
19:08 ^🔗	dd0a13f37	How? With grab-site?
19:08 ^🔗	dd0a13f37	They might block the IP being used in that case
19:09 ^🔗	bithippo	10 second wait between requests
19:09 ^🔗	bithippo	It'll take a while, but it'll finish eventually.
19:09 ^🔗	bithippo	Have to head out, leave me a note here if that's a plan
19:11 ^🔗	dd0a13f37	10 seconds would take half a year for the whole site, and it would change during the time, so I don't think that's a good idea
19:12 ^🔗	dd0a13f37	But if it continues to be this fast then it should be done in a few hours, which is good.
19:21 ^🔗	dd0a13f37	Well, the only logical explanation is some advanced throttling algorithm in place. I can't find any other explanation for why it's so slow.
19:27 ^🔗	dd0a13f37	https://pastebin.com/VkCS1yJ1 It apparently got faster over time, with a peak of 46 resp/s, before slowing down.
19:45 ^🔗	ola_norsk	any sqlite geninouses savvy to very basic sqlite reational database structure in here who wouldn't mind if ask some questions?
19:45 ^🔗	ola_norsk	relational*
19:46 ^🔗	dd0a13f37	I have a basic knowledge, shoot
19:47 ^🔗	ola_norsk	dd0a13f37: thanks. if you please take a look at the SQL here, (just page search for 'sqlqueries') https://github.com/DuckHP/twario-warrior-tool/blob/master/src/twario/sqlitetwario.py
19:48 ^🔗	ola_norsk	i'm sure that sql could be done better, and i think you'll agree. Sadly my sql is quite shit
19:49 ^🔗	ola_norsk	to optimize storage etc..i mean
19:49 ^🔗	ola_norsk	and speed etc
19:50 ^🔗	dd0a13f37	The schema?
19:50 ^🔗	ola_norsk	aye
19:50 ^🔗	dd0a13f37	You can add a constraint for TweetUserId so it has to have a corresponding entry in Users
19:51 ^🔗	dd0a13f37	And Users should either have id INTEGER PRIMARY KEY, or TweetUserID should be a username
19:51 ^🔗	ola_norsk	yeah been thinking that so i made a 'users' table
19:51 ^🔗	ola_norsk	ok
19:52 ^🔗	dd0a13f37	Display name isn't stable, but it might be overkill to provision for that
19:52 ^🔗	dd0a13f37	search for foreign key constraint
19:52 ^🔗	ola_norsk	aye, i'm not even sure yet if 'tweep' reads display names
19:52 ^🔗	dd0a13f37	https://sqlite.org/foreignkeys.html http://www.sqlitetutorial.net/sqlite-foreign-key/
19:53 ^🔗	dd0a13f37	Well, if you want to archive avatars etc it might be neat to have. You could have three tables, but it might be overkill
19:53 ^🔗	dd0a13f37	tweets - tweet text, date, username
19:53 ^🔗	dd0a13f37	users - username (not unique), date, avatar, displayname
19:53 ^🔗	dd0a13f37	or wait, that makes two
19:54 ^🔗		Valentine has quit IRC (Read error: Connection reset by peer)
19:54 ^🔗	ola_norsk	avatar might be doable
19:54 ^🔗	dd0a13f37	And then just do SELECT * FROM users WHERE username = ... LIMIT 1
19:54 ^🔗	ola_norsk	ty
19:55 ^🔗	dd0a13f37	not sure about the syntax
19:58 ^🔗	ola_norsk	the requests sql i can figure out i think, but i suck at schema/structure :/
19:58 ^🔗		BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io)
19:59 ^🔗	ola_norsk	i'll check out that link you posted. thanks
20:00 ^🔗	dd0a13f37	Well, have one users table that for each username can have multiple entries (e.g. if they change their avatar you get a new entry with same username)
20:00 ^🔗	dd0a13f37	And one tweets table, since they are immutable
20:00 ^🔗		Valentine has joined #archiveteam-bs
20:03 ^🔗	ola_norsk	e.g if a tweet is identical, just have a 'content' table perhaps, and refereance that?
20:03 ^🔗	dd0a13f37	If you're ever building your own scraper, it seems like mobile.twitter.com is more pleasant to work with
20:03 ^🔗	dd0a13f37	view-source:https://twitter.com/jack view-source:https://mobile.twitter.com/jack
20:03 ^🔗	ola_norsk	i'm just "re-doing" a tool called 'tweep'
20:04 ^🔗	dd0a13f37	As in, modifying it?
20:05 ^🔗	ola_norsk	aye, this: https://github.com/haccer/tweep ..it seems to work quite well, but could use some tweaking
20:05 ^🔗	dd0a13f37	If you want a complete archive, you could probably crawl pretty nicely. Start off by the timeline, then see what accounts and hashtags you find. Then traverse those accounts and hashtags, see what accounts and hashtags you find.
20:06 ^🔗	dd0a13f37	>The --fruit feature will display Tweets that might contain sensitive info
20:06 ^🔗	dd0a13f37	uh
20:07 ^🔗	ola_norsk	aye, have not tested that yet, but i've been thinking of removing it
20:07 ^🔗	ola_norsk	basically 'user' and 'search words' is my focus
20:07 ^🔗	ola_norsk	not exactly too keen on archiving 'doxing' tweets
20:08 ^🔗	dd0a13f37	Well, why bother taking it out? Just don't use it, or remove all documentation references to it if you're really concerned.
20:08 ^🔗	ola_norsk	aye
20:09 ^🔗	dd0a13f37	I think mobile.twitter.com is better. It shows 30 tweets/page instead of 20, and the pages are faster to download
20:10 ^🔗	hook54321	JAA: Are you still grabbing the Catalonia cameras that update every 5 minutes or so?
20:10 ^🔗	ola_norsk	dd0a13f37: it seems to require a signin/account
20:10 ^🔗	JAA	hook54321: Yeah
20:10 ^🔗	JAA	I think so, at least.
20:10 ^🔗	JAA	Let me check.
20:10 ^🔗	hook54321	lol
20:10 ^🔗	ola_norsk	dd0a13f37: i deliberaly made myself banned from twitter :/
20:10 ^🔗	hook54321	I need to start recording the cameras I was recording again
20:11 ^🔗	JAA	Yep, it's still grabbing... something.
20:11 ^🔗	dd0a13f37	mobile.twitter.com doesn't need an account
20:11 ^🔗	JAA	Haven't looked at the content in a long time though.
20:11 ^🔗	dd0a13f37	https://mobile.twitter.com/jack?max_id=938593014343024639 works just fine for me
20:12 ^🔗	ola_norsk	dd0a13f37: so it's just because i'm using desktop browser then?
20:12 ^🔗	ola_norsk	dd0a13f37: that link worked btw
20:13 ^🔗	ola_norsk	dd0a13f37: doh, i got a "join today to see it all" when scrolling
20:13 ^🔗	hook54321	JAA: Should I grab this whole youtube channel? https://www.youtube.com/user/gencat
20:14 ^🔗	dd0a13f37	I am using tor browser with JS disabled.
20:16 ^🔗	ola_norsk	dd0a13f37: i think if 'twario/tweep' is made a bit less agressive, it wouldn't need to be 'torified'
20:16 ^🔗	dd0a13f37	https://mobile.twitter.com/jack?max_id=743833014343024639 I can go quite a bit back
20:17 ^🔗	dd0a13f37	Why castrate your perfectly working tweet scraping tool? Requests can use proxies, or multiple.
20:17 ^🔗	ola_norsk	dd0a13f37: with original 'tweep' it seemed to stop at half a year or so back in time at search word
20:18 ^🔗	ola_norsk	they could, but it would eventually get noticed i think if's running continously :/
20:18 ^🔗	dd0a13f37	differnt users go differently far back https://mobile.twitter.com/realDonaldTrump?max_id=793833014343024639
20:19 ^🔗	ola_norsk	i don't mean users, but e.g one word
20:19 ^🔗	dd0a13f37	There are many tor exit nodes.
20:21 ^🔗	ola_norsk	how could a python script be _fully_ torifyed? If it could be done without using a virtual machine, that would be cool :D
20:21 ^🔗	dd0a13f37	torsocks python ./myscript
20:21 ^🔗	ola_norsk	ty
20:22 ^🔗	dd0a13f37	Or you can just have requests use a proxy
20:22 ^🔗	dd0a13f37	torsocks -i for guaranteed fresh ip
20:24 ^🔗		BnAboyZ has joined #archiveteam-bs
20:27 ^🔗	hook54321	What collection should I upload that channel to? There's like 400 videos....
20:28 ^🔗	ola_norsk	dd0a13f37: will definetly test that. And i'm guessing just the tiny bit of extra time storing to an sqlitedb counts as tiny bit of it being nice-ifyed :D
20:29 ^🔗	dd0a13f37	The time the request takes will, unless you're using twisted/multithreading
20:30 ^🔗	ola_norsk	dd0a13f37: the reason 'local capture time' column is in tweets i think i put in for exactly that purpose, since JAA pointed out that 'tweep' itself does not seem to be correct at keeping times
20:30 ^🔗	ola_norsk	aye
20:31 ^🔗	dd0a13f37	The mobile search url query string is ... interesting...
20:31 ^🔗	dd0a13f37	https://mobile.twitter.com/hashtag/EU?src=hash
20:31 ^🔗	dd0a13f37	https://mobile.twitter.com/search?q=EU&next_cursor=TWEET-943937901217370114-943937901217370114-BD1UO2FFu9QAAAAAAAAVfAAAAAcAAABWQABAAAAIAAAAAAAAQgAAAAAAAJAAAAAAAAAAABAAAAQAAAAAAAAAAiAAAQAAAAAAABAAAAAAAAAACBAAIAAAAAQAAAAAAIAAAAAACAAIAAAAAAAAAAAAAAACAAAhAAAAAAAAACAgAAAAAAAAAAAIAAAAAAAAAAAAAAAAgAAIAAAAAAFAAIIAAACCAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAEAAAAAACgCAAAAgAABwAAABAAAAAAAAAAIAAAAARAAEAAAAAAAA
20:31 ^🔗	dd0a13f37	AAAAAAIAAAAgAAAAAAAAAAAAAAAAACAAAABAAAAABAAAAAAAAQAAAAQAAEAABAAAEAAEAQAAAAAgAAAAAAAAAAAAAwACAAAAAAAAAAAAAAAAABQAAAAAAAAAAAAAAAAAACQAACAAAAAAAAAIAAQACAAAAFABAAAAAAAQkAAAEAAAAAAAAoAAAAAAAAAAAAAACAAAAAAAAAAAAAAAAIAAAAAICAAAAAAAAAAAAAAEAAAAAAEAAACAAAAAAAAAEAAAAAAAAAgAAAAAAQAEAAQAAAAAAAAABAUAAAEAAAAAAABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAAIQAAAACACAQAAQAAIAAAAIAAAAAIAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQBAAAAwAAAAAAAAAAAAAAAAAA
20:31 ^🔗	dd0a13f37	AAAAAQAAAAAAAAAAAgAAAAEAAAAAAACABAAAAAAAAAAAAAAEAAQAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAQAAAAAAAAAAEAAAAAAAAAQAAAAAABABAAAAAAACAAAQAAAAAAAAAAAAAAAAAgAAAAAIAABACIAAAAAAAAAIAAAQAAAAAAAAA%3D%3D-R-0
20:31 ^🔗	dd0a13f37	(that was one link)
20:31 ^🔗	hook54321	oh dear
20:31 ^🔗	dd0a13f37	the base64 encoded part is some kind of bitmask
20:31 ^🔗	ola_norsk	my eyes!
20:32 ^🔗	ola_norsk	i think i've seen that garbled shit before, at tweep crashing :/
20:33 ^🔗	ola_norsk	when adding 'loggin' module, that looks exactly like the output given on the line where it stopped
20:33 ^🔗	ola_norsk	logging*
20:34 ^🔗	dd0a13f37	hmm, strange
20:34 ^🔗	dd0a13f37	because there is no base64 encoding or anything of the sort in tweep
20:34 ^🔗	ola_norsk	maybe i have the output..one sec
20:36 ^🔗	ola_norsk	seems i've deleted the log, I'll risk trying to run the same command one more time. brb
20:38 ^🔗	jrwr	good news
20:39 ^🔗	jrwr	I enabled the new editor toolbar in the wiki (cc SketchCow )
20:40 ^🔗	ola_norsk	dd0a13f37: could it be compressed stuff, like in 'header: gz' crap?
20:41 ^🔗	dd0a13f37	No, it's base64
20:42 ^🔗	dd0a13f37	run base64 -d \| xxd, then paste it in
20:42 ^🔗	dd0a13f37	You'll see most of the bytes only have one bit set
20:42 ^🔗	SketchCow	Hurrah
20:42 ^🔗	dd0a13f37	Since it doesn't do anything if you change the numbers at the beginning (max id), the max_id parameter is in there too
20:42 ^🔗	ola_norsk	dd0a13f37: all i know that mess of "AAAAAAAA" was the end of the log line when i last tested tweep. And also where it apprently failed.
20:43 ^🔗	dd0a13f37	Not at the beginning, since that's the same across requests
20:43 ^🔗	ola_norsk	i'm running the same command now, and will pastebin (when) it fails
20:44 ^🔗	jrwr	SketchCow: its snazzy
20:45 ^🔗	ola_norsk	dd0a13f37: for all i know it might've been some nasty character(s) that did it
20:46 ^🔗	jrwr	makes it a /little/ simpler to edit pages
20:48 ^🔗		bithippo has quit IRC (Quit: Page closed)
20:54 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
20:59 ^🔗	dd0a13f37	Oh, regular twitter has that same AAAAAAAAAAA mess, just not as the requested URL
21:01 ^🔗	JAA	I think it does. Load a page in your browser (with JS enabled), enable dev console, scroll to the bottom, check out the requests that happen in the background.
21:02 ^🔗	JAA	I think tweep just tries to imitate what the browser would do.
21:02 ^🔗	ola_norsk	hmm
21:04 ^🔗	ola_norsk	it has not crashed yet here like last time, but seems like a lot of people love to tweet 'netneutrality' these days. So it's not even done with this month. I think last time it crashed at about this years month of may tweets
21:05 ^🔗	ola_norsk	holy shit people have been tweeting 'netneutrality' lol
21:05 ^🔗	dd0a13f37	Twitter gets 6k tweets/sec, with 20 tweets/request archiving this is in the realm of possibility
21:06 ^🔗	ola_norsk	dd0a13f37: aye :D and using webarchive.io, or wget/curl with requests to web.archive.org/save/ is quite futile :D
21:07 ^🔗	dd0a13f37	You would need to do a few hundred requests per second. The problem is archiving all those avatars, if you saturate a 1gbit/s line you can afford to archive 20kbit avatars assuming no overhead or IP bans
21:07 ^🔗	dd0a13f37	But the avatars aren't very important, are they?
21:07 ^🔗	ola_norsk	reconstructing the links to the tweets is more important
21:08 ^🔗	dd0a13f37	That's possible too, all the info is in the HTML
21:08 ^🔗	ola_norsk	and 'tweep' captures the id
21:08 ^🔗	ola_norsk	aye
21:08 ^🔗	dd0a13f37	Does tweep have a mode where it can just show you all the tweets being done?
21:09 ^🔗	ola_norsk	it does by default
21:09 ^🔗	dd0a13f37	Without narrowing down to a hashtag? Does it get 100%?
21:10 ^🔗	ola_norsk	i don't know. I does a fuck of a lot of tweets though :D
21:11 ^🔗	ola_norsk	if why it stopp(ed), could be worked out, i bet it could do 100%
21:12 ^🔗	jrwr	for the wikidump nerds At 04:00 on Friday. a copy of the wiki's XML + Images are uploaded to the IA
21:12 ^🔗	jrwr	for good measure
21:12 ^🔗	ola_norsk	right now i'm just doing "python tweep -s 'netneutrality' > tweets.txt" ..to see if it eventually stops like last time. For all i know, piping to a textfile is what did it.
21:13 ^🔗	dd0a13f37	But can you just run python tweep > t.txt?
21:14 ^🔗	SketchCow	AND THE WINNER OF THE "DO YOU WANT THIS" SWEEPSTAKES FOR DECEMBER 21 IS
21:14 ^🔗	ola_norsk	with '-s <search word>' , yes
21:14 ^🔗	SketchCow	...hundreds of gigs of funeral recordings in mp3
21:14 ^🔗		BartoCH has joined #archiveteam-bs
21:14 ^🔗	dd0a13f37	But without -s parameter?
21:15 ^🔗	dd0a13f37	You could cheat and just use the X most common words, but that's not a nice solution
21:15 ^🔗	ola_norsk	then it asks for parameters i think .. I think it's either '-u (user)' or '-s (word(s))' possible
21:16 ^🔗	ola_norsk	either one of those are required.. _i think_
21:16 ^🔗	dd0a13f37	Then a full scrape is difficult, or at least harder
21:16 ^🔗	jrwr	SketchCow: your collection never ceases to amaze me
21:16 ^🔗	JAA	jrwr: Yay, finally. The last such dumps were uploaded in 2014 or 15.
21:17 ^🔗	jrwr	they get dumped here after processing https://www.archiveteam.org/dumps/
21:17 ^🔗	ola_norsk	dd0a13f37: with a 'users' table i could be easier though perhaps..or :D
21:17 ^🔗	jrwr	only keeps one
21:17 ^🔗	ola_norsk	dd0a13f37: it*
21:17 ^🔗	jrwr	the backup log for it is in https://www.archiveteam.org/backup.log
21:18 ^🔗	dd0a13f37	Well, there's still a few users who are never mentioned by others, never use certain hashtags, and never use certain words
21:18 ^🔗	ola_norsk	dd0a13f37: yeah
21:18 ^🔗	JAA	Neat
21:19 ^🔗	ola_norsk	dd0a13f37: not to mentioned banned, yet mentioned, and private..etc. i guess
21:20 ^🔗	ola_norsk	dd0a13f37: i don't have much experience using tweep, so i don't even know how it behaves on finding disabled accounts, or banned users :/
21:22 ^🔗	dd0a13f37	If you're scraping in realtime, that doesn't matter.. it would be one hell of a tweet to get banned in under 5 milliseconds
21:22 ^🔗	ola_norsk	dd0a13f37: i think it just goes back in time from point of start
21:23 ^🔗	dd0a13f37	You'll never keep up, better to go in realtime
21:23 ^🔗	ola_norsk	dd0a13f37: that would need some genoius at python threads i think..and perhaps faster bandwitch than mine :D
21:24 ^🔗	ola_norsk	bandwidth*
21:24 ^🔗	dd0a13f37	twisted-http is fast, no?
21:24 ^🔗		jacketcha has joined #archiveteam-bs
21:25 ^🔗		pizzaiolo has quit IRC (Read error: Operation timed out)
21:25 ^🔗	ola_norsk	dd0a13f37: i do not know..tweep uses 'request' / 'urllib(3?)' i think
21:26 ^🔗	dd0a13f37	one results page is 8.5k gzipped, contains 20 tweets, at 6k tweets/sec this gives 20 mbit/s
21:26 ^🔗	ola_norsk	dd0a13f37: it wouldn't surprise me if in certain senarios a hashtag were quicker than i could process
21:26 ^🔗	dd0a13f37	Yeah, requests without threading.
21:27 ^🔗		jacket has quit IRC (Ping timeout: 248 seconds)
21:27 ^🔗	dd0a13f37	But I think caching will make such attempts impossible, if you do the same query multiple times you'll get the same result
21:27 ^🔗	ola_norsk	when using crontab wget, i had to cut time from 5 minutes to 3 minutes between each web.archive.org/save/ request..just to have a chance
21:29 ^🔗	dd0a13f37	You can do those in the background. Fire and forget. But IA won't like it
21:29 ^🔗	ola_norsk	dd0a13f37: going "upwards" in time in a twitter feed is most likely the best solution. But my grasp of how to do that..is weak :D
21:30 ^🔗	dd0a13f37	I think archiving twitter is an insanity project anyway, better to just wait for library of congress to get their shit together
21:30 ^🔗	ola_norsk	dd0a13f37: i just focous on hastags, like netneutrality :D
21:30 ^🔗	dd0a13f37	that's probably possible
21:31 ^🔗	ola_norsk	dd0a13f37: entire twitter, or twitter by even years or months..yeah, some congress would've have to do that :D
21:31 ^🔗	jrwr	and now I rest from poking the wiki really hard over the last 24hr
21:33 ^🔗		jacketcha has quit IRC (Read error: Operation timed out)
21:33 ^🔗	ola_norsk	dd0a13f37: tweets containing 'netneutrality' been scrolling on my screen for 'since i said i started the command' , and i'm still on 2017-12-19 :/
21:34 ^🔗	ola_norsk	dd0a13f37: though i expect it will speed up when getting past the 14th a bit
21:34 ^🔗	dd0a13f37	You could archive faster if you modify it to use twisted
21:35 ^🔗	ola_norsk	just by the protocol stuff or using threading?
21:36 ^🔗	dd0a13f37	What?
21:36 ^🔗	dd0a13f37	>Twisted is an event-driven networking engine
21:37 ^🔗	dd0a13f37	https://twistedmatrix.com/documents/current/api/twisted.web.client.html
21:37 ^🔗	ola_norsk	so its beatifulsoup that's bottleneck, or?
21:38 ^🔗	dd0a13f37	No, requests
21:38 ^🔗	dd0a13f37	and that it's not using requests with threads
21:40 ^🔗	ola_norsk	could it "re-use" already established connections? because that is one thing that pisses me off about tweep. It seems to do one connection per damn tweet
21:40 ^🔗	dd0a13f37	yeah
21:40 ^🔗	ola_norsk	..or at least try, like wget
21:40 ^🔗	ola_norsk	ty
21:41 ^🔗	jrwr	anyway JAA I figured once a week is a good backup for a low traffic wiki
21:42 ^🔗	dd0a13f37	or apparently asyncio is recommended
21:42 ^🔗	JAA	Yeah, sounds reasonable.
21:43 ^🔗	JAA	ola_norsk, dd0a13f37: It would probably be easiest to reimplement the whole thing based on aiohttp or similar.
21:44 ^🔗	*	ola_norsk taking notes of all :D
21:44 ^🔗	JAA	I've written scrapers with aiohttp before, it's really nice.
21:45 ^🔗	ola_norsk	got git? :D
21:45 ^🔗	JAA	HTTP/2 support would be even better.
21:45 ^🔗	JAA	No, haven't shared it yet.
21:45 ^🔗	JAA	It's on my list for the holidays, uploading all my grabs and the corresponding code.
21:46 ^🔗	ola_norsk	JAA: feel free to punch in some stuff :) https://github.com/DuckHP/twario-warrior-tool
21:46 ^🔗	ola_norsk	i have to the get database thingy working first i guess, before i do anything else :/
21:47 ^🔗	ola_norsk	(that, and making sure it doesn't freeze)
21:47 ^🔗	dd0a13f37	http://www.sqlalchemy.org/
21:48 ^🔗	JAA	What did you change so far?
21:48 ^🔗	JAA	Also, port to Python 3 please.
21:48 ^🔗	ola_norsk	JAA: i've barely (not really) touched tweep itself so far :/
21:49 ^🔗	JAA	Ah ok
21:49 ^🔗	ola_norsk	bah...I've not python'ed in years..2.7 is new to me :D
21:50 ^🔗	JAA	Oh please, Python 3 was released in 2008. :-P
21:51 ^🔗	ola_norsk	why is there not a python script that converts e.g 'print "shit"' ?
21:51 ^🔗	dd0a13f37	2to3?
21:51 ^🔗	JAA	2to3
21:51 ^🔗	JAA	That should handle the most obvious stuff.
21:52 ^🔗	ola_norsk	good
21:53 ^🔗		icedice has joined #archiveteam-bs
21:54 ^🔗	ola_norsk	the need of porting an interpreted language...that's a travesty by itself :(
21:54 ^🔗	JAA	Well, it's necessary because they cleaned up a ton of poorly designed stuff in Python 3.
21:55 ^🔗	dd0a13f37	it truly boggles the mind
21:55 ^🔗	dd0a13f37	yet c can remain source compatible for 28 years and counting
21:55 ^🔗	JAA	Yeah, let's compare C to Python...
21:56 ^🔗	JAA	And I doubt that C was as stable in the early stages of development.
21:56 ^🔗	JAA	Let's discuss that again when Python is 45 years old.
21:56 ^🔗	ola_norsk	:D
21:56 ^🔗	dd0a13f37	But python is old by now. The 2 to 3 migration was a complete catastrophe.
21:57 ^🔗	JAA	Well yeah, many of those things (e.g. string vs. unicode distinction) should've been fixed earlier.
21:57 ^🔗	JAA	But they waited and accumulated all those things and then made one big backwards-incompatible release.
21:58 ^🔗	JAA	Which makes sense, otherwise you'd have to keep changing the code all the time.
21:58 ^🔗	hook54321	Should these videos be uploaded to community video, or a different collection? https://www.youtube.com/user/gencat
21:58 ^🔗	JAA	Anyway, this is getting way too offtopic for this channel.
21:58 ^🔗	dd0a13f37	C was standardized in 1989, and k&r was released in 1978 - 11 years
21:58 ^🔗	ola_norsk	:D
21:58 ^🔗	dd0a13f37	python was released in 1991
21:58 ^🔗	dd0a13f37	it was not standardized by 2002
21:59 ^🔗		JAA changes topic to: Lengthy Archive Team and archive discussions here \| Offtopic: #archiveteam-ot \| <godane> SketchCow: your porn tapes are getting digitized right now
21:59 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:59 ^🔗	dd0a13f37	Or, to be fair, python2 was released in 2000, and it wasn't standardized by 2011
22:00 ^🔗	JAA	-> #archiveteam-ot
22:01 ^🔗	dd0a13f37	I didn't know we had an offtopic channel for the offtopic channel
22:01 ^🔗	JAA	This isn't the offtopic channel, it was always about lengthy discussions (because #archiveteam is limited to announcements).
22:01 ^🔗	JAA	And -ot is new, just opened last week I think.
22:02 ^🔗	DFJustin	oh great another channel
22:02 ^🔗	hook54321	lol
22:04 ^🔗	dd0a13f37	#archiveteam-ot-bs when?
22:06 ^🔗		icedice2 has joined #archiveteam-bs
22:08 ^🔗	JAA	hook54321: Community video sounds reasonable to me. Are you uploading each video as its own item? If so, you should probably ask info@ to create a collection of all of them in the end.
22:08 ^🔗		icedice has quit IRC (Ping timeout: 250 seconds)
22:09 ^🔗	hook54321	Each of them there own item yeah. I'm using tubeup to do it. I'll email info@ when it's done I guess.
22:10 ^🔗		ola_norsk has quit IRC (R.I.P dear known Python :( https://youtu.be/uy9Mc_ozoP4)
22:12 ^🔗	Smiley	is -bs not the off topic channel tho?!
22:12 ^🔗	*	Smiley so confuse. fuck that.
22:14 ^🔗	dd0a13f37	When do Igloo's pipelines upload? As part of
22:14 ^🔗	dd0a13f37	Archiveteam: Archivebot GO Pack?
22:15 ^🔗	JAA	Yes
22:15 ^🔗	JAA	All pipelines do, except astrid's and FalconK's.
22:16 ^🔗	dd0a13f37	So why can't I find a certain !ao job in it?
22:16 ^🔗	JAA	Let's go to #archivebot.
22:19 ^🔗		pizzaiolo has joined #archiveteam-bs
22:46 ^🔗		icedice has joined #archiveteam-bs
22:48 ^🔗		icedice2 has quit IRC (Ping timeout: 245 seconds)
22:51 ^🔗		icedice2 has joined #archiveteam-bs
22:53 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
22:55 ^🔗		icedice2 has quit IRC (Client Quit)
22:55 ^🔗		kristian_ has joined #archiveteam-bs
22:56 ^🔗		icedice has joined #archiveteam-bs
23:06 ^🔗		icedice2 has joined #archiveteam-bs
23:08 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
23:48 ^🔗		jacketcha has joined #archiveteam-bs
23:56 ^🔗	jacketcha	Hey, does anybody know if there is a node.js implementation of the Warrior program?

irclogger-viewer