#archiveteam-bs 2017-12-13,Wed

↑back Search

Time	Nickname	Message
00:00 ^🔗	ola_norsk	i'm not sure how to formulate it. E.g from 2006 until today, a file with e.g https://twitter.com/drkbri/status/940731016880312321
00:01 ^🔗	ola_norsk	on each line
00:01 ^🔗	JAA	A text file filled with tweet URLs, you mean?
00:01 ^🔗	ola_norsk	yeah
00:01 ^🔗	JAA	Well, one such URL is something like 60 bytes.
00:02 ^🔗	JAA	Idk how many URLs you want to store.
00:02 ^🔗	JAA	If storage size is a concern, you could compress this massively, of course.
00:03 ^🔗	JAA	You could just store 'username id' lines, for example.
00:03 ^🔗	ola_norsk	there's roughly 30 tweets in 3 minutes of netneutrality hashtag
00:03 ^🔗	astrid	you dont need to store the username, the ids are globally unique
00:03 ^🔗	astrid	you can replace it in the url by any text you want
00:03 ^🔗	JAA	Oh, it just redirects. TIL
00:04 ^🔗	JAA	Was it always like this though?
00:04 ^🔗	JAA	I thought it used to 404.
00:04 ^🔗	ola_norsk	can it be ballparked to NOT be above 160GB file?
00:04 ^🔗	JAA	Ok yeah, just the ID then.
00:04 ^🔗	JAA	And even that is highly compressible since it's only digits.
00:05 ^🔗	JAA	10 tweets per minute = about 53 million tweets in 10 years
00:05 ^🔗	ola_norsk	* 60 bytes ?
00:06 ^🔗	ola_norsk	i have dyscalculia :D
00:06 ^🔗	JAA	Well yeah, if you want to store the entire URL every time.
00:06 ^🔗	JAA	That would be 3.2 GB or so.
00:06 ^🔗	JAA	If you store username + ID, that probably reduces by half.
00:07 ^🔗	JAA	If you only keep the ID, another factor of ~2.
00:07 ^🔗	JAA	And if you compress that file, you probably get it down to less than 100 MB.
00:08 ^🔗	zino	zfs with lz4 turned on would automatically make that tiny.
00:09 ^🔗	ola_norsk	as long as it doesn't surpass the machine im doing it on it'll be fine. But yeah, i'm thinking several users might've made several tweets with that word. Maybe sqlite could be useful then
00:09 ^🔗	zino	You probably want to avoid sqlite for large databases.
00:09 ^🔗	JAA	^
00:10 ^🔗	ola_norsk	doesn't sqlite go into terrabytes?
00:10 ^🔗	zino	I mean it can, and then the database suddenly is corrupt.
00:10 ^🔗	PoorHomie	Plus it's going to kill your disk doing so
00:11 ^🔗	ola_norsk	PoorHomie: only this machine has SSD, the working machine is good old spinning crap :D
00:12 ^🔗	ola_norsk	work machine*
00:12 ^🔗	JAA	I don't know what you're trying to do really.
00:12 ^🔗	JAA	If you want to store the tweets to search through them afterwards, use a proper database.
00:12 ^🔗	JAA	If you just want to compile the tweet URLs, just use a (compressed) text file.
00:13 ^🔗	ola_norsk	goal: get link to ALL tweets containing any mention of 'netneutrality'
00:13 ^🔗	astrid	just url?
00:13 ^🔗	ola_norsk	as far back as it goes
00:13 ^🔗	JAA	username + ID of 1000 tweets is around 14.6 kB gzipped.
00:14 ^🔗	ola_norsk	astrid, yeah. From that, it could either be made to a warc or fed to wayback..or?
00:14 ^🔗	JAA	Yes
00:14 ^🔗	ola_norsk	with wget --get-requisites, wayback will apparently even save images
00:15 ^🔗	JAA	If you only keep the IDs, that reduces to 7.7 kB gzipped.
00:15 ^🔗	JAA	ola_norsk: Yeah, don't do that for that many URLs.
00:15 ^🔗	JAA	Use wget or wpull or whichever tool you prefer to create WARCs, then upload those to IA, and they'll get included in the Wayback Machine.
00:16 ^🔗	ola_norsk	JAA: there's 'Sleep' to limit exessive /save/ requests though
00:16 ^🔗	ola_norsk	hmm ok
00:17 ^🔗	JAA	I mean, you can try using /save, but creating WARCs directly will be much more efficient.
00:17 ^🔗	ola_norsk	but, 6 hours of '#netneutrality' tweets, is quite a lot
00:17 ^🔗	ola_norsk	even just 6 hours
00:17 ^🔗	JAA	It will also be possible to download the entire archives at once. The WARCs of stuff saved through the WM are not downloadable.
00:19 ^🔗	ola_norsk	3 hours of '#netneutrality' is ~ 290MB
00:19 ^🔗	ola_norsk	as warc
00:19 ^🔗	JAA	Yeah, the WARCs will be quite large. You'll probably want to upload to IA while you're still grabbing.
00:20 ^🔗	ola_norsk	anyway to concatenate warc files?
00:21 ^🔗	ola_norsk	e.g daily captures
00:21 ^🔗	JAA	Yes, you can just concatenate them with e.g. cat.
00:22 ^🔗	ola_norsk	cat -R *
00:22 ^🔗	astrid	or you can just upload them to the same item and not bother even cat'ing them
00:22 ^🔗	ola_norsk	it still goes to wayback if warc?
00:22 ^🔗	astrid	yea
00:22 ^🔗	astrid	well, if it gets blessed
00:22 ^🔗	ola_norsk	ty
00:23 ^🔗	ola_norsk	astrid: what does that entail?
00:23 ^🔗	ola_norsk	reviewd?
00:23 ^🔗	astrid	someone with admin waves a magic wang over it
00:23 ^🔗	astrid	i really don't know tbh
00:23 ^🔗	ola_norsk	it still gets uploaded though?
00:24 ^🔗	astrid	yes that is all post-upload
00:24 ^🔗	ola_norsk	ok
00:26 ^🔗	ola_norsk	it's bound to be more likely to be blessed than doing /save/ reqeusts every fucking 3 minute though :D
00:27 ^🔗	ola_norsk	not to mention safer, as when some damn conscrution company decided to cut my power last tuesday for 2 hours
00:29 ^🔗	ola_norsk	what the world needs next is WARC tasks distibuted via DHT
00:29 ^🔗		astrid has left ][
00:29 ^🔗	JAA	Yeah, you're largely independent from IA while grabbing, too. And you're archiving way more URLs than just the first page of the hashtag.
00:30 ^🔗	ola_norsk	the 'requisites' could be aquired after though, at least, some of them
00:30 ^🔗	ola_norsk	(and the FDS¤"#!#¤ t.co link could some day get translated into real links)
00:31 ^🔗	JAA	ola_norsk: Just ran another test. 10k tweet IDs are 75 kB gzipped, and it took about 9 minutes to grab those.
00:32 ^🔗	ola_norsk	basically' in matter of 'netneutrality', it's not so much as digging out meme pictures, but 'PRO' and 'CON' tweets, i guess
00:32 ^🔗	ez	twitter isn't very keen on mirroring, especially of historical data
00:32 ^🔗	JAA	You could probably reduce the size a bit more by using a more efficient compression algorithm.
00:32 ^🔗	ola_norsk	ez: aye, they make money of off selling it
00:32 ^🔗	JAA	ez: Yep, and that's exactly why we should do it.
00:33 ^🔗	ez	in terms of storage its about 100GB a month or so, archiving wise mirroring twitter isnt hard
00:34 ^🔗	ola_norsk	in terms of using webarchice though, 3 hours of scrolling a hastag is ~290 megabytes :/
00:34 ^🔗	ez	i'd prefer twitter to be dataset
00:34 ^🔗	ez	people want those archives for research, not to page through
00:34 ^🔗	JAA	ez: That doesn't sound right.
00:34 ^🔗	JAA	There are around 500 million tweets per day.
00:35 ^🔗	ola_norsk	what year did twitter start?
00:35 ^🔗	JAA	The text of that alone would be 70 GB already.
00:35 ^🔗	JAA	(At 140 characters, but they increased the limit recently.)
00:35 ^🔗	JAA	Plus all the metadata, images, and videos.
00:35 ^🔗	ez	JAA: its an old ballpark from my last attempt in 2015 or so, indeed the number could be much higher
00:36 ^🔗	ola_norsk	ez: here's my "success (NOT)" at using webarchive https://webrecorder.io/ola_norsk/twitter-hashtags
00:36 ^🔗	JAA	ola_norsk: 2006
00:36 ^🔗	ola_norsk	JAA: about the time of hight of 'netneutrality' issue then?
00:37 ^🔗	ez	JAA: yea, the images and clips might pose trouble, or not. i dont really have clear idea how big % those are per tweet
00:37 ^🔗	ola_norsk	~30 tweets per. 3rd minute for ~10 years ..
00:38 ^🔗	ez	well, am not sure of the wisdom mirroring special hashtag, especially a political buzzword
00:38 ^🔗	ola_norsk	if 3 hours of that = ~290MB of stuff..
00:38 ^🔗	ez	in europe its been called variously 'regulation of state telecoms', and related deregulation of those in mid 2000s
00:39 ^🔗	JAA	ola_norsk: Yep, 53 million tweets. That's up to 7.4 GB raw text.
00:39 ^🔗	JAA	ola_norsk: Obviously, the website is much, MUCH larger.
00:40 ^🔗	ola_norsk	aye, even if subtracting shitty repsted lame memes, it's still a biggy i think
00:40 ^🔗	ola_norsk	reposted*
00:41 ^🔗	ez	you can generally count 10% of the raw number, theres not much point storing it raw, unless you need to build fast reverse index
00:41 ^🔗	ola_norsk	it's basically a chore just getting the urls quicker than their inputted i guess :/ https://en.wikipedia.org/wiki/Infinite_monkey_theorem
00:41 ^🔗	ez	10% is what it generally compressed to with general algos, and 5% with specialized (very slow) ones
00:41 ^🔗	JAA	ez: Hmm, actually, that 500 million per day figure is from 2013. It's probably significantly higher than that now.
00:44 ^🔗	ez	JAA: yea. better way to estimate this is simply pulling the realtime feed and piping it through gzip
00:44 ^🔗	ola_norsk	i'm betting if even using /save/ per new tweet in a established hashtag, wayback would be too slow
00:44 ^🔗	ez	twitter generally allows for pulling the realtime feed, the biggest problem is the history
00:44 ^🔗	ez	no sane api, only page scraping
00:45 ^🔗	ola_norsk	not if you pay them..
00:45 ^🔗	JAA	There probably is a sane API, but $$$$$.
00:45 ^🔗	ola_norsk	aye
00:46 ^🔗	ez	JAA: anyhow, the current rate is something like 6k a second or something like that
00:46 ^🔗	ola_norsk	basically, if you have the funds to make them listen you could say, "hey, gives me all tweets with the word 'cat' in them'..
00:47 ^🔗	ez	with this distribution in length
00:47 ^🔗	ez	https://thenextweb.com/wp-content/blogs.dir/1/files/2012/01/Aig565bCAAAYgkB.png
00:47 ^🔗	JAA	ez: That's the rate from 2013.
00:48 ^🔗	JAA	I wonder how that distribution looks like for the past month.
00:48 ^🔗		second has quit IRC (Quit: WeeChat 1.4)
00:48 ^🔗	JAA	(Since they increased the limit.)
00:48 ^🔗		sec0nd is now known as second
00:49 ^🔗	ola_norsk	the problem i'm seeing with scraping is that there should be a second process capturing newly entered tweets
00:50 ^🔗	ola_norsk	e.g running back 10 years if well and good, but in the meantime there's 1000's of new ones :/
00:51 ^🔗	ola_norsk	maybe there's should be a tweep v2
00:52 ^🔗	JAA	I'm sure that's possible since the browser displays that "3 new tweets" bar.
00:52 ^🔗	ola_norsk	in tweep?
00:53 ^🔗	ez	ola_norsk: its super annoying tasks only the datamining companies bother with
00:53 ^🔗	JAA	No, but in a similar way.
00:53 ^🔗	ez	the stream allows you only certain rate under a filter
00:53 ^🔗	ez	so you need shitton of accounts with disperate filters
00:54 ^🔗	ola_norsk	ez: some data is worth bother with for regular people i think :)
00:54 ^🔗	ez	its the sort of thing its just easier to pay for than trying to awkwardly skirt the rules (indian and russian media companies still do, coz they're fairly comfortable with blackhat social media stuff)
00:54 ^🔗	ola_norsk	IMO public data entry should be public, but yeah
00:55 ^🔗	ez	twitter is still one of the most open guys in the town
00:55 ^🔗	ez	but yea, all big 3 will laugh at you and gaslight you if you ask for something like this
00:55 ^🔗	ez	its a lifeblood of internet advertising, and you want it for free?
00:56 ^🔗	ola_norsk	for all we know, people've already typed in one of shakespears books..
00:56 ^🔗	ola_norsk	to twitter
00:57 ^🔗	ez	ola_norsk: no, its a really interesting corpus for ML training
00:57 ^🔗	ez	you can easily make a chatbot with decent sample of twitters history, and a lot of folks do.
00:58 ^🔗	ola_norsk	decent is not perfect
00:58 ^🔗	ez	but i like the idea of what happens when a 1TB torrent appears on piratebay with all of twitters history
00:58 ^🔗	ez	(IA wouldnt survive the heat of doing that)
00:58 ^🔗	ez	legal heat at least
00:59 ^🔗	ola_norsk	how could they get legal heat of archiving public tweets? :/
00:59 ^🔗	ez	for starters, you didnt ask individual users if they allow you to archive their tweets
00:59 ^🔗	ola_norsk	i'm not doubting, just asking what reasons
00:59 ^🔗	ola_norsk	what does the twitter EULA say?
01:00 ^🔗	ez	its the whole dirty secret biz of data mining
01:00 ^🔗	ez	by using twitter, you allow TWITTER, and its partners to use your data
01:00 ^🔗	ez	but nobody else
01:00 ^🔗	JAA	Could Twitter actually do anything about it though? I assume you retain the copyright when posting.
01:01 ^🔗	ola_norsk	i wish these guys were on IRC https://discord.gg/Qb9TSZ (Legal Masses)
01:01 ^🔗	ez	JAA: they could, and they do. citing "privacy concerns"
01:01 ^🔗	ez	which is hilarious case of gaslighting
01:01 ^🔗	JAA	I mean, I could think of various things they could do regarding scraping the data, but what could they do about the data release itself legally?
01:02 ^🔗	ola_norsk	here in norway there is the 'right to be forgotten', but that does not trump unwillingness to let memory go
01:02 ^🔗	ez	they take it down, they have fairly clear rules you are allowed to release data only with their approval. there are provisions for small samples which are useful only for statistics, but not more whole-picturesque things
01:02 ^🔗	JAA	Oh dear, time to switch the topic before we get into that discussion again.
01:03 ^🔗	JAA	ez: But on what legal basis would they take it down?
01:03 ^🔗	JAA	Their rules don't matter much if they aren't enforcable legally.
01:03 ^🔗	ola_norsk	let's just archive it all, and hear who pisses about it :D
01:03 ^🔗	ez	its basically this https://twittercommunity.com/t/sharing-social-graph-dataset-for-research-purposes/77998
01:03 ^🔗	ez	and thats JUST the social graph
01:04 ^🔗	ez	not even the tweets
01:04 ^🔗	JAA	Their rules don't matter much if they aren't enforcable legally.
01:05 ^🔗	ez	JAA: its basically same how jstor can paywall public domain works. they legally bully, but it wouldn't pass rigorous scrutiny
01:05 ^🔗	ez	the issue of user's consent remains
01:05 ^🔗	ez	twitter couldn't sue you in the end, but kenye west easily could
01:05 ^🔗	ola_norsk	i don't know man, if i see it, i might screenshot it
01:06 ^🔗	JAA	Yes, that's exactly my point. The users could certainly do something about it because they hold the copyright to that content and didn't consent to it being distributed in that way. But I don't see what Twitter could do about it (ignoring the company's accounts).
01:06 ^🔗	JAA	Anyway...
01:06 ^🔗	JAA	A datapoint: there are about 27 hours between tweets 940737669532758016 and 940331105898577920.
01:07 ^🔗	ola_norsk	archive first, delete whatever later..it's futile to archive after the fact
01:07 ^🔗	JAA	Clearly these aren't just numeric IDs, but there's something more complex going on.
01:07 ^🔗	JAA	It could be a five-digit random number at the end.
01:07 ^🔗	JAA	That would mean that there were 4 billion tweets in 27 hours.
01:07 ^🔗	ez	ola_norsk: this data, at least in terms of archive would definitely survive easily as a torrent
01:07 ^🔗	ez	IA would only seed it initially :)
01:08 ^🔗	ola_norsk	ez: i'm only 75% into the topic
01:09 ^🔗	ola_norsk	too much tech for me, i be grabbing them links!
01:09 ^🔗		ola_norsk has quit IRC (skål!)
01:12 ^🔗	ez	JAA: btw, regarding the historical stats. pundits claim that twitter TPM has stalled since 2013
01:12 ^🔗	ez	and kept around the 5-8k/sec figure since then
01:12 ^🔗	ez	http://www.businessinsider.com/twitter-tweets-per-day-appears-to-have-stalled-2015-6
01:13 ^🔗	ez	(before that there was hockeystick growth apparently). nobody made a sigmoid curve yet to verify tho
01:15 ^🔗		ola_norsk has joined #archiveteam-bs
01:15 ^🔗	JAA	I see.
01:15 ^🔗	*	ola_norsk plain forgot
01:15 ^🔗	ola_norsk	here's output http://paste.ubuntu.com/26173713/
01:16 ^🔗	ola_norsk	quite a mess, but better than nothing
01:17 ^🔗	JAA	ez: Hmm, they appear to be basing that entire article just on the 500 million figure stated somewhere on Twitter. :-\|
01:17 ^🔗	ez	JAA: no, i google various sites which claim to know current TPM
01:17 ^🔗	ez	and they all show 5k, 6k, 7k
01:17 ^🔗	JAA	ola_norsk: Well yeah, it's messy. I'd only keep users and tweet IDs if you really intend to grab the entire history.
01:17 ^🔗	ez	but yea, the BI article just compares two years which is a poor sample and argument
01:18 ^🔗		ola_norsk has quit IRC (https://youtu.be/EPHPu4PV-Bw)
01:19 ^🔗	ez	this one claims even decline since peak in 2014, http://uk.businessinsider.com/tweets-on-twitter-is-in-serious-decline-2016-2
01:20 ^🔗	ez	could be just a case of BI having an agenda to portrait twitter that way tho
01:21 ^🔗	JAA	ez: Regarding the total size of Twitter, they apparently have at least 500 PB of storage: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html
01:21 ^🔗	JAA	Quite an interesting article in general, really.
01:23 ^🔗	ez	i wish google would elaborate on their technicals more
01:24 ^🔗	ez	compared to twitter, the amount of traffic google (ie yt) gets is real scary
01:25 ^🔗	Frogging	I'd rather not have my content used for advancing algorithms to manipulate people with advertising. so good thing I don't post anything, I guess :)
01:25 ^🔗	ez	well, "pirating" their data would amount to opening a pandoras box
01:25 ^🔗	robogoat	Twitter 500 PB?
01:26 ^🔗	robogoat	Sounds like more than I would expect.
01:26 ^🔗	ez	abuse of the data is inevitable, but at least everyone should get equal opportunity to do good or evil
01:26 ^🔗	ez	not just the highest bidder
01:28 ^🔗		Stiletto has quit IRC (Read error: Connection reset by peer)
01:28 ^🔗		ZexaronS has joined #archiveteam-bs
01:28 ^🔗	ez	robogoat: i'd expect few PBs at most, for the actual content. the number being inflated by massive duplication on the edges
01:29 ^🔗	JAA	Yep, that number includes even cold storage.
01:29 ^🔗	robogoat	Yeah,
01:30 ^🔗	ez	google said they're 10 exa live, 5 on top
01:30 ^🔗	robogoat	If you're talking 1PB replicated 500 times.
01:30 ^🔗	ez	*5 on tape
01:30 ^🔗	JAA	They claim to be processing 10s of PB per day on another blog post.
01:30 ^🔗	ez	in 2013
01:30 ^🔗	robogoat	Google I wouldn't be surprised.
01:30 ^🔗	JAA	YouTube alone is ~1 EB.
01:31 ^🔗	JAA	(Really rough order of magnitude estimate)
01:32 ^🔗	ez	not sure if anyone has posted numbers since 2013
01:32 ^🔗	ez	but i suspect google is at the end of sigmoid too, in user adoption anyway
01:32 ^🔗	ez	they definitely had to have a bump with 1080p/4k which wasnt as prevalent in 2013
01:35 ^🔗		Stilett0 has joined #archiveteam-bs
01:41 ^🔗		CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
01:49 ^🔗	Somebody2	hook54321: yes, afaik the offer is still open.
01:49 ^🔗	hook54321	Did they contact ArchiveTeam specifcally, or?
01:50 ^🔗		ranavalon has quit IRC (Quit: Leaving)
01:50 ^🔗	Somebody2	No.
01:50 ^🔗	hook54321	Who did they direct the offer towards?
01:51 ^🔗	Somebody2	But the site itself said that for years: https://web.archive.org/web/20150203022400/http://www.autistics.org/
01:52 ^🔗	Somebody2	And I have friends-of-a-friend contact with the custodian.
01:52 ^🔗	hook54321	Custodian being the website owner?
01:52 ^🔗	Somebody2	Yep.
01:55 ^🔗	hook54321	I have a Facebook friend in common with them, but I don't really know that Facebook friend personally.
01:58 ^🔗	Somebody2	Nods.
02:12 ^🔗		pizzaiolo has joined #archiveteam-bs
02:13 ^🔗		pizzaiolo has quit IRC (Client Quit)
02:29 ^🔗		Soni has quit IRC (Read error: Operation timed out)
02:32 ^🔗		closure has quit IRC (Read error: Operation timed out)
02:34 ^🔗		closure has joined #archiveteam-bs
02:35 ^🔗		svchfoo1 sets mode: +o closure
02:37 ^🔗		Valentin- has quit IRC (Ping timeout: 506 seconds)
02:37 ^🔗		dashcloud has quit IRC (Remote host closed the connection)
02:38 ^🔗		dashcloud has joined #archiveteam-bs
02:41 ^🔗		Asparagir has joined #archiveteam-bs
02:42 ^🔗		Asparagir has quit IRC (Client Quit)
02:47 ^🔗		MrRadar has quit IRC (Quit: Rebooting)
02:54 ^🔗		Stilett0 has quit IRC ()
03:06 ^🔗		ZexaronS has quit IRC (Read error: Operation timed out)
03:09 ^🔗		MrRadar has joined #archiveteam-bs
03:25 ^🔗	hook54321	Somebody2: I can try contacting the owner unless you think it would be easier for you to contact them.
03:25 ^🔗	hook54321	It's kinda difficult to email people about domain names because it oftentimes gets seen as the "Your domain is expiring soon" spam
03:34 ^🔗		zhongfu has quit IRC (Remote host closed the connection)
03:50 ^🔗	Somebody2	hook54321: Probably better to work out what you are proposing in some more detail, first.
03:50 ^🔗	Somebody2	Let's take this to PM.
04:06 ^🔗		qw3rty117 has joined #archiveteam-bs
04:12 ^🔗		qw3rty116 has quit IRC (Read error: Operation timed out)
04:37 ^🔗		zhongfu has joined #archiveteam-bs
05:05 ^🔗		du_ has quit IRC (Quit: Page closed)
05:11 ^🔗		Yurume has quit IRC (Read error: Operation timed out)
05:17 ^🔗		Yurume has joined #archiveteam-bs
06:08 ^🔗		sep332 has quit IRC (Read error: Operation timed out)
06:08 ^🔗		sep332 has joined #archiveteam-bs
06:25 ^🔗		Nugamus has quit IRC (Ping timeout: 260 seconds)
06:34 ^🔗		kimmer2 has quit IRC (Ping timeout: 633 seconds)
06:55 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
07:49 ^🔗		me is now known as yipdw
08:02 ^🔗		Mateon1 has joined #archiveteam-bs
08:17 ^🔗		ndiddy has quit IRC ()
08:42 ^🔗		godane has joined #archiveteam-bs
09:25 ^🔗		Soni has joined #archiveteam-bs
09:28 ^🔗		tuluu has quit IRC (Read error: Operation timed out)
09:28 ^🔗		tuluu has joined #archiveteam-bs
10:00 ^🔗		beardicus has quit IRC (bye)
10:00 ^🔗		beardicus has joined #archiveteam-bs
10:40 ^🔗		BnAboyZ has quit IRC (Quit: The Lounge - https://thelounge.github.io)
12:06 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
12:27 ^🔗		pizzaiolo has joined #archiveteam-bs
12:32 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
12:34 ^🔗		pizzaiolo has joined #archiveteam-bs
12:55 ^🔗		refeed has joined #archiveteam-bs
12:55 ^🔗		refeed has quit IRC (Client Quit)
13:15 ^🔗		ranavalon has joined #archiveteam-bs
13:16 ^🔗		ranavalon has quit IRC (Read error: Connection reset by peer)
13:16 ^🔗		ranavalon has joined #archiveteam-bs
13:25 ^🔗		purplebot has quit IRC (Quit: ZNC - http://znc.in)
13:25 ^🔗		PurpleSym has quit IRC (Quit: *)
13:28 ^🔗		PurpleSym has joined #archiveteam-bs
13:29 ^🔗		purplebot has joined #archiveteam-bs
13:33 ^🔗	ThisAsYou	Are we going to do Bitchute after vidme?
13:35 ^🔗	JAA	Are they shutting down?
13:45 ^🔗	godane	so i'm uploading my tgif abc woc 1998-12-11 tape i have
13:46 ^🔗	godane	i'm slowly uploading christmas shows i have for myspleen
14:14 ^🔗	ThisAsYou	JAA I heard they are in a similar situation as VidMe (barely making it) and now have an influx of VidMe users to make things worse
14:26 ^🔗	JAA	Makes sense considering they're essentially saying "Please come back, Vidme!": https://twitter.com/bitchute/status/936804311492734977
14:26 ^🔗	JAA	On the other hand, they also retweeted messages from former Vidme users saying they're switching to BitChute, e.g. https://twitter.com/ErickAlden/status/937022050761433088
15:32 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
15:37 ^🔗		RichardG has joined #archiveteam-bs
16:01 ^🔗	JAA	SketchCow: FYI, I'm about to upload two huge ArchiveBot WARCs to FOS. 28 and 35 GB or something like that. That pipeline (not mine) was using a buggy version of wpull, so I'm trying to fix it.
16:01 ^🔗	JAA	Not sure if this is even problematic or anything, just wanted to let you know.
16:04 ^🔗	JAA	And regarding the rsyncd config, if you don't mind giving me your config file from FOS, I'd like to play around to see if I can figure out a proper solution to the overwriting issue.
16:05 ^🔗	JAA	I also want to test if --ignore-existing even works at all with a write-only target. Since the client can't get a list of files on the server, it's possible that it won't change anything.
16:16 ^🔗		du_ has joined #archiveteam-bs
16:32 ^🔗		cloudfunn has joined #archiveteam-bs
16:59 ^🔗		kimmer12 has joined #archiveteam-bs
17:02 ^🔗		kimmer12 has quit IRC (Read error: Connection reset by peer)
17:02 ^🔗		kimmer13 has joined #archiveteam-bs
17:04 ^🔗		kimmer1 has quit IRC (Ping timeout: 633 seconds)
17:05 ^🔗	SketchCow	Go ahead and try
17:05 ^🔗		kimmer13 has quit IRC (Read error: Connection reset by peer)
17:06 ^🔗	SketchCow	1.5tb free on FOS at the moment, more free today
17:06 ^🔗	JAA	Uploads are done already. :-)
17:08 ^🔗		kimmer1 has joined #archiveteam-bs
17:08 ^🔗		Stilett0 has joined #archiveteam-bs
17:19 ^🔗		Valentine has joined #archiveteam-bs
17:20 ^🔗	SketchCow	Yeah, just saying we went from MANGA CRISIS to ok
17:22 ^🔗	JAA	Sweet
17:40 ^🔗	arkiver	https://pineapplefund.org/
18:17 ^🔗		Stilett0 is now known as Stiletto
18:24 ^🔗		kimmer12 has joined #archiveteam-bs
18:30 ^🔗		ola_norsk has joined #archiveteam-bs
18:30 ^🔗		kimmer13 has joined #archiveteam-bs
18:31 ^🔗	ola_norsk	anyone know if there's a way to sort the playlist on a 'community video(s)' item?
18:31 ^🔗		kimmer1 has quit IRC (Ping timeout: 633 seconds)
18:31 ^🔗	ola_norsk	or, re-sort it, i guess
18:34 ^🔗	ola_norsk	i'm guessing it's sorted by filenames. E.g, if all files starts with date like '20170101_' , would it be possible to reverse that sorting; so that the playlist is basically reversed, showing newest -> oldest
18:35 ^🔗		kimmer12 has quit IRC (Ping timeout: 633 seconds)
18:41 ^🔗	ola_norsk	i'm guessing one rather messy workaround would using 'ia move' command, to append e.g '0001_' , '0002_' to the filenames based on date. But might there be a better way?
18:54 ^🔗	ola_norsk	going with prefixed filenames would require every filename in an item to be renamed as well, if a newer file were to be added to it :/
19:04 ^🔗		ndiddy has joined #archiveteam-bs
19:08 ^🔗	ola_norsk	JAA: btw, tweep is working like a mofo :D But, i'm kind of worried about there not being a way to regulate/randomize it's frequency of requests. So perhaps running it trough proxys might be better?
19:10 ^🔗	ola_norsk	JAA: I'm guessing it would be noticable, at some point by some twitter admin, even if it's faking user agent. if it's left running for weeks and months :/
19:20 ^🔗	SketchCow	How about #archiveteam-offtopic
19:21 ^🔗	ola_norsk	that
19:22 ^🔗	ola_norsk	SketchCow: i don't want to be OP there though :/
19:24 ^🔗	ola_norsk	SketchCow: How about archiveteam-ot , for short, and similarities to -bs ?
19:33 ^🔗		schbirid has joined #archiveteam-bs
19:41 ^🔗	ola_norsk	SketchCow: though i'm not seeing how Internet Archive playlist sorting and archiving tweets is not within 'Off-Topic and Lengthy Archive Team and Archive Discussions'
19:47 ^🔗		jschwart has joined #archiveteam-bs
19:52 ^🔗		Smiley has quit IRC (Ping timeout: 255 seconds)
19:55 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
19:55 ^🔗		Smiley has joined #archiveteam-bs
20:53 ^🔗		icedice has joined #archiveteam-bs
21:04 ^🔗		BlueMaxim has joined #archiveteam-bs
21:04 ^🔗	SketchCow	-ot might works
21:05 ^🔗	SketchCow	You did the thing
21:05 ^🔗	SketchCow	I was suggesting the channel, not suggesting you were saying something for the channel.
21:05 ^🔗	ola_norsk	Me doing something useful; That's rare :D
21:06 ^🔗	ola_norsk	btw, for some reason i just naturally figured "bs" to stand for "bullshit(ting)" :/
21:07 ^🔗	ola_norsk	anyway, i can't be OP though, regardless of what it's called
21:07 ^🔗	BlueMaxim	that's pretty much what the original purpose of this channel was :P
21:08 ^🔗	PoorHomie	lol
21:08 ^🔗	PoorHomie	PurpleSym already grabbed op in -ot
21:08 ^🔗	ola_norsk	he's gotta log off some time
21:08 ^🔗	ola_norsk	or maybe he's the OP it needs, who knows
21:09 ^🔗	*	ola_norsk just knows ola_norsk is not OP material
21:11 ^🔗	ola_norsk	(or she)
21:12 ^🔗	ola_norsk	channel squatting is quite useless, since there's a multitude of variations that's between "-ot" and "-offtopic" :D
21:26 ^🔗	ola_norsk	PoorHomie: he or she might've just joined and gotten OP by default, like i did
21:40 ^🔗	Frogging	archiveteam-bs-bs
21:41 ^🔗		Asparagir has joined #archiveteam-bs
21:44 ^🔗	ola_norsk	so much bs :D
21:47 ^🔗	ola_norsk	if there's #archiveteam , and #archiveteam-bs , how on topic is required here?
21:48 ^🔗	BlueMaxim	well I mean I haven't been around a while so I might be wrong
21:48 ^🔗		Asparagir has quit IRC (Asparagir)
21:48 ^🔗	BlueMaxim	but I always thought this channel was for general nothing talk until it needed to get serious
21:55 ^🔗	SketchCow	Buck stops with me
21:55 ^🔗	SketchCow	People discussing endless what ifs and theories about archiving and saving thing
21:55 ^🔗	SketchCow	= OK
21:55 ^🔗	SketchCow	People going off for hours about how to make a good wiki software suite
21:55 ^🔗	SketchCow	! OK
21:55 ^🔗	SketchCow	People going off about bitcoing
21:56 ^🔗	SketchCow	! OK
21:56 ^🔗	ola_norsk	fair enough :D
21:57 ^🔗	SketchCow	I'm just going to clean shit up
21:58 ^🔗	SketchCow	People who do good work are either being driven away or can't focus on what's needed
22:00 ^🔗	SketchCow	So guess who has the bat
22:00 ^🔗	ola_norsk	a vidme user wrote this "What you are doing means a lot to me. And I agree that there is no such thing as "safe" digital data, sadly."
22:03 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
22:10 ^🔗	godane	SketchCow: i'm starting to upload Joystiq WoW Insider Show
22:10 ^🔗	godane	i have 16gb of that
22:12 ^🔗		pizzaiolo has quit IRC (Read error: Operation timed out)
22:15 ^🔗	SketchCow	Great
22:16 ^🔗	godane	metadata maybe a problem until episode 139
22:16 ^🔗		ranavalon has quit IRC (Read error: Connection reset by peer)
22:17 ^🔗		ranavalon has joined #archiveteam-bs
22:20 ^🔗	ola_norsk	ez: you said the other day that sqlite would not be ideal to use for capturing twitter data. Could 'H2' be better?
22:23 ^🔗	ola_norsk	ez: speed and stability would in my case be worth more than databasesize, since i'm pretty much just looking to reconstruct links to individual tweets, and take it from there
22:25 ^🔗	schbirid	just use postgres
22:26 ^🔗	ola_norsk	schbirid: does it make a single file database?
22:26 ^🔗	schbirid	no
22:27 ^🔗	ola_norsk	i only have experience with mysql and sqlite :/
22:27 ^🔗	schbirid	what kind of volume do you expect?
22:27 ^🔗	ivan	never a bad time to learn postgres
22:28 ^🔗	schbirid	gtg
22:28 ^🔗		schbirid has quit IRC (Quit: Leaving)
22:28 ^🔗	ola_norsk	schbirid: there was made some calculations 1-2 days ago , by JAA
22:28 ^🔗	ola_norsk	oh
22:29 ^🔗	ez	ola_norsk: its fine if you have some subset, but for archiving the databse is too bloaty
22:30 ^🔗	ez	datasets like that are just flat "log" files, aggresively compressed
22:30 ^🔗	ola_norsk	ez would writing into a mounted gzip file be too slow?
22:31 ^🔗	ez	no, you just pipe output of whatever dumper you have
22:31 ^🔗	ez	again, this matters only if youre scraping whole twitter, and the 1:20 compression ratio helps a lot with logistics
22:32 ^🔗	ez	doesnt make sense if you use some highly specific filter
22:34 ^🔗	ez	ola_norsk: if you want to reconstruct UI ("links") for the scraped data and make a web interface for it, yea, i'd probably go with uncompressed db
22:34 ^🔗	ola_norsk	filter is every tweet containing a word, e.g "netneutrality" ..Basically there's no way i can store it as wark. So focusing on using tweep, which seems to get the text and IDs, and reconstruct those IDs into links
22:34 ^🔗	ez	perhaps not even bother with db, and just save it all as rendered pages
22:34 ^🔗	ez	ie warc
22:34 ^🔗	ola_norsk	my harddrive is 160GB :D
22:35 ^🔗	ez	hard to say, but gut feeling is that thats plenty enough for something so specific
22:35 ^🔗	ez	if you were to do, say, 'trump' as a keyword, youd probably need far more
22:36 ^🔗	ola_norsk	problem is i'd need some way to check for duplicate entrys, in case of e.g powerfail :/
22:37 ^🔗	ez	well, log style dumps work that way. append operation is already guaranteed to be atomic the filesystem
22:37 ^🔗	ola_norsk	from what i know of sqlite, it doesn't store anything unless it's solidly entered into the database
22:37 ^🔗	ez	so the mirroring script you make just looks at the end of the log to fetch last logged entry and continues from there
22:37 ^🔗	ez	as for sqlite, same applies, you just select max() etc
22:38 ^🔗	ez	sqlite guarantees write order, so it behaves like a (much less efficient, but with nifty query language) log
22:39 ^🔗	ez	ola_norsk: again, my gut feeling is for something so super specific, sqlite is plenty fine
22:39 ^🔗	ez	tho i my idea might be off how many tweets are out there could be off, i'm just expecting couple hundred millions, not more
22:41 ^🔗	ola_norsk	some would indeed be deleted, user banned, profile set to private, etc. So looking for something that is as fast as possible to store it, then validate later if need be.
22:41 ^🔗	ola_norsk	without wasting too much storage each day
22:42 ^🔗	ez	well, depends on what you do in the end
22:42 ^🔗	ez	most people scrape twitter on scale like this for sentiment tracking
22:42 ^🔗	ez	twitter itself has best api for that. you give it keyword, it throws a realtime feed back at you.
22:42 ^🔗	ola_norsk	plan is to upload to IA every 24hour, afer 24h capture.
22:43 ^🔗	ola_norsk	twitter api seems intentionally limited..
22:43 ^🔗	ola_norsk	after*
22:43 ^🔗	ez	not sure if its intentional. im pretty sure keeping reverse search index for everything would be monumental task
22:43 ^🔗	ez	so they dont, and just track 7 day window
22:45 ^🔗	ola_norsk	i think the free API is limited to ~3000 tweets
22:45 ^🔗	ola_norsk	for e.g one user
22:45 ^🔗	ez	am talking about the realtime feed
22:45 ^🔗	ez	yes, the rest of the api is worthless
22:45 ^🔗	ez	you're better off scraping for that
22:47 ^🔗	ez	wow, that hashtag
22:47 ^🔗	ez	what a cesspool of FUD
22:48 ^🔗	ola_norsk	'#netneutrality" ?
22:48 ^🔗	ola_norsk	:D
22:48 ^🔗	ez	yea
22:48 ^🔗	ola_norsk	lol, yeah
22:48 ^🔗	ola_norsk	so, i'm not saving all those pics :D
22:49 ^🔗	ola_norsk	'tweep' seems to get the Id, the user, and perhaps also resolved (fucking!) t.co links though :D
22:49 ^🔗	ola_norsk	and text
22:49 ^🔗	ez	ola_norsk: i think that archiving twitter in immediate future is not viable, until sb commits to play a constant whack-a-mole with a fleet of proxies
22:49 ^🔗	ez	and accounts
22:50 ^🔗	ez	but writing a bot which just archives whatever it is the most controversial trend at any given time might be viable
22:50 ^🔗	ola_norsk	that's my main worry. The speed tweep seems to go at is bound to be noticed by some jerk at twitter that might ban my ip or something
22:51 ^🔗	ola_norsk	if i leave it running for weeks and months
22:52 ^🔗	ez	id not worry about tweep much as long you use it just for a keyword
22:53 ^🔗	ez	tweep itself is a bit troublesome because of what you said - it sees only whatever is in search index, and only with fairly long poll delays
22:55 ^🔗	ez	twitter is now somewhat famous for "banning" (they call it deranking, but so far the tweets simply vanish) from search results
22:56 ^🔗	ola_norsk	tweep seems to be more focused on just specified user, not so much words or phrases
22:56 ^🔗	ola_norsk	IMO
22:56 ^🔗	ez	well, it does only what the webui do
22:56 ^🔗	ez	and the web ui is very restricted in scope, yes
22:56 ^🔗	ola_norsk	there's e.g no option to limit frequency
22:57 ^🔗	ez	iirc it just hammers
22:57 ^🔗	ez	one-request-at-any-given time
22:57 ^🔗	ez	thats fairly benign by hammering standards
22:57 ^🔗	ola_norsk	aye, and imagine if 'lol' or 'omg' was used as searchword..and left running :D
22:58 ^🔗	ez	if you were hammering 500 requests in parallel, they would probably raise some trigers
22:58 ^🔗	ola_norsk	aye
22:59 ^🔗	ola_norsk	i doubt they'd go "wow, that Firefox user is clicking mighty fast!" :D
23:02 ^🔗	ola_norsk	ez the 'anonymity' of tweep seems to be only faking user agent :/
23:02 ^🔗	ola_norsk	a static hardcoded as such
23:02 ^🔗	ez	twitter is a lot like reddit. they're generally lenient towards bots on plain http level. instead, they just expose api so crappy it makes any mass scrapes very, very awkward on account of crappy api
23:03 ^🔗	ola_norsk	unless you pay them..
23:07 ^🔗	ola_norsk	i'm thinking an extremely THIN wm where every request goes via Tor or open proxies might even be safer than e.g letting tweep run for 24 hours :/
23:07 ^🔗	ez	just run tweep via torify
23:08 ^🔗	ez	if you want to play a blackhat whack-a-mole like this though, i suggest you first modify to support proxies directly
23:08 ^🔗	ez	by just using single request at a time, and your own ip i'd consider fairly legit ... anything beyond that, you're skirting net etiquette a bit
23:09 ^🔗	ola_norsk	i appreciate etiquetette :D , im not blackhat :D
23:10 ^🔗	ola_norsk	(within reason)
23:11 ^🔗	ola_norsk	it's why i'd would at least like the tool to have request delay
23:11 ^🔗	ez	well, its the street equivalent of peaceful protest which dissolves in a hour, and violent black bloc. they both might have just cause, but the latter is impatient, angry, and more likely to catch bystanders in the paving block and tear gas crossfire.
23:12 ^🔗	ola_norsk	i'd prefer walking away with the shit slowly :D
23:12 ^🔗	ola_norsk	'better late than never' :)
23:13 ^🔗	ez	ola_norsk: request delay is not generally necessary
23:13 ^🔗	ez	what is good manners is so called backoff-delay
23:14 ^🔗	ez	ola_norsk: https://gist.github.com/ezdiy/17855d7421bbb416cbb3d8e0e1caf213#file-vidme-py-L21
23:14 ^🔗	ez	this is my vidme scraper for example
23:15 ^🔗	ez	it hammers, until something goes wrong with the api, and starts to exponentially increase the delay
23:15 ^🔗	ez	for as long there is error
23:15 ^🔗	ez	worst thing to do is knowing you broke something and just keep blindly hammering anyway
23:16 ^🔗	ola_norsk	lol..like going beyong date in a shell script? :D
23:16 ^🔗	ola_norsk	1 etc
23:16 ^🔗	ola_norsk	beyond*
23:17 ^🔗	ola_norsk	my sh script was so rotten i could smell it lol
23:18 ^🔗	ola_norsk	ez: http://paste.ubuntu.com/26179607/
23:19 ^🔗	ez	ola_norsk: fixed delay helps in a pinch (especially in places like sh), but is not quite ideal either
23:20 ^🔗	ez	ola_norsk: yea, for channel i'd not worry about it much
23:20 ^🔗	ola_norsk	aye
23:22 ^🔗	ola_norsk	i personally do not care about vidme users, though i feel sorry for the many who moved there from youtube thinking it would be a 'free haven'
23:23 ^🔗	ola_norsk	commercialazed haven, that is
23:23 ^🔗	ez	i suspect vidme enjoyed a lot of popularity on account of its built-in youtubde-dl support
23:24 ^🔗	ez	ie its very easy to "double post" on there
23:24 ^🔗	ola_norsk	aye, and a ton of youtube user put their links into that; and when done, cancelled their youtube channels
23:25 ^🔗	ez	not sure if that model would prevail. youtube tends to tell sites whoa are doing that "you guys, would you please stop doing that?" when they get big enough
23:35 ^🔗	ola_norsk	ez: one problem of Vidme might'VE been https://imgur.com/a/DWXj3
23:36 ^🔗	ola_norsk	ez: never saw a single damn ad, execp on profile/video pages..
23:37 ^🔗	ola_norsk	ez: never heard back from the regarding that issue report though
23:37 ^🔗	ola_norsk	them*
23:46 ^🔗		kristian_ has joined #archiveteam-bs
23:48 ^🔗	ez	ola_norsk: i didnt follow vidme in recent past
23:48 ^🔗	ez	but as far as january, vidme had no ads, their model was that of paid subscriptions/tips
23:54 ^🔗	ola_norsk	ez: then i wished help@vidme.com would've just said so (in september), instead of acting like they did :D
23:55 ^🔗	ez	iirc they made some vague promises
23:55 ^🔗	ez	in fairly recent times, but as i said, i didnt follow at the time, you better just google reddit convos or something
23:55 ^🔗	ola_norsk	'fake it until you make it' i guess :D
23:56 ^🔗	ola_norsk	damnit, they made me second-guess adblock :/
23:56 ^🔗	ez	but yea, vid.me was a very .com startup in a lot of ways
23:56 ^🔗	ola_norsk	aye
23:57 ^🔗	ez	over-relience on users will come, with mediocre product, and they didn't come. happened a lot in the 90s.
23:57 ^🔗	ez	i honestly have no idea what the bitchute guys are doing
23:57 ^🔗	ola_norsk	quickly make a warrior job for bitchute..it's got magnetlinks :D
23:57 ^🔗	ola_norsk	aye
23:57 ^🔗	ez	they might get a traction if they position themselves as non-profit
23:58 ^🔗	ez	so people will be willing to seed their webtorrents
23:58 ^🔗	ez	if they dont do that, everybody will be like 'fuck no, why should i help a commercial company lower their opex?'
23:58 ^🔗	ola_norsk	it's a viable thing though.. using webtorrent
23:58 ^🔗	ez	sure
23:58 ^🔗	ez	i seed IA torrents
23:59 ^🔗	ola_norsk	webtorrent player could even alleviate some outgoing data on IA i think
23:59 ^🔗	ez	as i consider IA mostly as a non-profit endeavor
23:59 ^🔗	ez	honestly, webtorrent is massive clusterfuck

irclogger-viewer